Software and hardware annotations 2008 January

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

080122 Number of NFS server instances
Recently some performance issue with an NFS server reminded me of a classic episode many years ago. The recent issue was about an NFS server being very, very slow at serving data at around 1MB/s read or write rates to some clients, even if its hardware was not overloaded with over 50MB/s local performance. Tracing the NFS traffic with the classic tcpdump showed that the low network performance involved some seconds of continuous packet traffic broken by several seconds of pause, resulting in low average traffic. This pointed to some bottleneck on the server, which however could transfer data over the network at full 1gb/s interface speed. It then transpired that some highly parallel task had been started on a nearby cluster and the processes were all doing IO in parallel, and since there were 24 of them, they were monopolizing the 16 NFS server processes, which were then the critical resource.
Which reminded me of an amusing episode long ago: that on some Sun 3 server the CPU would become very busy as soon as more than 8 processes, and in particular NFS processes, became active. I found that there was that there was a page table cache with capacity for the page tables of 8 processes, evicting a table from the cache and loading a new one was rather expensive in CPU time, which in themselves were not problems. But the big problems was that while processes were scheduled FIFO (round robin) the page table cache had a LIFO replacement policy (least recently used) which guaranteed that if there were more than 8 processes there would be a page table cache entry replacement at every context switch. This because the OS team had read in some textbook that processes should be scheduled FIFO, the hardware team that caches should be managed LIFO, and neither could see that the combination would be catastrophic.
How did Sun solve the issue? Well, as far as I know it was never solved for the Sun 3, and for the Sun 4 the design team had a beautiful idea: to increase the number of page cache slots to 64, while keeping the LIFO replacement policy for the cache. Fortunately at the time there were few situation where more than 64 processes would become active.
080119 Another difficulty with centralized computing
In general as already argued I prefer Internet-like network service structures composed of many relatively small workgroup sized cells connected by a (hierarchicaly meshed) backbone. The main reasons is resiliency and flexibility, as problems tend to be local rather than global, and the sysadm overhead is not that huge if one follows reasonable tools and practices of mass sysadm.
But there are some other concerns, which are about scalability of system administration:
Performance tuning
The difficulty and inconvenience of performance tuning grows rather faster than capacity, because if a load is partitionable the easiest strategy is to throw discrete bits of hardware at it. Consider serving files to 400 concurrent users from 20 servers each supporting 20 users or from a single file server. Providing ample network and storage bandwidth for 400 users is very hard, providing it for 20 is almost trivial:
  • For example one 1gb/s network interface might well be enough to support 20 users, but how easily can one provide 20 1gb/s network interfaces on a 400 user file server? One would have to use perhaps 3-4 10gb/s interfaces and then a distribution network to the edge user stations.
  • The same applies to storage bandwidth: the sort of subsystem that can serve adequately 20 users is currently a simple RAID with a few drives and a single filesystem of a few TiB. A storage system capable of providing enough capacity and especially bandwidth for 400 users is a rather bigger undertaking, and it takes significant effort to tune or cost to buy a ready-optimized package.
But there is a bigger point that affects the strength of the two points: insufficient or poorly tuned capacity has a much bigger impact in the centralized case, because it affects everybody; in particular performance must also be high enough in the central case that it satisfies the most demanding user, while in the decentralized case one need only to worry about the cells where the demanding users are (and they are often clustered by job).
Configuration maintainability

Much the same argument can also be made about system administration: any suboptimal configuration, and not just peformance configuration, impacts everybody, and configuration on the central system must be then union of those suitable for all users, including the most demanding ones. So for example a central server not only must have every possible package installed, but also several different versions of each package, as different users will require different ABI levels because they use binary packages with different version dependencies, and so on.
Of course if centralized capacity can be tuned and configured optimally for everybody and kept tuned and configured optimally over time then one gets the best possible outcome.
That is not however what my experience suggests: most organisation are slightly or rather disjointed and inevitable imperfections and mishaps prevent perfect execution on a global scale, and good execution in a few locales where it really matters is already a challenging goal in an imperfect world, but at least it is a goal that usually capture most business benefit and is often more easily achievable.