Software and hardware annotations 2008 April

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

080430 Wed The tragic depth of the Microsoft cultural hegemony

In Fedora 8 the system log dæmon is rsyslog which is a backwards compatible upgrade of the usual BSD style syslog. I don't really need the extensions, so I did not much bother to read the relevant parts of the manual, until now, and I have been lucky. The reason is that therein lies the syntax for the log file format, and the syntax is the same as for MS-DOS environment variables, as this quote makes tragically clear:

A template for RFC 3164 format:

              $template RFC3164fmt,"<%PRI%>%TIMESTAMP% %HOSTNAME% %syslogtag%%msg%"

The Microsoft cultural hegemony seems nearly absolute.

080428 Mon Large decreases in flash disk prices expected

Some hopeful quotes from an article on future improvements in solid state storage capacity:

Saito stated that this should help lower Toshiba's NAND manufacturing costs by 40-50 percent each year.

Moving towards 2011, Saito also stated that the price ratio between SSDs and HDDs will likely dissipate as well, as long as NAND manufacturing costs keep reducing by his 50 percent per year goal.

This target of compound cost reduction is major news, because if achieved it will substantially alter many current tradeoffs betwen capacity, speed and power density in the design of storage systems, especially because flash memory keeps information even when powered off just like disks, but powers on much more quickly.
An equally epochal moment happened a dozen years ago when hard disk manufacturers saw that RAM was closing in to hard disk price/capacity ratio and decided to push their engineers to keep or widen the gap by aiming for 80% yearly capacity increases largely thanks to new disk head technology using AMR, GMR and finally TMR technology. The rate of capacity improvement has slowed down recently to perhaps 40% per year but this was before the introduction of TMR platters.
As to power, flash memory draws a lot less power than hard disks of course, and thus allows much higher density storage or a reduction in cooling requirements for large computer farms, even if they are rather slow at 18MB/s writing and 25MB/s reading for 64GB. Which is about the same as for much smaller flash mini-drives, which seems to indicate that either access is serial or is parallel to rather slower (and presumably cheaper) chips than those used in the high speed mini-drives. The price is still quite high, as that 64GB drive seems to currently retail for $2,200 (around $34/GB) which is the same as announced almost two years ago. By comparison, non-persistent DDR2 memory sells for around $300 per 4GB or $4,800 per 64GB (around $75/MB) which is twice as much per GB as the flash disk above, and does not include interface electronics (not a big deal though). I would have expected a much bigger difference though by now.
It is interesting to note that MRAM has still lower power requirements, is persistent, and is rather faster than flash but unfortunately it is still not much available, at least until someone with deep pockets places a large order for it justifying the allocation of a large chip production facility to it.

080420 Sun Multiple CPUs and developers

Another interesting opinion on the usefulness of multiple CPU chips seems to me to argue that while they are currently only useful for a very narrow set of applications they should be adopted by developers because:

Multithreaded IDEs and build systems in particular are among the few applications that benefits from several CPUs to reduce the duration of pauses during development.
More tellingly, it is hoped that if developers have multiple CPUs, they will develop applications that take advantage of multiple CPUs.

The latter point seems very credible to me, as it is an application of the scratch my itch logic. But perhaps it will be applied a bit too much: it seems likely to me that development tools will be the first to be enhanced to take advantage of multiple CPUs on a chip, because they are the main itch of a developer, and they are also relatively easy to parallelize.
So I would not be too optimistic about applications other than embarassingly parallel ones or development tools taking much advantage of parallellism anytime soon. I am more optimistic about their diffusion though: while as I am typing on this 2-CPU laptop only my editor is expending a minuscule amount of CPU time on one at a time, in a different context someone might be listening to music or watching streaming video or compiling the kernel in the background.
Or perhaps multiple CPU chips are the enabler for highly parallel multi streaming servers and that is an important enough market that multiple CPU chips become the default even if they are a bit wasted on desktop users. Perhaps multiple CPU chips will be the key enabler for Snow crash Metaverse style clients, which are in effect servers too.
Or perhaps There Is No Alternative and multiple CPU chips are all we can get. As to that, it is not so easy to find single CPU chips anymore, at least for desktops and laptops.
Then perhaps the best feature of multiple CPU chips is indeed that most of the time only one CPU is used, as this then allows saving power by shutting down the other CPUs (in a somewhat clever way to avoid differential thermal expansion issues). This might help save more power than just dynamically underclocking a single large CPU. Especially perhaps for desktops and laptops that tend to have burstier loads than servers.

080417 Thu Large storage pools and Lustre

In the past few weeks I have been discussing the network load implications of large storage pools, so I have had a look at recent storage pool technology, and I have not been surprised to see that a lot of new software has appeared in the last few years, and there are some interesting developments. But first here are my personal impressions as to what the subject is:

A large storage seems to me to be large enough that cannot be implemented easily as a single disk array (currently around 5-10TB and above) or also something that cannot be reasonably implemented as a single filesystem (also 5-10TB and above).
A pool because the ambition when setting up so much storage is to use it flexibly, that is to share it among many different application areas, with logical, flexible subdivisions instead of physical, rigid ones.

The first aspect means that one needs to use multiple servers, to provide aggregate capacity beyond that of a single array, and the second aspect that a single logical filesystem must be possibile even if a single physical one is not.
It is also desirable that the pool be scalable by adding more capacity to support more physical filesystems (rather than enlarging the existing filesystems). Of these large storage pools there are two varieties, those where only capacity is scalable, and those where performance grows with capacity.
Recent or not so recent software packages that implement something like the above are Lustre, Gluster, Ceph, the not quite finished HAMMER plus various hopeful proprietary designs such as Isilon, or Ibrix. Among these I have been especially interested in Lustre, in large part because I have noticed how there are some very large installations that have been using it successfully for quite a while (for example CEA, Sanger and others), but also because it offers at least a temporary solution to one of my biggest worries, the time and space needed for filesystem checking. Lustre is based on two basic ideas:

To use a filesystem as a storage device.
To split the files in a filesystem across multiple storage devices (that is, lower level filesystems).

This means having a filesystem tree spanning several lower level filesystems on several different machines. To provide a filesystem single-system image it has a (possibly replicated) metadata server with pointers to the various bits of the single image. It is a bit of a weak point, but a filesystem server usually must have some kind of entry point; or conceivably use a lot of multicastingcasting, which has its own downsides. In practice the most obvious downside is that Lustre performs well when used with large files, as then the metadata to data ratio is low.
The Lustre design as a second order file system has the advantage that the lower order filesystems can be checked in parallel, and the second order metadata can also be checked in parallel. This allows quite a bit of scaling, at least in capacity, which is often as important or more so than speed.

080417 Thu Number of patches in Red Hat kernels over time

The RHEL kernels have lots of patches. In large part because the RHEL kernels contain several backported features, but also because of the social nature of the concept of works of any large software system: that not too many people complain at any one time. Which means that a long lived software package accumulates a lot of complaints and (ideally...) fixes.
What matters in the end is that the cross section between the software and user issues be small enough. Using the metaphor of software as a lens in a macroscope, what matters is that it be transparent enough for it to remain useful. Unfortunately many fixes introduce new problems as they trigger new bugs, or removing spots on the lens might introduce new blemishes. Thus the reluctance of software vendors to fix bugs that affect only a minority of users: fixing them can create new bugs that affect the majority, and that would be really bad.
The alternative is to make and effort and design things to be sort of simple and code them mostly right so there be few mistakes by construction, not by endless scratching of itches.

080416 Wed Two rather different types of clusters

It may not be always clear, but it is important when talking about clusters to say whether their aim is to offer:

Resilience, usually by transparent failover.
Speed, usually by exploiting parallelism.

The ambiguity of the term is especially noticeable when talking about clusters with a storage component also because it is relatively rare for compute clusters to offer resilience, in the sense of redundant computation; while some (very expensive) storage clusters target both relience and speed.
Also because speed and resilience for computation used to be achieved by tightly coupled multiprocessing (for example Tandem and Stratus products) and the first clusters were loosely coupled arrangements (not many perhaps still remember JES3) with replicated storage rather than parallel computations. Then Beowulf style pure compute clusters became popular as tightly coupled multiprocessing remained expensive and 32 and 64 bit microprocessors based PCs and Ethernet became rather inexpensive. To the point that in many cases cluster on its own is used to indicate the Beowulf variety.

080415 Tue Dimensions of filesystem performance

When discussing file system performance it is not very sensible to think of it as a single figure of merit as the performance envelopes of various file systems are often quite differently shaped, and that shape is often misperceived. Among different types of performance those more commonly recognized are:

Read or write.
Data or metadata.
Small or large file.
Sequential or random.
Bandwidth or latency.

Another quite important dimension is parallelism, of which the most important cases are, roughly in order of increasing challenge:

Single thread to single file, for example data collection.
Single thread to many files, for example a compilation or the creation of an archive.
Many threads each to a different file, for example a time-sharing system or a file server.
Many threads to a single file, for example a DBMS.

The more challenging degrees of parallelism require fine grained yet low overhead locking for in-memory structures and coarse yet flexible allocation of storage media space (to acoomodate the simultaneous growth of many files without interleaving them too much).
Among commonly used file system types two that cover most of those spaces well are JFS and XFS but there are some perfomance differences: XFS is more suitable for high degrees of parallelism, in particular with many threads writing to a file, and JFS is more msuitable when there are accesses to many small files (even if not quite as good as ReiserFS if there are just accesses to many very small files). Some other less commonly recognized but important dimensions of performance are:

Time and memory required for recovery.
Write speed with or without barriers.
Application level locking overheads.
Access in large directories (a sub-case of metadata performance).
Access of extended attributes (a sub-case of metadata performance).

080414 Mon Performance tests of 2-CPU and 4-CPU chips

Curiosity about the performance of recent multiple CPUs is being satisfied by an increasing number of recent tests which do not much change the conclusions from some earlier ones: 4-CPU chips work really well, but most applications cannot take advantage of them, and that relatively few applications benefit much from 2-CPU chips. Even games designed for multithreading (usually for the Xbox 360 and PS3 consoles) struggle a bit to to take advantage of a second CPU, even if it usually helps, and sometimes significantly.
However multiple CPU chips are still broadly useful for multitasking, or inter-applications parallelism, and some useful CPU intensive media related scientific applications do make good use of 2 CPUs and even 4. But 2 seem to be best right now for most desktop style usage.

080407 Mon A cheap large reliable storage pool system

Discussing how to design a cheap large storage pool, my very summary proposal was based on my usual favourites for large storage, the Sun X4500 storage server (Thumper), Dell 2950 generic servers, Lustre, Linux RAID10, and DRBD for mirroring over the network. More specifically the target is around 200TB of cheap, highly available storage as a single pool or a few large ones. Some of the details:

Two 42U racks.
8 Sun Thumpers each with 48×1TB disks, 4 per rack.
2 Dell 2950s each with 6 SAS drives, 1 per rack.
Each drive on a Sun Thumper in rack 1 mirrored with a drive on a Sun Thumper in rack 2 using DRBD.
On each Sun Thumper 8 RAID0 arrays of 6 DRBD RAID1s, the Nth array made of the Nth drives on each of the 6 host adapters in a Sun Thumper.
Each of the 32 RAID10s holding a 6TB OST.
The 2950s holding the array's MDT master and backup copies.
Each Thumper and 2950 with a nice 10gb/s card or two.
Each rack with a recent Force 10 S2410P switch with 24 10gb/s ports.

In other words some of the latest and greatest, more or less. Well, despite it being so, this outline has some obvious weaknesses (too few too small MDTs given the numbers of OSTs), but one MDT server per 4 OST servers (even if each of these has 8 pretty large OSTs) may still be tolerable and would give more than 200TB of single filesystem image, something that I am not sure is worth doing.
Assuming an average file size of 1MB a 200TB storage pool will have 200 million files, and that is pretty high. Fortunately Lustre checks each OST in parallel and each would then have on average only 6 million objects, and serial checking is sort of reasonably fast for that size of OST.
The cost for the whole might be under US$400,000 for an academic site, as Sun do some generous academic pricing for the X4500 (and several other excellent products), and as to overheads the products above seem to be quite power, cooling and space efficient.
Each 6x(1+1)-drive OST might return between 250-500MB/s (for example depending on whether reading or writing, outer or iiner tracks, and whether the RAID10 has near or far layout), and striping large files across OSTs might improve that. Striping large files across OSTs in different Thumpers, of which there are 4 per rack, gives a stripe width of 4, and probably at least 1GB/s of single client performance (with RAID10 with options -p f2 read speed might be significantly higher at the expense of lower write speed). In my dreams perhaps, but not entirely delusionary.
I have lots more comments on this kind of setup, and a draft post on large storage pool that has been pending for months, and this is all

080406 Sun Much improved file system checking for XFS

Ah, I have forgotten to mention some rare good news as to file system checking. Recently the memory requirements for XFS checking have gone down a lot, apparently because of some algorithm improvements by Barry Naujok:

General rule of thumb at the moment is 128MB of RAM/TB of filesystem plus 4MB/million inodes on that filesystem.

Right now, I can repair a 9TB filesystem with ~150 million inodes in 2GB of RAM without going to swap using xfs_repair 2.9.4 and with no custom/tuning/config options.

Which is a significant improvement on previously reported space requirements.

080405c Sat Some other recent CPU developments

While reading some new about the most recent Xeon and Opterons the low power consumption of a Penryn 45nm quad-CPU chip:

As Intel launches the L5420, a low power Xeon at 2.5 GHz. This CPU consumes 50 W (TDP), less than 12.5W per core thus, and only 16W (4 W per core) when running idle. The CPU consumes as little power as the previous 65 nm L5335, but performs about 30% better in for example Povray, Sungard and Cinebench.

and in another article: a description of the physical chareacteristics of that type of IC (each quad-core Xeon chip has two of these ICs each with two CPUs):

These days, Intel manufacturers millions of Core 2 Duo processors each made up of 410 million transistors (over 130 times the transistor count of the original Pentium) in an area around 1/3 the size.

In the same article another amazing statement as to the physical cost of an x86 CPU decoder:

Back when AMD first announced its intentions to extend the x86 ISA to 64-bits Iasked Fred Weber, AMD's old CTO, whether it really made sense to extend x86 or if Intel made the right move with Itanium and its brand new ISA. His response made sense at the time, but I didn't quite understand the magnitude of what he was saying.

Fred said that the overhead of maintaining x86 compatibility was negligible, at the time around 10% of the die was the x86 decoder and that percentage would only shrink over time. We're now at around 8x the transistor count of the K8 processor that Fred was talking about back then and the cost of maintaining x86 backwards compatibility has shrunk to a very small number.

As I have been fond of repeating over many years, for now CPU architecture is dead, as immense transistor budgets make it almost irrelevant. The same article again also talks about another one of my favourite topics, chips with many CPUs, of which a CPU design with relatively few transistors bould be a building block:

Built on Intel's 45nm manufacturing process, the Atom is Intel's smallest x86 microprocessor with a < 25 mm^2 die size and 13 mm x 14 mm package size. Unlike previous Intel processors that targeted these market segments, Atom is 100% x86 compatible (Merom ISA to be specific, the Penryn SSE4 support isn't there due to die/power constraints).

The article reports quite a few interesting detail about this building block CPU, initially to be used for palmtops, but which can still run 64 bit code.

080405b Sat Recent AMD and Intel quad-CPU chip test

While reading a rather interesting report of performance for quad and dual CPU AMD and Intel chips I found a most unusual page with tests that exercise all CPUs in a quad CPU chip. The tests are not from actual running games, but from a demo and a tool. The results are fairly interesting and to me they show the advantage of low latency and large caches.

080405 Sat Another example of inappropriate error messages

One of my usual frustrations with incompetently written software is the extremely poor quality of error messages. Writing useful error messages does not take that much longer than writing useless ones, but it seems beyond the intellectual capacity of many programmers. I have been very frustrated trying to figure out problems with a Kerberos 5 setup, the error messages printed have appalled me, for example:

Apr  5 15:05:12 tree kernel: gss_create: Pseudoflavor 390005 not found!<6>RPC: Couldn't create auth handle (flavor 390005)
Apr  5 15:05:23 tree kernel: gss_create: Pseudoflavor 390005 not found!<6>RPC: Couldn't create auth handle (flavor 390005)

But one of the most ironic case is the printing of error messages from the syslog dæmon itself:

Apr  4 18:03:16 tree syslogd: select: Invalid argument

Apr  4 18:04:16 tree syslogd: select: Invalid argument

Fortunately there is strace to figure out what is the actual error, no thanks to the authors of so many bad error messages:

time(NULL)                              = 1207404572
writev(1, [{"Apr  5 15:09:32", 15}, {" ", 1}, {"", 0}, {"base", 4}, {" ", 1}, {"syslogd: select: Invalid argumen"..., 33},
fsync(1)                                = -1 EINVAL (Invalid argument)
writev(2, [{"Apr  5 15:09:32", 15}, {" ", 1}, {"", 0}, {"base", 4}, {" ", 1}, {"syslogd: select: Invalid argumen"..., 33},
fsync(2)                                = -1 EINVAL (Invalid argument)
writev(6, [{"Apr  5 15:09:32", 15}, {" ", 1}, {"", 0}, {"base", 4}, {" ", 1}, {"syslogd: select: Invalid argumen"..., 33},

So the syslogd error message is not just useless, it is wrong too. The file descriptors on which the fsync report an error for for files that not plain block device files:

*.info                                          |/var/spool/xconsole
*.=debug                                        |/var/spool/xconsoled

as these are named pipes. Adding a - before the | removes the error, as the - tells syslogd not to fsync on every line.

080402 Wed More data on the cost of the Linux page cache

As usual I have been doing some tests as to the performance profile of Linux block oriented storage and among the several disappoinments a confirmation of just how expensive is the GNU/Linux page cache. A subset of some of my more recent tests is:

readahead	cached write	direct write	cached read	direct read
16	192MiB 28%	253MiB 10%	84MiB 28%	522MiB 24%
512	165MiB 24%	253MiB 10%	478MiB 68%	496MiB 22%
16384	176MiB 25%	262MiB 11%	673MiB 61%	481MiB 21%

In each entry there is a transfer rate in MiB/s and the corresponding CPU utilization as reported by Bonnie 1.4. The tests were done on a moderately high end system with these characteristics:

Two four-CPU Xeon chips at 2.4GHz with 8GiB of RAM.
Two MegaRAID SAS host adpters, each connected to 4 hard drives (2 SAS 400GB and 2 SATA2 1TB) arranged as a 4×(1+1) RAID10.
RHEL51 GNU/Linux with JFS.

The point here is not the scandalous dependence on read-ahead, but the scandalous CPU overhead of using the page cache instead of O_DIRECT. There should be a difference as going through the page cache means an extra memory-to-memory copy, which in this singular test is not amortized over multiple uses, but it is the level of overhead here that is amazing.
In a test in which both direct and cached reads reach much the same transfer rate of nearly 500MB/s the CPU overhead of the page cache is three times larger than that for direct IO. That is, memory-to-memory copying overheads costs twice as much in CPU time as reading from disk. The latter benefits from assistance from a DMA capable host adapter, but the CPU overhead of the page cache is not that agreeable, as it suggests that direct IO can do at 100% CPU around 2250MiB/s, and cached IO only around 725MiB/s (with a read-ahead of 256KiB). This on a machine whose memory subsystem has a bandwidth of several GiB/s:

# blockdev --setra 16384 /dev/md0
# hdparm -tT /dev/md0

/dev/md0:
 Timing cached reads:   20748 MB in  2.00 seconds = 10396.44 MB/sec
 Timing buffered disk reads:  1966 MB in  3.00 seconds = 655.21 MB/sec

So what's going here? Which part of the page cache subsystem is being awful? Even more confusingly, why the much lower number for cached reads here, when specifying not to do caching?

# hdparm -tT --direct /dev/md0

/dev/md0:
 Timing O_DIRECT cached reads:   3452 MB in  2.00 seconds = 1725.17 MB/sec
 Timing O_DIRECT disk reads:  1430 MB in  3.00 seconds = 476.20 MB/sec

At least in this something can be discovered: that cached reads means reading once the first 2MiB of the block device, and then timing repeated reads of the same, but in that case it is odd that it is faster than sequential reading of the same. Also, the array drives even when all are transferring can at most do around 800MB/s aggregate, so obviously some caching is going on despite the O_DIRECT, probably though in the on-drive RAM buffers, which are rather larger than 2MiB. So probably 1.7GB/s is the maximum speed over the two SAS host adapters and buses being used, which sort of figures, as that is just under 8 PCIe lanes.