Software and hardware annotations 2011 April

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

110420 Wed Comprehensive page on using 'find'

Discovered a web page devoted to the find utility in its numerous variants. Many useful notes including some subtleties that somewhat unexpected.

110417b Sun Wide variation in SATA chipset transfer rates

While browsing the same site where I read the site with the review as per the previous entry I found a test of the transfer speeds of common SATA chipsets and I was surprised to see very wide variations, 130MB/s to 362MB/s for large block sequential reads and 123MB/s to 228 MB/s for large block sequential writes.

There is something a bit weird here though, as the cluster of results around 120-130MB/s hints that the tests were done with a contemporary hard disk drive (which are capable of reading or writing sequentially at that rate on the outer track) but the tests actually were done with a 256GB Crucial Real SSD C300, so the difference in transfer rates might actually be somewhat chipset dependent rather than drive dependent.

What is more interesting perhaps is the very large difference in transfer rates on the small block size sequential reads and writes, where with 4KiB blocks reads ranged between 44MB/s and 147MB/s (clustering around 80MB/s), and writes ranged between 45MB/s and 110MB/s.

Obviously 4KiB/block IO has not only higher overheads than 4MiB/block IO, but some chipsets have much higher overheads than others. Or perhaps it is the MS-Windows drivers of those chipsets, as it is pretty difficult to disentagle the two in a test like that (and for MS-Windows users fairly pointless too).

110417 Sun Very high write latencies on SSDs

I was explaining recently to a bright guy that SSDs's most important parameters are erase block size and time to erase, as those drive the ability to do simultaneous read-write and in general reduce read-erase-write cycles.

Then I have just read a fairly decent review of one of the best SSDs around and in the comparative tests there are very interesting graphs showing that some SSDs have catastrophic maximum random write latencies of the order of hundreds of milliseconds, including one of the much reputed Intel MLC based drives.

This is quite interesting, even if not necessarily typical, as it obviously depends on the vagaries of read-erase-write opportunities.

I was a bit disappointed that the review did not seem to have an interleaved read-write test, as these also have highly variable results depending on read-erase-write cycles.

110411 Sat Library to disable data persistency guarantees

The (should be) famous O_PONIES controversy about file systems that ensure or not persistence of data on request has a new interesting aspect: the aptly named eatmydata library that is designed to nullify the calls to persistency requests in applications. The rationale for its existence is described as:

This package contains a small LD_PRELOAD library (libeatmydata) and a couple of helper utilities designed to transparently disable fsync and friends (like open(O_SYNC)). This has two side-effects: making software that writes data safely to disk a lot quicker and making this software no longer crash safe.

You will find eatmydata useful if particular software calls fsync(), sync() etc. frequently but the data it stores is not that valuable to you and you may afford losing it in case of system crash. Data-to-disk synchronization calls are typically very slow on modern file systems and their extensive usage might slow down software significantly.

Here the sort-of-true but very worrying statement is the last. It is true that Data-to-disk synchronization calls are typically very slow but that's entirely unrelated to modern file systems because the cause is modern (and not so modern) storage systems.

Persistent storage systems tend to have colossal per-transaction overheads, and in particular latencies, because of being often weakly random-addressable (it possible to access them randomly, just much more slowly than sequentially) because ensuring data consistent persistency requires a number of scattered accesses, except for special cases, and the number can be dramatically increased by common poor choices of storage system organization.

However the impression that it is only recently that Data-to-disk synchronization calls are typically very slow on modern file systems may have originated in some independen accidental circumstances:

In many popular recent PC-level OSes Data-to-disk synchronization has not been a particularly pressing concern, giving many users the illusion of cheapness.
The amount of data written by applications has been probably increasing a lot, in part because of ever greater automagic persistence of state, but also because media files (photos, movies, ...) have become far more common.
Application programmers have also unlearned that it is very, very expensive for file systems and storage layers to handle many small files, and collections of files (for example ar collections, but also indexes files and databases) have been forgotten. I routinely see on various mailing lists inane requests for advice on how to optimize filesystems for storing hundreds of millions of 1-4KiB files.

The notable point is that application programmers seem now to have gone from almost never issuing Data-to-disk synchronization calls to perhaps issuing them too frequently and automatically. One problem here is for example that if one wants to persist 10 related files one has to issue 10 separate OS calls, instead of one, paying a huge additional price.

eatmydata is certainly an interesting development, but it is a symptom of a profoundly wrong situation with applications, storage systems and operating system APIs. There is little hope that wishful looking forward to O_PONIES will stop.

110401 Fri One reason why parity RAID is mistakenly popular

I have seen many times system administrators setup very wide (up to 48 drives) parity RAID sets, usually as RAID6, when they are often ludicrously inappropriate.

One aspect is the extreme constraining of good performance due to very wide stripes with high alignment requirements to avoid frequent RMW cycles.

But only recently I realized why the other point that I often mention gets underestimated: that one (RAID5) or two drives (RAID6) worth of redundancy among dozens is just too little.

While mentioning this to someone I got a startled reply as to the low probability of two drives failing at the same time.

The implication that I suddenly realized is that the speaker assumed that the probability of two drives failing out of a set is independent of the size of the set, something that amazingly had not occurred to me as a possible (and popular) way of thinking before.

Another seemingly popular way of thinking relates the probability of drives failing to the length of the rebuild period once one drive has failed already; here the more plausible argument is that since drive speed grows much more slowly than drive capacity, rebuild times have become ever longer, and during a long rebuild time (several days is not unheard of on busy RAID sets) the probability of further failure is proportional to the length of that rebuild time.

This argument has elements of plausibility, as drive probability of failure is indeed in percent of population per unit of time, so it is time dependent, but that is the lesser issue. The bigger issue is percent of population. There are secondary effects like extra RAID set stress during rebuild, and those may matter more.

But the bigger point is that this assumes that very long rebuild times are fine, and this is so wrong: they aren't. Rebuild periods are dangerous, and they should be as short as possible. Again here RAID10 has the advantage as rebuilding a pair is a fairly simpler and quicker and more reliable operation than rebuilding a parity stripe (only two disks are involved in the rebuild, instead of the whole width of the array), and it being far less of a risk to reliability and performance to have wide RAID sets with RAID10 than with parity RAID.

It is still however a worry that RAID set size should not be large, and that with ever larger drives this means that the degree of IOPS/TB per RAID set goes down with time. But that's because the IOPS/TB of each drive goes down with time, and however unpleasant that is it is something pretty much unavoidable. Except with SSDs, which however have their own RMW issues with large erase blocks.