This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
Discovered a web page devoted to the find utility in its numerous variants. Many useful notes including some subtleties that somewhat unexpected.
While browsing the same site where I read the site with the review as per the previous entry I found a test of the transfer speeds of common SATA chipsets and I was surprised to see very wide variations, 130MB/s to 362MB/s for large block sequential reads and 123MB/s to 228 MB/s for large block sequential writes.
There is something a bit weird here though, as the cluster of
results around 120-130MB/s hints that the tests were done with
a contemporary hard disk drive (which are capable of reading
or writing sequentially at that rate on the outer track) but
the tests actually were done with a
256GB Crucial Real SSD C300,
so the difference in transfer rates might actually be somewhat
chipset dependent rather than drive dependent.
What is more interesting perhaps is the very large difference in transfer rates on the small block size sequential reads and writes, where with 4KiB blocks reads ranged between 44MB/s and 147MB/s (clustering around 80MB/s), and writes ranged between 45MB/s and 110MB/s.
Obviously 4KiB/block IO has not only higher overheads than 4MiB/block IO, but some chipsets have much higher overheads than others. Or perhaps it is the MS-Windows drivers of those chipsets, as it is pretty difficult to disentagle the two in a test like that (and for MS-Windows users fairly pointless too).
I was explaining recently to a bright guy that SSDs's most important parameters are erase block size and time to erase, as those drive the ability to do simultaneous read-write and in general reduce read-erase-write cycles.
Then I have just read a fairly decent review of one of the best SSDs around and in the comparative tests there are very interesting graphs showing that some SSDs have catastrophic maximum random write latencies of the order of hundreds of milliseconds, including one of the much reputed Intel MLC based drives.
This is quite interesting, even if not necessarily typical, as it obviously depends on the vagaries of read-erase-write opportunities.
I was a bit disappointed that the review did not seem to have an interleaved read-write test, as these also have highly variable results depending on read-erase-write cycles.
The (should be) famous O_PONIES controversy about file systems that ensure or not persistence of data on request has a new interesting aspect: the aptly named eatmydata library that is designed to nullify the calls to persistency requests in applications. The rationale for its existence is described as:
This package contains a small LD_PRELOAD library (libeatmydata) and a couple of helper utilities designed to transparently disable fsync and friends (like open(O_SYNC)). This has two side-effects: making software that writes data safely to disk a lot quicker and making this software no longer crash safe.
You will find eatmydata useful if particular software calls fsync(), sync() etc. frequently but the data it stores is not that valuable to you and you may afford losing it in case of system crash. Data-to-disk synchronization calls are typically very slow on modern file systems and their extensive usage might slow down software significantly.
Here the sort-of-true but very worrying statement is
the last. It is true that
synchronization calls are typically very slow but
that's entirely unrelated to
modern file systems
because the cause is modern (and not so modern) storage
Persistent storage systems tend to have colossal per-transaction overheads, and in particular latencies, because of being often weakly random-addressable (it possible to access them randomly, just much more slowly than sequentially) because ensuring data consistent persistency requires a number of scattered accesses, except for special cases, and the number can be dramatically increased by common poor choices of storage system organization.
However the impression that it is only recently that
Data-to-disk synchronization calls are typically
very slow on modern file systems may have
originated in some independen accidental
Data-to-disk synchronizationhas not been a particularly pressing concern, giving many users the illusion of cheapness.
optimizefilesystems for storing hundreds of millions of 1-4KiB files.
The notable point is that application programmers seem
now to have gone from almost never issuing
Data-to-disk synchronization calls to perhaps
issuing them too frequently and automatically. One
problem here is for example that if one wants to persist
10 related files one has to issue 10 separate OS calls,
instead of one, paying a huge additional price.
eatmydata is certainly an interesting development, but it is a symptom of a profoundly wrong situation with applications, storage systems and operating system APIs. There is little hope that wishful looking forward to O_PONIES will stop.
I have seen many times system administrators setup very wide (up to 48 drives) parity RAID sets, usually as RAID6, when they are often ludicrously inappropriate.
One aspect is the extreme contraining of good performance due to very wide stripes with high alignment requirements to avoid frequent RMW cycles.
But only recently I realized why the other point that I often mention gets underestimated: that one (RAID5) or two drives (RAID6) worth of redundancy among dozens is just too little.
While mentioning this to someone I got a startled reply as to the low probability of two drives failing at the same time.
The implication that I suddenly realized is that the speaker assumed that the probability of two drives failing out of a set is independent of the size of the set, something that amazingly had not occurred to me as a possible (and popular) way of thinking before.
Another seemingly popular way of thinking relates the probability of drives failing to the length of the rebuild period once one drive has failed already; here the more plausible argument is that since drive speed grows much more slowly than drive capacity, rebuild times have become ever longer, and during a long rebuild time (several days is not unheard of on busy RAID sets) the probability of further failure is proportional to the length of that rebuild time.
This argument has elements of plausibility, as drive probability of failure is indeed in percent of population per unit of time, so it is time dependent, but that is the lesser issue. The bigger issue is percent of population. There are secondary effects like extra RAID set stress during rebuild, and those may matter more.
But the bigger point is that this assumes that very long rebuild times are fine, and this is so wrong: they aren't. Rebuild periods are dangerous, and they should be as short as possible. Again here RAID10 has the advantage as rebuilding a pair is a fairly simpler and quicker and more reliable operation than rebuilding a parity stripe (only two disks are involved in the rebuild, instead of the whole width of the array), and it being far less of a risk to reliability and performance to have wide RAID sets with RAID10 than with parity RAID.
It is still however a worry that RAID set size should not be large, and that with ever larger drives this means that the degree of IOPS/TB per RAID set goes down with time. But that's because the IOPS/TB of each drive goes down with time, and however unpleasant that is it is something pretty much unavoidable. Except with SSDs, which however have their own RMW issues with large erase blocks.