Computing notes 2013 part two

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

131229 Sun: Comparing tape and disk for archival storage

Somewhat related to the recent post on the performance envelope of 4TB disk drives for bulk, but non-archival, storage, I was reminded of a recent article by the periodical The Economist containing the argument that for archival tape drives are more cost effective than disk drives.

The article that quotes the person responsible for the tape archival system at CERN, without pointing out that implicitly that person and the article are comparing offline tapes, that is tapes archived outside a tape drive, on a shelf, with online disks, disks that are powered up, whether rotating or not, being inserted in a disk storage system, as this quote makes clear:

The third benefit of tapes is that they do not need power to preserve data held on them. Stopping a disk rotating by temporarily turning off the juice—a process called power cycling—increases the likelihood that it will fail.

All the other points listing the advantages of tape over disk, with one exception, are based on the same hot versus cold assumption, for example also:

Tapes can still be read reliably three decades after something is recorded on them. For disks, that figure is around five years.

That is a wholly unfair comparison. Disks can be put in caddies and shelved just like tape are, and when powered down and on a shelf they are far more robust and cheaper than tape cartridges; sure, tapes can be spliced together when part of them fail, but they can also stretch and break more easily when active, particularly because for a long time tapes have been recorded with many serpentine tracks, and a full tape scan requires dozens of back-and-forth passes.

There are other differences like the presence of the disk controller, an electronics board, in the disk drive, but overall I think that the lifetime, robustness and cost etc. of a shelved disk drive is equivalent or better of a tape cartridge.

The only detail that perplexes me is comparison of the sequential transfer rate of disk drives to tape drives, where the article claims that:

Although it takes about 40 seconds for an archive robot to select the right tape and put it in a reader, once it has loaded, extracting data from that tape is about four times as fast as reading from a hard disk.

As we have seen recently a recent 4TB disk drive, when used as a mostly-sequential storage medium, can reach transfer rates from 80MB/s to 160MB/s. Tape drives cannot reach realistically rates of 320MB/s to 640MB/s because that requires IO buses that are not yet available, or not yet entirely popular, like the 6 gigabits/s SATA3, or the 8 gigabits/s FC and even that cannot quite do 640MB/s in practice.

If I look at top end archival tape systems like the IBM T3500 or the Fujitsu ETERNUS LT270 S2 they are both based on the LTO-6 standard or the advanced IBM 3592 cartridge for the TS1140 tape drives, with these characteristics:

LTO-6 tape cartridges have a native capacity of 2.5TB and a maximum transfer rate of 160MB/s.
3592 cartridges can go up to 4TB, and the TS1140 has a maximum transfer rate of 250MB/s, in rather unrealistic scenarios with IO transactions larger than 180MB, else they can do around 40-80MB/s (see page 14 of this document).

Note: Arguably the big archival system is the StorageTek StreamLine 8500 based on the very recent T10000D tape drive (250MB/s, 8.5TB tape cartridge capacity) should be discussed as well, but it is difficult to find online cost estimates, and while being a bit more advanced, it is in the same ballpark as the LTO-6 standard and the IBM TS1140 drives and related products.

As to price, from some random online suppliers I see $130.29 for LTO-6 2.5TB cartridges, (but from another supplier just $75.29 for LTO-6 2.5TB cartridges) $290.00 for 3592 4TB cartridges; while for enterprise disk drives I found $350 for HGST 4TB disk drives $190 for HGST 2TB SAS disk drives.

But to this it must be added that while a disk drive contains both the medium and the drive and controllers to read and write it, and needs only a caddy and a suitable enclosure to hold it, a tape cartridge also needs a tape drive, and tape drives are phenomenally expensive.

It is not easy to get public prices for large tape libraries, so let's just look at the cheapest, commodity products. A simple LTO-6 tape drive costs $2,500 and a LTO-6 16-cartridge tape library costs $3,600 which means that with the accompanying LTO-6 20× catridge pack cost of $2,000 a 16×2.5TB (40TB) tape archival system costs $5,600; while 10×4TB disk drives cost $3,500 and a disk enclosure for 10 drives costs $530 or going for a luxurious HP disk enclosure for 12 drives costs $1,700. Which means that the 40TB fully online disk alternative is cheaper than a tape library with a single tape drive.

My impression is that there is little justification for the existence of tape anymore, because the volume (rather than areal) densities of tape and disk drives are comparable (see page 3 of this very detailed and informative comparison) except that there are several manufacturer of automated offline tape libraries, while there is none of automated disk cartridge libraries, so offline disk cartridge archival is not for very big archives (or it is for archives so big that nearly all data is on a shelf instead of a cartridge library).

Recent announcements of long term archival storage services by Seagate and by Amazon are both based on large semi-online disk libraries. Since the Amazon service has a restoration latency of several hours it might be based on some kind of fully offline tape library.

However considering the point of my previous the multi-TB disk drive post was that a 4TB drive with a 80MB-160MB/s sequential transfer rate and a 1MB-4MB/s small block random transfer rate is hardly a random access device, it is more like a tape cartridge with a much faster random positioning time and no need of an additional very costly drive, and should be used mostly for sequential workloads such as logging and backup archives.

131228 Sat: Some statistics on the Kexi usage

I have found some details interesting in this summary of statistics of usage of Kexi which is a major KDE application for DBMS access.

It is interesting because to some (unknown) extent Kexi usage statistics can be used as proxies for KDE usage statistics.

The more interesting points are the expected one that very few people have installed the most recent versions, something common to most software applications, and that it is most commonly used with the Ubuntu (and derivatives like Mint and Kubuntu) and SUSE. These are also the distributions that we expect to have most desktop users and in particular KDE users.

Instead Debian and RedHat's EL distributions and their derivatives seem to me mostly targeted to or at least popular for servers, even if they offer a full choice of desktop environments, but even so the distance from Ubuntu and SUSE here is enormous: Debian gets only 2% of Kexi installations, and RedHat (and derivatives) does not even make the list. This is a significant surprise.

It is only slightly surprising that the most popular distribution versions for Ubuntu are the LTS ones, but it is rather surprising that at least half of the SUSE versions seems to be the somewhat less stable, experimental versions of the community based openSUSE distribution instead of the stable, commercially supported one, whose latest release is 11SP3, while 12.1, 12.2 and 12.3 must be openSuSE, with newer versions being increasingly less popular as expected.

It is also slightly surprising, but in keeping with the theme that older versions are still the most popular, that 32-bit and 64-bit versions are equally popular.

131227 Fri: The issue with disk drives with multi-TB capacities

A smart guy from a major UK LCG Tier 2 site has posted on a related mailing some perplexity about buying cheaper-per-GB 4TB drives:

I am in 2 minds now whether 4 TB drives (and up to 36 of them per server) is taking things too far.

The perplexity is easy to guess: whether 4TB disk drives have too few IOPS and too slow a transfer rate for their capacity to be usable as data processing drives instead of archival drives.

The CERN, computing department have a cleverly defined minimum performance criterion for their storage: at least 18MB/s of interleaved read and write streams per TB capacity. This was designed to discourage vendors to win storage bids with cheap, large, but unusably slow drives.

Note the other cleverness of the criterion: the minimum performance is not just per TB of capacity, but measured in terms of interleaved read and write streams, so that there must be at least two concurrent IO streams per TB of storage, and at least half of them must be writing, to a different file the other is reading from, and thus a different aread of the storage system. In other words this implies a significant test of aggregate IOPS.

Note: the criterion is given purely in terms of throughput, as a combined minimum transfer rate; perhaps they ought to have specified a maximum latency for writes, or both reads and writes, but I suspect they are not concerned because their data processing application don't have tight latency constraints, all being batch jobs.

The above criterion is based on the knowledge that for a long time disk drives have increased capacity fast faster than transfer rate (as capacity is to both linear transfer rate and latitudinal track density), and transfer rate has been growing faster than IOPS (as transfer rate is proportionalto detector speed and sensitivity both electronic properties, while IOPS is proportional to arm acceleration and deceleration, both mechanical properties).

My reckoning is that:

The minimum target of 18MB per interleaved read/write stream per TB of capacity is generally desirable except for pure archival workloads.
That usually 1TB disk drives can deliver it, but often 2TB drives cannot, and most likely 3TB and 4Tb cannot.

To reach the CERN target a 4TB drive ought to be able to sustain 72MB/s of interleaved IO over 8 threads, 4 for reading and 4 for writing.

As to numbers relative to available drives there is a first (somewhat shallow) comparative test of some 4TB disk drives with different profiles that show that regular ones have for only-read or only-write, accesses:

Sequential transfer rates between 160MB/s (outer cylinders) and 80MB/s (inner cylinders) with an average of 130MB/s.
Semi-sequential (512KiB per IO) transfer rates of around 60MB/s (reads) and around 100MB/s (reads).
Random (4KiB per IO) transfer rates between 2MB/s (writes) and 1MB/s (single threaded reads).

It looks it rather unlikely, especially in the inner tracks that a semi-random 8-thread workload would be able to even remotely each 80MB/s.

The numbers above also mean that duplicating a 4TB drive will take around 31,000 seconds or nearly 9 hours, and that filling it or reading it by regular applications will take at least twice as long, single threaded, and probably a lot slower, if multithreaded. Even worse, the resync times for parity RAID will be much, much longer than the duplication times, probably extending into days.

Another interesting (less shallow) test with 5 drives has graphs for interleaved read-write, with 70% reads and 30% writes, for 2 threads up to 16 threads. With 2 threads and 4 queued IOs or 4 threads the results are delivered in terms of IOPS, between 75 IOPS and 150 IOPS for various drives, with 8KiB blocks, that is an aggregate tranfer rate of less than 2MB/s (two megabytes per second). Never mind an average latency of several hundred milliseconds, and a maximum latency of 1 second or more.

The File Server [Throughput] graphs, with mized IO sizes between 512B and 64KiB sees IOPS betwen 60 and 120, assuming an average IO size of 32KiB that results in transfer rates of 2MB/s to 4MB/s.

But from other literature it may be apparent that CERN for processing LHC event data may be considering larger IO sizes. Another even more detailed comparative drive test has screenshots from the benchmark tool HD Tune Pro 5.00 that show 8MB single threaded random IO delivers 65MB/s for reads and 54MB/s for writes.

The above discussion is only relevant to people that plan for the medium and longer term; cunning people who know that what matters in many cases is short term impressions will be budget heroes and choose the cheaper 3TB and 4TB drives rather than the 1TB ones because at the beginning of their lifetime their used capacity will be rather less than 4TB for quite a long time, and initial data will be mostly in the faster outer tracks, and thus for quite a while the 4TB disk drives will perform like a 1TB one, and even better because of the use of only a smaller range of outer cylinders.

Eventually as the disk drives fill up their overall performance will collapse. to a fraction of the target desired. That has happened before even to well meaning people who forgot that inner cylinders have half the transfer rate of outer cylinders and free space fragmentation tends to become worse the more of a storage unit is used.

But then I know that people that care for the longer term sustainability of performance of their system still buy 146GB 15000RPM 2.5in drives for IOPS critical storage, and 1TB 2.5in enterprise disk drives for bulk (but non-archival) storage.

131226 Thu: Steam and Team Fortress 2 on 64 bit GNU/Linux

I have wished to get Steam and its GNU/Linux ported games on my Ubuntu LTS 12.04 (Precise Pangolin) 64-bit system for a while, with hardware acceleration. So I decided to spend some time looking in depth into this, during the current holidays.

Since my graphics card is an AMD/ATi HD 7850 it is only fully supported by the manufacturer's own own binary drivers (called fglrx for legacy reasons), which have been improving with time, and are reportedly fairly usable.

These drivers come prepackaged for Ubuntu in the packages fglrx-updates-12.104 and fglrx-amdcccle-updates-12.104 but I have downloaded and installed instead the latest version 13.12 which when built results in packages fglrx-13.256 and fglrx-13.256.

Well, that almost works, as I could run the games based on the first Half-Life engine, but not those based on the Source engine.

The steam command would start and report that DRI is not available, and on starting Source engine games like Team Fortress 2 would report that the extension GL_EXT_draw_buffers2 was missing and the entry point glColorMaskIndexedEXT was not available:

This system DOES NOT support the OpenGL extension GL_EXT_draw_buffers2.

Could not find required OpenGL entry point 'glColorMaskIndexedEXT'!

Both these issues are commonly reported by other users, but usually in the context of older AMD/ATi graphics cards or even older AMD/ATi drivers that don't support that extension, while the HD7850 is fairly recent and fully supports them, and so should the 13.12 driver version.

This was really perplexing because both the commands glxinfo and fglrxinfo -v reported that those were available, and that DRI was enabled.

After some time I remembered then that the steam command and the Source engine game executable are provided as 32-bit executables, and by default the glxinfo and fglrxinfo executables are 64-bit ones. In other words the 64-bit setup was fine, but the 32-bit setup not.

I looked at issues and realized that the two critical components that make a difference between 32-bit and 64-bit are the OpenGL library shared object libGL.so.1.2 and the DRI module for the Xorg server fglrx_dri.so.

Well, these were not quite in the right place, in part because the right place has shifted with time: it used to be that on systems with the ability to run both IA32 and AMD64 binaries the 64-bit libraries and modules would be under /usr/lib/ and the 32-bit ones under /usr/lib32/, this being called the bi-arch layout.

Fortunately a better layout was remembered and now 32-bit libraries and modules should be under /usr/lib/i386-linux-gnu/ and the 64-bit ones under /usr/lib/x86_64-linux-gnu/, this being called the multi-arch layout.

But during a transition period, some need to be in both places, as some applications have not been modified accordingly. I found that if I put the OpenGL library and the DRI module under both I got 32-bit games to work right. The setup for the fglrx packages creates a symbolic link for one, but it is not enough. So I have just hard-linked the library and module, which are installed by the AMD/ATi packages under the bi-arch locations even if they are looked for under the multi-arch ones.

So these are the locations for the 64-bit OpenGL library and the DRI module for the Xorg server:

-rw-r--r-- 2 root root   795296 Dec 25 12:16 /usr/lib/fglrx/libGL.so.1.2
-rw-r--r-- 2 root root   795296 Dec 25 12:16 /usr/lib/x86_64-linux-gnu/libGL.so.1.2
-rw-r--r-- 2 root root 34801856 Dec 25 12:16 /usr/lib/fglrx/dri/fglrx_dri.so
-rw-r--r-- 2 root root 34801856 Dec 25 12:16 /usr/lib/x86_64-linux-gnu/dri/fglrx_dri.so

And these are the locations for the 32-bit OpenGL library and the DRI module for the Xorg server:

-rw-r--r-- 2 root root   543244 Dec 25 12:16 /usr/lib32/fglrx/libGL.so.1.2
-rw-r--r-- 2 root root   543244 Dec 25 12:16 /usr/lib/i386-linux-gnu/libGL.so.1.2
-rw-r--r-- 2 root root 36041952 Dec 25 12:16 /usr/lib32/fglrx/dri/fglrx_dri.so
-rw-r--r-- 2 root root 36041952 Dec 25 12:16 /usr/lib/i386-linux-gnu/dri/fglrx_dri.so

After double linking I have checked that both bi-arch and multi-arch paths were listed in the ld.so configuration files under /etc/ld.so.conf.d/, and executed ldconfig to update the ld.so cache.

I have also uninstalled the 64-bit mesa-utils package that contains glxinfo and replaced it with the 32-bit mesa-utils:i386 package, so I can verify the state of OpenGL and GLX with glxinfo for 32-bit applications and with fglrxinfo for 64-bit ones.

With the double-linking above both reported that DRI was active and the OpenGL extension GL_EXT_draw_buffers2 was available, and indeed steam stopped reporting that DRI was not active, and the Team Fortress 2 executable that the entry point glColorMaskIndexedEXT was not available, and started the game, which runs pretty well and smoothly.

Note: probably I could have just used the older fglrx-updates and fglrx-updates-amdcccle packages from the Ubuntu repositories rather than building the very latest version from AMD/ATi.

Note: I have also used a newer version of the Xorg server and the MESA OpenGL packages than the one released with Ubuntu LTS 12.04: the newer packages have suffix -lts-raring and versions 7.7/1.13 for the X server and 9.1 for Mesa OpenGL, instead of 7.6/1.11 and 9.0. Probably the newer versions are necessary. I haven't tried running steam or Team Fortress 2 with the older ones.

131215 Sun: GNU/Linux distribution review shallow as usual

Amusing Kubuntu review entitled Just got better with better animations, social network integration and much more!. The review is in the style of other not so useful reviews as mentioned before.

Almost all the review is about the GUI, the applications, and how easy it is to install for a given random piece of hardware. There is towards the end a mention of repositories, but as to the all-important issues of maintainabiliy there is essentially no discussion.

This may be an appropriate style of review for mobile phones which are largely a one-use appliance to throw away after a few years, or Mac-OSX or personal pre-installed MS-Windows systems that are also typically thrown away or reinstalled when maitenance becoms an issue.

But UNIX-like systems like GNU/Linux ones are supposed to be configurable and updateable and maintainable in the long term, and that's their main, and huge value.

As to shiny applications, they are much the same across most GNU/Linux distributions, and only the details differ, and mor of the time the major detail that differs is how old the release of the distribution is, as that determines how recent is the version of a shiny application.

The reviewed distribution is Kubuntu version 13.10, which is not a Long Term Support one, so it is targeted at people who want a maintainable system that can be easily upgraded, and has recent system components and applications.

There is a passing mention of the new window compositor API and protocol Wayland but for example no mention or test of its reference implementation Weston, no mention of the Mir situation that means that there will not be a KDE UI using the native Mir protocol.

There is no mention of how easy is package management and the quality of the packaging.

There instead a short mention of some aspect of available drivers and hardware support, where instead this is a crucial aspect because with GNU/Linux systems it is essential to purchase only devices known to be well supported, as manufacturers as a rule do not supply usable drivers, and this means that usually only popular and somewhat older devices can be expected to work reliably.

The Performance section has a useful subjective start, but the following statistics are pointless. Most of the perceived responsiveness of a system depends on disk scheduler and page flusher algorithms and parameters and the configuration of the dynamic linker and libraries, which are not discussed.

An operating system is not just for Christmas, it is for its life.

131213 Fri: The usefulness of RAID14

As per some previous explanations RAID6 is only appropriate in a limited number of cases, and the typical RAID set should be RAID10.

However RAID10 is criticised by many for there being one case where it does not have continuity of operations if two members are lost (both members of a pair), while RAID6 is capable of continuity of operation in all cases, no matter of how many members it is composed.

As I pointed out previously RAID resilience is a statistical and not geometric property so that argument is unfounded as such. Even worse, the probability of two members failing goes up with the total number of members, and then the cost of repairs matters a lot.

For this discussion the two most important weakness of RAID6 are:

The redundancy is fixed at two members, while that of RAID10 in proportional to the size of the RAID set, and thus grows with the growing risk of 2 failures.
The redundancy is global, and doubly so, and therefore repair involves work on all members, while for RAID10 it is local and involves only the mirror of the failed member.

It is possible to achieve mirrored RAID sets with the ability to continue operating despite the loss of any two members by creating a RAID10 made of mirror triples instead of pairs, but of course the degree of redundancy seems a bit excessive, even if it has advantages.

There is another and rather uncommon way to achieve the same property with mirror pairs by adding a single extra member: RAID14.

That is create a RAID4 composed of N pairs plus a single extra member (or two mirrored members if desired). This ensures that if a whole mirror pair becomes unavailable the RAID set can continue operating because the top-level RAID4 set has redundancy in the single extra member; and if the single extra member plus one mirror member fails, the single extra member can be reconstructed from the other members, and the failed mirror member from its paired member.

Just like with parity RAID in general the prices paid are:

Writes become in part correlated, even if only the top level, and therefore also RMW becomes possible. But the RMW for RAID14 involves single parity, therefore partial stripe writes don't involve full stripe reads. Therefore as for RAID4 certain configurations have stripe sizes small enough to be tolerable.
In case of failure repair may involve RMW of every stripe, but only if the extra (or both) parity member fails, as parity ties together the state of only the top level set.

It is interesting to compare RAID14 and RAID6 in some reasonable configurations, where each data stripe is composed of 2, 4, 8 chunks; the RAID6 set has 2+2, 4+2, 8+2 members, the RAID14 set has 2×(1+1)+1, 4×(1+1)+1, 8×(1+1)+1 members, with a top RAID4 level being 2+1, 4+1, 8+1.

Cost and resilience

As a common note regardless of the number of chunks in a data stripe, RAID14 has redundancy proportional to that number, because each additional mirror pair brings it own extra redundancy; the parity member is always 1, but in effect it must be noted that in effect it is the parity for each mirror pair, as it is only ever needed if both members of a pair fail. Sure the probability of both members of some pair failing increases with the number of pairs, but for numbers of pairs within reason the combined probability is very small.

Also, RAID14 protects against most failures of 3 members.

With 2 data chunks per stripe RAID6 has 4 members and RAID14 has 5 members. RAID6 has a small cost advantage, but RAID14 is rather more resilient.

With 4 data chunks per stripe RAID6 has 6 members and RAID14 has 9 members. RAID14 is 50% more expensive, but RAID14 is far more resilient, as its redundancy has increased proportionally to the number of data members.

With 8 data chunks per stripe RAID6 has 10 members and RAID14 has 17 members, or 70% more expensive. But we are at the very limit of prudent use for RAID6 in several cases, while the redundancy of RAID14 has again increased in nearly perfect proportion to the increase in members.

Read effort

With 2 data chunks per stripe RAID6 can read in parallel from 2 members and RAID14 from 4 (given some multithreading).

With 4 data chunks per stripe RAID6 can read in parallel from 4 members and RAID4 from 8 (given some multithreading).

With 8 data chunks per stripe RAID6 can read in parallel from 8 members and RAID4 from 16 (given some multithreading).

Write effort

With 2 data chunks per stripe RAID6 needs 3 member reads, a 4 member heavy calculation and 3 writes for 1 chunk updates, and 4 member heavy calculation and 4 writes for whole stripe writes. RAID14 needs 2 member reads and 3 member simple calculations and 3 writes for 1 chunk updates, and 3 member light calculation and 5 writes for whole stripe writes.

With 4 data chunks per stripe RAID6 needs 4 member reads, a 4 member heavy calculation and 3 member writes for 1 chunk updates, and 6 member calculations and 6 writes for whole stripe writes. RAID14 needs 2 member reads and 3 member simple calculations and 3 writes for 1 chunk updates, and 5 member simple calculation and 9 writes for whole stripe writes.

With 8 data chunks per stripe RAID6 needs 8 member reads, a 10 member heavy calculation, and 3 writes for 1 chunk updates, and an 8 member heavy calculation and 10 member writes for whole stripe writes. RAID14 needs 2 member reads and 3 member writes for 1 chunk updates, and 9 member simple calculation and 17 writes for whole stripe writes.

Resyncing after 1 failure

Here let's look at the cost of a single member failure, which is by far the most common.

With 2 data chunks per stripe RAID6 requires a read scan of 3 members, a 4 member heavy calculation, and write scan of 3 members. RAID14 requires a read scan of 1 member and write scan of 1 member in the 4 cases of 5 where a mirror member failed; if the parity member failed, a 2 member read scan, a 3 member light calculation and a 1 member write scan.

With 4 data chunks per stripe RAID6 requires a read scan of 5 members, a 6 member heavy calculation, and write scan of 3 members. RAID14 still requires a read scan of 1 member and write scan of 1 member in the 8 cases of 9 where a mirror member failed; if the parity member failed, a 4 member read scan, a 5 member light calculation and a 1 member write scan.

With 8 data chunks per stripe RAID6 requires a read scan of 7 members, a 10 member heavy calculation, and write scan of 3 members. RAID14 still requires a read scan of 1 member and write scan of 1 member in the 16 cases of 17 where a mirror member failed; if the parity member failed, a 8 member read scan, a 9 member light calculation and a 1 member write scan.

All of the above ought to be compared with having multiple RAID sets, for multiple RAID5 or RAID6 sets to give the same data capacity. But the discussion here is about large single free space pools, even if they are often unnecessary.

The long list of cases above shows that RAID14 has large advantages over RAID6, and the higher cost is (potentially) justified by much better statistical continuity like for RAID10. The advantages of RAID14 (or RAID15) are founded on its redundancy being based on mirroring, which is local and proportional to the number of members in the set, and on simple one parity member RAID4 at the top, with much lower entanglement among members (because simple parity calculations are simple to invert), and much faster parity calculation.

But the important deal is resilience and resyncing: it has the geometrical property that regardless of the size of the RAID set all possible 2 member failures don't impact continuity, and the far more important statistical one that nearly all 3 member failures don't impact continuity either; plus the desirable property that almost all 1 member failures result in a simple localised mirroring, and those involving the parity member involve only reading one of the members in each mirror pair, and writing just to the parity member.

While I have described the above as a RAID4 over a set of RAID1 pairs, plus a single parity member, there are at least two natural variants:

Using a RAID1 for the parity member of the RAID4 too, which results in all 1 member failures and most 2 member failures not requiring a parity resync. I think this is not a significant advantage in most cases, and usually and in particular with larger set I would rather keep an extra member as a spare than as a mirror for the parity member.
Using a RAID5 top-level. In simple RAID5 situations this spreads the parity write effort for partial stripe updates across all RAID5 members. But in a RAID14 situation writes always impact both members of a pair anyhow. However RAID14 makes it feasible to have much larger RAID4 sets, which may make partial stripe updates rather more common. On the other hand RAID14 has the property that is physically a RAID10 plus a parity member, which is a bit more flexible. Probably for smaller sets a RAID14 is good, and a RAID15 becomes more desirable for larger ones.

Overall I think that a RAID14 with 4 or 8 data chunks per stripe or a RAID15 with 8 or 16 data chunks per stripe can be a much better alternative to RAID6 for demanding applications, where someone really requires the geometrical property of continuity after any 2 member failures, and at the price only of one extra member, and rare cases of parity reconstruction.

131210 Tue: An unusual way to use OpenAFS

AFS is one of the oldest designs of networked, distributed file-systems, and it has become somewhat popular thanks to the free software OpenAFS implementation, especially on GNU/Linux systems.

This despite some legacy issues with code and implementation quality, which have been slowly fixed over time, with the result that the current 1.6 release branch is very much improved over its predecessors; there are even some commercial reimplementations and support companies (Your File System, Sine Nomine Associates) and an alternative implementation (arla).

The reasons for this persistence have been these crucial advantages of AFS:

Being one of the original distributed meta-file-systems, with metadata servers holding lists of file-trees called volumes, the list pointing at data servers holding file-trees, with clients accessing the data servers once they have located them via the metadata servers, it is very scalable, supporting up to dozens of data servers holding hundreds of terabytes of data.
Having (potentially strong) authentication and encryption built-in, thanks to integration with Kerberos.
The ability to create read-only snapshots of file-trees in real-time from one data-server to another, called replicas.
The ability to transparently move file-trees from one data server to another (or from one location on a data-server to another).

AFS has other interesting and rare properties, such as having implementations for a large number of operating systems, and having by convention a world-wide namspace, but these can be secondary attributes compared to those above.

AFS is part of a line of distributed file-systems that aimed at realizing partitionable distributed file-systems with disconnected operation for clients, like Coda, it successor.

Originally AFS-volumes were supposed to be cacheable in their entirety on clients, allowing for disconnected operation, but this did not quite work out, and so that work continued onto Coda.

However there is an interesting alternative possible with AFS which can give a lot of the same abilities: add an AFS data server to every client.

In this way AFS-volumes stored on the client will be available to that client always, and to other clients when it is connected to the same internet.

The crucial detail here is that the AFS metadata servers only record AFS-volumes, and AFS data-servers only as a consequence of that. There is no explicit configuration needed to add an AFS data server to an AFS cell which is a named set of synchronized AFS metadata servers.

A typical setup would be to have the home file-tree of users stored on their own main desktop or laptop, efficiently accessed locally, yet available from any other system, at the price of network access; and similarly for workgroup file-trees (for example documents), and site-wide file-trees (for example mirrors of distant repositories).

Except for one vital and unfortunate detail: all servers in an AFS cell need to share the same authentication and encryption key, including the data-servers, and this means that AFS data servers cannot be reasonably setup on systems entrusted to end-users.

However for system infrastructures where AFS clients and servers are managed together this should not be a problem, giving clients local access to AFS-volumes where useful, and at the same time allowing remote access by other clients. For example where there are site AFS data servers, and workgroup AFS data servers, with user desktops and laptops accessing their home file-trees on the latter.

This is possible also with NFSv4 where it is even easier and safer, as each server has a separate authentication and encryption key, and local shared files can be accessed directly and without passing through the NFS protocol, as NFSv4 is a meta-file-system with an identity mapping between its files and those of the underlying file-system in which they are stored.

But NFSv4 does not have the crucial features listed above, such as distributed transparent read-only snapshots or moving file-trees. But it allows putting home file-trees directly onto end-user managed systems, which may be an advantage in some cases.

131025 Fri: Decay of flash SSD speed with use

Since flash SSD storage devices are based on memory chips with extremely anisotropic performance profiles, in particular being unable to overwrite data, being able to only reset data to zeroes in large blocks, writing to them in small blocks is simulated, and the simulation depends on recent history.

Some recent group tests from the usual excellent X-bit labs of flash SSD devices show this clearly, for example this page shows a graph of how much random write speed on a used device is lower than on a virgin one, immediately after some previous use, after a 30 minutes internal, and after using the TRIM command to help the drive firmware manage for best effect its simulation of a writable device.

The differences can be rather significant, with many drives writing at less than 20% of their virgin-state speed after some use, and even after a 30 minutes pause; in the same all but one however revert to virgin-state speed after use of the TRIM command.

From the same group test another page shows that for some drives tested speeds are significantly reduced when the drive is 50% full.

It is notable however that the flash drives have still fairly good transfer rates compared to a disk drive; but I suspect that latency is more impacted than throughput.

131006 Sun: IPv6 traffic ramping up after many years

Interesting article with a graph that shows a definite ramping up of IPv6 traffic as seen by Google, hitting 2% and after a long time with negligible impact.

It is also interesting that Teredo traffic has essentially disappeared, after being most of the IPv6 traffic they were seeing.

The adoption of IPv6 is clearly going to be dominated though by mobile devices (tables, smartphones), as demonstrated by previous news that even an USA-based carrier is using IPv6 for their most recent network.

131005 Sat: ZFS and 'fsck'

Noticed an interesting thread on using ZFS as the storage layer for the OpenAFS meta-filesystem, started with an interesting series of questions:

Are you using ZFS-on-Linux in production for file servers?

If not, and you looked into it, what stopped you?

If you are, how is it working out for you?

ext3/ext4 people: What is your fsck strategy?

The most interesting for me is the last one about fsck strategy, given my long standing interest in the lack of scalability in the running time of most current filetree repair tools, and the implication that ZFS does not need a filetree repair tool like fsck.

This is a persistent myth, and it based on the half-true notion that ZFS and BTRFS and other copy-on-write, versioning filesystem designs don't need a filetree checking tool because having pervasive checksums they can detect malformed parts of a filetree natively, without waiting for a whole-filetree scan.

That they don't need a filetree checking tools to detect malformed parts of the filetree when they are accessed, but there is a still a need to detect malformed parts of the filetree periodically even if they are not otherwise accessed by applications. Indeed ZFS has such a tool, and running it is called resilvering.

Moreover hopefully the terminology switch above has been noticed: I started talking about filetree repair tools and then switched to filetree checking tools. Those are two very different concepts, and while pervasive checksums half-obviate and facilitate filtree checking they don't obviate or facilitate filetree repair, which requires in most cases that matter a whole-filetree scan and some heuristics to rebuild a well-formed filetree structure.

There is the argument that a versioning filesystem does not need filetree repair tools because if the current version gets malformed, for example by a system crash, it is easy to rollback to a well-formed earlier version. But that argument only covers the easy case of simple, easy damage of the sort that journaling filesystems were designed for, to reduce the need to run filetree repair tools, but not to eliminate it.

The damage that a filetree repair tool is meant for is typically that arising from limited storage layer issues, such as damaged recording media or coding mistakes which can cause random, not immediately reported, corruption of data.

The whole-tree scan and repair is worthwhile if it is a cheaper operation than a whole-tree recovery from backups, which is the case as a rule. The absence of such a tool means that backups and recovery from backups have to happen more frequently than if such a tool was available, which is not an insignificant issue.

ZFS and BTRFS and similar filesystem designs try to reduce the need for more frequent whole-tree backups and recovery from backups by using piecemeal storage redundancy, that is by keeping two or more copies of at least some parts of a filetree, so that if a copy is detected as malformed repair can be simply restoring it from a non-malformed copy. But this makes backups and restores incremental and online, reducing their latency, which is good, but does not reduce much their overall cost; ZFS at least has a scrub tool that does whole-tree checking and repair, if possible using the redundancy of the storage layer.

Besides the cost, the schemes used by ZFS and BTRFS for storage redundancy can be used separately with other filesystem designs, and they don't obviate the advantage of having a filetree repair tool either. ZFS and BTRFS integrated to some extent the filesystem layer with the storage redundancy layer, which makes it more convenient, and reduces perhaps the frequency with which to run filetree repair tools just like journaling does, but still does not eliminate it.

Never mind eliminating the need for conventional backups.

130829 Thu: Blaming latency evasively

While reading a recent blog post about latency adding hours to a large database job:

That’s when I figured I’d ping the PostgreSQL server from the ETL server and there it was, 1ms latency. That was something I would expect between availability zones on Amazon, but not in a enterprise data center.

It turned out that they had a firewall protecting their database servers and the latency was adding 4-5 hours to their load times. The small amount of 1ms really adds up when you need to do millions of network round trips.

I was amused by the evasiveness of the analysis of the situation:

A bulk database update job takes 7 hours to finish.
The job can update around 1,000 records per second.
The job accesses the database over a link with 1ms RTT.
CPU, memory and disk resources are very lightly loaded.

The details above strongly imply that the job runs in half-duplex mode, with each record update being synchronous, that is each record update is a transaction.

The 1ms latency has an impact on this job only because someone coded the job to be synchronous with each record update while processing 7 million records, which seems ridiculous to me; unless there is a strongly compelling reason for updating each record synchronously, whis is very rare for bulk ETL workloads.

130818 Sun: Debate on the size of BGP tables for IPv6

One of the big stories about networking apart from the difference between switched and routed internets is the large difference between routing within an AS (organizational internet) and routing across the global Internet.

The main difference is that routing in the global Internet needs a routing table entry for each independently routable IPv4 subnet, leading to IPv4 routing tables with nerly 500,000 entries.

This is due to the global IPv4 Internet routing designed to support an unrestricted mesh, where every subnet may be connected anyhwere perhaps multiple times and thus every router having to storage a complete list of routes to every subnet, even if almost all of those subnets are connected to a much smaller number of communication company subnets.

IPv6 was meant to have the opposite design, of being based on nearly hierarchical routing, where local IPv6 prefixes are subsets of some communication company prefix, and thus can be subsumed in the latter prefix.

There is some debate (For example 1, 2) as to whether this will happen, and Internet IPv6 routing tables will be significantly smaller than IPv4 ones.

The debate is about whether IPv6 addresses will be hierarchical as intended, with prefixes for an AS all subsumed by one or more ISP prefixes, or whether AS owners will buy globally routable IPv6 prefixes, which can be quite cheap.

AS owners may want globally routable prefixes in two cases, when they have multiple uplinks to different ISPs, and to avoid changing IPv6 addresses when switching ISPs (for example in the case of corporate mergers and ISP consolidation).

The first case of multiple ISPs is weak, because it is possible to advrtise a prefix from one ISP through another (usually for a fee). The second case is where things get more difficult, because purely hierarchical addressing does require renumbering when leaving one branch of the hierarchy and joining another branch.

But in theory such renumbering should not be an issue, as anyhow IP addresses are just numbers, like Ethernet addresses, and should no be imbued with any special significance, as they are usually invisible to both applications and users, as both as a rule identify services by DNS entries.

However in practice IP addresses, and even Ethernet addresses, do get imbued with special meanings, for example as authentication tokens, or are not managed as flexibly as the DNS for example.

Therefore the demand for globally rather than hierarchically routable prefixes. Or for IPv6 NPT to be applied to ULA prefixes, which may be rather worse.

Fortunately however for now as per one of the links above there are around 10 IPv4 prefixes per each of the 40,000 active AS numbers, while for IPv6 that seems to be 0.2 prefixes, as most ASes don't advertise IPv6 prefixes.

The prediction that every AS will eventually advertise the global Internet routing table at least one IPv6 prefix is not correct if hierarchical addressing is used, and indeed that is the main reason to use a hierarchical prefix scheme. Because while each AS may well advertise one or more IPv6 prefixes, almost all will only advertise them to their uplink providers, which will not readvertise them individually.

Right now it is not clear to anybody how this will play out. Those who think that there will be an explosion of routes like for IPv4 point at how cheap a globally routable IPv6 prefix is, and how convenient it is for someone buying one to squander Internet router resources for which they are not billed.

130817 Sat: Chromium consumes 100% CPU on Google sites

I was amused by the somewhat belated realization by a blogger that Chromium consumes 100% CPU on Google sites because that is exactly what has been happening for a long time. Most browsers and sites are designed for the convenience of site publishers, not that of browser users and that convenience means capturing and driving the attention of those users to sell something to them.

Therefore as soon as the browser lets the site owner run some code, that permission will be used to the hilt to distract the browser user from reading the content they are interested in and draw their attention to sales related content. Whether it is JavaScript, Flash or HTML5.

Because after all the processing power needed to run that is free to the site owner, and therefore they have a strong incentive to use as much as they can to make their pitch; plus it is free to the people hired by the site owner to develop the site, while writing efficient code costs them money.

Therefore there are browser modules like NoScript to disable JavaScript and Flash. Unfortunately many site owners try hard to deliver the main content of a site using the same code that runs the sales content so that disabling code in web pages makes them unreadable.

130726 Fri: VLAN tagging, broadcast and ordinary addresses

It just occurred me that the extra 4 bytes for the VLAN tag in a tagged Ethernet frame are not necessary for frame addressed to ordinary addresses, because these are specified to be globally unique, and it does not matter in which specific VLAN they are, as least as to the primary function of VLAN tagging, which is to create limited broadcast domains.

Only broadcast frames need to carry a tag that matches those of the ports they are intended for, as they only need to be forwarded to the ports with the same tag, and since Ethernet broadcast frames are such by virtue of a single bit in the address, there is plenty of broadcast address space to put a tag in without extending the frame format.

But there is already a case for frame addresses with the broadcast bit on and the rest not all ones: multicast frames. VLANs could have then been defined or can now be reinterpreted as multicast groups.

The only real difference between a multicast group and using tagged frames, apart from the syntactic difference of an extra few bytes of address, is that Ethernet addresses in different VLANs are not mutually visible. But this matters only if one uses Ethernet addresses or VLAN tags as authentication or security tokens, which is popular because it is expedient, but authentication or security should better be handled at a much higher level.

130713 Sat: Hardware routing, switched internets, and VLANs

In the previous entry I referred to 6to4 as a fairly cheap way to work around the lack of IPv6 hardware routing in some less recent routing products.

Some time ago discussing the merits of switched and routed internets I was astonished to hear that someone reckoned that enterprise switches also capable of routing could not route IPv4 in hardware and that was why switched internets were common.

Note: Hardware routing is a major issue because switches and routers are usually designed as embedded systems, with a low wattage, rather slow CPU with very limited memory, controlling one or more hardware modules doing the switching among modules over a dedicated backbone bus. The hardware modules have forwarding tables in dedicated memory, and dedicated chips that extract addresses from frames or packets and look them up in the hardware tables.

That used to be true many years ago: IPv4 packets have a TTL header field that must be decremented by 1 every time they pass through a router, and a checksum field that must be recomputed whenever a header field is modified, and doing that in realtime using hardware logic used to be a difficult problem, but that problem was solved over ten years ago.

Note: The hard part is recomputing the checksum, and therefore IPv6 was designed without a header checksum, to facilitate hardware routing.

When routers could not route at line speed because it was done in software but switches could switch at line speed here was also a performance reason for having switched internets instead of routed ones.

So flooding and then updating of switching forwarding tables was seen as the cheap alternative to routing, either static or with RIP or OSPF (optionally having enabled ECMP) plus it enabled internet-ing of popular procotols that would not support routing, plus allowed spreading IPv4 subnets over multiple switches, enabling the use of IP addresses for purposes unrelated to networking such as access control.

Fortunately for several years enterprise-level so-called level-3 switches (routers with an embedded switch) have been able to route in hardware at the same speed as switching.

The only major difference is that router routing tables tend to be smaller than switch forwarding tables; mid-range products tend to have routing tables capable of handling hundreds to thousands of routes, while their forwarding tables can handle dozens of thousands of Ethernet address forwardings.

This difference means that in some site internets it is not possible to use single-host routes for all nodes, which sounds extreme but is the equivalent of switching holding a forwarding table entry for all Ethernet addresses in the internet.

Because inter-switch frame forwarding is layer 2 routing, as some form of routing must be involved in a network made of multiple networks, and each switch defines one (or more than one) network.

As previously noted at length, I think that level 2 inter switch frame forwarding was not a good idea many years ago but it might have been expedient because of performance reasons.

Currently the performance reasons no longer apply and if the rather unnecessary evil of location independent IP addresses is desired it is possible to use single-host routes instead of relying entirely on per-Ethernet address forwarding tables.

Similarly autoconfiguration and resilience can be handled with a suitable OSPF and ECMP configuration and even as to this Ethernet frame forwarding can be avoided; especially as flooding-and-learning of individual Ethernet addresses is a fairly poor routing algorithm, especially compared to OSPF and ECMP.

130712 Fri: 6to4 to 6to4 dual-protocol nodes, 6to4 routers, 6to4 speed

I previously wrote that setting up 6to4 tunnel interfaces is best done on Linux by setting up two distinct ones one for the 6to4 prefix itself, as if it were a local network, for direct packet exchange between 6to4 nodes, and another one for the rest of the IPv6 address space, giving as tunnel endpoint that of a 6to4 gateway. I have also shown how to do this with Debian style /etc/network/interfaces configuration files.

Apparently something that is not obvious is that at the very least the 6to4 to 6to4 tunnel interface with its associated 2002::/16 route should be setup by default if one wants IPv6 connectivity and the node has a publically routable IPv4 address, even if the node already has a native IPv6 address with associated default (::/0 or 2000::/3) route.

Because it means that traffic with other 6to4 nodes does not need to go through an IPv6 router or 6to4 forward or reverse gateway at any point, and the cost of 6to4 encapsulation is quite small.

In effect 6to4 is so useful that it should always be setup if the node has a publically routable IPv4 address and IPv6 connectivity is desired, and especially if it is a client, which usually are relatively low traffic anyway. It can mean avoiding setting up a native IPv6 address (if a 6to4 relay gateway or interface is also setup), and each 6to4 address gives a free /48 subnet.

Which means that an IPv4 router for an IPv4 LAN can be turned into an IPv6 router for the associated /48 practically for free, and the nodes in that IPv6 prefix don't need to be 6to4 nodes themselves, just native IPv6, even if they should be because of the previous argument.

For a server with an expected significant traffic load the choice between both an IPv6 and 6to4 address or just a 6to4 address depends on whether the router upstream can function as a gateway between IPv6 and 6to4 gateway at high speed, because then there is no need to have a native IPv6 address: native IPv6 packets are routed to the upstream router, which then just encapsulates them 6to4 and sends them to the server, and viceversa.

This can be a very attractive option in one not so uncommon case: when one has a very optimized, highly reliable campus IPv4 campus net, but one that is expensive to reconfigure for IPv6 or that cannot handle IPv6 packets at all or at the same speed, for example because most routers don't have hardware routing for IPv6 or IPv6 routing is (unconscionably) an expensive additional option.

In this case one only needs to purchase or upgrade 6to4 border routers, accepting IPv6 traffic incoming, or 6to4 traffic outgoing, and the rest of the network can remain cheaply IPv4 without much compromise. Thanks also to the wonders of unicast routes (/32 for IPv4 and /128 for IPv6) and ECMP the 6to4 gateway routes can be transparently multiple and load balancing (or not, depending on route priorities).

In this way one trades the cost, in terms of engineering effort, hardware upgrades, license upgrades of routing IPv6 natively across a perhaps vast local internetwork for the relatively cheap cost of 6to4 encapsulation at a few border routers or LAN routers, or at any leaf nodes, which can be an attractive proposition.

How cheap is 6to4? For one data point we can compare native IPv4 vs. 6to4 transfer rates and CPU times on a 1Gb/s LAN:

# nuttcp -4 -t 192.168.1.34
  836.4609 MB /  10.06 sec =  697.4131 Mbps 5 %TX 36 %RX 0 retrans 0.31 msRTT
# nuttcp -6 -t fd00:c0a8:100::1
  741.4141 MB /  10.07 sec =  617.3787 Mbps 5 %TX 53 %RX 0 retrans 0.29 msRTT
# nuttcp -6 -t 2002:c0a8:122::
  718.3125 MB /  10.04 sec =  600.2773 Mbps 29 %TX 46 %RX 0 retrans 0.35 msRTT

# nuttcp -6 -r 2002:c0a8:122::
  602.0693 MB /  10.04 sec =  503.1609 Mbps 21 %TX 69 %RX 0 retrans 0.35 msRTT
# nuttcp -6 -r fd00:c0a8:100::1
  499.7539 MB /  10.05 sec =  417.1046 Mbps 12 %TX 57 %RX 0 retrans 0.34 msRTT
# nuttcp -4 -r 192.168.1.34
  691.3633 MB /  10.09 sec =  574.8813 Mbps 11 %TX 83 %RX 0 retrans 0.34 msRTT

As the unexciting native IPv4 rates indicate this is not a very high speed LAN setup, but the 6to4 numbers are within 10-15% of the native IPv4 numbers even if the CPU usage is rather higher, and a significant part of that can be attributed to the higher cost of processing IPv6 packets, as the native IPv6 transfer rates show (those using a non canonical form of fd00::/7 unique local addresses).

Note: it is unexpected that the 6to4 receiving rates for native IPv6 traffic are significantly lower than those of 6to4 packets, but I suspect that is due to the IPv4 code being more tuned for latency critical work like receiving than IPv6 code.

Note: The two nodes involved are an i3 laptop and a Phenom X3 desktop in low power mode, maximum CPU speeds of 2.4GHz and 2.8Ghz, and both have very consumer grade and somewhat old network interface chips, and so is the switch chipset.

This scheme can be extended in another way: it is probably possible to publish routes to native IPv6 addresses via 6to4 router addresses, giving the possibility of nodes with native IPv6 addresses yet accessible only via an established, well tuned IPv4-only network infrastructure.

Sure, it would be better to avoid encapsulation entirely, and be able to afford upgrading a whole router mesh to support native IPv6 at full speed, but that may require a very large investment of engineering effort and hardware and license spending.

In particular because it is somewhat unlikely that anytime soon IPv6-only servers will be setup, having servers that are 6to4-only or have both native IPv6 addresses and 6to4 prefix-only addresses can be a good idea.

130704 Thu: Well done comparative test of recent flash SSD products

I have just read a (rarely) well-done test of a recent and well regarded flash SSD device, the Toshiba THNSNH.

Among the more interesting aspects the is the extensive discussion of the particularities of flash SSD performance envelope, for example:

One area of performance that isn't mentioned often is the condition referred to as "steady state." This is the performance that will be experienced after an extended period of time with the SSD. We will be using Iometer testing to show the difference between a brand new shiny SSD, and that same SSD after it is loaded with data and subjected to continuous use over a period of time.

We will also use SNIA guidelines to place the SSD into steady state, and then test with our trace applications. Steady State trace-based testing will illuminate a bit of the difference between actual application performance after the SSD has been used for a period of time.

Another is the illuminating comparison of various drives as to minimum, average and maximum figures of merit, and well chosen ones, for example in particular maximum read latency:

	Max Read (ms)
256GB OCZ Vector	0.812
256GB Samsung 840 Pro	0.819
480GB Crucial M500	1.554
250GB Samsung 840 TLC	1.710
256GB Toshiba THNSNH	0.716

and maximum write latency:

	Max Write (ms)
256GB OCZ Vector	0.784
256GB Samsung 840 Pro	0.720
480GB Crucial M500	3.414
250GB Samsung 840 TLC	0.735
256GB Toshiba THNSNH	13.540

It would interesting to see maximum latencies in the case of interleaved reads and writes and I have seen tests that include those, but even just reads and just writes is highly informative.

The test also shows the difference between performance when new and used:

	256GB Toshiba THNSNH	250GB Samsung 840 TLC	480GB Crucial M500	256GB Samsung 840 Pro	256GB OCZ Vector
Steady (ms)	53,890	46,400	38,865	61,348	49,221
Fresh (ms)	72,778	78,981	86,721	89,028	86,013

130702 Tue: Too many options difficult to test

Someone I know that is still a developer in a first-world country prefers not to change options from defaults when configuring applications because he thinks that applications only get tested with defaults.

His implicit expectation is that applications get written haphazardly, so they don't work at all to start with, in the sense that they are fundamentally broken, and then after many cycles of testing what gets tested is made to work, but only as far as it is tested, because most managers responsible for a software product don't see any point in wasting budget trying to get to work something that is not tested.

Admittedly I have seen quite a few cases where software gets written broken and then parts of its are made to work, instead of being written right with a few mistakes here and there.

It has worried me a bit to see that someone is worried that ext4 has many options:

I don't recommend nodelalloc just because I don't know that it's thoroughly tested. Anything that's not the default needs explicit and careful test coverage to be sure that regressions etc. aren't popping up.

(One of ext4's weaknesses, IMHO, is its infinite matrix of options, with wildly different behaviors. It's more a filesystem multiplexer than a filesystem itself. ;) Add enough knobs and there's no way you can get coverage of all combinations.)

I reckon that in large part the ext[23] code has been written more carefully than haphazardly slapper together, but it is older code that has been extended many times and the number of options is a reflection of that and can have some different bad consequences:

Many maintainers over time tend to reduce the architectural inegrity of the code, if it had one to start with, and this can indeed introduce mistakes.
It becomes difficult for users to understand how the application will behave, both in terms of semantics and performance (see the sdrawkcab article) or documenting it can become too difficult.