Software and hardware annotations 2007 April

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

070429b DVD-RAM, packet writing and shoddiness
During a recent trip to Hamburg I found a shop selling exotic DVD media (and even Blue-Ray media, both writable and rewritable), and in particular I got this nice 8cm dual sided DVD-RAM disk. I have some regular 12cm single side DVD-RAM discs, but they have not used them recently with Linux as kernel cd driver support for writable formats (DVD-RW, DVD+RW, DVD-RAM) has been not that working for a long time. However I have also recently switched to kernel version 2.6.21 and miraculously now writing works.
Even better, the packet CD/DVD block layer works too. This is particularly important for writable DVD formats, as they have a hardware block size of 32KiB, but an advertised one of 2KiB, and writing data using 2KiB blocks is very inefficient as it involves read-modify-write; the pktcdvd driver ensures that data gets writter in 32KiB natural blocks. Despite previous experiences that this too was not quite working for a long time it seems to have been fixed too; even more miraculously the UDF filesystem layer works too.
However performance is a bit disappointing: copying a filesystem tree of 860MBwith around 30 directories containing 6,200 files was done at an average 1.6MB/s top write speed is around 2MB/s even with pktcdvd and it can be much worse than that, especially when writing trees of small files, in part because DVD-RAM seek times are especially slow. However sequential rates are:
Mode Block size MB/s
/dev/hdd
MB/s
/dev/pktcdvd/1
write 2KiB 1.7 0.8
write 32KiB 1.8 1.3
read 2KiB 2.7 2.7
read 32KiB 2.7 2.7
Ahhhh, rather surprising. It looks like that the CD driver already does 32KiB blocking, and the pktcdvd one does not (not surprising that write rates are much lower than read rates: DVD-RAM drives do verify-after-write)..
Another disappointment is the usual shoddy quality of software, as the pktsetup utility that binds or unbinds the packet driver to the underlying CD/DVD devices has some classical Microsoftisms:
070429 Promise eSATA not quite hotplug under Linux
So I bought a nice eSATA card and disk and I have had already a disappointment: that the Linux driver for it is not quite hotplug. In order to trigger a rescan of the attached devices one must, like with so many Linux drivers, compile the driver as a module and unload and reload it.
In this USB2 and Firewire and their drivers have the advantage: they get notified when devices are attached and detached.
070426c Fedora requires the root filesystem to be called "/"
Having just written about annoying drive letter changes for the MS Windows boot partition I have to add that as part of the continuining Microsoft cultural hegemony someone at Fedora has decided to emulate them, only worse: Fedora will not boot using a Fedora prebuilt kernel unless the root filesystem is called /, and that of course should be unique. The reason is that / is hard coded for mounting the root filesystem the initrd filesystem. This means that one cannot have multiple partitions with multiple versions of Fedora; unless the obvious fix is applied, to build a custom kernel to which one can specify a root= argument that then does not get ignored.
070426b Best way to change MS Windows boot drive letter is via registry
Having recently added some new hard drive host adapter my MS Windows 2000 has reset drive letter assignments, including that for the boot volume (note that under MS Windows the boot volume is the one from which the operating system loads MS Windows, equivalent to the UNIX root filesystems, while the system volume is the one that contains the boot loader, equivalent to the UNIX boot filesystem). If one has applied the essential registry fix for the MS Windows login problem when the boot volume letter has changed, then the easiest method to change it back to what it was is to directly edit the driver letter registry keys.
070426 Vibrations cause hard disk to die or to run twice as slow
Chatting about a perception about decreasing computer reliability a smart systems person told me about his latest discoveries about why some systems he need replacing the hard disk every few months and others have unusually low hard disk performance: in both case vibration is the cause.
Some of the systems have five fans on the motherboard, and this makes the motherboard vibrate, and those vibrations get transmitted via the SATA cable to the hard disk, which gets then shaken until it dies after a couple of months. In other system the vibrations transmits to the motherboard, and then much attenuated to the hard disk, where the vibrations cause the head and arm to misposition or take longer to settle. In both cases putting the disk on a softer, vibration absorbing base cured the problem: in the first type of system it no longer dies after a couple of months, in the second the disk performance has nearly doubled. Quite interesting, especially as cooling and fans seem to be an ever increasing need, and also as these are not cheap assemblages of random bits, but systems from major integrators or manufacturers.
070424 Laptop monitor by default causes extra power draw
I have just been using a laptop unplugged for some period of time, running Fedora 6. With a bit of care it is possible to reduce considerably the amount of power drawn, but then I discovered that something caused the disk to spin up 2-3 times a minute. With a bit of logging thanks to /proc/sys/vm/block_dump the source was ironically kcmlaptop, the KDE applet that (mostly) monitors the battery status: by default it checks the battery status every 20 seconds, and writes its status to its kcmlaptoprc file. Changing the battery check interval to something like 5-10 minutes mostly solves the issue. However saving the battery status to a file seems to me rather pointless.
070419 External SATA delivers, positional device naming dead
Having had a look at eSATA recently I have then bought for my main home PC some SATA-II and eSATA stuff: a Promise TX4302 PCI card with 2 eSATA and 2 SATA-II ports, a WD MyBook Premium ES 500GB SATA hard disc, and two internal SATA-II 500GB hard discs, from Seagate and Hitachi. These are bulk archives of easily replaced data (mostly GNU/Linux distribution ISO images and free software and data downloads), so they are arranged as an active internal disk, an internal nightly automatic backup, and an external weekly backup.
The TX4 card was a bit expensive, but it is pretty nice and has a nice balance between external and internal ports (my motheboard is old and only has two SATA ports), and is well supported by a Linux driver. I got the retail instead of the OEM package as the price difference was small, and the retail package includes 2 eSATA cables which are not easy to find. The MyBook ES external drive was chosen as it has both eSATA and USB2 ports, which means I can use eSATA for speed and fall back on USB2 if I carry it around.
Well, the eSATA arrangement just worked, and the speed was remarkable: on the outer tracks hdparm -t reported a bulk sequential read speed of over 80MB/s, same as for internal disks. A remarkable event: to have the same high speed interface for both external and internal peripherals, a great convenience which I last experienced with parallel SCSI many years ago (even if I still have a SCSI-2 host adapter in my PC, with both external and internal sockets). I also have Firewire host adapter with both internal and external socket too, but internal native Firewire peripherals did not quite happen. It is fortunate that eSATA is here, but it would have been a lot better if Firewire had emerged instead, especially the excellent performance recent chipsets have even for non-native peripherals.
There are some disappointments though, mostly that under GNU/Linux eSATA is not quite hotpluggable yet, or at least did not seem to me in quick test, as apparently the SATA drivers in the SCSI subsystems don't quite support rescanning. Other than that the usual issue: that the order in which the various host adapters and devices get enumerated is arbitrary and somewhat unpredictable. It is particularly annoying that the Promise TX4 card enumerates first the two external sockets of its four; combined with the Linux kernel listing SCSI peripherals only if they are present (unlike for ATA peripherals) this means that internal SATA drives get a different drive letter depending on the number of external drives active.
Positional device naming is indeed virtually impossible because of this, and the only way out is to name devices by some unique ID. I'll have to change to that soon. I have already largely reverted to using LILO instead of GRUB because the latter only identifies boot drives positionally, while LILO can use UUIDs; this is important as the BIOS and Linux can have very different idea of the position of a peripheral (for example Linux lists peripherals strictly in order by PCI id attachment, unlike the BIOS). Yet another ancient UNIX tradition goes.
070409c 3com is nearly out of the Ethernet card market
In my search for 10gb/s and 1gb/s optical products I have also had a look at the 3com web site and on their front page they no longer list network cards as a product category. On the network cards page they only list IPsec accelerator 100mb/s cards and in their SMB product guide or enterprise product guide they only mention wireless cards and a firewall card. It is a clear sign of the trends of the industry, and even more amazing as 3com is the original Ethernet company, and their network cards have been for decades the standard to compare others to.
070409b Optical fibre network cards, availability and prices
Over the past few months I have been looking ahead at 10gb/s networking, and given a closer look at 1gb/s optical networking. Equipment for both is rather less common and rather more expensive than perhaps it should be, while 1gb/s copper networking equipment can be bought for cheap from corner shops nowadays.
It is easier to find switches for 10gb/s and 1gb/s optical networking than network cards, as evidently they are perceived as infrastructure options, with 1gb/s (or slower) copper links as the only needed option for computer links. Unfortunately this is not quite right, as storage and computer servers can find the speed of 1gb/s links a bit limiting (and I am pondering with a mix of worry and elation a request to achieve 400MB/s data collection write rates sustained for a day or so), and also I would like the cards to build a test machine, so I can peek and probe into inter-switch optical links for testing and monitoring purposes.
Also, I am only interested in long reach optical cards 10GBASE-LR and 1000BASE-LX. This restricts things quite a bit, as there seems to be a bit more choice with multimode fibre and short reach frequencies; in part because existing fibre cable plant and equipment are mostly multimode and fit well within the 300m limit of short reach. But the future is single mode fibre, and long reach eliminates most considerations of rach (10km instead of 300m).
Anyhow these are the 1gb LX cards I could find:
Intel
1000BASE-LX PCI/PCI-X (product code PWLA8490LX).
and this one of 10gb/s cards:
Intel
S2IO also known as Neterion
Myrinet
Chelsio
Tehuti Networks
SolarFlare
HP
From the above the most notable thing is that there aren't many 1gb/s LX cards. One, as far as I know; there used to be at least another one, but it is no longer manufactured. It is also sad that all the 1gb optical cards I could find have a fixed transceiver, when they could instead have a nice SFP socket like most switches have. At least some 10gb/s cards do have the equivalent XFP socket. There is indeed a lot more availability of 10gb/s singlemode cards than 1gb/s singlemode, and I suspect this is because 1gb/s tends to be a legacy speed mostly for older multimode cable plants, and in any case it suffers from deadly competition from 1gb/s on CAT5e, which is so much cheaper for end nodes. Conversely, if one wants to do more than 100MB/s with a single server (and I got to do that soon) there is little alternative to 10gb/s optical, for now (10GBASE-CX4 is a bit of joke, and 10GBASE-T is not yet there).
Now, as to prices, I had a big surprise: those one gets from web sites via shopping search engines are ridiculously high, with 10gb/s cards (especially the Intel one) being listed for $4,000 when if one asks for quotes prices quoted are a lot more reasonable (still pretty high). Probably the most attractive proposition is in the Myrinet published price list where a 10GBASE-LR card can be purchased for around $1,700 and the card if PCIe and has pluggable XFP modules. I am also eagerly waiting to see the Chelsio prices for its 10GBASE-T products and a few others.
070409 Some disk reliability studies
At some recent conference two papers about hard drive reliability over a large population followed for some years have been presented, one one from Google Labs and another from CMU.
Some people have already posted some interesting comments, and what impressed me from those papers was:
070407 eSATA is out, good for servers too, FireWire 800 not yet dead
I was recently very impressed with FireWire 800 performance but I also mused that it was too late, as eSATA product were imminent. I have been doing recently a look for eSATA stuff, and found that there are now cheap 4-port eSATA cards (available with PCIe or PCI-X interfaces). Even cheaper if all one wants is two eSATA ports and PCI, and some motherboards now come with one built-in eSATA port.
Four ports is a good number, and one can always use a couple (or more) of these cards. This means that one can build fairly large servers with all external and hot-pluggable drives, just like on the old times with parallel SCSI chains, only much better, because there are now no terminators, hot pluggability is by default, and it is much, much cheaper. For example the nice 500GB WD drive can be currently bought for around £75+tax OEM bare and around £100+tax as either retail external USB2+Firewire 800 or retail external USB2+eSATA. The £25-30 difference is almost trivial, considering that one gets also a retail warranty and the external case.
Given that eSATA drives are now available one can build highly modular yet very powerful servers using SFF PCs, some of which are available in amazingly powerful configurations. The one downside to this is the increased number of external cables (both power and data), but that can be handled with tidiness and some cable dressing.
070406 Fontconfig on Fedora 6 and snappy application startup times
On a system with Fedora 6 I noticed unusually long application startup times. A bit of investigation and it appeared that there were no Fontconfig cache files, and each application had to scan all font files in directories mentioned in /etc/fonts/ files.
The reason seems to be that Fedora 6 has a buggy fc-cache command which complains that write cache file failed even if there is no reason for that. So I downloded the Fedora 7 version of the fontconfig RPM and its fc-cache command just worked.
Just one of those cases where caching and hinting mean that things continue to appear to work even if they are broken.
070401 Simplicity by deliberate design and ancient texts
Rethinking about simplicity an ancient paper should be mentioned that has had a profound impact on my attitude to system design,
%A Peter J. Denning
%T Why our approach to performance evaluation is SDRAWKCAB
%J ACM SIGMETRICS Performance Evaluation Review
%V 2
%D SEP 1973
%P 13-16
where a very good argument is made about performance evaluation, which is of more general application. It is that instead of creating complex systems and then try to create complex performance models for them, it is rather preferable to design systems with a goal that they have a simple performance model. The autor of the paper did notably contribute an example of with the working set page replacement policy which has a single very simple tunable and a simple performance model. Too bad that many contemporary systems have bewilderingly complex sets of tunables and complex multidimensional performance models.
However the generalization of the idea, that instead of trying to solve complex problems it is better if possible to adjust requirements to create more general simpler ones, has become one of my principles. Things that are simpler to understand tend to work better, because as Edgser Dijkstra well argued in Notes on structured Programming the main problem with complex designs is our inability to do much. Dijkstra therefore argued for appropriate simplicity and partitioning of problems in parts and disciplined thinking for a long time and I am not surprised by the title of a restrospection on his work:

To mark the occasion of Dijkstra's retirement in November 1999 from the Schlumberger Centennial Chair in Computer Sciences, which he had occupied since 1984, and to celebrate his forty-plus years of seminal contributions to computing science, the Department of Computer Sciences organized a symposium, In Pursuit of Simplicity which took place on his birthday in May 2000.

Some relevant quotes by Dijkstra:

Simplicity is prerequisite for reliability.

The competent programmer is fully aware of the limited size of his own skull. He therefore approaches his task with full humility, and avoids clever tricks like the plague.

How do we convince people that in programming simplicity and clarity -- in short: what mathematicians call "elegance" -- are not a dispensable luxury, but a crucial matter that decides between success and failure.

All lost in time, which reminds me that once I attended a lecture by D. E. Knuth on the history of structured programming and he mentioned that the wonderful classic mentioned above:
%A E. W. Dijkstra
%A C. A. R. Hoare
%A O. J. Dahl
%T Structured programming
%I ACPRESS
%D 1972
was still in print, and then sadly added that when he asked how many copied were being sold, the publisher said on average three copies per year (I bought one a long time ago) were being sold. Which reminds of of this somewhat sad but apposite cartoon.