Software and hardware annotations q4 2005

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

December 2005

051227b

I am still thinking about which hard drives to buy for my desktop PC (1 internal to use, 3 backups, of which 1 internal and 2 external).
For the main hard drive that will go into my desktop PC I care about seek times and noise and I dont care about power draw for it and the internal backup drive; But for the external backup drives I care only about power draw, but not noise or seek times (the external case is already noisy, and anyhow it is going to be powered up rarely, and performance over USB2/FW is already poor).
I also want to buy 4 drives from 4 different manufacturers or at least product families (and I shall split my order between two suppliers), to minimize the risk that the main drive and its 3 backups all share the same defects.
The problem is that the HGST and Maxtor drives are both low power and have the lowest seek times, which means they fit well my requirements for both internal an external drives, but getting only two brands goes against my diversification of risk principle.
In the end I shall probably buy two Maxtor drives, one as main internal, and one as external backup, a HGST for the other external backup, and the Seagate one for internal backup, as it is quieter than the WD one.

051227

What's wrong with Linux swapping? Well, almost everything, in particular page clustering, but also for example pitiful swap-in thruput. When I use some program that grows quite a bit, like Konqueror or UT2004, other processes get paged out, and when I want to use them again they get paged in, at something like 1MiB/s on a disc capable of around 40MiB. Listening to the noise my discs make the cause is obvious: lots and lots of arm movement, which means the swapped-out pages are roughly in random order in the swap partition.
But this is astonishing! How can it be? Disc space is much cheaper than disc bandwidth or memory space, and there is reason to preallocate contiguous stretches of swap space ahead of pages being paged out, so that on paging in they will be neatly clustered nearby.
Traditional UNIX systems reserve swap space for every memory segment allocated anyhow, even if it is never going to be used, so that swap space must be at least as large as main memory. Linux does not, but the kernel could at least do some preallocation.
Of course the main issue here is that ideally the Linux kernel would swap process working sets in and out, not just individual pages, as the former can be done quickly in large transfers. Or at least swap mappings in and out.
But we can dream: I would be very surprised if many kernel developers ever experienced swapping nowadays on their well configured systems (it is very easy to say that memory is cheap when it is paid for by your employer) and most benchmarks marketing departments care about can be run with as much memory is needed to avoid swapping.

051226e

Time passes and I had meant to do some speed tests to illustrate an idea that I have mentioned some months ago in a message to the JFS mailing list, that one can get much better performance out of mass market, large capacity ATA drives, by using just the outer 1/2 or 1/3 of the cylinders in a disc drive. This is in part because the outer tracks have higher linear density, but mostly because restricting arm movement can deliver much smaller maximum and average seek times. It won't help with rotation speed and thus angular latency, but the advantage of higher linear density and reduced seek times are not trivial.
Indeed it seems that the server oriented high speed disc drives available have capacities that are much smaller than those of desktop ones in large part because pf a similar reasoning; in particular those with a higher rotation speed (10,000 or 15,000 RPM) seem to have 2.5" platters even if the casing is still 3.5".
A five platter 400GB hard drive currently costs around US$220 and a 147GB enterprise drive costs around US$380. So using just a third of the 400GB drive would be still way cheaper than the SCSI drive. The latter has advantages, like the 10,000 instead of 7,200 RPM, but I guess that the outer third of the 5 platter 400GB drive could deliver even better performance.
Eventually I shall get around to providing some numbers on how effective this simple technique is.

051226d

I have spent a significant portion of the Christmas weekend playtesting, for the sake of technology analysis ;-), a few popular games.
Most often I play online FPS team based games, for a few reasons, for example:

Online FPS games have a play time of around half an hour per bout, which is more or less ideal for jumping in, having a break, and then continuining what was one doing. Strategy games instead tend to require much longer periods of time for example.
I like the team based aspect of these games, as the behavioural dynamics they elicit are endlessly varying and usually quite interesting. Offline games tend to be much more repetitive and a bit navel gazing instead.
They are mostly simulations, and I tend to be interested in simulation technology in general.

What I have noticed is the death of FPS, in its other meaning, that of frames per second. My system is an average one for 2005 (Athlon XP 2000+, 512MB, NVIDIA 6800LE/Radeon X800 SE) and I tend to get between 20FPS and 30FPS (like this reviewer), with average visual quality settings, in games such as PC Halo, Doom 3, Quake 4, F.E.A.R., UT2004. What are the designers of these games thinking? Well, that it runs just fine on their own top of the line PCs. In other words, they are designing games for other game designers, who tend to have top of the line systems too, in a social dynamic not dissimilar to that in the recent evolution of the Linux culture.
It is no surprise that could buy F.E.A.R. and Quake 4, which are of recent release, almost half price just before the weekend, and that the store clerk asked me if I knew about the minimum hardware requirements thereof.
Yes, I knew of course, and I was curious to see how they would play. But I can imagine that lots of average users that buy them will return them next week as they are all but unplayable on their PCs, and thus discounted sale.

051226c

At the beginning of the new year I am looking to expand my desktop PC's storage capacity, substituting my 80GiB main disc with a 250GiB/300GiB one. Since I like to do backups, and hard drives have been for a few years overall the best way to backup other hard drives, I shall be buying 4 such discs, two of which wilol go into my PC case (one of them for nightly backup) and two will go into a caddy to be used with my external USB2/FW case.
The latter poses a problem, that just like most others it has a puny PSU, which a maximum draw on the 12V line of 1.8A and this is a problem because many disks require more to spin-up. This means that I have had to search the technical documentation on the sites of the few hard drive manufacturers left, and come up with the specs of the various models I am interested in, and this is a table with some features of some parallel ATA models:

3.5" hard drive selected characteristics
Firm	Model	Capacity	Seek min/avg/max	Spin up current	Noise	Misc
Maxtor	6L250R0	250GiB	0.8/9.3/17ms	1.7A@12V	25dB	NCQ
HGST	7K250	250GiB	1.1/8.5/15.1ms	1.8A@12V	28dB	NCQ
Seagate	ST3250823A	250GiB	0.8/9/??ms	2.8A@12V	28dB
WD	WD2500JB	250GiB	2/8.9/21ms	2.4A@12V	34dB

In general, hard drives from the same manufacturer come as families with very similar characteristics, where the major difference is the number of platters in the hard drive (usually 1 to 3) and the surfaces (1 to 6). Most other characteristics are the same, and even the spin up power requirements are much the same (the mass of the platters is such that 1 or 3 does not makes a lot of difference).
What appears to me is that at least for current desktop high capacity drives, WD and Seagate have gone for high power draw, and Maxtor (DiamondMax family) and HGST for lower power requirements, presumably with longer spin-up times.
However I am also interested in seek times, and it appears that Maxtor and HGST have particularly attractive seek times, and they also support NCQ (even if NCQ support in parallel ATA drives is of somewhat dubious value).

051226b

Some more numbers in my ongoing quest to figure out how file system designs degrade in performance with usage.
It is now 4 weeks since I installed Fedora 4 on a JFS partition, and I have been upgrading it a bit, even if less frequently than I did with Debian. The number, in much the same conditions as a previous test, are:

Used vs. new JFS filesystem test
File system	Repack	Avg. transfer rate
used JFS	17m31s 52s	6.3MiB/s
new JFS	09m46s 56s	11.4MiB/s

which indicates a 1.8 times slowdown, which seems fairly limited to me, so far that is.

051226

Spotted a fairly fascinating article on multithreaded Quake 4 performance (minus AI and physics, as it is a timedemo, that is a graphics walkthrough) on Intel and AMD dual core/HT systems.
The results are fairly encouraging, as frame rate, at (low) frame sizes where the game is CPU bound, goes up rather significantly, even as much as 80%.
Quake 3 was multithreaded too, but when run on an SMP machine the advantage was nugatory, so probably Quake 4 was designed to perform well in a multi CPU environment. After all John Carmack has stated that his main development environment now is the Xbox 360, which has three dual thread CPUs.
Which raises however questions about the baseline; sure, there is a large increase in performance, but that could be because a program structure designed to perform well on a multiple CPU system performs poorly on a single CPU one (especially given that it is a timedemo).
The problem really is that it is not that easy to write good parallel code for a game, because much of its nature is sequential, as each step in the pipeline for a frame needs the completion of the previous step to start, and then there is always Amdahl's law.
Given this the more obvious choices for taking advantage of multiple CPUs in a game are:

Try to overlap steps: For example one could try to figure out in each pipeline step which part of the computation has already produced final results and then compute the next step with respect just to those.
Try to parallelize each step: Some steps take a lot more time than others, and these may be parallelized, even leaving a strict sequence of steps, assigning a different CPU to a subset of the step.

Of the two, parallelizing within each step is the easiest, except that not many steps both take a lot of time and can be easily parallelized. The obvious ones to consider are AI, physics and most importantly graphics.
A game is essentially a tree or graph, and trees or graphs don't parellelize well in general, even if tree/graph walking can be parallelized to some extent. The singular advantage of graphics is that each node of the graph is pretty big and vectorial, being a geometric shape with a surface to be filled with pixels drawn from textures.
Now most of the pixel filling effort is done well by dedicated graphics chips, so we must look at shape processing and texture creation.
Having multiple CPUs makes it possible to create textures on the fly, for example, and even to some extent shapes on the fly. Now there are two types of textures, those that describe the surface properties of a shape, and those that describe the lighting properties of that shape.
As to the latter it is famous that the Doom 3 engine does dynamic multi light processing, generating several lighting texture maps that are then blended by the graphic chip. Well, most games so far have had only a single static or dynamic light, as lighting is a very expensive operation, as it has to be done for each light, and it turns out that most of the computation for each light can be done independently of that for the other lights.
If each CPU handles a light source, a lot of work can be overlapped. This would require a bit of carefuly memory layout, because the CPUs would be otherwise scanning the same data and interfering. I suspect a degree of data duplication would be useful.
The same can be thought of whenever there are some number of things that need processing, like say AI for NPCs, or dynamic texture effects per say vehicle, or whatever.
It will be interesting what will happen on PS3 and Xbox360 in particular.

051219

So, tired with the various limitations of the FAT32 filesystem, in particular the 2GiB limit on files, and the increasing cluster size with increasing filesystem size, which effectively limits FAT32 to 32GiB filesystems, I have switched to ext2 the filesystems that I need to share between MS Windows 2000 and GNU/Linux Fedora on my system (almost only games), using Ext2IFS and it has just been working reliably and speedily, and with the removal of those limitations it is a large improvement over FAT32.

051217

Thinking again about optimization, I used to be in the past and I am going to do some again, both scalar and vector oriented, for numerical simulations, for example using the various modern SIMD (4 wide as a rule) extensions to popular architectures, like AltiVec for PPC, VU for EE, and the various 3DNow! and SSE versions for x86 and AMD64.
As a rule it is best to do whole-algorithm optimization, rather than trying to speed up individual operations, for example by recoding single functions like matrix multiplication using the very convenient GNU C style inline assembler statements, which is rather more flexible than MS C style inline assembler.
However quite a bit of advantage can be obtained just by recoding individual functions this way, but with one important qualification: if such operations are of the and becomes sort, that is assignment combined with the operation on two operands, as in

x = y
x += z

rather than

x = y + z

So for example by defining for example in C:

void Matrix33PlusAB(Matrix33 *const x,const Matrix33 *const y)

or in C++:

void operator *=(Matrix 33 *const x,const Matrix33 *const y)

Less preferably one might accept three operand functions as in

void Matrix33SetPlus(Matrix33 *const x,
   const Matrix33 *const y,const Matrix33 *const z)

where unfortunately one has to handle the case where the first operand is the same as one of the last two.
Organizing calculations with a two operand and becomes style has three considerable benefits:

Functions do not need temporaries to return results.
If the compiler supports register operand specifiers, or similar, for inline assembler, then each function does not need to start with a series of register loads, and terminate with a series of register stores, which often dominate execution time especially if several operations follow each other, as often the peephole optimizer cannot reliably elide redundant ones.
When optimizing the whole computation, not just single operations, the structure of the calculation is already in the right form, and does not need to be rearranged (and let's hope there is no operator overloading either, as if there is figuring out what is going on can be really hard).

The fundamental issue is that CPUs and vector units are not expression oriented, and in particular lack arbitrary sized registers as temporary storage, and while many languages simulate expression oriented virtual machines on top of ordinary CPUs, using stack or heap based temporaries, that just goes against the grain, and is only efficient in the case of scalars (with a very few exceptions).

051216

All is not lost as to programming quality, in particular as to memory management: I discovered recently a really rare and valuable book, Algorithms for Memory Hierarchies a splendid collection of papers in how to improve memory reference locality in programs. It is amazing that some people still care. The quality of the papers is usually pretty good, even if some feel a bit flat, and there are some of the usual misconceptions (spatial locality for example, as if it meant something useful).
I found several familiar but little known techniques in the book, and some details that were new and interesting. There was for example some discussion of the issues in matrix transposition, a problem of great importance and which is quite difficult to solve efficiently (something that comes as a surprise to many).

051204

More stunning filesystem performance discoveries! So, I have switched my PC to Fedora a few days ago, first installing it to an ext3 filesystem, and then applied many updates with the latest and greatest. Then I have decided to switch the root filesystem to JFS like most of the others, and to apply many more upgrades. So after few days and many updates on ext3, I copied the result to a freshly made JFS filesystem, and then many more updates, and kept the ext3 one around.
So now it is time to make some speed tests in the usual way, on an Athlon XP 2000+ PC, with 512MiB of SDRAM, on a slowish 80GB drive, capable of doing around 30-35MiB/s peak sustained on sequential access, and using the cfq scheduler: with some surprising results:

New vs. used file system test
File system	Elapsed (system)	Bandwidth	Notes
used `ext3` 1KiB	25m37s (41s)	3.9MiB/s	`dir_index`, ACLs
new `ext3` 1KiB	17m45s (44s)	5.6MiB/s	`dir_index`, ACLs
used JFS 4KiB	23m33s (34s)	4.2MiB/s	ACLs
new JFS 4KiB	09m57s (37s)	10.9MiB/s	ACLs

Now these are interesting numbers as they related to each other, but I was surprised by how large the elapsed time is for the fresh ext3 filesystem, as previously, on a rather similar filesystem on the same disk it was much lower.
After a bit of thinking I realized that the more recent filesystem had both hash tree directory indexes (dir_index) and SELinux file attributes, and either may be responsible, and indeed without either the speed for ext3 goes up a lot:

Speed with and without ACLs and indexes
File system	Elapsed (system)	Bandwidth	Notes
new `ext3` 4KiB	04m55s (52s)	21.8 MiB/s	no `dir_index`, no ACLs
new `ext3` 4KiB	04m57s (52s)	21.7MiB/s	no `dir_index`, ACLs
new `ext3` 4KiB	18m38s (56s)	5.7 MiB/s	`dir_index`, no ACLs
new JFS 4KiB	09m21s (53s)	11.5MiB/s	no ACLs

It is pretty obvious that hash tree directory indexes cause a lot of extra arm movement. I suspect that this is because as files get added to a directory the index expands and its blocks gets scattered. I tried then rebuilding the directory indexes after loading with fsck.ext3 -D, but that did no change the result, probably because it only restructures the hash table, does not improve its locality.
Surprisingly a full directory scan using find is quick, at around 3 minutes, which increases the mystery. Perhaps the hash trees are fairly well clustered together, but far away from the directory or its inodes.
My conclusions are to avoid the dir_index option for ext3 unless really necessary and the usual: that ext3 filesystems are amazingly fast when freshly loaded, and degrade rather quickly, while JFS filesystems are less fast to start with, but degrade less quickly; they also support directory indexes quite efficiently.
As a last note, I need to find something less exciting to do over a weekend :-).

051203

While restructuring my filesystems I had to produce a backup of a root partition with a few GiB of data, so I wanted to create a compressed tar archive of it. To compare I decided to do it with three different compressors, and to check the time needed to decompress it, as I have already had a look at compression time and ratio. Decompressing the same data on an Athlon XP 2000+ with 512MB of SDRAM gives:

decompression speed:
`lzop -d`, `gunzip`, `bunzip2`
	lzop -d	gunzip	bunzip2
size of compressed data	3.36GiB	2.78GiB	2.50GiB
CPU user	45s	117s	327s
CPU system	18s	12s	6s

It looks pretty obvious that bunzip2 is really rather slow at decompressing too, and while gzip is over three times slower at compression than lzop, and almost three times slower at decompression, the absolute amount of time is not that bad.

051202b

After the Release of KDE 3.5 the KDE-RedHat project has provided the appropriate testing-level RPM packages. Lots of thanks to them, as upgrading seems to have fixed (with a bit of twiddling options in the printing configuration dialogs) an ugly problem when printing from Konqueror.

051202

In this article on running Doom III on a Voodoo2 SLI (german language) there are two among several interesting screenshots (1, 2) which show how Doom III looks on an older graphics system that lacks recent graphics primitives. Well, it looks a bit worse than Quake, mostly because of the lack of bump mapping, but less so because of the lack of dynamic lighting.
The lack of bump mapping reveals how crude are the based geometry and textures in Doom III, and just how effective is John Carmack's decision to have low complexity geometry compensated for by high detail bump maps.
However, Doom III is definitely unaligned with trends in graphics cards technology, which are going towards extremely parallel multiple texture merging in a single pass on very high polygon count geometries. John Carmack has previously remarked that instead Doom III uses many passes of simple textures over rather simple geometries to implement dynamic lights.
But perhaps the general shift towards programmable shaders as the basic unit of graphics card architecture will allow both styles to coexist.

November 2005

051130

Now that I have been upgrading and customizing my new Fedora 4 install I have been using Yum and its frontends KYum and YumEx and I have noticed some of their shortcomings (more of implementation than of design).
I used to use the RedHat distribution before it changed name to Fedora (there is currently no distribution called RedHat, RedHat's distributions are called RedHat Enterprise), so I am quite familiar with RPM (a major reason for the change).
These are some of the shortcomings:

As described here quite frequently the table of contents of a repository is updated, but then its checksum takes a while to update, which means that in the interval attempts to use that repository fail.
It has happened that the repository table of contents file downloaded failed to parse; this because it seemed to be a 404 error page. It looks as if updates of the metadata when new packages are uploaded has not been coded as an atomic transaction, and actually is slow enough that clients do see inconsistent intermediate states.
Just like APT, Yum is only really usable with a broadband connection. Fortunately for those without, Fedora is released onto CDs twice a year, and that helps.
The functionality of KYum and YumEx is nowhere near that of Aptitude for example; things like listing the various versions of the same package and selecting one, or showing the dependencies of a version, are quite useful and missing.
When installing multiple packages with Yum any error terminates the operation (but see the recently introduced -t options). Unfortunately this means that KYum and YumEx also terminate it, which means that if one selects several packages on which to operate with them, any operation will succeed only if it can on all of them. In other words, one is driven to select only a few at a time.
KYum sorts the field containing the package size as a string, not a number, and the size is left justified and expressed as either K or M...

There is nice document about using Yum with Fedora.

051129

I recently discovered ALSA Lisp mostly because I have switched one of my PCs from Debian Sarge+ to Fedora 4 over the past few days (I am keeping Debian Sarge on my web server). The actual switch was quite quick, and then it is has just been bringing over most bits of my configuration.
This has been relatively easy because as a rule whenenever I modify a config file I also (hard) link it with the same path under a directory, for example /root/cfg/, so that among others /etc/passwd and /root/cfg/etc/passwd are linked (the actual scheme I use is a little more involved than that, to account for multiple hosts).
This means that under /root/cfg/ I have a complete and current list of all the configuration files I need to carry over if I do a distribution reinstallation distribution change. Then there will be some incompatibilities, as the locations of the same files will be different, or their contents need to be adjusted, or some distributions use different base tools. For example I am now using CUPS as the print spooling system instead of LPRng (a move about which I am not that happy).
In the general trend towards low quality software and the cultural hegemony of the Microsoft Windows mindset, CUPS configuration is supposed to be a point'n'click thing via its embedded web server and configuration pages. Sure, one can edit the resulting configuration file, but it is not that well documented.

051128

It is embarassing to admit that only today I discovered ALSA Lisp which looks like a Lisp implementation semi-integrated with the ALSA library (whose asound.conf syntax is vaguely Lisp-like itself).

051127

While discussing partitioning and in particular things like LVM2 my point has been that usually many partitions, and logical volume managers, are somewhat pointless under UNIX-like systems, because they do not have drive letters which tie file name spaces to volumes and they handle things by subtree, not by volume.
Volumes are just there as an inconvenient reminder of the storage media underlying. Therefore in the general case one might just want to have a single partition per hard drive, and then move things around by subtree, or aggregate them by using symbolic links or mount points.
There are of course special cases (e.g. root filesystem, encrypted filesystem, ...), and it just occurred to me that there is one that is justified purely by the poor state of file system technology, that one might want to split what should be a single large partition into multiple partitions to make filesystem checking faster:

File system checkers are not multithreaded within a filesystem, but they are across filesystems. So a single large filesystem that spans two drives will be checked sequentially, but two each of half the size, and each on its own drive, will be checked in parallel.
When a system restarts, weakly damaged filesystem only need their journal replaying, but more damaged one need a full check. Given that usually, especially on a desktop, only a small subset of a filesystem tree is being written at any one time, means that usually if one splits a filesystem in two, and a check is needed, only one of the two smaller filesystems will need checking, the other will get away with just replaying the journal, and this means that even if they are on the same drive the check will be significantly shorter.
Splitting one large filesystem into two smaller ones will usually halve the maximum memory needed because some file systems like XFS require an amount of memory proportional to the filesystem size (and to number of inodes in the filesystem) for checking:

From some quick tests I just ran, for 32bit binaries xfs_check needs around 1GiB RAM per TiB of filesystem plus about 100MiB RAM per 1million inodes in the filesystem (more if you have lots of fragmented files). Double this for 64bit binaries. e.g. it took 1.5GiB RAM for 32bit xfs_check and 2.7GiB RAM for a 64bit xfs_check on a 1.1TiB filesystem with 3million inodes in it.

For xfs_repair, there is no one-size fits all formula as memory consumption depends not only on the size of the filesystem but what is in the filesystem, how it is laid out, what is corrupted in the filesystem, etc. For example, the filesystem I checked above only required ~150MiB for repair to run but that is a consistent filesystem. I've seen equivalently size filesystems (~1TiB) take close to 1GiB of RAM to repair when they've been significantly corrupted.

All these mean that while it would be nice and ideal to have one filesystem per drive, the limitations of current checking programs mean that there is in effect a system dependent upper bound to just how large a filesystem should be, and if this is smaller than a drive, perhaps one should have multiple filesystems on that drive.

051125

Some perplexing experiments in programming today. Quite interesting, though. I have been discussing game console programming, and the use or misuse of C++, and memory management thereof. The big problems I see is that consoles in particular have very slow (in particular latency, small caches) memory subsystems, and C++ encourages reference counting, which hits memory hard, in particular because of extra writes and with random strides.
A palliative is to use an ancient technique. in which reference counts are not updated directly; rather the address of the reference count is appended (with a low order bit toggled for example to indicate whether it is increment or decrement, other schemes are easy to imagine)) to a queue of pending reference count updates (an intention log), and when the queue fills up, all updates are performed at once (this might delay freeing some memory, but usually is not a significant effect).
The benefit of doing so is to append updates well clustered to the queue most of the time, disturbing the cache and the memory subsystem with scattered read-modify-writes only periodically. Hopefully the write-only appending to the queue only disturbs a very narrow set of cache lines.
Sorting the queue by address before running the queue adds two more potential benefits:

That the updates happen in strictly increasing and possible clustered address order.
That when the queue is full reference count updates can be summarized, so if there have been several on the same reference count at nearby times can just compute the net result and then update the reference count with one operation (or none) instead of several.

So I wrote a nice little LGPL'ed header and library to perform this kind of delayed reference counting (to be cleaned up a lot, but for now: 1, 2, 3), and started to measure things a bit on my Athlon XP 1.6GHz with 256KiB of level 2 cache, and comparing it to a direct, in-place reference count update scheme.
The test setup is the generation of a large number of blocks, whose addresses are put in random order in an array. Each block is then subjects to half a dozen increment/descrement operation, in two way: multiple passes are made and the same operation is applied to all blocks, or one pass is made and all operations are applied after another to the one block at a time.
These are two extreme and unrealistic cases; most programs will have locality in between the no locality of the first case, and perfect locality of the second case. Also, in both cases the only work done is reference count updates; in more realistic scenarios they would be interleaved with many more data processing operations.

Without sorting the queue of updates, delayed update is about about as fast as direct update (for both operation patterns) which is surprising, as it does significantly more work; the compensating gain probably is in better write locality to the queue.
With sorting it is rather slower for the no locality operation pattern, and somewhat faster for the high locality one; from the profile, running the queue takes a lot less time for the high locality case, but the sort is expensive for both. It takes about 1 second of CPU time to sort 7 million 32 bit addresses on my CPU, in chunks of a few hundred at a time.

Unfortunately I coded a far too simplistic test cases, and I am going to try and find some easily modified application.
Overall queueing updates is not much better than direct updates of reference counts; what I suspect is that the onchip Athlon XP cache flatter a bit too much direct update. I reckon that on systems like game consoles, which usually have extremely limited cache and very high latency memory, delaying and batching reference count updates would win in more cases.

051123

Just noticed the fascinating announcement that GNU/Linux is officially available for Cell from SCEI. Apparently it has a clever approach to SPE access (via the /proc filesystem).
I had previosuly seens a screenshot of Fedora RawHide and KDE running on a dual CPU Cell server which was interesting. As to SPE programming, this PS3 article sounds optimistic to me.

051121c

There has been a recent release of the latest version of Kpager2 a nice improved virtual desktop pager for the Kicker panel of KDE 3.

051121b (updated 051122)

Some friend has sent me a pointer to a particularly ugly Makefile portion in the OpenSolaris top level Makefile:

    CODE_COVERAGE=
    $(RELEASE_BUILD)CODE_COVERAGE:sh=   echo \\043
    $(CODE_COVERAGE)$(OBJS_DIR)/unix_bb.o   := CPPFLAGS     += -DKCOV
    $(CODE_COVERAGE)$(OBJS_DIR)/unix_bb.ln  := CPPFLAGS     += -DKCOV

It is specially horrid because the intent seems to be to rely on comments being parsed after macro expansion, as code 043 is for the # comment character, the intended effect apparently being that if $(RELEASE_BUILD) is empty, then CODE_COVERAGE is set to the comment character, and the next two lines are commented out.
Alternatively, the Make version is use does not actually process comments after macro expansion, and while the author of the Makefile thinks that the next two lines be commented out, what actually happens is that the next two macro definition lines apply to target paths beginning with #$(OBJS_DIR) (for example #build), instead of $(OBJS_DIR) (for example build), and since presumably there is no rule for targets beginning #$(OBJS_DIR), but only rules for targets beginning $(OBJS_DIR), those macro definitions have no effect on the build.
The trick used to put a comment character in a variable is also in itself slightly distasteful, but probably required by the lack of proper quoting in the Make variant used.
What is quite regrettable is that such hideous contortions are pointless, because conditional assignment of macros can be achieve much more cleanly and legibly with the age old practice of using variable macro names, as follows for example (assuming GNU Make for rule-specific macro assingments):

# 'CODE_COVERAGE' and 'RELEASE_BUILD' can be either "yes" or
# empty, and we want code coverage only if the former is "yes"
# and the latter is empty.
KCOV+yes+                       =KCOV
KCOV                            =${KCOV+${CODE_COVERAGE}+${RELEASE_BUILD}}

CPPFLAGS+unix_bb+KCOV           =-DKCOV

${OBJ_DIR}/unix_bb.o:           CPPFLAGS += ${CPPFLAGS+unix_bb+${KCOV}}
${OBJ_DIR}/unix_bb.ln:          CPPFLAGS += ${CPPFLAGS+unix_bb+${KCOV}}

There is an equivalent technique to do conditional execution of rules, using phony targets, for example:

# Conditional assignment to CFLAGS depending on ${ARCH}
CFLAGS-x86      =-O3
CFLAGS-PPC      =-O2 -funroll-loops
CFLAGS          =${CFLAGS-${ARCH}}

# Conditional build depending on ${ARCH}
build:          build-${ARCH}
build-x86: ....
        ....
build-PPC: ....
        ....

which avoids cleanly a lot of inappropriate (because un-Make-like) conditionals even in Make variants that have them.

051121

While browsing for something else, I found this very interesting but not new message by Andrew Morton on changes in Linux kernel development of which I particularly liked this:

I don't think 2.2 and 2.4 models are applicable any more. There are more of us, we're better (and older) than we used to be, we're better paid (and hence able to work more), our human processes are better and the tools are better. This all adds up to a qualitative shift in the rate and accuracy of development. We need to take this into account when thinking about processes.

As previously remarked the new and improved status of the developers while lucky for them also has significant downsides.
I was also amused by the next paragraph, about the flood of kernel contributions:

It's important to remember that all those kernel developers out there *aren't going to stop typing*. They're just going to keep on spewing out near-production-quality code with the very reasonable expectation that it'll become publically available in less than three years. We need processes which will allow that.

where I suspect that near-production-quality is often rather optimistic :-).

051120

Today I wanted to try to add a new ATA host adapter to my PC, as I would rather have all my ATA discs on separate channels, like for SATA, and thanks to a lucky incident I got an ATA PCI card based on the ITE IT8212F chipset. The card itself is a really cheap Q-TEC 14426 which is a bit alarming as Q-TEC, which seems to be just the component brand of Trust International BV seems to me (like its parent) a low cost, low value reseller of bottom of the barrel stuff.
Well, in this case, like in the case of one of the other Trust products I have dared to buy (a Trust 514DX sound card), the accidental choice of a good base chipset makes a lot of difference, because the ITE IT8212F has some really good points:

It is a full hardware ATA-RAID host adapter, where the onboard firmware handles all the RAID0, RAID1, RAID10 processing. So far the only comparable cards were the excellent 3ware ones, which may be fairly better quality; but the equivalent one, the 7006-2, costs 8 times as much.
That it is a full hardware RAID means that the Linux kernel only sees the logical RAID volumes the firmware defines after they are configured in the BIOS of the card. The Linux kernel does not need to have the sw RAID drivers, can boot off the RAID directly, unlike the other fake RAID cards that have been out so far.
It can operate either as an ATA HBA, emulating the less sophisticated ITE IT8211F chipset, but also as a SCSI-style host adapter, with the attendant fairly sophisticated abilities.
The chipset is fully and rather well documented in a manufacturer supplied specification, which means it is possible to openly develop and maintain drivers for it.
There are not one, but two Linux drivers for it, one which presents it as an ATA HBA, maintained for a while by Alan Cox, the other that handles it as eithe a full SCSI host adapter, or as an ATA HBA again, and that has been very kindly and generously developed and released as GPL by ITE themselves.

Basically this chipset (and cards based on it) is pretty good and a lucky find, an excellent value, a bit like the ZyDAS 1201 802.11b chipset and the little Origo USB stick I bought that uses it. A comment in the driver by Alan Cox says:

 *  The ITE8212 isn't exactly a standard IDE controller. It has two
 *  modes. In pass through mode then it is an IDE controller. In its smart
 *  mode its actually quite a capable hardware raid controller disguised
 *  as an IDE controller. Smart mode only understands DMA read/write and
 *  identify, none of the fancier commands apply. The IT8211 is identical
 *  in other respects but lacks the raid mode.

The only issue I can see with the IT8212F wrt Linux is that neither of the two drivers have made it yet into the mainline Linux kernel, which has caused some slight maintenance complications in the past, as the internal kernel driver API churn has occurred.
However I have managed to track down two fairly current (2.6.14 compatible) versions of both drivers:

The Alan Cox driver, drivers/ide/it821x.h, is available as a patch for Linux kernel 2.6.12 from a message to the LKML, and which has become a standard part of the Linux kernel with the name drivers/ide/it821x.c.
The ITE driver, drivers/scsi/iteraid.[hc], by Mark Lu and colleagues, is available as ZIP file with sources and binaries from the ITE web site, but only for Linux kernels up to 2.6.10, as an external module.
It has been also available as a patch in the -mm series of experimental Linux patches, the latest patch being for Linux kernel 2.6.13-rc3-mm3, after which the Alan Cox driver has become part of the mainline kernel instead. But it can still be used with this patch for Linux kernel 2.6.14 from the enhanced kernel of the GRML live/rescue GNU/Linux CD (which contains much other goodness).

Between the two drivers I have decided to use the ITE one because it is integrated in the SCSI, not the ATA/IDE, subsystem of the Linux kernel, and I like the SCSI subsystem rather more. The driver source also seems tidy and professionally written just as that from Alan Cox, which is quite impressive.
The IT8212F is also present on several recent motherboards as the onboard chipset to provide RAID ATA channels, and both because of its good value and features, and its designers providing a nice GPL Linux driver for it, I hope it is successful and widely adopted.
So what are the downsides? Well, unfortunately after trying it a plain ATA/IDE HBA (I have really no need for RAID yet):

Like all HAs that virtualize drives into RAID, one loses the ability to access the drive directly, for example to read the temperature.
If the Q-TEC card is the second of two ATA PCI HBAs, then the first configures itself, and the Q-TEC card does too, but if I put it first the BIOS of the second does not seem to get called.
I tried the iteraid driver as updated for 2.6.14 and it just did not work for me. I suspect this is due to it being oriented to RAID operation, but it should still work in plain IDE mode. The problem was that the two drives attached to it were detected, but with size 0, and without the right names.
I tried the it821x driver and it seemed to work well and fast in IDE mode; but only for reading. When trying to write at first I got lots of IO errors due to CRC issues. I imagined that since the drives were actually in a dodgy enclosure this could be a problem, so I attached them direct to the card. This seemed to avoid the issue, but then it reappeared even if far less frequently. Usually CRC errors are due to bad cables creating electrical problems, but these are cables that give no trouble on a similar IDE HBA. What I suspect is poor build quality of the Q-TEC card more than problems with the IT8212F chipset or the drivers.

In the end I spent most of two days trying things out, and while I am impressed by the potential of this chipset, the card I have got is not quite usable in my configuration for IDE-only drives, so I reverted to putting all four of my drives on the two channels of my existing CMD0649/SiL PCI card, which has a couple of annoyances of its own:

It does not handle LBA48, so it cannot boot from areas beyond 138GiB. I ended up restructuring my partitions to work around this.
Probably some aspect of the chipset is not multiplexed, because if two drives are on the same channel IO is rather slower (even if between drives on different channels).

051116

Unfortunately among its many technical issues Debian has that it uses DPKG 1 and APT as package and dependency managers, and they cannot handle having installed at the same time multiple packages with the same base name but different versions for different architectures. This is right annoying because it makes harder to have applications that use different ABI versions of the same library, or supporting systems that can handle multiple architectures, either for execution or even merely for file sharing.
There is a mythical multiple architecture support conventions, but it is not supported yet. So the all too clever solution is to re-encode the version and architecture into the base name of the package, which of course means that Debian needs to repackage exactly the same stuff with several different names, and because the package name is inside the package files, and there are checksums too, this cannot be done by simply giving the same package file different names, therefore:

There is already a minimal set of IA32 libraries packaged for use in a 64bit Debian system. Simply do an 'apt-get install ia32-libs' and you will be able to run most 32bit binaries within your system.

The dumb practice of putting version numbers into package base names has been going on for a while:

i   356kB  807kB  4.1.25-18  4.1.25-18 libdb4.1
i   368kB  868kB  4.2.52-20  4.2.52-20 libdb4.2
i   <N/A>  979kB  4.2.52-19  4.2.52-19 libdb4.2++
i   399kB  926kB  4.3.28-3   4.3.28-3  libdb4.3
i   427kB 1020kB  4.3.28-3   4.3.28-3  libdb4.3++c2

but since the AMD64 architecture has become popular the Debian packagers have indulged in the even more nefarious one of putting the architecture name into the based name, and inconsistently, as this list of i386 architecture package names I can select by searching for those which contains the strings amd64 or lib64 shows (forgive the messiness dues to my rather daring /etc/apt/sources.list file):

p   4492kB 11.0MB <none>     1.1        stable           amd64-libs
p   18.6MB 76.2MB <none>     1.1        stable           amd64-libs-dev
p   2020B  8192B  <none>     103        stable           kernel-headers-2.6-amd6
p   2038B  8192B  <none>     103        stable           kernel-headers-2.6-amd6
p   2026B  8192B  <none>     103        stable           kernel-headers-2.6-amd6
p   224kB  14.6MB <none>     2.6.8-14   stable           kernel-headers-2.6.8-11
p   223kB  14.5MB <none>     2.6.8-14   stable           kernel-headers-2.6.8-11
p   219kB  14.2MB <none>     2.6.8-14   stable           kernel-headers-2.6.8-11
p   2072B  8192B  <none>     103        stable           kernel-image-2.6-amd64-
p   2082B  8192B  <none>     103        stable           kernel-image-2.6-amd64-
p   2092B  8192B  <none>     103        stable           kernel-image-2.6-amd64-
p   12.6MB 44.6MB <none>     2.6.8-14   stable           kernel-image-2.6.8-11-a
p   13.2MB 46.6MB <none>     2.6.8-14   stable           kernel-image-2.6.8-11-a
p   13.2MB 46.7MB <none>     2.6.8-14   stable           kernel-image-2.6.8-11-a
p   34.8kB 115kB  <none>     1.0.2-7ubu breezy           lib64bz2-1.0
p   29.3kB 106kB  <none>     1.0.2-7ubu breezy           lib64bz2-dev
p   7460B  20.5kB <none>     4.1-0exp0                   lib64ffi4
p   54.6kB 127kB  <none>     1:3.4.4-9  testing,unstable lib64g2c0
p   84.6kB 135kB  <none>     1:3.4.3-13 stable           lib64gcc1
p   107kB  352kB  <none>     4.0.2-3    unstable,unstabl lib64gfortran0
p   91.0kB 246kB  <none>     4.1-0exp0                   lib64mudflap0
p   319kB  676kB  <none>     5.5-1      unstable,unstabl lib64ncurses5
p   382kB  1282kB <none>     5.5-1      unstable,unstabl lib64ncurses5-dev
p   43.9kB 115kB  <none>     4.0.2-3    unstable,unstabl lib64objc1
p   4598B  16.4kB <none>     4.1-0exp0                   lib64ssp0
p   326kB  1004kB <none>     3.4.3-13   stable           lib64stdc++6
p   8705kB 34.8MB <none>     4.0.2-3    unstable,unstabl lib64stdc++6-4.0-dbg
p   8661kB 35.3MB <none>     4.1-0exp0                   lib64stdc++6-4.1-dbg
p   53.2kB 135kB  <none>     1:1.2.3-6  unstable,unstabl lib64z1
p   56.3kB 168kB  <none>     1:1.2.3-6  unstable,unstabl lib64z1-dev
p   3253kB 8098kB <none>     2.3.5-8    unstable,unstabl libc6-amd64
p   2000kB 9462kB <none>     2.3.5-8    unstable,unstabl libc6-dev-amd64

The above are i386.deb packages that allow executing some AMD64 stuff on a base 32 bit system; there are of course amd64.deb packages that have their name warped to indicate they contain 32 bit libraries and executables.

051115

I have been spending the past two days almost entirely reorganizing the partitions on my two hard discs, in particular to ensure that some bootable ones are below the dreaded LBA/138GiB limit that many BIOSes still have, and also to convert to ext2 my bulk archive FAT32 partitions, as one can find pretty good ext2 drivers for MS Windows 2000 (e.g. 1, 2).
One of the issues I have incurred, as always, has been that it is rather inconvenient to change the MS Windows 2000/XP boot and/or system partition(s), as its boot process is both rickety and fragile. This time, instead of reinstalling, I have decided to confront the issue.
First of all it is extremely useful to have a nice MS Window 2000/XP boot floppy, with a set of boot lines for various partitions, using the ancient ARC notation.
Then there is the issue of the drive letter of the system drive. In theory this drive letter is contained in the SystemDrive environment variable, but in practice it is hard-coded in very many paths. In general it gets hard-coded in the paths of any installed software packages, but unconscionably it also gets hard-coded into the path to a particular system command, Userinit.exe, which is an essential component of the login process.
In practice it is essential to ensure that the drive letter of the system drive is changed back to what it was when MS Windows 2000/XP was installed, even if whenever the partition is moved around the drive letter changes. That the drive letter changes at all is rather annoying, also because volume management under MS Windows 2000/XP is also done by
Fortunately, like previous versions of MS Windows, 2000/XP can change the drive letter assigned to a partition, but unfortunately this can only be done rather awkwardly.
Also, it helps if when install MS Windows 2000/XP one installs it so that its system drive letter is pretty high, like W:, so that it does not take one of the lower drive letters leaving them to be assigned to non-system partitions in the ordinary way, but I am not aware of any easy way to achieve this. Also, it might help to make all installations to a virtual drive letter, one assigned either with subst or with the ability to assign multiple letters to the same volume, which is possible using Disk Management. Or perhaps one should just use the mountvol command and do mostly away with drive letters. The downside of this is that then all paths will be wrong until the one drive letter that one cannot get rid of, the system drive one, is restored.
As to restoring the system drive letter, the most irritating aspect is that as described in a Microsoft KB article the full path prefix of Userinit.exe is pointless and can be removed because it is in the system directory anyhow. Indeed it would be advisable to just remove that prefix at any time after installation, as it is not necessary and prevents logins if the system drive letter changes.
Now the question is how to edit the registry if you cannot login, and the Microsoft KB article suggests trying a remote edit etc., but there is an alternative, the ntpasswd GNU/Linux utility and bootdisk which also supports offline editing of a MS Windows 2000/XP registry, even if it is on a NTFS volume.
Once in, it is possible to change the drive letter assigned to a volume by editing the registry again as described in another Microsoft KB article but this is somewhat awkward because the drive letter assignments in the registry (in the HKEY_LOCAL_MACHINE hive under SYSTEM\MountedDevices) more or less the equivalent of /etc/fstab under GNU/Linux) are more or less by volume ID, and there is no easy way to check which volume, including the one with the system, has which ID. By the way, since drive letters are mostly by volume ID, cloning a MS Windows volume creates two partitions with the same ID, which can be one of the reasons why the system drive letter has changed.
One could think of using

Disk
	    Management

to change the drive letter, and that works well for all volumes except the system volume. Fortunately I also tried Partition Magic 8.0 which does allow changing the drive letter of the system partition, and that was easy.
It would have been less easy if I had MS Windows XP instead of 2000, because the drive letter etc. has also a role in the product licence activation scheme introduced for XP, and to fix that one has to do more registry resetting.

051114

Today I have been restructuring some of my filesystems, which has involved copying them around (restructuring in place is slow and dangerous). Like before I have noticed a disappointing aspect of performance, that on average on my 1.6GHz Ahtlon XP CPU with 512MiB of 133MHz SDRAM it takes around 200MHz of kernel time to shift around 10MiB via the buffer cache:

# vmstat 10
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
...
 1  1    232   5948   1140 350224    0    0 19279 19489 1051  2749  1 48  0 50
 1  1    232   5784   1076 350212    0    0 18374 18668 1057  2772  2 46  0 51
 1  1    232   6428    952 349612    0    0 21343 21600 1130  2958  2 55  0 43
 1  1    232   5824   1140 350228    0    0 20692 20371 1200  3228  1 55  0 44
 1  1    232   5700   1208 350108    0    0 20916 21106 1089  2900  1 55  0 44
 1  1    232   5824   1268 349948    0    0 19301 19504 1044  2752  1 50  0 50

This was a particularly bad case; in particularly good cases I have seen the cost as low as half of that, 100MHz per 10MiB/s. This is quite awful in absolute terms, because it means that on modern disks, that can easily do 40-50MiB/s, IO becomes CPU bound on a 0.5-1GHz CPU, which is a joke, considering that DMA means there is almost no CPU involvement in the IO itself, it is all about buffer cache management overhead.
I have used the cfq elevator, but with the anticipatory and noop ones the cost is roughly the same. Interestingly though while the average read or write rate with cfq is around 20MiB/s, for this double-tar copy that with noop is just around 16MiB/s, but with anticipatory it is around 24MiB/s. Probably this is because of seeking to update metadata while a file body is written sequentially.
Looking at it another way, it takes around 10-20 cycles to shift one byte of IO, even with DMA, or more precisely it takes 5-10,000 cycles to process a 512 byte sector or almost 50-100,000 cycles to manage a 4KiB page in the page cache.
But of course the kernel developers responsible for this hardly notice the obscenity on their top of the line multi-GHz multi-CPU PCs. It is not their itch.

051113

Answering a request for help from someone who could install Fedora 4 on a recent laptop, it turned out that the SATA chipset used by the laptop was not yet supported by the 2.6.11 Linux kernel used in the original Fedora 4 installer; but Knoppix V4.0.2 and OpenSUSE 10.0 which have been released in the past few weeks instead of some months ago had a more recent kernel and would install.
However the user asking for help really would have preferred Fedora 4, so I looked for updated Fedora installers, as it is sort of the wrong time to install Fedora on a very recent system, as we are almost exactly halfway between two Fedora releases, so the current release is already some months old, and the next one is not yet stable enough, and the Fedora project does not do official midpoint releases.
Fortunately I was told that some volunteers have been respinning updated install ISO images incorporating into them the Fedora updates, which include a newer kernel.

051110

Well, another historical SUSE person has resigned from Novell, Hubert Mantel, one of the cofounders of SUSE and its main kernel packager. The most amusing detail of his resignation is in this thinly veiled message:

I'm very confident the Novell management will find a competent successor very quickly. After all, there are lots of extremely skilled people over there in the Ximian division.

where it is significant that he says Ximian, the GNOME and Mono developers, not SUSE. He seems to be alluding that Ximian have won some internal battle with SUSE, and Novell is now committed to Mono and GNOME, and bye-bye SUSE and Linux.
I have also been reached by rumours that a Novell development facility of around 400 people in India is being switched from NetWare maintenance to SUSE Linux maintenance, which probably means that the ex-SUSE staff in Germany is not going to linger for long at Novell.

051108

Quite amusing report of problems with a 138GB filesystem with 4 million inodes on a 256MiB PC, which do not happen if the memory is upgraded to 1024MiB. Totally unsurprising to me, as per my previous notes on memory related issues with large filesystems and in particular integrity checking.
Most current software is developed and tested (usually an euphemism) on large memory (at least 1GiB) systems, so having less memory means going against the grain, or into uncharted waters.

051106

Quite well presented screenshot comparison among Microsoft Outlook, GNOME Evolution and KDE Kontact. it is pretty obvious that they have much the same functionality, and just how much Evolution is inspired by Outlook. Quite entertaining to see.

051105b

Slashdot reports some news that Novell SUSE has chosen GNOME as the flagship environment which is not unexpected; GNOME is the USA centric alternative to KDE, and Ximian are way nearer to Novell's management than SUSE can be. I would expect that eventually SUSE's base in Germany will be reorganized too.
It is quite sad as GNOME is way more bloated and less polished than KDE, and SUSE has been a small but significant counterbalance to GNOME's adoption by RedHat and the remaining UNIX companies.

051105

I was doing, as part of my backup scheme, a disc-to-disc copy of my partitions, and noticing it was going pretty slowly at only 12MiB/s, and that seemed because of slow writing, rather than reading. So first I played around with the elevators, without much relief, and then I enabled /proc/sys/vm/block-dump because of a suspicions and it turned out that both kswapd and pdflush were writing to the destination disk, and in a non-fully sequential order.
So I decided that pdflush was being too lazy, and I looked at its various parameters in section 2.4 of Documentation/filesystems/proc.txt and I realized that just like other parameters like swappiness it is set to favour too much caching of disc blocks, which is largely pointless and/or damaging, because of lack of locality beyong a certain threshold. So I have set dirty_background_ratio to 3 (percent) and dirty_ratio to 6 (percent) as (except probably on laptops) it is fairly obscene to leave more than a few percent of system memory dirty; indeed the parameters should be not a percentage, but a quantity, as the amount of unwritten blocks should not depends on how much memory there is, but things like disc speed and the user's tolerance for mishaps.
Leaving a lot of unwritten buffer to accumulate also means that when they need to be sync'ed back to disc there will be suddently a lot of writes queued, and that will raise the chances of increasing disc latency for other activities, especially as with some elevators writes have greater priority over reads.
121215, 170114, 170211 Updates: The dirty_ratio should be set to a much larger value than 3 (percent), as the system usually becomes rather unresponsive when it is hit, as many processes occasionally write to disk, for example for logging, and then block. Sometimes 10-40 (percent) may be good. Also dirty_background_ratio on contemporary systems should be much smaller than 3 (percent), as on an 8GiB system it is 245MiB, or 2-3 seconds of writing on many disk drives.
Also it is a good idea to use the newer dirty_bytes and dirty_background_bytes: set dirty_background_bytes to 1 second of writing bandwidth, and to set dirty_bytes to several (eg. 10) seconds of writing bandwidth.

051104

While looking for information on Xbox 360 programming I chanced onto an interview about new technology to use in a popular american football game and I found this entertaining quote on overhyped technology:

MN: NaturalMotion's endorphin animation technology is one of the hottest topics on Madden fan sites right now. Is there a chance that we might see this technology or some form of rag doll/on the fly physics in Madden at some point in the future?
JS: I can't speak towards specific technologies at this point, but let me just say this is something that is a lot easier in demos than a full game, especially on the scale of a football game.

The a lot easier in demos point is lovely, not just because I reckon that it defines the whole startup/dotcom scene, but also regrettably a lot of the software one in general.

051101

Well, I switched my filesystem from ext3 to JFS around six weeks ago, and I have been just using my system normally since, and I have done several package updates (updated X a couple of time, KDE a couple of times, several other bits and pieces).
Time to see how JFS copes with churn. So first image copied my root filesystem to an otherwise unused disc, and measured how much time it takes to find a few files in it and to read it all with tar; and I did it again after reformatting the partition, and loading it from fresh. The results (elapsed and system CPU times) are:

Used vs. new JFS filesystem test
File system	Repack	Find
used JFS	26m45s 63s	04m17s 06s
new JFS	10m20s 55s	03m15s 06s

The filesystem size is around 7.5GiB out of 9.8GiB available, with 360k files; test on an Athlon XP 2000+, 512MiB RAM, and a slowish 80GB drive) capable of doing around 30-35MiB/s peak sustained on sequential access. I have taken the suitable precautions, including unmount the filesystem before each measurement. The number above do not include sync time afterwards; in the find test this adds around 4m as all the atimes get updated and have to be finalized.
It looks like a bad case of ongoing fragmentation; so I had a look at several large files and seeing the difference in speed reading them:

Used vs. new JFS filesystem
file read rate test
File name	Size	Used or new JFS filesystem	Transfer rate
var/tmp/db	130,476MiB	used	4.5MiB/s
var/tmp/db	130,476MiB	new	27.0MiB/s
var/cache/apt/srcpkgcache.bin	21,788MiB	used	16.8MiB/s
var/cache/apt/srcpkgcache.bin	21,788MiB	new	35.0MiB/s

Other files read at full speed, those that had not been updated. Now var/tmp/db is a test Berkeley DB I filled from scratch some time ago and var/cache/apt/srcpkgcache.bin is a file that is rebuilt every time I update the repository package lists, and is created sequentially. This seems to be reflected in their respective levels of fragementation deducible from the slowdowns above.
Overall I had hoped better: it can be said that an increase in bulk read time of 2.5 times in six weeks is not what I would wish for. But then essentially all file system benchmarking is done on freshly loaded filesystems, so I guess file system designers could not care less.
Note also that in this case essentially all the churn in the filesystem is overwrites due to package updates. The filesystem also has /var, and in that there is a 100MiB Squid cache, but that cannot skew the result that much. As the couple of file transfers above insinuate, it is the ordinary files that get shredded into somewhat separated extents.

October 2005

051031

So far so good: I have been using JFS for while and it seems to be doing all right. But I am also looking at it as a possible vehicle for large media/virtual volume files as it has a nice buddy system based allocator and tend to allocate large single extent files if possible and lucky.
But I am considering some changes, and this require understanding better what it does currently. I have been rereading a few times a five year old description of its data structures but admittedly it is a bit opaque, so I have decided to summarize it in my own words.

051030c

Noticed a fairly decent test of dual core and multiprocessor Opteron systems by Sun as to database performance and other tests, under the recently (partially) released Solaris 10 and under Novell's SLES 9.

051030b

Thanks to a reader for pointing me towards Sun's new Zettabyte File System which has some interesting aspects, even if regrettably its delivery has been delayed.
I feel rather skeptical about several aspects of its design, in particular the commendable autotuning, and specifically the to me laughable idea of having per-file adaptable block sizes. It sounds entirely laughable to me as larger block sizes are just a static, stupid way of doing cluster IO and/or pre-allocation or read-ahead, and all these are much better done statically or dynamically by aggregating smaller blocks.
The need for zettabyte storage pools also seems a bit questionable, as several larger storage pools are probably going to be more common. But there is little harm in having them though, even if some existing file system designs support ceilings of 2⁶⁴ of 4096B blocks, or 2⁷⁶ byte filesystems. According to the authors, 2⁵⁰ byte storage pools are already in existence, and every year requires another bit, so 2⁷⁶ byte filesystems are still good for another 26 years, which seems fine to me.

051030

A recent interesting OOo 2.0 benchmark which compares its spreadsheet calculator unfavourably to the MS Excel. The benchmark is interesting for several reasons; one of them is that the author seems to have an axe to grind, and in particular to find fault with OOo, so it seems significant that the attack is concentrated on the spreadsheet component, and this makes me think that the others are after all fairly competitive (but not in memory usage) with their MS Office counterparts.
It is also interesting that the benchmark involves a rather large spreadheet, with 16 sheets each with 16,000 rows of 13 columns, which is around 27-30MiB in XML form.
I have done some testing on my Athlon XP 2000+ with Linux kernel 2.6.13 and GNU LIBC 2.3.5; loading the .sxc version with OOo 1.1.5 takes 325s elapsed, 269s of which are user CPU time and 26s of which are system CPU time; with OOo 2.0 it took 321s elapsed, 282s of which user CPU time and 6 seconds of which are system CPU time. I tried also with KSpread 1.4.2 and it just crashed after a while.
Now there are several ways of doing XML parsing, and especially if one uses a tree based validating XML parser it can take a lot more than a non validating streaming parser; besides the very bulk of an XML representation, especially for the relatively small amounts of data in each row of the sample spreadsheet, means that a lot of time is spent just processing XML verbiage and not data. This applies, but less so, also to the Microsoft XML format, as it is less verbose.
So I decided to try and save the data to .xls format. Well, with OOo 1.1.5 this failed, after several minutes, and after the size of the program had grown to more than 900MiB, more than 20 minutes had elapsed, and much paging had occurred, as my PC has only got 512MiB of RAM.
Then i decided to have a look with a smaller spreadsheet, as the full spreadsheet just takes too long for my taste on my several years old PC. So I saved a single sheet out of the 16 (the May 2002 data), as a CSV format file to strip it down to the essentials and look at the actual volume of data in it (the metadata in CSV is almost nothing), and that's 1.1MiB; loading it was very quick, less than 10s of elapsed time (warm start), 8s of user CPU time, and 0.5s of system time.
I also tried to load the 16,282 line CSV into KSpread 1.4.2 and it is taking more than 5 minutes of CPU time. Well, I have decided to kill it after it had gotten to 7 minutes of CPU time.
So I tried to save the 16,282 line CSV into .xls format using OOo 2.0 in various formats with crashes and bizarre error messages.
Then I realized that my /tmp filesystem had become full with the temporary files of previous crashes, and that was the cause of the problems, even if the error messages were entirely misleading. Once I cleaned up /tmp I was able to save the 16,282 line CSV into various formats, fairly quickly in each case (around a dozen seconds) and with the following resulting sizes and loading times (the elapsed times and the RAM used numbers are approximate):

Warm load of a 13x16,282 sheet with OOo 2.0
Linux 2.6.13, GLIBC 2.3.5, Athlon Xp 2000+
Program	Format	Size	Elapsed time	User CPU time	System CPU time	RAM
OOo Calc	none	0	5.7s	4.6s	0.3s	19MiB
OOo Writer	`.txt`	1180KiB	5.9s	4.7s	0.5s	30MiB
XEmacs	`.txt`	1180KiB	2.1s	1.6s	0.2s	8MiB
OOo Calc	`.csv`	1180KiB	9.4s	7.4s	0.5s	32MiB
OOo Calc	`.xls`	3220KiB	7.8s	6.2s	0.4s	29MiB
OOo Calc	`.sxc`	248KiB uncompressed: 19.3MiB	24.6s	22.3s	0.7s	35MiB
OOo Calc	`.ods`	248KiB uncompressed: 19.3MiB	19.5s	17.7s	0.8s	35MiB

The main message here is that non XML formats load three times as fast as the XML ones, in about 2s net (that is) and that the fastest (and smallest) loading format is the .xls one, which is also the fastest for MS Excel 2003.
As to memory usage, loading OOo 2.0 Calc seems to take around 19MiB of RAM, and an extra 10MiB to 15MiB are used after loading the 13x16,282 sheet, or about 8-12 times larger than the data contained in it.
As to MS Excel 2003 the reported time to load the full 16x13x16,282 spreadsheet in .xls format is 2s, which means around 25MiB/s of data reading and parsing and unpacking speed, which sounds somewhat optimistic. Perhaps MS Excel is optimized to read just what it needs (first screenful of the first sheet) in the .xls case.
However altogether OOo Calc and KSpread on rather large spreadsheets are rather disappointing, both as to speed, memory, reliability and polish. It is pretty obvious that while they are quite usable, especially on large but not immense spreadsheets like 13x16,282, MS Excel 2003 is rather more mature.
In my imagination the authors of OOo Calc have done some cool snazzy demos with small sheets on huge PCs, and that looked good enough...

051026

I was looking at the licensing terms for the compilation copyright for SUSE OSS 10.0:

The Software is a collective work of Novell. You may make and use unlimited copies of the Software for Your distribution and use within Your Organization. You may make and distribute unlimited copies of the Software outside Your organization provided that: 1) You receive no consideration; and, 2) you do not bundle or combine the Software with another offering (e.g., software, hardware, or service).

and Fedora Core 4:

The Software is a collective work under U.S. Copyright Law. Subject to the following terms, Fedora Project grants to the user ("User") a license to this collective work pursuant to the GNU General Public License.

and it is pretty obvious that there is a big difference: SUSE OSS, while being composed entirely of free/open software:

At openSUSE.org, anyone can download the OSS (Open Source Software) version of SUSE Linux 10 for free. This code is strictly OSS and does not have any of the licensed components like RealPlayer, Adobe, and licensed drivers that you would find in SUSE Linux 10 (non-OSS).

is not itself as a whole free software, while Fedora is clearly so, the licence for the compilation copyright being the GPL.
So for example Fedora can be bought from those services that download and burn cheap CDs/DVDs of Linux distributions, but SUSE OSS cannot. Probably this is because SUSE on CD/DVD is still for sale by Novell, and for fairly cheap, but this is a bit disappointing.

051023

Some recent discoveries:

Qish a generational GC for C/C++, an alternative to the well known Boehm-Demers one.
An implementation of a subset of RSYNC written in portable Python.
An interesting if limited UNIX/POSIX implementation, developed in Romania around 1995, Thix.
A quite nice survey of disc-to-disc backup technology in particular using RSYNC and derivatives.

051016

In another discussion, someone was asking how to optimize the creation of a set of around one million small files, in a directory tree with a fanout of 50, of size between 50 and 100 bytes each, as this was taking him around 15 hours, not unexpectedly.
Some people including myself suggested that using a file tree for this was not as good as using a database, and I decided to show how this could be done, by using a couple of small Perl scripts with a Berkeley DB database. The results are for creating the database:

$  time perl megamake.pl /var/tmp/db 1000000 50 100

real    6m28.947s
user    0m35.860s
sys     0m45.530s
$  ls -sd /var/tmp/db*
130604 /var/tmp/db

and for doing 100,000 random record accesses in it:

$  time perl megafetch.pl /var/tmp/db 1000000 100000
average length: 75.00628

real    3m3.491s
user    0m2.870s
sys     0m2.800s

that is about 2,500 records inserted per second and 500 random record accesses per second; the total space used is 130MiB, instead of the 4GiB needed for 1M files whose minimum size in most filesystems is 4KiB.

051015

Belated discovery that Jamie Zawinski has written a detailed illustration of a rather confusing subject, X11 cutting and pasting, and with Emacs too.

051014

During a debate on OOo startup times, I pointed out that the large difference between in timing between a cold and a warm start is probably also due to many accesses to scattered files during startup. So I decided to count how many files some applications look at during startup, using:

strace -f -e trace=open,stat64 app 2> /tmp/app.strace
egrep -c '\<(open|stat64)\(' /tmp/app.strace
egrep '\<(open|stat64)\(' /tmp/app.strace | egrep -c ' = [0-9]'

to count total files (inodes) accessed and accessed successfully, with the following results (under KDE, so this is sort of biased in favour of KWord):

Inodes accesses, total and successful
Application	Accesses	Successful
`soffice`	1660	715
`abiword`	1597	1274
`kword`	1836	1632
`vim`	187	92

The numbers are quite scary; even if a good percentage of these accesses are repetitions, thus benefiting from caching, they are still very very many for some of those applications.
A lot of those file accesses seem to be done by framework libraries or for framework purposes, like accesssing icons. Yet another illustration that process creation has become expensive if not directly, indirectly. With more and more programs becoming long lived servers with short lived client windows, as in Emacs and gnuserv and gnuclient.

051013

it is delightful to discover that someone has done a set of installers for games for GNU/Linux Many of these games are great classics, and have been open sourced to some degree and ported to Linux by group of devoted people around Ryan "Icculus" Gordon.

051012d

Finally for today, I was discussing the advising system calls, which are particularly useful if often unimplemented. The discussion was with someone who is doing PhotoShop plugins, and was wondering why it and similar packages implement their own software virtual memory system called tiling instead of relying on the virtual memory provided by the operating system.
In part the reason is historical, as these applications were initially designed for operating system which provided no virtual memory or a very small address space; but today's reason to keep them is that the VM policies of most operating systems around either suck or are unsuited to the access patterns of such programs or both.
This is nothing new, Michael Stonebraker made exactly the same point as to DBMSes about twenty five years ago in his Operating System Support for Database Management article. Some relevant quotes (remember, 25 years ago):

Although the folklore indicates that LRU is a generally good tactic for buffer management, it appears to perform only marginally in a database environment. Database access in INGRES is a combination of:

sequential access to blocks which will not be rereferenced;

sequential access to blocks which will be cyclically rereferenced;

random access to blocks which will not be referenced again;

random access to blocks for which there is a nonzero probability of rereference.

Although LRU works well for case 4, it is a bad strategy for other situations. Since a DBMS knows which blocks are in each category, it can use a composite strategy. For case 4 it should use LRU while for 1 and 3 it should use toss immediately. For blocks in class 3 the reference pattern is 1, 2, 3 ..... n, 1, 2, 3 ..... Clearly, LRU is the worst possible replacement algorithm for this situation. Unless all n pages can be kept in the cache, the strategy should be to toss immediately. Initial studies[9] suggest that the miss ratio can be cut 10-15% by a DBMS specific algorithm.
In order for an OS to provide buffer management, some means must be found to allow it to accept "advice" from an application program (e.g., a DBMS) concerning the replacement strategy. Designing a clean buffer management interface with this feature would be an interesting problem.

Although UNIX correctly prefetches pages when sequential access is detected, there are important instances in which it fails.
Except in rare cases INGRES at (or very shortly after) the beginning of its examination of a block knows exactly which block it will access next. Unfortunately, this block is not necessarily the next one in logical file order. Hence, there is no way for an OS to implement the correct prefetch strategy.

However, for the sake of my team buffered sequential copy program I have investigated what kind of advising calls are available under Linux some time ago, and I recently revisited the subject. I have been grateful that the effect of the various options under Linux is now documented in posix_fadvise(2) but when I examined it I felt very disappointed, because it says as to the policies on offer:

Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely. These changes affect the the entire file, not just the specified region (but other open file handles to the same file are unaffected).

which seems to me quite misdesigned. These options apparently only affect read-head, and do not address keep-behind, and are rather incomplete, as a LIFO policy is missing. In theory, the NORMAL policy should be suitable for LIFO access patterns (so read-ahead no, but keep-behind yes), SEQUENTIAL for FIFO access patterns (so read-ahead yes, but keep-behind no) and RANDOM for random ones (so neither read-ahead not keep-behind).
I suspect that however with filesystem data being now handled by the page cache under Linux the default is indeed keep-behind, as well as read-ahead, which I reckon to be the worst of all possible policies. This makes itself particularly felt when running my team (or any other) program to do large sequential copies, as I do to backup whole partitions on my disks, in particular as keep-behind leads to the just read, and not likely to be used again, blocks just copied to be kept around in the file page cache, which leads them to crowd out other more useful pages.
Given that the policies above don't see to influence the keep-behind side, and may or may not be actually implemented properly, I have decided for team to use also on every read and write also POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED as suitable; adding to the mess, one of the usual suspect has masterfully added the O_STREAMING option to fcntl(2), which might as well be used too.
This means that I use this sequence of code on opening the input and output files:

#ifdef POSIX_FADV_SEQUENTIAL
  (void) posix_fadvise(fdin,(off_t) 0,(off_t) ilength,POSIX_FADV_SEQUENTIAL);
  (void) posix_fadvise(fdout,(off_t) 0,(off_t) olength,POSIX_FADV_SEQUENTIAL);
  errno = 0;
#endif

#ifdef O_STREAMING
  (void) fcntl(fdin,F_SETFL,O_STREAMING);
  (void) fcntl(fdout,F_SETFL,O_STREAMING);
  errno = 0;
#endif

and something like this code for reading

  int readbytes = read(fdin,buffer,bufbytes);

#ifdef POSIX_FADV_DONTNEED
  if (donebytes >= 0 && readbytes >= 0)
  {
    (void) posix_fadvise(fdin,
      donebytes,(off_t) readbytes,POSIX_FADV_DONTNEED);
    errno = 0;
  }
#endif

#ifdef POSIX_FADV_WILLNEED
  if (donebytes >= 0 && readbytes > 0)
  {
    (void) posix_fadvise(fdin,
      donebytes+readbytes,(off_t) bufbytes,POSIX_FADV_WILLNEED);
    errno = 0;
  }
#endif

to start reading ahead the next buffer and discarding the just read blocks, and something like this code for writing:

  int writtenbytes = write(fdout,buffer,readbytes);

#ifdef POSIX_FADV_DONTNEED
  if (donebytes >= 0 && writtenbytes > 0)
  {
    (void) posix_fadvise(fdout,
      donebytes,(off_t) writtenbytes,POSIX_FADV_DONTNEED);
    errno = 0;
  }
#endif

which to me seems rather redundantly pleonastic, especially considering that this could be easily autodetected heuristically in the standard C library or by the OS itself.

051012c

Having downloaded via a BitTorrent the latest KNOPPIX DVD I went on to burn it to a DVD disc, and then I verified it with a checksum, which failed. Ha! Unlikely, because the burn seemes to have gone well. So I verified the ISO file, and it obviously checked OK, as BitTorrent does a running checksum as it downloads. So I used cmp -l to compare the ISO image and the DVD disc it had been burned to, and there were no differences: but cmp -l reported that the ISO image was shorter.
Well, bad news: DVDs write in 32KiB sector, and CDs in 2KiB sectors, and the ISO image was long a whole number of 2KiB blocks but not a whole number of 32KiB blocks. Therefore it has been padded with zeros to a 32KiB boundary by the burning program and this of course caused the verification with a checksum to fail. Yet another little detail that is easy to miss.

051012b

Some very funny fellow has sent a message to a few file system oriented mailing list reporting some alleged benchmarks he has performed. The methodology employed is not explained, and neither the context, but parts can be gleaned from the script used to run the benchmarks, which made me think that these are based on comical misunderstandings.
This prompted some comments for example (among the least of the issues) as to the worth of using as a test data set the Linux kernel sources, which amount to 240MiB, which is likely to be less than the amount of memory used for caching in most modern systems, and the splendid (in the realm of stand up comedy I guess) answer to this particular point was:

> Consider for example the relative size of the kernel source tree
> and that of your PC's memory, and that 'sync' does not have yet
> magical :-) powers.

why should i sync to disk, when i can use RAM?

which to me sounds all the more funny as the script does invoke sync repeatedly, which does clean dirty blocks, but does not remove them from the cache...

051012

Well, I was rereading some of my earlier notes about file systems, and I was also reading some entries in the rather interesting XFS mailing list; several raised interesting issues, not just about XFS, but generally interesting ones too, for example:

This thread is about write caching defaults and it points out that on most server class SCSI drives it is disabled by default, while on essentially all ATA drives it is enabled by default, and in many cannot be disabled.
This thread is about fsck of large filesystems and it turns out that since the XFS fsck utilities keep everything in memory, for somewhat large filesystems with many files a 32 bit address space is simply insufficient, and a 64 bit address space is needed. Note that the limit is indeed the address space, not just the physical memory. This reminds me of similar problems, for example with programs with many threads where a 32 bit address space is exhausted.
Clearly this space issue is, together with fsck speed, a potentially large problem with very large filesystems even when, like XFS, they can warm restart very quickly thanks to journaling and internal parallelism. Sooner or later a cold restart has to be done with an external fsck style tool.
The mailing list has fairly scary comments (1, 2, 3 for example) on the vagaries of what flushing written buffers should do and when. Basically it is not well defined, unless one checks carefully what actually happens for the particular configuration, and never mind that some drives don't actually write data to discs when asked but lie about it. The discussion was also related to a complaint by someone who had not read the conspicuous warnings in the documentation that XFS does not journal data updates, and had not ensured that these were handled properly by the application. Because it is only in the application that proper transactional semantics can be defined for it, and the right form of journaling or whatever else defined.
I also read a series of messages on preallocation, which was recently mentioned here. This thread is about it. A related message notes that writing, as opposed to overwriting, a file throws away all preallocated, as well as already allocated blocks which can have very detrimental performance issue, especially with slowly growing files like logs and downloads. Slowly growing files, in the presence of multiple programs writing, result in them being allocated several small, interleaved extents, which can only be cured with preallocation.
A large number of extents in a file can cause not just performance problems, for the file itself and others (as the free list becomes fragmented too) but also quite remarkable problems in filesystems whose internal data structures have been designed under the assumption that files consist mostly as a few chunky extents.
For logs and downloads the logging (or logrotate on their behalf) or downloading program can preallocate enough space; the downloading program knows exactly how much to preallocate, and the logging program can be configured to rotate logs every N megabytes, so it can do that too in such a case, or at least keep a significant chunk preallocated.

051011b

Having just discussed IO operation clustering and advising there is something to add as to preallocation of clusters, this can be done too via heuristics related to the speed at which the application is writing to a file: the faster the application writes, the higher the degree of clustering, so ideally the application never has to wait for a block to be allocated.
There is a special case of preallocation which to me seems rather important, but it is somewhat obscure: what to do when overwriting a file.
By default overwriting a file is accomplished by first deleting or truncating to zero length the file, and then creating or appending the new content. This is extraordinarily wasteful because the blocks previously allocated for the file are in effect a preallocation for the new contents of the file, because in general when a file is overwritten, it is overwritten with contents that are in general of a similar size: consider for example editing a document, object files resulting from compilation, or extracting a package upgrade from an archive.
This is extraordinarily wasteful not because one has to rerun the allocation code needlessly, which hopefully is cheap but for another reason: that in general dellocating a file and then reallocating it may well result in more fragmentation, because usually the free inode and block lists of the filesystem will be more fragmented than existing files, as these free lists as a rule updated rather more frequently.
Therefore ideally the heuristics should be to rewrite a file by opening it in overwrite mode, and then to truncate it at the end if it has become shorter. Again this heuristic can be done quite easily in the standard C library or equivalent, it is virtually cost free.
There are however pathological cases when there is relatively little free space, multiple files are being written to, and they are large relative to the free space: for example, if two file A and B of 20MiB and 70MiB are being rewritten to sizes 50MiB and 40MiB, and there are only 20MiB of free space, not truncating will cause problems, as A will attempt to grow by 30MiB, but B will not shrink by 30MiB until it is closed and truncated. These pathological cases can be ignored, or some obvious rule used to truncate instead of overwrite if the amount of free space in the filesystems falls below some minimum level.

051011

Well I was mentioning adaptive clustering in yesterdy's annotations and I would like add something about that, because it can be surprisingly easy. The current ext2 implementation does attempt to detect sequential file accesses in a simple minded way and then switches then to fixed-size clustering.
The detection of the access patterns can trigger one of the types of advice for both memory mapped and buffered IO accesses (and invoke madvise(2), posix_madvise(2) and posix_fadvise(2) as appropriate) in different places:

The application itself can be coded to explicitly advise the expected access patterns, and this is often easy and profitable, as Larry McVoy discovered after my advising him :-) as to cp under SunOS when Sun switched to mmap'ed IO.
But there is no need to change most applications, as pretty easy and effectively heuristics can be applied inside the standard C library. Most applications use stdio (or the PLAN 9 or C++ equivalents) to perform file accesses, and from the parameters passed to fopen(3) and the first few subsequent operations it is possible to guess the file access pattern:
- If the open mode is "r", "w" or "a" there are extraordinarily good changes that the access pattern is POSIX_FADV_SEQUENTIAL; if in particular it is "r" one can expect POSIX_FADV_DONTNEED to apply.
- If the open mode ends in "+" there are pretty good chances that the file access mode is POSIX_FADV_RANDOM if writing and POSIX_FADV_NORMAL (should be LIFO, but under Linux it actually does a small degree of FIFO) if reading.
- By observing whether fseek operations precede reads or wirtes or not, corrections can be made dynamically. For example if a file with a mode ending in "+" but there is no seeking one can expect this is an overwrite operation and switch to POSIX_FADV_SEQUENTIAL.
- The amount of clustering can be adjusted dynamically by watching the interval between IO operations and whether they have to wait for blocks or not, using common sense rules, like enlarging the cluster (up to a maximum) if sequential reads constantly have to wait for blocks to arrive.
Similar heuristics can be applied in the kernel, and indeed some simple minded ones already are there as described previously.

Unfortunately the advising system calls seem to have little effect, but adding them and some heuristics to the standard C library and the most significant operations costs little and does no harm, and most likely obtains large gains when and if they are fully implemented.

051010

I have just noticed the announcement of a new release of the davtools suite, which contains a programs that scans an ext[23] filesystem and draws a map of its fragmentation. Interesting especially considering the shocking discovery of a sevenfold decrease in ext3 performance over time.
As to that some more considerations: one is that there are three types of fragmentation:

inodes being far away from the directory that links to them;
data blocks being far away from the inode or the metadata (like tree or indirect blocks) that extends from the inode;
data (or metadata) blocks being far away from each other.

The last type of fragmentation is about accessing the data, the first two are about how fast is traversal of a subtree of the filesystem, which is a relatively frequent operation, because most files are small, and accessing many small files can mean more IO operations for the traversal than for the data in the files (this is another reason why it is important to cache filesystem metadata but much less importan to cache the data, especially as data is accessed sequentially most of the time).
There are several strategies that file system designers can employs to try and reduce the average distances involved; for example many file system design are based on groups which are subset of the storage area of a filesystem in which a subset of a filesystem is contained; for example cylinder groups for ext3 and allocation groups for JFS.
Also ext2 and the block IO subsystem, after a suggestion I made in a news discussion long ago, do both allocation and read/write clustering, instead of using large blocks. The discussion was about the idea that ext2 perhaps should be like the Berkeley FFS, and have large blocks (for speed) and small tail fragments (for saving space).
My objection to that large blocks were a very bad idea, and it was not about wasted space due to internal fragmentation (the tail fragment avoids that problem), but rather than a large block size is tantamount to mandatory allocation and read/write clustering, with a couple of large disadvantages over clustering:

The implied prefetching on read of an extra blocks is only appropriate in some cases of sequential access, and it is an extraordinarily bad idea in the case on non sequential access.
When clustering, for example with an 8KiB cluster of 1KiB blocks, the first block of a cluster is available immediately as it is read, while with a large block size of 8KiB the first 1KiB of data is only available when the 8^th 1KiB has been read. In other words, the prefeching is synchronous with large blocks, and asynchronous with a cluster of smaller blocks.
When reading the second block of a cluster, the first block can be discarded immediately; conversely, with a large block the previous N-1 subblocks can only be discarded when the last has been read, as only the block as a whole can be discarded.
When writing the second block of a cluster, the first can be written to the disc and thereafter discarded; with large blocks, the whole block can only be written after the last part of it has been wholly processed.
An obscene amount of copying can happen if the fragment size is smaller than the size of a virtual memory page, and the application writes less than a page of data at a time.

So I explained that all the benefits of large blocks plus fragments and none of the disadvantages can be obtained by using small blocks and fixed or adaptive clustering, at allocation time or when reading or writing. For allocation, it is easy to allocate more blocks than needed, because in any case the file system code must handle the case where the physical size of a file is larger than its logical size, and in case must implement the ftruncate(2) system call.
Therefore all that is needed is to allocate some blocks more than need and apply the truncate logic on file close. As to the read/write clustering, the SCSI subsystem of the time could already support it with little effort, and it was especially efficient if mailboxing was supported by the host adapter.
My points were well received and that is the reason why the FFS style block+fragment logic was never introduced to ext2 and to this day the manual page says:

mke2fs accepts the -f option but currently ignores it because the second extended file system does not support fragments yet.

and even if this phrase appears in the BUGS section it describes, for once, not a bug, but really a feature, as confirmed in this paper:

Ext2fs takes advantage of the buffer cache management by performing readaheads: when a block has to be read, the kernel code requests the I/O on several contiguous blocks. This way, it tries to ensure that the next block to read will already be loaded into the buffer cache. Readaheads are normally performed during sequential reads on files and Ext2fs extends them to directory reads, either explicit reads (readdir(2) calls) or implicit ones (namei kernel directory lookup).
Ext2fs also contains many allocation optimizations. Block groups are used to cluster together related inodes and data: the kernel code always tries to allocate data blocks for a file in the same group as its inode. This is intended to reduce the disk head seeks made when the kernel reads an inode and its data blocks.
When writing data to a file, Ext2fs preallocates up to 8 adjacent blocks when allocating a new block. Preallocation hit rates are around 75% even on very full filesystems. This preallocation achieves good write performances under heavy load. It also allows contiguous blocks to be allocated to files, thus it speeds up the future sequential reads. These two allocation optimizations produce a very good locality of:

related files through block groups

related blocks through the 8 bits clustering of block allocations.

and this is probably one of the reason why (freshly loaded) ext2 filesystem routinely figure at the top of every benchmark.
This logic can be improved, for example by making the degree of clustering adaptive, and by making a certain degree of preallocation persistent, and this might be added sooner or later (1, 2).
Unfortunately preallocation clustering has been taken out of ext3 for a long time (as it complicates journaling a bit), and I suspect that this may be part of the reason why I observed a sevenfold reduction in performance in a well used filesystem as opposed to a freshly loaded one.

051009

In the previous entry I forgot to mention another interesting aspect of seeing filesystem as a database, and a file system driver as a DBMS: that some filesystems are so large that they are in effect VLDBs: a VLDB is defined as a database that is so big that backing it up (or restoring it) takes so long that it is unaffordable, at least offline.
For filesystems a filesystem can indeed be so large that it cannot be taken offline, and therefore it is possible now to do virtual snapshots of a filesystem for backup purposes.
But there is another dimension to the VLDB story, and it is that in general they are so large that checking their integrity can take an unaffordable amount of time.
Now considering large filesystems, imagine one spread over 20 discs with 20 threads; if the file system code is well designed with fine grained locking etc., and the data layout and access patterns are suitable, then most of the discs will be active and most of the threads will make progress at the same time, and indeed there are several benchmarks that show this does happen.
But what happens in the case of a warm restart, when only the journal need to be replayed, and in case of a cold restart, when the file system checking utility needs to be run? Are the jorunal replay code and the file system check utility multithreaded too? For the journal replay that does not matter a lot, because usually the journal size is not that large, and even replaying it sequentially is not going to take that much time, and one can expect that since in effect the file system code is replaying the log incrementally while in operation, if that can keep multiple discs busy, so it will during replay at startup.
But the big issue is whether the file system checking utility, running which is inevitable sooner or later, is multithreaded itself, because as a rule it must scan all the metadata, which usually will be spread across several discs, and if it does it sequentially it can take a long time.
Indeed it can take a long time for other reasons: this reports fsck as taking more then one month on a large but not that large (around 1.6 TiB) filesystem on a RAID, and one of the reasons is that recovery involves attaching very many files under lost+found/, which can be very slow if it contains very many entries. Thsi is a special case misfortune, but the issue of multithreaded checking remains.
It is not easy to find information as to which file system checking utilities are fast and multithreaded unfortunately. The one for the ext3 file system seems not to be, and neither seems to be the ReiserFS one. Probably the XFS is, and I hope that the JFS one is too.

051008

When someone mentioned the poor interactive response of his Linux workstation under conditions of some load I mentioned some of my recent investigations of filesystems and that I had decided to use these parameters:

hdparm -a 32 /dev/hda to ensure that the kernel block device readahead is smaller than the rather excessive default of 256.
echo >/sys/block/hda/queue/scheduler cfq to ensure the latency-impairing anticipatory elevator is not used, and the cfq one is used instead, which prevents one process hogging a disc.
echo >/proc/sys/vm/page-cluster 0 to disable the counterproductive and ridiculous swap-ahead logic.
echo >/proc/sys/vm/swappiness 20 to reduce the amount of memory that the system devotes to the file page cache.

Update: it turns out that in nearly all kernel versions the cfq IO scheduler is quite buggy and should be avoided, and only deadline can be recommended.

Update: the block device read-ahead should be much larger than the default of 256 because of very debatable choices in the block IO subsystem that prevent proper sequential read streaming without that.

Of these the one that took me longest to explain was the last one, about the vm/swappiness parameter of the virtual memory subsystem, because in a sense it is representative of all the others, and the poor reasoning behind each.
Historically UNIX like systems have separated process address space swapping from storage media data caching, and used rather different if not always well though policies for each.
Then the temptation to unify the two arises and a unified mechanism is introduced, where usually process address spaces are then regarded as ephemeral files in the swap/paging area; more rarely files come to be regarded as persistent address space instead.
This has relatively recently happened in the Linux kernel too, but as usual (this has happened before!) without giving much thought to the very different policies and access patterns for ephemeral process data and persistent file data.
The number one difference is that usually file data is read or written sequentially, and usually process data is read or written in timewise subsets (the so called phase hypothesis), therefore usually file data should be handled with a FIFO oriented policy (except for top level metadata) and process data with a LIFO oriented policy.
Unfortunately this is often forgotten, and for various reasons all data blocks end up being managed with a LIFO oriented policy, which is usually particularly catastrophic when large files are involved. When small files, that is smaller than one page (or two) are involved, it does not make a lot of difference, even if it is negative.
The problem is that with FIFO accesses the best policy is read-ahead and free-behind, that is to prefetch some blocks ahead of the user process requesting them, and then to discard the blocks immediately as soon the user process has used them, while a LIFO oriented policy does exactly the opposite, and a LIFO oriented policy applied to FIFO accesses guarantees the worst possible performance, as the most recently read (or written) blocks are those that are kept preferentially in memory, instead of being immediately discarded.
This is blatantly obvious for example when copying a partition around for backup purposes, where one see the block page cache expand needlessly and crowd out the process page cache; another classic is running a package upgrade, which involves extracting files from potentially large archives (the just written blocks are most unlikely to be used soon again, especially if one is installing several packages).
Of course all this madness could be prevented by the use of the madvise(2) and posix_fadvise(2) (or the mysterious, why-ever, O_STREAMING option to the fcntl(2) system call) to inform the kernel policy modules of the expected access patterns, for example for file data with POSIX_FADV_SEQUENTIAL, but that does not seem to work that well (as in, they seem ignored by the kernel).
However, setting the disc subsystem read-ahead with hdparm -a sort of does the prefetch bit (inflexibly and always) and as to discard-behind one can just reduce the size of the file page cache by decreasing vm/swappiness drastically.
Besides there is quite a bit of (ancient, forgotten) literature that shows that large file page caches are rarely beneficial. The argment also can be made that:

The definition of database is that of a collection of data whose working set cannot fit in memory, that is every data access involves at least one disc access.
Under this definition, A filesystem is a kind of large database for files.
With a database, or a filesystem, the best one can hope in the general case is to cache the metadata that leads to the data, so that data accessed involves just the disc accesses needed for the data, without adding those for the metadata.
For a filesystem the metadata is usually the top levels of the directory tree, and the top levels of each file's block tree. These do not amount to much.
In any case the amount of filesystem page caching should depends not on the amount of memory available, which is what vm/swappiness about, but on the size of the filesystems (and of their metadata) and the access patterns of the applications to them.
Usually a few megabytes are good, and more than a one or a few dozen megabytes is probably going to be wasted; as usual with caching the knee of the curve is rather sharp.

In particular cases subsets of the filesystem do get reused repeatedly, for example in repeated compilations of the same program, and then that particular working set of that data subset do fit in memory. But these subsets are often small, and also often there are concurrent activities that prevent caching anyhow, as such caching would have to occur over human-sized timescales, like several seconds or minutes.
Another interesting argument is that for various (mostly social) reasons swap partition reading and writing seems much slower than filesystem access, which encourages to avoid page swapping even at the expense of more file data accesses, by increasing the amount of memory that goes to process pages.
Also, users tolerate more delays due to data accesses, which can be related to the amount of work they do, than sluggishness in program operation, like slow menu popups due to the need to swap-in the related part of a program. Said in a slightly different way, for data page access what matters is throughput (cost) more than latency (lag), and conversely for program page access.
So, for various causes, related to both expediency in the face of missing or misimplemented kernel functionality, and to fundamental access patterns, it is best to have a small amount of memory for file page caching, and not larger than a few dozen megabytes at most, even if one has gigabytes of memory (unless much of it is not used by the programs one has loaded).

051007

A few days ago I switched my graphics card from one based on a somewhat old GeForce3 Ti200 to one based on a much more recent GeForce 6800LE, and at the same time I switched back from the open source nv X driver, which does not support OpenGL 3D acceleration, to the proprietary NVIDIA one, which does.
The upgrade and switch has been smooth, and I can now play demanding 3D games like Doom 3, Tribes 2 or Wolfenstein Enemy Territory with high graphical detail at decent speed, and under Linux too.
The GF6800LE is the cut down version of the GF6800 and GF6800GT, and it has enabled by default only 8 pixel pipelines and 4 vertex pipelines, out of 16 and 6 available. This allows the reuse of partially defective chips. The clock speed is also limited.
However it still performs well, around the level of a GeForce 6600GT, but it runs cooler and because of the twice as wide memory bus it runs better at higher resolutions and with antialiasing and anisotropic filtering. Antialiasing does not do much to improve the quality of the picture, but anisotropic filtering can make a large difference, and it is very expensive.
The particular model I bought is one of the few that are still available, as the GF6800LE chipset is not supposed to be sold retail, and is probably also being phased out. But it is cheap, and good value, even if it runs at a low clock speed by default.
There are somewhat risky tools for both MS Windows (most notably RivaTuner) and GNU/Linux (most notably nvtuner and nvclock) that allow tweaking the chipset, which is extremely configurable, to enable the disabled bits and to change the clock speed, and using these I have managed to get a little extra out of the chipset.
One interesting oddity is that I have configured X for dual monitor operations, but I have a single screen on my monitor.
I could do it because both the card and the monitor have dual connectors, in both cases a DVI and a VGA connector and crucially the monitor accepts input on both its connector, so by pressing the button on its from that toggles which one is active it can behave as if it were two monitors, or a monitor with two screens. Obviously it is best to configure them as either independent X displays or independent screens within the same display, and I decided to do the latter. No TwinView (which is evil) or Xinerama.

051006

Today I have been at the Linux World Expo in London. I haven't attended the conference sessions, but I have walked around the exhibition and attended the Fedora FUDCon 5 conference, and these are my impressions:

Lots of geeks, not that many suits

It seemed to me that the geeks still outnumbered the suits among the attendees, which is not good for business. GNU/Linux events in the USA reportedly now attract lots more suits than geeks, as potential customers do turn up. They seemed outnumbered by enthusiasts at this expo.

Fedora very visible, but OpenSUSE kept invisible

On the day in which the first OpenSUSE project release was made, the OpenSUSE project and that release were virtually invisible at the Novell SUSE which was large and luxurious. It seemed pretty obvious that Novell is only interested in selling SUSE Linux Enterprise Server and OpenSUSE was only mentioned in a small poster in a corner (update: eWeek comments on this release start with Without any fanfare, Novell Inc. has released the latest version of its flagship Linux distribution: SuSE Linux 10).
None of this short sighted attitude at RedHat: their equally slick stand had free Fedora 4 CDs, and FUDcon 5 was well attended by their top people, like Michael Tiemann and Alan Cox. This disparity seemed significant to me; I am considering switching to either Fedora or OpenSUSE, and while methinks that OpenSUSE is technically more elegant, this display of corporate noninterest by its sponsor worries me.

FUDcon 5 impressions

FUDcon 5 attracted me as I wish to switch back to an RPM based distribution. There was an interesting schedule of presentations; I only a attended a few, the ones about Java, Xen, and LVM2, and the BOF session.
The presentations about Xen and the LVM2 over DM combination were well delivered but not particularly novel, even if they had some interesting updates on new features and future directions; the one about Java discussed mostly the GCJ project. The main topic was the switch from a relatively static, C++ style implementation for natively compiled Java to one that is more dynamic and better integrated with Java running under the JVM.
I suspect this was due to the realization that a fully natively compiled standalone Java application was not a common possibility, and that therefore GCJ had willing or not to accomodate mixed native/JVM situations.
However after a question of mine it was confirmed that the static implementation was to be retained indefinitely alongside the new dynamic one.
One of the minor points is that even if the GNU CC based compiler is itself fast, for compiling very many sources the JVM based compilers run from a Java environment like Eclipse is much faster, because it is loaded and initialized only once, while gcj spawns the compiler executable many times.
At the Fedora BOF session it was confirmed that the Fedora Foundation had been created but not fully setup yet. After my question, it was said that the Fedora trademarks and other Fedora intellectual properties would continue to be owned by RedHat for a while, but probably would eventually be transferred to the foundation.

Freeware projects

There were several booths representing freeware projects, and among others I noticed:

Hula is an e-mail suite released by Novell; it contains both a mail storage server and a mail tranport agent, and quite cool web frontends for both, as well as a web base client.
Ubuntu was well presented, with neatly packaged free CDs/DVDs being given away, and was also prominently displayed at the GNOME and KDE (in its Kubuntu form) booths.
CentOS was present too, and it was a always quite interesting. It is a RedHat Enterprise clone, but with some interesting delivery improvements, like respins with updates and more repository types than Up2date, which is not awesome. It occupies a very useful spot between RH Enterprise and RH Fedora, as it is freely downloadable like RH Fedora but is stable like RH Enterprise. A good project, and many sites standardize on RH Enterprise for truly large or critical systems, and on the equivalent CentOS release for more development or desktop oriented systems, without switching to the rather more experimental RH Fedora, which is more of a developer oriented distributions.
Karoshi was a fairly elegant project for centralized network management, oriented to schools, developed and presented by a female developer who did not look particularly geeky either.
At the emDebian and OpenBSD booths there was an amazing variety of tiny computers.
Finally I was quite impressed by the ReactOS project demos, where this MS Windows NT/2000/XP compatible (not a clone) operating system was running very convincingly several applications, including Doom (with software rendering), and Abiword for example.
It is quite obvious that ReactOS is already fairly usable, and that is a significant achievement. It is easy to download it even as a small live CD ISO image, and boot it directly or under QEMU to try it out.
It is doubtful whether the WIN32 APIs are worth implementing again (or even once), but the developers think that an open sources alternative for the most widely deployed platform around has a great potential for disruptive change for the better.
Personally I think that developing for a platform that one does not control only adds to the value of that platform and benefits its owner. For example even a project like Cygwin by making available a number of desirable free software tools under MS Windows makes the latter platform more desirable and entrenched at the expense of free software platforms, most of which implement the POSIX API.

Overall it was fairly interesting; another detail that was somewhat remarkable but not unexpected that most hardware vendors were pushing rack clusters or rack servers.

051003

I am still using JFS for all my Linux partitions, and NTFS for the MS Windows system partition and FAT32 for the partitions shared between MS Windows and Linux.
So far I am quite satifisfied with the reliability and performance of JFS, and I think that I will keep it in preference to ext3, if in the longer term it results no worse than ext3 at performance degrading over time; which is likely, because of my shock at discovering that ext3 performance can degrade sevenfold over time
From some quick testing that I have done the JFS performance has also degraded noticeably even if not dramatically (perhaps around 10-20%) in the past 2 weeks since I switched.
Also, most of my disc space is actually devoted to somewhat large FAT32 partitions, containing my data archives, so I can share them between MS Windows and GNU/Linux when dual booting. That is unsatisfactory, as FAT32 is not that ideal; for one things it still has a 2GiB limit on file size, its cluster size grows with partition size to absurd levels, and the vfat file system code in Linux is very CPU intensive for whatever reason.
Given the availability of ext3 drivers for MS Windows 2000 I am considering switching those partitions over to ext3. This is somewhat ironic as I just switched from ext3 to JFS for my Linux filesystems, but it would still be a large improvement.
Sometimes I wonder how hard it would be to port the JFS code to the MS Windows 2000 filesystem kernel ABI; after all the ext3 driver for MS Windows 2000 is actually a port of the Linux one, and that's encouraging.

051002

While helping some guy in the IRC channel #LinuxHelp to get his external USB disc drive working, it occurred to me to ask him for the power rating of the case's brick power supply and the spin-up power required for his disc drive.
The poor guy was astonished by the request, but it turned out that most likely his problems were related to the power required to spin-up his disc drive exceeding that supplied by the USB case's power supply.
This is not uncommon: hard disc drives often require several times more power to start rotating than for operating, because spin-up requires accelerating the mass of the disc platters to around a hundred revolutions per second, and quickly. Most external USB boxes have power supplies designed for CD and DVD drives or 2.5" disc drives where the rotating element is a lot lighter.
To give an idea of what to expect, the typical external USB or FireWire case has an internal or external power supply rated at 1.8A or 2.0A on the 12V rail (the 12V rail powers the motor; the 5V rail powers the electronics, and its power rating is usually rather higher than required, so not an issue), and 3.5" hard discs on spinning-up draw usually from 1.3A to 2.8A on the 12V rail.
For example I have a fairly nice and cheap USB2/FireWire case with an internal power supply rated at 1.8A at 12V and 1.5A at 5V; my hard discs have power draws of around 0.5A at 5V and 0.5A-1A at 12V when operating, but on spinning-up they draw 1.3A (Western Digital WD1600BB), 1.8A (Maxtor 4D080H4), 2.2A (Western Digital WD800AB) and 2.8A (Seagate ST3160021A) at 12V.
The 1.3A one works well in the case, the 2.2A and 2.8A do not manage to achieve operational speed (so I use them only internally), and the 1.8A most of the time starts up OK, as it is just at the edge of what the power supply is rated for, but sometimes it does not, and I have to power it down and restart.
The message here is to check carefully the startup powers rating of any drives you buy for use in an external case, and similarly the power supply of the case.