Software and hardware annotations, 2005

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

Note: since this file has grown too big, I have switched to a file per quarter scheme. I have already modified all the links in this page to point to the new files into which it has been split.

September 2005

050923
Argh.... Having decided to have a look at how /proc/sys/vm/page-cluster really behaves I have had a look in the Linux kernel sources and found these astonishing bits of code:
  	/* Use a smaller cluster for small-memory machines */
	if (megs < 16)
		page_cluster = 2;
	else
		page_cluster = 3;
int valid_swaphandles(swp_entry_t entry, unsigned long *offset)
{
	int ret = 0, i = 1 << page_cluster;
	unsigned long toff;
	struct swap_info_struct *swapdev = swp_type(entry) + swap_info;

	if (!page_cluster)	/* no readahead */
		return 0;
	toff = (swp_offset(entry) >> page_cluster) << page_cluster;
	if (!toff)		/* first page is swap header */
		toff++, i--;
	*offset = toff;

	swap_device_lock(swapdev);
	do {
		/* Don't read-ahead past the end of the swap area */
		if (toff >= swapdev->max)
			break;
		/* Don't read in free or bad pages */
		if (!swapdev->swap_map[toff])
			break;
		if (swapdev->swap_map[toff] == SWAP_MAP_BAD)
			break;
		toff++;
		ret++;
	} while (--i);
	swap_device_unlock(swapdev);
	return ret;
}
and both make me feel sick and depressed (why is left as an exercise to the reader :-)).
Conclusion: it looks ever more important to set /proc/sys/vm/page-cluster to 0 instead of the usual default of 3.
050921
Having decided to tranform my changelog for this site into a proper syndication feed, I had to decided which format and how. I was swayed by Dave Winer's arguments for RSS 2.0 mostly because it is quite simple and backwards compatible with RSS 0.91 which was and still is so popular; I do have sympathy for the arguments in favour of the polical correctness of RSS 1.0 which is based on RDF, but sometimes I get impressed by expediency too.
So I set out to find some DTD for both version 0.91 and 2.0 of RSS, for use with PSGML mode of [X]Emacs or perhaps other DTD validating editor like jEdit. I created then some suitable SGML CATALOG and also XML CATALOG, and template headers for RSS files:
<?xml version="1.0"?>
<!DOCTYPE rss
 PUBLIC "-//IDN Netscape.com/DTD RSS 0.91//EN"
 "http://My.Netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
</rss>
<?xml version="1.0"?>
<!DOCTYPE rss
 PUBLIC "-//IDN Silmaril.IE/DTD RSS 2.0//EN"
 "http://WWW.Silmaril.IE/software/rss2.dtd">
<rss version="2.0">
</rss>
The RSS 0.91 DTD seems mostly fine, but the RSS 2.0 DTD is not quite right, as it based on the idea that A channel can apparently either have one or more items, or just a title, link, and description of its own which is not quite correct as authentic sample attests.
It also imposes an order on the title, description and link subelements of item that is also quite wrong.
The RSS 0.91 DTD is conversely extremely lax as to ordering of subelements, so I fixed the RSS 2.0 DTD in an intermediate way, replacing the definitions for channel and item as follows:
<!ELEMENT channel
  ((title|link|description)+,
    (language|copyright
    |managingEditor|webMaster|pubDate|lastBuildDate
    |category|generator|docs|cloud|ttl|image
    |textInput|skipHours|skipDays)*,
   item+)>

<!ELEMENT item
  ((title|link|description)+,
   (author|category|comments|enclosure|guid|pubDate|source)*)>
The customary order is title, link, description, but the definitions above leave that unenforced as long as they precede all other subelements.
For extra value I have cleaned up my subtle RSS style in CSS and its accompanying RSS JavaScript helper.
The RSS style allows clean rendering of (most of) a RSS feed directly in a browser, if the browser supports CSS styling of XML, and Mozilla and Firefox, Konqueror 3 and Opera 8 do this pretty well. This can be achieved by inserting this processing instruction around the top of the RSS file:
<?xml-stylesheet type="text/css"
  href="style/rss.css"?>
The RSS JavaScript helper turns the link elements into active, clickable links, (only in Mozilla and Firefox) and to enable such transformations this element should be added as the last one in the body of the RSS file:
<script xmlns="http://www.w3.org/1999/xhtml" type="text/javascript"
    src="style/rss.js"></script>
050920
I got another complete machine lockup while doing a large partition copy with dd and using heavily a JFS partition at the same time. This did not use to happen before I switched to JFS, even if in normal interactive use by itself JFS seems now very reliable.
After looking at the malloc() tunables I also had a look at the kernel tunables for memory allocation and swapping. I set /proc/sys/vm/swappiness long ago to be way lower than the default, to 40, as the buffer cache does not work very well, because it used LIFO policies when most accesses are FIFO, and tragically file (and memory) access pattern advising are not implemented. The result is good.
While reviewing the others I noticed that I should make sure that /proc/sys/vm/page-cluster is set to 0 because prefetching and/or large pages (even worse) are a very bad idea. Well, the bad news and the good news are:
  • After an extended period of really heavy web browsing it turns out that JFS by itself does not quite bring much improved responsiveness beyond the effect of having a freshly loaded filesystem.
  • Setting the page clustering parameters seems instead to result in a large improvement in responsiveness when memory is tight and there is quite a bit of memory swapped out.
050919
Sometimes my caution concerning newly released stuff is overcome by other considerations, and I set about switching piecemeal my Debian install from being mostly Sarge edition to allowing Etch (that is currently Sid) which involves some wrenching ABI transitions. One of these transitions is from GNU LIBC 2.3.2 to 2.3.5 and this causes trouble.
The reason is that version 2.3.5 does more stringent malloc checks, and therefore some applications with previously undetected bugs now crash, and this happens right at the time of the installation of updates package.
The more stringent checks can be disabled by setting the environment variable MALLOC_CHECK_ to 0 (or 1), which of course is a bit sad.
However I deciced to have a look into GNU LIBC to see if there are other interesting allocator-tweaking environment variables, and indeed there are several, and I was lucky to find a page that lists them and several others in various parts of GNU LIBC. The ones relevant for allocation are:
  • MALLOC_TOP_PAD_: if the heap as to be grown at its end, add this much to the allocation in bytes.
  • MALLOC_TRIM_THRESHOLD_: if the heap has more than these many free bytes at its end, shrink it.
  • MALLOC_MMAP_MAX_: maximum number of blocks to allocate allocate via mmap.
  • MALLOC_MMAP_THRESHOLD_: blocks of this size or larger (in bytes) are allocated via mmap.

I also found some discussion and numbers about various aspects of GNU LIBC memory allocation in a nice if a bit old article.
I also found an high system overheads due to mmap allocations which reveals that handling larger allocations via mmap can be very expensive if these allocations are short lived. Similarly for reducing the heap size when its top can be freed.
050918
Today a happy moment for further ALSA understanding: with a bit of deductive reasoning I figured a way around some ALSA library plugin restrictions to achieve sw mixed, 2 to 4 channel, playback. The key was to (re)realize that for my CMI8738 card device 1 on the card is the multichannel one, but only in four channel mode. This allows the dmix plugin to use it directly.
I have also had another look at the poorly documented syntax for parametric ALSA library configuration entries so perhaps I will be able to replace the duplicate definition of my configuration for cards 0 and 1 with a single parametric one for most configurations.
On a completely different topic, a very wise note by my hero Ulrich Drepper on LSB and its technical and social aspects. The author is becoming suitably cynical while keeping his scruples, too bad for his career.
And some interesting comment on the endless cycle of software reinvention for which I have another example: I often look at the LMKL archive using an RSS feed, even if it is after all a mailing list, because it is a much more efficient way to fetch and read just the things that interest me, in other words to use the mailing list archive as a newsgroup.
050917
It turns out that most likely the -o noatime crashes, which recurred, are due to non-JFS issues. Also, this JFS quick patch seems to have removed one cause of trouble.
I have converted all my ext3 filesystems to JFS to see how JFS performance degrades with usage, after the sevenfold slowdown over time shock of ext3. It is obvious that virtually all file system tests and benchmarks happen on a freshly loaded filesystem, so there is very little incentive for file system authors to reduce performance degradation over time, all that matter is performance on a fresh load.
Overall KDE and X under JFS seem more responsive (in particular program startup) than under ext3; but this may be because it is a freshly loaded filesystem, or because of the switch from 4KiB to 1KiB blocks, rather than because of better latency or performance for JFS as such.
So I have created a copy of my JFS root filesystem as a 4KiB freshly loaded ext3 filesystem, and I have used that for a day. It feels better than the well used ext3 1KiB filesystem I had originally, but my impression is that it is not quite as good as JFS. perhaps because because some operations that take time involve directory scanning, and I have not enabled indexed directories under ext3, but of course all non trivially small JFS directories are indexed.
As to the surprise that it seems that under JFS Konqueror does not grow that crazily, but actually occasionally shrinks, which I speculated as mmap related well a freshly loaded ext3 with 4KiB blocks seems to behave like JFS. It can be that I am seeing things that are not there, or that he real issue is to have filesystem block size equal to page size, in which case perhaps the Linux kernel does some special mmap optimization.
050916
Quite a bit more testing of file systems, more later, and new surprises. Some performance profiles are highly anisotropic and nonlinear... The latest is that in order to get the full performance of my hard drives reported by hdparm -t I have to set them up with a soft readhead of 32 sector or more; 16 or less cause a huge falloff in the report speed, for example for my /dev/hda (a WD 80GB 7200 unit) from 40MiB/s to 13MiB/s, one third (with a readahead set to 24 sector it gets to 24MiB/s). Now that looks related to back-to-back transfers, probably because of the firmware in the unit, as the others are slower with a smaller readhead but not as much.
Huge readhead is of course going to help with sequential streaming accesses, but can lead to prefetching of stuff that is not needed otherwise.
I have also noticed in the recent OLS 2005 paper on ext3 evolution that as of Linux 2.6.10 ext3 locking is rather less coarse than before, which should help a lot with scalability to highly parallel benchmarks.
I also appreciated and like the emphasis the paper gives to stability and backwards compatibility, as well as recoverability, as goals for ext3. From the paper I learned that the indexed directories in recent ext3 version use indexes carefully designed to be on top of an unmodified directory data format, so even if the index is corrupted the directory is still readable.
The paper reported also the valiant attempts by some to add more modern features, like extents, most of which break backward compatibility, but I dislike them. If one wants extents there are already extent based file systems out there that are quite good. The goal of ext3 should be to be itself, not to mutate into something else. But then it it may be a case of job protection: if somebody's job title is to be a ext3 developer it may be hard to talk oneself out of a salary by saying that things are fine as they are; the same logic as the constant innovation in marketing or pricing plans: if your job is marketing manager or pricing manager, it may be couterproductive to tell your boss that the current marketing campaign or price plan are just fine.
050915
More gripping discoveries with filesystems... Not unexpectedly my newly loaded JFS based file systems have resulted in a dramatic improvement in GUI responsiveness as widely scattered blocks in widely scattered files have been replaced presumably by contiguous blocks in nearby files.
But rather astonishingly my KDE apps seem no longer as memory mad and in particular they no longer grow crazily with time (so far). Their total allocated memory also shrinks occasionally, and using pmap I have checked that the anonymous mappings do indeed shrink.
I suspect this is not at all due to the switch to JFS as such, but to JFS supporting only 4KiB blocks, while my previous file system was ext3 with 1KiB blocks. I had often wondered just how on a CPU with 4KiB pages was mmap dealing with files broken in 1KiB blocks, and now I guess that:
  • mmap does not deal very well with non default block sizes that are smaller than the size of a page.
  • If mmap sometimes cannot deal at all with file blocks being smaller than pages then file IO will fall back to read and write via the buffer cache. Perhaps it turns out that this is not well tuned because most people use page-sized file system blocks.
  • There is something demented in either the VM subsystem or the memory allocator in GNU LIBC that handles very poorly the case where file system block size is smaller than the CPU page size. Recent versions of the allocator are supposed to turn large allocations into their own memory segments, and indeed pmap shows quite a few 4KiB, 8KiB and 12KiB anonymous mappings in existence for Konqueror (but there is a single 59KiB anonymous mapping, presumably the main allocator arena).
  • For executables, which are presumably mmaped into memory on exec, the alternative is to read them into memory where they get reblocked into 4KiB pages that then get swapped out. The dreadful suspicion is:
    • copy-on-write applies to mmap'ed executables, which means all processes running the same executable share the same pages (minus the copy-on-write ones), whether or not they descend from a common ancestor;
    • for executables that have been read on exec that creates a distinct copy.
To see deeper into this startling behaviour I am duplicating my root file system into first a newly loaded ext3 file system first with 1KiB blocks and then 4KiB block to be doubly sure.
Well, first surprise: my previous notes that overall JFS saved memory with respect to ext3 with 1KiB may not be totally reliable, because my new JFS root and its fresh ext3 1KiB give:
Space under ext3 and JFS
File system JFS ext3 1KiB ext3 4KiB
hdc1 (8032MiB) 7220 Used
781 Available
6495 Used
1061 Available
7230 Used
327 Available
This may simply be because the root file system (which in my cases includes /var and thus the Squid cache, as well as the library headers etc.) contains a very large percentage of small files, and thus the loss due to the larger block size is greater than for the gain on metadata; this seems reasonable as ext3 with 4KiB blocks has about the same space used as JFS which also has 4KiB blocks, but a lot less space available.
By the way, in the reading back of this optimally laid out fresh file system, ext3 with 4KiB blocks was 40-50% faster than JFS, which is slightly puzzling.
050914
As I want to experience what happens to the performance of well used JFS file systems, I have converted my ext3 ones to JFS; and I got my first metadata corruption (in the dtree of a directory) when unpacking a file from a FAT32 file system into a newly formatted JFS one, and this after the crash with noatime a while ago. It may be it is not the JFS code after all: it could be some dodgy code somewhere else overwriting things where it should not, after all I am using a bleeding edge 2.6.13 kernel.
It might be some hardware problem, so just to err on the safe side I have recently run memtest86 overnight and no problems were reported. I am also real sure CPU etc. temperature are low, and I have a superstable 550W power supply which is wildly overspecified for my box.
However, wonders never cease, and suprising news: despite the move from 1KiB blocks to 4KiB blocks the amount of free space has grown substantially, and at the same time the amount of space used for data has also grown, as reported by df -m:
Space under ext3 and JFS
File system ext3 JFS
hdc6 (4016MiB) 3332 Used
441 Available
3381 Used
620 Available
hdc7 (24097MiB) 20944 Used
1839 Available
21164 Used
2901 Available
hdc8 (9028MiB) 7997 Used
558 Available
8098 Used
900 Available
This miraculous situation is probably because while JFS has large blocks (which explains in part why the amount of space used has grown) it also probably has, especially in these conditions, a much smaller metadata overhead because:
  • JFS only allocates space for most metadata as needed, while ext allocates most statically and usually one makes sure it is overallocated.
  • For the metada that ext3 allocates dynamically, the indirect blocks in the file space tree, JFS uses extents instead.
    While the number of indirect blocks is solely a function of the number of blocks in a file, the number of extents in a file under JFS depends also on how contiguous is the free space area.
    Since the JFS numbers above are for a load into an empty file system, in which the free space area is entirely contiguous, it is likely that most if not all files will be described by a single extent.
These consideration suggest that:
  • JFS can be after all more space efficient than ext3 even when the latter has a smaller block size.
  • If there are many large files, ext3 with a larger block size might take less space than with a smaller block size, because the internal fragmentation at the tail is less important, and many less indirect blocks are needed because the file is chunked in many less data blocks, or in other words a lower number of bigger fixed size extents.
  • The available space reported in a JFS file system is not quite the same as that for an ext3 file system, as part of it will be taken by the metadata of newly added files, which is mostly preallocated for ext3.
  • As the free space area becomes less contiguous, the JFS filesystem will have rising metadata overheads because of an increased number of extent descriptors used, up to the limit of needing an extent descriptor per each 4096 byte basic block.
  • While reloading a file system into a freshly made filesystem does wonders for ext3 speed, it is likely that it also increases the available space under JFS, as files that previously needed several extent descriptors end up in a single extent.
  • In order to increase the chances of proximity in allocating inodes and data blocks, ext3 should not be fully allocated, its available space should never fall too low; indeed I think that the default 5% reserve is way too low, considering the sevenfold slowdown over time possible. The same probably applies to JFS, and doubly so, as a larger available free space reserve raises the chances that longer contiguous extents are found, and therefore that both speed and space occupied are better.
050913
More investigation of Linux filesystem performance and features. It has occurred to me that in the recent testing I did one of the main limitations was that speed tests were done on a freshly loaded filesystem, presumably one where layout was optimal, and that I had not tested the time taken to fsck (there are others, but minor -- I hope).
So I decided to take my main root filesystem, which is around 7GiB in size, and has been rather thoroughly mixed up by upgrades, spool work and so on, copy it to a quiescent disc first blobk-by-block, then file-by-file (thus in effect optimally relaying it out), both as ext3 with 1KiB block size (which is also its current setup) and as JFS. Then to apply the read/find/remove tests and a new fsck test.
Well, all this takes a long time, which is the main problem with extensive filesystem tests; because small scale tests are just not realistic (and a few GiB is at the lower end of plausibility too, unfortunately). I wish someone with more time and money did some more extensive tests. But then too bad that most of those I have seen have been somewhat less than well constructed.
Another factor is the large number of kernel problems I have encountered, necessitating frequent reboots; the general principle that non default configurations are dangerous seems to hold; for example as I was writing a restore from a tar file on a vfat partition to a JFS partition just got hung, and I am about to reboot. Perhaps the combination of FAT32 and JFS transfers has not been used much...
However, as to the issue of well used filesystems, shocking news (and this is really a like-for-like comparison):
New vs. used file system test
File system Repack Find fsck Notes
used ext3 1KiB 64m10s
81s
06m43s
06s
06m44s
04s
13% non contiguous
new ext3 1KiB 09m12s
74s
03m03s
03s
04m31s
04s
1% non contiguous
new JFS 4KiB 11m56s
64s
02m50s
05s
02m14s
04s
558MiB free instead of 829MiB
new ReiserFS 4KiB notail 26m53s
70s
05m34s
06s
02m34s
16s
1293MiB free instead of 829MiB
I made really sure these are like-for-like comparisons; the file system is my root one (around 420k between files and directories, and 6.7GiB of data), and I have copied it for each test to an otherwise quiescent disc, first with dd to get it as-is, highly used, and then I reformatted the partition and used tar to copy it again file-by-file to get a neatly laid out version. For the sake of double checking I then rebooted into the newly created partition and rerun the same tests on the original file system, and the results were coherent with those above (the exception is that they were around 25% lower, as the original disc is 7200RPM vs. 5400RPM and so on).
For pure metadata based operations (find, fsck) the newly loaded version is roughly twice as fast; but for reading all the data it is seven times faster. To me this indicates that metadata (directories, inodes) is fairly randomized even in a freshly loaded version (and indeed running vmstat 1 shows very low transfer rates, and the disc rattles with seeking), but data is laid out fairly contiguously. But after repeated package upgrades and the like the data also becomes rather randomized, and indeed this is also borne out impressionistically by looking at the output of vmstat 1 and the rattling of the disc (a lot less).
In the table above there is also a line with JFS numbers; these are for the same stuff as a freshly loaded JFS file system. Since I don't have a well used JFS file system I have decided to convert my root one to JFS, and then in a month or two check out how much it degrades after the usual frequent package install and upgrade that I do. With JFS the speed as freshly loaded is a bit slower or a bit faster than for ext3 freshly loaded, but there is an extra 5% of space used as JFS uses 4KiB blocks instead of 1KiB (and there are lots of small files in a root file system).
In a further illustration that non default configurations are dangerous at one point I compared two otherwise identical JFS file systems one of which however was on all operations 2.5 times slower than the other. I then remembered that for the one that was being slower I had whimsically set the journal size to 30MiB, while for the other 2.5 times faster on I had let the journal size default to 32MiB. It has astonished me that such a small detail has such impressive impact on performance, but then I guess that the JFS code has never been tested with a 30MiB journal size...
Update: I have not been able to reproduce this result, and I have to the conclusion it was due to a bizarre hardware issue.
There is also a line with ReiserFS numbers too, as an outlier point of comparison (saves a fair bit of space, but it is much slower than either ext3 or JFS). No Reiser4 data out of arbitrary lazyness (it still needs to be manually patched into the kernel).
So for now the conclusion is: at least for ext3 with time the layout becomes rather fragmented, with extremly large impact on performance in at least some cases. The cost of seeking is so large that a raise in the non-contiguous percentage reported by fsck.ext3 from 1% to 13% involves a sevenfold decrease in sequential reading performance.
To avoid this file systems should be regularly straightened out by dumping them to something and then copying them back file by file.
050912
Another anedocte about ATA/IDE drives not flushing when asked:
However, the disk hasn't actually written out its cache yet. It lied to the OS / file system and said it had, but it hasn't, it's busy doing something else. Poof, the power goes out.
Now, the journal doesn't have our data, we've already cleared it out, and the file system, which is supposed to have been coherent because we fsynced it, is not, and it is now corrupted.
I have reproduced this behavior a dozen or more times on IDE based systems. The only way to stop it is to tell the drive to stop using it's write cache.
A while ago I had mentioned similar gossip and then added flush the buffer cache more frequently (the kernel one) as a possibly useful palliative.
Unfortunately from my investigation of filesystem features it turns out that only ext3 allow tuning the flushing frequency (which is also useful for laptops, where one wants to make it less frequent); JFS does not, and XFS has a policy of doing it as rarely as possibly, which they call delayed allocation because it raises the chances of being able to allocate a large contiguous extent, and to write to in a single block IO.
050911 (updated 050915)
I have collected and listed below some online resources about filesystems. Older ones are not quite accurate, because things in kernel 2.6 are quite better than in kernel 2.4 and filesystem maintainers have reacted to older unfavourable benchmarks by tuning their designs. So the references below are oredered by most recent first.
General
Descriptions
Benchmarks
Warnings: many of these benchmarks not only are designed somewhat naively, some truly essential aspects of the context, like the elevator or the filesystem readahead, are not mentioned; benchmarks under Linux 2.6 can give very different results from under Linux 2.4; SCSI and ATA/IDE disc drives have very, very different performance profiles, including sync reporting.
050910
During my filesystem experiments I have chanced on this nice SlashDot comment:
I've been using ReiserFS _EXCLUSIVELY_ since about 2.4.11 and I've never had a single problem. It's important to format with the defaults and not specify 'special' arguments to mkreiserfs or you can run into trouble.
which is a classic case of the social way of defining that a program works: it works if most users do not run into bugs. Usually such programs are misdesigned and misimplemented, so that they mostly do not work, and sometimes (usually only for a demo to the boss) they seem to work. Then the bugs most complained about then get fixed, and thats it.
The alternative is to design and implement the program so that is works almost always save for inevitable rare mistakes, which eventually get found and corrected.
As to the social definition of working, in 2.6.13 the XFS code crashes for blocksizes of 1024, the JFS code crashes if a JFS filesystem is mounted with -o noatime, and UDF if one deletes files.
The obvious inference is that very few users have used blocksizes other than the default for XFS, have used non default mount options for JFS, or have deleted files from a UDF filesystem, all rather plausible assumptions.
BTW, I have done some light testing (not with JFS of course) about mounting journaled filesystem with -o atime and indeed some operations like long searches are faster, even if not dramatically. This is as expected, because each directory traversal and file read generates by default an access time update, that has to be journaled, and the journaling involves locking etc., and -o atime avoids all of that. Probably the benefit is much larger on parallel systems.
I also think some points about reliability of various filesystems need to be expanded. XFS for example very aggressively caches updates in memory, in order to be able to coalesce them in large write-to-disc transactions. Considering that most discs handle reading a lot better than writing, this can be very worthwhile. Unfortunately it also means that crashes can do very large damage, with the loss of a lot of data updates.
Probably ReiserFS does the same. However as to ReiserFS there is another problem: its metadata is very tighly packed and not duplicated. This means that bad blocks in the metadata area can cause very extensive damage, even if it is one of the few (with ext3) that has full bad block handling. By contrast ext3 duplicated the superblock many times, and divides the disk into several semi independent cylinder groups.
050909
Some more notes on filesystem tests. First a comparison of features (under Linux 2.6, under 2.4 things can be quite different):
Desktop filesystem features
Feature ext3 JFS XFS
Block sizes 1024-4096 4096 512-4096
Tunable commit interval yes no no
Supports VFS lock yes yes yes
Has own lock/snapshot no no yes
Small data in inodes no some auto
fsck speed slow ? fast
Redundant metadata yes yes ?
Bad block handling yes mkfs only no
Supported by GRUB yes yes mostly
Names 8 bit UTF-16 or 8 bit 8 bit
noatime yes yes yes
sync yes no no
O_DIRECT ? ? yes
DMAPI no patch option
Quotas both patch both
Max fs size 2-8TiB 32PiB 18EB
Max file size 1-4TiB 4PiB 9EB
Max files/fs 232 232 232
Max files/dir 232 231 232
Max subdirs/dir 215 216 232
Number of inodes fixed dynamic dynamic
Resize journal offline ? offline
Journal on another partition yes yes yes
Journals data option no no
Journal disabling yes yes no
Case independent no option option
Can grow online online only online only
Can shrink no no no
Journals what blocks operations operations
Journal size fixed fixed grow/shrink
Indexed dirs option auto yes
Quotas both both both
EA/ACLs both both both
Special features or misfeatures In place convert from ext2.
MS Windows drivers.
Case independent option.
Low CPU usage.
DCE DFS compatible.
OS2 compatible.
Real time (streaming) section.
IRIX compatible.
Very large write behind.
Superblock on block 0.
050908
For the sake of getting a very approximate idea of desktop filesystem performance under Linux I have done some mini benchmarks involving:
  • A PC with an Athlon XP 2000+ and 512MB.
  • Linux kernel 2.6.13 with no gui, otherwise quiescent.
  • A 4GB partition on a 160GB 7200rpm 100mHz ATA hard drive.
  • A .tar.gz of a SUSE 9.3 root filesystem (3132712960 bytes uncompressed, 173759 entries), chosen because it contains a lot of small files and a number of fairly large files.
  • The operations of restoring the filesystem, re-tar-ing it, finding a file based on a non-name property, and deleting all files in the filesystem.
The goal has been to see both the elapsed time and the CPU system time for each operation, and how much space is left free when the file system is empty, has been restored, and has been deleted (to see the space efficiency of the filesystem). I have taken reasonable precautions to have the operations not skewed (too much) by various sources of bias.
The results for various types of filesystem (and various block sizes for some filesystems) are:
Desktop filesystem test
Filesystem Code size Free after
mkfs
Restore Free after
restore
Repack Find Delete Free after
delete
ext3 1KiB 195,163B 3770KiB 5m22s
0m39s
783KiB 3m03s
0m27s
0m47s
0m01s
2m12s
0m06s
3770KiB
ext3 4KiB 195,163B 3747KiB 5m06s
0m30s
454KiB 2m39s
0m25s
0m38s
0m01s
1m19s
0m04s
3747KiB
JFS 4KiB 189,084B 4000KiB 5m38s
0m31s
683KiB 3m46s
0m21s
1m01s
0m03s
2m44s
0m05s
3988KiB
XFS 4KiB 549,809B 4007kB 5m05s
0m56s
720KiB 3m50s
0m35s
0m44s
0m26s
1m41s
0m27s
3923KiB
UDF 2KiB 72,157B 4016KiB 10m45s
1m07s
768KiB 2m55s
1m02s
1m40s
0m34s
n.a. n.a.
Notes:
  • The system times above are probably not complete, as they related only the overheads directly incurred.
  • The repack, find, delete times above are for cleanly loaded data; it would be interesting to see how performance would degrade after a lot of file additions and deletions, which would presumably result in a less optimal layout.
  • The elevator used was the anticipatory one; some tests with the cfq elevator show slightly increased elapsed times, not surprising as the anticipatory optimizes throughput at the expense of latency, the viceversa for cfq.
  • The numbers for ext3 were with data=writeback, but tests showed that with data=ordered the restore took only a little more time, so the latter, which is the default, is good. With data=journal the restore took 40% more time.
  • The specific commands given were:
    # gunzip -dc /tmp/SUSE.tar.gz | (time tar -x -p -f -)
    # (time tar -c -f - .) | cat > /dev/null
    # (time find * -type d -links +500)
    # (time rm -rf *)
    and the execution of each was preceded by unmounting, flushing the buffer cache, and remounting; for restoring the archive to be restored was on a different drive.
  • JFS (because of its OS/2 lineage) has a unique feature, case independent file name matching (which can only be enabled when the filesystem is created).
    This of course is not POSIXly correct, but it can be a (rather large) advantage for filesystems exported via Samba.
  • The UDF filesystem crashed on deletion; also mkudffs supports blocks sizes of 1KiB, 2KiB and 4KiB, but the udf system module only supports 2KiB.
  • The ReiserFS and Reiser4 were not tested at all.
Some opinions:
  • ext3 for a small desktop machine seems the best bet overall, and the choice is between a somewhat faster 4KiB block and a rather more space efficient 1KiB version. Considering the availability of a lot of tools (including MS Windows drivers) for ext[23] that are not available for other filesystem types, this impression is reinforced.
    The one possibly large weakness of this filesystem is that directories are linearly scanned by default. There is now an option to have them as trees, but if directory size is an issue then I'd suggest JFS.
  • UDF has been included as an oddity, for one thing it is not journaled. But UDF amazingly is the smallest in code and still puts in a fairly credible performance. It is slow at writing, but maybe this could be optimized. The UDF filesystem code feels less mature than the others.
  • It is utterly amazing to me that the code for the XFS filesystem amounts to half a megabyte.
  • JFS and XFS are slow (but still quite reasonable), but other tests show they perform much better under highly parallel setups. Of the two JFS is slightly slower in elapsed time, and XFS has much higher CPU overheads. Those other tests show that XFS scales better. I would still prefer JFS in most cases because of smaller code and because at high throughput rates higher CPU overhead could matter a lot more.
Overall my preference goes to ext3 with 1KiB blocks (still fast, saves a fair bit of space), or to JFS for more demanding environments or for filesystems exported with Samba (parallelizes well, good features, low CPU usage).
050907
I have been trying out the various elevator algorithms on my little desktop machine, and reading about them, as well as trying various filesystems, and my approximate conclusions are:
  • As to filesystems, ext3 for most configurations (especially desktops, in particular if dual booting with MS Windows), but JFS for really large partitions and/or for systems with several processors and RAID with many discs, and perhaps switch to XFS for really really large numbers of processors and disc arrays.
  • As to elevators, cfq for desktops as it minimizes latency, as when throughput is more important (but is not good on subsystems with many heads, like RAIDs), and the deadline elevator for DBMSes (best for random access patterns); noop is good for storage subsystems with their own intrinsic scheduling.
The use of cfq in particular has been useful for me to reduce the hogging of the disc by particularly large disc operations, like installing package or filesystem scans, that would make most other processes rather unresponsive with as for example.
However each of these elevators has blind spots. For example I use a multiprocess piping program to clone some of my partitions to a backups hard disc every night, and while with as which favours large sequential transfers it does 20-25MiB/s, with cfq it does only 4-6MiB/s if it runs as 3 processes, returning to 11-12MiB/s if run as 2 processes. Pretty amazing.
050906
I was assisting somebody asking what kind of filesystem to use for a small network storage server with a small RAID array, and then I got asked about various Linux filesystem tradeoffs. My take is more or less this:
  • The ext2 usually has awesome performance for almost anything, but does not journal, so bad news for large filesystems.
  • The ext3 has pretty good performance across the board except that since it uses kernel based coarse locking (particularly essential for the journal) it does not scale well to highly parallel hardware and process configurations (presumably because locking the journal becomes a bottleneck, as ext2 scales well).
    It is probably by far the best desktop style computers; for larger systems it can result in very long fsck times in part due to having index based directories disabled by default (which is right, because ext3 is designed to be simple, and index based directories are sort of unnatural for it, even if they are now available).
  • JFS and XFS both seem rather more scalable, because they are designed for high degrees of both internal (their own fine grained locking mechanism) and external (ability to work with several drives) parallelism.
    XFS seems even more scalable than JFS, but at lower levels of scale the advantage is eaten by the high overheads of XFS. XFS is also awesomely complex code, and even if it is mature, well tested code, that worries me.
  • Elevator choice can have a truly dramatic impact on performance even greater than filesystem choice.
050905
As to ridiculous memory usage by Konsole here is my latest:
 USER     PRI  NI  VIRT   RES   SHR S CPU% MEM% Command
 pcg       15   0 64192 38788  8120 S  0.0  2.6 kicker [kdeinit]
 pcg       15   0 62412 38960  7076 S  0.0  2.6 konsole [kdeinit]
 pcg       15   0 60752 37108  7160 S  0.0  2.5 konsole [kdeinit]
 pcg       15   0 46908 27868 10480 S  0.0  1.9 konsole [kdeinit]
and this is what is was just after startup a few days before:
 USER     PRI  NI  VIRT   RES   SHR S CPU% MEM% Command
 pcg       15   0 34064 17656 13520 S  0.0  4.3 kicker [kdeinit]
 pcg       15   0 33200 17072 12508 S  0.0  4.1 konsole [kdeinit]
 pcg       15   0 32460 16124 12292 S  0.0  3.9 konsole [kdeinit]
 pcg       15   0 32452 16116 12292 S  0.0  3.9 konsole [kdeinit]
This is just ridiculous and sick: well, each konsole process has a few tabs open, but has grown to a resident set of several dozen megabytes is just utterly sick (never mind the over 60MB of reserved total memory), even if almost a dozen is shared KDE libraries. And KDE is not as bad as others...
050904
Rather interesting observation from an interesting comparative test of various DVD rewritable media:
What we did witness and it seems to be the case with most of the media we tested, is that they all need a couple writes/erasures in order to "settle in" after which we had lower levels and fewer errors.
This is not a big problem, because rewritable discs should usefully fully formatted and written over before use, in part to ensure they are good, in part to initialize them even if DVD+RW can format incrementally; and for DVD-RW formatting is pretty much essential, as by default they come unformatted and in incremental sequential mode, while they should be for maximum convenience be in restricted overwrite, and as the overwrite says, they must be first written to become fully randomly rewritable.
So what I do is to format fresh discs, for example with
dvd+rw-format -force=full /dev/hdg
and then fully write them over with noise data, that is data that has been encrypted with a random password, using a script like this:
#!/bin/bash

: "'stdin' encrypted with a random password"

case "$1" in '-h')
  echo 1>&2 "usage: $0 [ARGS...]"
  exit 1;;
esac

if test -x '/usr/bin/aespipe' -o -x '/usr/local/bin/aespipe'
then
  NOISE_KEY="/tmp/noise$$.key"
  trap "rm -f '$NOISE_KEY'" 0
  dd 2>/dev/null if=/dev/random bs=1 count=40 of="$NOISE_KEY"
  exec 3< "$NOISE_KEY"
  aespipe -e aes256 -p 3
else
  export MCRYPT_KEY MCRYPT_KEY_MODE
  MCRYPT_KEY="`dd 2>/dev/null if=/dev/random bs=1 count=32 \
	       | od -An -tx1 | tr -d '\012\040'`"
  MCRYPT_KEY_MODE='hex'
  exec mcrypt -b -a twofish ${1+"$@"}
fi

On the general subject of rewritable quality and reliability, rewritables are based on phase change recording layers, and these apparently decay with time, so they are not suitable for long term archiving.
I have also been writing some notes on how to use packet writing under Linux as recent DVD+RW and DVD-RW are particularly suitable for it.
As to this, very funny news about 16x DVD-RAM drives and media:
Another issue is that the new 16x DVD-RAM media do not support a high overwriting cycle, which means that the discs will perform the best before 10,000 overwrites (100,000 for the 1x, 2x, 3x media).
What is so funny is that DVD+RW (and also DVD-RW to some extent) already supports full random access operation which was one of the two main features distinguishing DVD-RAM, and its main remaining difference with DVD-RAM was that it supported only 1,000 rewrite cycles.
In other words, it may be that the new DVD-RAM standard is really just a rebadging of DVD+RW, with a somewhat higher rewrite cycle.
050903
I have been compressing and encrypting my backups with gzip -2 and mcrypt and it turned out that the latter was many times slower than aespipe. After reading a comprehensive article about time and compression tradeoffs for several compressors I decided to have look and experimented with lzop which has a good reputation for being particularly suitable to backups, being fast and still offering good compression.
Indeed, as a test compression on my Athlon XP 2000+ CPU of a tar stream of my 3.47GiB home dir (mostly text, but lots of photos too) shows:
lzop vs. gzip -2
  gzip -2 lzop
CPU user 372s 126s
CPU system 15s 22s
output 2.57GiB 2.69GiB
It looks like that lzop is a clear winner here, even if there is a small, expected, increase in the size of the compressed output. The reduction in CPU time cost allows for higher output speed given the same CPU, as CPU time spent compressing plus encrypting almost make backups CPU bound here.
050902
I am always impressed by the power of the Aptitude dependency manager, in particular as to fine control (despite its numerous warts and limitations) in particular its filtering. As I still (why?) use Debian, tracking packages in its various editions, I am trying to avoid for now getting on with the several ABI changes being introduced. Aptitude allows me to exclude from display all packages that depend on either libc6-2.3.5 or packages that depend on it, with this filter expression:
~D(libc6~V2\.3\.5|!~D(libc6~V2\.3\.5))
which is slow but fairly impressive.
050901
Extremely peculiar statement by a developer who knows his stuff about the relationship among the USB chipset drivers in the Linux 2.6 series kernels:
Use uhci_hcd or ehci_hcd, but never both at the same time. ehci_hcd will work with all lo-speed ports, so uhci_hcd is then no needed.
Ahhh, thats interesting, and poses a somewhat irritating conundrum, at least for me. I have compiled all three of the USB HCD drivers (UHCI, OHCI, EHCI) in my kernel binary, not even as modules, to get univeral support without the need for an initrd. Will the EHCI driver support a pure UHCI chipset? Time will tell.
On a completely different note, a startup by Erik Troan and others will use a new package repository system called Conary.

August 2005

050831
Well, while playing around with lsof and I noticed that Konqueror had opened and mapped the Arial Unicode MS TrueType font, which is about 24MiB large and has lots of glyphs.
In itself this is not tragic; since this is now done with mmap(2) if many processes open and map the same font the font will be read into memory only once, just like with a font server. But when a font is read into memory it also involves per process data, and with a font server this only happens once.
And on the subject of memory consumption, very revealing and fascinating discussion in a RedHat issue entry; the revelation is that in order to minimize locking among threads, by default the GNU LIBC allocator creates not only a largish stack region per thread, but also a largish 1MiB initial heap arena per each thread, which also needs to be naturally aligned. This results in a fair degree of address space fragmentation, and probably also in some data space wastage.
Among the various fascinating details, this one amazes me:
> Alloced memory amount grew to ~2266 Mb (from 1433 Mb
> before) but allocation speed dropped significantly
> (several times).

Were you building the i686 glibc?  I.e. rpmbuild --target
i686 -ba -v glibc.spec?
which seems to indicate that when compiling for the 686 architecture some instructions give a large speedup, my guess is that is about cheap locking.
050830
Well, I have been investigating a bit the gamin server on which KDE depends for getting notified of filesystem changes (mostly, but not only, to refresh the lists of files when a directory is open in Konqueror).
I got interested in it as after backing up to an external FireWire/IEEE1394 hard drive I could not remove the sbp2 module as it was in use, but no other module was using it. So I started suspecting various monitoring facilities, and looked at gamin because it has a bit of a reputation for keeping things open. So I discovered that it cannot be disabled or removed, but one can configure it to choose polling or kernel notifications (polling is slower but safer) by path, or the same or to disable it entirely on some file system types.
On a totally different note, just discovered a really fascinating set of notes by a Sun engineer who attended the OLS 2004 scalability session both as to the numbers for Linux 2.6 and as to the worries he has about Solaris scalability. Random quote:
stop using cmpxchg on multiproc systems, doesn't scale.. 10x slower on 4p, 100x slower on 32p.
050827
Rather baffling discovery while prefilling media like swap partitions or DVD-RW with noise (pseudorandom stream encrypted with a random key, which seems to be rather noise like): it turns out that AES (Rijndael-128) encryption using mcrypt is many times slower than using aespipe which uses the excellent AES implementation by Dr. Brian Gladman.
050826
After listening to the Kamaelia talk at the FAVE about the BBC R&D developing a massive, massively parallel streaming system, little suprise to read that the BBC plans to broadcast programmes onto IP streams as well as over the EM spectrum. I wonder whether this means that UK internet users (to which the webcasts will be restricted) will be required to pay some kind of license fee (I don't have a television as I don't have much time and/or enjoyment watching it).
050825
Ha, another masterful example of programming: when I delete or move a bookmark in the KDE Konqueror bookmark editor, that takes around 4 seconds of CPU time on an Athlon XP 2000+. Why? Well, that's obvious: because they can!
More in detail, I have a few thousand bookmarks, and it looks like that operation on a bookmark have a cost proportional to the number of bookmarks, as if the whole bookmark tree were re-laid-out on ever update. And why not, if the goal is to look good on a demo with a dozen bookmarks?
050825
Considering my previous rants about memory misuse in many programs, I was rather amused by this article:
A Poltergeist in My Plasma TV
LG's $5,000 set worked for a month. Then things got weird as the unit developed a disturbing "memory leak"
050823
The lead programmer for the openBSD project has decided to deal with long standing memory misuse patterns via better allocation in both the kernel and the C library. He well realizes that this will break backwards compatibility with many buggy programs:
we expect that our malloc will find more bugs in software, and this might hurt our user community in the short term. We know that what this new malloc is doing is perfectly legal, but that realistically some open source software is of such low quality that it is just not ready for these things to happen
I admire his determination, but it is quite brave. Also, while some open source software is of such low quality, some proprietary software is even worse, and MS Windows allegedly contains deliberate support for bugs without which many important proprietary packages no longer seem to work properly.
050822
More grief and upset because of the merry t*ssers inflicting their incredible astuteness on the poor X server.
My issue as before is that I would like to specify X fonts in XLFD format with a screen DPI independent format, for example
-adobe-courier-medium-r-*-*-*-100-0-0-*-*-iso8859-*
that is 10 points (100 decipoints) with the horizontal and vertical DPI defaulting to those returned by the X server.
Note that specifying those DPI as * instead of 0 has completely different semantics, and selects the first (more or less random) DPI actually available. Well, in the file dix/dixfonts.c, procedure GetClientResolutions, there is this marvelous bit of code:
/*
 * XXX - we'll want this as long as bitmap instances are prevalent
 * so that we can match them from scalable fonts
 */
if (res.x_resolution < 88)
    res.x_resolution = 75;
else
    res.x_resolution = 100;
which forces (clumsily) the DPI to be either 75 or 100, regardless of the actual screen DPI, and whether there are fonts for the actual screen DPI available. Now virtually all 15" 1024x768 LCD monitors are 85DPI, and I have created a nice font.alias that does define 85DPI versions of all the bitmap (PCF) fonts with the proper size (roughly). But these get ignored, and I get the 75DPI bitmap fonts instead.
The code above looks like an attempt to violate the X font specs to work around a common issue in the shoddiest possible way; the common issue is that if there is no font at the proper DPI, the X server should return no font available, or scale a bitmap font forcibly to the right DPI, which causes trouble or uglyness. Since the X server computes the DPI from the declared screen size in millimiters and pixels, unusual DPIs can easily result.
Well, yes but the correct procedure is to select the font with the nearest DPI, and if nothing near be available, rescale, and if that is not possible, fail, not to preempt shoddily any availability of fonts (or font names) with the right DPI that might be present.
I had fondly but erroneously believed that since the freetype font module can handle bitmap (PCF) fonts too it might be used to work around this, but this is not possible because the bitmap module is loaded not just forcibly but it is also loaded first (not last, as a default), and preempts any subsequent font modules from handling bitmap fonts, and not viceversa; in any case the forcing of the DPI is the server itself, not in one of the font modules.
050821
I have been looking at my complicated (don't try this at home kids!) Debian APT setup, which is complicated because it references several repositories from different distributions (Ubuntu, Debian, Libranet, ...) and from different editions (Debian Sarge, Etch and Sid, Ubuntu Hoary and Breezy, ...).
The /etc/apt/sources.list was changed to reference Debian editions not by level (unstable, testing, stable plus experimental which is a state more than a level) but by name, as a I finally decided, after many months of indecision, that tracking specific editions is a lot safer than tracking a state (which must be why Ubuntu does use only the edition names); especially when, as now, the testing and unstable levels are very incomplete and inconsistent just after the release of a new edition of Debian, and as major ABI transitions are in flux.
Therefore I had another look at my APT configuration and pinning preferences which I revised. I also added some entries to assign priorities to packages that contain in their version string an edition or distribution name (something not that clean, but it happens a lot), and I got a stupid warning, motivated by this inane observation:
You can't use 'Package: *' and 'Pin: version' together, it's nonsensical.
Well, in theory, the versions of two random packages are not related; but since people (including the official Debian distribution maintainers) encode edition and distribution names in the version string, in practice it can make a lot of sense, as for example in:
Package: *
Pin: release o=Debian,a=sarge
Pin-Priority: 990

Package: *
Pin: version *sarge*
Pin-Priority: 990
which is too bad.
050820
Quite interesting presentations at the Bristol Linux media FAVE. I am writing mostg of these notes as the event progresses, thanks to a very convenient WiFi cell set up by Bristol Wireless who have also set up 20 public access workstations on the same wireless cells using the LTSP package.
CSound
Quite impressive free music synthesis system. It is quite ancient, but it is still being improved. It has two programming languages, one high level, one low level, and astonishingly musicians seem to be not bothered by the latter, which is assembler-level.
Kamaelia
The issue is massive multimedia streaming, with a 10 year horizon.
  • BBC currently delivers 10-50,000 concurrent streams, 1-6 million streams served per day.
  • With RealAudio they have to pay per stream licensing costs, and 1,000 times growth anticipated.
  • Mass distribution: P2P and multicast. P2P starts slow, but multicast can be used to seed it much more quickly.
  • CPUs are not growing in speed as fast as storage and network speeds.
  • New codecs (like Dirac) needed for growth.
  • Concurrency is the future, and it is easy.
  • BBC is a broacaster, more precisely a program maker, however we have to develop technology. Making technology available helps with feedback and standards.
  • Real has been used only because it was most scalable and platform independent at the time it was chosena, even if proprietary.
  • Kamaelia is a componentized, based on an architecture with scheduler and other infrastructure, and then data flow. It has a plugin framework, and a nice GUI frontend for the pipeline/graphline.
  • Concurrency can be tamed with read only stuff or single reader/single writer pipeline/graphline.
  • All written in Python, with growing C/C++ bits.
  • It is not very mature yet, so for production you may want to look into Fluendo's Flumotion which is based on the Twisted framework/ and GStreamer.
  • It has been released as free software to expose it to actual usage and garner feedback.
The Rosegarden sequencer and Studio To Go!
Nice tutorial on how to use these MIDI tools.
Access Space
Access Space is an open lab for creative digital activities in Sheffield. It has been set up for creative artists (a euphemism for unemployed people, the speaker said) and they have both availability of free software based technology and courses and seminars.
Studio To Go! is a distribution packaged with smoothly working GNU/Linux audio applications, a bit like aGNUla's DeMuDi; a couple of users present said that they preferred it even to DeMuDi.
A presentation on open source film making
I started to be a bit busy with user enquiries, so I was not following that much. From what I could glimpse the two themes were that free software is one of the enabling technologies for a lot of grassroots film making. I was reminded of the StarWars Revelations movie, which is a pretty good example.
After the talks there were a few sessions of GNU/Linux aided music, including some (angry-sounding) singing by RachelAPP (formerly Natalie Masse, described here) and some pleasant synthetic music later on. But I was very distracted with handing out installation or live CDs, and helping people with various GNU/Linux and sound problems, which went on for a while, so I left the event much later than I had expected, but it was fairly worthwhile.
050816
Curious moves in the distribution world: just as RedHat have decided to create a Fedora Foundation some time ago (but they haven't actually done so yet), NOVELL have decided to create an OpenSUSE project (and organization?)> to mimic Fedora. The perplexity is still on how seriously either corporates are prepared to let go, considering the ancient Mandrake fork of RedHat; which after all barely dented, if at all, RedHat's market position.
050815
From a report about the Ottawa Linux Symposyum 2005 some moderately startling observation about the relative popularity of Novell's SUSE and RedHat, where Novell seems to be increasing in popularity and RedHat becoming a lot less visibile. This may be due to Novell having a much bigger distribution channel, from the times where they seemed to be the networking suite quasi monopolists.
050814
While looking at linking technologies for a friend, in part because of interface linking hell, interesting discovery of a new static library idea that claims to have most of the advantages of shared libraries.

July 2005

050730
I was looking at the issues in Debian with C++ package (in particular KDE) ABI transitions and feeling disconsolate, for example because:
  • Since C (and C++) were designed for static libraries, they have extremely poor (or nonexistent) primitives for source and binary interface handling. Since the UNIX style toolchains prevalent today were also designed for static libraries they also don't offer much to handle the issues created by shared libraries.
  • In particular for Debian, since both its DPKG package manager and (in particular) its APT dependency manager not only do not handle the notion of dependency on an ABI as a consequence of the previous point, but also do not handle having installed multiple versions of the same package, which means that different ABI versions of a library must be packaged with different names.
050723
Discovered Ulrich Drepper's blog because he is at RedHat and he works intensely on the GNU LIBC and in particular on the dynamic linker; as a result he also has to deal with dynamic linking mispractices and then realize that not many people care, except for some other nice (mostly KDE) people.
050715
Thanks to this blog entry I have discovered Launchpad, a site which combines something of Freshmeat.net and SourceForge.org plus some more features, and a special twist: packages are being uploaded to its version control system with the various patches applied by various distributions.
This is a very good idea (it would be nice even if only for the Linux kernel), and I sort of share the blogger's reservations. Agreed, those that take the initiative often get away with some degree of control or even just influence, and while the project sponsor and owner, Canonical surely has done a lot of good to free software, it is a rather opaque private corporation; contrast with the Free Software Foundation or Software in the Public Interest.
But frankly it is a bit too late to worry; yes, there are the examples of Google and Canonical. or the lesser ones of SourceForge.org or FreshMeat.net, but I regard as rather more dangerous that big corporates are buying indirect control of free software by purchasing and funding more than 90% of the top kernel developers, most of the GNOME developers, and a significant number of the KDE ones. Canonical themelves have purchased quite a few Debian developers.
This creates dependency not only directly, but indirectly; the job market for programmers is pretty dire in many countries, and I doubt that many free software developers have failed to notice that if they manage to get a slam dunk in, like becoming top committers for large free software projects, in one way or another (not necessarily a nice way), they stand a good chance of being purchased for a fancy salary by a safe, friendly big corporate.
I reckon this has already started to drive some behaviour (consider some people's ever expanding and brutal land grabs in the kernel), such as job-security-through-obscurity (free software has never suffered much from documentation, but when documentation allows your employer to replace you at will with another unemployed free software developer, many people with lower ethical standards would think that providing good documentation can be suicidal), and gratuitous patches and rewritings.
This kind of stuff may be far more damaging, as similar incentives have caused warped behaviour in the NBA.
050715
The discussion about the software clock tick rate continues, and it looks like the power impact is indeed significant. In the meantime I have made my own little patch.
Just found an utterly fascinating article on the first programmed graphical computer game, on the first personal computer, that is SpaceWar (aka Asteroids) on the PDP-1 and then on the Xerox PARC workstations.
It is a scan of a RollingStone article from December 7th, 1972 and I also have a photo of the original PDP-1 display with SpaceWar.
050710
Quite funny and interesting debate reported in kernelTrap.org about the Linux kernel software clock interrupt frequency which is 100Hz in 2.4 and 1000 in 2.6 and it turns out that the latter causes quite faster battery depletion in laptops, which once it has been pointed out seems pretty obvious.
050701
I have been looking (again) at the prelinking story (part of the general application startup time story) and it surely sounds attractive (even if someone found virtually no benefit). The reason is that as reported by someone at the GNOME memory reduction project about a GNOME Hello world! program:
That simple program uses 73 shared libraries that allocate a total of 13Mb of non-shared static data.
In other words in shared objects there are substantial impure areas, which are usually about relocation and links to other libraries.
Since these impure areas are mapped copy-on-write, they only become unshared if modified, and prelinking is a way to raise the odds they don't get modified.
This is achieved by pre-modifying them in the shared object itself to values that are likely to be valid; and since the values are mostly relocation entries and inter library links, this is achieved by assigning each library a possibly unique default base address, and setting the relocation and inter library link entries accordingly.
This is so obvious that it is a bit astonishing that is not done by default by the linker; after all with the previous a.out shared library system the impure problem did not much exist because each shared object had to be linked at a unique, statically defined, fixed (not merely default) base address and instead of going all the way to having no default base address one could have just mde the default a hint changeable at runtime.
However, I have decided in the end against using prelinking, for two reasons:
  • Prelinking by the user means that all shared objects and binaries involved get modified, which means that all the checksums held by the package manager become invalid.
    Since I reckon it is pretty valuable to verify that executables and libraries have not been modified after installation, this is a major difficulty.
    As to this, Gentoo handles this difficulty:
    Current versions of Portage can handle, via prelink, the changing MD5sums and mtimes of the binaries.
    but this is easier because of the source-based nature of Portage.
  • Prelinking actually would not help me a lot, because I use mostly KDE and its developers have implemented a completely different workaround to the same problem, which works well enough, even if it is less elegant.
    The workaround is to have most KDE apps as parts (shared objects) mapped into processes forked off running the kdeinit service, which already has mapped most of the KDE libraries. This means that the child process inherits the already mapped in and prelinked shared libraries of the parent. Thus in effect prelinking is done at the start of every session instead of statically, but that is good enough, especially as few systems run multiple sessions at a time.
  • Some distributions have started using an additional technique to reduce dynamic linking cost, which is to mark entries internal to a shared object as not globally visible using the new -fvisibility option to GNU CC, which especially helps with C++ code.
In the end I agree with Léon Bottou and John Ryland that prelinking or some variant ought to be the default, in the sense that shared objects and binaries should have a default nonzero base address, and lazy external symbol resolution; what would be needed in addition is a mechanism for registering base addresses that are likely to be nonclashing, which could be done without too much trouble with a simple utility, possibly an extension of ldconfig.
Ulrich Drepper has written a rather comprehensive paper/HOWTO on GNU/Linux ELF dynamic linking.
As a general comment the high overheads imposed by dynamic linking and horrors like client side fonts invalidate one of the fundamental assumptions in UNIX-like systems, that process creation is cheap. The results are mechanisms like kdeinit which remind me of the discussion of new_proc in Multics, which famously had expensive process creation.

June 2006

050630
My ALSA latency notes recommend setting the PCI latency timer to a uniform lowish values like 32 cycles, and I have been told by many that this often fixes sound quality issues when there are PCI/AGP cards, like many video cards, that hog the bus by default.
But I have just been surprised to discover that in particular setting the host bridge latency timer to any non zero value reduces the effective transfer rate of the IDE disk subsystem (as detected with hdparm -t), for example from about 55MB/s to about 33MB/s for a value of 32 cycles.
This is somewhat surprising, because the latency timer is an upper limit, and if the device does not need to hog the bus for that long it is supposed not to. But it looks instead as if at least the host bridge in my chipset uses it as a lower limit.
050629
I have been trying out the Xen hypervisor for virtual partitions under Fedora 4 and I have noticed that some people have had difficulty figuring out how to boot it under GRUB, which is indeed not totally obvious, and here is an example (with the bleeding edge experimental kernels for Fedora 5):
title           Xen Fedora 5
kernel          (hd1,5)/boot/xen.gz dom0mem=120000
module          (hd1,5)/boot/vmlinuz-2.6.12-1.1400_FC5xen0 ro \
                  vga=ext reboot=warm pci=biosirq \
                  elevator=anticipatory root=/dev/hdb6
module          (hd1,5)/boot/initrd-2.6.12-1.1400_FC5xen0.img
So far Xen works pretty well and is quite fast, even if I have had some lockups. I suspect the bleeding edge Fedora kernel; also the Fedora firewall seems to have some issues. It seems quite practical to just run always under Xen, to enjoy for example fast save and restore to disk.
050628
The dumbness of sysfs and hotplug can only get worse: the zd1201 driver requires that the firmware for the peripheral it supports be loaded, and what now happens is that it creates on loading a sysfs entry to enable the loading, and after a short timeout this entry is removed. This means that just about the only way to load the firmware is to have hotplug enabled.
Note that there is no problem with just leaving the firmware loading entries around permanently until the firmware is loaded, for example manually. Just more job security via complexity for the maintainers of hotplug and sysfs.
050627
Just noticed that the ZyDAS 1201 driver (zd1201) is incorporated in the mainline kernel as of release 2.6.12 so that the patch mentioned previously is no longer necessary.
050623
More client side font pain as I discovered while having a look at Fedora 4 that under KDE (and GNOME too I guess):
  • It is not possible to access the X font system, only the evil Fontconfig/Xft2 one (this probably can be fixed only by recompiling).
  • The bitmaps fonts are only available in the X font system, not the the Fontconfig/Xft2 one (this can be fixed by adding a few lines to /etc/fonts/local.conf.
050618
Got a pointer to an interview with Theo de Raadt comparing Linux with OpenBSD as to code quality. I agree mostly that OpenBSD code quality is broadly higher, both in the kernel and in many userspace utilities. But the point with Linux are that it is GPL'ed and has lots more drivers.
Linus seems well aware that large portions of Linux are messy but that's just one of the issue, however painful it is to me.
050617
I have been looking into making bootable CDs or DVDs with Linux, and it is now possible with recent versions of GNU GRUB to use it to boot directly, instead of using a floppy image or SYSLINUX (more precisely ISOLINUX). A simple example of a GRUB menu for booting from a hard disk partition:
title		hda1
root		(hd0,0)
kernel		(hd0,0)/boot/bzimage root=/dev/hda1

title		C:
rootnoverify	(hd0,2)
chainloader	(hd0,2)+1
and a mkisofs line to create a filesystem image for booting GRUB:
mkisofs -no-emul-boot \
  -boot-info-table -c boot/boot.catalog \
  -boot-load-size 32 -b boot/grub/iso9660_stage1_5 \
  -r -J -l -o /tmp/grubboot.iso /tmp/grubboot/
If the bootloader one uses does not support boot choices/menus it is possible to construct a CD image with multiple boot image choices using the -eltorito-alt-boot options to mkisofs but this require multiple boot choice handling in the BIOS and this is not implemented in a significant number of cases.
I have also found some quite interesting interesting Microsoft Windows based utility for those interested in such practices.
050616
Just spotted an amusing if slightly confused article which suggests that splitting a 512MB swap partition into four improved swapping performance, as long as they have the same priority.
This is highly surprising, because four swap partitions on different discs should give better performance, but not on the same disc.
But a note says that the swap space handler in the Linux kernel at least in 2.4.x is poorly implemented, and uses a linear scan to find unused swap space, and splitting the swap partition is a workaround to that.
Or perhaps having four partitions with the same priority gets the kernel to issue four times as many IO requests for swapping, allowing better disc scheduling.
050615
Well, Fedora 4 has been released and I had a look, because I am still looking at something that does not have Debian's awesomely irritating packaging suckages. Unfortunately it turns out that dumb n00b packaging ideas are contagious: Debian's comical practice of calling a binary package with a base name differently from its source package, by adding a leading lib in front of the name if it is a runtime library package, has appeared in Fedora too, after having been dumbly adopted by Mandriva.
For example the source for the termcap library is called termcap-2.0.8.tar.bz2 and is packaged as libtermcap-2.0.8-39.src.rpm. Looking inside the source package one also finds some silly inconsistency in naming, with patches have different base names:
$ rpm -qlp libtermcap-2.0.8-39.src.rpm
libtermcap-2.0.8-ia64.patch
libtermcap-aaargh.patch
libtermcap.spec
termcap-116934.patch
termcap-2.0.8-bufsize.patch
termcap-2.0.8-colon.patch
termcap-2.0.8-compat21.patch
termcap-2.0.8-fix-tc.patch
termcap-2.0.8-glibc22.patch
termcap-2.0.8-ignore-p.patch
termcap-2.0.8-instnoroot.patch
termcap-2.0.8-setuid.patch
termcap-2.0.8-shared.patch
termcap-2.0.8-xref.patch
termcap-2.0.8.tar.bz2
termcap-buffer.patch

This is not the sole example, but of course for the sake of confusion some packages that contain mostly libraries are packaged without the annoying spurious lib prefix, for example ncurses-5.4.tar.bz2 is packaged quite properly:
$ rpm -qlp ncurses-5.4-17.src.rpm
ncurses-5.4-20041218.patch
ncurses-5.4-20041225.patch
ncurses-5.4-20050101.patch
ncurses-5.4-20050108.patch
ncurses-5.4-20050115.patch
ncurses-5.4-20050122.patch
ncurses-5.4-xterm-kbs.patch
ncurses-5.4.tar.bz2
ncurses-linux
ncurses-linux-m
ncurses-resetall.sh
ncurses.spec
patch-5.4-20041211.sh
still note the marvelous idea of having one of the patches called patch.
This silly, contagious mispackaging idea has two bad consequences; one is that it breaks simple, direct mappings from original package names to RPM package names, and most importantly it breaks sorting consistency. The most important one is that since it is both gratuitous, in the sense of having no obvious advantages, and it does have significant disadvantages, and it is also plain ugly, demonstrates a lack of what Linus Torvalds calls good taste.
Now good taste in packaging or programming is very important because it guides people to do things that suck less, and that matters greatly in the complicated issues that arise from packaging. In other words it puts the judgement of a packaging team into serious question; if they get simple things like issues of good taste wrong, one wonders what else is wrong.
For example once upon a time RedHat switched from putting the internal components of INN2 in the various traditional package specific directories, as for example in /usr/lib/news/bin/procbatch, to merging all those internal components into the public base directories, like /usr/bin/procbatch. I then submitted a bug report, and was told to lump it. Cool, and roughly on the same day I stopped using RedHat, because if they had people who could gleefully do that and persist obviously things were going downhill.
Overall my impression is that of the major distributions the least badly packaged is still SUSE, but it has always been politically incorrect (playing the 99.9% free but with proprietary bits as limpet mines game that RedHat have later finessed to a truly clever degree with proprietary _trademarked_ names logos and icons). Then Mandriva at one point switched to the stupid lib prefix naming convention, and at that demonstration of loss of good taste I just switched to Debian. After all if one is prepared to tolerate dumb bad taste, let's get it direct from the source, which at least is politically correct.
Unfortunately Debian policy mandates other dumb examples of bad taste, like the use of DPKG (who ever though that ar archives containing tar.gz archives was a good package format? Never mind the other huge problems), the starting of daemons on package upgrade even if they are disabled in the init runlevel config, and the social problems that cause very infrequent releases, and so on.
I have been attracted to Fedora at this point because for those wanting to try the latest and greatest and are prepared to cope with the resulting issues it seems pretty OK. Wait a moment, now that I think of it, does Fedora still have the same tradermarked names and icons issues as RedHat? It looks like it has less of them, but the resulting FUD effect is still rather unpleasant. Hopefully the move by RedHat to endow a separate entity, the Fedora Foundation, the endowment is described as:
Red Hat will create the Fedora Foundation with the intent of moving Fedora project development work and copyright ownership of contributed code to the Foundation.
with no mention of trademarks, a beautiful example of corporate cleverness. Copyrights matter a lot less because those are GPL'ed anyhow, but there is no GPL for trademarks or GPL equivalent permissions for RedHat's trademarks.
050612
More memory horrors... Konqueror has grown from around 50MB to around 250MB (about 70-80% memory resident) so I tried to quit it and restart it. However engagingly it did not terminate, because I had specified in the misnamed Performance preferences to preload a copy in the background. My naive expectation was that on quitting it would because of that restart it. Fat chance.
Also, how a Konqueror process where all pages have been closed can still have almost 200MB resident when on creation it has only got around 35MB baffles me. I suspect grave mistakes in the caching logic. The curious thing is that even when running Konqueror under gc that memory is not collected, so it is not quite a leak.
050611
Returning from a meeting of the pretty good Greater London Linux User Group meeting where there were four quite intesting talks. Of the four talks, I was quite personally interested in the one on the SVK source control system by Chia Liang Kao and the one on the Xen hypervisor for virtual partitions because I often have to deal with document/source archives and I'd like to run different Linuxes in different virtual partitions.
The SVK presentation made a convincing case for looking seriously at it, as it is based on the reuse of existing known working code and ideas from other projects, and was designed to be quite compatible with some popular existing version control systems, and to avoid some of their most hateful limitations.
Among the more interesting aspect of SVK is that as of now it is purely a client side system, which relies for now, and is therefore compatible with, the subversion server for shared version storage.
The Xen talk was particularly enjoyable because of the quality of the delivery by Dr. Ian Pratt.
Xen is a very interesting take on the virtualization idea. It is a kind of microkernel that creates a set of virtual machines whose architecture is not identical to the underlying hardware architecture, but is much simpler so it can be simulated much more efficiently. As such it is a paravirtualization system like UML or CoLinux but with a radically different architecture strongly reminiscent (and probably the most useful subset) of MERT (aka UNIX/RT) by Heinz Lycklama (see also this reference for the place of MERT in UNIX history). but considerably modernized and extended in one crucial way: the Xen hypervisor by defining a simple abstract software virtual machine can snapshot it incrementally, and either save the snapshots or transfer them to another machine running Xen hypervisor, thus achieving dynamic process migration, which was the original motivation for the development of Xen in the Xenoserver project. Another operating system kernel famously based on an hypervisor archirecture was Mach from CMU which eventually evolved into OSF/1 and the GNU Hurd. Even the Microsoft Windows NT HAL is a sort of hypervisor for the various operating system personalities (WIN32, OS/2, POSIX) that it supports.
The Xen architecture is very different from that of UML, and is based on booting a miminal hypervisor kernel which creates one or more virtual machines, into which a Linux (or FreeBSD, or other) kernel, ported to the Xen virtual machine architecture, is then booted.
The hypervisor itself has little or no drivers because of an astute ruse: one (or more) of the virtual machines (usually number 0) is given more or less direct access to the peripherals, and this acts as a server for the other virtual machines. So for example the disc driver runs in virtual machine 0, and all other virtual machines access the disc by sending requests to virtual machine 0. This architecture reminds me fairly somewhat of IBM's MVS or VM clusters using real (MVS) or virtual (VM) CTCAs to communicate among individual systems.
050608
While helping someone with getting a good intensity calibration on his monitor(s) I (re)discovered that many people don't quite get the subtle issues related to color and intensity. Some very nive people have helped by writing interesting papers with illustrations, for example:
050607
There is an intensely ironic part of an interview about the recent OpenBSD 3.7 release which is:
ORN: A lot of companies have been using OpenSSH in their products (Sun Microsystems, Cisco, Apple, GNU/Linux vendors, etc.). Did they give anything back, like donations or hardware?
Henning Brauer: Nobody ever gave us anything back. A plethora of vendors ship OpenSSH --commercial Unix vendors (basically all of them), all of the Linux distributors, and lots of hardware vendors (like HP in their switches)-- but none of them seem to care; none of them ever gave us anything back. All of them should very well know that quality software doesn't "just happen," but needs some funding. Yet, they don't help at all.
It is so ironic because the OpenBSD project is committed to the BSD license, which does not require anything other than credit or, in its new edition, nothing at all, in the way of contributions from adopters. At least the GPL requires vendors to contribute back their improvements, and this has worked really well in a number of cases.
I have been particularly amused by this other remark further down:
ORN: This is the first release that includes X.Org. Why did you choose to import it instead of XFree86 4.5.0?
Matthieu Herrb: The primary reason is that the new revision 1.1 of the XFree86 license is less free than the old MIT license that had been used for years by XFree86. OpenBSD already avoided shipping the final XFree86 4.4 release that also uses the new license in 3.6. Then, as many other projects moved away from XFree86 because of the license, it became obvious that most new developments in the X window system now take place in X.Org. Having said that, projects like OpenBSD have to stay vigilant that X.Org doesn't turn into a Linux-only project (that would slowly slip to a GNU General Public License).
These people are worried by a slow switch to the GPL; so ironic too. Especially as I was once following a discussion on the Xorg IRC channels and one of the authors said he had quite a bit of code to do an improvement someone was requesting, but since the server was not GPL licensed, the code was proprietary and could not be shared. It used to be that the development of the X reference server code was mostly funded by major corporates, so they chose the licence that best served their embrace and extend competitive advantage stategies.

May 2005

050531
I have recently published sabishape and dokde.
050527
Still looking at memory usage horrors including the bizarre discovery than under Debian unstable, using the unofficial 3.4.0 packages, the KDE Konqueror browser grows to inordinate size (like more than 200MB) and under SUSE 9.2 the KDE 3.3.2 Konqueror stays at around 50MB.
So I decided to use the latest BD conservative garbage collector compiled in malloc-override mode (with build option --enable-redirect-malloc) and then to set export GC_PRINT_STATS=1, and then to preload it with export LD_PRELOAD=/usr/local/lib/libgc.so.1.0.2) and it is such good fun watching what it snitches on something like Konqueror.
050525
Some to-remain-nameless culprit working on the Koha open source library automation program has been asking me how to improve the speed of their catalog searching program, which uses MySQL tables as inversion indexes, which is inappropriate in itself. Even worse, I have had a look at the schema and queries and code that does that and I have found it an appallingly awful farrago, and then found that MySQL has what looks to me a poorly written query planner. So I have supplied first some decently written SQL that works around MySQL's query planner limitations, and then suggested looking at proper text indexing systems, like Lucene.
Koha however is mostly based on Perl, not Java like Lucene, so that someone had a look at Plucene a translation of Lucene to Perl. Unfortunately this translation seems slow. How slow? Well, on 2GHz CPU it takes over one hour to index 170,000 record totaling around 7MB, and the worst part of it is that it is solidly CPU bound.
Now this is quite extraordinary: over one hour of CPU time to index that amount of data requires skills previously unsuspected, as it is usually expected that indexing and searching be IO bound operations.
So in order to learn from such marvels I have had a look at the lower levels of the system, as I fervently hope that at least the overall hash table index strategy is not unusual.
By looking at the lower level datum and aggregate code I have learned so much! For example, this piece of low level code to write a variable length binary integer:
sub read_vint {
        my $b = ord CORE::getc($_[0]->[0]);
        my $i = $b & 0x7F;
        for (my $s = 7 ; ($b & 0x80) != 0 ; $s += 7) {
                $b = ord CORE::getc $_[0]->[0];
                $i |= ($b & 0x7F) << $s;
        }
        return $i;
}
brought me tears and screams because of its depth and daring, and never mind this other splendid example of Perl programming, whose style seems to be representative of much other code in Plucene:
sub doc {
        my ($self, $n) = @_;
        $self->{index}->seek($n * 8, 0);
        my $pos = $self->{index}->read_long;
        $self->{fields}->seek($pos, 0);
        my $doc = Plucene::Document->new();
        for (1 .. $self->{fields}->read_vint) {
                my $fi = $self->{field_infos}->{bynumber}->[ $self->{fields}->read_vint ];
                my $bits = $self->{fields}->read_byte;
                $doc->add(
                        bless {
                                name         => $fi->name,
                                string       => $self->{fields}->read_string,
                                is_stored    => 1,
                                is_indexed   => $fi->is_indexed,
                                is_tokenized => (($bits & 1) != 0)              # No, really
                                } => 'Plucene::Document::Field'
                );
        }
        return $doc;
}
Watch and learn! And these apparently are the literal translations into Perl from Java code of equivalent magnificence.
050523
I have been reading several articles about filesystems alternative to Ext3, in particular JFS and XFS. In a related article about Ext3 I have noticed this scary story:
Laptops...beware?
Ext3 has a stellar reputation for being a rock-solid filesystem, so I was surprised to learn that quite a few laptop users were having filesystem corruption problems when they switched to ext3. [ ... ] had nothing to do with ext3 itself, but were being caused by certain laptop hard drives.
The write cache
[ ... ] Unfortunately, certain laptop hard drives now on the market have the dubious feature of ignoring any official ATA request to flush their write cache to disk. This isn't a wonderful design feature, although it has been allowed by the ATA spec up until recently [ ... ]
However, it gets worse. Some modern laptop hard drives have an even nastier habit of throwing away their write cache whenever the system is rebooted or suspended. Obviously, if a hard drive has both of these problems, it's going to regularly corrupt data, and there's nothing that Linux can do to prevent it from doing so.
The same article has interesting details about the various modes of journaling that Ext3 offers, and in particular that data=journal can be very fast in some special case (probably reading from the journal as it is writing to it). Also, about making it flush the buffer cache more frequently prevents huge write storms.
050518
While reading an article in Linux Journal about the new RedHat EL 4 release process I was greatly amused by some points about their kernel release manager, as to how committed RedHat are to share their QA process with the kernel mainline, unlike other vendors:
During the past year, more than 4,100 patches from Red Hat employees were integrated into the upstream 2.6 kernel. In contrast other companies boast that their offering contains the most patches on top of the community kernel.
and
Upstream - doing all our development in an open community manner. We don't sit on our technology for competitive advantage, only to spring it on the world as late as possible.
These statements are commendable, and reflect some of my own thoughts but one can make some points:
  • The issue is not that other distributions do not publish their patches, because they do, and in a timely manner, thanks to the GPL, but that they do not actively submit them upstream, and since upstream is busy enough, upstream people do not go out of their way to scout for patches, but only deal with those that are actively submitted.
    I think this is a very important observation on the actual process of free software development, which is totally dominated by scratch my itch logic.
  • My perception is that RedHat used to be as reluctant to actively contribute the results of their QA to upstream as the others. It is very welcome that their official position is now different.
    However, I think that the kernels in their products are still very different from upstream kernels because of the number of feature patches they use to differentiate it from upstream. Yes, it is good that differentiation is not based on the number of semi-hoarded fix patches, but they are still semi-forking the kernel.
  • I think there is little competitive advantage to be had from being reluctant to contribute QA to the upstream kernel for Linux companies; competitors, unlike upstream, can always proactively fetch those patches, so not actively contributing them is rather futile.
    I reckon that semi-hoarding patches instead benefits the kernel maintainers by enhancing their job security. After all a policy of having a kernel that is kept quite different from the upstream kernel implies the need for a team to maintain it instead of relying on upstream and community maintenance.
    Most of RedHat's competitors have already done rounds of downsizing, in the middle of a big IT recession, and probably their remaining employees are very keen to make themselves as indispensable as possible (and think of Mandrake who have just acquired Conectiva, that is a team of distribution developers that cost one third per year and are as good).
    Job security through obscurity (whether of source or of process) is a winner in many cases, and every little helps, unfortunately :-(.
050513
As an ulterior proof of the frequently useless sensationalism of SlashDot there has been a big discussion about this apparent discovery that it is hard to guarantee something has actually been written to disc. This has been well known for years at least to those reading the comp.arch newsgroup. Things also are much subtler and more complex than apparent from this late discovery.
To me the most interesting question is why if disk write caching is disabled then sequential write performance goes down typically by a factor of as much as ten (e.g. from 40MB/s to 4MB/s).
050512
After an article about the evolution of GNOME I have had a look at the GNOME memory reduction project which has comments like:
The plan is to reduce the amount of memory that Gnome applications consume. Gnome is barely usable on a machine with 128 MB of RAM; contrast this with Windows XP, which is very snappy on such a configuration.
Even more amusing is the question that comes next:
Why do you want to reduce memory consumption?
to which some answers are given. Unfortunately none seems convincing to me; I reckon that there is a vanishingly small and largely powerless constituency for fixing the many and horrifying memory wastages in most GNU/Linux applications, so the answers are merely pious hopes.
GNU/Linux projects are based on the volunteer side on the scratch my itch principle, and many volunteers by now just have very large PCs, with at least 1GB of memory.
Also many volunteers are now richly paid employees of big corporates instead; for example this article on BusinessWeek.com points out that:
Looking at the top 25 contributors to the Linux kernel today, you'll discover that more than 90% of them are on the corporate payroll full-time for companies such as HP (HPQ), IBM, Intel (INTC), Novell (NOVL), Oracle, Red Hat (RHAT) and Veritas (VRTS), among many others.
and obviously these big corporations are not limiting their highly paid employees with PCs having only 128MB RAM, and I would imagine that most of them have no interest whatsoever in wasting their expensive time minimizing memory consumption in the kernel or in applications. The employees themselves are now paid large enough salaries that buying more memory even for their personal system is simply no longer an issue for them.
To them high memory usage is not an itch that has to be scratched other than by buying more memory; adding transparent windows and other cool effects seem to be high priority itches, also perhaps because they help feature/demo driven career progression: try to imagine them wowing the manager who decides their raises and bonuses saying one of:
  • I spent the last month halving the memory used by these applications
  • I spent the last year adding to this application these snazzy transparency effects and ten more cool features as you can see in this demo
As I have remarked before some virtual console applications have memory footprints of 5 megabytes (optimistically: my current Konsole processes have resident set sizes of over 30 megabytes, of which only 5-8 are shared), that is more than one thousand (4KiB) pages, when all they do 99% of the time is to receive a string of a few characters and send a request to BLIT them on the screen.
Also, the hideous memory usage is a system problem; most applications, on their own, fit into 128MB systems, even virtual terminals that have 30MB memory footprints. It is when several are used together that laughably large memory footprints have an impact. But the author of each one has an easy defense of it is fine by itself, and in any case what really matters is the demo, not actual use in a loaded system.
As to the merits of what is causing enormous aggregate memory footprints, here are some of my guesses, corroborated by quite a bit of observation:
Benchmarks and lack of visibility
Most popular benchmarks are pure speed benchmarks and can be run on systems with enough memory that the whole application being benchmarked is memory resident. Also, most popular benchmarks are application benchmarks, and not system benchmarks, and that's another reason why the whole application can be memory resident.
There are also some intrinsic problems with (mostly nonexistent) sysem and memory benchmarks:
  • A memory benchmark for a system is intrisically less repeatable and thus impressive a fact than a speed benchmark for a single application.
  • Creating a realistic system and memory benchmark and running it takes a lot more effort than a synthetic or single application speed benchmark.
Large pages
I have seen very persuasive research evidence that for unoptimized programs the optimal page size is 256 bytes, and with pages sizes above 1024 bytes the number of pages in the working set is roughly constant; in other words a 4096 byte page size means that other things being equal the working set of a program is four times larger in bytes than with a 1024 byte page size. Larger page sizes amplify considerably the impact of the other issues.
Use of shared libraries
Shared libraries mean that any use of a function on a page brings in the whole page. Bad news if functions in a shared library are not clustered with respect to expected dynamic usage. Usually functions are clustered in random order or in alphabetic order...
Use of many shared libraries with lots of impure data
Shared libraries contain a significant amount of inpure (writable) pages typically containing either external (interlibrary) links or internal (intralibrary) links. Internal links contain mostly relocation entries.
When any internal or external link on a library is fixated the entire page on which it resides it must be duplicated for the process in which it is. As this page says about a GNOME Hello world! program:
That simple program uses 73 shared libraries that allocate a total of 13Mb of non-shared static data.
This can be alleviated in the following ways:
  • Better library and language design.
  • Prerelocating libraries to a nominal address range, and prelinking them (KDE does a bit of this).
  • Forking processes that use the same set of libraries from a master one that has them already prerelocated and linked in (KDE does this too by default).
  • Generating PIE (position independent code) which eliminates most of the intra library links, at the price of higher register usage usually.
  • Merging libraries together.
Bad memory allocators
Many common memory allocators are designed to maximize memory usage and resident sets, for example one or more of:
  • Use of buddy-system logic, which tends to overallocate memory by about 30%. But it is fast, so benchmarks on infinite memory reference PCs are flattered.
  • Extensive list walks to find or free allocated blocks, so they touch a lot of pages.
  • No attempt to cluster block allocations.
  • Inability to coalesce freed blocks and to release back large chunks of memory to the operating system, which does not directly impact resident sets, but indirectly via larger virtual memory mapping tables.
Careless program designs
Many programs embody deep and thin flow trees, where going from event to event resolution can involve dozens of nested procedures, each of which is often on a different page from the next.
A legendary case if the reference X server implementation: most of the time the X server just receives bitmaps to BLIT on the screen, so only one or two pages should be touched; but to get from reading the BLIT request to the BLIT code that satisfies it may involve a very deep and dispersed call sequence, and so very many pages are kept constantly referenced.
Some programs also involve maximization of data, not just code, memory resources. For example client side fonts imply that each X client needs to have a list and cache of rendered glyphs, which it constantly references, instead of having them in the graphics card's framebuffer memory, as it happens with server side fonts.
Also, a lot of cool graphics tricks require a lot of memory, for example to handle screen damage events (for example background pictures or transparency or non rectangular shapes).
050510
Interesting discovery on some details of font systems. The X11 bitmap font module (for PCF format bitmap fonts) is loaded by default and this is unfortunate because it has a misdesign in which it forces the DPI of fonts to be either 75 or 100 only. Omitting it from the list of modules to load is no good.
However I found that if the freetype font module (for PCF, TrueType and Type1 fonts and a few other types too) is specified before bitmap, it registers itself for PCF fonts and seems to preempt the bitmap module, which seems an acceptable workaround.
050503
I have discovered that some nice people have produced two fairly high quality repackagings in Type1 of the Computer Modern typefaces with fairly standard (not TeX) encodings (1, 2).
The on screen quality seems fairly good (particularly for Latin Modern), and probably better than that of the Bitstream Vera typefaces, which really require antialiasing to mask the lack of hinting.
The above only applies if the rasterization is left to the X11 freetype font module, and the type1 font module is not loaded. The freetype font module uses the FreeType library which seems to have a fairly good Type 1 rasterizer, one that does some decent autohinting (a side effect I guess of FreeType having to have an autohinter for TrueType fonts), while the original type1 module does not, and produces fairly crude low DPI bitmaps.
The font packages are the Latin Modern (there is a nice PDF paper describing them) and Computer Modern Unicode sets.

April 2005

050419
I have been a bit wondering why my external Firewire backup hard disk is so awkward to use, as in requiring such great care in the exact sequence of actions that don't result in OOPSes or hangs, and here are the not so good news from a kernel developer:
April 08, 2005
Pete Zaitcev: Thomas in a cage with Firewire
http://thomas.apestaart.org/log/index.php?p=291
Yay. Thomas is about to find why Firewire is unsupported even on Fedora (let alone on RHEL).
But If someone asked me what Firewire needed, I would answer, "only a hacker with a brain". He may be able to pull it off, though I'm not too optimistic. Firewire is about as complex as USB and we all know how well that goes despite a sustained effort by Greg K-H, David-B, Stern and myself. My approach was to hide whenever Firewire came too close and wait for it to die in the marketplace (which, I suspect, is inevitable at this point).
Now, my external hard drive box also supports USB2, so I tried that. Bad news! It does not work at all with the usb-storage driver, presumably because it uses a chipset (ALi) that is designed to the MS USB driver interface, not the official one. It works, very slowly, with the slow device ub driver, but this anyhow only supports up to 138GB drives, and my drive has a 160GB capacity.
050418
Cleaning up things in my host and user configurations, I have been as usual feeling very disappointed with client side fonts particularly as implemented in Fontconfig/Xft2 (and Xft1 is best forgotten).
Client side fonts are not a good idea in general for X:
  • Architecturally, client side fonts don't fit with the X architecture: X is designed so that a single host may support multiple X servers, a single X server may have multiple screens with completely different characteristics, a process may be connected to multiple X servers on multiple hosts, as well as a single X server being used by many processes from different hosts. X is designed to support this flexibility this not quite as transparently as it should, but at least with some display independence, where font rendering is delegated to the server, which is what drives directly the specific target screens.
    A process connected to multiple X displays or screens of the same X display need to have multiple font rasterizer contexts if it uses client side fonts. With server side fonts all the process has to do is to send strings and they will be rendered optimally for the specific target display, no matter its characteristics, which host it is connected to, and so on.
  • From a performance point of view client side fonts are bad now and will be even worse in the future. Their only performance advantage is in some modest win as to protocol latency: since the rendering of glyphs happens on the client side, it knows exactly the geometry of the rendered text, and the X server does not need to send back font metrics; so, it is possible to contrive a demo in which gives a small overall advantage in a very special case.
As to the client side performance disadvantages, present and future:
  • Rasterization and caching of the rasters has to be performed in every single application connected to the same X server; if an application is connected to multiple X screens, it has to rasterize and cache the same glyphs multiple times. Also, almost always the majority of applications connected to an X server use the same fonts.
    Of course if memory and CPU time are in effect infinite this does not matter; and therefore it is almost unnoticeable for single processes, but unfortunately I have often several dozen processes running connected to one or two X servers with one or two screns with different characterics each.
  • Bandwidth is consumed by the sending of text rasters. This applies to multiple levels of the communication chain: even if the client and the X server are on the same host, with client side fonts he X server must rasterize the glyps every time to the display card, thus consuming AGP or PCI bus bandwidth.
    With server side fonts the X server can use all of the framebuffer that is not used for the screen as a glyph cache, making the rasterization of text almost entirely an onboard operation, saving on AGP and PCI bus bandwidth and latency.
  • Latency increases because each application has to first rasterize all the glyphs in a text strings, and then send the raster as a whole, and the X server can only start blitting it when it is fully received.
    With client side font rendering not only the X can cache the glyphs in the frame buffer, but the text of the string to render is received quickly, and then the X server can render that text in way of principle incrementally, glyph by glyph.
  • Sending pixmaps to blit instead of character strings to render currently is way less efficient than sending text strings in relative terms, but not that much in absolute terms because most current screens have rather low DPI values, so the bitmaps aren't that big.
    Unfortunately now that the world has switched to LCD screens higher DPIs are becoming more common, and higher DPI raise dramatically the size of the pixmaps to be sent, especially if they are antialiased, as this both increases the size of the pixmap and reduces the effectiveness of compression, if any; but then antialiasing is fortunately less necessary on high DPI displays.
    In any case rasterizing text over the wire on a 1280x1024 pixel display is not quite the same as on a 4000x3000 display.
    The growing size of client side text pixmaps impact performance severely in several ways:
    • Each client has to spend that much more CPU time to create bigger glyph rasters, and that much more RAM to cache them.
    • The bandwidth needed to send the text rasters increases considerably, both at the network level or at the AGP/PCI/... bus level.
    • The latency also increases.
These issues apply to any strictly client side scheme, but Fontconfig/Xft2 have their own specific problems, for example:
  • A considerable lack of documentation. For example the Fontconfig font specification is not that clearly defined anywhere, and there are several obscurities in the specification of the Fontconfig configuration files; and the Fontconfig configuration files are a classic example of moronic abuse of text markup, but then this is a much wider issue, admittedly.
  • A number of gross misdesigns have been made in the implementation of Fontconfig; version 1 for example did not even cache font descriptions, and version 2 does, but in a rather clumsy way. After quite a wait now Fontconfig maps font files instead of copying them into memory, but of course this only works for filesystems that support mmap efficiently.
  • Each client process must depends on a number of additional shared libraries, and since different clients will be executing different parts of the shared library at any one time, the memory residence footprint will be way large than if those shared libraries were in use just by the X server.
  • Each client process has a private glyph cache, and I am not aware of any user level mechanism to influence the caching policy, including the minimum and maximum size of the cache; perhaps this is possible, but I am pretty sure most people do not use it. Instead cache policy settings are obvious in server side embedded and standalone font servers.
  • It seems impossible to specify which precise font one wants to select, in particular I have been unable to ensure that the PCF version of Helvetica is selected instead of the Type 1 version. This may be due to lack of documentation or my misunderstanding, but I suspect not.
050409
While sorting out my enormous archive of articles and papers I found some old (around 1990) UNIX for PC software pricelists, with several entertaining items:
  • ESIX Unix System V dev system multi user US$1499, std system 2 users US$619.
  • UNIX System V TCP/IP US$359.
  • UNIX System V Windowing runtime US$219, dev sys US$599.
and so on (some scans, including some modem prices: 1, 2, 3). Over the years I have bought, and I still have the installation media, Microport System V.2 for the 286, then ESIX System V.3 (ESIX was a brand from Everex), and finally Dell System V.4, before switching to GNU Linux.
050408
Just found a fairly nice page on the more important design patterns.
It looks like to me the usual confusion between design, patterns, algorithms, methods, recipes and pious homilies.
So I am reminded that they may be considered a buzzword bandied by some to make their employers or customers believe they possess some kind of secret sauce that makes their work tastier and less fattening at the same time.
Even more usefully, like most previous software fads, their underlying promise is that they give managers more powers over their minions by deskilling them; that is enables hiring low skilled labourers telling them to follow a set of simple principles or rules obediently and dumbly, as long as they they are the secret sauce ones.
As to the those of these patterns that seem actually design oriented they tend to be trivializations of simple databased design rules.
There is of course a benefit, small but significant, to the talk about patterns: whatever they are, they name recurrent and topical practices, whether related to design or not, and this might sometimes improve communication with and between otherwise unskilled people; which may be of benefit given the industry tendency towards employing unskilled workers.
050407
I have been pleasantly surprised with KIAX an IAX2 protocol based software phone. It feels much better designed and reliable than either Linphone or KPhone, as decribed more extensively in my (updated) more specific comments on them.
050405
I have been experimenting with Asterisk which is an IP and PSTN telephony multiprotocol switch program, and to thsi effect I have been experimenting with both some GNU/Linux based software phones and the Gradwell VoIP services.
As to the Gradwell VoIP services, they have a somewhat attractive IAX2 based service, by which those who run an Asterisk instance can just register with an Asterisk server at Gradwell's usingn it to switch calls from and to the PSTN.
The service charge is just a £1+VAT registration fee plus (low) call charges. However, this is not enough to use their PSTN gateway service, because to use in the outgoing direction one must supply an authenticated caller ID number, which must be either a manually authenticated PSTN number already assigned to the user, and this manual authentication costs £35+VAT, or one already supplied to the same customer by Gradwell themselves. Gradwell can register a block of 10 numbers for £3+VAT per month (minimum term 3 months) or a single number with a SIP account for £4+VAT per month, first three months free (mininum term 3 months).
So in practice one has to register for the IAX2 services, and then either pay a once only but fairly high fee, or register for a monthly-fee service. Since I wanted to experiment with SIP telephony anyhow I have registered for the SIP account.
Readily available to me for Debian are two SIP software phones, Linphone and KPhone.
They are both half-finished, mostly working, mostly undocumented; they are sort of useful but shoddy.
Of the two it is easier to get Linphone to work, but KPhone seems more reliable and lightweight; I have written up some specific comments.
050404
Another day, another great Debian script annoyance: I have installed zeroconf a nice little utility which is like a DHCP client, but without the server, in that it can configure a network interface automagically a bit like for IPv6, but (usually) in the IPv4 169.254.0.0/16 range reserved for that.
So far it is good and nice, but the Debian zeroconf package installs also /etc/network/if-up.d/zeroconf-up which is a script that runs the application every time an interface is activated, whether or not I want it. Allegedly autoconfiguration should always work, but there are cases where I simply want to bring up an interface in a fully passive way.
However, I have had a look at the various /etc/network/if-X.d/zeroconf-up directories and the scripts therein, and as usual the Debian Way is to try to do things automagically and in ways to that me feel rather shoddy.
Having something like zeroconf (or IPv4LL is definitely a nice idea, having it started by default is not nice. But this pales compared the the horror I feel when updating a package containing a daemon starts the daemon even if I have disabled its activation in the runlevel configuration.

March 2005

050329
I have finally gotten around to add the Ubuntu sources to my Debian /etc/apt/sources.list:
deb http://archive.Ubuntu.com/ubuntu/ hoary main universe restricted multiverse
deb http://archive.Ubuntu.com/ubuntu/ warty-security main universe restricted multiverse
deb http://archive.Ubuntu.com/ubuntu/ warty-updates main universe restricted multiverse
deb http://archive.Ubuntu.com/ubuntu/ warty main universe restricted multiverse
in order to have the option of installing Ubuntu-only packages or package versions. To ensure that only Debian packages are considered by default I have had to release-pin packages with an origin of Ubuntu to a low priority, by putting these lines in /etc/apt/preferences:
Package: *
Pin: release o=Ubuntu
Pin-Priority: 90
050321
I have just had a look at the initrd for Debian and I was quite amazed to see it is 2MB compressed and about 5MB uncompressed. It is a pretty largish root filesystem, and some mini distributions are smaller.
I was looking at it to answer a question by someone as to how to prevent Debian from loading a specific SCSI driver at startup, even if the HA was present. Now it turns out that this is not easy because there are many excessively helpful mechanisms that try to automate system driver loading and configuration, starting with those in the initrd.
050319
Having a renewed interest in VoIP I have started looking at developments a bit more recent than H323 and this obviously means both SIP, IAX2 and the Asterisk software exchange.
Unsurprinsingly it looks like the usual: the whole are is underdocumented and the desing of stuff is awkward and inconsistent.
050315
Playing with the 2.6.11.2 kernel and looking at the SUSE patches I have noticed they have a ZyDAS 1201 driver as a patch, version 0.15 of the ZyDAS 1201 driver. That patch applies cleanly and the driver just works, even the firmware loading seems good (I have put the ZyDAS firmware files in /usr/local/lib/firmware which is the right place for manually installed firmware files).
The newer releases of the driver are much better than the older ones (I had tried release 0.8 a while ago) and the 2.6 USB code seems to have improved a fair bit too. Even better, there is a note saying that ZyDAS is helping by giving documentation. This means ZyDAS joins the good Linux WiFi chipsets which is particularly welcome as ZyDAS chipset USB thingies are easy to find, cheap and small.
just out of curiosity I have tried to measure what is the effective 802.11b speed one can get, in not very optimal conditions. I have my AP in one room and a PC in the next room, with a wall in between that is rather radio opaque, and thus a signal strength of 35/128 and default parameters.
Under these conditions I could get around 600KiB/s, or around 5mb/s, out of the theoretical maximum of 11mb/s. This is not too bad, and very similar to the actual utilization, around 50%, with 802.11g but I suspect that in better conditions and with a little tuning (frame size, MSS, ...) this can be improved.
050314
Well, I have been playing around with the 2.6.11.2 Linux kernel release, and it seems a lot more reliable than previous releases in the 2.6.x series. I have made a couple of interesting discoveries:
  • It is now possible to select the elevator on a per-block-device basis, by using /sys/block/dev/queue/scheduler.
  • I figured out why I was getting much lower hdparm -t results under 2.6 than under 2.4 and the it appears that to get the same results under 2.6 I must raise the filesystem readahead set with hdparm -an to some large value like 512 blocks.
    Evidently 2.6 kernels don't automagically do as much readahead as 2.4 kernels do.
    Note that lots of readhead make streaming tests look better, but may be terrible for other uses...
050312
Since the Linux 2.6.11 kernel release is so recent, and its two official updates to 2.6.11.2, I have a look again at a major distribution variant of the same kernel, the SUSE kotd package, to see how much of a variant it is; other kernels from major distributions are similar, for example there is a list of extra drivers in recent Mandrake kernels.
Well, it has almost 500 patches. Some of these add functionality like UML and Xen). But there are very many fixes... These are the patch collections:
Collection # of files total #
of lines
arch 19 16130
drivers 150 365887
fixes 106 14786
rpmify 13 911
suse 157 236323
uml 10 2817
xen 14 36169
Of these the suse and drivers collections seem to be mostly extensions, but they also contain a lot of fixes.
Now the question is, if these things are good for SUSE, who are careful people who test things and have many happy users, why aren't these patches in the original kernel? My comments on this are:
  • Well, some of those patches are not really relevant to the original kernel, because they are SUSE specific or not relevant to a general purpose kernel. But still, there are so many fixes.
  • In order to become part of the original kernel, they have to be submitted to the original kernel maintainers. I can easily see that the majaor distributions may think it is not in their best interests to proactively contribute their collections of fixes to the original kernel.
Another point is that of these changes to the kernel, very many are extensions, and they are distributed as source patches. I know that Linus Torvalds prefers it like this, but I still think that the Linux kernel should have some mechanism to allow modularitazion of the source. Even accepting that it is a monolithic kernel at runtime does not imply that its source should be monolithic to the extent it is now.
050301
Things with P2P seem a bit better, or perhaps worse, than the impressions I got from looking at the people queued to download from my machine. Other statistics show that the percentage of user that cannot provide uploads is way less than the 40-50% I had summarily estimated. According to statistics by Razorback2 which is probably the biggest eDonkey directory site, only about 15% of users cannot provide uploads (LowID). So the mystery of why queues are so long and terrible and download speeds so low persists.
However, I tried a few other downloads. One, and a fairly large one, for an ISO9660 image with same media test files, started almost immediately and proceeded at high speed (around 40KiB/s). The reason seems to be that the file was seeded by some high speed servers from Razorback2 itself, thus showing how effective seeding is.
I then decided to try and download something that was a bit large but also with very many available sources. Finding something suitable was not easy, in part because of the variously objectionable nature of most of the really popular stuff (unfortunately most was not freedom software), in part because not a lot of files are popular.
Well, even with many complete sources, queuing took a fairly long time and downloads were not speedy. Typically once started there were 3-4 sources (out of a few hundred) each delivering around 3-10KiB.
050301
Well, more observations on the dreadful P2P situation. I have left my eDonkey client running with a tasty selection of free software ISOs. The top uploads served are reported as KANOTIX-2005-01.iso with 1.2GB, ubcd32-full.iso with 0.9GB, knoppix-std-0.1.iso with 0.7GB, and I haven't had any download running for a bit.
I have occasionally tried to download something to test the download side, and while my upload side is constantly busy, when I try to download things often there is a single host offering them, and there are huge queues.
I am not not at all surprised that my experience so far (and many others I have read about online) has been so negative, with extremely poor download rates, scarce availability ofqyq seeds, and long queueing times.

February 2005

050223
In the past few days I have done some trying out of P2P systems like eDonkey, Gnutella, OpenFT, and it is pretty obvious there are fairly big problems with the P2P model of operation. The main problem however is simply lack of bandwidth, that is of seeds for downloads. This is going to become worse and worse as ISPs are not trying to switch from high monthly fixes fees to low monthly fees and then charging for bandwidth, in both directions, but the situation is ugly enough as it is.
The main symptom is that I have had both aMule and giFT running for a week now constantly and I have uploaded well over six times more than I have downloaded. When I have tried to download something, it just gets stuck for many hours or days waiting for some site to become free, and then it downloads at very very slow speeds (so it took several days to get the ISO image of System Rescue CD which downloaded in a few dozen minutes from a SourceForge mirror). This has been worse for eDonkey than for Gnutella/OpenFT.
This seems to be a rather common experience, and indeed there are some obvious and systematic causes:
  • Almost all P2P hosts are either on a modem or an ADSL line. This means that at best the theoretical down bandwidth is half the up bandwidth, and in several cases it is one eighth, as many services offer 2mb/s accounts with a 256kb/s up limit (there are also technical reasons for inefficient line utilization).
  • About half of P2P hosts seem to be behind firewalls that forbid incoming connections completely. This is less of a problem for Gnutella/OpenFT, but it is just like that with eDonkey.
Now the combined effect of these two inevitable issues is that in theory download and upload banwidth should equalize to about half the typical/most common bandwidth, which is about 256kb/s, or in practice (taking into account some technicalities) the effective limit should be 28KiB/s. So I would expect that my up banwidth would be close to 28KiB/s, but my effective down bandwidth be around 14KiB/s, assuming that P2P is indeed peer to peer, that is sharing happens symmetrically.
But I usually observe that my aggregate downloads are for a lot less than that, that is when downloading happens it is for 3-6KB/s, and as a rule downloading doesn't happen at all for hours, as transfer requests are queued before getting a short burst of 3-6KB downloading, for an average download speed well below the typical 3-6KB, never mind being equal to the upload speed.
My up link not only runs at top capacity all the time (which is not good, as I am on a theoretically 1:50 contended service), I also see in the queue of people waiting for a chance to download from host around 80-100 hosts, and some of them have been queued for days.
All this indicates that not only maximum upload speeds are on average much lower than download speeds, because of asymmetric speeds for both v.90 modems and ADSL, and that around half of the hosts participating don't accept clients at all because of firewalls, but that very very very few sites are actually sharing.
In other words the typical usage pattern is that people get online, wait a long time to download something they are interested in, and then once the download is complete, they close the connection/sharing.
In other words that files are shared just about only when they are being downloaded, as they are being downloaded, and then only in half of the cases, and at less tha half speed, and most crucially when they are mostly incomplete.
The waiting happens because very few of the P2P hosts have complete file image to share, and everybody else has got incomplete ones that are incomplete in the same way.
In other words:
  • P2P actually have very hierarchical, download style usage patterns.
  • There are very few seed hosts to kickstart the temporary sharing that is what actually takes place, and these seeds are on relatively slow and overloaded connections.
In effect P2P networks are not fully peer to peer, they are shared download systems (much like BitTorrent), with not much in the way of sites to download from to start with.
The consequence is that P2P systems currently are just about useless for an important and interesting use, which is to replace or augment FTP/HTTP/RSYNC/Torrent sites as the primary distribution mechanisms for free software, and in particular for ISO images of free software operating system install CDs.
This is highly regrettable, because P2P could instead be a particularly efficient viral marketing channels for free software installers.
Two fixes are possible, one both weak and unfeasible, and the other unlikely but in theory excellent but for a detail:
Change the behaviour of peers to continue sharing even after the download completes.
This is weak because peers, most of whom have consumer grade, contended modem or ADSL connections, have pitiful upload bandwidths to offer, and it is unfeasible because it just goes against the grain of user behaviour, and the more commonly ISPs charge for bandwidth, against their self interest.
Put the same repositories that currently offer their archives on FTP/HTTP/RSYNC/Torrent on P2P networks too
This can work really well and greatly improve the reliability of downloads from those sites, at the same time relieving them of a large part of the bandwidth cost, as after all while people download they also end up sharing. It is unlikely however that this will happen, as many repositories are publicly funded (e.g. hosted by universities) and P2P systems have been demonized as vehicles for dishonest and criminal behaviour. The technical problem is that P2P systems typically present a completely flat view of the namespace of available files, and most existing archives are arranged, for very good reasons, hierarchically. This can be fixed by having P2P servers flatten file paths into file names, which is not hard.
As a final note, I suspect that the current popularity of P2P systems despite their awful performance is due to historical causes; in the beginning all P2P systems probably were in effect seeded by University students, and in particular computer science ones, who enjoyed symmetrical and very high bandwith connections thanks to attachment to their campus network.
Then the enormous amount of bandwidth consumed and the illegal nature of much of the content offered for sharing led universities to forbid such seeding, and the P2P systems remaining out there are now seedless and sad ghosts of what they were, still popular thanks to fresh memories of a golden age that is no more.
050223
I am looking into P2P programs, mostly based around the eDonkey or the OpenFT protocols.
The motivation of this research is that freedom software packages are becoming ever more sophisticated and bigger, in particular for the albums/compilations known as distributions, especially the live CD ones.
The existing methods are all somewhat unsatisfactory:
FTP or HTTP
  • Download is from a single server per file, putting huge loads on server.
  • No built in verification of the integrity of the transferred file.
  • When an MD5 checksum file is also available, this only tells whether the download failed, not where.
  • Partial downloads in practice can only be restarted from the end.
  • Fortunately there are very many FTP and HTTP servers, even if they are prone to congestion, unfortunately there are few systematic catalogs of servers and indexes of their contents, with the result that the well known servers are even more prone to congestion.
RSYNC
  • RSYNC downloads in chunks and verifies the integrity of each chunk, and can redownload any arbitrary chunk, so that's pretty nice.
  • There is still a single download source at a time.
  • There are relatively few download servers.
  • There are even fewer ways to find catalogs of RSYNC servers and indexes of their contents than for FTP and HTTP.
  • Existing RSYNC clients are slightly more awkward than FTP or HTTP servers, which have nice shell-style or commander-style interfaces.
BitTorrent
  • BitTorrent is basically RSYNC where chunks can come from many different servers, which all register with a the original BitTorrent server, which may or may not be the one with the original content.
eDonkey
TBD
Gnutella
TBD
050218
The ridiculous font situation under Linux is getting ever worse. I have been looking at changing the font used for the GUI elements (toolbars, menus, not the page) in Mozilla and Firefox. The following disgusting issues arose:
  • Mozilla uses GTK 1, which uses the X11 native font system, and Firefox uses GTK 2, which completely ignores it in favour of that idiocy, FontconfigXft2.
  • One can change the Mozilla GUI font by editing/overriding the theme description for its GTK 1 theme, that is by adding some poorly documented lines to $HOME/.gtkrc.
  • In theory, and as documented, one can change the GUI font for Firefox by similarly editing/overriding the GTK 2 theme, by adding some poorly documented lines to $HOME/.gtkrc-2.0.
  • The GTK 2 per-user theme file is called .gtkrc-2.0 even in the 2.2 and 2.4 releases of GTK 2.
  • The font specification in the .gtkrc-2.0 file uses the setting gtk-font-name whose syntax is similar to, but incompatible with that of Fontconfig/Xft2 font names, which in turn is hardly documented, and the differences seem gratuitous. For example, in Fontconfig font names the point size is separated from the font name by a dash, but not in GTK 2 settings.
  • In any case, a bright guy has made sure that several settings which are possible in .gtkrc-2.0 are actually overriden by equivalent settings in the GConf database, which apparently is only documented in an email announcing this patch to a mailing list; this requires Firefox to be dependent not just on the GTK libraries, but also on the GNOME libraries, or at last the GConf ones.
  • Even after all this idiocy has been worked out, if I choose a bitmap/PCF font it is bold by default, and I haven't been able to switch that off. Why? Why? Why?
In this like in many other cases (ALSA springs to mind) the unwillingness and perhaps inability to think things through and go beyond the cool half-assed demo stage seem to me the driving forces.
050215
The insanity of Linux kernel development is becoming ever more manifest in the 2.6.x series. For the sake of entertainment I have had a look at the 2.6.x kernel packages by RedHat and SUSE among many. Well, the RH ES 4.0 2.6.7 krnel has over 250 patches, and the SUSE 2.6.10 kernel source package has several archives of patches, incuding a 4 gigabytes one of fixes.
Sure, some of these will be cool little features that don't really need to be in the mainline kernel (like UML and Xen support), but the number of mere bug fixes, especially inside drivers, is amazing.
Understandably Linux says that his main worry is to make sure that the overall core structure of Linux be right, and this has meant paying a lot less attention to device issues, but it is getting a bit ridiculous.
Also, RedHat and SUSE are hardly untrustworthy as to the stuff they do with their kernel; one might be tempted to just include almost all their patches into the mainline kernel, as if they re good for them, probably they are good for everybody.
050202
Thanks to a letter by Michael Forbes to Linux Magazine I have discovered the recently introduced --link-dest option to rsync and the Perl wrapper script rsnapshot that uses it to automate creating backups of filesystems that are both incremental and full, using forests of hard links.
050201
As to ALSA, I haven't had the time yet to check whether there is a mixer plugin in 1.0.8 but it has a reworked alsamixer with a rather less misleading user interface, in particular for controls that do not correspond to sound channels.

January 2005

050112
Quite entertaining interview with Linus Torvalds in the January 2005 issue of Linux Magazine among the interesting points is that he is currently using a dual PPC G5 system, to practice code portability, and that he lists along with x86 and PPC the ARM architecture as one of the crucial Linux architectures, and the importance he gives embedded Linux, as well as SMP (on which he says his pessimism was wrong, which I disagree with).
Very interesting blog entry about the consequences of defining pseudo-OO in base C which then suggests the use of a preprocessor to autogenerate all the plumbing:
Lets face it, because of C's constraints, writing GTK code, and especially widgets, can be ridiculously slow due to all the long names and the object orientation internals that C can't hide.
C with Classes anyone? :-)
Also, found an interesting product that traces a lot of WIN32 API calls.
050111
Good news for those concerned with the slightly primitive state of ALSA mixing: apparently version 1.0.8rc2 has a mixer abstraction plugin in the ALSA library, and a new graphical mixer application, Mix2005 has been anounced.
050109
Rather fascinating article on Tomcat and general web serving performance issues So you want high performance by Peter Lin. It discusses issues like the very high cost of parsing XML, optimal JNDC architectures, how much time and money it takes to get physical high speed lines, and the cost of power and cooling for faster CPUs and disks in racks.