Software and hardware annotations q3 2006

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

September 2006

060928c Intel researching chip with 80 FPUs
Well, I have mentioned before chips with 16-CPUs and now Intel reportedly is looking instead at more specialized 80-FPU chips:
Today we were given a chance to see the first prototype wafer they will be using for production of 80-core processors. Each CPU like that will have 80 simple floating-point dies and each die is capable of teraflop performance and can transfer terabytes of data per second. They claim that it will be commercially available in a 5-year window and will be ideal for such tasks as real-time speech translation or massive search, for instance.
To some extent the ClearSpeed CSX600 is already like that, with 96 FPUs each with around 6KiB of onchip memory for operands. It has a somewhat different usage profile, and I suspect that for now it costs a lot more than what would be the target for regular desktop usage.
060928b Fixing Belkin ADSL gateway bugs by using SMC firmware
I have been long suffering from the many terrible bugs in the firmware of my Belkin F5D7630 ADSL modem/router and I finally decided to take advice found on some self help forum and I loaded onto it the firmware of the equivalent SMC 7804WBRA which shares the same platform. Well, the SMC firmware, user interface and manual are way better, and starting from the same base (the web user interface code is still written in a way that I find quite terrible).
Belkin and SMC both selected the same low cost supplier for the product, which is common. Then it looks like that the Belkin buyers are paid bonuses on how much money they save, and piffling inanities like quality control cost money and reduce bonuses. I guess that somebody at SMC still cares about quality control, thus the various functions were tested and the bits that did not work fixed, and professional documentation was commissioned. I am going to buy SMC in the future rather than Belkin.
060928 Configuring 'automount' to use '/etc/fstab'
So I have decided to use automount I have realized that I sometimes still want to disable it and mount things manually and permanently. Unfortunately the format of automounter maps different from that of /etc/fstab therefore I would have to maintain the same information twice. Fortunately the automounter can synthetize map lines on the fly by invoking a script, and it is pretty easy to select the relevant /etc/fstab line (to which noauto should be added of course if necessary) and turn it into an automounter map entry dynamically, with a script like this:
#!/bin/sh

SP='[:space:]'

: '
  We need to look up the key to find type, options and device to
  mount. This means we need the name of an "fstab" file and the
  prefix under which the key appears in it. These can be different
  depending on the "autofs" mount point, under which this script
  is run.
'
case "$PWD" in
*)	FSTAB='/etc/fstab'; FSPFX="/fs/";;
esac

KEY="$1"

grep '^['"$SP"']*[^\#'"$SP"']\+['"$SP"']\+'"$FSPFX$KEY"'/\?['"$SP"']' \
  "$FSTAB" | if read DEV DIR TYPE OPTS REST
    then
      case "$TYPE" in ?*)
	case "$OPTS" in
	'') OPTS="fstype=$TYPE";;
	?*) OPTS="fstype=$TYPE,$OPTS";;
        esac;;
      esac
      case "$DEV" in *':'*) :;; *) DEV=":$DEV";; esac

      : '
	We must omit the key in a map program like this.
      	key -[fstype=TYPE,][OPTION]* [HOST]:RESOURCE
      '
      echo "-$OPTS $DEV"
    fi
and then running the automounter like this from /etc/inittab:
af:2345:wait:/usr/sbin/automount -t 300 /media program /etc/auto.fstab
By the way, too bad that automount is designed to always background itself, because if it did not it that line could have action respawn instead of wait.
I have extended the unmount timeout to -t 300 seconds instead of 60 because some server I use intermittently accesses some directories on some mounted filesystems, which causes a large number of mounts. This adds pointless lines to the log, and some filesystems like ext3 impose a periodic check every so many mounts. To avoid the latter I have actually disabled the force check every some number of mounts and retained only the one every some number of days.
060924c Some reports on dual core versus hyperthreading performance
Spotted some interesting comments in the mailing list for the PostgreSQL DBMS about price/performance of recent dual core AMD CPU chips and SMT Intel CPU chips based on the older Netburst microarchitecture. The first one definitely supports the AMD dual CPU chips vs. the
I have been able to try out a dual-dual-core Opteron machine, and it flies.
In fact, it flies so well that we ordered one that day. So, in short £3k's worth of dual-opteron beat the living daylights out of our Xeon monster. I can't praise the Opteron enough, and I've always been a firm Intel pedant - the HyperTransport stuff must really be doing wonders. I typically see 500ms searches on it instead of 1000-2000ms on the Xeon)
which is not unexpected, while the second mentions that for something as highly parallel as a DBMS running many transactions, Intel's Hyper-Threading seemed to work pretty well too:
Actually, believe it or not, a coworker just saw HT double the performance of pgbench on his desktop machine.
Granted, not really a representative test case, but it still blew my mind. This was with a database that fit in his 1G of memory, and running windows XP. Both cases were newly minted pgbench databases with a scale of 40. Testing was 40 connections and 100 transactions. With HT he saw 47.6 TPS, without it was 21.1.
This is rather unexpected, as the author hints, because Hyper-Threading tends to result, even on custom coded applications, something like 20-30% better performance (I did some work on that custom coding).
060924b IBM to deliver hybrid Opteron and Cell cluster
I have just noticed that IBM is building a supercomputer as a cluster (all supercomputers today are clusters) of elements with a similar number of Opteron and Cell BE CPUs. Quite astonishingly the system will be built out of distinct cluster elements, with System x3755 fairly impressive Opteron based racks and BladeCenter H blade carriers with likely these Cell BE based blades.
This combinations is a bit surprising because of a major and a minor reason:
  • That the Opteron and the Cell BE will not be on the same board, but in entirely separate computers. I could have imagined using the Cell BE as a coprocessor for the Opteron.
  • That not only the Opteron and Cell elements will be distinct, but that the Opteron ones are based on a different physical format, even if there are pretty good looking Opteron based blades for the BladeCenter H infrastructure used for the Cell based systems.
The press release says that The machine is to be built entirely from commercially available hardware so that explains why no custom dual Opteron and Cell BE boards, but the BladeCenter H is commercially available. I suspect that with several thousand racks involved a common physical format is not that important.
Having a hybrid cluster with elements of two very different architectures is going to be a challenge, but IBM boasts that the issues will be solved with a new software infrastructure:
Roadrunner's construction will involve the creation of advanced "Hybrid Programming" software which will orchestrate the Cell B.E.-based system and AMD system and will inaugurate a new era of heterogeneous technology designs in supercomputing. These innovations, created collaboratively among IBM and LANL engineers will allow IBM to deploy mixed-technology systems to companies of all sizes, spanning industries such as life sciences, financial services, automotive and aerospace design.
which seems to be a remarkably optimistic statement.
060924 Web transaction test shows 64b slower than 32b
Some recently published benchmark report shows that running the same system with 32b Linux delivers more web transactions per minute with many clients then with 64b Linux. The advantage is not large, the peak for both is with 128 clients where 32b mode delivers 6,405 transactions per minute and 64b mode 4,932, but clear.
This is slightly surprising, because the CPU is an Opteron and these seem to perform a little better in 64b mode than in 32b mode, while Intel's Core and Pentium 4 with the AMD-compatible EM64T seem to perform noticeably better in 32b mode than in 64 bit mode.
By looking at the full graph reporting transactions per minute versus number clients one sees that 32b and 64b mode are equivalent up to 64 clients, the widest difference is with 128 clients, and there is again no difference no difference over 400 clients. Also and crucially the transactions per minute in 64b mode decline after 64 clients, while those of 32b mode are mostly constants with 128, 192 and 256 clients, after which they decline.
My impression is that this is due to memory being the bottleneck: the benchmark system has only 2GiB of main memory, and while many 64b applications tend to run a bit faster than 32b ones on an AMD chip, they take significantly more memory. Not quite a doubling, because in practice only pointers double in size, but still often significantly larger.
What looks like from the graph is that in 32b mode maximum utilitization of the CPU, IO and net is achieved starting with 128 clients, and memory then becomes the limiting factor above 256 clients; while in 64b mode memory limits performance above 64 clients.
Perhaps the limit is reached because of paging, or instead because of higher memory traffic. I suspect to some extent the latter; the motherboard is the Tyan Thunder K8SR S2881 seems good with 4 memory sockets per processor socket, but I wonder whether the test system had its 2GiB as 2x1GiB sticks (no dual channel per processor) or 4x512MiB.
Another benchmark report on the same site illustrates the effect of different memory sizes on a similar workload, and it shows that transactions per minute drop precipitously rather than gently when it exceeds the size, rather than the bandwidth, of available memory. This would tend to suggest that the issue is more memory bandwidth.
Curiously the graph of transactions per minute versus number of clients strongly resembles both the profile and absolute level of that for the Opteron benchmark in 64b mode, even if it is for Athlon 32b only CPUs. The motherboard used is the Tyan Thunder K7 S2462. Being designed for Athlon MP processors it has a much weaker memory subsystem: 4xDDR266 sockets instead of 8xDDR400, and chipset managed multiprocessing and memory controller rather then HyperTransport and CPU integrated memory controller. Also note that the Athlon MP CPUs have significantly less cache than the Opterons.
My guess is that the S2462 in 32b mode and the S2881 in 64b bit mode both hit the maximum memory bandwidth at 64 clients, while the S2881 in 32b bit has more headroom (like 20-40%) and hits it at around 320 clients. Probably this means that AMD realized that Athlon 64s and Opterons in 64 bit mode needed more memory bandwidth than Athlons, and thus pushed the memory subsystem specs so that 64b mode would not be starved for memory acces; which delivers the incidental benefit that the memory subsystem is in some sense oversized for 32b mode, and for memory bandwidth intensive benchmark this more than compensates for slightly slower 32b operations.
060920b Polipo, another nice proxy cache for small sites
While Apache with my patch is working well as a proxy cache, I also had a look at another one that seems especially promising:
Polipo is a small and fast caching web proxy (a web cache, an HTTP proxy, a proxy server) designed to be used by one person or a small group of people. I like to think that is similar in spirit to WWWOFFLE, but the implementation techniques are more like the ones ones used by Squid.
Polipo has some features that are, as far as I know, unique among currently available proxies:
  • Polipo will use HTTP/1.1 pipelining if it believes that the remote server supports it, whether the incoming requests are pipelined or come in simultaneously on multiple connections (this is more than the simple usage of persistent connections, which is done by e.g. Squid);
  • Polipo will cache the initial segment of an instance if the download has been interrupted, and, if necessary, complete it later using Range requests;
  • Polipo will upgrade client requests to HTTP/1.1 even if they come in as HTTP/1.0, and up- or downgrade server replies to the client's capabilities (this may involve conversion to or from the HTTP/1.1 chunked encoding);
  • Polipo has complete support for IPv6 (except for scoped (link-local) addresses).
  • Polipo can optionally use a technique known as Poor Man's Multiplexing to reduce latency even further.
In short, Polipo uses a plethora of techniques to make web browsing (seem) faster.
The attention to detail in the features listed above is notable. I am still using Apache as a proxy cache though, because I have to run it anyhow as a web server, and seems to be adequate.
060920 Using Apache 2.2 as a proxy cache, with patch
I have been using for years mostly Squid as my web proxy cache but also occasionally Apache which has proxying and caching modules (they are distinct: Apache can proxy without caching and cache non-proxy requests, and both modules must be enabled to provide proxy caching like Squid).
The advantages of Squid are that it is rather more flexible, notably can filter URIs via a custom external program or script, and it scales to very large cache sizes. The main advantage of Apache as proxy cache is that one does not need to run an extra dæmon and it is simpler and smaller.
Anyhow I have been using for a while Privoxy for filtering (it can filter both URIs and the content associated with them as these examples show) and there are secondary differences between Squid and Apache as proxy caches that matter: Squid does not support IPv6 (even if there are somewhat dodgy IPv6 patches available) and Apache does, and Squid runs periodically and fairly frequently a cache cleanup thread, while Apache has a separate cache cleanup program that can be run at any time. The latter is is an advantage on laptops, where Squid's cleanup thread wakes up the disk too often.
So I enabled forward proxy mode in my Apache 2.2 configuration with something like:
<IfModule mod_proxy.c>
  ProxyRequests         on
  Listen                *:3128

  NoProxy               127.0.0.1
  NoProxy               192.0.2.0/24
  NoProxy               .sabi.co.UK
  ProxyDomain           .sabi.co.UK
  ProxyVia              on
  ProxyTimeout          50

  <Proxy *>
    Satisfy             any
    Order               deny,allow
    Deny from           all
    Allow from          127.0.0.1
    Allow from          192.0.2.
  </Proxy>
</IfModule>
The supposed effect of the above is to allow proxying (on port 3128 too, but then Apache does both local service and proxy service on all ports); servers with local addresses are not proxied, and only clients with local addresses are allowed to proxy (unrestricted proxying is a very bad idea). Even without caching Apache proxying has a value in a mixed IPv4 and IPv6 environment, because Apache has full support for both protocols, and it can proxy between them, so that IPv6-only clients can access IPv4-only servers and viceversa.
Then I enabled caching too with something like this:
<IfModule mod_cache.c>
  CacheDefaultExpire            3600
  CacheMaxExpire                86400
  CacheLastModifiedFactor       0.1
  CacheIgnoreNoLastMod          on
  CacheIgnoreCacheControl       off

  <IfModule mod_disk_cache.c>
    CacheEnable                 disk http://

    CacheRoot                   /var/cache/httpd
    CacheDirLength              1
    CacheDirLevels              2
    CacheMinFileSize            1
    CacheMaxFileSize            900000
  </IfModule>

  <IfModule mod_mem_cache.c>
    CacheEnable                 mem /

    MCacheSize                  10000
    MCacheMinObjectSize         1
    MCacheMaxObjectSize         200000
    MCacheMaxObjectCount        2000
  </IfModule>

  CacheDisable                  http://[::1]/
  CacheDisable                  http://127.0.0.1/
  CacheDisable                  http://localhost/
  CacheDisable                  http://ip6-localhost/
  CacheDisable                  http://example.com/
  CacheDisable                  http://.example.com/
</IfModule>
The supposed effect of the above is to allow disk caching of proxy requests involving HTTP URIs (FTP could also be cached, but I don't have a use for it) and memory caching for content served by Apache itself; and to permit no caching of any content on the local server or any server within my own domain.
Then I noticed that nothing was getting disk cached because of long standing issues with URI matching in the CacheEnable and CacheDisable directives:
  • The / pattern would match both local and proxy URIs.
  • No way to match all subdomains of a domain.
  • Anyhow, proxy URIs could otherwise only be matched by fully specifying the hostname part of the URI, that is http:// would match no URIs.
These issues were at variance with the relevant documentation or just did not make much sense. So I had a look and found that there was a known bug, and then I decided to fix those issues myself, and rewrote the relevant code and submitted a patch for a much improved and somewhat extended URI matching.
060917c IPv6 prefix non-portability and unique local addresses
In a blog entry on unique local IPv6 addresses there is an interesting but flawed argument for private address ranges in IPv6:
they are stable (because they don't depend on your ISP),
Here the argument is that in IPv6 all global unicast address range allocated to users are not portable, because while IPv4 routed each subnet independently, IPv6 only allows routing by aggregation, that is hierarchically. The argument above, expanded, looks like:
  • If one gets an IPv6 prefix from an ISP, assigns addresses under it to nodes, and then changes ISP, one gets a different prefix, and has to change the address of all nodes.
  • If one uses from the beginning a private prefix, changes in ISP only require changes at the border gateways that remap the private prefix into the changing global one, not all nodes.
This argument is flawed because in the first case if the public prefix changes, it is well possible to start mapping the old global prefix into the new one at the border gateways. In the second case one has to map the private prefix to the public one at all times, just in case the public prefix changes in the future, in the second case one needs to map the old public prefix to the new one only if there has been a change.
Again the case for private prefixes is based on the desire of network managers to force all nodes to access the Internet via controlled gateways; this depends critically on those prefixes not being globally routable, rather than them being private or portable.
060917b Reintroducing private addresses, that is NAT, in IPv6
One of the advantage of IPv6 is that the abundance of potential addresses means that there is never a need to use private address ranges and thus to perform NAT which is a something with grave implications for networking. Unfortunately some network and system administrators have been clamoring for private addresses and NAT in IPv6, because they like the control and the modest degree of isolation and thus partial security that NAT gives in an IPv4 context. IPv6 used to have site local addresses which however have been deprecated as ambiguous; because the problem then exists of what to do with nodes that are homed on multiple sites, and the complications associated with zone identifiers suffixes. The solution to the perceived demand for private IPv6 addresses has now become the comically named unique local unicast addresses, which raises two questions: if they are local, why do they need to be unique, and if they are unique, why should they be local?
My impressions is that unique local addresses are a fudge to satisfy the demand for private address space. It is a fudge because the modest advantage of IPv4 private addresses and NAT is precisely that they are ambiguous, and thus require NAT. The benefits obtained by some network administrators from private address spaces are that:
  • Since IPv4 private addresses are ambiguous, and they are drawn from well known address ranges, even if datagrams carrying them leak onto the Internet they are not routable because no Internet router (hopefully) will accept routes for those well known address ranges, and most will simply drop packets to or from such addresses (which are listed in well known bogon list).
  • Since IPv4 private addresses are ambiguous and thus not routable on the Internet, a node with only a private address must use a NAT proxy to access the Internet, and viceversa, and this means that administrators can control and monitor all traffic from and to the Internet for nodes which only have an IPv4 private address.
In other words the side effect of private IPv4 address ranges is the ability to define separate, isolated internets (called intranets sometimes) with one way address translation to the global Internet.

Note: it is of course possible to create a separate intranet which reuses the full 32 bit address space of IPv4 instead of just the private address ranges. But while the latter only requires NAT one way (because the intranet and the Internet address sets then do not overlap, only the intranet addresses at different sites overlap), a full separate internet would require two way NAT, mapping in-use Internet addresses to a subset of the intranet's address space, as well as in-use intranet addresses to a subset of the Internet's address space, and addresses in the DNS protocol would have to be mapped too.

The reason why unicast site local addresses have been deprecated was not that they were ambiguous, but that they were ambiguous and without NAT: because if there had been NAT then nodes could not be really belong to multiple sites ambiguously. The zone suffixes were an attempt to disambiguate the site locale addresses, but using relative instead of absolute addressing, and as Robert Stroud's thesis showed if absolute addresses work but do not scale, relative address scale but do not work.
The new proposal for unique local addresses turns mostly ambiguous relative addresses into mostly unambiguous global ones by inserting in each address a randomly generated site specific prefix, where it is unlikely that two prefixes get the same random number, which is after all Robert Stroud's conclusion in his thesis:
The only alternative is to sacrifice a deterministic notion of identity by using random identifiers to approximate global uniqueness with a known probability of failure (which can be made arbitrarily small if the overall size of the system is known in advance).
But why bother for IPv6? Why not allocate a proper prefix? There is very little difference between that and a (partially) randomly generated one. The control (and weak security) of IPv4 private addresses is derived mostly from the necessity to use a NAT proxy to gain any access to the proper Internet, not from ambiguity in itself.
This need can be enforced by any site administrator even with globally unique addresses by not publishing routing tables for the prefix addresses that should not be routable on the Internet either way. What about accidental routing table leaks? To help with this, a prefix, or another distinguising feature of the address, could be allocated with the convention that Internet routers would not accept routing tables for subprefixes and would drop any packets to any address under that subprefix.
So for example the JA.net UK network have been given the subprefix 2001:0630:0050::/48, and they could have been given 2001:0630:0051::/48 as well, with the idea that they would never publish any routing table for the second prefix; or been given something like fc00:0630:0050::/48 as well, with the idea that no Internet router would route datagrams with addresses under fc00::/16. This would force any node with addresses under 2001:0630:0051::/48 or fc00:0630:0050::/48 prefixes to route via a proxy under 2001:0630:0050::/48 to communicate with the Internet, with the proxy having full flexibility on how to remap (NAT) either internal prefix to the externally routable one.
Of course this would have been possible with IPv4 too, and no need would have arisen for the IPv4 ambiguous private address space, except that it would have required the allocation of many more addresses. Private addresses in IPv4 have served mainly to conserve publicly routable address allocations, not to create isolated intranets, and given that IPv6 has no scarcity of address space, private addresses of any form are not needed to create isolated intranets in IPv6.
060917 Weird and not so weird long standing bugs
I was chatting with someone about software reliability and the free software development process, based on the social definition of works, and I made some examples of long standing bugs in software that I use:
  • In XEmacs, often, after doing a regular expression based search and replace, the next search and replace does not match its target. If I repeat it identically it does.
  • In KDE's Konsole after a while I use the 8 key stops having an effect.
  • In Fedora the Yum dependency manager as a rule does not clean after itself, even if one uses the clean all command, sometimes leaving hundreds of MiBs of data in its cache.
Why do these happen? Well, I guess because few people have encountered them before, or in the case of Yum, most hardly notice the issue. Why do they happen to me? Probably because my usage patterns are a bit different from the social average. My XEmacs configuration file is large, and I use Konsole with non default options, for example.
Why haven't I fixed them yet? Because they are not fatal errors, and I don't like to maintain my own patches, and submitting them to the official maintainers is often futile. Or even worse than futile if they contain design fixes and improvements, because they are perceived (entirely understandably) as criticism, resulting in very time consuming discussions.
060916 Large IPv6 datagram sizes and why they matter
In a previous entry I mentioned that one advantage of IPv6 is that it consolidates some useful extensions added as options to IPv4 over the year, most importantly large TCP6 window sizes, which are very useful for getting better TCP performance on high bandwidth and high latency links. But there is a bigger story as to that: it adds one option to permit IPv6 datagrams, that is the MTU for TCP6 and UDP6, to be larger than 64KiB. This is important for very high speed links, not just for high speed and latency ones, whether regional, continental or intercontinental, for example 10gb/s links.
To fully utilize a 10gb/s link with 64KiB datagrams one must be able to route 20,000 packets (assuming that one datagram is in a single packet, which will be the rule in IPv6) per second, roughly 50 microseconds per 64KiB, no matter what the latency is. That can be sort of expensive, because routers and network interfaces have two limits: number of bytes per second they can shift, and number of packets per second. Now the latter is often more of a limit than the former, because it requires expensive operations like interrupts, flushing caches etc., so on very high bandwidth links it is useful to have large packet sizes.
The same issue arises on a smaller scale with already common 1Gb/s networks and cheap cards and switches. Most cheap 1Gb/s cards and switches can only process Ethernet frames up to 1500B, which unfortunately does not allow actual transmission speeds anywhere near the 1Gb/s limit. One should use jumboframes of 9000B instead. Note that however since jumbo frames are an Ethernet extension they don't work that well unless all the connected equipment can support them. While IPv6 jumbograms are part of the specification, so they should be handled (even if not supported, because there is no requirement to support packets larger than 1280B) by all IPv6 implementations.
As discussed in RFC 1263 extensions are a somewhat dangerous activity: because while they preserve backwards compatibility, they also introduce excessive variation and are prone to incomplete or poor implementation. RFC 1263 offered the design of new updated protocols as an alternative. It turned out that this was impractical in the short term, but with IPv6 it demonstrated
Thus one way of lowering the per-packet overhead costs one obvious way is larger packets. There are two big problems with larger packets though, the first is that adaptive packet length configuration, called path MTU discovery, does not work well in IPv4, (largely because it was a later extension to IPv4 and many routers are misconfigured for it) and the other is that IP datagram size in IPv4 is limited to 64KiB. These issues have become increasingly frustrating and are discussed forcefully at the large MTU advocacy site in which MTU and datagram sizes of several hundred KiB are advocated for 10gb/s links, with a target of around 500 microsecond per packet, or 2,000 packets per second.
And this is where IPv6 has the advantage: it is the only widely available protocol that supports large datagrams and path MTU discovery (as well as large window sizes and UDP datagrams) being built in and not as an extensions, in large part ironically because IPv6 has been so long in coming.
060915 Thousands of queries per second for private addresses
NAT and private IPv4 addresses (for example those in the 10.0.0.0/8 range) are a crime, and an additional reason why is in this FAQ about the blackhole DNS servers:
The blackhole servers generally answer thousands of queries per second. In the past couple of years the number of queries to the blackhole servers has increased dramatically.
It is believed that the large majority of those queries occur because of "leakage" from intranets that are using the RFC 1918 private addresses. This can happen if the private intranet is internally using services that automatically do reverse queries, and the local DNS resolver needs to go outside the intranet to resolve these names.
For well-configured intranets, this shouldn't happen.
Usual problem here: the average network administrator will not waste his precious time figuring out how to setup reverse mappings for the private address range he uses, especially if someone else pays the cost. Also, this reminds me of an article on Internet root DNS servers reporting that one of the most common queries to them was for the top level domain WORKGROUP (never mind all those desktops with MS Windows 2000 or later trying to register themselves via DNS in the root servers). The lesson learned by Michael Stonebraker long ago about Ingres catalog indices:
Users are not always able to make crucial performance decisions correctly. For example, the INGRES system catalogs are accessed very frequently and in a predictable way.
There are clear instructions concerning how the system catalogs should be physically structured (they begin as heaps and should be hashed when their size becomes somewhat stable). Even so, some users fail to hash them appropriately.
Of course, the system continues to run; it just gets slower and slower. We have finally removed this particular decision from the user's domain entirely. It makes me a believer in automatic database design (e.g., [11]).
has been lost in time, like so many other others.
060914b Game load times, fragmentation; reporting to base
Another fascinating quote from the previously mentioned interview with Valve is about poor performance in loading levels due to poor filesystem locality (which matters under GNU/Linux too):
For an example of that on the customer side, we want to improve performance. The engineers said: "Rather than guessing what's bottlenecking performance, let's go and measure what's actually going on." We instrumented all the Steam clients, and the answer was surprising. We thought that we should go and build a deferred level-loader, so that levels would swap in. It turned out that the real issue was that gamers' hard drives were really fragmented, and all of the technology we wanted wouldn't have made a difference, as we were spending all our time waiting on the disk-heads spinning round.
Note also the implication: installing Valve games (at least via their Steam service) means installing some software that does an exhaustive scan of the hard disk and reports its contents to Valve. This is most likely well advertised in the license.
060914 Episodic game content four times cheaper to develop
Noticed an interesting bit of an interview with the CEO of Valve, the developers of the Half-Life series, about cost of development of major game titles:
The solution that we're trying is to break things into smaller chunks and to do them more regularly. So far, it seems to be working. When we look at how long it took us to build a minute of gameplay for Half-Life 2, versus how many man-months it takes us to build a minute of gameplay for Episode One or Episode Two, we seem to be about four times as productive. But we'll go through all three episodes to see... We sort of made a commitment to do it three times and then assess.
That four times more productive is a big thing. It reminds me claims by Mark Rein of Epic that games based on their Unreal engine development tools only require about 15 developers. My impression is that the case they are making is not about episodic content or development tools, it is that games that are really mods, reusing a lot of the engine and art of previous games, are much cheaper to develop than games developed from scratch. Which probably is possible because most platforms have sort of leveled in terms of functionality, so game engines don't have to be rewritten from scratch frequently, and thus can become somewhat stable platforms. Never mind middleware, whether third party libraries or tools: a game engine is in effect its own middleware, when used as a modding platform.
The business implications are tremendous too: Valve made a lot of money with Half-Life and Half-Life 2, who were both developed from scratch and retailed for around $40-50. Now they are releasing episodes that retail for around $20-30 and where a major component of development cost is one fourth. Developing mods can be very profitable. Especially when they release them via Steam: they get the whole price, instead of a fraction of it when released retail. No surprise that Valve is rather cautious when talking about Steam:
What is the split between sales of Episode One via Steam and boxed sales?
Gabe Newell: That isn't something that we've talked about. It's something we're keeping to ourselves.
So, how do you manage your relationship with EA when you're selling games via Steam?
Gabe Newell: Our relationship with EA is fine. I think that retailers are really frightened of these kinds of changes in the industry, and I think that we're learning stuff that is going to be very important for them. For example, Steam enables new ways of doing promotion: [ ... ] I think retailers are starting to understand that communicating more efficiently with customers is a way, not of taking money away from them, but of driving people into stores. It's not a way of cutting them out of the equation.
Fabulously disingenuous argument about promotion: the goal of online game delivery is not to promote games, for which one can do web sites and demos; it is is to cut out the middleman as much as possible, and that is also why massive online games like World of Warcraft are so enormously profitable for game developers (and most are not available for consoles, where the console brand owners really insist on getting a large cut). If online content becomes even more popular then games publishers will be reduced to the role of marketing and PR agencies, not resellers, and the other role that they perform, project venture capitalists, will probably split off into independent entities.
Another point he makes is about how dominant in business terms is World of Warcraft:
Right now, I think the benchmark game in the industry is World of Warcraft, and every platform could be measured against its ability to give advantage, or fail to give advantage, to building a better World of Warcraft.
a notion that I have previously discussed as that game probably explains a lot of the falling sales of other PC games. As to that, I supect that if World of Warcraft were available on consoles, the sales of other console games would be impacted too.
060913b Socket duty cycles matter
I am quite pleased to see that the designers of FireWire considered the important mechanical problems in plugging and unplugging cables, something that not everybody understands is a significant issue:
One of the primary features is that the moving parts were all in the *plug* (all connectors have moving parts ... those little springy bits that apply pressure and make sure the connection is good and tight). This way the part of the connector/socket system that wears out is in the cable. When something goes wrong (as it will in any mechanical system), you throw out the cheap, easily replaceable component ... i.e., the cable.
I was talking about this recently again in the context of a large scientific facility, which expects a lot of visiting scholars for relatively short periods of time. My argument was that for the user facing part of the network wireless network is a good idea (security can be sorted out) simply because it avoids a lot of plugging and unplugging of cabling by hasty and not very careful people, which is especially damaging as the insertion cycles to which many sockets are designed are pretty low.
In general an important difference in the quality of different connectors is their design insertion cycle limit. This can be for USB, or RAM, or CPU, or PCI, AGP sockets. Low quality connectors can break even after only half a dozen insertions. It is one of those aspects of quality that only careful buyers consider. Some people think that one is not going to do that many insertions, for example into RAM sockets. But testing a defective RAM stick can change that calculation quite a bit, and so on for upgrades.
060913 Astonishing FireWire security vulnerability
I was reading the Wikipedia article on FireWire (also known as IEEE 12394 and i.Link) and I was astonished to belatedly learn that most FireWire host adapters have a large security issue that they have: by default they allow any device to read or write any address of main memory. This does not seem grave, because after all usually it is the kernel that sends commands to the device instructing it where to read and write memory, in other words operations are initiated by the kernel. However, FireWire is essentially a peer-to-peer system, where operations may be initiated by any connected device; indeed it can be used to connect two (or more) systems directly, as if it were a network link. In which case they have full access to each other's memory, which is a bit too loose.
It is also amusing that this little issue has been proactively turned into a special case advantage: a FireWire link can be used as a memory debugging tool:
This feature can also be used to debug a machine whose operating system has crashed, and in some systems for remote-console operations. On FreeBSD, the dcons driver provides both, with using gdb as debugger. Under Linux, firescope and fireproxy exist.
These are classic examples of the if life gives you lemons, make lemonade principle.
FireWire remains still vastly preferable to USB2, as it is a much better defined protocol with much more reliable implementations (notably from Oxford Semiconductor (but also arguably very bad ones, for example the notorious Prolific PL3057). It is a pity that Apple decided to start asking for licensing fees on FireWire, triggering others to do so, which has made Intel and other design and adopts USB2.
060911 IPv6 options for IPv4 interoperability
Talking again about IPv6, what are the options for IPv6 connectivity? Well, many, but most are obsolete or not very practical. IPv6 adoption is ever rising (slide 6), but mostly for corporate or government use. Overall the practical methods of getting IPv6 connectivity for home users are:
  • Direct IPv6 support by the ISP: Some ISPs support it directly, for example in the UK there are Andrew & Arnold at a retail level, Bytemark for web hosting, and UK6X (an offshoot of BT) at a wholesale level. ISP direct support in theory allows configuring an IPv6 only system, but since most servers are still IPv4 only, it is necessary to configure a dual stack as those ISPs that do support IPv6 rarely also provide an IPv4-to-IPv6 proxy.
  • IPv6-in-IPv4 using protocol 41: this can use the convenient 2002 prefix encapsulation for IPv4 addresses and then it is usually autorouted or explicitly configured and routed via a tunnel broker. Except in rare cases it is not supported by consumer grade ADSL/cable gateways, but can be used if one connects to the Internet via a simple PtP connection via a modem (either a phone or ADSL one). Autorouting usually results in poor performance, so registering with a 6in4 tunnel broker is usually better.
  • IPv6-in-UDP (or TCP or SCTP, but UDP is much better, which means using the Teredo protocol, supported natively by MS Windows and by Miredo under GNU/Linux, or the AYIAY scheme, supported by AICCU. These schemes rely on the fact that UDP must work for Internet access to happen. The overhead of encapsulating both within UDP and IPv4 is regrettable, but buys simplicity.
Thus for most home users Teredo or AYIAY will be the only choice, for those using direct modem connections or from a webhosting ISP protocol 41 encapsulation will be easiest, usually via a tunnel broker, but sometimes autorouted is good too. The very lucky ones have ISPs and web hosts that support IPv6 directly, and an ADSL/cable gateway that routes IPv6 too.
060910b QoS shaping and a nice paper with a strange setup
As the author of sabishape I was interested to find a nice detailed discussion about shaping and QoS. The author makes some useful points, for example about jitter limitation requiring a far higher rate of examination on low bandwidth links, and it is somewhat surprising how high a rate is needed:
The only timer source of use at SME bandwidths are the high performance timers available in post-PentiumPro CPUs. These allow bandwidth estimation and policying at speeds lower than 64K on FreeBSD (with the HZ raised beyond 2KHz) and lower than 128K on Linux (at 1KHz HZ).
and that high overhead, interrupt-per-packet cards are better for that purpose:
At the same time, cards that are considered "horrid" like Realtek 8139 (rl) provide much more interrupts and much more scheduling opportunities. As a result they provide considerably better estimator precision and policy performance. The difference is especially obvious on speeds under 2MBit. It is nearly impossible to achieve a working ALTQ hierarchy where some classes are in the sub-64Kbit range for a 2MB (E1) using Intel EtherExpress Pro (fxp). It is hard, but possible on a Tulip (dc). It is trivial to do this using Realtek (rl).
Linux estimator entry points differ from BSD. As a result, the effects of hardware are less pronounced and system is more dependant on the precision of the timer source. Still, the same rules are valid for Linux as well. QoS on low bandwidth links (sub-2MB) cannot be performed on server class network hardware.
The reason is that if we want to be sure that the maximum bandwidth used is under a certain level, the more variable it is, the lower the average must be to ensure the limit is not crossed. The overall discussion is quite agreeable, even if I think that:
In reality the diagram is likely to contain 16-20 classes for an average company or 5-10 classes for an average home office network.
exaggerates a bit. Because really one can do one class per source of traffic (e.g. sharer of the link) and three classes per type of traffic, like low, medium, high. More classes just make the situation more complex and don't really work that well especially if there is little bandwidth to share.
Then I was a bit surprised by this point:
HTB is not a good choice for an SME or hobby network. Bandwidth will not be utilised fully and the link efficiency is considerably worse. Its only advantage is that its "ease of understanding" and "predictability" are easier to express in a subletting agreement.
HTB is a a looser policy than CBQ (both also described in this Linux specific HOWTO) but it should not lead necessarily to lower utilization. If one specifies the ceiling for each class as the limit for the whole link.
I also found the sample configuration a bit odd because it only shapes incoming traffic, where shaping is really necessary, and yet not so effective, when it is outgoing, in part because the assumption is that no traffic goes to the host doing the shaping, only to hosts behind it. But still it is quite surprising. Because incoming traffic cannot really be shaped by queueing: only by dropping packets.
Shaping incoming traffic by queueuing at the time of sending them on only has the effect of creating large queues on the shaping host, which eventually fill up and cause packets to be dropped, but likely in a less favourable pattern than if the dropping is done periodically by the ingress discipline. The goal of dropping packets on incoming is in effect to simulate a congested link and thus trigger quenching at the source; dropping packets regularly as ingress does simulates a constant, steady bandwidth limit. But letting incoming queue up and then drop them as the queue becomes full simulates a full bandwidth link that occasionally becomes very congested, and congestion control algorithms don't react well to that. I'll ask the author of the site about his reasons for that.
060910 Using IPv6 for a large site
I was recently discussing with some interesting people the use of >IPv6 for a large scientific site with a vast internal network infrastructure and both internal and external users. The main advantages of IPv6 are:
  • Lots of addresses. Never do NAT.
  • Better autoconfiguration, especially nice for temporary and mobile clients.
  • Better performance over high bandwidth and/or high latency links (jumbograms, incorporates the common TCP extensions).
  • Fairly easy to convert applications to use it, alone or with IPv4, by using getaddrinfo.
  • Cheaper to process in high traffic routers.
The main disadvantages are:
  • It is completely incompatible with IPv4, which means that various encapsulation options are needed to pass IPv6 traffic through IPv4-only networks.
  • Its security architecture is based on IPSEC which is complicated.
  • Minimum packet length is somewhat longer.
  • Addresses are quite long to write.
  • Having globally unique addresses without NAT makes it much easier to track usage by PC, which often means by user.
  • The dynamic autoconfiguration abilities are sort of pointless if one has to configure the same host for both IPv6 and IPv4 operation, because then the IPv4 must be configured explicitly anyhow.
There are also a couple of serendipitous security advantages:
  • It is possible to assign IPv6 addresses where the lower 64 bits are randomly generated, thus creating a fairly strong measure of security via sparse capabilities against networking scanning attacks.
  • More weakly, given the relative scarcity of IPv6 targets, security attacks against IPv6-only hosts are rather more unlikely than against IPv4 hosts.
Overall IPv6 is a win, except for the lack of connectivity at ISP level. There is also a peculiar advantage: MS Windows Vista is by default dual IPv6 and IPv4, and IPv6 connections are attempted first if an IPv6 address can be discovered for the target. This will cause some delay if there is no IPv6 connectivity at the source. So better to have it anyhow.
There a few distinct IPv6 based deployment scenarios (some discussed in RFC 4057), and at the time my opinion was that especially in a large scientific environment one might as well have all servers and clients and services on both IPv6 and IPv4. Well, on reflection it would be better to have all internal servers (and clients) to be IPv6-only, and have dual IPv6 and IPv4 only on externally visible hosts. Because IPv6-only adds the extra serendipitous security described above, and is feasible, and only requires one configuration.

August 2006

060828 A new 'init' design from Ubuntu
Having recently discussed the init dæmon it is pleasing to see that the Ubuntu project is replacing the standard System V style one with a redesigned one called upstart for which they have written a neat paper describing it and comparing it with similar projects like OpenSolaris SMF, initng or Apple's launchd.
My first impression is that most are a bit misguided because most attempt to unify a too wide notion os service management, merging the functionalities of init, cron and inetd (and more). Never mind that launchd uses XML for configuration files...
This is quite incorrect, especially as to including the inetd functionality, because it has been traditional to allow a UNIX like system to startup without any networking, for various reasons.
Then there is the non trivial problem that there is essentially no precedent in UNIX like systems for service dæmon management: traditional init (almost) just runs enough scripts to reach some state of readiness and back at shutdown, cron just runs commands, not dæmons, and inetd really manages sockets, not services.
However something good may come out of this mess. Even if I detect signs of bad news. For example for upstart:
In fact, any process on the system may send events to the init daemon over its control socket (subject to security restrictions, of course) so there is no limit.
which seems to indicate that it is yet another mess like udev. And it was pretty scary to read that some people wanted to integrate upstart with D-Bus, another mess. Overall I think initng is the more UNIX like solution, simple and dependency based, but event based upstart may be not too bad, even if it is already decided that it will use /etc/event.d as its configuration directory (UNIX style is not to have a .d suffix on directories). Let's hope that the socket is actually a named pipe, at least.
060825 Much better performance with direct HyperTransport interface
Quite impressive numbers in an article about network messaging comparing a cool InfiniPath HTX card with the same with a PCI-E interface; performance with the direct HyperTransport interface is several times higher in both throughput (10 times more messages/second) and latency (3 times lower) than with a PCI-E 8-lane interface. It is also interesting that HyperTransport now comes with a PCI-style slot design called HTX as well as the more familiar AMD style CPU socket interfaces.
AMD or Intel CPUs don't have anymore a coprocessor interface unlike MIPS style CPUs, but AMD CPUs have this great HyperTransport interface that is good enough (throughput and latency) to support inter-CPU communications in a multiprocessor, and can be used also by non-CPU chips as a general system bus, which can allow all sorts of tighly coupled coprocessors, network oriented or otherwise. Many years ago for example Weitek had line of extra-performance x87 socket compatible floating point coprocessors. One can now imagine putting all sort of things like that in the HyperTransport capable, or more probably HTX capable, AMD64 motherboards.
060822 AMD hopes for large market share in servers
One of AMD's vice-presidents has stated that the expects AMD to reach 40% market share in servers by 2009 which sounds fairly plausible. If AMD want to grow, they have to do that: laptop sales currently outnumber desktop and server sales in most countries, and Intel's overwhelming market share on laptops seems unassalaible, especially after the release of the Core 2 series. On servers AMD may have the advantage, because its on-chip links for the HyperTransport as part of the Direct Connect Architecture bus give it a considerable advantage for 4-chip and 8-chip systems, which do not therefore require the SMP chipsets that Intel CPUs need.
Part of the sotry is that until relatively recently Intel's market strategy for servers was based on the Itanium architecture whose lack of success has given a large opportunity to AMD. Conversely laptops have traditionally been mostly purchased by corporations, and corporations tend to prefer Intel over AMD chips, simply out of retentive habits.
It could well happen that Intel will end up dominating the laptop market, AMD the server market, and both will continue to sell in the shrinking desktop market, with the advantage to Intel as usual, mostly thanks to their new focus on low cost, integrated graphics, systems.
060821c The 32-bit Linux real and virtual memory boundaries
This is a little detail that should be probably discussed a bit more widely. The Linux kernel usually maps in each process address space both the virtual memory for that process and the whole of the system's real memory, as well as a workarea for itself. This means that on CPUs with 32 bit addressing, the total sum of virtual memory and real memory and system work area cannot exceed 4GiB.
The default system workarea is 128MiB, and the default per process virtual memory space is 3GiB, so the default real memory mapping window is 896MiB. What happens if the system has more than 896MiB? Well, they get ignored, or high memory support is enabled, and then real memory above 896MiB is mapped temporarily in a subwindow of the per-process kernel memory area of 128MiB, which involves a modest slowdown and some complications.
My reckoning is that up to a point it is better to map more real memory and have less per-process address space. The boundary between the two can be changed by defining the CONFIG_PAGE_OFFSET setting in the .config file when building the kernel, or for older kernels in redefining the macro __PAGE_OFFSET in the kernel header include/asm-i386/page.h before building it. The default is 0xC0000000 and the values that I think are sensible are:
Kernel process/real memory
space boundaries
Boundary Process
space
Kernel
space
Real memory
window
0xB8000000 2944MiB
3GiB-128MiB
128MiB 1GiB
0xC0000000 3GiB 128MiB 896MiB
1GiB-128MiB
0x98000000 2432MiB
2.5GiB-128MiB
128MiB 1.5GiB
0x78000000 1920MiB
2GiB-128MiB
128MiB 2GiB
0x38000000 896MiB
1GiB-128MiB
128MiB 3GiB
Of these I think that the most useful is 0x78000000, where the real memory map window is 2GiB and the per process address space is just under that, as this allows direct mapping of most desktop real memory sizes, and the 1920MiB per process address space is still pretty large, and it still allows the full amount of real memory to be used up by a single process.
The least useful is the default 0xC0000000, because unless high memory is enabled it wastes 128MiB of a by now common 1MiB real memory endowment, for the dubious benefit of a 3GiB per process address space. More useful would have been 0xB8000000 as at least it allows full mapping of the 1GiB, at the insignificant cost of 128MiB less of per process address space.
060821b Another chip with 16 CPUs
Having just seen an interesting 16-CPU chip it is not that surprising to discover another one by Boston Circuits are planning another 16-CPU chip with some hardware queue manager and some special hardware IPC:
BCI's gCORE processors employ a new Grid on Chip architecture which arranges system elements such as processor cores, memory, and peripherals on an internal "grid" network. gCORE is one of the first commercial applications of "network on chip" technology, that has been at the forefront of research at leading universities in recent years. It is widely accepted that traditional bus architectures are no longer valid for large scale system on chip implementations in the sub 90 nanometer geometries. Traditional buses become too large, and too slow to support 16 or more processor cores. Ease of use has been one of the biggest obstacles for the widespread adoption of multi-core processors. BCI has taken an unique approach of incorporating a "Time Machine" module in the chip to dynamically assign tasks to each of the processor cores. By alleviating the need to explicitly program each core, this approach greatly simplifies the software development process.
This chip is also designed to run Linux. Not surprising: just as the ready and cheap availability of UNIX source licenses considerably reduced initial cost to develop new minicomputers and workstation systems in the 1980s, Linux has done the same for small servers and PCs and even desktop boxes and ADSL routers in the 1990s.
060821 15 years of Linux
Linux® was famously announced 15 years ago as a hobby project. As discussed by Linus Torvalds and Red Herring in this interview it is also now a big business, especially for startups. I have been using it almost as long, since I was doing my degree.
In the computer industry 15 years is a long time, and it is even more remarkable that Linux is a compatible reimplementation of the UNIX® kernel, which has been around for over 30 years. But August 2006 is not only the 15th anniversary of Linux, but also the 1st of the closure of the UNIX department at AT&T. Indeed my previous OS was Dell's UNIX System V.4, and while it was not bad, GNU/Linux was rather more flexible, as well as free to modify and enhance, which is a big advantage. Perhaps Plan 9 should have become popular in its place; but at the time the license did not allow it. I was also fond, and perhaps fonder, of FreeBSD but at the time it was the target of a license case from AT&T, and anyhow its Berkeley-style license does not reward contributions as much as the GPL which means that I have easily resisted the temptation to switch to a *BSD distribution even if I think that they are technically more elegant (except for their package managers).
There have been several other UNIX kernel compatible reimplementations, and several other free kernels some of them quite unlike UNIX, but interesting. But for now and the near future it is going to be Linux (and a few other UNIX compatible reimplementations) for most of those who choose a Microsoft alternative.
060815 A chip with 16 CPUs
Given my long standing interest in parallel and vector processors, I was delighted to see an article on Movidis, a company selling a server based on a processor with an unusual set of tradeoffs: while clock speed is limited to around 600MHz, the chip has 16 processors, and a total power consumption of around 30W. If the 16 processors delivered under load processing power at say a 50% efficiency and thus 8 times the power of a single processor, the chip would have then equivalent performance to a 5GHz processor, which is pretty remarkable.
The CPU is the OCTEON CN3860 which seems to have been designed by networking company Cavium for multiple-packet-flow handling networking appliances.
With a few others I had expected processor chips to evolve in two different branches as silicon budgets have gone way up in the past 10-15 years, one being to use the budget for ever more complicated single-CPU chips with bigger caches, for backwards compatibility, and another branch of chips using tall those transistors to put a whole multiple-CPU system on a chip.
But multiple-CPU chips have only started happening relatively recently, and they have been about a few complicated CPUs on a chip, not many simpler ones. The OCTEON family is one of the few chips with a multiple-CPU architecture that goes for numbers. The others I have seen so far are Sun's 8-CPU UltraSPARC T1 and the rather more specialized 24-CPU (soon to become 48) Vega from Azul System.
What can be a MIPS-architecture based 64 bit general purpose chip best used for? Well, to process multiple streams of network packets, as Cavium does, or of media, as Movidis initially positioned it for. As to latter I was a bit perplexed because I designed and delivered a streaming video media server years ago with standard PC parts and it did not really need a lot of CPU power, and the first instance could easily cope with 20 simultaneous MPEG1 streams with a Pentium 100MHz chip. Media servers seem to me more disk bound than network bound, and not much CPU bound.
But as a general purpose processor it is really quite good for workloads with many users, and that for example means web servers. Which suits the somewhat network-oriented tradeoffs in the CN3860, as the 16 CPUs on this chip have no floating point coprocessor, and only 32KiB of level 1 instruction cache and 8KiB of level 1 data cache on them, and 1MiB of shared level 2 cache, which might be equivalent to 128KiB per CPU (but probably more).
For example currently a number of web hosting companies already use various software based virtualization techniques like Linux VServer, UML or Xen to provide 8-16 virtual partitions per hosting PC, and with an OCTEON (or UltraSPARC-T1) CPU they can provide a real CPU per user, which solves a number of issues, and for a lot less watts than an equivalent 3-5GHz x86 style single-CPU chip.
Another good use of such a chip is for build systems, even if compilers would use more cache than that provided on the CN3860. Lack of floating point of course makes it wholly unsuitable for most scientific processing tasks, but another suitable application area is image and data processing and compression, as a lot of relevant algorithms either do not require floating point or can be reformulated in fixed point, and the CN3860 has plenty of precision for that as its CPUs are fully 64 bit.
For multiprocessor scientific computing in-a-box there used to be Orion MultiSystems (cached home page) which sold boxes with 96 single-CPU Transmeta Efficeon single-CPU, low wattage chips, which however were not that good at floating point.
060810b Finer tagging of text in HTML and the semantic web
Having just written against the way microformats are used to tag data for automatic processing (as the Semantic Web is not quite here yet), I feel like confessions that I use the sensible alternative to microformats, which is finer resolution text tagging. The reason for this is that often there is a need to put in evidence both some parts of speech and point out types of discourse (usually in the special case, but not only, of levels of discourse, where the type is the degree of abstraction).
A part of speech is a word with a specific semantics role, for example a verb or a preposition. Most parts of speech do not need to be put in evidence because their role is obvious and implicit in the language. HTML already has some tags to indicate some parts of speech, for example abbr and acronym.
However often the parts of speech that are not obvious are proper names, because some proper names are not easy to distinguish from ordinary words (many are derived from them). In many languages there is some convention to indicate a proper name, and the one I use is to capitalize the first letter of each word of a proper name. I also reckon that the cite tag of HTML is the neatest way of tagging proper names in general. So I would write:
<cite>Smith</cite> is an engineer, not a smith.
Then however there are different types of proper names, and I wish often to be able to differentiate them. This is particularly valuable in talking about computer and business related matters, because often different entities, like companies and their products, have the same proper name; and many companies and products have all-lower case or mixed case names, and first letter capitalization cannot be used to indicate a proper name role for the word. For this I use not just the cite tag but also the class attribute to indicate what type of proper name is being tagged, and then a slightly different CSS rule to give them slightly different renderings. For example, I would write:
<cite class="corp">Oracle</cite>'s main product is
<cite class="thing">Oracle</cite>, and only an oracle can
predict whether they will release a similarly name Linux distribution.
because Oracle is the name of a corporate person and Oracle that of a thing. I have tried not to define too many classes of proper names, and currently the values I use for class of a cite element are corp for corporate persons), thing for objects, place for locations, and uom for units of measure.
A bit less consistently I also use cite to tag bibliographic citations, not just citations of proper names, thus I also have classes author for proper names of authors, title for names of the article or book, part for names of the specific part of a book, which for the name of the issue, publ for publisher names, and date for the date of publication.
As to types of discourse, it often happens to mix different abstraction and speech levels, for example as in quotations. HTML already has at least two generic tags for types of discourse, to indicate quotations, blockquote for large textual quotations and q for smaller quotations, and code for quoting code and a few others.
I tend to use different classes of q to indicate the type of discourse of single words or longer sequences, for example fl for foreign language, nlq for non-literal quotes, and toa for terms of art.
So for example I would write:
A <q class="toa">type of discourse</q> can be recognized
because it must be read in a non-<q class="nlq">plain</q> way
to get the correct meaning: in <cite class="thing">French</cite>
the word <q class="fl">chair</q> does not mean
<q>chair</q>, but <q>flesh</q>.
Apart from being useful to remove ambiguities, the use of finer pseudo-tagging of text has other advantages: it becomes a lot easier to search for stuff in text. For example, using tools like sgrep, a version of grep which has matching operators specifically for SGML style syntax.
060810 Fonts, antialiasing and low DPI
While trying to help some of usual people lost in the GNU/Linux fonts mess I point out that antialiasing is usually not a good idea for bitmap and well hinted outline fonts (which excludes also those designed to rely on subpixel antialiasing) because that blurs the character boundaries and makes focusing too hard, while the shapes of the characters are already pretty good. But I have to admit that there might be an exception, and I have been impressed by how increasingly common it is: if the display only supports low resolutions like 72DPI or 75 DPI then antialiasing might have some merit as at those resolutions glyphs are really rather pixelated. I always try to make sure that any monitor I use is at 96DPI or 100DPI or even preferably 120DPI, and starting at those DPI well hinted and bitmap fonts look indeed fine (and sharper) without antialiasing.
There are two reason why 72DPI or 75DPI resolutions are common. The first is that 19 inch diagonal LCD screens are increasingly common, and they almost all have pixel dimensions of 1280x1024 (instead of 1600x1200) and that means 72DPI. Why are LCDs manufactured to such low resolutions? In part perhaps to minimize rejects, but I suspect in large part because of the second reason: many people don't know that they can set font sizes in logical, resolution independent, terms, so that no matter the resolution glyph stay the same apparent size. So for the many middle aged, computer illiterate people out there a 72DPI screen looks like having bigger, more readable lettering, even if it is coarse.
Under both MS Windows and GNU/Linux and X it is regrettably somewhat involved to get glyphs scaled by the correct resolution. For X one has to inform the X server of the screen DPI, which can be done:
  • with the option -dpi dpi in the X server's command line (usually X server command lines are specified in the display manager's configuration file);
  • or by specifying the right screen area size in millimeters in the X configuration file's relevant Monitor section's DisplaySize directive.
In addition to this, the specifiers for the requested fonts must either include the same DPI value, or a special value which means to get it from the X server. For native X fonts specified with an XLFD I have written a summary of the issues.
Under MS Windows it is not possible to specify directly the font characteristics of GUI fonts, but it is possible to either ask for larger fonts or to override the system's default DPI, which sometimes is not computed correctly, or if computed correctly it is desired to override it with a higher one to get larger glyphs.
060809 XML and HTML microformats
Thanks to some (not appreciative) blog entry on them I have sadly become aware of HTML microformats:
Every once in a long while, I read about an idea that is a stroke of brilliance, and I think to myself, "I wish I had thought of that, it's genius!" Microformats are just that kind of idea. You see, for a while now, people have tried to extract structured data from the unstructured Web. You hear glimmers of these when people talk about the "semantic Web," a Web in which data is separated from formatting. But for whatever reason, the semantic Web hasn't taken off, and the problem of finding structured data in an unstructured world remains. Until now.
Microformats are one small step forward toward exporting structured data on the Web. The idea is simple. Take a page that has some event information on it -- start time, end time, location, subject, Web page, and so on. Rather than put that information into the Hypertext Markup Language (HTML) of the page in any old way, add some standardized HTML tags and Cascading Style Sheet (CSS) class names. The page can still look any way you choose, but to a browser looking for one of these formatted -- or should I say, microformatted -- pieces of HTML, the difference is night and day.
Ahhhhh the pain the pain, the memories :-). In a discussion a long time ago I was not happy with the tendency to use SGML or XML to define not markup languages but data description languages:
I'll spare myself the DTDs, but consider two instances/examples of two
hypothetical SGML architectural forms; the first is called MarkDown:

  <ELEMENT TAG=html>

    <ELEMENT TAG=head>
      <ELEMENT TAG=title TEXT="A sample MarkDown document"></>
    </>

    <ELEMENT TAG=body ATTRS="bgcolor" ATTVALS="#ffffff">
      <ELEMENT TAG=h1 ATTRS="center" ATTVALS="yes" TEXT="What is MarkDown?"></>
      <ELEMENT TAG=p>
        MarkDown is a caricature of SGML; it is an imaginary
        architectural form whose semantics are document markup, where
        the <ELEMENT TAG=code TEXT="tag"></> attribute is the one to
        which the MarkDown semantics are attached.
      </>
    </>
  </>

Now this is a monstrosity, but I hope the analogy is clear, even if a
bit forced in some respects.
and here I find a variant of that monstrosity. The examples of microformats provided are also particularly repulsive because of less than optimal choice of the HTML tags to pervert into data descriptions:
    <div class="vevent">
      <a class="url" href="http://myevent.com">
        <abbr class="dtstart" title="20060501">May 1</abbr> - 
        <abbr class="dtend" title="20060502">02, 2006</abbr>
        <span class="summary">My Conference opening</span> - at
        <span class="location">Hollywood, CA</span>
      </a>
      <div class="description">The opening days of the conference</div>
    </div>
as there is too much use of generic tags like div and span, and the abuse of abbr, where the data is in the title attribute and its verbose description as the body of the abbr element. Now trying to imagine the above data as text the second example might be less objectionably marked up as:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <style><!--
      dt			{ font-weight: bold; }
      cite.title:before		{ content: open-quote; }
      cite.title:after		{ content: close-quote; }
      cite.location:before	{ content: "["; }
      cite.location:		{ font-style: normal; }
      cite.location:after	{ content: "]"; }
      q.abstract:before		{ content: ""; }
      q.abstract		{ display: block; font-size:90%; }
      q.abstract:after		{ content: "."; }
      --></style>
  </head>
  <body>
    <dl class="vevent">
      <dt id="20060501-20060502">
	<a href="http://WWW.MyEvent/#opening">
	  <abbr class="dtstart" title="May 1 2006">20060501</abbr> to
	  <abbr class="dtend" title="May 2 2006">20060502</abbr></a></dt>
      <dd><cite class="title">My Conference opening</cite>
	<cite class="location">Hollywood, CA</cite>:
	<q class="abstract">The opening days of the conference.</q></dd>

      <dt id="20060503-20060504">
	<a href="http://WWW.MyEvent/#closing">
	  <abbr class="dtstart" title="May 3 2006">20060503</abbr> to
	  <abbr class="dtend" title="May 4 2006">20060504</abbr></a></dt>
      <dd><cite class="title">My Conference closing</cite>
	<cite class="location">Hollywood, CA</cite>:
	<q class="abstract">The closing days of the conference.</q></dd>
    </dl>
  </body>
</html>
It might seem similar, but it isn't: because my version is just text with text markup, structured as text, not data (and never mind the use of SGML and XML instances that contain only markup, with no data or text). The idea of microformats is to decorate structured data with pseudo-markup so programs can extract individual data elements more easily. As the already mentioned by this blogger this is an abuse of HTML, which is about text, not data, markup, where some XML instance would be more appropriate. However there is a decent case to be made for finer resolution markup of text so that parts of it may be easier to identify and extract, and doing this by some finer grain HTML markup is not too bad. Using the class attribute to indicate finer classes of semantics is a bit of an abuse, as they are meant to indicate finer classes of rendering, but one can make the case that different classes of rendering do relate to finer classes of meaning, at least in the eye of the average beholder.
060808 What is the resident set size, 'exmap', working sets
Someone was asking me what is the RSS column in the output of ps on Linux and I said that in theory is the number of resident pages the process has, but that is somewhat unsatisfactory.
There are two reasons for being unsatisfied with RSS, and one is incidental and the other deeper. The incidental one is that shared library (or other mappings) resident pages are accounted for in every process using them, with the result that the total sum of the RSS figures is larger than the memory actually used.
This counting problem can be partially avoided by using a tool which I recently discovered called exmap which accounts for share mapping resident pages in proportion to how many processes share them:
Exmap is a memory analysis tool which allows you to accurately determine how much physical memory and swap is used by individual processes and shared libraries on a running system. In particular, it accounts for the sharing of memory and swap between different processes.
To my knowledge, other tools can determine that some memory is shared, but can't determine how many processes are making use of that memory and so fairly apportion the cost between the processes making use of it.
Now this useful tool accounts for shared pages equally among processe, but there might be other ways of accounting, like a more dynamic usage based count. But it already needs a kernel extension just to collect per-process ownership data for pages, because the Linux kernel does not:
Exmap uses a loadable kernel module to assign a unique id to each physical or swap page in use by each process. This information is then collated and 'effective' usage numbers calculated.
That the Linux kernel does not already keep track of this leads to the deeper issue with determining how much memory a process is actually using: that Linux uses a global replacement policy for memory management, that is it treats all processes together, in other words pages from different processes compete as to residency.
Global policies are popular because they are simple to implement, and they make it unnecessary to separately implement the operations of paging (of individual pages) and of swapping (of whole processes).
Local policies (like my favourite, PFF, as in W. W. Chu and H. Opderbeck Program Behavior and the Page-Fault-Frequency Replacement Algorithm IEEE Computer November 1976) instead require to define for each process a working set of pages that are most active in that process, and they try to estimate only which pages should be part of that working set. The reason why that is better done per process than globally is that the working set of a process is supposed to change with time but slowly, under the phase behaviour hypothesis: that processes execution is in distinct phases, and each phase usually has a different working set, which is something that a global policy does not take advantage of.
The crucial point of the working set is that its size is determined such that adding a page would hardly reduce the page fault rate, and removing a page would increase it substantially. A global policy thus will take a page from a process to give it to another process even if it is part of the working set of the original process, as it ignores process boundaries. But rather than stealing pages in the working set of another process it would be better usually to swap out entirely the other process, and steal all its pages, and then swap it in again when it gets scheduled.
Therefore what we would really like to know is not how many pages are resident per each process, but how large is the (local) working set of each process. Possibly adjusted by shared ownership of common pages, where the static measure would be enough, as pages in a working set are by construction deemed necessary.
Of course this discussion as to global and local policies is pointless, because Linux kernel developers seem only much interested in the case where there is enough RAM that no paging or swapping occurs.
060806c Solved problem with automounted filesystems and 'updatedb'
After switching from /etc/fstab to an automounter map for most of my filesystem I have been disappointed to see that I missed a subtle detail that means that the locate database created by updatedb only lists files in filesystems mounted for other reasons.
I had counted on my choice to specify the -g (ghosting) option to automount to create directories to act as virtual mount points, which updatedb would then descend, triggering the mounting of the relevant filesystem. But currently I use Fedora on my desktop PC, and the updatedb in it is from the highly optimized mlocate variant, which checks whether a directory is empty before descending in it, and unfortunately the mere check does not trigger the mounting of the filesystem, and the mountpoint directory created by the -g option is empty before the filesystem is mounted. Indeed I just checked and the stat64 system call does not trigger mounting, but getdents or getxattr trigger the mounting. Which is incorrect, because mounting should be triggered by any access to the inode or its data contents, not just the data contents (the extended attributes read by getxattr are in not in the inode).
I have tried then to configure the updatedb in mlocate to scan the relevant directories explicitly, which triggers the mount, but then the locate databases gets overwritten unless I create a separate database file. But well, I can do so indeed, and simplify several issues, by creating a separate mlocate database for each filesystem in a given list. The database by default will be in the top directory of the filesystem, but optionally the list will have a second field for an explicit name (to cater to the case where the filesystem is read-only).
My /etc/cron.daily/mlocate.cron file now contains:
/usr/bin/updatedb

DBDIRS='/etc/updatedb.dirs'

if test -e "$DBDIRS"
then
  grep -v '^[ \t]*#\|^[ \t]*$' "$DBDIRS" \
  | while read DIR DB
  do : ${DB:="$DIR/mlocate.db"}
    /usr/bin/updatedb -U "$DIR" -o "$DB"
  done
fi
and my profile initializes the LOCATE_PATH environment variable like this:
if test -e "$DBDIRS"
then
  export LOCATE_PATH

  dbdirspath() {
    DBPATH="$1"; DBDIRS="$2"

    { grep -v '^[ \t]*#\|^[ \t]*$' "$DBDIRS"; echo ''; } \
    | while read DIR DB
    do
      case "$DIR" in
      '') echo "$DBPATH";;
      ?*) : ${DB:="$DIR/mlocate.db"}
	case "$DBPATH" in
	'') DBPATH="$DB";;
	?*) DBPATH="$DBPATH:$DB";;
	esac;;
      esac
    done

    unset DBPATH
    unset DIR
    unset DB
  }

  LOCATE_PATH="`dbdirspath \"$LOCATE_PATH\" \"$DBDIRS\"`"
fi
060806b The 'init' daemon and 'inittab'
Having just mentioned that I prefer to start the automounter from /etc/inittab perhaps some discussion of init is useful to justify that.
The original init was a simple thing that just run some script chain, and the script chain would define a few nested run levels corresponding usually to modes, for example single user mode, multi user mode, where the system would pass through each lower level to the target level on startup, and viceversa on shutdown.
In the obvious way: init would execute first thing something like a /etc/singleuser script, which would contain single user initialization commands, a call to /etc/multiuser and single user termination commands; /etc/multiuser would contain multiuser initialization code, and a call to an upper level or just the spawning of getty, and multiuser termination commands.
This simple and easily comprehended structure is still present in the various BSD derivatives and the GNU/Linux distribution Slackware uses a variant of this scheme.
For whatever reason, the UNIX System V developers decided to adopts a seemingly more general mechanism: to have run states instead of run levels (even if the terminology did not change, they are actually still called levels), in the sense that states are not ordered and the system can go from any state to any other state. To support this they added a configuration file /etc/inittab where each line is tagged by the states in which it is valid.
Another interesting feature was added: each command line can be tagged as to whether it is a simple command or whether it runs a dæmon, in which case init can monitor it and restart it if it terminates.
The init program reads the configuration file, and when requested to switch run state, lists all commands that are unique to the source and destination state, and the common ones, dividing the latter in simple and dæmons, and does the following:
  • Kill all the dæmons that are in the source state but not the destination state.
  • Run the once only commands for the destination state.
  • Start the dæmons unique to the destination state.
As it is pretty obvious there are a few significant problems with this scheme, which the nexted run level scheme does not have:
  • There is no way to specify simple commands to execute on leaving a run state.
  • Every possible run state combination must be well defined.
Also, both the run-level and the run-state models share an issue: to change the definition of a level or state one must edit manually the run-level scripts or /etc/inittab. This creates a problem if one has to goal to allow package installation without manual intervention, because the package may be a dæmon
Because of these something very funny happened: the original run level scheme was hacked back into the System V init, only much worse. Because it is based on having nested run levels by convention (and many distributions use different conventions, even if there is a nominal Linux standard) where however is not checked because the scripts run at each level are put in a separate per-level directory. Even more insanely, which script is run in which level is not configured in a file, but by the mere absence or presence of a script (or a symbolic link to one) determines what is run.
Which makes keeping track of what happens a bit hard, and has resulted in a number of utilities to manage the situation. Some systems, mostly Linux based ones, have even created a dual system of determining which script to run in a given run level or state: each script can be present or absent, but if present it checks also setting in some other configuration file before actually doing anything else.
Linux based distributions have generally followed System V, so they have adopted its init model, with variants, so for example to enable/disable individual services RedHat uses files in /etc/sysconfig, SUSE first used the file /etc/rc.config and then this was split into separate files under etc/rc.config.d, and Debian goes for gold by having only two levels, and which scripts run in the upper one is determined solely by configuration in various random files.
A sorry situation, and what astonishes me is that so far very few people have realized that the correct solution is either to improve a bit the run level scheme, or just to fix the System V init, for example by:
  • Add the ability to specify commands to be executed on exit from a run state, not just on entry (as in once, wait).
  • Add the ability to include all the files in a directory.
  • Add some way to tell dæmons that a state change is occurring, to avoid stopping them on leaving a state only to restart them on entering the next.
Ideally one would also add some form of command/dæmon dependency management, to avoid the issue of the order in whch to execute them within a run state.
There however however already several alternative init redesigns. There are however very small chances that any of them will be used in a production distribution, as the inertia of historical accidents probably will prove too strong. However it may be still interesting to have a look at some, for example LFSinit, boot-scripts, the NetBSD 1.5 rcorder system.
060806 An amusing example of (moderately) bad code
While revising some kernel patches (to figure out why writing to DVD[-+]RW and DVD-RAM started working in 2.6.17) I found this amusing bit of bad code:
      for (unit = 0; unit < MAX_DRIVES; ++unit) {
              ide_drive_t *drive = &hwif->drives[unit];
              if (hwif->no_io_32bit)
                      drive->no_io_32bit = 1;
              else
                      drive->no_io_32bit = drive->id->dword_io ? 1 : 0;
      }
The perhaps inadvertent double obfuscation of the logical condition is all too typical. I would have written:
      {
	      const ide_drive_t *const end = &hwif->drives[MAX_DRIVES];
	      ide_drive_t *drive;
	      for (drive = &hwif->drives[0]; drive < end; drive++)
		      drive->no_io_32bit = hwif->no_io_32bit || drive->id->dword_io;
      }
I also wonder whether scanning all drives up to MAX_DRIVES is correct, without checking whether the drive is actually present. But probably harmless. As a final note, the code above is not, by far, the worst I have seen recently; it is just obviously lame.
060805 Using automounter maps instead of '/etc/fstab'
After some consideration I have decided to switch to Linux automounter maps to mount most of my local filesystems, instead of relying on /etc/fstab and the boot scripts. The two are mostly equivalent, but the automounter maps are used by a dæmon to mount filesystems dynamically instead of statically and on usage; mounted filesystems are then unmounted after they haven't been accessed for a while.
The main advantage for me of automounter maps is not that that by mounting automatically they remove the need to issue a mount command before accessing a filesystem, which is trifling issue, but that they do so dynamically, that is filesystems only stay mounted for as long as they are used. This has the not inconsiderable advantage that most filesystems will stay unmounted most of the time, and an unmounted filesystem is clean and does not need to be checked if there is a crash, and this is particularly useful if one does system development and these crashes occur during debugging. Sure, most file system types have journaling and they recover fairly quickly, but because of my PC has MS Windows dual boot, I also have a few (for historical reasons) FAT32 and ext2 filesystems that don't have journaling. Moreover keeping filesystems inactive and unmounted reduces other chances of accidental damage.
Currently I have only one map for mounting my local filesystems under /fs, called /etc/auto.fs and it has two sections, one for mounting filesystem (mostly from removable media) that might exist on any system, and another for mounting PC specific filesystems, and it looks like:
# vim:nowrap:ts=8:sw=8:noet:ft=conf
#MOUNTP -fstype=TYPE[,OPTION]*                                  :RESOURCE

# Host independent
##################

0       -fstype=ext2,user,rw,defaults,noatime,nosuid,nodev      :/dev/fd0
a       -fstype=vfat,user,rw,nocase,showexec,noatime,umask=077  :/dev/fd0
A       -fstype=msdos,user,rw,noatime,umask=077                 :/dev/fd0

1       -fstype=ext2,user,rw,defaults,noatime,nosuid,nodev      :/dev/fd1
b       -fstype=vfat,user,rw,nocase,showexec,noatime,umask=077  :/dev/fd1
B       -fstype=msdos,user,rw,noatime,umask=077                 :/dev/fd1

sda1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sda1
sdb1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sdb1
sdc1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sdc1
sdd1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sdd1
sde1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sde1
sdf1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sdf1

uba1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/uba1
ubb1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ubb1
ubc1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ubc1
ubd1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ubd1
ube1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ube1
ubf1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ubf1

pkt     -fstype=udf,user,rw                                     :/dev/pktcdvd/0
udf     -fstype=udf,user,ro                                     :/dev/cdrom
cd      -fstype=cdfs,user,ro,mode=0775,exec                     :/dev/cdrom
iso     -fstype=iso9660,user,ro,mode=0775,exec                  :/dev/cdrom
r       -fstype=iso9660,user,ro,mode=0774,norock                :/dev/cdrom

pkt1    -fstype=udf,user,rw                                     :/dev/pktcdvd/1
udf1    -fstype=udf,user,ro                                     :/dev/cdrom1
cd1     -fstype=cdfs,user,ro,mode=0775,exec                     :/dev/cdrom1
iso1    -fstype=iso9660,user,ro,mode=0775,exec                  :/dev/cdrom1
s       -fstype=iso9660,user,ro,mode=0774,norock                :/dev/cdrom1

# Host dependent
################

home    -fstype=jfs,defaults,noatime                            :/dev/hda2

c       -fstype=vfat,rw,nocase,showexec,noatime,umask=02        :/dev/hda3
d       -fstype=ext3,rw,defaults,noatime                        :/dev/hda6
e       -fstype=ext3,rw,defaults,noatime                        :/dev/hda7
Interestingly autofs maps can be scripts that given the mountpoint print the options and resource to mount on it, but for my simple purposes of /etc/fstab replacement that is not necessary, a static map is good enough. I have of course kept an /etc/fstab, but it now contains only the indispensable static mounts, which are about virtual filesystems and swap:
# vim:nowrap:ts=8:sw=8:noet:
#DEVICE         MOUNT           TYPE    OPTIONS                   DUMP PASS

# Host independent
##################

none            /proc           proc    auto,defaults                   0 0
none            /sys            sysfs   auto,defaults                   0 0
none            /dev/pts        devpts  auto,mode=0620,gid=tty          0 0
none            /proc/bus/usb   usbfs   auto,defaults                   0 0

none            /proc/nfs/nfsd          nfsd    noauto                  0 0
none            /var/lib/nfs/rpc_pipefs pipefs  noauto                  0 0

# Host dependent
################

/dev/hda1	/		jfs	defaults,errors=remount-ro	4 1
none            /tmp            tmpfs   auto,mode=0777,size=700m,exec   0 0
none            /dev/shm        tmpfs   auto,mode=0777,size=100m,exec   0 0

/dev/loop7      swap            swap    noauto,swap,pri=8               0 0
Instead of using the autofs rc script to start the automount dæmon, I have chosen to start it directly in /etc/inittab, because that is where it should be, and thus I have added these lines to it:
# Use 'automount' instead of '/etc/fstab'. Local and remote maps.
af:2345:wait:/usr/sbin/automount -g -t 60 /fs file /etc/auto.fs
aa:345:wait:/usr/sbin/automount -g -t 180 /am file /etc/auto.am
Note that I have specified the ghosting option -g to make the mountpoints visible under the main mountpoint, and slightly different and somewhat longish autounmounting timeouts. One can always issue umount explicitly if one wants a quick unmount, for example for a removable medium.
One possible downside to this is that startup becomes a little more fragile, but not really, because the essential filesystems (root and possible /usr and /var if separate) should in any case be specified in /etc/fstab and be mounted statically at boot. Surely nowhere as fragile as udev. Another possible downside is that mounting is slow for some file system types, most notable ReiserFS, because some extended consistency checks are performed. In such a case perhaps lengthening the default mount timeout is a palliative.
Instead of using autofs I might have used AMD. AMD is a similar dæmon which is system independent (instead of using the special autofs module of Linux it pretends to be an NFS server) and rather more sophisticated. AMD is more suitable for large networked installations, where it has considerably simplified my work a few times, and I just wanted a dynamic extension to /etc/fstab and for that autofs is quite decent.
autofs in effect is a subset of the ability in Plan 9 to mount processes in the filesystem name space, and vaguely similar to the BSD portal feature for example as in mount_portalfs.

July 2006