Software and hardware annotations 2007 January

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

070128 Laptops as servers

Recently I installed a few really small PC-style boxes as network appliances to be special purpose firewalls. These machines are quite neat and useful, and could be used as general purpose servers even if they have relatively low performance (after all I did a video-on-demand server on a Pentium 100MHz years ago). But after working with many racks full of largish contemporary servers and having to get rackable consoles etc., I have been looking at today's laptops, and I have started to think that they may be pretty good servers in many cases, not just desktop replacements, because:

Laptops have a built in UPS that can keep them working for hours.
They have a built-in screen and keyboard.
They are very small, consume little power, dissipate little heat.
Laptops are designed for robustness on-the-move, and this makes them particularly resilient.
There are now very powerful laptops, even with dual 2.5" drives in RAID configurations.

There are obvious downsides too, as in laptops are more expensive than low end boxes (but not more expensive than industrial grade 1U servers), and most lamentably they are not that expandable. However as to expandability USB2 overcomes most issues, with two major exceptions:

Additional Gigabit and above network interfaces.
Fast and large hard drives.

As to additional Gigabit network interfaces one could use the classic CardBus or the newer ExpressCard sockets; the same for extra hard drives, as there are now FireWire 800 and even eSATA plug-in cards for laptops, and those two standard are fast enough for fairly serious performance. Of course external hard drives would not enjoy the advantage of the built-in UPS of the laptop itself, but once spinning hard drives draw relatively little power. Anyhow, I have been using a couple of laptops as backup servers with external USB2 or FireWire 400 hard drives.

070127 More RAID5/RAID6 madness

I was looking for various other reasons to a page of notes about large site computing infrastructure and I noticed these two bullet point:

• NEC won tender 4.5PB over threee deliveries
• 500 TB already installed (450 x 400GB 7k SATA) connected to 28 storage controllers RAID6

and from other notes on the same event:

- 500TB already installed -- 400GB 7k SATA drives and 50TB FC -- 140GB 15k
- RAID 6 SATA
- 2 servers at 2 gbit every 20TB
- Jan 2007 -- 300GB FC and 500GB SATA

about the same presentation and I was quite amused: 450 drives (actually 400+140) over 28 RAID6s means over 16 per RAID6, which to me seems quite perverse.
Another amusing bullet point about power demands of large data centers:

• 2MW today in near future .... 5.5 MW total into the building by end of year by 2017 predict 64MW

070122 Crazily high cable prices

Walking home I passed by a computer product superstore belonging to a national chain and I briefly visited it. It looks like there are ways of making a lot of money selling extraordinarly expensive Belkin branded cables to gullible consumers: as a typical example an internal ATA 80 wire cable priced at £16.99 (around $34 or €26). A simple electrical power cable might then seem then a bargain for a mere £9.99 ($20 or €15). Or perhaps still not :-).

070120b Simple configuration-driving environment variables

I have a few PCs here and there that not only have rather different hardware but also exists in different locations and are sometimes moved among them, and I try to share configuration files among them. Many of these files fortunately are actually shell scripts that set environment variables or run commands.
After many years and trying several approaches my current preference is to define either at boot time or via some overall scripts a minimal orthogonal base for configuration space via environment variables, of which I use three:

HULL is for names of systems, which implies a certain hardware configuration, including for example which filesystems are available.
SITE is for names of locations, which usually imply different networking setups for servers.
NODE is for names of individual configurations, and is a recent addition, as I realized that several configurations may be needed on the same HULL at the same SITE, for example because of dual booting different GNU/Linux distributions, or virtual machines, or different roles.

One could add more variables for a finer discrimination of the configuration space, but I am reluctant to do so. As to these variables I use them in configuration shell scripts using nice traditional case statements with pattern matching, something that the authors of most shell scripts I see seem to eschew in their quest for the most appalling style, as follows for example (many variants are possible, but a delimiter like the + below is necessary):

case "$SITE" in
'home+laptop1')
   : 'Whatever purely site specific';;
...
esac

case "$SITE+$HULL" in
*'+laptop1')
   : 'Whatever purely hull specific';;
...
esac

case "$SITE+$HULL+$NODE" in
'home+laptop1+linux1')
   : 'Whatever for this specific situation';;
*'+laptop1+linux1')
   : 'Whatever hull and node specific';;
...
esac

For configuration files that are not shell scripts I use other variants on the idea:

I keep directories where I hold my configuration files, usually modified versions of default ones, named after site, node or hull; for example /root/CONF/work+laptop1 might contain configuration files specific to that site and hull. Then switching configuration can be as simple as
```
cp -alf /root/CONF/work/. /.
cp -alf /root/CONF/work+laptop/. /.
```
(even if I actually use a slightly different scheme and script).
I use some make file to copy or generate on the specialized configuration file, using the name of site, node or hull in the make variables and the filenames. For example:
```
${HOME}/.emacs: emacs-${SITE}.el; cp -p emacs-${SITE}.el '$@'
```
and sometimes I preprocess the files to be installed with cpp which requires some slight trickery instead of a mere cp:
```
${HOME}/.Xresources: Xresources
	cpp-dot < Xresources > .tmp && cp -p .tmp > '$@'
```
where .tmp is used to prevent overwriting the older target in case the preprocessing fails, and cpp-dot looks like:
```
ENV=''
CPP='gcc -E -x c'

test -f /lib/cpp		&& CPP=/lib/cpp
test -f /usr/lib/cpp		&& CPP=/usr/lib/cpp
test -f /usr/ccs/lib/cpp	&& CPP=/usr/ccs/lib/cpp
test -f "$LOCAL/bin/cpp"	&& CPP="$LOCAL/bin/cpp"

for VAR in HOME SITE HULL NODE
do
  eval VAL=\"\$"$VAR"\"
  ENV="$ENV -DEnv$VAR=$VAL"
done

exec $CPP $ENV ${1+"$@"} \
  | exec egrep -v '^[ 	]*$|^[#!]' \
  | exec sed 's/^  *//;s/ *\%\% *//g;s/\^^/"/g'
```
In the above there is a special trick: the sequence %% is used where white space around it should be deleted, as some variants of cpp insert white space around expanded macros.
In order to setup the values of the environment variables I use two different methods: configuration files like /etc/env-NODE, /etc/env-SITE, /etc/env-HULL white typically read like this:
```
#!/bin/sh
export SITE
: ${SITE:='home'}
```
and a generic one that evaluates those and can override them by injecting assignments from the kernel command line into the environment like this:
```
#!/bin/sh

for S in /etc/env-SITE /etc/env-HULL /etc/env-NODE
do test -r "$S" && . "$S"
done

if test -r /proc/cmdline
then
  # This should be: tr ' ' '\012' | while read P
  # but cannot be because then the 'while' is a subshell.
  for P in `cat /proc/cmdline`
  do
    if N="`expr \"$P\" : '$[A-Z_][A-Z_0-9]*$='`"
    then
      export "$N"
      V="`expr \"$P\" : \"$N=$.*$\"`"
      eval "$N"="'$V'"
    fi
  done
fi
```
Then the script above is sourced at the beginning of the global profile script for users, or the rc scripts for init.

Sounds complicated, but that is the product of quite a bit of thinking and practice, and there are some details I haven't been discussed explicitly. Still better than the registry, even if the usual victims of the Microsoft cultural hegemony want to add it to GNU/Linux too.

070120 Impressive FireWire 800 performance

Having recently tried two external hard drives in USB2 mode, I managed to get cheap FireWire 800 host adapter with a decent TI chip, and what a very pleasant surprise: the same My Book™ Pro Edition™ II which delivered around 22MB/s with USB2 also did about the same with FireWire 400 (which is a surprise too, should be faster) but delivered 68MB/s reading and 54MB/s writing with FireWire 800. The test was done on a relatively slow Celeron 2.4GHz system, which returns a bandwidth of only 700MB/s from hdparm -T, and system CPU usage was around 11% for FireWire 400 and a still fairly low 28% for FireWire 800. Compare with the same figures for my USB2 test with much faster systems at lower data rates.
All this is quite impressive, and means that at least some current FireWire 800 host adapter and ATA bridge chips (I am going to try to figure out which ATA bridge chip is inside the MyBook Pro) are far more efficient (probably lower latency) than USB2 or FireWire chips. Indeed the speed achievable via FireWire 800 is high enough that the need for eSATA is called into question, or even that of internal hard drives. I have known people who have servers with RAID on external FireWire 400 drives, and I can see that as being far more interesting with the much faster version. A kind of non-Fibre Channel for small-to-middle situations, which can add considerable flexibility to configurations.

070115 Trying two recent branded external hard drives

So I have been looking at various external hard drives for data exchange purposes, around the 250GB range. I had already had a look some time ago and a bit earlier too and I was rather disappointed by the vexed problem of spin up current and the lack of sufficient power in the usual bricks, and never mind the somewhat dodgy nature of many USB and firewire chipsets which Linux drivers cannot quite work around in every case. Well, I have just bought a LaCie P3 external USB2 250GB hard disk and I was pleasantly surprised:

The power brick is rated up to 2A at 12V, and 2A at 5V, and the hard drive within box is a well regarded Samsung SP214N which has a rather low maximum power draw on spin-up of 1.9A, which is then lower than the power brick limit.

I was able to copy to it, from a 250GB ATA5 hard drive, quite a few (outer cylinder) GB at a rather reasonable rate of 28MB/s over 15 minutes (and it is still going):

# dd bs=32k if=/dev/hda of=/dev/sdi
777871+0 records in
777871+0 records out
25489276928 bytes (25 GB) copied, 903.105 seconds, 28.2 MB/s

and vmstat 10 was reporting:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 1  3   4240  10208 676080  59140    0    0 25138 27612 2727  5682  2 19  0 78  0
 0  8   4240   9944 679664  58456    0    0 29947 27198 2960  6169  2 21  0 77  0
 2  2   4240   9608 670924  61236    0    0 25709 26650 2761  5985  8 19  0 73  0
 1  5   4240   9396 682068  60600    0    0 30333 28282 2928  6029  7 24  0 69  0
 0  5   4240   9800 686468  58512    0    0 27962 28345 2782  5709  2 20  0 78  0

which is pretty decent, even if I have seen people claiming reaching 35MB/s with other ATA-USB2 chipsets (I haven't checked yet which one is in this enclosure). there is for both reading and writing a 20% CPU use on a 3GHz Athlon 64. Just reading from the same external drive gives much the same rate with a 10% use of CPU:

# dd bs=32k if=/dev/sdi of=/dev/null count=100000
100000+0 records in
100000+0 records out
3276800000 bytes (3.3 GB) copied, 117.09 seconds, 28.0 MB/s

with vmstat 10 reporting:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa st
 0  1   4848   9952 574724  61264    0    0 27789     1 2318  4529  1 10  0 88  0
 1  1   4848   9652 575004  61240    0    0 27202     0 2273  4442  1 12  0 87  0
 1  1   4848  10320 574240  61260    0    0 27750     1 2363  4563  1  9  0 90  0
 0  1   4848   9532 575164  61208    0    0 27251     0 2275  4438  1 12  0 87  0
 2  1   4848   9892 576376  59660    0    0 27789     0 2316  4525  1  9  0 90  0
 0  1   4848  10132 576192  59588    0    0 27265     0 2274  4441  1 12  0 87  0

and neither transfer rate is disappointing. Overall a lot better than many others.

I also briefly checked out a Western Digital external drive, a luxury My Book™ Pro Edition™ II with 2x500GB of capacity and various interfaces. I copied to it in sequence a tree containing about 22GB of data, to 40 different directories, and doing also sequential reads and writes, and it is slightly slower on USB2 at around 22MB/s, but it also seems quite reliable, as it copied the nearly 900GB (over 15 hours) without trouble; I didn't try the FireWire and FireWire 800 interfaces.
But I am rather interested in two new interfaces for external small scale storage system, one of them being the usual Ethernet, using either AoE or some NAS protocol like NFS or SMB. Ethernet and NFS or SMB are very reliable, well understood protocols, even if some low end implementations are still somewhat dubious. Ethernet seems also to be becoming the nearly universal bus because of that.
The other is eSATA (see also 1, 2, 3, 4, 5) which is the long-awaited serial ATA for external connections, as for example in this very recent (indeed not yet shipping) LaCie product, the d2 eSATA II 3Gbits Hard Drive 500GB. The hope here is that SATA is such a crucial standard that it is going to be hard to find buggy implementations, and by being the same standard used in internal drives it will not require the extra layers of conversion, as in the ATA-USB2 or ATA-Firewire bridge in the external box (there being no native USB2 or Firewire drives), which of course is a big advantage.
USB and USB2 will continue to be useful for simple and/or odd devices, and FireWire will continue to be a great missed opportunity, as it could have made SATA itself pointless, because in effect FireWire is cheap, simple, fast serial SCSI, and could have been used internally too. But SATA and eSATA are sort of good enough, so let's hope eSATA replaces all the dodgy implementations of USB and FireWire for external storagre devices.

070108 Parallel system with low power multi-CPU chips

Just spotted an interesting new startup. SiCortex that are doing very dense massively parallel systems based on low power multiple-CPU chips, which are an ongoing interest of mine. Their technology summary paper contains some interesting and agreeable point:

Our survey of technical applications indicated that typical HPTC programs spend the major- ity of their time waiting for memory. Ratios in the range of 5-80 floating-point operations per cache miss to main memory were typical.

which is very well supported by my experiences (and Ben's, hi! :->) in optimizing game code on various consoles, where more cache means better performance, given that most programmers, even game programmers, don't understand memory-friendly or microparallel algorithms.

By scaling to N communicating processes, we are able to spread the data movement task over N independent memory access streams. Scaling is, of course, limited by the cost of communication.

This is a bit too vague: the cost of communication is ambiguous as communication is rated in both bandwidth and latency, which can have very different cost profiles. For massively parallel algorithms latency probably matters more, but here it is not clear indeed whether the cost of bandwidth or the cost of latency has been targeted.

Our hardware design was guided by a simple idea: while traditional clusters are built upon processor designs that emphasize calculation speed, the SiCortex cluster architec- ture aims to balance the components of arithmetic, memory, and communications in a way that delivers maximum performance per dollar, watt, and square foot.

This balancing is surely wise, and reminds me of a very good point by Edseger Dijkstra about optimal page replacement algorithms (to be discussed some other time), where optimum utilization of one resource is not the goal, but cheapest utilization of all resources (that is, cost-weighting). Unfortunately shallow customers (which exist in the HPC market too) buy on raw performance benchmarks and selling to the wise and discriminating restricts the potential market.

Our obsessive attention to low power resulted in a variety of performance and cost benefits. By holding down the heat generated by a node, we were able to put many nodes in a small volume. With nodes close together, we could build interconnect links that use electrical signals on copper PC board traces, driven by on-chip transistors instead of expensive external components. With short links, we could reduce electrical skew and use parallel links, giving higher bandwidth. And with a small, single-cabinet system we were able to use a single master clock, resulting in reduced synchronization delays. Our low-power design also has cascading benefits in reducing infrastructure costs such as building and air conditioning, and in reducing operational costs for electricity.

Well said, and especially for lower end applications power requirements can impact costs severely. At the higher end not many have the resources of Google who can afford semi-custom PC designs and to build gigantic facilities where land and power are cheap.

The SiCortex node (Figure 3) is a six-way symmetric multiprocessor (SMP) with coherent caches, two interleaved memory interfaces, high speed I/O, and a programmable interface to the interconnect fabric.
The processors are based on a low power 64-bit MIPS® implementation. Each processor has its own 32 KB Level 1 instruction cache, a 32 KB Level 1 data cache, and a 256 KB segment of the Level 2 cache. The processor contains a 64-bit, floating-point pipeline and has a peak floating-point rate of 1 GFLOPs. The processor's six-stage pipeline provides in-order execution of up to two instructions per cycle.

The usual MIPS-style instruction set, and a fairly decent amount of cache considering the CPUs are packed six to a chip. Sounds not too unlike a Sony/IBM Cell design, MIPS rather than PowerPC based, and double precision floating point is obviously targeted at scientific rather than gaming markets. Remains to be seen whether double precisions performance is much slower than single precision floating point as in similar designs. But the kicker is that:

This simple design dissipates less than one watt per processor core.

which suggests a power draw of 6W per chip, which is fairly impressive. The 500MHz clock frequency however is far lower than the 3GHz clock of the Cell in the PS3, which however has only one such chip (even if with a similar number of CPUs), but then it is the only one such chip on that board, not one of 27 as in the SiCortex nodes. But then I also can only agree with the point that:

The processor's rather modest instruction-level parallelism is well suited to HPTC applications which typically spend most of their time waiting for memory accesses to complete.

070105 The DNS, interfaces, nodes, directories and naming schemes

I have been chatting with different people about similar issues related to DNS taxonomies. On reflection my guess is that how DNS works and what its purpose is are somewhat widely misunderstood concepts, in part because understanding DNS depends on some subtle concepts and terminology.
One misunderstanding is that the DNS is provides names for computers, which is quite inaccurate. But very little in Internet standards is about computers, and the main entities are interfaces, which have addresses. The Internet architecture has a concept of nodes which are entities with one or more interfaces, and which are routers if they forward traffic between some of those interfaces. What happens when multiple interfaces are on the same node? Well, Internet standards don't much say about that, except in the particular case of routing across them, and even that is just supposed to happen. IP implementations handle interfaces on the same node differently; for example by default the Linux kernel will handle them in effect as if they were all one interfaces on multiple subnets, which can lead to somewhat surprising behaviour (for example, as to ARP responses).
The DNS is a hierarchical distributed property list system, where the property(s) to be associated with a symbol are those of interfaces, not nodes. So WWW.Example.com and LDAP.Example.co.UK could be names for properties of two interfaces on the same node, or even the same interface. To the point that there is no way in the DNS (or in IP) to list interfaces by node, or check whether two interfaces are on the same node. And it is even worse of course: the DNS actually is about (mostly) giving names to addresses and IP implementations allow defining interfaces that have multiple addresses, or even (far less commonly) have the same address. The DNS also is not quite a directory-and-file system like UNIX filesystems, where interior nodes in the hierarchy cannot have data associated with them: in the DNS both a domain name and its subdomain names can have addresses associated with them.
All this poses two interesting issues:

The purpose of the DNS is to give symbolic, easily remembered names to IP entities like addressable IP endpoints, but most users have difficulty conceptualizing a network as being about those, and tend to think about it being made of computers (never mind other types of nodes). The cognitive dissonance is made worse by the common case where a computer has got a single interface (and is a leaf node) and thus the name and concept of the interface and that of the computer can be conflated.
The DNS is a hierarchical system that imposes a right-to-left linearization of tree paths, but networks, especially those with nodes with multiple interfaces, are general meshes, and there can be several different ways to map them onto linearized tree paths.

So what to do with naming multiple interfaces on the same node? Well, several possibilities present themselves; for an example, let's imagine a situation with 2 servers, both of them on 2 subnets (thus a total of 4 interfaces), one subnet per floor, each server providing both file and print service, redundantly, as in my preferred structure, but with one server (srv1) being preferred for floor 1 and the other (srv2 for floor 2. Let's consider then these possible naming schemes, which I will present all at once so the differences can be seen at a glance, with case-by-case comments afterwards:

; There should be '$ORIGIN Example.com' or equivalent here.

; #1
srv1.Example.com.		A	192.168.1.1
srv1.Example.com.		A	192.168.2.1
srv2.Example.com.		A	192.168.1.2
srv2.Example.com.		A	192.168.2.2

; #2
eth0.srv1.Example.com.		A	192.168.1.1	
eth1.srv1.Example.com.		A	192.168.2.1
eth0.srv1.Example.com.		A	192.168.1.2	
eth1.srv1.Example.com.		A	192.168.2.2

; #2
NFS-1.Example.com.		A	192.168.1.1
IPP-1.Example.com.		A	192.168.1.1
NFS-2.Example.com.		A	192.168.2.2
IPP-2.Example.com.		A	192.168.2.2

; #3
floor1.NFS.Example.com.		A	192.168.1.1
floor1.IPP.Example.com.		A	192.168.1.1
floor1.NFS2.Example.com.	A	192.168.1.2
floor1.IPP2.Example.com.	A	192.168.1.2

floor2.NFS.Example.com.		A	192.168.2.2
floor2.IPP.Example.com.		A	192.168.2.2
floor2.NFS2.Example.com.	A	192.168.2.1
floor2.IPP2.Example.com.	A	192.168.2.1

; #4
;   #4.1
floor1.srv1.Example.com.	A	192.168.1.1
floor2.srv1.Example.com.	A	192.168.2.1
floor1.srv2.Example.com.	A	192.168.1.2
floor2.srv2.Example.com.	A	192.168.2.2
;   #4.2
srv1.Example.com.		CNAME	floor1.srv1.Example.com.
srv1.Example.com.		CNAME	floor2.srv1.Example.com.
srv2.Example.com.		CNAME	floor1.srv2.Example.com.
srv2.Example.com.		CNAME	floor2.srv2.Example.com.
;   #4.3
NFS.main.floor1.Example.com.	CNAME	floor1.srv1.Example.com.
IPP.main.floor1.Example.com.	CNAME	floor1.srv1.Example.com.
NFS.bkup.floor1.Example.com.	CNAME	floor1.srv2.Example.com.
IPP.bkup.floor1.Example.com.	CNAME	floor1.srv2.Example.com.

NFS.main.floor2.Example.com.	CNAME	floor2.srv2.Example.com.
IPP.main.floor2.Example.com.	CNAME	floor2.srv2.Example.com.
NFS.bkup.floor2.Example.com.	CNAME	floor2.srv1.Example.com.
IPP.bkup.floor2.Example.com.	CNAME	floor2.srv1.Example.com.

As to these:

#1 involves names too tied to specific physical paths and names.
#2 is too flat for my tastes, and does not take enough advantage of the hierarchical nature of DNS, and does not make it easy to switch from the main to the backup server for each floor.
#3 is interesting, but the ordering of fields and the indication of which server is main and which backup is not a level of naming, which makes it harder to switch.
#4 is what I prefer. The #4.1 part clearly names the interfaces with subnet and host names in the best order; #4.2 allows one to refer to a specific host, all interfaces. and #4.3 allows one to select and change servers for a whole whole just by using different search lines in the resolv.conf (or equivalent DHCP) file, and/or by changing the pointed-to interfaces.

As to #4.3 one could simplify it to:

NFS.floor1.Example.com.		CNAME	floor1.srv1.Example.com.
IPP.floor1.Example.com.		CNAME	floor1.srv2.Example.com.

NFS.floor2.Example.com.		CNAME	floor2.srv2.Example.com.
IPP.floor2.Example.com.		CNAME	floor2.srv1.Example.com.

if one had clustering or load balancing for NFS and IPP acrosss the two servers.