Software and hardware annotations 2007 January

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

070128 Laptops as servers
Recently I installed a few really small PC-style boxes as network appliances to be special purpose firewalls. These machines are quite neat and useful, and could be used as general purpose servers even if they have relatively low performance (after all I did a video-on-demand server on a Pentium 100MHz years ago). But after working with many racks full of largish contemporary servers and having to get rackable consoles etc., I have been looking at today's laptops, and I have started to think that they may be pretty good servers in many cases, not just desktop replacements, because: There are obvious downsides too, as in laptops are more expensive than low end boxes (but not more expensive than industrial grade 1U servers), and most lamentably they are not that expandable. However as to expandability USB2 overcomes most issues, with two major exceptions: As to additional Gigabit network interfaces one could use the classic CardBus or the newer ExpressCard sockets; the same for extra hard drives, as there are now FireWire 800 and even eSATA plug-in cards for laptops, and those two standard are fast enough for fairly serious performance. Of course external hard drives would not enjoy the advantage of the built-in UPS of the laptop itself, but once spinning hard drives draw relatively little power. Anyhow, I have been using a couple of laptops as backup servers with external USB2 or FireWire 400 hard drives.
070127 More RAID5/RAID6 madness
I was looking for various other reasons to a page of notes about large site computing infrastructure and I noticed these two bullet point:
• NEC won tender 4.5PB over threee deliveries
• 500 TB already installed (450 x 400GB 7k SATA) connected to 28 storage controllers RAID6
and from other notes on the same event:
- 500TB already installed -- 400GB 7k SATA drives and 50TB FC -- 140GB 15k
- RAID 6 SATA
- 2 servers at 2 gbit every 20TB
- Jan 2007 -- 300GB FC and 500GB SATA
about the same presentation and I was quite amused: 450 drives (actually 400+140) over 28 RAID6s means over 16 per RAID6, which to me seems quite perverse.
Another amusing bullet point about power demands of large data centers:
• 2MW today in near future .... 5.5 MW total into the building by end of year by 2017 predict 64MW
070122 Crazily high cable prices
Walking home I passed by a computer product superstore belonging to a national chain and I briefly visited it. It looks like there are ways of making a lot of money selling extraordinarly expensive Belkin branded cables to gullible consumers: as a typical example an internal ATA 80 wire cable priced at £16.99 (around $34 or €26). A simple electrical power cable might then seem then a bargain for a mere £9.99 ($20 or €15). Or perhaps still not :-).
070120b Simple configuration-driving environment variables
I have a few PCs here and there that not only have rather different hardware but also exists in different locations and are sometimes moved among them, and I try to share configuration files among them. Many of these files fortunately are actually shell scripts that set environment variables or run commands.
After many years and trying several approaches my current preference is to define either at boot time or via some overall scripts a minimal orthogonal base for configuration space via environment variables, of which I use three: One could add more variables for a finer discrimination of the configuration space, but I am reluctant to do so. As to these variables I use them in configuration shell scripts using nice traditional case statements with pattern matching, something that the authors of most shell scripts I see seem to eschew in their quest for the most appalling style, as follows for example (many variants are possible, but a delimiter like the + below is necessary):
case "$SITE" in
'home+laptop1')
   : 'Whatever purely site specific';;
...
esac

case "$SITE+$HULL" in
*'+laptop1')
   : 'Whatever purely hull specific';;
...
esac

case "$SITE+$HULL+$NODE" in
'home+laptop1+linux1')
   : 'Whatever for this specific situation';;
*'+laptop1+linux1')
   : 'Whatever hull and node specific';;
...
esac
For configuration files that are not shell scripts I use other variants on the idea: Sounds complicated, but that is the product of quite a bit of thinking and practice, and there are some details I haven't been discussed explicitly. Still better than the registry, even if the usual victims of the Microsoft cultural hegemony want to add it to GNU/Linux too.
070120 Impressive FireWire 800 performance
Having recently tried two external hard drives in USB2 mode, I managed to get cheap FireWire 800 host adapter with a decent TI chip, and what a very pleasant surprise: the same My Book™ Pro Edition™ II which delivered around 22MB/s with USB2 also did about the same with FireWire 400 (which is a surprise too, should be faster) but delivered 68MB/s reading and 54MB/s writing with FireWire 800. The test was done on a relatively slow Celeron 2.4GHz system, which returns a bandwidth of only 700MB/s from hdparm -T, and system CPU usage was around 11% for FireWire 400 and a still fairly low 28% for FireWire 800. Compare with the same figures for my USB2 test with much faster systems at lower data rates.
All this is quite impressive, and means that at least some current FireWire 800 host adapter and ATA bridge chips (I am going to try to figure out which ATA bridge chip is inside the MyBook Pro) are far more efficient (probably lower latency) than USB2 or FireWire chips. Indeed the speed achievable via FireWire 800 is high enough that the need for eSATA is called into question, or even that of internal hard drives. I have known people who have servers with RAID on external FireWire 400 drives, and I can see that as being far more interesting with the much faster version. A kind of non-Fibre Channel for small-to-middle situations, which can add considerable flexibility to configurations.
070115 Trying two recent branded external hard drives
So I have been looking at various external hard drives for data exchange purposes, around the 250GB range. I had already had a look some time ago and a bit earlier too and I was rather disappointed by the vexed problem of spin up current and the lack of sufficient power in the usual bricks, and never mind the somewhat dodgy nature of many USB and firewire chipsets which Linux drivers cannot quite work around in every case. Well, I have just bought a LaCie P3 external USB2 250GB hard disk and I was pleasantly surprised: I also briefly checked out a Western Digital external drive, a luxury My Book™ Pro Edition™ II with 2x500GB of capacity and various interfaces. I copied to it in sequence a tree containing about 22GB of data, to 40 different directories, and doing also sequential reads and writes, and it is slightly slower on USB2 at around 22MB/s, but it also seems quite reliable, as it copied the nearly 900GB (over 15 hours) without trouble; I didn't try the FireWire and FireWire 800 interfaces.
But I am rather interested in two new interfaces for external small scale storage system, one of them being the usual Ethernet, using either AoE or some NAS protocol like NFS or SMB. Ethernet and NFS or SMB are very reliable, well understood protocols, even if some low end implementations are still somewhat dubious. Ethernet seems also to be becoming the nearly universal bus because of that.
The other is eSATA (see also 1, 2, 3, 4, 5) which is the long-awaited serial ATA for external connections, as for example in this very recent (indeed not yet shipping) LaCie product, the d2 eSATA II 3Gbits Hard Drive 500GB. The hope here is that SATA is such a crucial standard that it is going to be hard to find buggy implementations, and by being the same standard used in internal drives it will not require the extra layers of conversion, as in the ATA-USB2 or ATA-Firewire bridge in the external box (there being no native USB2 or Firewire drives), which of course is a big advantage.
USB and USB2 will continue to be useful for simple and/or odd devices, and FireWire will continue to be a great missed opportunity, as it could have made SATA itself pointless, because in effect FireWire is cheap, simple, fast serial SCSI, and could have been used internally too. But SATA and eSATA are sort of good enough, so let's hope eSATA replaces all the dodgy implementations of USB and FireWire for external storagre devices.
070108 Parallel system with low power multi-CPU chips
Just spotted an interesting new startup. SiCortex that are doing very dense massively parallel systems based on low power multiple-CPU chips, which are an ongoing interest of mine. Their technology summary paper contains some interesting and agreeable point:
Our survey of technical applications indicated that typical HPTC programs spend the major- ity of their time waiting for memory. Ratios in the range of 5-80 floating-point operations per cache miss to main memory were typical.
which is very well supported by my experiences (and Ben's, hi! :->) in optimizing game code on various consoles, where more cache means better performance, given that most programmers, even game programmers, don't understand memory-friendly or microparallel algorithms.
By scaling to N communicating processes, we are able to spread the data movement task over N independent memory access streams. Scaling is, of course, limited by the cost of communication.
This is a bit too vague: the cost of communication is ambiguous as communication is rated in both bandwidth and latency, which can have very different cost profiles. For massively parallel algorithms latency probably matters more, but here it is not clear indeed whether the cost of bandwidth or the cost of latency has been targeted.
Our hardware design was guided by a simple idea: while traditional clusters are built upon processor designs that emphasize calculation speed, the SiCortex cluster architec- ture aims to balance the components of arithmetic, memory, and communications in a way that delivers maximum performance per dollar, watt, and square foot.
This balancing is surely wise, and reminds me of a very good point by Edseger Dijkstra about optimal page replacement algorithms (to be discussed some other time), where optimum utilization of one resource is not the goal, but cheapest utilization of all resources (that is, cost-weighting). Unfortunately shallow customers (which exist in the HPC market too) buy on raw performance benchmarks and selling to the wise and discriminating restricts the potential market.
Our obsessive attention to low power resulted in a variety of performance and cost benefits. By holding down the heat generated by a node, we were able to put many nodes in a small volume. With nodes close together, we could build interconnect links that use electrical signals on copper PC board traces, driven by on-chip transistors instead of expensive external components. With short links, we could reduce electrical skew and use parallel links, giving higher bandwidth. And with a small, single-cabinet system we were able to use a single master clock, resulting in reduced synchronization delays. Our low-power design also has cascading benefits in reducing infrastructure costs such as building and air conditioning, and in reducing operational costs for electricity.
Well said, and especially for lower end applications power requirements can impact costs severely. At the higher end not many have the resources of Google who can afford semi-custom PC designs and to build gigantic facilities where land and power are cheap.
The SiCortex node (Figure 3) is a six-way symmetric multiprocessor (SMP) with coherent caches, two interleaved memory interfaces, high speed I/O, and a programmable interface to the interconnect fabric.
The processors are based on a low power 64-bit MIPS® implementation. Each processor has its own 32 KB Level 1 instruction cache, a 32 KB Level 1 data cache, and a 256 KB segment of the Level 2 cache. The processor contains a 64-bit, floating-point pipeline and has a peak floating-point rate of 1 GFLOPs. The processor's six-stage pipeline provides in-order execution of up to two instructions per cycle.
The usual MIPS-style instruction set, and a fairly decent amount of cache considering the CPUs are packed six to a chip. Sounds not too unlike a Sony/IBM Cell design, MIPS rather than PowerPC based, and double precision floating point is obviously targeted at scientific rather than gaming markets. Remains to be seen whether double precisions performance is much slower than single precision floating point as in similar designs. But the kicker is that:
This simple design dissipates less than one watt per processor core.
which suggests a power draw of 6W per chip, which is fairly impressive. The 500MHz clock frequency however is far lower than the 3GHz clock of the Cell in the PS3, which however has only one such chip (even if with a similar number of CPUs), but then it is the only one such chip on that board, not one of 27 as in the SiCortex nodes. But then I also can only agree with the point that:
The processor's rather modest instruction-level parallelism is well suited to HPTC applications which typically spend most of their time waiting for memory accesses to complete.
070105 The DNS, interfaces, nodes, directories and naming schemes
I have been chatting with different people about similar issues related to DNS taxonomies. On reflection my guess is that how DNS works and what its purpose is are somewhat widely misunderstood concepts, in part because understanding DNS depends on some subtle concepts and terminology.
One misunderstanding is that the DNS is provides names for computers, which is quite inaccurate. But very little in Internet standards is about computers, and the main entities are interfaces, which have addresses. The Internet architecture has a concept of nodes which are entities with one or more interfaces, and which are routers if they forward traffic between some of those interfaces. What happens when multiple interfaces are on the same node? Well, Internet standards don't much say about that, except in the particular case of routing across them, and even that is just supposed to happen. IP implementations handle interfaces on the same node differently; for example by default the Linux kernel will handle them in effect as if they were all one interfaces on multiple subnets, which can lead to somewhat surprising behaviour (for example, as to ARP responses).
The DNS is a hierarchical distributed property list system, where the property(s) to be associated with a symbol are those of interfaces, not nodes. So WWW.Example.com and LDAP.Example.co.UK could be names for properties of two interfaces on the same node, or even the same interface. To the point that there is no way in the DNS (or in IP) to list interfaces by node, or check whether two interfaces are on the same node. And it is even worse of course: the DNS actually is about (mostly) giving names to addresses and IP implementations allow defining interfaces that have multiple addresses, or even (far less commonly) have the same address. The DNS also is not quite a directory-and-file system like UNIX filesystems, where interior nodes in the hierarchy cannot have data associated with them: in the DNS both a domain name and its subdomain names can have addresses associated with them.
All this poses two interesting issues: So what to do with naming multiple interfaces on the same node? Well, several possibilities present themselves; for an example, let's imagine a situation with 2 servers, both of them on 2 subnets (thus a total of 4 interfaces), one subnet per floor, each server providing both file and print service, redundantly, as in my preferred structure, but with one server (srv1) being preferred for floor 1 and the other (srv2 for floor 2. Let's consider then these possible naming schemes, which I will present all at once so the differences can be seen at a glance, with case-by-case comments afterwards:
; There should be '$ORIGIN Example.com' or equivalent here.

; #1
srv1.Example.com.		A	192.168.1.1
srv1.Example.com.		A	192.168.2.1
srv2.Example.com.		A	192.168.1.2
srv2.Example.com.		A	192.168.2.2

; #2
eth0.srv1.Example.com.		A	192.168.1.1	
eth1.srv1.Example.com.		A	192.168.2.1
eth0.srv1.Example.com.		A	192.168.1.2	
eth1.srv1.Example.com.		A	192.168.2.2

; #2
NFS-1.Example.com.		A	192.168.1.1
IPP-1.Example.com.		A	192.168.1.1
NFS-2.Example.com.		A	192.168.2.2
IPP-2.Example.com.		A	192.168.2.2

; #3
floor1.NFS.Example.com.		A	192.168.1.1
floor1.IPP.Example.com.		A	192.168.1.1
floor1.NFS2.Example.com.	A	192.168.1.2
floor1.IPP2.Example.com.	A	192.168.1.2

floor2.NFS.Example.com.		A	192.168.2.2
floor2.IPP.Example.com.		A	192.168.2.2
floor2.NFS2.Example.com.	A	192.168.2.1
floor2.IPP2.Example.com.	A	192.168.2.1

; #4
;   #4.1
floor1.srv1.Example.com.	A	192.168.1.1
floor2.srv1.Example.com.	A	192.168.2.1
floor1.srv2.Example.com.	A	192.168.1.2
floor2.srv2.Example.com.	A	192.168.2.2
;   #4.2
srv1.Example.com.		CNAME	floor1.srv1.Example.com.
srv1.Example.com.		CNAME	floor2.srv1.Example.com.
srv2.Example.com.		CNAME	floor1.srv2.Example.com.
srv2.Example.com.		CNAME	floor2.srv2.Example.com.
;   #4.3
NFS.main.floor1.Example.com.	CNAME	floor1.srv1.Example.com.
IPP.main.floor1.Example.com.	CNAME	floor1.srv1.Example.com.
NFS.bkup.floor1.Example.com.	CNAME	floor1.srv2.Example.com.
IPP.bkup.floor1.Example.com.	CNAME	floor1.srv2.Example.com.

NFS.main.floor2.Example.com.	CNAME	floor2.srv2.Example.com.
IPP.main.floor2.Example.com.	CNAME	floor2.srv2.Example.com.
NFS.bkup.floor2.Example.com.	CNAME	floor2.srv1.Example.com.
IPP.bkup.floor2.Example.com.	CNAME	floor2.srv1.Example.com.
As to these: As to #4.3 one could simplify it to:
NFS.floor1.Example.com.		CNAME	floor1.srv1.Example.com.
IPP.floor1.Example.com.		CNAME	floor1.srv2.Example.com.

NFS.floor2.Example.com.		CNAME	floor2.srv2.Example.com.
IPP.floor2.Example.com.		CNAME	floor2.srv1.Example.com.
if one had clustering or load balancing for NFS and IPP acrosss the two servers.