Software and hardware annotations 2009 May

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg Technorati]

090521 Thu Resilience without VRRP/UCARP or SMLT

There are various way to achieve resilience through link redundancy, and these include somewhat explicit technologies like VRRP and CARP at the routing level or SMLT at the network level, and usually one has to choose between resilience and load balancing and fast switchover. But there is an alternative that uses nothing more than OSPF and basic routing to achieve most of all these objectives, and it seems fairly obscure, so perhaps it is useful to document it here, as I have been using it with excellent results for a while, especially in a medium high performance network recently (it is an idea that has been discovered probably many times).

The basic idea is that as OSPF will automagically create a global routing state from status exchanges among adjacent routers, that multiple routes to the same destination will offer load balancing (on a per connection basis) if ECMP is available, and that OSPF will also happily propagate host routes to virtual, canonical IP addresses for hosts.

The basic setup is very simple:

That does not take a lot of work. The effects are however quite interesting, if one refers to each such host or router not by the address of any of its network interfaces, but by that of the virtual interface. The reason is that OSPF will propagate not just routes to each physical network, but also so each virtual interface, via all the networks its node is connected to.

If ECMP is enabled, and some of these routes will have the same cost, load balancing will occur, across any of the number of routes that have the same cost; if any of routes become invalid, traffic to the virtual IP address will be instantly rerouted by whichever route is still available. Instantly because when an interface fails routes to it are withdrawn, and any higher (or equal) cost route will then immediately be used for the nextpacket.

It is also very easy to use anycast addresses or other forms of host routes to distribute services in a resilient way; the canonical addresses of routers and important servers are similar to anycast addresses.

As long as connections are from one virtual IP address to another virtual IP address, they will eventually arrive as OSPF creates and reshapes then set of routes across nthe various networks.

This technique has some drawbacks, mainly:

As an example of a particular setup, imagine a site with:

In the above discussion a canonical address is not indispensable, but very useful. The idea is that a given router or server cannot be referred to with any of the addresses of its physical interfaces, as in most such systems or routers when a link die its associated interface disappears and any address bound to it vanishes as well. Therefore each system or router needs an IP address bound to a virtual interface (dummy under Linux, circuitless for some Nortel routers, loopback in the CISCO and many other cultures) in order to be always reachable no matter which particular links and interfaces are active. For each such system there will be a host route published for its canonical address, but in most networks with dozens or hundreds of subnet routes, some more dozen or hundreds of host routes are not a big deal, with most routers being able to handle thousands or dozens of thousands of routes.

This scheme is rather more reliable and simpler than the use of floating router IP addresses using VRRP or CARP or other load balancing or redundancy solutions, as it does not rely on tricks with mapping between IP and Ethernet addresses. It also can be extended to fairly arbitrary topologies, and with the use of BGP and careful publication of routers it can be extended beyond a single AS.

It has also some interesting properties, for example:

090518 Mon Pretty good disk and JFS bulk IO performance

I was just comparing some 1TB drive with a 500GB drive and both perform pretty well. The 1TB one (Hitachi) can do over 100MB/s through the file system:

Using O_DIRECT for block based I/O
Writing with putc()...         done:  61585 kB/s  82.2 %CPU
Rewriting...                   done:  29108 kB/s   1.4 %CPU
Writing intelligently...       done: 101481 kB/s   2.0 %CPU
Reading with getc()...         done:   7787 kB/s  14.4 %CPU
Reading intelligently...       done: 104591 kB/s   2.3 %CPU
Seek numbers calculated on first volume only
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
              ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
              -Per Char- -DIOBlock- -DRewrite- -Per Char- -DIOBlock- --04k (03)-
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
base.t 4* 400 61585 82.2101481  2.0 29108  1.4  7787 14.4104591  2.3  170.9  0.3
and the 500GB one (Western Digital) can do over 60MB/s:
Using O_DIRECT for block based I/O
Writing with putc()...         done:  52489 kB/s  70.8 %CPU
Rewriting...                   done:  25733 kB/s   1.3 %CPU
Writing intelligently...       done:  62586 kB/s   1.2 %CPU
Reading with getc()...         done:   7946 kB/s  13.4 %CPU
Reading intelligently...       done:  63825 kB/s   1.3 %CPU
Seek numbers calculated on first volume only
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
              ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
              -Per Char- -DIOBlock- -DRewrite- -Per Char- -DIOBlock- --04k (03)-
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU   /sec %CPU
base.t 4* 400 52489 70.8 62586  1.2 25733  1.3  7946 13.4 63825  1.3  216.0  0.4
The file system used is the excellent JFS which is still my favourite even if the news that XFS will become part of RHEL 5.4 may tilt the balance of opportunity towards it. Anyhow even if the files used by the test above are large at 400MB, the filesystems used are fairly full, and JFS achieves transfer rates very close (80-90%) to the speed of the underlying devices for both reading and writing.
090510b Sun New line of intel RAID host adapters

I had somehow missed that intel is now selling a line of interesting RAID host adapters based on their own ARM architecture CPUs for an IO processor and dispatches and LSI MegaRAID chips. They cover quite a range of configurations, and usually intel takes more care in engineering and documenting their products than most other chip and card suppliers.

090510 Sun Power consumption per task instead of instantaneous

As to Atom CPUs I have been wondering how power efficient they are, given that they seem to be around 3 times slower than an equivalent mainling CPU. So I found this article reporting a test of the amount of energy (in watt-hours) consumed by some servers to run the same benchmarks, and it turns out that among intel CPUs the Core2Duo consumes the least energy; while its power draw is higher, it is faster, and this more than compensates for the higher power draw. What this tells me is that the Atom is better for mostly-idle systems, that is IO or network bounds, and the Core2Duo is better for mostly-busy ones, that is CPU-bound.

090509b Sat Strange PSU fault

Well, I often think that several mysterious computer issues are due to PSU faults. Indeed recently one of my older PCs stopped working reliably: it would boot, and often allow installing some sw or hardware, but also stop working abruptly, seemingly because of a hard disk issue, as IO the disk would stop and the IO activity light stay locked on.

Having tried some alternative hard disks and some alternative hard disk host adapter cards with the same results I reckonged that not all could be faulty, so I checked the voltage on one of the berg style connectors and I was rather surprised to see that the 12V line was actually deliverng 13V and most fatally the 5V line was actually delivering 4.5V, which is probably rather insufficient. I wonder why; the PSU was not a super-cheap one (they can catch fire) but a fairly high end Akasa one.

This is one of the stranger PSu failures that I have seen so far, where voltage on one rail actually rises and on the other weakens to just too low, without actually failing.

090509 Sat Small Atom based rackmount servers

It had to happen, yet I was still a bit surprised to see an intel Atom CPU based rackmount server as the most notable characteristic is that it is half depth. That ssort of makes sense as one can then mount them without rails and one in the from and one in the back of a rack.

It is also a bit disappointing to see that the hard disk is not hot pluggable and is a 3.5" one. The funniest detail however is that the motherboard chipset is active cooled but not the CPU.

The design logic seems to be for a disposable unit for rent-a-box web server companies, where most such servers are used fo relatively low traffic and low size sites, and anyhow the main bottleneck is the 1Gb/s network interface, and such servers are often bandwidth-limited to well less than that, typically 10-100Mb/s.

At the other end of the range the same manufacturer have announced another interesting idea, 1U and 2U server boxes with the recent i7 class Xeon 5500 CPU, configured as a 1U/2U blade cabinet. That is mildy amusing, and seems to me the logical extension of Supermicro putting two servers side by side into a 1U box, as a fixed configuration.

090508 Fri Amazing file system news from Red Hat

While reading the blog of a CentOS I had a huge surprise reading a comment about recently discovered Red Hat plans for RHEL5.4:

We discovered xfs was coming to RHEL,

That is amazing as it represents a really large change for Red Hat's strategy, which was based on in-place upgrade from ext3 to ext4 RHEL6, and not introducing major new functionality in stable' releases. Some factors that I suppose might have influenced the decision:

The Red Hat sponsorship of XFS shifts a bit my preferences; I have been using for a while JFS as my default filesystem as it is very stable and covers a very wide range of system sizes and usage patterns pretty well, I might (with regrets) move to XFS then.

CentOS has had some things like XFS available in the CentOSPlus repository, and XFS has been a standard part of Scientific Linux 5 too (which also includes OpenAFS).