Software and hardware annotations 2007 December

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

I have a backlog of draft blog entries, and the following are just some random quick notes.

071223 Sun BCM5752 jumbo frames work only in output

I have been doing some transfer rate tests among hosts with Broadcom BCM5752 or BCM5708 (tg3 Linux driver), Intel 82541PI (e1000 Linux driver), Realtek RTL-8169 (r8169 linux driver) chipsets (more details on these tests sometime later).
During these I have been quite confused by the behaviour of the Broadcom BCM5752 with jumbo frames, until I discovered that officially it does not allow them, but in practice supports them in transmission only. In addition the RTL-8169 supports jumbo frames but only up to 7000B instead of the more common 9000B.
Anyhow the BCM5752 has various form of network processing offloading, including TCP segmentation offloading, which means that it can handle small frames on the wire fairly efficiently, while the RTL-8169 does not, so jumbo frame support is rather more useful for it.
Also, to take advantage of the ability of the BCM5752 to transmit but not receive jumbo frames is easy: change the relevant routes to have with an MTU value higher than 1500, but set or leave the advertised MSS value under 1500 (minus 40 for the header), for example (the advmss 1460 part is redundant) instead of issuing

ifconfig eth0 mtu 9000

which is not valid, this would work:

ip route change 192.168.1.0/24 dev eth0 mtu 9000 advmss 1460
ip route change default via 192.168.1.1 dev eth0 mtu 9000 advmss 1460

As in the example above each route via the interface associated with the BCM5752 chip must be changed, if opportune.

071215 Sat Network accelerators run Linux

I have read some reviews about the Killer NIC game network accelerator which is quite a remarkable network card for a number of reasons. One is that it is quite expensive, and the other is that it reduces the network latency of MS-Windows games by offloading network processing to a card with a Linux kernel on it (and some sort of Freescale CPU and 64MB of RAM). In some reviews it is reported as actually lowering latency, but this is not a large or as uniform effect. Anyhow Cavium" have released some general purpose network accelerator product based on their previously mentioned Octeon multi-CPU chip. A bit pricey as they seem to cost about the same or more as low latency 10gb/s cards.

071215 Sat 2.5" hard disks differences

My impression is that there are very few commodities, as often even very similar products have important differences, which may or not matter to everybody, but matter to someone. One of the latest illustrations are large differences among 2.5" hard disc drives as to seequential small block performance, which is quite useful to have. The other tests also show remarkable differences (around 25%), and in different ways, as different onboard firmware reacts differently to different usage patterns. For example multithreaded reading and writing shows differences in performance of four times between fastest and slowest.

071210 Mon Much better read latency with AMD's HyperTransport

From an interesting review of the new AMD X2 5500 there are some impressive numbers about memory latency differences between current AMD and Intel CPUs. Latency in main RAM still matters, especially for systems with smaller caches, and less so the larger the cache. As to this Intel has the lead, thanks to their superior capital base that enables investing in better process technology which results in more onchip cache, and Intel are clearly driving the memory market towards memory with high latencies and high bandwidths, which is the combination that best feeds the large caches of their CPU chips via their high latency memory buses.
SDR memory had latencies of a few cycles, DDR of half a dozen cycles, DDR2 of a dozen cycles, and now DDR3 of a couple dozen cycles. A little known detail is tha the intrinsic speed of a memory cell has improved very little over the past decade or two (a mere doubling, from around 100MHz to 200MHz), and all transfer rate increases have come from higher degrees of pipelining and parallelism at the integrated circuit level. Unfortunately pipelining and parallelism only work well in the aggregate, as they involve those ever higher latencies already mentioned, especially noticeable for random accesses to memory.
Intel seem to be driving RAM to become what used to be called bulk store in the mainframe era, a cheap vast repository of seldom used data that can be recalled to main memory faster than from disk. The level 2 cache is the new main memory. In other words, RAM is no longer meant to be that random access, and is meant to be a commodity, where the performance and value reside in the onchip memory. It is no coincidence that Intel has exited the RAM market long ago, and have invested massive capital in processes that allow ever greater amounts of onchip memory.