Software and hardware annotations 2007 September

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

070929 Sat TOMOYO Linux and command of English

TOMOYO is a Linux security subsystem that is now based on the LSM API. I was not aware of TOMOYO, and anything that is an alternative to the hardly usable SELinux sounds interesting. The one thing I understood about TOMOYO is that it identifies security entities by their pathname. This is like Novell's AppArmor module, and it has been objected to on the ground that pathnames in UNIX-like systems are not essential, because they name directory entries and not files (which are numbered, not named). other than that I had some difficulty understanding what it does and how, because the English text is hard to understand because it is not written in good English.
This is of course a general problem, especially but not just with Asian publications: poor English is hard to distinguish from confused ideas, and there is a high regard among native English speakers for smooth command of the language, which is a quite a subtle and ambiguous language. This applies not just to technical writing but also to speech: some people I know are poor English speakers, and they come across as hesitant and confused, where in their native language they make a completely different impression thanks to their fluency and the assuredness that comes with it.
In general this is going to get worse, but for native English speakers, as a lot of the economically valuable action moves from western or English speaking countries to Asian countries. It is already the case that a lot of interesting technology research papers appear in Japanese, and when I do web searches on technical topics, usually hardware, I get quite a few hits in Japanese, and many Chinese too. Perhaps the hottest research in hi tech will move to China or Japan, and then it will be westerners that will sound or read confugsed and hard to understand to tyhe chinese or japanese reading in their native language.

070928 Fri SELinux brittleness and layering wizards

A discussion about the usability of SELinux is interesting in three respects. The first one is that SELinux like most fine grained access list based systems is very hard to configure and maintain unless one has full time security officers:

Every medium to large Linux deployment that I am aware off has switched SELinux off. Once you stray from the default configurations that the system distributors ship with the default policies no longer work and things start to break. In my admittedly limited experience, this happens very quickly.
If the policy language was halfway sane then this wouldn't be so bad - a skilled administrator could adjust the policy

The second issue is that the user interface layer is poorly designed and overly complex and clever (which is usually what wannabes do), that is very un-UNIX like, and the third is that:

The Linux solution to #2 seems to be to add various wizards and other abstraction between the administrator and the policy, rather than tossing the horrid mess and replacing it with something more comprehensible.

But I have the impression that this is the general policy for GNU/Linux development, because this is also the basis of most other enhancements: hide a poorly designed base under a layer of look-good wizardry. The two areas that come to mind are system administration, where instead of cleaning up and simplifying the fundamentals they are being layered under limited, fragile GUIs, and the desktop environment itself.
But this is exactly Microsoft's approach to their own software, where they keep mangling what is already a hopelessly messy base and layering make it easy GUIs on top. Conversely the UNIX philosophy instead has always been one of simplicity and terseness, conceptual economy. But the Microsoft cultural hegemony influences large parts of the GNU/Linux culture.
Then there is the case of the udev subsystem, which replaced the simple, robust and easy to understand devfs with something poorly designed that requires dæmons and stacks of scripts to be workable. A good example of replacing something more comprehensible with a horrid mess, which is something that Microsoft would find particularly admirable.

070923b Sun So the cases where RAID5 makes sense are...

I have just bravely stated that in at least one special case RAID5 makes some sense, but there is another one. For the sake of clarity, here are both cases:

A 2+1-wide array. The point here is that with 3 drives one just cannot do RAID10, RAID1 presumably is overkill, RAID0 has no fault tolerance and some is desired, and RAID5 on a 3-wide stripe still has tolerable write performance and a good enough degree of redundancy (and RAID5 is preferable in this case to RAID3 or RAID4). Some would extend this to a 4-wide array, but then RAID10 become possible.
A 4+1-wide array (perhaps even 6+1-wide one if desperate) if all these conditions apply:
- The data is read only.
- The data is archived and can be restored easily.
- Continuous fast read access and fast restore are not essential.
The idea is that RAID5 does provide reasonable read performance except during rebuilds, and some degree of redundancy means that full restores are needed less often, and rebuilds shouls be faster than full restores.

There is another overall condition: that a proper RAID10 is simply unaffordable. That is 4 instead of 3 disks is really too much in the first case, and 8 instead of 5 in the second case.

070923 Sun Yet another RAID5 perversity

Well, a new RAID5 perversity is bound to surface every now and then, and the XFS mailing list is one of the best sources:

we have a new large raid array, the shelf has 48 disks, the max. amount of disks in a single raid 5 set is 16.

What a pity! Such pettily arbitrary limitations can reduce the effectiveness of a positive-thinking, can-do plan (even if the truly inspired can work around such limitations):

There will be one global spare disk, thus we have two raid 5 with 15 data disks and one with 14 data disk.
The data on these raid sets will be video data + some meta data. Typically each set of data consist of a 2 GB + 500 MB + 100 MB + 20 KB +2 KB file. There will be some dozen of these sets in a single directory - but not many hundred or thousend.
Often the data will be transfernd from the windows clients to the server in some parallel copy jobs at night (eg. 5-10, for each new data directory). The clients will access the data later (mostly) read only, the data will not be changed after it was stored on the file server.

Well, if it is almost read-only that's one of the few cases in which RAID5 may be tolerable. The problem here however is what happens when one of the disk fails: is a large drop in read performance acceptable while the rebuild goes on acceptable? Is total data loss acceptable if a second failure happens during that?
For some applications that is true, for example the online caching of a vast read-only taped archive which is not needed 24x7 (a corner case, but for example important to the smart guy who persuaded me that is a good case in which to use RAID5, with plausible setups), but it is not clear that this is a case like that.

Each client then needs a data stream of about 17 MB/s (max. 5 clients are expected to acces the data in parallel). I expect the fs, each will have a size of 10-11 TB, to be filled > 90%. I know this is not ideal, but we need every GB we can get.

Exactly! The we need every GB we can get is the refrain of most RAID5 songs. A RAID developer I know made the very wise remark that RAID5 is salesman's RAID because a good salesman with a positive thinking, can-do attitude can promise that RAID5 will deliver all three of low cost, high performance, and great safety. And some salesmen will even say that about RAID3 or RAID4...

Any ideas what options I should use for mkfs.xfs? At the moment I get about 150 MB/s in seq. writing (tiobench) and 160 MB/s in seq. reading. This is ok, but I'm curious what I could get with tuned xfs parameters.

Well, a 14 or 15 wide RAID setup will have a pretty long stripe size, especially with a largish chunk size. If writes are not stripe wide and stripe aligned, bad news. More surprising that read performance is the same poor, but then I know a RAID subsystem that seems poorly designed enough (perhaps intentionally as part of a marketing strategy, as a smart guy mused) that the 60-70MB/s drives in it can only deliver about 7-10MB/s (and this is actually stated in the vendor's literature).
However in ideal conditions even RAID5 can deliver pretty high data rates, with the right filesystem and the right parameters (e.g. as discussed in these treads: 1, 2).

070920 Thu More Cell and Niagara style chips planned

Well, I have missed so far two interesting developments in multiple CPU chips, somewhat similar to the Sun Microsystems Niagara or the IBM/Sony/Toshiba Cell which are not that dissimilar from the ClearSpeed designs or the SiCortex (except that the latter is decidely targeted at low power non floating point work), Intel's Larrabee family of chips, which had been hinted at by some Intel announcement last year:

The new slide says that Larrabee products will have from 16 to 24 cores and adds the detail that these cores will operate at clockspeeds between 1.7 and 2.5GHz (150W minimum). The number of cores on each chip, as well as the clockspeed, will vary with each product, depending in its target market (mid-range GPU, high-end GPU, HPC add-in board, etc.)

Typically the CPUs will be in order dual issue ones, and it is also mentioned that they may be in effect a bunch of Pentiums clustered on a single die, which was one of the potential but unrealized ways in which CPUs could have evolved, and which seems to be coming back. From the same article some details on an upcoming Intel 8-CPU SMT chip from Intel:

The presentation also includes some details about Intel's 32nm "Gesher" CPU, due out in 2009. In brief, it's 4-8 cores, 4GHz, 7 double-precision FLOPs/cycle (scalar + SSE), 32KB L1 (3 clocks), 512KB L2 (9 clocks), and 2-3MB L3 (33 clocks). The cores are arranged on a ring bus, just like Larrabee's, that transmits 256 bytes/cycle. Gesher is due out sometime in 2009.

The other thing I missed is that the similar purpose design from AMD/ATi AMD stream processor is advanced enough to have been demonstrated and even benchmarked:

According to an AMD demonstrated system [6], with two dual-core AMD Opteron processors and two next-generation AMD Stream Processors based on the Radeon R600 GPU core running on Microsoft Windows XP Professional, 1 TFLOPS can be achieved by an universal Multiply-add (MADD) calculation, which is 10 times of the performance of the current top-class dual-processor systems (based on the figure of 48 GFLOPS per Intel Core 2 top-model processors). Showing benefits of stream processors from large scale parallel processors, providing enhancements in FP computational performance.
Recent demostartions showed that in AMD Stream Processor optimized Kaspersky SafeStream anti-virus scanning, the system with two AMD Straem Processors with dual Opteron processors spotted 6.2 Gbit/s (775 MiB/s) bandwidth, 21 times compared to plain dual-processor system without AMD Stream Processors, with only 1-2% CPU utilization on the dual processors system, showing significant FP offloading from CPU to the Stream Processor[7].

The Larrabee has just been speculated to be the reason why Intel bought Havok an otherwise irrelevant developer of game middleware paying as much as $110 millions for it:

Intel can make Havok's physics engine and other software scream on Larrabee, so that when then the new GPU launches the company can advertise "good enough DX10 performance" coupled with "but check out what we can do with physics and AI." If Intel can entice enough consumers with a combination of Havok performance and the promise of real-time ray tracing (RTRT) goodies to come, then the company can deliver a large enough installed base to developers to make the effort of putting out a RTRT version of their games worthwhile.

Well, that is a bit ridiculous: Intel could have bought with that money more than one game studio, and most game studios haven't licenced middleware like Havok, but have developed their own highly optimized multi-platform middleware, because it is actually quite easy to do, and there are lots of people who have written simulation libraries for GPUs apart from HavokFX. Also, real-time ray-tracing can be done today in software even if of course a multi CPU chip would help as was argued earlier here about the Cell.

070919 Wed The high CPU cost of the Linux page cache, more numbers

I have previously noticed that the Linux page cache (the equivalent of the buffer pool in UNIX) code is so inefficient that it almost turns IO bound operations into CPU bound ones, unless one has a very fast CPU. Well, while repacking my root filesystem I have also checked the underlying speed of reading from the root filesystem partition with dd both with the iflag=direct option (and a block size of 1MiB, writing to /dev/null):

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  1      0 964224   8952  13880    0    0 68608     0  503  307  1  2  0 97
 2  1      0 964224   8952  13880    0    0 69632     0  446  192  1  4  0 95
 2  1      0 964224   8952  13880    0    0 64512     0  434  174  0  5  0 95
 2  1      0 964224   8952  13880    0    0 68608     0  441  169  1  0  0 99
 2  1      0 964224   8952  13880    0    0 66560     0  436  183  1  5  0 94
 2  1      0 964224   8952  13880    0    0 63488     0  430  164  1  8  0 91
 2  1      0 964224   8952  13880    0    0 67584     0  438  183  0  1  0 99
 2  1      0 964224   8952  13880    0    0 68608     0  451  232  1  3  0 96
 2  1      0 964224   8952  13880    0    0 68608     0  447  205  1  0  0 99
 2  1      0 964224   8952  13880    0    0 67584     0  450  210  0  3  0 97
 2  1      0 964224   8952  13880    0    0 69632     0  451  204  1  1  0 98
 3  1      0 964224   8952  13880    0    0 68608     0  451  217  1  1  0 98
 2  1      0 964224   8952  13880    0    0 67584     0  443  190  0  1  0 99
 3  1      0 964224   8952  13880    0    0 68608     0  456  228  1  0  0 99

and without:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 3  1      0  93408 859992  18832    0    0 68864     0  852 1122  1 19  0 80
 3  1      0  93116 860504  18792    0    0 69376     0  875 1268  1 18  0 81
 3  1      0  93296 860248  18788    0    0 66048     0  914 1323  0 15  0 85
 2  1      0  92996 860632  18820    0    0 69248     0  848 1220  1 16  0 83
 2  1      0  92548 861016  18800    0    0 69248     0  849 1213  1 18  0 81
 2  1      0  92848 860760  18816    0    0 67328     0  833 1184  0 17  0 83
 2  1      0  92316 861272  18800    0    0 68224     0  841 1196  1 17  0 82
 2  1      0  93088 860504  18832    0    0 69376     0  849 1215  1 18  0 81
 2  1      0  92848 860764  18832    0    0 67840     0  879 1267  1 16  0 83
 0  1      0  93088 860508  18808    0    0 67328     0  852 1215  1 18  0 81
 2  1      0  92976 860636  18812    0    0 69120     0  863 1244  1 17  0 82
 1  1      0  92668 860892  18788    0    0 69120     0  857 1231  1 17  0 82
 1  1      0  92308 861276  18820    0    0 66688     0  854 1227  0 16  0 84

In both cases the transfer rate is around 65-70MB/s, but with the buffer cache enabled there is an additional 15-20% CPU overhead. In other words on this 2GHz socket 754 Athlon 64 in 32 bit mode, roughly 5MHz per 1MB/s, which sounds a bit expensive to me. Also, I have just noticed that while I used to get around 1500MB/s from hdparm -T now I seem to be getting only 600MB/s, and I shall investigate.
Anyhow my CPU is 2-4 times slower than contemporary high-end CPUs on the multi-CPU chips that presumably grace the desktops of top Linux developers (I have myselpf a top end Core 2 Duo system at work), so they could not care less about a few % of CPU time wasted.

070918b Tue AMD's 3 CPU chip

In a somewhat anticipated move AMD have introduced a 3 CPU chip to provide an intermediate step between its 2 CPU products and the recently announced 4 CPU ones. That is moderately amusing, but makes sense: since AMD 4 CPU products have all four CPUs on the same chip it allows increasing the yield, by accepting instead of rejecting chips with one faulty CPU; conversely current Intel 4-CPU chips have two 2-CPU dies on a carrier, which presumably gives better yields than a 4-CPU chip. AMD also now provide a new price point. A clever move.

070918 Tue Sun's 8 CPU, 32/64 thread new chip

I have been amused by this informative description of the design of the new Niagara II chip from Sun. The most remarkable design feature is that the 8 CPUs are in-order dual-issue superscalar. This means that their execution model is quite simple, and Niagara single CPU performance sucks on dusty decks. This is countered by the number of CPUs, and by the ability to run up to 8 threads per CPU. In other words this chip is likely to be terrible for single threaded dusty decks, but can be pretty awesome for custom-written highly threaded new developments, for example in Java. Its architecture is also somewhat reminiscent of the IBM/ class="corp">Sony/Toshiba Cell and the IBM/Microsoft Xenon chips, each of which has multiple CPUs, each being a RISC CPU with 2-issue in-order superscalar and multiple threads, and which also suck on code that has not been written for that kind of design. Conversely the Sun AMD64 servers based on AMD Opterons, have a reputation for very good performance on dusty decks. Interesting marketing and technical strategy for Sun.

070914 Fri Another used file system test

So after long delay I have upgraded from Fedora 5 to Fedora 7 and after that I have dumped and reloaded the root filesystem, which has 260k files taking 6.9GB in a 10GB partition. I have tried a tar dumping of the root before and after loading it on two different 250GB disks, one (hda1) with 57MB/s sequential transfer and faster seeks, and one (hdb1) with 66MB/s rate and slower seeks:

Desktop filesystem speed tests
6.9GiB 260k inodes
Status	Repack `hda1`	Repack `hdb1`
old	11.9MiB/s	10.5MiB/s
new	33.2MiB/s	33.6MiB/s

The ratio here is 2.8, much the same as for a previous upgrade scenario for JFS I had already reported. Why 33.2MB/s on a disk capable of 57MB/s peak sequential? Well, lots of small files, as demonstrated by the lower rate on the disk with the faster sequential rate but the slower seek. On directories with larger files I have seen 60MB/s and more sustained on the newly loaded copy.