Software and hardware annotations 2008 February
This document contains only my personal opinions and calls of
judgement, and where any comment is made as to the quality of
anybody's work, the comment is an opinion, in my judgement.
[file this blog page at:
- 080226 Tue
Free software and the value of proprietary platforms
- Today I was discussing usage patterns of
with a bright, sensible scientists and he told me that a few
years ago he was using GNU/Linux because he could get excellent
free software on it, and only on it. But now he is using
MS-Windows because all that excellent free software has been
ported to MS-Windows and he has also the option of using all
the MS-Windows packages too on the same computer without
In other words he was using the argument that the value of
a platform is in the value of the software available for the
platform, and that a platform like MS-Windows on which both
proprietary and free software packages are easily available is
more valuable than one where only free software is available.
For advocates of free software like
Richard Stallman porting free software to
proprietary platforms is indeed a bad idea, because it just adds
value to the proprietary platform, reducing the viability of the
whole free software system. There are two ways to add value to
competing platforms, and it does not matter whether either of
the platforms are free software or proprietary:
However useful in the short term, projects like
Gimp for MS-Windows,
Apache for MS-Windows
add great value to MS-Windows as a runtime platform,
adds a lot of value to MS-Windows as a (games) development
platform. Cedega for example is likely to have greatly reduced
the incentive for game developers to commission a GNU/Linux port
of their product. In the same way OS/2 suffered greatly from its
very good runtime compatibility with MS-Windows, as that removed
most of the reason to develop OS/2 specific applications.
- Make valuable software previously only available on one
platform to the competing platform. This is why for example
game console suppliers are so keen on game title exclusives
for their console platforms.
- Make one's platform compatible with the competing platform
(a more direct variant of the portability argument).
Developers will then develop for the competing platform
only, because that then opens access to the user bases
What should the supporters of a platform (proprietary or
free) do then? Well, avoid supporting competing
platforms by extending to them runtime or development
compatibility. It is not by chance that Microsoft have been
rather lacking enthusiasm for supporting the
on their platforms, and have made their browser rather less
compatible with other browsers than it could have been. If they
had done otherwise they would have enhanced the value of the JVM
as a runtime platform and the value of standards based browsers
as development platforms, and they own neither.
Sure, providing runtime or development compatibility for a
platform that has few native applications may provide a short
term increase in its popularity, but ultimately destroys it,
because the only recipe for platform success is the hard one: to
ecosystem of native applications
and of developers that have invested in the platform.
- 080220 Wed
A 48 drive setup, it had to happen
- I have often been wondering how some people are setting up all
boxes that Sun has been selling in large numbers, and here
is one of the first sighting:
My reckoning is that the above is remarkably inappropriate:
The box presents 48 drives, split across 6 SATA controllers.
So disks sda-sdh are on one controller, etc. In our
configuration, I run a RAID5 MD array for each controller,
then run LVM on top of these to form one large VolGroup. I
found that it was easiest to setup ext3 with a max of 2TB
partitions. So running on top of the massive LVM VolGroup are
a handful of ext3 partitions, each mounted in the filesystem.
but we're rewriting some software to utilize the
In this setup the only saving grace is the splitting of the
total capacity into 2TB slices, which is wise also because of
file system check time
- Too wide parity RAID.
- Likely that the
volume group is a linear concatenation of the 6 underlying
parity RAID sets. this is the both the least reliable and
lowest performing combination.
- Assuming the standard Thumper 500GB disks, each RAID set
has a capacity of 3.5TB. This means that a bit over of the
2TB filesystem volumes will straddle a RAID set boundary,
thus leading to total loss of the filesystem if either set
- 080219 Tue
Switching my main work PC to a laptop
- For the past few weeks a laptop
(which I have
has become my main working system. The main reason for this is
that the 160GB disk capacity is large enough that I can put on
it not just my home directory but enough of my archives (papers,
documentation and software) that I can work standalone. That
disk is also quite fast.
I am keeping
my older, slower desktop
which has a 250GB and a 500GB disk set, for backup and
repository for less frequently used archives that I don't need
that often or on the move (for example games).
Another reason why I can use the laptop as my main system is
that its battery life is long enough, the screen is good enough,
and the size is small enough that I can make use of it during a
somewhat cramped commute, reclaiming a bit of time otherwise
wasted, and then it is powerful enough (it is actually a bit
faster than my desktop) to use it as home over Ethernet using my
desktop as an X server.
Since I am somewhat conservative for geeks, that sort of
means that laptops have taken over even my computer habits. Not
fully yet, as I still need the desktops for demanding uses like
large storage and games and random peripherals that need more
electricity than a portable battery or bandwidth than a USB port
can provide. Also, I am still acutely aware that as
laptops are not as easy, quick and cheap to self-repair as
desktops, so extended outages are possible, and I shall keep a
backup desktop for quite a long time.
- 080217 Sun
Another RAID and volume management perversity
Linux RAID mailing list
sees a constant stream of amusing perversions, and today there
is another one:
start w/ 3 drives in RAID5, and add drives as I run
low on free space, eventually to a total of 14 drives (the max
the case can fit). But when I add the 5th or 6th drive, I'd
like to switch from RAID5 to RAID6 for the extra
Apart from the the perversity, the questions above come without
any hint as to the intended use and expected access patterns,
without which it is difficult to offer topical comments, and
that's typical too of many queries to any mailing list.
I'm also interested in hearing people's opinions
about LVM / EVMS. I'm currently planning on just using RAID
w/out the higher level volume management, as from my reading I
don't think they're worth the performance penalty, [
However, I will go out on a limb and here I list my general
(without any context) advice on RAID and storage setup for
- As a rule, assume that you don't know what you are
doing. Anybody can use
mdadm but that does not
mean that they understand the performance and reliability
implications. If you cannot quote from memory from about a
dozen research papers on storage and filesystems usually you
cannot assess those implications. if you think that this is
elitism, go ahead and make your day.
- If you don't know what you are doing,
and thanks to Neil Brown Linux has a very nice
If you know what you are doing, you have already
chosen to use RAID10
- Expect both occasional single drive failures, in the
3-10% per year
range, and therefore perhaps once a year or two a multiple
drive failure per array (array failures are not independent
because of very many common modes).
- Arrays with more than say 20 drives are not
a good idea even with RAID10.
- Usually don't bother subdivide your arrays, just
use the block device directly; rather create multiple arrays
than subdividing a large array.
- If you have to subdivide your array, first partition
the disks, and then build arrays on top of those
partitions rather than viceversa.
- Using DM
(and its frontends
is usually pointless except for a
- Worry a lot about backing up large arrays.
- Unless you can afford a tape robot, assume that backup can
only be done to a similar array.
for arrays meant to contain large filesystems.
- Don't build arrays larger than 5 to 10TB if you plan to
put a filesystem in them.
- If you really need parity based RAID, worry a lot about
alignment and stripe size.
- Use smaller chunk sizes than big chunk sizes.
- Worry a lot about
- If you need to build a really large filesystem
use something like
rather than a large array.
- 080216 Sat
A RAID and filesystem perversity
- Another wonderfully amusing
entry from the XFS mailing list:
The entertainment value arises both because of the plan to use a
filesystem as a
simple database manager
because of the RAID1 with 5 drives, and of the implication that
a RAID1 with 1 drive has 1X redundancy. As to the use of of a
filesystem as a database, here are some numbers from a
particularly glorious example:
I'm testing xfs for use in storing 100 million+
small files (roughly 4 to 10KB each) and some directories will
contain tens of thousands of files. There will be a lot of
random reading, and also some random writing, and very little
deletion. The underlying disks use linux software RAID-1 manged
by mdadm with 5X redundancy. E.g. 5 drives that completely
mirror each other.
and here is a comment on the results of using a simple database:
I have a little script,
the job of which is to create a lot of very small files (~1
million files, typically ~50-100bytes each).
[ ... ]
It's a bit of a one-off (or twice, maybe)
script, and currently due to finish in about 15 hours, hence
why I don't want to spend too much effort on rebuilding the
box. Would rather take the chance to maybe learn something
useful about tuning...
With some reasons why the filesystem alternative is not such a
First, I have appended two little Perl scripts
(each rather small), one creates a Berkeley DB database of K
records of random length varying between I and J bytes, the
second does N accesses at random in that database.
I have a 1.6GHz Athlon XP with 512MB of memory, and a relatively
standard 80GB disc 7200RPM. The database is being created on a
70% full 8GB JFS filesystem which has been somewhat recently
$ time perl megamake.pl /var/tmp/db 1000000 50 100
$ ls -sd /var/tmp/db*
I am often flummoxed by the ability of people to dream up
schemes like the above. In part because I am envious: it must be
very liberating to be unconstrained by common sense, and have
the courage to try and explore the vast space of syntatically
valid combinations, even those that seem implausible to people
like me chained by the yearning for pragmatism.
- The size of the tree will be around 1M filesystem blocks
most filesystems, whose block size usually defaults to
4KiB, for a total of around 4GiB, or can be set as low as
512B, for a total of around 0.5GiB.
- With 1,000,000 files and a fanout of 50, we need 20,000
directories above them, 400 above those and 8 above those.
So 3 directory opens/reads every time a file has to be
accessed, in addition to opening and reading the
- Each file access will involve therefore four inode
and four filesystem block accesses, probably rather widely
scattered. Depending on the size of the filesystem block
and whether the inode is contiguous to the body of the
file this can involve anything between 32KiB and 2KiB of
logical IO per file access.
- It is likely that of the logical IOs those relating to
top levels (those comprising 8 and 400 directories) of the
subtree will be avoided by caching between 200KiB and
1.6MiB, but the other two levels, the 20,000 bottom
directories and the 1,000,000 leaf files, won't likely be
- 080210 Sun
Some more data on filesystem checking speed
- Usual points about
fsck speed being an issue:
As a followup, a couple of years back when I was
deploying U320 1TiB arrays at work, we filled each of them with
~800Gb of MP3s, forcibly powered down, and did fsck tests. ext3
was ~12 hours. reiserfs was ~6 hours. xfs was under 2 hours.
XFS got used.
Yesterday I had fun time repairing 1.5Tb ext3 partition,
containing many millions of files. Of course it should have
never happened - this was decent PowerEdge 2850 box with
RAID volume, ECC memory and reliable CentOS 4.4 distribution
but still it did. We had "journal failed" message in kernel
log and filesystem needed to be checked and repaired even
though it is journaling file system which should not need
checks in normal use, even in case of power failures.
Checking and repairing took many hours especially as
automatic check on boot failed and had to be manually
That's for 1TB, 1.5TB, and 2-3TB filesystems, but currently many
people would like to deploy 8TB filesystems (and some excessively
brave chancers would like much larger filesystems).
> I'll definitely be considering that, as I already had to wait hours for
> fsck to run on some 2 to 3TB ext3 filesystems after crashes. I know it
> can be disabled, but I do feel better forcing a complete check after a
> system crash, especially if the filesystem had been mounted for very
> long, like a year or so, and heavily used.
The decision process for using ext3 on large volumes is simple:
Can you accept downtimes measured in hours (or days) due to fsck?
No - don't use ext3.
There's no workaround for that. Do not ever ignore the need to run fsck
periodically. It's a safe thing to do. You can remount the xfs volume as
read-only and then run fsck on that - that's another thing to take into
account when setting things up.
Would it be then 20 hours? Admittedly a very large part of
checking time is proportional to number of files, and many 8TB
filesystems will contain much larger files than MP3s. But some
filesystems will have lots of small files. In general I suspect
chunked file systems
are a good idea.
Note: this entry was mostly written in 0708.
Slow transfer rate over SSH and improvements
- More or less by default the SSH2 protocol has become
the standard for simple inter-node operations,
effectively replacing TELNET, RSH, FTP and has beocme
the favourite proxy for X11, SOCKS, etc.; the main
reason is not so much that it is secure, but that its
and its MS-Windows equivalent
are so convenient: just the SSH agent makes it very easy to
automagically connect to various hosts without prompts, and the
reliance on a single TCP port allows easy traversal for a number
of different proxied protocols, for example. These are
considerable advantages over implementations of alternatives
like TELENET over SSL.
The most significant problem with SSH is that when used as a
transport for bulk data it is quite slow. The most recent
example was a report that the otherwise very convenient
could only transfer around 1.7MB/s over a link measured as
able to transfer around 100MB/s over TCP, so I checked and
that was indeed the case. A
web search confirms
that this is indeed a common complaint. The main reason are:
SFTP protocol is half-duplex
and this particularly slow. It is much faster to use the
with WinSCP (even if it has some serious limitations, most
importantly non-restartability) or FISH
as implemented by
and others. However the best protocol over SSH, and arguably
thebest protocol for bulk data transfer overall, is
- The default SSH encryption algorithms can take a lot of
CPU time, and CPU time can cause relatively long pauses
between packets. The solution is to use the least expensive
ciphers. The quickest is arcfour
which is reasonable even if it is
weak over long sessions
and the second cheapest is
which is recommendable for most uses. In particular many
are rather slow (notably the one in WinSCP) and 3DES is
usually very slow.
- SSH is intrinsically a poor protocol for bulk data
transfer, and most implementations use small packets with
small buffers as that works well with SSH as a substitute for
TELNET. There is a
patch for the OpenSSH implementation
that improves performance for bulk data transfer over SSH.
- 080202 Sat
Coupling and active redundancy
- Active redundancy as in clustering involves coupling --
because the exercise of the redundancy depends on
communications. Ideally this does not cause problems, but in
theory and in practice there are two big issues.
The theoretical big issue is that loss of communications
cannot in the general case be distinguished from loss of
redundancy, and even polling does not help, if there is not
central synchronizing agent, and thus a single
common mode of failure
(the even bigger problem is that it is not widely undertstood
that distributed computation without a central synchronizing
agent belongs to a fundamentally different class of models of
computation from those equivalent to Turing machines).
The practical problem is even worse: in practice it is
extremely difficult to allow for communication and recovery
among redundant elements without introducing common modes of
failure, simply because it is very difficult to communicate and
recover across entirely different designs and technologies,
because communication and recovery require a high degree of
Consider for example one of the most relied upon redundancy
nearly always the disks share the same location, power supply,
cooling, receive the same commands from the same OS over the
same bus, are subject to much the same access and wear patterns,
and ridiculously enough are often of the same model and even
from the same manufacturing batch.
More insidiously, when complicated communication protocols
are used, it is often easiest to ensure compatibility by using
the same software from the same manufacturer on all the members
of the redundancy set. The compatibility requirement this
creates couplings in many cases. For example it is fairly common
for manufacturers to state that clustered systems only work
reliably if the same software product of the same
version is used, thus guaranteeing that the set
of bugs is exactly the same across all supposedly redundant
This is a general problem, and I remember reading that in
effect in modern experimental physics independent verification
of results is far less easy than repeating an experiment
somewhere else, because physicists tend to share and reuse the
same software codes and products.