Sabi notes draft posts

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

This document contains unpublished drafts. Please do not refer to it or link to it publicly.

171212 Tue: Ansible configuration structures

I shall illustrate some issues with Ansible a configuration management system that is both popular and makes the issues clearer than most.

In Ansible the main entities and relationships are:

Customization of a configuration file depending on context can happen for example in the following ways:

The data that drives such conditionality can be expressed:

There are very many choices available, and not all choices are equally valuable for minimizing dependencies and maximizing maintainability. Writing tasks, templates, data files, inventory groups with a whatever approach is something that is very difficult to unravel later.

170713 Fri: Firefox image size
  PID    VIRT    RES    DATA    SHR CPU %MEM     TIME+ TTY      COMMAND
 6716 5383024 1.804g 3642440 237080 1.6 23.7 328:24.73 ?        /usr/lib/firefox/firefox -P def+
  405 4799456 1.467g 3542988  84496 0.0 19.3 491:40.11 pts/9    /usr/lib/firefox/firefox --new-+
27053 4052620 235516 3132780  45660 0.0  3.0 227:52.88 ?        kdeinit4: plasma-desktop [kdein+
27035 3209436 167740 2492344  23948 0.0  2.1 254:37.13 ?        kdeinit4: kwin [kdeinit] --repl+
 3743 2155024  11236 1918840      0 0.0  0.1   2:23.27 ?        akonadiserver
 2935 2102680   1620 2052944      0 0.0  0.0   0:00.86 ?        /usr/sbin/console-kit-daemon --+
16194 2090552   2444 2064968      0 0.0  0.0   0:16.61 ?        /usr/sbin/privoxy --pidfile /va+
 3813 1674824  53280  832428      0 0.0  0.7 124:47.88 ?        /usr/bin/krunner
 3613 1386240 208072  649288  10996 0.0  2.6 169:22.29 ?        kdeinit4: kded4 [kdeinit]      +
 3471 1161784 567036  161904 488544 8.1  7.1   1249:47 tty9     /usr/lib/xorg/Xorg :9 vt9 -dpi +
 3935 1056268  11620  493524      0 0.0  0.1   2:06.43 ?        /usr/bin/knotify4
 4437 1050408 233420  353716  18176 0.0  2.9  83:38.47 pts/8    emacs -f service-start
170713 Fri: The routed Ethernet wave

As mentioned previously I have quite liked the general approach to connectivity redundancy and service mobility to use IP service addresses that are fully routed by themselves, (fully unicast) with servers carrying those services running OSPF (or similar) to advertise their reachability. It is in practice like multicasting but without the added complications (and in effect OSPF distributes route updates by multicasting).

`

The main advantage to simply leveraging IP routers to distribute reachability updates is indeed simplicity: that requires no tunneling, no weird layer-2 tricks, there are plenty of troubleshooting aids for IP, and routing is needed regardless, and in recent implementations it is very fast even for large numbers of individual addresses.

However, the ability to use Ethernet addresses as identifiers and the quasi-broadcast nature of Ethernet, plus habit from the time where layer-2 switching was much faster than layer-3 routing have meant that so-called layer-2 adjacency is highly prized even in its physically discontiguous variant.

Now how can physically discontiguous layer-2 adjacency in Ethernet be a thing? The idea is that adjacencyis a ` property of physical links (that ether in the name) or at least of physical switches.

The usual answer is to somehow merge physical links or switches to have VLANs spanning multiple areas, but that has the downside that one needs lots of switches, or long cables, and results in large broadcast domains, all issues that fully unicast addresses don't have.

Well, a pretty major datacenter product that all but requires layer-2 adjancency (1, 2) is VMware, which is popular because despite the severe disadvantages of virtualization, and therefore in recent years major network vendors and entities have sought to support fine-grained, dynamic layer-2 adjacency, first by dynamic layer-2 forwarding, for example by virtual wires where layer-2 frames are encapsulated in other layer-2 circuits or virtual circuits, such as Ethernet-in-Ethernet (e.g. Q-in-Q) or Ethernet-in-whatever (e.g. VPLS, VxLAN layer 2) but the recent trend is routed Ethernet in various guises and usually disguises.

The most common technique (if that isn't too noble a word for what is base cheating) encapsulation of layer-2 frames into layer-3 packets or even layer-4 datagrams and then routing them; techniques like SPB layer-3 TRILL, or Enterasys .

Therefore horrors like using IS-IS to route Ethernet frames over MPLS. And lesser horrors like the Enterasys scheme:

That is amazingly perverse still: in order to use ethernet identifiers in location independent way, they are mapped onto IP addresses used as identifiers by associating each with a route, and then these routes are distributed by IP multicast mapped into multiple ethernet broadcasts.

It is less bad in the straight Ethernet mapped to IP and OSPF an then back to `Ethernet version than in the insane versions where MPLS and IS-IS are involved as an extra layer. The complexity of involving another protocol family and layer of encapsulation, with a potentially different logical topology, and in effect circuit based rather than datagram based, has severe downsides, incouding a loss of security and safety due to the extra complexity preventing easier auditability of configurations and traffic.

160501 Sun: The ELF-a.out story, 'dbus'/'systemd', containers

In a recent typical discussion about systemd in the usual LWN site, there is a comparison between the transition from System V init to systemd and that between the executable format Linux a.out and COFF and ELF around 1995.

That transition, while far less damaging in the long term than the transition to dbus and systemd potentially is, was not wholly glorious however because it was a mixed blessing at best: an unnecessarily complex mode of program execution, that along with other related changes cause a lot of bloat and confusion.

The key issue was the handling of dynamic loaded and shared libraries, up from the (usually static ???

160104 Sat: Btrfs update

I am now using Btrfs for most of my filetrees, and it seems quite reliable in basic usage even with the Ubuntu LTS 14 kernel 3.13 and even better with the more recent backported kernels 3.19 (since ULTS 14.04.3) and 4.2 (since ULTS 14.04.4).

I use it primarily because of the checksumming, and secondarily because of the copy-on-write, not because of the storage management features. So I occasionally run the btrfs scrub operation that verifies data, and recently in around 5TB of data and backup filetrees it found 1 bit error. I checked the system logs and since this was unreported this was obviously one of those silent errors. perhaps related to the usual situation that storage devices are typically rated for 1 unreported bit error per 1014 bits (11.4TiB).

As to the storage features they are not quite as strong as those of a dedicated storage layer like MD RAID but they are convenient sometimes. In general they help with the UNIX style of operating on trees rather than block devices.

As to the storage features the best is clearly the ability to create subvolumes cheaply, that is multiple roots in a filetree like NILFS2 and JFS (but not really available in Linux) and ZFS; also to have cheap snapshots of subvolumes thanks to something like copy-on-write, like NILFS2 and ZFS. The RAID10 implementation is also fairly usable and quite reliable.

The parity-RAID implementation is still unreliable, and it may be even a bit too flexible, as it will create parity stripes of length equivalent to that of the devices that have available space, so if devices are of different sizes it will silently create stripes of different width, down to 2 chunks for RAID5, one for data and one for parity (effectively equivalent to RAID1). Which is admirably complete, but is rather more expensive than RAID1 of 2 devices or RAID10 of 3 devices.

The big problem with Btrfs is that it is very, very complex, with a lot of code, and large opportunities for race conditions and corner cases. But the base functionality seems to have become quite reliable for a few years already. It is unfortunate that simpler filesystems like JFS or NILFS2 would require an ondisk format change to add checksums.

150525 Sat: Very long time to upgrade Ubuntu LTS

The two desktops and laptop at home had Ubuntu 12 LTS installed, and I decided yesterday to upgrade that to Ubuntu 14 LTS without waiting for Ubuntu 16 LTS, in part because Ubuntu 14 LTS is the last LTS versions that does not use SystemD and it will be maintained until 2019.

I decided to do an in-place upgrade, by switching to the trusty archives instead of the precise archives. I expected some difficulty as those systems have many packages (around 5,000) as I have installed many just-in-case or to try them out, and some are backports or from less official sources.

I did not expect for it to take so long... first I did a test in-place upgrade on the desktop with the fewest packages installed and that started the day before yesterday around 1pm and was mostly finished by 8pm, with some more tweaking yesterday. The desktop with many more files I started around 3pm yesterday and it is mostly finished around noon today. The desktop upgrade started around 7pm yesterday and was mostly finished around 10am earlier today. While I left downloads running during the night some of the night hours were idle.

The biggest reason for the long time to upgrade as the difference between the desktops and the laptop shows is extremely slow file operations on disk drives: the laptop has a flash SSDs which made deleting old files and unpacking archives and writing them out far quicker. The issue is that the root filetree has lots of small files, in my case over 700,000 in around 18GB of data, and nearly 500,000 of these are less than 4KiB in size, and 300,000 are less than 1KiB in size. Indeed on the desktops I could hear the constant seeking of the disk drive during upgrade. Many of these should be part of archive files. This issue is more significant when using filesystems that implement fsync semantics properly as DPKG also uses fsync properly, which is a good idea, but perhaps less so when there are very many updates involving very many small files. Yet another instance of the mailstore problem.

Two other reasons that involved quite a bit of time were sort of inevitable: I specified that I wanted to be asked whether to keep or overwrite existing configuration files, and my choice of having backported or unofficial packages required some manual handling.

But the second biggest reasons for the long time to upgrade, which was of course even bigger on the laptop, was that the Debian and Ubuntu packaging format and related tools (DPKG is the package manager, APT is the dependency manager system) have some large flaws:

150328 Sat: Designing UIs for skewed aspect ratio monitors

Reading about adapting some aspects of GUI in Ubuntu's Unity design to portable device screens:

Today’s scrollbars are optimized for cursor driven UI but they became easily unnecessary and bulky on touchable and small screen devices. In those cases, optimization of the screen’s real-estate becomes essential. Other platforms optimized for touch input like Android and iOS are already using a light-weight solution visible only while dragging the content.

This is interesting because there is one important detail that usually GUI designers get wrong: as I have pointed out given current skewed aspect ratios of monitors GUIs should put their decorations and items mostly on the side where there is most space. To add to that earlier post I was amused to see that it is an issue for others as shown by this point that vertical maximization matters (I use it a lot too) to make the most of the limited height of landscape-skewed displays.

So what's the solution? As I mentioned in practice only either to use a GUI framework that allows moving to left or right most GUI elements, or a taller display.

As to GUI elements I should add that KDE also has repositionable icon bars (called toolbars by KDE) that can be moved to any side, including left and right, and I do that for most.

This is particular important with a 16:9 aspect ratio, such as 21in to 23in 1920×1080 displays, and even more on laptop 12in to 14in 1366×768 displays, as vertical space is so limited. The issue is less urgent on 1920×1200 displays as the aspect ratio is less skewed, or on 27in or higher 2560×1440 displays as they have rather more vertical space both in pixels and physical extent.

150218 Sat: Startup dependencies, some case and reflections

In the previous discussion about resource and daemon states in mentioning upstart it was pointed out that it implements a generic dependency mechanism, and that the policy can be either to start daemons when their dependencies become available, as in dependency push, or to request the availability of those resources before starting a daemon, as in dependency pull. The two policies are not equivalent in two important ways:

Which means that daemon managers should be based on a pull logic. The obvious example of that is the classic inetd daemon manager that only starts a daemon if its service port is accessed. Put another way daemon managers should be more like make files than (parallel) sh scripts.

In other words when the system starts the only process started should be the daemon manager, and nothing else should happen by itself. No driver should be loaded, no disk should be recognized, no other process should be started until the daemon manager senses a request for service on one of the service ports it monitors. These may be serial lines, consoles, IP ports.

For example if a SAK is detected on a console the login daemon should be started on it, and this may try to access the /usr filesystem, and that might trigger a mount, which might trigger an access to the NFS server, which can then trigger bringing up the relevant IP interface, which may load the appropriate driver module, and its configuration.

In the UNIX tradition the make style logic to pull resource activation has been implemented in several different ways other than inetd, for example getty for text consoles, xdm for X11 displays, amd and autofs for mounting filesystems.

In general the latter looks to me the better model: that services be represented in the filetree as mountpoints and accessses below that mountpoint should trigger activation of the relevant service daemon. While it would be nice to have a clear single model of daemon activation and

150124 Sat: Startup dependencies, some cases and reflections

In the recent and not so recent past I have noticed cases where startup or other actions were done in the wrong order or at least at the wrong time:

Both instances are symptoms of a general issue which it has little to do with dependency based service and daemon management, even if somewhat related.

The issue is that UNIX-style systems (like many others) have no well defined conceptual model of resource states and transitions, never mind managing their dependencies.

Which makes me remember how the MUSS operating system had a well defined, realistic and still simple, notion of which states devices (applicable to many resources) could be in and the allowed transitions (plus a very well designed system for interprocess signaling and communication), and that was many decades ago.

In the absence of a well defined model of states and transitions talking about daemon management is a bit vacuous, but there is another specific issue that applies regardless, and it is whether the type of dependencies that ought to be managed.

There are service and daemon management systems like upstart which are explicitly designed to be event receiving and sending: when an event is received by upstart a process is instantiated, but processes can send events at any time, not just when they end, and receive them also at any time.

This is a fairly low level mechanism because it can be used to implement fairly arbitrary dependency policies, either those where when a resouce becomes available a daemon can be started to use it, or those where when a daemon needs a resource it can wait until it becomes available.

The difficult part thus is not in the mechanism that allows daemons to depend on other daemons or resources, but in designing well structure policies for representing those dependencies, and this still depends on a clean conceptual model of resource and daemon states.

150110 Sat: The problem of multiple states

One of the more fundamental aspects of the Microsoft style of system design that is being adopted more and more into GNU/Linux systems is the avoidance of fixing issues with existing software, replaced by working those issues with layers of wrappers.

This is often well motivated, and mostly by organizational issues: it is often very difficult to get existing software modified in part because of territorialism by existing maintainers in part because of the fear of the possible consequences on stability.

It is sometimes necessary to combine several diverse programs to achieve a goal; for example recently I setup a system with an AFS fileserver relying on a filetree stored on a DRBD storage area relying on IPsec for data privacy; or my laptop with a cellular network connection needs a driver for the USB cellular modem, a program to enable it with a PIN, a PPP daemon do establish a connection and a firewall script to delimit it.

In all these cases it is very hard to document or automate how to operate each setup. The difficulty is that each of the components has several states, and the possible combinations of states across all components can be many, thus requiring a variable but significant paths to reach a desired a desired end state.

To some extent this is unavoidable in a system made of several moving parts: a lot of component states must align the right way to startup a GUI session on my desktop too.

The abundance of possible state combinations is what makes finding the cause of a fault difficult and lengthy: difficult because the fault finder must have a mental model of all the possible state combinations, and lengthy because it can be pretty time consuming to explore them to figure one which part is in a faulty state.

The abundance and intricacy of the dependencies among combined states for non-trivial systems required problem-solving when trying to get from a given current state to a desired other state, which can be for example having a browser capable of accessing the Internet over a PPP connection via a 3G modem.

However I have noticed that a lot of people try to avoid problem solving, and they want instead a series of steps that they can follow mindlessly to produce the result they desire, a way of proceeding that in management jargon is called deskilling.

I have noticed this quite often in IRC channels or on blog posts or self-help books targeted at practitioners, which tend to offer simple, linear N-step copy-and-paste procedures to do things that are actually very subtle, such as how to configure GNU/Linux for high-speed 3D graphics, or MS-Windows for large distributed storage setups. The usual context is the notion that busy people don't have time to understand what they are doing, and they just want to be able to do very difficult and complicated things in 12 easy steps.

But doing difficult and complicated things usually involves many layers of astraction and implementation, and several interfacing parts, each with many states.

It is sometimes possible to wrap these with simple processes for people to follow or scripts for computers to execute, but it is almost never possible to do so in a failproof way: because handling potential mistakes or failures cannot be done as a rule with any simple process or script. And it is a big challenge to devise a suitable complex process or script that can handle every possible mistake or error state, because when many state combinations are possible, usually only a small number are good.

This is why people who attempt to configure non-trivial network setups with NetworkManager under GNU/Linux or the MS-Windows Control Panel even within the limits of those tools often incur in significant confusion: those tools are designed to hide the complexities of the states of the lower level tools they use, but this cannot be done fully because their states are far more complicated than it appears via the nhigher tool, creating deep difficulties in investigation in case of problems, plus notable restrictions in what is achievable.

This is a problem even with simple wrappers like my firewall scripts which even if rather limited in what it can configure it cannot mask the far more complicated states beneath it, but then it does prerend to.

The solution for me is to define clearly expected states, and in an economical way and with a view to minimizing interactions with other states to reduce the number of possible combinations of states; plus to show how operations change states and give guidelines on how to build sequences of operations to return to a desired state from an invalid one.

In my experience and from many sources the key to the learnability of a topic is for learners to form a mental model of how it works.

140926 Fri: A large backup pool that works well only over a short time

As to storage insanities (for example 1, 2, ...) one of the most common syntactic delusions is about single large storage pools with very many small files, and there is a recent excellent example:

Subject: Is XFS suitable for 350 million files on 20TB storage?

Hi,

i have a backup system running 20TB of storage having 350 million files. This was working fine for month.

But now the free space is so heavily fragmented that i only see the kworker with 4x 100% CPU and write speed beeing very slow. 15TB of the 20TB are in use.

Overall files are 350 Million - all in different directories. Max 5000 per dir.

The author is well meaning of course, and seems to be a strong adherent to the syntactic approach which is to consider all syntactically valid configurations equally plausible and desirable from a technical point of view, but having second thoughs after some month. Regrettably most syntactically valid configurations don't work that well (and it takes insight and skill to understand which ones do and why) and in particular single large free space pools don't scale, except at the beginning when they are empty and there is no load, and then things get interesting as storage fills up and load increases.

Note: Many companies claim they offer very scalable large storage pools, both in size and speed; most recent news I read are about a company called StoragePool quite suggestively.

The specifics of the madness above are moderately interesting:

140228b Fri: Applications and how to scale font glyph sizes

Continuing the previous post on distance-scaled font glyphs, the conclusion was that scaling display DPI is the easiest option. But things are not as simple as that.

Old style X Windows configurations allow to prepare several different potential display configurations at different pixel dimensions, and the DPI is recomputed accordingly, but that is not the right mechanism as it changes the logical pixel dimensions.

Various releases of the regrettable regrettable RANDR extension allow to change dynamically the reported DPI of the display, in which case the reported size of the display in millimiters instead of pixels is scaled accordingly. This can be done with one of (depending on whether for a single output only):

xrandr --dpi 130
xrandr --dpi 130/VGA1

Most applications read the current DPI from the X server only when they start, which is inconvenient. But desktop environments cause more trouble as some keep persistent state and this may include DPI and will be given to newly started applications. I have tried a few and have had mixed results:

xterm -fa 'liberation mono:size=10'

With xterm using the FontConfig font system text is correctly scaled for the current DPI when it starts, but subsequent DPI changes are ignored.

xterm -fn '-*-liberation mono-medium-r-*-*-*-100-0-0-*-*-iso10646-1'

With xterm using the old style X11 font system with a scalable DPI-independent (the -0-0- part) XLFD font specification there is no sensitivity to DPI changes either at startup or after.

This is probably because the laughable code that forces the DPI in a scalable font specification to be either 75 or 100 is still there.

xterm -fn '-*-liberation mono-medium-r-*-*-*-100-130-130-*-*-iso10646-1'

With xterm using the old style X11 font system with a scaled, DPI-explicit font XLFD I do get on startup fonts scaled according to the DPI; the results are good for an outline font like Liberation Mono but of course much uglier for a bitmap font like -sony-fixed-*.

GTK+ 2 applications

The GTK+ 2 library gets its settings from the ~/.gtkrc-2.0 file, from the X server resources, and from the X server itself.

It gets the font specification in particular from ~/.gtkrc-2.0, and that does not include the DPI, and gets the DPI from the Xft.dpi X resource if present else from the X server idea of DPI.

GTK+ 2.0 applications as a rule only read the DPI at startup.

Qt 3 and 4 applications
Emacs 23

Emacs font support is somewhat complicated because it can be built in two different X Windows supporting versions, one with the Lucid library, which only uses the X11 font system, and one with the GTK+ 2 library.

The font renderer in the GTK+ 2 version can use either the old style X11 font systems or the new style FontConfig one, just like xterm, but in an haphazard way: which one is used depends on the syntax of the font specification to commands like set-frame-font (which only autocompletes only XLFD font specifications); while the menu-set-font dialog only recognizes GNOME style font names (similar but not the same as the FontConfig ones) and the mouse-select-font dialog only allows choosing X11 font system fonts.

With XLFD style fonts Emacs behaves like xterm with the same.

With GNOME style fonts it is impossible to specify DPI, so one must specify a large point size as in for example Liberation Mono 14 with set-default-font or choose explicitly a larger point size with mouse-set-font or the Options>Set Default Font dialog. At least this applies the default fonts to all Emacs content windows.

The above only applies to the text displayed by Emacs in its main panel. The menus and dialogs are rendered by GTK+ 2 and therefore the font sizes are fixed when Emacs starts like for other GTK+ 2 applications.

konsole

This is typical KDE application and has two different behaviours: when it is launched from an existing KDE wrapper like the application menu, or equivalently from the command line with kwrapper4, it does not notice changes in DPI, because the KDE session caches the DPI.

I haven't been able to find a way to refresh the cached value.

However when started on the command line on its own it does notice at startup the current DPI. Changes are not noticed.

However KDE offers Enlarge Font and Shrink Font commands accessible by the traditional key combinations C−+ and C−-.

Regrettably this is on a per-window basis, so it does not scale all konsole instances once and for all.

konsole
140112 Sun: Kerberos tickets addressless or not
140131 Fri: Three approaches to user environments

I was reading a blog entry about user environments and somewhat accidentally it pointed out that there is a third type of user environment, which should have been obvious to me, even if I used to think that there are two types of user environemtns:

The main difference in the above is interoperability:

A recent entry in another blog (Debian or KDE planet) points out that there is a combined type, by saying that GNOME is evolving from an open environment to a launcher one, where the environment launches applications that are themselves in essence workspaces.

MS-DOS and MS-Windows were and are popular examples of this, and so are smart cellphones (while non-smart cellphones run workspace environments).

To some extent launcher combine the disadvantages of workspace and open environments: the users suffer from a much reduced ability to make applications interact, and often the only way is to dump data into a file and reread it from another application, if the file formats are vaguely compatible.

So the question is why are they so common, and I think that is because they are good for developers: precisely because of the isolation of the applications from each other, developers have a much simpler task, and can make each single application look best at what it does, rather narrowly. At the moment of choosing an application most users evaluate it narrowly too, for how well it looks and does whatever it is targeted at, and only later it becomes clear how interoperability is awkward.

140204 Tue: UNIX hierarchies, paths, and the LFS

The established UNIX tradition is to use hierarchical filetrees as keyword based classification systems, and since they are hierarchical there must a priority among classification criteria. The established UNIX tradition is to order keywords into:

In this classification major tree are administered differently, minor trees are collections of optional material, and role trees aggregate files that might be used in similr way.

131214 Sat: Booting Linux and the 'init' controversy

Booting a system is often a delicate and fragile operation as it relies on the cooperation of several bits of software and firmware written by very different people and organizations, and recovery from boot problem can be quite awkward, precisely because until the system is booted very few tools are available.

For UNIX this usually meant having a boot program that is loaded and started by firmware, and then it loads and starts the kernel itself, when then loads and starts a shell script which on the basis of user input forks a set of services to enter either single-user or multi-user mode.

Early on this simple picture was slightly complicated by the introduction of a program called init that would be loaded and started by the kernel and it would fork itself in two, one fork to load and start the shell scripts. The existence of init was due to two technicalities of the UNIX system:

131113 Wed: MD RAID huge stripe cache effect on aligned writes

Configuring a system with 24×1TB 2.5in drives resulted in a choice of 4× (4+2) RAID6 sets. It is not as good as a RAID10 would have been, but considering the usage pattern (a lot of reading largish archived files) it is one of the few RAID6

130409 Tue: The mail store issue, IOPS, directory locking, spooling

Having discussed the past the message store problem, my main points were in summary that:

But I was discussing this briefly some time ago as to the decision of a site to store messages for a Dovecot IMAP server in Maildir, and I was told it was not a problem for them, and I was astonished. Then I saw their racks full of 146GB 15,000RPM SAS disk drives and realized that they were willing to pay for the privilege, with a storage layer capable of massive random IOPS.

A disk tray with 16× 146GB 15K RPM SAS disks has a capacity of around 2TB, and has 16 disk arms where a single 2TB disk has only one. It might also have much higher transfer rates, perhpas not 16 times higher, but that is far less impressive than 16 independent positioning arms.

Because for rotating disk storage the achievable IOPS for page sized (4KiB) transfers is around 2-3,000 for sequential transfers and 100-150 for random ones, or alternatively transfer rates of 80-120MB/s sequential and 0.5MB/s random.

There is a factor of over 100× between sequential and random transfer rates, and even a tray of 16 fast small drives only partially bridges that, even if it goes a long way. The overall result however is that it pays to arrange for data to be laid sequentially to the point that it is quicker to have files that contain coarsely selected data, and read it all and then keep only that which is relevant.

Which is one of the reasons why an archive containing many members are usually a win for collections of small even weakly related data items like e-mail messages, as long as it pays to process them in bulk and TBC.

However an archive containing many members has one seemingly trivial issue: when it is being updated it must be locked to prevent interleaved updates.

One reason why the Maildir structure was introduced to replace the mbox archive format was indeed locking: because many implementations of the POSIX/UNIX API implement file locking, especially over popular network filesystems, unreliably, but implement reliably with implicit locking operations over directories. It is important to understand that locking still occurs when adding or removing a message to the mail archive, as that is implemented by adding or removing a file to a directory, which means atomically updating the directory, so the following argument is false by omission of file between no locks:

Why should I use maildir?

Two words: no locks. An MUA can read and delete messages while new mail is being delivered: each message is stored in a separate file with a unique name, so it isn't affected by operations on other messages.

An important detail is that it is therefore designed for the very rare case of a mail archive which it is frequently written concurrently by multiple programs. What is notable is that Maildir was introduced together with an MTA and in particular as its spool queue format, and its delivery agent inbox format.

In other words for the case not of a mail archive, but a mail spool queue, and a fairly small one too. Then a spool directory might be expedient, even if not quite right either, and spool queues are fairly often implemented with members being individual files in a directory, instead of parts of an archive files. But again for small numbers of members, because as the newsspool story demonstrates even spools have been converted to a dense single file when they became large (in the case of the newsspools the spool is also accessed as an archive).

130320 Wed: Some angles on the large asset management problem

I heard an interesting presentation on distributed development where the issue is that IC related description and simulation files are big, are generated relatively frequently, and are part of a multisite workflow.

The talk was about the case where the IC description files are generated centrally and used in other sites, and that the critical attributes of the generated datasets are:

There are a lot of similar use cases that result in similar situations:

In many situations there is a very considerable overlap between dataasets close in time.

The discussion was mostly on how to reduce the time to replicate the datasets offsite, and to minimize the cost of storage.

As to that using rsync with --link-dest and --fuzzy as in BackupPC or rsnapshot achieves large degrees of bandwidth and space reduction, and further space reduction may be achieved with similar more fine grained techniques like copy-on-write as in filesystems like ZFS, Btrfs, COW ext3, which is somewhat similar to log structured storage.

But in the scenario above there is a relatively common and big issue: that IO involving 200GB over 1M inodes can be very slow, as in most cases the physical layout of the inodes does not correlate well with the logical access patterns within the data set.

Given the extremely different transfer rates of rotating disk storage between small and large records and sequential and random access access patterns, the dominant factor in this situation, both for backup and loading and writing the data set, is to increase physical locality.

Unfortunately trying to save space by using sharing of unmodified files or copy-on-write make locality worse, often much worse:

130308 Fri: Filesystems and applications designers

A recent query about XFS allocations patterns is one of a class:

one of our customers has application that write large (tens of GB) files using direct IO done in 16 MB chunks. They keep the fs around 80% full deleting oldest files when they need to store new ones. Usually the file can be stored in under 10 extents but from time to time a pathological case is triggered and the file has few thousands extents (which naturally has impact on performance).

121222 Sat: Standardized storage sizes and leaving unused areas

It used to be the case that disk drives of the same gross capacity could have slightly different sizes in sector counts, and this made it somewhat unwise to use all the space on a drive for partitions or for RAID set members. The issue was that replacing a disk might happen with one slightly smaller, and then the data no longer fit.

Since the differences used to be small, I used to leave the last few dozen MiB, and currently the last GiB, of a storage device unused; I also usually left the first few dozen MiB, and current the first GiB, of a storage device unused.

I also standardize partition sizes, so that partitions would not cross typical disk size boundaries, for example 80GB, 160GB, 250GB, 500GB, 1TB, 2TB, in particular the smallest numbers expressed in conventional cylinders of 255*63*512B sectors, or 8225280B. I found in the past the minimum sizes in cylinders were:

Disk drives minimum size in 8225280B units
gross capacity cylinders
80GB 7297
160GB 19457
250GB 30400
500GB 60800

However I have noticed that in the past few years disk drives from the few remaining manufacturer have exactly identical sizes expressed in sectors, which makes me suspect that finally some industry body has published a relevant standard precisely to avoid issues with slightly different sizes in RAID sets.

110817 Wed: A comparison of some aspects of version control systems

Some discussions of the finer points of Git, Monotone, Mercurial and Bazaar:

A list of some non obvious VCS features:

  Monotone GIT Bazaar Mercurial
Language C++ C, shell Python Python
Metadata per ... One file per repo Per repo, Per commit per directory Per file
How many files One Depends on packing . 2 per managed file
Size of repo files large large and small medium small
Notes . . . .
Access via HTTP, or via other port, or via SSH port/SSH HTTP port/SSH HTTP
Special features Very clean design, uses database. Commits need periodic packing. . Large number of metadata files.
Scalability and speed . fast . slow
Integration with Trac or other ticketing system yes yes yes yes
Whether there is a GUI tool no yes yes yes
Preservation of history across renames yes yes yes plugin
Can cherrypick . yes . .
Can rebase . yes . .
How much file metadata is tracked . . . .
Signing of changes yes yes . .
Global or local version and file ids . . . .
Workflows . . . .
Conversion from CVS and SVN . . . .
Special features Very clean design Tracks content, not files. . Extensive plugins
120907 Fri: Trading among latency, throughput, safety vs. cost, capacity

The main tradeoffs in computer systems designs involve latency vs. throughput vs. safety and both against cost and capacity.

For most computer systems designs cost is roughly fixed, and the goal becomes to build the best system the budget allows for, and this usually involves tradeoffs between throughput and latency.

These tradeoffs are pervasive because for almost every system component it is possible to find implementations that have lower latency but also lower capacity or higher latency but higher capacity, and therefore most system componenents are arranged in pairs or deeper hierarchies which take advantage of locality to obtain most of the advantage of the lower latency implementation (for more active work) and of the higher capacity one (for less active work).

The tradeoffs are especially important for storage components, at every level of the storage hierarchy.

At one extreme system memory components have achieve higher throughput over the years by increasing their internal degree of parallelism and thus the minimum physical transaction size, which has significantly impacted latency; this has been made less negative by the growth in capacity of CPU chip caches, which have smaller transaction sizes and much smaller latencies.

Today I was reminded of the importance of these tradeoffs when bulk-overwriting an enterprise disk drive capable of peak sustained 90MB/s and noticed it could sustain as much as 33MB/s with drive caching turned off. This was amazing as I usually see something like 4MB/s.

120907 Fri: Virtualization approaches
070603 Again on distribution of services to servers, some examples
Now another more significant one: a department which occupies an entire building, with a variety of office workers, some of which use internal client-server apps based on some relational DBMS, like Sybase or Oracle. There are 200 clients nodes over 5 floors in 2 different buildings.
A single LAN with several VLANs, all switches and servers shared in a single room withg about 20 servers, one for each service or for few specific services.
With this configuration if everything works all is fine. Except that the 20 servers need to be individually configured and setup, and require lots of cooling and power, and all in a single location.
10 distinct LANs, one per floor of half-floor, a central many-to-many router, each LAN having some servers locally, but these and clients depending for some crucial services like DNS, authentication, home directories on servers in each of the two buildings.
In this case quite a few things can continue working in case of some failures. But all clients will stop working if the shared servers become unaccessile or fail.
10 distinct, one per half-floor, routed via a backbone LAN with a one-to-one router to the Internet, all services on servers on the workgroup LAN, backed or mirrored from shared servers in both buildings.
This is more like the Internet itself, where many individual sections can fail, and others become isolated, but most continue working thanks to replication.
140712 Sat: Some links to popular sites and shops for premium keyboard

Useful sites about premium keyboards:

Plus specialized or semi-specialized manufacturers:

Plus specialized or semi-specialized on-line vendors:

121209 Sun: Times to 'fsck' filesystems update

When a server with several 1TB ext3 filetrees crashed they needed fsck, and these are some of the times:

name time size used fragm. max inodes used inodes
b 13m41s 1099.5GB 139.8GB 5.6% 67.1m 609,448
c 31m35s 1099.5GB 534.6GB 16.4% 67.1m 1,029,415
d 16m41s 1099.5GB 354.4GB 14.8% 67.1m 4,670
e 18m36s 1099.5GB 264.0GB 28.6% 67.1m 136,142
f 25m18s 1099.5GB 370.5GB 41.0% 67.1m 223,616
g 27m48s 1099.5GB 549.4GB 17.0% 67.1m 355,585
i 9m53s 1099.5GB 27.7GB 11.2% 67.1m 178

In the above the times are all for fsck -n there is a mere check, and all the filetrees checked had no or minimal inconsistencies, and as reported most were far from full, a a fairly good situation, but also the data in some of them was quite scattered.

Also all the filetrees were partitions on a RAID6 of 16 total drives, thus behaving as a 14 drive RAID0, except that one of the disks was missing, thus t

The rough estimate is around 1TB per hour (faster if the average file size is large, slower if it is small), which is roughly in line with other reports for mostly fine, not too scattered ext3 filetrees.