Computing notes 2016 part two

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

161229 Thu: Bookmarks with 'org' mode

The WWW is a wonderful if messy library and I keep lists of hyperlinks (bookmarks) to notable parts of it, currently around 6,000. I have switched a few times the tool I use to keep these lists:

Initially I used the Konqueror browser and therefore the KDE bookmark manager. But the bookmark manager was extremely slow and awkward (it would autosave all bookmarks on any edit of one) and Konqueror while being excellent was not getting frequent security fixes. Also the KDE bookmark manager while being independent of any browser it is well integrated only with the Konqueror and Chromium browsers.
Then I used the bookmark manager built-in to the Firefox browser and that was much faster, but still awkward, and still worked integrated only with one browser, and given that I had already switched main browser, that was limiting.
Then I used the KDE basKet note manager which has a note type which is a hyperlink and can be use with any browser supporting drag-and-drop of hyperlinks, but it was till awkward to use and with some limitations when using it to manage lists of notes instead of random collections of notes, for example inability to sort a list of notes (that is bookmarks). Also quite slow, taking often many seconds to open a collection of notes, incomprehensibly.

The main problems were actually similar across these three GUI bookmark managers:

Inability to perform some or all mass-update operations: sort, search-replace, move, a set of bookmarks.
Awkward updates to a bookmark: usually click on it to pop up a form to update its attributes instead of editing them in place.
Undocumented, even if usually reverse-engineerable, format for the stored form of the bookmarks.
Often very slow operation, probably due to use of very slow XML libraries and in-core representation as complicated trees with many pointers.
Also inability to use the bookmark lists outside of the GUI in text-mode connections.

So I started looking for text-based bookmark list managers, and I found first Buku and tried it. It is a purely command like bookmark database, and does not organize bookmarks by list, but by keywords. It works well, but that is not quite I wanted. So I though about that I wanted and I realized that I really wanted a nested list manager, and that an outline-oriented editor could be used in that way.

Then I remembered that there is a mode for outline editing in EMACS, and that EMACS can open files or hyperlinks in the middle of text, and that there was a further evolution of outline editing with added functionality for maintaining lists of notes, which is org mode. I reckon that in general outline editing is not that useful, but it occurred to me that it may match well looking at and editing nested lists.

Org mode is an extension of outline editing mode, by incorporating some kind of markdown and having more operations available, among them easy ways of sorting lists, moving groups of list items, displaying only entries that match a regular expression, so I started using it.

It is a lot better than the GUI based list managers I tried so far as to managing lists of bookmarks, and even citations. In particular it is much, much faster, every operation being essentially instantaneous, as it should be, on a few hundred KiB of data, and very convenient to use, and I have been finally been able to re-sort and tag and update my bookmark collection. Since EMACS runs also in pure text mode, it also works on the command line in full screen mode. Plus of course since it is built inside a full-featured text editor it has all the power of the text editing tools available.

While I am happy that I found in my old EMACS a good solution I am also sad because I usually find applications built within EMACS like the email reader/writer VM and the SGML/XML structure editor PSGML to be preferable to dedicated tools. I think that this is because of some fairly fundamental issues:

Some programmers that do shiny GUIs have a learning based on computer science cliches (and usually MS-Windows ones). In particular they don't care to provide the data structures they define with a sufficient vocabulary of operations, not just add a member, edit a member, delete a member; operations such as copying, moving, bulk operations and sorting and filtering are also needed in most cases), and that code modularity is good because users want to do many different things.
Unfortunately it is still much easier to add a new module to an existing coherent yet highly modular and programmable environment like EMACS than to reuse modules in a system environment like UNIX/Linux, with the exception of shell pipelines that process text.

So org mode is working well for me for bookmark lists, and I guess that it would work well for keeping other types of lists. There are indeed a number of org mode enthusiasts (for example 1, 2) who do nearly everything with org mode, as nearly anything can be turned into a list of items, but I haven't reached that stage yet.

161217 Sat: The most interesting filesystem types

One of the more interesting aspects of Linux (the kernel) is the number of filesystem designs that have been added to it. In part I think is because it started with the Minix filesystem which had a number of limitations, then someone implemented xiafs which also had some limitations, and then things started to snowball ad everybody tried to write better alternatives.

Among the current filesystem designs there are some that are generic UNIX-style filesystems and some who are rather specialized as to features, purpose, or not being very UNIX like. Having tried many the ones that I like most are:

JFS is the all round generic filesystem I like. It has a small, efficient, flexible implementation, it copes fairly well with most workloads, a well rounded performance envelope, it is very reliable and exceptionally stable, and work well on small and large filetrees. It has also the interesting feature of being able to be case-independent.
ReiserFS is second to JFS among the generic filesystems, it has become very reliable and stable, and it has the special feature of coping very well with small files.
NILFS2 could be another generic filesystem choice, even if is a bit slower in reads than the best, even if not by a large margin. But it has the really useful special feature of doing continuous temporary snapshots that can be made permanent, and being log-structured it can handle writing pretty well. It is also quite reliable, recovers well from crashes, and is quite stable. It really needs though resular runs of the compactifying daemon nilfs_cleanerd to reclaim and consolidate the space left by expired snapshots, which does interfere (usually very little) with other use.
XFS can be used as a generic filesystem, but it has has the special feature of coping particularly well with highly multithreaded workloads that access the same file.
Otherwise it is quite complex, the implementation is large and complex (IIRC at one point it contained 5 different B-tree variant implementations), has significant resource demands, and the implementation is being excessively actively maintained.
Btrfs can be used as a generic filesystem too, but it really is best when its two main features, checksums and subvolumes are desired, even if checksums and other features come at a large cost in CPU time (much reduced if hardware acceleration is available). Another excellent special feature is that thanks to having (some like) a copy-on-write design it can handle some fsync(2) cases particularly efficiently.
It has many more features, but many of them are still buggy or have surprisingly unpleasant corner cases, in particular multiple device management, in particular but not only the parity RAID profiles.
Even if it has been around for 10 years the non-base features are still not quite ready, its implementation is large and complex, and it is also being excessively maintained. It also really needs periodic free space compaction, and can suffer from fragmentation more than others.
UDF is standard on DVD and BD disks (it can handle block devices and files larger than 4GiB which is the limit for ISO filesystems). Therefore its special feature is that it is supported by virtually any operating system, and is therefore extremely portable and allows widespread media sharing.
It is little known that it is read-write and can be used on ordinary disk drives or flash drives too, and actually performs fairly well as a generic filesystem too, and this is supported under MS-Windows too. The Linux implementation used to be a bit buggy, but that seems to have been fixed.
OCFS2 has the special feature of having interlocking so that a single filetree can be shared read-write by multiple computers at the same time in a cluster. But it also works well as a generic filesystem, and has some interesting features such as metadata checksums and reflinking.

161206 Tue: GRUB2 and Linux console video modes

When GRUB2 sets up the system console, if it is graphical, is set to some video mode (resolution and depth), and so when the loaded Linux kernel starts, and they are not necessarily the same.

Usually the defaults are suggested by the BIOS (if it is an IA system), or by the DDC information retrieves from the monitor. Sometimes the default are not appropriate or available, so they must be set explicitly.

Having had a look at the GRUB2 and Linux kernel documentation and done some experiments I discovered that as usual it is somewhat reticent and misleading and the more reliable and complete story is:

The GRUB2 shell has several console output drivers, which can be selected by setting the terminal_output variable. Among them:
- The console driver sets it on a PC in VGA text mode, whichever that is, and its video mode cannot be changed.
- The gfxterm drivers sets on on a PC in a graphics selectable with the gfxmode GRUB2 variable, which must be set before the terminal_output variable is set to gfxmode; it can further customized with a text font that is configurable.
The Linux kernel can be configured for an initial console video mode with:
- If loaded in 16 bit mode (with the linux16 and initrd16 GRUB2 commands) with the vga Linux kernel parameter, which takes the decimal number of the BIOS video mode desired.
- If loaded in 32 bit mode (with the linux and initrd GRUB2 commands) and with a framebuffer graphics card driver, the mode can be set with the GRUB2 gfxpayload variable, which is passed to the kernel as a boot environment variable, which can be set to the string keep to have it the same as the GRUB2 graphics mode. Unfortunately this did not work with my Ubuntu 4.4 kernel.
- With the framebuffer driver video kernel parameter. This worked well with my Ubuntu 4.4 kernel.
If one is using the default GRUB2 configuration generation scripts driven by parameters in /etc/default/grub, then these correspondences exist:
- GRUB_TERMINAL_OUTPUT sets the terminal_ouput parameter.
- GRUB_GFXMODE sets the gfxmode variable
- GRUB_GFXPAYLOAD_LINUX sets the gfxpayload variable
- GRUB_CMDLINE_LINUX_DEFAULT and/or GRUB_CMDLINE_LINUX can contain the definitions of the value of the vga or video Linux kernel parameters.

I have found that good defaults are 1024x768x32,1024x768,auto for gfxmode and 1024x768-32 for video: virtually any graphics cards support them, and virtually all monitors have that resolution or higher, and that resolution allows for a decently sized console, around 128x38 characters with typical fonts.

161126 Sat: Homogenous systems and heteregenous workloads

Having pondered the match (or rather more often the mismatch) between system and workload performance envelopes I found in an article a pithy way to express this in the common case:

A homogeneous system cannot be optimal for a heterogeneous workload.

My work would have often been a lot more enyable if my predecessors had setup systems accordingly. But the only case where a homogenous cluster can run well a heterogenous workload is when it is oversized for that workload, where the performance envelope of the workload is entirely containe within that of the system, and consequently first impressions of a newly setup homogenous cluster can be positive for a while, only to evolve into disappointment as the cluster workload nears capacity (or worse when it enters a post-saturation regime).

The author of the presenation in which the quote is contained has done a lot of interesting work on latency, which is usually the limiting factor in homogenous systems, and particularly on memory and interconnect latency and bandwidth. Since my usual workloads tend to be analysis rather than simulation workloads, my interest has been particularly to network and storage latency (and bandwidth), which is possibly an even worse issue.

161124 Sat: Specifications of a 1990 UNIX System V workstation

Quite amused by looking at an old email list post about advice on buying a UNIX system from March 1990, suggesting a 386SX 16/32b processor at 20MHz with 4MiB of RAM, a 40MB-80MB disk, and UNIX System V versions Xenix or ESIX, with prices in either UK pounds or USA dollars:

Best configuration is a 386SX, 4 Megs, 1/2 discs either RLL, ESDI or SCSI, for a total of 80-120 megabytes, a VGA card with a 14" monitor, and a QIC-24 tape drive. Software should be either Xenix 386 (more efficient) or ESIX rev. D (cheaper, fast file system).

Here are UK and USA prices for all the above:
	Xenix 386 complete		#905		$1050
	Xenix TCP/IP			#350

	ESIX complete					$800

	386SX with 1 meg		#660		$900
	additional 3 meg		#240		$300

	VGA 16 bit to 800x600		#100		$190
	VGA 16 bit to 1024x768		#150		$250

	VGA mono 14"			#110		$240
	VGA color 14"			#260		
	VGA color 14" multisync		#300		$440
	VGA color 14" to 1024x768	#450		$590

	1 RLL controller		#100		$140
	1 RLL disc 28 ms 40/60 meg	#260		$380
	1 RLL disc 25 ms 40/60 meg	#300		
	1 RLL disc 22 ms 71/100 meg	#450		$570
	1 RLL disc 28 ms 80/120 meg	#520		$570

	1 ESDI controller		#170		$200
	1 ESDI disc 18 ms 150 meg	#660		$1200

	Epson SQ 850			#490
	Desk Jet +			#510		$670
	LaserJet IIP			#810		$990
	Postscript for IIP		#545		$450

	Archive 60 meg tape		#420		$580

Note: given inflation roughly double the prices above to get somewhat equivalent 2016 prices.

That's just over 25 years ago, and it is amusing to me because:

The proposed configuration could run confortably Xenix (UNIX System V.2) or ESIX (UNIX System V.3) in multiuser mode, and even with X11 or the MGR graphics server (1, 2, 3, 4).
Xenix was by far the most popular UNIX and it was the leading Microsoft product which at the time was the largest UNIX company.
At the time Microsoft products, both Xenix and MS-Windows, were indeed the cheap, subversive, affordable alternative to UNIX products.
Still, the price of UNIX/Xenix products was rather significant.

161112 Sat: Old ADSL modem-routers still working well

Recently I changed ISP so that my previously current O2 Wireless Box V stopped being suitable, and I tried two other ADSL modem-routers that I had kept from years ago:

A Belkin F5D7630 which I bought in 2004. This (now reflashed with the firmware of the SMC Barricade 7804WBRA of which it is a rebranding) worked well, but being limited to ADSL2, it could only connect at 800Kb/s uplink and 8000Kb/s downlink.
A Draytek Vigor 2800 which I bought in 2006 which also worked well once I used a new 15V power supply, as its own had decayed and was supplying insufficient voltage. Since it supports ADSL2+ it has connected at somewhat higher rates of 900kb/s uplink and 12000kb/s downlink, which are adequate.

Note: in a previous post I surmised that the Vigor 2800 does not support 6to4 IPv4 packets. That was a mistake: it does allow them through, it just does not do NAT on the encapsulated IPv6 packet, and this just requires a slightly different setup.

It is somewhat remarkable that 12 and 10 year old electronics still work. They have not been used for half of that time, but the age is still there. I guess it helps that they have no moving parts.

It is more remarkable that that they are still usable, and connect at the top speeds available on regular broadband lines to consumers even today, same as 10-12 years ago. Current FTTC lines with VDSL2 and cable lines can do higher speeds, but they cost more and they have required extensive recabling. It looks like that ADSL2+ is about as good as it will ever get on electrical lines, and will be stuck in the 2000s.

161106 Sun: Auditing and "cloud" virtual machines

Recently while discussing "cloud" virtual machine hosting options for a small business a smart person mentioned other customers of the fairly good GNU/Linux virtual machine hosting business he was using being perplexed by the default (optable-out) of the hosting provider of keeping root access to the virtual machines of customers.

My comment on that is many hosting providers realize that most of their customers are not particularly computing literate and thus can be fortgetful of bad situation in their virtual machines, such as security breaches, that can cause trouble to other customers or to the hosting provider itself, and that in such situations a hosting provider had a number of options, listed in order of increasing interventionism:

Notify the customer.
Suggest to the customer remedial action.
Give a deadline to the customer for remedial action.
Intervene directly inside the VM with root access to fix the issue.
Block network access to/from the VM.
Stop the VM entirely.

Since many issues can be easily fixed by direct intervention inside the VM, that both solves the issue from the point of view of the hosting provider, without impacting the availability of the customer's system.

The comment on that was that some customers were reluctant to maintain the default root access to a third party, for privacy or commercial confidentiality reasons, or for auditability. As to this I pointed that a virtual machine hosting provider (that is any of their employees with access, including those working covertly for other organizations) have anyhow complete control over and access into a customer's virtual machines, including any encrypted storage or memory, as they can snapshot and observe or modify any aspect of the virtual machine's state, essentially invisibly to any auditing tools inside the virtual machine, so direct root access was just a convenience for the benefit of the customer.

Note: in a sense a system hosting VMs is a giant all-powerful backdoor to all the VMs it hosts.

Of course given the complete access that "cloud" hosting providers have to the VMs and data of their customers only a rather incompetent or underfunded intelligence gathering organization would refrain from infiltating as many popular hosting providers as possible with loyal maintenance (building, hardware, software) engineers at all major "cloud" businesses, as most intelligence gathering organizations have mandates from their sponsors to engage in pervasive industrial espionage, at the very least.

Note: in the case of a hosting provider manager discovering somehow that one of their engineers have been accessing customer VMs their strongest incentive is to maintain confidence by customers either by doing nothing or by terminating the engineer's contract with excellent benefits and a confidentiality agreement, without investigating the hosting systems to remedy the situation, because such an investigation might cost a lot of money for no customer-visible, never mind customer-invoiceable, benefit.

To which someone else pointed out that in some countries (Germany was mentioned) many businesses keep their own physical systems hosted on their own premises, with suitable Internet links. My comment to this is that this is understandable: even hosted physical systems can be compromised, even if less easily than VMs. Systems purchased and installed by the customer in cages at a hosting provider can also be compromised, even if less easily still. For moderately confidential business data co-located customer-purchased and installed systems in a cage with a mechanical padlock purchased and installed by the customer (cage locks controlled by the hosting provider are obviously not a problem for insiders) seems good enough, but protecting more confidential business data requires keeping the systems on-premises to remove a further class of potential and unauditable access.

That "cloud" VMs (or physical systems) can be compromised at will by hosting provider engineers in an unauditable way is perhaps not commonly considered by their customers with even moderately sensitive data.

Note: Hopefully no bank or medical data are processed on a "cloud" hosted system.

Note: Well encrypted data held on a "cloud" storage system is of course not an issue. But most "cloud" storage vendors make it quite expensive to access data from outside their "cloud", and that often involves significant latency anyhow, so usually "cloud" storage is accessed by systems hosted in the same "cloud" (usually the same region of it).

What "cloud" hosting providers seem most useful for (1, 2) as to both cost and confidentiality is to provide a large redundant international CDN for nearly-public data.

161103 Thu: IPv6 6to4 with NAT

I have mentioned previously how easy (but subtle) it is to setup a 6to4 (1, 2, 3, 4) but I have left a bit vague the issue of NAT with 6to4, which is unfortunately very common for home users.

As 6to4 encapsulates an IPv6 packet inside an IPv4 packet there are two possible layers of NAT, on the containing IPv4 addresses and on the contained IPv6 addresses, and a limitation: since IPv4 packets don't have ports, the internal-to-external address mapping of NAT must be one-to-one. If IPv6 was encapsulated in an IPv4 UDP datagram the port number in the UDP header could be used to map multiple internal addresses to one external address.

This means that there can only be as many 6to4 systems on an internal LAN as there are external addresses available to the NAT gateway, and often that is a single one. The rest of the discussion is based on assuming that there is a single external address.

It is still possible to have many internal systems on IPv6 as long as they route via the 6to4 system, and the internal IPv6 subnet is chosen appropriately. The latter is not difficult as each IPv4 address defines a very large 6to4 subnet, where the prefix for it in first 16 bits contain the 6to4 prefix, and the next 32 bits contain the IPv4 address. So for example:

The public IPv4 address 192.0.2.230 (0xc00002e6) defines a prefix 2002:c000:02e6::0/48, and a subnet of it may be 2002:c000:02e6:0001::0/64.
The private IPv4 address 10.168.1.10 (0x0aa8010a) defines a prefix 2002:0aa8:010a::0/48, and a subnet of it may be 2002:0aa8:010a:0001::0/64.

The first two issues are whether the NAT gateway allows passing IPv4 packets with a protocol type of 41 (IPv6) at all, and whether it performs NAT on them. Both must be possible to use 6to4 in a NAT'ed subnet.

The goal then is to ensure that when a 6to4 packet exits the NAT gateway it has these properties:

The IPv4 target address is 192.88.99.1 or the IPv4 address of some suitable 6to4-IPv6 gateway.
The IPv4 source address is the external address of the NAT gateway, for example 192.0.2.230.
The IPv6 target address is the target address of the original IPv6 packet.
The IPv6 source address is the original one with its prefix being already or being NAT'ed with the prefix derived from the IPv4 external address of the NAT gateway, for example 2002:c000:02e6.

The last point is the really important one, so that when replying the target IPv6 system knows that the reply IPv6 packet must be encapsulated in an IPv4 packet or sent to a 6to4 gateway (because of the 2002::0/16 prefix), and the target address of the IPv4 packet is the external address of the NAT gateway.

Therefore the next issues are whether the NAT gateway also NATs the contained IPv6 addresses, and whether the external address is static or dynamic:

NAT gateway mapping of internal IPv6 source addresses:
- If mapped: the internal IPv6 prefix can be either the prefix based on the external address of the NAT gateway (2002:c000:02e6:0001::0/64 if it is 192.0.2.230), or based for example on the internal IPv4 address of the 6to4 gateway (2002:0aa8:010a:0001::0/64 if it is 10.168.1.10).
- If not mapped: the internal IPv6 prefix must be based on the external address of the NAT gateway, (2002:c000:02e6:0001::0/64 if it is 192.0.2.230).
NAT gateway external IPv4 address lifetime:
- If static: Works whether the source IPv6 address is mapped or not, as long as the IPv6 internal prefix is based on the NAT gateway external address.
- If dynamic: If the NAT gateway cannot map the source IPv6 address then the prefix of the internal IPv6 network must change every time the external dynamic IPv4 address changes, or the 6to4 gateway needs to do that. Neither option is attractive, so in the following this case will not be considered.

So in practice the two simple cases are to use an internal IPv6 based on the static external IPv4 address of the NAT gateway, or if the NAT gateway can map the IPv6 source addresses to use a prefix based on the internal IPv4 address of the 6to4 gateway.

Two full examples based on the following values from the example above, assuming the IPv4 configuration is already done:

Definition of 6to4 gateway adddresses:

# NAT gateway external IPv4 address if any.
EXT4='192.0.2.230'
PFX6="`printf '2002:%02x%02x:%02x%02x' 192 0 2 230`"

# 6to4 gateway internal IPv4 address.
INT4='10.168.1.10'
PFX4="`printf '2002:%02x%02x:%02x%02x' 10 168 1 10`"

# If "$EXT4" is nonempty then the NAT gateway external IPv4 address is
# static and known, so the internal IPv6 prefix will be based on it.
case "$EXT4" in
?*)     IP6P="$PFX6";;
'')     IP6P="$PFX4";;
esac

IPv6 internal address on the 6to4 gateway:

ip -6 address add "$IP6P":0001::000a/128 dev eth0
ip -6 route add "$IP6P":0001::/64 dev eth0

Encapsulation of IPv6 packets to 6to4 addresses on the 6to4 gateway:

ip tunnel add 6to4net mode sit local "$INT4" remote any ttl 64
ip link set dev 6to4net mtu 1280 up
ip -6 addr add dev 6to4net "$IP6P":0001::000a/16

Encapsulation of IPv6 packets to non 6to4 addresses on the 6to4 gateway:

ip tunnel add 6to4rly mode sit local "$INT4" remote 192.88.99.1 ttl 48
ip link set dev 6to4rly mtu 1280 up
ip -6 addr add dev 6to4rly "$IP6P":0001:000a/128
ip -6 route add 2000::/3 dev 6to4rly metric 100000

IPv6 interface configuration on an internal host:

ip -6 address add "$IP6P":0001::0022/128 dev eth0
ip -6 route add "$IP6P":0001::0/64 dev eth0
ip -6 route add default via "$IP6P":0001::0022

The same in abbreviated form for Debian's /etc/network/interfaces:

For the 6to4 gateway:

auto		eth0
iface		eth0 inet6 static
  netmask	64
  address	2002:c000:02e6:0001::000a
# address 	2002:0aa8:010a:0001::000a

auto            6to4net
iface           6to4net inet6 v4tunnel
  endpoint      any
  local         10.168.1.10
  netmask       16
  address       2002:c000:02e6:0001::000a
# address       2002:0aa8:010a:0001::000a

auto            6to4rly
iface           6to4rly inet6 v4tunnel
  endpoint      192.88.99.1
  local         10.168.1.10
  netmask       3
  address       2002:c000:02e6:0001::0001
# address       2002:0aa8:010a:0001::000a

For an internal host:

auto		eth0
iface		eth0 inet6 static
  netmask	64
  gateway       2002:c000:02e6:0001::000a
  address	2002:c000:02e6:0001::0022
# gateway 	2002:0aa8:010a:0001::000a
# address 	2002:0aa8:010a:0001::0022

161009 Sun: Backup drives with NILFS2 filesystems

Currently I am using mostly Btrfs for my home computer filetrees mostly because of Btrfs extensive support for data checksums. My setup also pairs the active disk drive with a backup one where the second is synchronized via rsync to the original every night, and then I also do periodic manual backups to an external drive or two.

I have recently chosen to use NILFS2 for the backup drives, for these main reasons:

I like for original and backup data to be on different technologies, such as disks from different manufacturers, and different types of filesystems, in general.
More specifically, while the Btrfs core functionality seems stable and fairly reliable, NILFS has been quite stable and very reliable for a long time, so if there is some big unexpected bug with Btrfs I can rely on the NILFS2 backup.
Not only NILFS2 is particularly suitable to backups that get updated bby RSYNC, and virtually never read, as it is log structured, it also provides continuous snapshotting which means that the backup drive contains many previous versions of the original filesystem.
NILFS2 does data checksumming, even if currently it does not check them on read but only on restart.

161008 Sat: The "spines-leaves" topologies and crossbar switching

Among others Google have adopted (the main part of) the purely-routed network topology that I described a few years ago and that I had implemented earlier at a largish science site, and is based on routers (and servers) being multihomed on multiple otherwise unconnected backbones with connectivity managed by OSPF and ECMP, and using /32 routes to provide topology independent addresses for services.

The Google paper and other descriptions (1, 2 3) refer to them as leaf-spine (or more properly spines-leaves) topologies and claim that they are crossbar-switch topologies as those introduced by Charles Clos quite some time ago to provide interlink several crossbar switches into a whole that performed almost like a single large crossbar. These topologies have been long known also as multi-stage distribution topologies and typically have 3 stages.

The mutiple independent backone network spines-leaves topologies I have used are not quite Clos-style networks; the shape is different as Clos topologies need at least three stages, and spines-leaves topologies are as a rule on two levels; also Clos topologies are pass-through, while spines-leaves topologies connect endpoints among themselves. The main common point is that the second stage or level as a rule involves fewer switches than the first level.

Note: There is thus a semi-plausible argument that spines-leaves topologies are degenerate Clos topologies were the first and third stage switches are the same, but I don't like it.

The main point is not however about the different shape, but the rather different purposes:

Clos networks are designed to achieve a symmetric mapping of N-by-N incoming to outgoing circuits, 1-to-1, sacrificing latency by using multiple smaller switches than a with lower cost than a single N-by-N switch, which would be otherwise preferable.
The spines-leaves topology is designed to provide scalable resilience and capacity for C-by-S asymmetrical but bidirectional packet flows, where typically there are many C clients and fewer S servers. The C clients are supposed to be attached to leaf/client routers and the servers to leaf/server routers, both types routing thanks to OSPF and ECMP via multiple otherwise unconnected spine/backbone routers.

The shape of spines-leaves topologies is driven by their primary goal which is resilience achieved by:

Multiple spine/backbone routers that are not connected directly to each other.
Having only routed connects between leaf/local routers and spine/backbone routers to avoid sharing broadcast domains.
Using OSPF and ECMP both to have quick take-down of routes through failed links while the others continue operating, and to allow spine/backbone routers and leaf/local routers to be from multiple vendors to minimize common modes of failure.
Using /32 host routes on virtual interfaces for service endpoints (typically on server hosts ruinning an OSPF daemon) to ensure that they are reachable regardless of the current link topology and thus of which interface packets are coming in or going out of.
An interesting resilience property of using OSPF/ECMP and /32 host routes is that links and topologies can then be reconfigured in an essentially arbitrary way dynamically and even without service interruption (if staggered): for example if a leaf/local router for servers dies the servers can be re-connected to any other leaf/local router (for example by bodily transporting them to a different computer room, or by using a long fibre) nearly transparently.

Note: the use of /32 host routes on vcirtual interfaces might allow to use entirely LAN-local dynamically allocated addresses (such as the 169.254.0.0/16 IPv4 range) for link-interface addresses.

This delivers resilience which is scalable by adding more spine/backgone routers and/or more leaf/local routers; as to the latter servers (and even clients) can well be multiple-homed themselves onto multiple routers, and running an OSPF daemon on a host is a fairly trivial task. I even had an OSPF daemon running on my laptop. In my largish science installation I had servers that were critical to overall resilience and performance multiple-homed directly on the spine/backbone routers, or those important to specific areas on the leaf/local routers.

The spines-leaves topology is not designed to deliver full symmetrical N-to-N trunking of circuits like Clos topologies because client-server packet flows are not at all like that. However it also gives nice scalable capacity and latency options: adding more spine/backbone or leaf/local routers and thus links and switching capacity to the whole, again largely transparently thanks to OSPF/ECMP routing and /32 host routes.

Finally spines-leaves topologies are very maintainable: because of their dynamic and scalable resilience and capacity it is possible to overprovision them, allowing to take down small or large parts of the infrastructure for maintenance quite transparently, while the rest of the infrastructure continues to provide service. It is also important for maintainability that the router configurations end up being trivially simple:

Both spine/backbone and leaf/local routers need a /32 host address on a virtual interface.
Both spine/backbone and leaf/local routers also need a subnet definition for the hosts or routers connected to them.
Leaf/local routers also need a point-to-point definition for each link to a each spine/backbone router they are connected to.

161006 Mon: Cluster post-saturation regime and disk based storage

A famous principle of system optimization is that removing the main performance limiter exposes the next one. So faster CPUs make memory the next performance limiter, and so on.

But another important detail is that what is a performance limited depends critically on the match or mismatch between the workload and the performance envelope of each system component.

In particular many system components like disk have highly anisotropic performance envelopes, so that their effective capacity is highly dependent on workload.

In particular capacity can be highly dependent not just on the profile of the workload, but also on it size, because effective component capacity often depends nonlinearly on its utilization.

As an example, consider a typical parallel batch processing cluster, with around 20 systems, with a total capacity of around 600 threads and 200 large disks drives, with around 100 threads devoted to background daemons, and 500 available for parallel jobs. Each of the disk drives can do between 100MB/s on a single thread, or 0.5MB/s if used by many threads.

In such a cluster total capacity is at some point inversely proportional to number of threads running that do IO, even if that IO is sequential, because the more threads that do IO on the same disk the lower the total transfer rate that the disk can deliver.

It must be emphasized that what decreases at some point is the total capacity of the cluster, not merely per-thread resources; the per-thread resources will fall at some point more than linearly.

Therefore suppose that overall 1.5 threads can share a disk given their average transfer rate an IOPS requirements: then if more than 200 threads are added to the 100 background threads total cluster capacity will fall because the ach disk will deliver a lower, possibly a much lower aggregate transfer rate. This even if 300 total threads are well below the 600 total capacity of the cluster.

Sometimes this situation is called a post-saturation regime, where in the example above cluster saturation is reached with 300 threads. Once that regime is reached additional load will further reduce capacity, and total time to complete may well be longer than with sequential execution. In an extreme case I saw many years ago running two jobs in parallel took twice as long as running then in series. In the example above running an additional 400 threads reaching 100% CPU occupancy will result in a total completion time probably rather higher than running two sets of 200 threads serially.

Note: for virtual memory the same situation is called thrashing, and outside computing it often loads to gridlock: imagine a city that has a saturation point of 200,000 cars: with an 250,000 cars traffic speed will slow down enormously, and with 300,000 cars probably traffic speed will be close to zero. For car traffic the resources whose total capacity reduces past saturation point are intersections, as cars must stop and start again the more often the more cars use them.

In a similar situation an observant user asked to configure the cluster job scheduler to limit the number of concurrent threads to the number that would not trigger a post-saturation regime on the disks. Most cluster job schedulers can be configured for user-defined consumable resources, for example in the case of and using as example SGE as complex values using qconf -Mc.

For the example above, of the 200 total disks the capacity of around 50-70 is consumed by the background threads, and one can define complex valus for total IOPS and total sequential transfer rate available, to accomodate jobs with mostly random or mostly sequential access patterns, as follows for the example above, given 140 disks remaining capacity:

complex_values IOPS=1300,MBPS=1300

Which is actually a bit optimistic because we have assumed that 200 threads can saturate on average 200/1.5 => 133 disks, but if there are hot spots, because of non-uniform distribution of IO from threads to disk, the post-saturation regime can happen with fewer than 200 threads.

Note: that configuration is even more optimistic because it gives as capacity the raw physical capacity of the 200 example disks. The actual capacity in IOPS and MPBS available to user threads can be significantly lower if there are layers of virtualization and replication in the IO susbsystem, for example if the jobs run in virtual machines over a SAN or NAS storage layer.

When I configured consumables for IOPS and MBPS for the cluster as requested by that observant user I got complaints from other users and management that since this limited the number of concurrent threads it limited the utilization of the cluster. But the relevant utilization was that of the main performance limiter, in that case disk capacity, and that ignoring it and considering number of available thread slots instead would overload disk capacity, thus achieving lower overall utilization; it seemed as if the number of thread slots occupied was considered more important than that of the number of thread slots utilized, so I had to make the specification of those consumable resources optional, which rather defeated their purpose. But the experienced user who had made the request continued to use them, and his cluster jobs tended to run at time when nothing else was running, so at least he benefited.

Note: the capacity reduction in a post-saturation regime on disk is due to a transition from sequential accesses with a per-disk transfer rate of 100-150MB to interleaved (and thus random-ish) accesses with a per-disk transfer rate that can be as low as 0.5-1MB/s. Seek times are for disks the equivalent of stopping and starting at intersections for cars in a city, that is periods of time in which useful activity is not performed, and which increase in frequency or duration when the load goes up.

160924 Sat: The Linux auditing subsystem and limits to auditing

I have been testing for a while a configuration for the Linux auditing subsystem (which seems to me awfully designed and implemented) to monitor changes in system files, and I was looking some time ago at the included rules examples and that left me somewhat amused. The examples mostly are about reporting modification or access for a few critical system files. That is somewhat weak: as per my testing one has to check also all the executables and libraries, because every modified executable can misbehave.

So my experiments have been about adding the major system library and executable directories to be monitored by the auditing subsystem, as a kind of continuous integrity checking, instead of the periodic one by checksum-based integrity checkers or periodic snapshot using filesystems like NILFS2 or Btrfs.

The major functionality of the audit subsystem is indeed to monitor file accesses, using in-kernel access to the inotify subsystem; but it can also monitor use of system entry points, which is unfortunately largely pointless, as there are very many system entry point calls on a running system, and the audit subsystem does not allow a fine grain of filtering as to which specific calls to a system entry point to monitor.

These are reasonable just-in-case measures, but they are not very effective against a well-funded adversary: every library or program in a system or the system hardware can be compromised at the source.

As to libraries and programs, given that the main technique of major security services is to infiltrate other organizations, it is hard to imagine that they would not get affiliated engineers hired by major software companies, or volunteer them to free software projects, so that they could add to the code some backdoors cleverly disguised as subtle bugs. It is hard to guess how many of the security issues in software that get regularly discovered and patched were genuine mistakes or carefully designed ones. But I suspect that the most valuable backdoors are very well disguised and hard to find, probably triggered only by obscure combinations of behaviours.

Probably companies like Microsoft, Google, Redhat, Facebook, SAP, Apple, EA, etc., have had (unknowingly) for a long time dozens if not hundreds of engineers affiliated with the security services (or large criminal organizations) of most major countries (India, UK, China, Israel, Russia, ...). Indeed I would be outraged if the security services funded by my taxes were not doing that.

As to hardware probably there are also engineers affiliated with various security services (or large criminal organizations) in virtually all major hardware design companies, Intel, Apple, ABB, CISCO, etc., where they can also insert into the design of a CPU or another chip or the firmware of peripherals like disks or printers some backdoors cleverly disguised as design mistakes.

But,as the files disclosed by Edward Snowden as to the activities of the NSA in the USA, hardware can be compromised at the product level, not just at the component level: security agencies can afford the expense to intercept shipping boxes and insert in them surveillance devices or backdoored parts, or enter into premises hosting already installed products and do the same, for example replacing USB cables with (rather expensive) identical looking ones containing radios.

Note: a former coworker mentioned reading an article that showed how easy it is to put a USB logger, that is a keylogger, in a computer mouse, or any other peripheral, and there are examples of far more sophisticated and surprising techniques.

Fully auditing third party libraries and programs (whether proprietary or free software), component hardware designs, boxed hardware, and installed hardware is in practice impossible or too difficult, and anyhow very expensive as far as it can go. For serious security requirements the only choice is to make use only of hardware and software that has been developed entirely (down to the cabling, consider again USB cables with built in radios) by fully trusted parties, that is full auditing of the source; for example the government of China have funded the native development of a CPU based on the MIPS instruction set, and no doubt also of compilers, libraries, operating systems entirely natively developed. Probably most other major governments have done the same.

For system administrators in ordinary places the bad news is that they cannot afford to do anything like auditing at the source, and therefore every hardware and software component must be presumed compromised; the good news is that the systems administered are usually rather unlikely to be of such value as to attract the attention of major security services or to be regarded by them as deserving the risk and expense of making use of the more advanced backdoors, or of using NSA-style field teams to intercept hardware being delivered or to modify in place hardware already installed. Also probably for ordinary installations a degree of physical isolation is sufficient to make effective use of the less advanced backdoors too difficult in most cases.

Therefore for ordinary installations the Linux audit subsystem is moderately useful, together with other similar measures, as long as it does not give a feeling of security beyond its limits.

It is also useful for system troubleshooting and profiling, as it can give interesting information on the actual system usage of processes and applications, complementing the strace(1) and inotifywatch(1) user level tools and the SystemTap kernel subsystem.

160907 Wed: Something new in recent times: Arvados Keep

Today in a discussion the topic of what is new and recent in systems came up. In general not much that is new has happened in systems design for a while, never mind recently. Things like IPv6, Linux, even flash SSDs feel new, but they are relatively old. Many other recent developments are not new, but rediscoveries of older designs that had gone out of practice as tradeoffs changed, and have come back into practice as they changed back.

After a bit of thinking I mentioned a distributed filesystem, Arvados Keep because it contains a genuinely different design feature, that needs some explaining.

As mentioned in numerous prvious posts designing scalable filesystems is hard, especially when they are distributed. Scaling data performance envelopes is relative easy in some aspects, for example by using parallelism as in RAID to scale up throughput for mass data access; the difficulty is scaling up metadata operations, which includes both file attributes and internal data structures. The difficulty arises from metadata beng highly structured and interlinked, which mans that mass metadata operations, like integrity auditing or indexing or backups tend to be happen sequentially.

Arvados Keep is a distributed filesystem which has two sharply distinct layers:

A segment-storage layer, where each segment is up to 64MiB long, is unmodifiable, and has a unique identifier which is its hash, like in the git.
A file-naming layer, where files (actually collections of files) are represented as a sequence of segment identifiers plus storage attrbutes.

The distinctive characteristic is that there is no metadata that lists which hash (that is, which segment) is on which server and storage device.

Since there is no metadata for segment location, and each segment identified uniquely identifies the content of the segment, whole-metadata checks can be paralellized quite easily: each storage server can enumerate in parallel all the segments it has, and then checks the integrity of the content with the hash; at the same time the file naming layer does its won integrity checks in parallel. Periodically a garbage-collector looks at the file-naming database, queries in parallel the storage servers, and reconciles the lists of hashes for the files and those available on the storage backends, deleting the segments that are not referenced by any files.

That is quite new and relatively recent, and helps a lot with scalability, which has been a problem for quite a while.

The absence of metadata linking explicitly files to storage segments is made possible by the use of content addressing for the segments, and it is that which makes parallelism possible in metadata scans. It has however a downside: that locating a segment requires indeed content addressing via the segment identifier. That potentially means a scan of all storage servers and devices every time a segment needs to be located.

That could be improved in several ways, for example via a hint database, or by using multicasting. The current way used by Arvados Keep is to first calculate a likely location based on the number of servers, and check that first, and if the segment identifier is not found on that server, a linear scan.

That works surprisingly well, in part because segment identifiers are essentially random numbers, and the same calculation is of course used when creating a new segment and when looking it up. The calculation is also highly scalable: it takes the same time whether the segments are distributed across 10 or 10,000 servers.

However it does not work that well in another sense of scalable, where the number of server is increased over time: because the location of a segment is fixed by the calculation based on the number of servers when it was created, and it is searched for with the current number of servers.

For a similar situation the Ceph distributed object systems uses a crush map algorithm, which is a table-driven calculation where the table is recomputed when the number of servers changes, but in a way tha preserves the location of segments created with a different number of servers.

This probably could be added to Arvados Keep too, but currently it is not particularly necessary. In part because usually when expansion is done it happens in large increments. However in extreme cases it can result in file opening times of hundreds of milliseconds, as many servers need to be probed to find the segments that make up the file. Arvados Keep is targeted at storage of pretty large files, typically of several gigabytes, and therefore it has a large segment size of 64MiB (Ceph uses 4MiB) which minimizes the number of segments per file, and also means that the cost of opening a file is amortized over a large file.

The other cost is that since thre is no explicit tracking of whether a given segment is in use or not, a periodic garbage collection needs to be performed, but that is needed regardless as an integrity audit, for any type of filsystem design, and it is easy to parallelize.

Overall the design works well for the intended application area, and the fairly unusual and novel decision to use content addressing without any explicit metadata structure cross referencing files and storage segments provides (at some tolerable cost) metadata scalability.

160817 Wed: A subtle downside of filesystem checksums

Today on the Btrfs IRC channel a user who mentioned building his computer out of old parts asked about some corruption reported by Btrfs checksumming, which he wanted to recover from. Some of the more experienced people present then noticed that it was a classic bitflip (one-bit error), inside a filesystem metadata block, and added:

[17:55]        kdave | pipe: the key sequence is ordered, anything that looks 
                       odd is likely a bitflip
[17:55]     darkling | Look at that one and the ones above and below, and 
                       you'll usually see something that fairly obviously 
                       doesn't fit.
[17:55]        kdave | I even have a script that autmates that
[17:56]     darkling | Convert the numbers to hex and sit back in smug delight 
                       at the amazement of onlookers. ;)
[17:56]     darkling | There's even a btrfs check patch that fixes them when 
                       it can...
[17:56]        kdave | we've done the bitflip hunt too many times
[17:56]     darkling | It's my superpower.
[17:56]     darkling | (Well, that and working out usable space in my head 
                       given a disk configuration)

The point being that bitflips are rather more common in their experience as Btrfs users and developers, than in the experience of most users, and in the present case it was likely due to unreliable memory chips.

Obviously bitflips are not caused by Btrfs, so their experience is down to Btrfs detecting bitflips with checksums, which makes apparent those bitflips that otherwise would not be noticed.

That detection of otherwise unnoticed bitflips brings with it the obvious advantage, but also a subtle disadvantage, which is the reason why it is often recommended to use ECC memory when using ZFS which also does checksumming. This seems counterintuitive: if filesystem software checksums can be used to detect bitflips, hardware checksums in the form of ECC may seem less necessary.

The subtle disadvantage is that as a rule checksums are not applied by Btrfs or ZFS to individual fields, but to whole blocks, and that on detecting a bitflip the whole block is marked as unreliable, and a block can contain a lot of fields: for example a block that contains internal metadata, such as a directory or a set of inodes.

Note: in a sense detecting a bitflip in a block via a checksum results in a form of damage amplification, as every bit in the block must be presumed damaged.

In many cases an undetected bitflip happens in parts of a block that are not used, or are not important, and therefore results in no damage or unimportant damage. For example as to filesystem metadata, a bitflip in a block pointer matters a lot, but one in a timestamp field matters a lot less.

ECC memory not only detects but corrects most memory bitflips, avoiding in many cases the writing to disk of blocks that don't match their checksums, or cases of checksum failure because of a memory bitflip after reading a block from disk. This prevents a number of cases of loss of whole blocks of filesystem internal metadata (as well as data).

This is obviously related to another Btrfs aspect, that by default its internal metadata blocks on disk are duplicated even on single disk filesystems. Obviously duplicating metadata on a single disk does not protect against failure of the disk, unlike a RAID1 arrangement (but it may protect against a localized failure of the disk medium though).

But the real purpose of duplicating metadata blocks by default is to recover from detected bitflips: when checksum verification fails on a metadata block its duplicate can be used instead (if it checksums correctly, and usually it will). This minimizes the consequences of most cases of non-ECC memory bitflips.

Obviously duplicating metadata on the same disk is expensive in terms of additional IO when updating it. Therefore it might be desirable in some cases to have more frequent backups, ECC memory, and turn off the default duplication of metadata blocks in Btrfs. So the summary is:

A lot of metadata or data corruption remains unnoticed because it does very little damage.
Filesystem checksumming makes nearly all cases of data or metadata corruption noticeable.
Filesystem checksumming applies to filesystem blocks and this means that a whole block of metadata or data that fails its checksum is discarded even if it was damaged in an unimportant way.
Since discarding a whole block of metadata can have large negative effects, Btrfs by default is set to duplicate them even on a single disk, so if one gets discarded because of checksumming failure, the other can be used.

160806 Sat: Blu-Ray discs need deformatting before reformatting

I have been using dual-layer Blu-Ray discs (50GB nominal capacity) for offline backups, both write-once (BD-R DL) and read-write (BD-RE DL), and recently I wanted to reformat one of the latter (to remap some defective sectors), and found that dvd+rw-format cannot do it, as the Blu-Ray drive rejects the command with:

FORMAT UNIT failed with SK=5h/INVALID FIELD IN PARAMETER LIST

After some web searching I discovered that BD-RE discs must be deformatted with xorriso -blank deformat before a BD drive will accept a command to format them. In this they are slightly different in behaviour from both DVD-+RW and CD-RW discs.

160717 Sun: A typical cold-storage system and its cost

While looking for related material I chanced upon a message describing a typical system configured for local cold-storage using Ceph instead of remote third party based services discussed recently:

OSD Nodes:

6 x Dell PowerEdge R730XD & MD1400 Shelves

2x Intel(R) Xeon(R) CPU E5-2650

128GB RAM

2x 600GB SAS (OS - RAID1)

2x 200GB SSD (PERC H730)

14x 6TB NL-SAS (PERC H730)

12x 4TB NL-SAS (PERC H830 - MD1400)

The use of 4TB-6TB NL drives is typical of bulk-data cold-storage (or archival) systems with a very low expected degree of parallelism in the workload, mostly a few uploading threads and a few more threads to download to warm-storage (or cold-storage if used in archive-storage).

There are also a couple of enterprise (the capacity of 200GB suggests this) flash SSDs, very likely for Ceph journaling, and two 600GB SAS disks for the system software.

Such a system could be used for cold-storage or archival-storage, but to me it seems more suitable for archival, as it has a lot of drives and most of them are as big as 6TB. But it could be suitable for cold-storage as long as uploads and download concurrency is limited.

Note: The system configuration seems to be inspired by the throughput category on pages 13 and 16 of these guidelines.

The total raw data capacity is 120TB, and that would translate to a logical capacity of around 40TB using default 3-way Ceph replication for cold-storage of 80TB with a 12+4 erasure code set (BackBlaze B2 uses 17+3 sets, but that seems a bit optimistic to me).

Note: as to speed, with 26 bulk-data disks each capable of perhaps 20-40MB/s of IO over 2-4 threads, and guessing wildly, aggregate rates might be: for cold-storage 200-400MB/s for pure writing over 3-6 processes and 300-800MB/s (depending on how many threads) for pure reading over 5-10 processes; for archival storage it might be half that for writing (because of likely read-modify-write using huge erasure code blocks and waiting for all of them to commit) over 2-4 processes, and 400-800MB/s for pure reading.

Note: To coarsely double check my guess of achievable rates I have quickly setup an MD RAID10 set of 6 mid-disk partitions, with replication of 3, on 1TB or 2TB contemporary SATA drives, and done with fio randomish IO in 1MiB blocks:

soft#  fio blocks-randomish.fio >| /tmp/blocks-randrw-1Mi-24t.txt                                                                   
soft#  tail -13 /tmp/blocks-randrw-1Mi-24t.txt

Run status group 0 (all jobs):
   READ: io=1529.0MB, aggrb=51349KB/s, minb=1886KB/s, maxb=2418KB/s, mint=30191msec, maxt=30491msec
  WRITE: io=1540.0MB, aggrb=51718KB/s, minb=1920KB/s, maxb=2317KB/s, mint=30191msec, maxt=30491msec

Disk stats (read/write):
    md127: ios=24096/23557, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4310/12770, aggrmerge=210/484, aggrticks=114720/152362, aggrin_queue=267136, aggrutil=94.43%
  sdb: ios=5315/12139, merge=570/1118, ticks=291976/493200, in_queue=785520, util=94.43%
  sdc: ios=4547/12434, merge=456/840, ticks=68920/48540, in_queue=117456, util=35.66%
  sdd: ios=2438/12447, merge=238/828, ticks=42272/38664, in_queue=80936, util=24.40%
  sde: ios=7084/13200, merge=0/40, ticks=153356/143492, in_queue=296840, util=54.44%
  sdf: ios=4252/13204, merge=0/38, ticks=85368/96032, in_queue=181396, util=41.64%
  sdg: ios=2227/13201, merge=0/41, ticks=46428/94248, in_queue=140672, util=33.84%
soft#  cat blocks-randomish.fio 
# vim:set ft=ini:

[global]

kb_base=1024

fallocate=keep
directory=/mnt/md40
filename=FIO-TEST
size=100G

ioengine=libaio
io_submit_mode=offload

runtime=30
iodepth=1
numjobs=24
thread
blocksize=1M
buffered=1
fsync=100

[rand-mixed]

rw=randrw
stonewall

Note: The setup is a bit biased to optimism as it is over a narrow 100GiB slice of the disk, and IO is entirely local and not over the network, and it is for block IO within a file, therefore no file creation/deletion metadata overhead. The outcome is a total of 100MiB/s, or around 17MiB/s per disk. With pure reading (that is without writing the the same data three times) the aggregate goes up to 160MiB/s, or a bit over 26MiB/s, and with 16MiB the aggregate transfer rate up to 170MB/s, or around 28MiB per disk. I have watched the output of iostat -dkzyx 1 while fio was running and it reports the same numbers. For another double check, CERN have seen over 50GB/s of usable aggregate transfer rate by 150 clients over 150 servers (each with 200TB of raw storage as 48×4TB disks), or 350MB/s per server.

As to comparative pricing, the same one-system capacities on Amazon's S3 for about 5 years have a base price currently of $30,000 40TB in S3 IA and of $33,600 for 80TB in Glacier, plus traffic and access charges; these are difficult to estimate, but I would add $6,500 to the S3 IA cost and $9,000 to the S3 Glacier cost.

Note: $6,500 is around the price of accessing around 20TB (of the 40TB) per year from another region (within region accesses are free) and reading back 10TB per year from S3 IA; $9,000 is around the price of restoring 1TB of the 80TB per year in 2 different months over 4 hours each time (10 downloads of 500GB each at 125GB/hour).

Rounding up a bit (10%) to account for other charges that apply may give around $40,000 over 5 year for 40TB of S3 IA, or $47,000 over 5 years for 80TB of Glacier.

The total cost of ownership of local system hardware plus a minimum system administration would have to compete with that. The purchase price of one such similar system is around $14,000 with similar or better SSDs and disks plus around $5,500 for a 12×4GB disks expansion unit (prices include 5 years NBD support but not tax), for a total of around $20,000 (the pricing I looked at is for SATA disks, but SAS NL disks are not much more expensive); to this one should add data center charges (for space, power and connectivity), and my rough estimate for that is around $12,000 over 5 years for colocation per system.

So the total cost of ownership for one of the cold-storage or archival servers mentioned at the beginning would be around $32,000 over 5 years, compared with $40,000 for 40GB of S3 IA, or $47,000 for 80GB of Glacier (both before tax). That $8,000 or $15,000 difference can pay for a lot of custom system administration, especially given that this is largely a setup-and-forget server. Considering that many sites have rack space and networking already in place the cash difference can be bigger.

The main advantage of Glacier of course is that it is guaranteed offsite, but then colocation can do that too, even if it is harder to organize.

Overall for a 5-year period my impression is that local or colocated cold-storage (or archive-storage) is less expensive than remote third-party storage, so for those who don't need a worldwide CDN that is still the way to go.

160706 Wed: First impressions of the ZOTAC CI323 mini-PC

I have been using the ZOTAC CI323 mini-PC for a couple of weeks. I bought it (from Deals4Geeks for around £150 including VAT, without memory and disk) to check a bit the state of the art with low power, small size servers; mini-PC are in general based on laptop components, and they may be considered as laptops without keyboard and screen. I have previously argued that indeed laptops make good low power, small size servers and I have been curious to check out the alternative.

Since my term of comparison are laptops, which usually have few ports, I have decided to choose to get a mini-PC that has better ports than the laptop, to compensate for the lack of builtin screen and keyboard, and the CI323 comes with an excellent set of ports, notably:

3 external display ports, of all 3 popular types: DVI, HDMI, DP.
4x USB3 ports, one of them USB-C, and 2x powered USB2 ports. With UASP USB3 ports are almost as good as eSATA for mass storage.
Two 1Gb/s Ethernet sockets.

Compared to many other mini-PCs it also has some other notable core features:

Four CPU Celeron N3150 chip with nice features like AES and virtualization acceleration. It is also from a very recent low power CPU family, with a typical power draw of 4W.
Two DDR3 SODIMM sockets for RAM, instead of just one.
A full 2.5in disk drive slots with a SATA socket, instead of a flash-stick M.2 slot. The latter may be faster, but flash-sticks can get very hot under load and are somewhat more expensive.
Completely passive cooling, which is remarkable.

The PCI device list and the CPU profile are:

#  lspci
00:00.0 Host bridge: Intel Corporation Device 2280 (rev 21)
00:02.0 VGA compatible controller: Intel Corporation Device 22b1 (rev 21)
00:10.0 SD Host controller: Intel Corporation Device 2294 (rev 21)
00:13.0 SATA controller: Intel Corporation Device 22a3 (rev 21)
00:14.0 USB controller: Intel Corporation Device 22b5 (rev 21)
00:1a.0 Encryption controller: Intel Corporation Device 2298 (rev 21)
00:1b.0 Audio device: Intel Corporation Device 2284 (rev 21)
00:1c.0 PCI bridge: Intel Corporation Device 22c8 (rev 21)
00:1c.1 PCI bridge: Intel Corporation Device 22ca (rev 21)
00:1c.2 PCI bridge: Intel Corporation Device 22cc (rev 21)
00:1c.3 PCI bridge: Intel Corporation Device 22ce (rev 21)
00:1f.0 ISA bridge: Intel Corporation Device 229c (rev 21)
00:1f.3 SMBus: Intel Corporation Device 2292 (rev 21)
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
04:00.0 Network controller: Intel Corporation Wireless 3160 (rev 83)

#  lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 76
Stepping:              3
CPU MHz:               479.937
BogoMIPS:              3199.82
Virtualisation:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-3

#  cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 480 MHz - 2.08 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 480 MHz and 2.08 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 480 MHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes

Given all this, how good it is? Well, so far it seems really pretty good. The things I like less are very few and almost irrelevant:

The lack of builtin screen and keyboard as for a laptop makes it incovenient to do maintenance. However I think this is a reasonable tradeoff for better connectivity and a smaller, boxy, form factor.
The disk and memory are on the bottom, connected to a metal heat dissipation plate via thermal pads; so I have the box sitting on its side (I have added some tall rubber feet on that side), which is also more convenient for the space I have for it, and seems good given the internal layout shown in this article. They still get a bit warm, around 40-45°C or 20°C above ambient temperature.
I wish it had an eSATA socket, just-in-case, but I don't think it is really necessary as the internal disk slot is sufficient for most uses for such a small box.
It is a bit bigger than a NUC style mini-PC, but not much, and for me quite acceptable.

The things I particularly like about it are:

Connectivity is excellent indeed. The two RAM slots make for easy upgrades, the two Ethernet sockets and the WiFi interface allow using it as a router or a bridge/firewall, the USB3 ports are quite fast. The WiFi chip supports hostapd which gives extra flexibility.
It does not get that hot even with passive cooling. A bit warm, with CPU and disk usually around 40-45°C, that is around 20°C above ambient. Other testers report that it can work for a while at full speed, for example for games. Regardless it is completely silent; even with a hard disk I can't hear a sound.
Power consumption is very good: other testers have reported 7-8W idle, and 15-20W under load, same as a laptop.
I tried it with Ubuntu 14.04.04 with a 4.2 series Linux kernel and it seems all parts are well supported.
ZOTAC tend to do stuff well, even if they are a somewhat minor brand, and it seems quite well built.
It is priced right, a bit cheaper than an equivalent laptop or a NUC, and better connectivity.

So I have decided, especially because of the very low power consumption, to use it as the shared-services server of my 2 home desktops and laptop. I was previously using for that one of the desktops that mainly is for archive/backups, and so has four disks, and draws a lot more power. I can now put that desktop mostly to sleep, and keep the CI323 always on.

so far I think it is a very good product, with a very nice balance of useful features and a good price, and as a low power server it is impressive.

Testers of this and similar models (1, 2, 3, 4, 5, 6) have looked at it as a desktop PC, or media PC or even as a gaming console, and they re seem pretty decent for all of those uses; the GPU is good enough for not the most recent games, and the overall performance envelope seems both good and isotropic.

160701 Fri: Cold storage, archiving, Blu-Ray, large disks

A somewhat fascinating story revolves around cold-storage and archival storage. They are two storage types that are designed to be used in a rather atypical way:

Low IOPS per TB are acceptable.
Writes tend to be sequential and large but infrequent.
Reads are quite rare for cold storage and almost never happen for archival storage, and when they happen they tend to be sequential for files but for random files.

Note: recent backups should go to cold-storage, not archival-storage (1, 2).

Decades ago hot-storage used to be magnetic drums, warm-storage magnetic disks, cold-storage and archival-storage were both on tape.

Currently hot-storage is on flash SSD or low-latency and low-capacity magnetic disk, warm-storage on smaller capacity (300GB-1TB) magnetic disk, cold-storage on larger capacity (2TB and larger) low IOPS-per-TB magnetic disks, and archival-storage on tape cartridges inside large automated cartridge shelves.

Cold-storage is meant to be read infrequently and only for staging to warm- or hot-storage, and archival-storage is meant to be almost never read back, to be write-only in most cases.

The typical cold-storage hardware is currently low IOPS-per-TB disks in the 4TB-6TB-8TB range, arranged in 2-3 way replication or, at the boundary between cold- and archival-storage, using erasure codes groups that have a very high cost for small writes or any reads when some members of the group are unavailable, but have better capacity utilization at the cost of only a small loss of resilience.

What has happened in the recent past is that cloud storage companies have been offering all types of storage, and in particular for cold and backup storage they have offered relatively low cost per TB used.

That is remarkable because usually cloud shared computing capacity is priced rather higher (1, 2, 3) than the cost of dedicated computing capacity, despite the myth being otherwise. The high cost of cloud computing is only worth it for those who require its special features, mostly that it effectively comes with a built-in CDN that otherwise would have to be built or rented from a CDN business.

The most notable examples are Amazon Glacier and BackBlaze B2 and they are very different.

They have rather dissimilar pricing for capacity: between $80 and $140 per year per TB for Glacier, and $60 per TB per year for B2.

The cost of S3 capacity is 10 times higher than Glacier; there are other services like rsync.net that costs similarly to S3, and Google Nearline that costs similarly to Glacier.

The main differences are that Glacier has very strange pricing for reading, a highly structured CDN, and Amazon keeps the details of how it is implemented a trade secret, and B2 has a simple pricing structure, its implementation is well known, and has a much simpler CDN. Starting with B2, the implementation is well documented (1, 2, 3, 4):

Custom built, but with an entirely standard design, for storage servers with 45 or 60 disk drives.
They are organized in erasure code sets of 17+3.
The disk drives are ordinary cold storage ones in the 4TB-6TB-8TB range:
- 4TB hard drives => 3.6 petabytes/vault (Deploying today.)
- 6TB hard drives => 5.4 petabytes/vault (Currently testing.)
- 8TB hard drives => 7.2 petabytes/vault (Small-scale testing.)
- 10TB hard drives => 9.0 petabytes/vault (Announced by WD & Seagate.)
There is no CDN or inter-site resilience.

In other words, this is just a cost optimized version of a conventional cold-storage system. As to this BackBlaze have published several interesting insights, notably:

Unsurprisingly the cost difference per TB among 4TB, 6TB and 8TB drives is small: the main advantage of the larger drives is lower number of storage servers, at the cost of even lower IOPS-per-TB and higher cost-per-TB. This reflects what a perceptive author noticed: Kryder's Law is no longer valid and storage price/performance improvements have considerably slowed down.

As to Glacier instead there is a the mystery of the trade secret implementation: the Amazon warm-storage service costs 10 times as much, and it is known that it is a cost optimized version of a conventional warm-storage system, based on ordinary lower capacity disk drives. Potentially some clue is in the strange pricing structure (1, 2, 3, 4), for reading back data from Glacier:

Reading only can begin 4 hours after being requested.
Reads are charged at the cost of the largest retrieval request divided by the four 4 hour waiting period, times the number of hours in a month, minus a free allowance of 0.17% of the stored data per day.
Therefore reading as slowly as possible (the smallest amount as possible for each 4 hour period compatible with stored object size) all of the time costs a lot less than reading fast even once and slowly at any other time.
If instead of reading data the data is deleted there is an extra charge for deleting within 3 months of writing it, which in effect means that there is a minimum billing period once the data has been uploaded.

Therefore if one wants to read 100GiB out of 500GiB it is 10 times cheaper to request 10GiB every 4 hours than request 100GiB at once, as the first incurs a read charge of 10GiB for the whole month, and the latter one of 100GiB for the whole month.

Note: there are other pricing details like a relatively high per-object fee.

This pricing structure suggests that Glacier relies on a two tier storage implementation, and the cost of reading is the cost of copying the data between the two tiers:

There is a slow but cheap second bulk tier that cannot be accessed directly by users, and a faster but more expensive first staging tier that can be accessed directly by users.
The minimum billing implied by the deletion charge implies that not just allocating space on the bulk second tier has a cost, but also uploading to it is expensive.
The pricing for storing data relates to the cost of capacity in the second bulk tier.
Reading back involves allocating space first staging tier to hold the entire amount of data in the read back request.
The way to minimize storage tier occupancy is therefore to read at a constant slow rate: this allows requesting only a small allocation of first staging tier storage.
The allocation of the first staging tier space lasts until the end of the month (but is charged for the whole month).
The 0.17% free allowance per day means that a first staging tier allocation of up to 0.17% of the total data stored is free.

My guess is that the staging first tier is just the warm-storage Amazon S3. There has been much speculation (1, 2, 3, 4, etc.) as to what the bulk second tier is implemented with, and the two leading contenders are:

The bulk second tier is similar to that of BackBlaze, 4TB-6TB-8TB disks that are densely packed and kept spun down to minimize power consumption and heat and vibration, but with custom low-cost features, such as a low RPM, and/or SMR technology.
I remember that until several years ago it was possible to buy 3600RPM 5.25in higher capacity disks even when most disk were already 3.5in 7200RPM ones, and presently surveillance class disks tend to run at 5900RPM, and are targeted at the same usage profile as cold- and archival-storage: recording of video, which is very rarely read back.
The bulk second tier is made of packs of high capacity, triple layer Blu-Ray disks held in cartridges stored in automated cartridge shelves.

The Blu-Ray hypothesis has recently got a boost as an Amazon executive involved in the Glacier project has commented upon the official release of a Blu-Ray cold-storage system, Sony's Everspan, and he writes:

But, leveraging an existing disk design will not produce a winning product for archival storage. Using disk would require a much larger, slower, more power efficient and less expensive hardware platform. It really would need to be different from current generation disk drives.
Designing a new platform for the archive market just doesn’t feel comfortable for disk manufacturers and, as a consequence, although the hard drive industry could have easily won the archive market, they have left most of the market for other storage technologies.

This is quite interesting, because it seems a strong hint that Glacier does not use magnetic disks for its bulk tier; at the same it is ironic because BackBlaze has created a competitive product (for those that don't need a built-in CDN with multi-region resilience) based on commodity ordinary cold-storage 4TB-6TB-8TB disks.