This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
The WWW is a wonderful if messy library and I keep lists of hyperlinks (bookmarks) to notable parts of it, currently around 6,000. I have switched a few times the tool I use to keep these lists:
drag-and-dropof hyperlinks, but it was till awkward to use and with some limitations when using it to manage lists of notes instead of random collections of notes, for example inability to sort a list of notes (that is bookmarks). Also quite slow, taking often many seconds to open a collection of notes, incomprehensibly.
The main problems were actually similar across these three GUI bookmark managers:
attributesinstead of editing them in place.
So I started looking for text-based bookmark list managers, and I found first Buku and tried it. It is a purely command like bookmark database, and does not organize bookmarks by list, but by keywords. It works well, but that is not quite I wanted. So I though about that I wanted and I realized that I really wanted a nested list manager, and that an outline-oriented editor could be used in that way.
Then I remembered that there is a
outline editing in EMACS, and that EMACS can open files or
hyperlinks in the middle of text, and that there was a further
evolution of outline editing with added functionality for
maintaining lists of notes, which is
I reckon that in general outline editing is not that useful,
but it occurred to me that it may match well looking at and
editing nested lists.
Org mode is an extension of outline editing mode, by
incorporating some kind of
and having more operations available, among them easy ways of
sorting lists, moving groups of list items, displaying only
entries that match a regular expression, so I started using
It is a lot better than the GUI based list managers I tried so far as to managing lists of bookmarks, and even citations. In particular it is much, much faster, every operation being essentially instantaneous, as it should be, on a few hundred KiB of data, and very convenient to use, and I have been finally been able to re-sort and tag and update my bookmark collection. Since EMACS runs also in pure text mode, it also works on the command line in full screen mode. Plus of course since it is built inside a full-featured text editor it has all the power of the text editing tools available.
While I am happy that I found in my old EMACS a good solution I am also sad because I usually find applications built within EMACS like the email reader/writer VM and the SGML/XML structure editor PSGML to be preferable to dedicated tools. I think that this is because of some fairly fundamental issues:
computer sciencecliches (and usually MS-Windows ones). In particular they don't care to provide the data structures they define with a sufficient vocabulary of operations, not just add a member, edit a member, delete a member; operations such as copying, moving, bulk operations and sorting and filtering are also needed in most cases), and that code modularity is good because users want to do many different things.
So org mode is working well for me for bookmark lists, and I
guess that it would work well for keeping other types of
lists. There are indeed a number of org mode enthusiasts
who do nearly
everything with org mode, as
anything can be turned into a list
of items, but I haven't reached that stage yet.
One of the more interesting aspects of Linux (the kernel) is the number of filesystem designs that have been added to it. In part I think is because it started with the Minix filesystem which had a number of limitations, then someone implemented xiafs which also had some limitations, and then things started to snowball ad everybody tried to write better alternatives.
Among the current filesystem designs there are some that are generic UNIX-style filesystems and some who are rather specialized as to features, purpose, or not being very UNIX like. Having tried many the ones that I like most are:
When GRUB2 sets up the system console, if it is graphical, is set to some video mode (resolution and depth), and so when the loaded Linux kernel starts, and they are not necessarily the same.
Usually the defaults are suggested by the BIOS (if it is an IA system), or by the DDC information retrieves from the monitor. Sometimes the default are not appropriate or available, so they must be set explicitly.
Having had a look at the GRUB2 and Linux kernel
and done some experiments I discovered that as usual it is
reticent and misleading and the
more reliable and complete story is:
framebuffergraphics card driver, the mode can be set with the GRUB2 gfxpayload variable, which is passed to the kernel as a boot environment variable, which can be set to the string keep to have it the same as the GRUB2 graphics mode. Unfortunately this did not work with my Ubuntu 4.4 kernel.
framebufferdriver video kernel parameter. This worked well with my Ubuntu 4.4 kernel.
I have found that good defaults are 1024x768x32,1024x768,auto for gfxmode and 1024x768-32 for video: virtually any graphics cards support them, and virtually all monitors have that resolution or higher, and that resolution allows for a decently sized console, around 128x38 characters with typical fonts.
Having pondered the match (or rather more often the mismatch) between system and workload performance envelopes I found in an article a pithy way to express this in the common case:
A homogeneous system cannot be optimal for a heterogeneous workload.
My work would have often been a lot more enyable if my
predecessors had setup systems accordingly. But the only case
where a homogenous cluster can run well a heterogenous
workload is when it is oversized for that workload, where the
performance envelope of the workload is entirely containe
within that of the system, and consequently first impressions
of a newly setup homogenous cluster can be positive for a
while, only to evolve into
as the cluster workload nears capacity (or worse when it
The author of the presenation in which the quote is contained has done a lot of interesting work on latency, which is usually the limiting factor in homogenous systems, and particularly on memory and interconnect latency and bandwidth. Since my usual workloads tend to be analysis rather than simulation workloads, my interest has been particularly to network and storage latency (and bandwidth), which is possibly an even worse issue.
Quite amused by looking at an old email list post about advice on buying a UNIX system from March 1990, suggesting a 386SX 16/32b processor at 20MHz with 4MiB of RAM, a 40MB-80MB disk, and UNIX System V versions Xenix or ESIX, with prices in either UK pounds or USA dollars:
Best configuration is a 386SX, 4 Megs, 1/2 discs either RLL, ESDI or SCSI, for a total of 80-120 megabytes, a VGA card with a 14" monitor, and a QIC-24 tape drive. Software should be either Xenix 386 (more efficient) or ESIX rev. D (cheaper, fast file system).
Here are UK and USA prices for all the above:Xenix 386 complete #905 $1050 Xenix TCP/IP #350 ESIX complete $800 386SX with 1 meg #660 $900 additional 3 meg #240 $300 VGA 16 bit to 800x600 #100 $190 VGA 16 bit to 1024x768 #150 $250 VGA mono 14" #110 $240 VGA color 14" #260 VGA color 14" multisync #300 $440 VGA color 14" to 1024x768 #450 $590 1 RLL controller #100 $140 1 RLL disc 28 ms 40/60 meg #260 $380 1 RLL disc 25 ms 40/60 meg #300 1 RLL disc 22 ms 71/100 meg #450 $570 1 RLL disc 28 ms 80/120 meg #520 $570 1 ESDI controller #170 $200 1 ESDI disc 18 ms 150 meg #660 $1200 Epson SQ 850 #490 Desk Jet + #510 $670 LaserJet IIP #810 $990 Postscript for IIP #545 $450 Archive 60 meg tape #420 $580
Note: given inflation roughly double the prices above to get somewhat equivalent 2016 prices.
That's just over 25 years ago, and it is amusing to me because:
Recently I changed ISP so that my previously current O2 Wireless Box V stopped being suitable, and I tried two other ADSL modem-routers that I had kept from years ago:
Note: in a previous post I surmised that the Vigor 2800 does not support 6to4 IPv4 packets. That was a mistake: it does allow them through, it just does not do NAT on the encapsulated IPv6 packet, and this just requires a slightly different setup.
It is somewhat remarkable that 12 and 10 year old electronics still work. They have not been used for half of that time, but the age is still there. I guess it helps that they have no moving parts.
It is more remarkable that that they are still usable, and connect at the top speeds available on regular broadband lines to consumers even today, same as 10-12 years ago. Current FTTC lines with VDSL2 and cable lines can do higher speeds, but they cost more and they have required extensive recabling. It looks like that ADSL2+ is about as good as it will ever get on electrical lines, and will be stuck in the 2000s.
Recently while discussing "cloud" virtual machine hosting
options for a small business a smart person mentioned other
customers of the fairly good GNU/Linux virtual machine hosting
business he was using being perplexed by the default
(optable-out) of the hosting provider of keeping
root access to the virtual machines
My comment on that is many hosting providers realize that most of their customers are not particularly computing literate and thus can be fortgetful of bad situation in their virtual machines, such as security breaches, that can cause trouble to other customers or to the hosting provider itself, and that in such situations a hosting provider had a number of options, listed in order of increasing interventionism:
Since many issues can be easily fixed by direct intervention inside the VM, that both solves the issue from the point of view of the hosting provider, without impacting the availability of the customer's system.
The comment on that was that some customers were reluctant to maintain the default root access to a third party, for privacy or commercial confidentiality reasons, or for auditability. As to this I pointed that a virtual machine hosting provider (that is any of their employees with access, including those working covertly for other organizations) have anyhow complete control over and access into a customer's virtual machines, including any encrypted storage or memory, as they can snapshot and observe or modify any aspect of the virtual machine's state, essentially invisibly to any auditing tools inside the virtual machine, so direct root access was just a convenience for the benefit of the customer.
Note: in a sense a system hosting
VMs is a giant all-powerful
all the VMs it hosts.
Of course given the complete access that "cloud" hosting providers have to the VMs and data of their customers only a rather incompetent or underfunded intelligence gathering organization would refrain from infiltating as many popular hosting providers as possible with loyal maintenance (building, hardware, software) engineers at all major "cloud" businesses, as most intelligence gathering organizations have mandates from their sponsors to engage in pervasive industrial espionage, at the very least.
Note: in the case of a hosting
provider manager discovering somehow that one of their
engineers have been accessing customer VMs their strongest
incentive is to maintain
customers either by doing nothing or by terminating the
engineer's contract with excellent benefits and a
confidentiality agreement, without investigating the hosting
systems to remedy the situation, because such an investigation
might cost a lot of money for no customer-visible, never mind
To which someone else pointed out that in some countries
(Germany was mentioned) many businesses keep their own
physical systems hosted on their own premises, with suitable
Internet links. My comment to this is that this is
understandable: even hosted physical systems can be
compromised, even if less easily than VMs. Systems purchased
and installed by the customer in
cages at a
hosting provider can also be compromised, even if less easily
still. For moderately confidential business data co-located
customer-purchased and installed systems in a cage with a
mechanical padlock purchased and installed by the customer
(cage locks controlled by the hosting provider are obviously
not a problem for insiders) seems good enough, but protecting
more confidential business data requires keeping the systems
on-premises to remove a further class of potential and
That "cloud" VMs (or physical systems) can be compromised at will by hosting provider engineers in an unauditable way is perhaps not commonly considered by their customers with even moderately sensitive data.
Note: Hopefully no bank or medical data are processed on a "cloud" hosted system.
Note: Well encrypted data held on a
"cloud" storage system is of course not an issue. But most
"cloud" storage vendors make it quite expensive to access data
from outside their "cloud", and that often involves
significant latency anyhow, so usually "cloud" storage is
accessed by systems hosted in the same "cloud" (usually the
region of it).
What "cloud" hosting providers seem most useful for (1, 2) as to both cost and confidentiality is to provide a large redundant international CDN for nearly-public data.
I have mentioned previously how easy (but subtle) it is to setup a 6to4 1, 2, 3, 4) but I have left a bit vague the issue of NAT with 6to4, which is unfortunately very common for home users.
As 6to4 encapsulates an IPv6 packet inside an IPv4 packet there are two possible layers of NAT, on the containing IPv4 addresses and on the contained IPv6 addresses, and a limitation: since IPv4 packets don't have ports, the internal-to-external address mapping of NAT must be one-to-one. If IPv6 was encapsulated in an IPv4 UDP datagram the port number in the UDP header could be used to map multiple internal addresses to one external address.
This means that there can only be as many 6to4 systems on an internal LAN as there are external addresses available to the NAT gateway, and often that is a single one. The rest of the discussion is based on assuming that there is a single external address.
It is still possible to have many internal systems on IPv6 as long as they route via the 6to4 system, and the internal IPv6 subnet is chosen appropriately. The latter is not difficult as each IPv4 address defines a very large 6to4 subnet, where the prefix for it in first 16 bits contain the 6to4 prefix, and the next 32 bits contain the IPv4 address. So for example:
The first two issues are whether the NAT gateway allows passing IPv4 packets with a protocol type of 41 (IPv6) at all, and whether it performs NAT on them. Both must be possible to use 6to4 in a NAT'ed subnet.
The goal then is to ensure that when a 6to4 packet exits the NAT gateway it has these properties:
The last point is the really important one, so that when replying the target IPv6 system knows that the reply IPv6 packet must be encapsulated in an IPv4 packet or sent to a 6to4 gateway (because of the 2002::0/16 prefix), and the target address of the IPv4 packet is the external address of the NAT gateway.
Therefore the next issues are whether the NAT gateway also NATs the contained IPv6 addresses, and whether the external address is static or dynamic:
So in practice the two simple cases are to use an internal IPv6 based on the static external IPv4 address of the NAT gateway, or if the NAT gateway can map the IPv6 source addresses to use a prefix based on the internal IPv4 address of the 6to4 gateway.
Two full examples based on the following values from the example above, assuming the IPv4 configuration is already done:
# NAT gateway external IPv4 address if any. EXT4='192.0.2.230' PFX6="`printf '2002:%02x%02x:%02x%02x' 192 0 2 230`" # 6to4 gateway internal IPv4 address. INT4='10.168.1.10' PFX4="`printf '2002:%02x%02x:%02x%02x' 10 168 1 10`" # If "$EXT4" is nonempty then the NAT gateway external IPv4 address is # static and known, so the internal IPv6 prefix will be based on it. case "$EXT4" in ?*) IP6P="$PFX6";; '') IP6P="$PFX4";; esac
ip -6 address add "$IP6P":0001::000a/128 dev eth0 ip -6 route add "$IP6P":0001::/64 dev eth0
ip tunnel add 6to4net mode sit local "$INT4" remote any ttl 64 ip link set dev 6to4net mtu 1280 up ip -6 addr add dev 6to4net "$IP6P":0001::000a/16
ip tunnel add 6to4rly mode sit local "$INT4" remote 126.96.36.199 ttl 48 ip link set dev 6to4rly mtu 1280 up ip -6 addr add dev 6to4rly "$IP6P":0001:000a/128 ip -6 route add 2000::/3 dev 6to4rly metric 100000
ip -6 address add "$IP6P":0001::0022/128 dev eth0 ip -6 route add "$IP6P":0001::0/64 dev eth0 ip -6 route add default via "$IP6P":0001::0022
The same in abbreviated form for Debian's /etc/network/interfaces:
auto eth0 iface eth0 inet6 static netmask 64 address 2002:c000:02e6:0001::000a # address 2002:0aa8:010a:0001::000a auto 6to4net iface 6to4net inet6 v4tunnel endpoint any local 10.168.1.10 netmask 16 address 2002:c000:02e6:0001::000a # address 2002:0aa8:010a:0001::000a auto 6to4rly iface 6to4rly inet6 v4tunnel endpoint 188.8.131.52 local 10.168.1.10 netmask 3 address 2002:c000:02e6:0001::0001 # address 2002:0aa8:010a:0001::000a
auto eth0 iface eth0 inet6 static netmask 64 gateway 2002:c000:02e6:0001::000a address 2002:c000:02e6:0001::0022 # gateway 2002:0aa8:010a:0001::000a # address 2002:0aa8:010a:0001::0022
Currently I am using mostly Btrfs for my home computer filetrees mostly because of Btrfs extensive support for data checksums. My setup also pairs the active disk drive with a backup one where the second is synchronized via rsync to the original every night, and then I also do periodic manual backups to an external drive or two.
I have recently chosen to use NILFS2 for the backup drives, for these main reasons:
continuous snapshottingwhich means that the backup drive contains many previous versions of the original filesystem.
Among others Google have adopted (the main part of) the purely-routed network topology that I described a few years ago and that I had implemented earlier at a largish science site, and is based on routers (and servers) being multihomed on multiple otherwise unconnected backbones with connectivity managed by OSPF and ECMP, and using /32 routes to provide topology independent addresses for services.
The Google paper and other descriptions
refer to them as
leaf-spine (or more
properly spines-leaves) topologies and claim that
they are crossbar-switch topologies as those
introduced by Charles Clos
quite some time ago to provide interlink several crossbar
switches into a whole that performed almost like a single
large crossbar. These topologies have been long known also as
multi-stage distribution topologies and typically have 3
The mutiple independent backone network spines-leaves topologies I have used are not quite Clos-style networks; the shape is different as Clos topologies need at least three stages, and spines-leaves topologies are as a rule on two levels; also Clos topologies are pass-through, while spines-leaves topologies connect endpoints among themselves. The main common point is that the second stage or level as a rule involves fewer switches than the first level.
Note: There is thus a semi-plausible argument that spines-leaves topologies are degenerate Clos topologies were the first and third stage switches are the same, but I don't like it.
The main point is not however about the different shape, but the rather different purposes:
The shape of spines-leaves topologies is driven by their primary goal which is resilience achieved by:
Note: the use of /32 host routes on vcirtual interfaces might allow to use entirely LAN-local dynamically allocated addresses (such as the 169.254.0.0/16 IPv4 range) for link-interface addresses.
This delivers resilience which is scalable by adding more spine/backgone routers and/or more leaf/local routers; as to the latter servers (and even clients) can well be multiple-homed themselves onto multiple routers, and running an OSPF daemon on a host is a fairly trivial task. I even had an OSPF daemon running on my laptop. In my largish science installation I had servers that were critical to overall resilience and performance multiple-homed directly on the spine/backbone routers, or those important to specific areas on the leaf/local routers.
The spines-leaves topology is not designed to deliver full symmetrical N-to-N trunking of circuits like Clos topologies because client-server packet flows are not at all like that. However it also gives nice scalable capacity and latency options: adding more spine/backbone or leaf/local routers and thus links and switching capacity to the whole, again largely transparently thanks to OSPF/ECMP routing and /32 host routes.
Finally spines-leaves topologies are very maintainable: because of their dynamic and scalable resilience and capacity it is possible to overprovision them, allowing to take down small or large parts of the infrastructure for maintenance quite transparently, while the rest of the infrastructure continues to provide service. It is also important for maintainability that the router configurations end up being trivially simple:
A famous principle of system optimization is that removing the main performance limiter exposes the next one. So faster CPUs make memory the next performance limiter, and so on.
But another important detail is that what is a performance
limited depends critically on the match or mismatch between
the workload and the
of each system component.
In particular many system components like disk have highly anisotropic performance envelopes, so that their effective capacity is highly dependent on workload.
In particular capacity can be highly dependent not just on the profile of the workload, but also on it size, because effective component capacity often depends nonlinearly on its utilization.
As an example, consider a typical parallel batch processing cluster, with around 20 systems, with a total capacity of around 600 threads and 200 large disks drives, with around 100 threads devoted to background daemons, and 500 available for parallel jobs. Each of the disk drives can do between 100MB/s on a single thread, or 0.5MB/s if used by many threads.
In such a cluster total capacity is at some point inversely proportional to number of threads running that do IO, even if that IO is sequential, because the more threads that do IO on the same disk the lower the total transfer rate that the disk can deliver.
It must be emphasized that what decreases at some point is the total capacity of the cluster, not merely per-thread resources; the per-thread resources will fall at some point more than linearly.
Therefore suppose that overall 1.5 threads can share a disk given their average transfer rate an IOPS requirements: then if more than 200 threads are added to the 100 background threads total cluster capacity will fall because the ach disk will deliver a lower, possibly a much lower aggregate transfer rate. This even if 300 total threads are well below the 600 total capacity of the cluster.
Sometimes this situation is called a post-saturation regime, where in the example above cluster saturation is reached with 300 threads. Once that regime is reached additional load will further reduce capacity, and total time to complete may well be longer than with sequential execution. In an extreme case I saw many years ago running two jobs in parallel took twice as long as running then in series. In the example above running an additional 400 threads reaching 100% CPU occupancy will result in a total completion time probably rather higher than running two sets of 200 threads serially.
Note: for virtual memory the same situation is called thrashing, and outside computing it often loads to gridlock: imagine a city that has a saturation point of 200,000 cars: with an 250,000 cars traffic speed will slow down enormously, and with 300,000 cars probably traffic speed will be close to zero. For car traffic the resources whose total capacity reduces past saturation point are intersections, as cars must stop and start again the more often the more cars use them.
In a similar situation an observant user asked to configure
the cluster job scheduler to limit the number of concurrent
threads to the number that would not trigger a post-saturation
regime on the disks. Most cluster job schedulers can be
configured for user-defined consumable resources, for example
in the case of and using as example
using qconf -Mc.
For the example above, of the 200 total disks the capacity of around 50-70 is consumed by the background threads, and one can define complex valus for total IOPS and total sequential transfer rate available, to accomodate jobs with mostly random or mostly sequential access patterns, as follows for the example above, given 140 disks remaining capacity:
Which is actually a bit optimistic because we have assumed that 200 threads can saturate on average 200/1.5 => 133 disks, but if there are hot spots, because of non-uniform distribution of IO from threads to disk, the post-saturation regime can happen with fewer than 200 threads.
Note: that configuration is even more optimistic because it gives as capacity the raw physical capacity of the 200 example disks. The actual capacity in IOPS and MPBS available to user threads can be significantly lower if there are layers of virtualization and replication in the IO susbsystem, for example if the jobs run in virtual machines over a SAN or NAS storage layer.
When I configured consumables for IOPS and MBPS for the cluster as requested by that observant user I got complaints from other users and management that since this limited the number of concurrent threads it limited the utilization of the cluster. But the relevant utilization was that of the main performance limiter, in that case disk capacity, and that ignoring it and considering number of available thread slots instead would overload disk capacity, thus achieving lower overall utilization; it seemed as if the number of thread slots occupied was considered more important than that of the number of thread slots utilized, so I had to make the specification of those consumable resources optional, which rather defeated their purpose. But the experienced user who had made the request continued to use them, and his cluster jobs tended to run at time when nothing else was running, so at least he benefited.
Note: the capacity reduction in a post-saturation regime on disk is due to a transition from sequential accesses with a per-disk transfer rate of 100-150MB to interleaved (and thus random-ish) accesses with a per-disk transfer rate that can be as low as 0.5-1MB/s. Seek times are for disks the equivalent of stopping and starting at intersections for cars in a city, that is periods of time in which useful activity is not performed, and which increase in frequency or duration when the load goes up.
I have been testing for a while a configuration for the Linux auditing subsystem (which seems to me awfully designed and implemented) to monitor changes in system files, and I was looking some time ago at the included rules examples and that left me somewhat amused. The examples mostly are about reporting modification or access for a few critical system files. That is somewhat weak: as per my testing one has to check also all the executables and libraries, because every modified executable can misbehave.
So my experiments have been about adding the major system library and executable directories to be monitored by the auditing subsystem, as a kind of continuous integrity checking, instead of the periodic one by checksum-based integrity checkers or periodic snapshot using filesystems like NILFS2 or Btrfs.
The major functionality of the audit subsystem is indeed to monitor file accesses, using in-kernel access to the inotify subsystem; but it can also monitor use of system entry points, which is unfortunately largely pointless, as there are very many system entry point calls on a running system, and the audit subsystem does not allow a fine grain of filtering as to which specific calls to a system entry point to monitor.
These are reasonable just-in-case measures, but they are not very effective against a well-funded adversary: every library or program in a system or the system hardware can be compromised at the source.
As to libraries and programs, given that the main technique
of major security services is to infiltrate other
organizations, it is hard to imagine that they would not get
affiliated engineers hired by major software companies, or
volunteer them to free software projects, so that they could
add to the code some
disguised as subtle bugs. It is hard to guess how many of the
security issues in software that get regularly discovered and
patched were genuine mistakes or carefully designed ones. But
I suspect that the most valuable backdoors are very well
disguised and hard to find, probably triggered only by obscure
combinations of behaviours.
Probably companies like Microsoft, Google, Redhat, Facebook, SAP, Apple, EA, etc., have had (unknowingly) for a long time dozens if not hundreds of engineers affiliated with the security services (or large criminal organizations) of most major countries (India, UK, China, Israel, Russia, ...). Indeed I would be outraged if the security services funded by my taxes were not doing that.
As to hardware probably there are also engineers affiliated with various security services (or large criminal organizations) in virtually all major hardware design companies, Intel, Apple, ABB, CISCO, etc., where they can also insert into the design of a CPU or another chip or the firmware of peripherals like disks or printers some backdoors cleverly disguised as design mistakes.
But,as the files disclosed by Edward Snowden as to the activities of the NSA in the USA, hardware can be compromised at the product level, not just at the component level: security agencies can afford the expense to intercept shipping boxes and insert in them surveillance devices or backdoored parts, or enter into premises hosting already installed products and do the same, for example replacing USB cables with (rather expensive) identical looking ones containing radios.
Note: a former coworker mentioned
reading an article that showed how easy it is to put a USB
logger, that is a
keylogger, in a computer
mouse, or any other peripheral, and there are examples of far
more sophisticated and surprising techniques.
Fully auditing third party libraries and programs (whether proprietary or free software), component hardware designs, boxed hardware, and installed hardware is in practice impossible or too difficult, and anyhow very expensive as far as it can go. For serious security requirements the only choice is to make use only of hardware and software that has been developed entirely (down to the cabling, consider again USB cables with built in radios) by fully trusted parties, that is full auditing of the source; for example the government of China have funded the native development of a CPU based on the MIPS instruction set, and no doubt also of compilers, libraries, operating systems entirely natively developed. Probably most other major governments have done the same.
For system administrators in ordinary places the bad news is that they cannot afford to do anything like auditing at the source, and therefore every hardware and software component must be presumed compromised; the good news is that the systems administered are usually rather unlikely to be of such value as to attract the attention of major security services or to be regarded by them as deserving the risk and expense of making use of the more advanced backdoors, or of using NSA-style field teams to intercept hardware being delivered or to modify in place hardware already installed. Also probably for ordinary installations a degree of physical isolation is sufficient to make effective use of the less advanced backdoors too difficult in most cases.
Therefore for ordinary installations the Linux audit subsystem is moderately useful, together with other similar measures, as long as it does not give a feeling of security beyond its limits.
It is also useful for system troubleshooting and profiling, as it can give interesting information on the actual system usage of processes and applications, complementing the strace(1) and inotifywatch(1) user level tools and the SystemTap kernel subsystem.
Today in a discussion the topic of what is new and recent in systems came up. In general not much that is new has happened in systems design for a while, never mind recently. Things like IPv6, Linux, even flash SSDs feel new, but they are relatively old. Many other recent developments are not new, but rediscoveries of older designs that had gone out of practice as tradeoffs changed, and have come back into practice as they changed back.
After a bit of thinking I mentioned a distributed filesystem, Arvados Keep because it contains a genuinely different design feature, that needs some explaining.
As mentioned in numerous prvious posts designing scalable filesystems is hard, especially when they are distributed. Scaling data performance envelopes is relative easy in some aspects, for example by using parallelism as in RAID to scale up throughput for mass data access; the difficulty is scaling up metadata operations, which includes both file attributes and internal data structures. The difficulty arises from metadata beng highly structured and interlinked, which mans that mass metadata operations, like integrity auditing or indexing or backups tend to be happen sequentially.
Arvados Keep is a distributed filesystem which has two sharply distinct layers:
hash, like in the git.
The distinctive characteristic is that there is no metadata that lists which hash (that is, which segment) is on which server and storage device.
Since there is no metadata for segment location, and each
segment identified uniquely identifies the content of
the segment, whole-metadata checks can be paralellized quite
easily: each storage server can enumerate in parallel all the
segments it has, and then checks the integrity of the content
with the hash; at the same time the file naming layer does its
won integrity checks in parallel. Periodically a
garbage-collector looks at the file-naming
database, queries in parallel the storage servers, and
reconciles the lists of hashes for the files and those
available on the storage backends, deleting the segments that
are not referenced by any files.
That is quite new and relatively recent, and helps a lot with scalability, which has been a problem for quite a while.
The absence of metadata linking explicitly files to storage segments is made possible by the use of content addressing for the segments, and it is that which makes parallelism possible in metadata scans. It has however a downside: that locating a segment requires indeed content addressing via the segment identifier. That potentially means a scan of all storage servers and devices every time a segment needs to be located.
That could be improved in several ways, for example via a
hint database, or by using multicasting. The
current way used by Arvados Keep is to first
calculate a likely location based on the number of
servers, and check that first, and if the segment identifier
is not found on that server, a linear scan.
That works surprisingly well, in part because segment identifiers are essentially random numbers, and the same calculation is of course used when creating a new segment and when looking it up. The calculation is also highly scalable: it takes the same time whether the segments are distributed across 10 or 10,000 servers.
However it does not work that well in another sense of
scalable, where the number of server is increased over
time: because the location of a segment is fixed by the
calculation based on the number of servers when it was
created, and it is searched for with the current number of
For a similar situation the
distributed object systems uses a
algorithm, which is a table-driven calculation where the table
is recomputed when the number of servers changes, but in a way
tha preserves the location of segments created with a
different number of servers.
This probably could be added to Arvados Keep too, but currently it is not particularly necessary. In part because usually when expansion is done it happens in large increments. However in extreme cases it can result in file opening times of hundreds of milliseconds, as many servers need to be probed to find the segments that make up the file. Arvados Keep is targeted at storage of pretty large files, typically of several gigabytes, and therefore it has a large segment size of 64MiB (Ceph uses 4MiB) which minimizes the number of segments per file, and also means that the cost of opening a file is amortized over a large file.
The other cost is that since thre is no explicit tracking of whether a given segment is in use or not, a periodic garbage collection needs to be performed, but that is needed regardless as an integrity audit, for any type of filsystem design, and it is easy to parallelize.
Overall the design works well for the intended application area, and the fairly unusual and novel decision to use content addressing without any explicit metadata structure cross referencing files and storage segments provides (at some tolerable cost) metadata scalability.
Today on the
channel a user who mentioned building his
computer out of old parts asked about some corruption reported
by Btrfs checksumming, which he wanted to recover from. Some
of the more experienced people present then noticed that it
was a classic
bitflip (one-bit error),
inside a filesystem metadata block, and added:
[17:55] kdave | pipe: the key sequence is ordered, anything that looks odd is likely a bitflip [17:55] darkling | Look at that one and the ones above and below, and you'll usually see something that fairly obviously doesn't fit. [17:55] kdave | I even have a script that autmates that [17:56] darkling | Convert the numbers to hex and sit back in smug delight at the amazement of onlookers. ;) [17:56] darkling | There's even a btrfs check patch that fixes them when it can... [17:56] kdave | we've done the bitflip hunt too many times [17:56] darkling | It's my superpower. [17:56] darkling | (Well, that and working out usable space in my head given a disk configuration)
The point being that bitflips are rather more common in their experience as Btrfs users and developers, than in the experience of most users, and in the present case it was likely due to unreliable memory chips.
Obviously bitflips are not caused by Btrfs, so their experience is down to Btrfs detecting bitflips with checksums, which makes apparent those bitflips that otherwise would not be noticed.
That detection of otherwise unnoticed bitflips brings with it the obvious advantage, but also a subtle disadvantage, which is the reason why it is often recommended to use ECC memory when using ZFS which also does checksumming. This seems counterintuitive: if filesystem software checksums can be used to detect bitflips, hardware checksums in the form of ECC may seem less necessary.
The subtle disadvantage is that as a rule checksums are not
applied by Btrfs or ZFS to individual fields, but to whole
blocks, and that on detecting a bitflip the whole block is
marked as unreliable, and a block can contain a lot of fields:
for example a block that contains internal metadata, such as a
directory or a set of
Note: in a sense detecting a bitflip in a block via a checksum results in a form of damage amplification, as every bit in the block must be presumed damaged.
In many cases an undetected bitflip happens in parts of a block that are not used, or are not important, and therefore results in no damage or unimportant damage. For example as to filesystem metadata, a bitflip in a block pointer matters a lot, but one in a timestamp field matters a lot less.
ECC memory not only detects but corrects most memory bitflips, avoiding in many cases the writing to disk of blocks that don't match their checksums, or cases of checksum failure because of a memory bitflip after reading a block from disk. This prevents a number of cases of loss of whole blocks of filesystem internal metadata (as well as data).
This is obviously related to another Btrfs aspect, that by default its internal metadata blocks on disk are duplicated even on single disk filesystems. Obviously duplicating metadata on a single disk does not protect against failure of the disk, unlike a RAID1 arrangement (but it may protect against a localized failure of the disk medium though).
But the real purpose of duplicating metadata blocks by default is to recover from detected bitflips: when checksum verification fails on a metadata block its duplicate can be used instead (if it checksums correctly, and usually it will). This minimizes the consequences of most cases of non-ECC memory bitflips.
Obviously duplicating metadata on the same disk is expensive in terms of additional IO when updating it. Therefore it might be desirable in some cases to have more frequent backups, ECC memory, and turn off the default duplication of metadata blocks in Btrfs. So the summary is:
I have been using dual-layer Blu-Ray discs (50GB nominal capacity) for offline backups, both write-once (BD-R DL) and read-write (BD-RE DL), and recently I wanted to reformat one of the latter (to remap some defective sectors), and found that dvd+rw-format cannot do it, as the Blu-Ray drive rejects the command with:
FORMAT UNIT failed with SK=5h/INVALID FIELD IN PARAMETER LIST
After some web searching I discovered that
BD-RE discs must be
with xorriso -blank deformat before a BD drive will
accept a command to format them. In this they are slightly
different in behaviour from both DVD-+RW and CD-RW discs.
While looking for related material I chanced upon a message describing a typical system configured for local cold-storage using Ceph instead of remote third party based services discussed recently:
6 x Dell PowerEdge R730XD & MD1400 Shelves
- 2x Intel(R) Xeon(R) CPU E5-2650
- 128GB RAM
- 2x 600GB SAS (OS - RAID1)
- 2x 200GB SSD (PERC H730)
- 14x 6TB NL-SAS (PERC H730)
- 12x 4TB NL-SAS (PERC H830 - MD1400)
The use of 4TB-6TB NL drives is typical of bulk-data cold-storage (or archival) systems with a very low expected degree of parallelism in the workload, mostly a few uploading threads and a few more threads to download to warm-storage (or cold-storage if used in archive-storage).
There are also a couple of
capacity of 200GB suggests this) flash SSDs, very likely for
Ceph journaling, and two 600GB SAS disks for the system
Such a system could be used for cold-storage or archival-storage, but to me it seems more suitable for archival, as it has a lot of drives and most of them are as big as 6TB. But it could be suitable for cold-storage as long as uploads and download concurrency is limited.
Note: The system configuration seems to
be inspired by the
throughput category on pages 13 and
The total raw data capacity is 120TB, and that would
translate to a logical capacity of around 40TB using default
3-way Ceph replication for cold-storage of 80TB with a 12+4
erasure code set (BackBlaze B2 uses 17+3
sets, but that seems a bit optimistic to me).
Note: as to speed, with 26 bulk-data disks each capable of perhaps 20-40MB/s of IO over 2-4 threads, and guessing wildly, aggregate rates might be: for cold-storage 200-400MB/s for pure writing over 3-6 processes and 300-800MB/s (depending on how many threads) for pure reading over 5-10 processes; for archival storage it might be half that for writing (because of likely read-modify-write using huge erasure code blocks and waiting for all of them to commit) over 2-4 processes, and 400-800MB/s for pure reading.
Note: To coarsely double check my guess of achievable rates I have quickly setup an MD RAID10 set of 6 mid-disk partitions, with replication of 3, on 1TB or 2TB contemporary SATA drives, and done with fio randomish IO in 1MiB blocks:
soft# fio blocks-randomish.fio >| /tmp/blocks-randrw-1Mi-24t.txt soft# tail -13 /tmp/blocks-randrw-1Mi-24t.txt Run status group 0 (all jobs): READ: io=1529.0MB, aggrb=51349KB/s, minb=1886KB/s, maxb=2418KB/s, mint=30191msec, maxt=30491msec WRITE: io=1540.0MB, aggrb=51718KB/s, minb=1920KB/s, maxb=2317KB/s, mint=30191msec, maxt=30491msec Disk stats (read/write): md127: ios=24096/23557, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4310/12770, aggrmerge=210/484, aggrticks=114720/152362, aggrin_queue=267136, aggrutil=94.43% sdb: ios=5315/12139, merge=570/1118, ticks=291976/493200, in_queue=785520, util=94.43% sdc: ios=4547/12434, merge=456/840, ticks=68920/48540, in_queue=117456, util=35.66% sdd: ios=2438/12447, merge=238/828, ticks=42272/38664, in_queue=80936, util=24.40% sde: ios=7084/13200, merge=0/40, ticks=153356/143492, in_queue=296840, util=54.44% sdf: ios=4252/13204, merge=0/38, ticks=85368/96032, in_queue=181396, util=41.64% sdg: ios=2227/13201, merge=0/41, ticks=46428/94248, in_queue=140672, util=33.84% soft# cat blocks-randomish.fio # vim:set ft=ini: [global] kb_base=1024 fallocate=keep directory=/mnt/md40 filename=FIO-TEST size=100G ioengine=libaio io_submit_mode=offload runtime=30 iodepth=1 numjobs=24 thread blocksize=1M buffered=1 fsync=100 [rand-mixed] rw=randrw stonewall
Note: The setup is a bit biased to optimism as it is over a narrow 100GiB slice of the disk, and IO is entirely local and not over the network, and it is for block IO within a file, therefore no file creation/deletion metadata overhead. The outcome is a total of 100MiB/s, or around 17MiB/s per disk. With pure reading (that is without writing the the same data three times) the aggregate goes up to 160MiB/s, or a bit over 26MiB/s, and with 16MiB the aggregate transfer rate up to 170MB/s, or around 28MiB per disk. I have watched the output of iostat -dkzyx 1 while fio was running and it reports the same numbers. For another double check, CERN have seen over 50GB/s of usable aggregate transfer rate by 150 clients over 150 servers (each with 200TB of raw storage as 48×4TB disks), or 350MB/s per server.
As to comparative pricing, the same one-system capacities on Amazon's S3 for about 5 years have a base price currently of $30,000 40TB in S3 IA and of $33,600 for 80TB in Glacier, plus traffic and access charges; these are difficult to estimate, but I would add $6,500 to the S3 IA cost and $9,000 to the S3 Glacier cost.
Note: $6,500 is around the price of accessing around 20TB (of the 40TB) per year from another region (within region accesses are free) and reading back 10TB per year from S3 IA; $9,000 is around the price of restoring 1TB of the 80TB per year in 2 different months over 4 hours each time (10 downloads of 500GB each at 125GB/hour).
Rounding up a bit (10%) to account for other charges that apply may give around $40,000 over 5 year for 40TB of S3 IA, or $47,000 over 5 years for 80TB of Glacier.
The total cost of ownership of local system hardware plus a minimum system administration would have to compete with that. The purchase price of one such similar system is around $14,000 with similar or better SSDs and disks plus around $5,500 for a 12×4GB disks expansion unit (prices include 5 years NBD support but not tax), for a total of around $20,000 (the pricing I looked at is for SATA disks, but SAS NL disks are not much more expensive); to this one should add data center charges (for space, power and connectivity), and my rough estimate for that is around $12,000 over 5 years for colocation per system.
So the total cost of ownership for one of the cold-storage or archival servers mentioned at the beginning would be around $32,000 over 5 years, compared with $40,000 for 40GB of S3 IA, or $47,000 for 80GB of Glacier (both before tax). That $8,000 or $15,000 difference can pay for a lot of custom system administration, especially given that this is largely a setup-and-forget server. Considering that many sites have rack space and networking already in place the cash difference can be bigger.
The main advantage of Glacier of course is that it is guaranteed offsite, but then colocation can do that too, even if it is harder to organize.
Overall for a 5-year period my impression is that local or colocated cold-storage (or archive-storage) is less expensive than remote third-party storage, so for those who don't need a worldwide CDN that is still the way to go.
I have been using the ZOTAC CI323 mini-PC for a couple of weeks. I bought it (from Deals4Geeks for around £150 including VAT, without memory and disk) to check a bit the state of the art with low power, small size servers; mini-PC are in general based on laptop components, and they may be considered as laptops without keyboard and screen. I have previously argued that indeed laptops make good low power, small size servers and I have been curious to check out the alternative.
Since my term of comparison are laptops, which usually have few ports, I have decided to choose to get a mini-PC that has better ports than the laptop, to compensate for the lack of builtin screen and keyboard, and the CI323 comes with an excellent set of ports, notably:
Compared to many other mini-PCs it also has some other notable core features:
The PCI device list and the CPU profile are:
# lspci 00:00.0 Host bridge: Intel Corporation Device 2280 (rev 21) 00:02.0 VGA compatible controller: Intel Corporation Device 22b1 (rev 21) 00:10.0 SD Host controller: Intel Corporation Device 2294 (rev 21) 00:13.0 SATA controller: Intel Corporation Device 22a3 (rev 21) 00:14.0 USB controller: Intel Corporation Device 22b5 (rev 21) 00:1a.0 Encryption controller: Intel Corporation Device 2298 (rev 21) 00:1b.0 Audio device: Intel Corporation Device 2284 (rev 21) 00:1c.0 PCI bridge: Intel Corporation Device 22c8 (rev 21) 00:1c.1 PCI bridge: Intel Corporation Device 22ca (rev 21) 00:1c.2 PCI bridge: Intel Corporation Device 22cc (rev 21) 00:1c.3 PCI bridge: Intel Corporation Device 22ce (rev 21) 00:1f.0 ISA bridge: Intel Corporation Device 229c (rev 21) 00:1f.3 SMBus: Intel Corporation Device 2292 (rev 21) 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c) 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c) 04:00.0 Network controller: Intel Corporation Wireless 3160 (rev 83)
# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 76 Stepping: 3 CPU MHz: 479.937 BogoMIPS: 3199.82 Virtualisation: VT-x L1d cache: 24K L1i cache: 32K L2 cache: 1024K NUMA node0 CPU(s): 0-3
# cpupower frequency-info analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: 0.97 ms. hardware limits: 480 MHz - 2.08 GHz available cpufreq governors: performance, powersave current policy: frequency should be within 480 MHz and 2.08 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency is 480 MHz (asserted by call to hardware). boost state support: Supported: yes Active: yes
Given all this, how good it is? Well, so far it seems really pretty good. The things I like less are very few and almost irrelevant:
The things I particularly like about it are:
So I have decided, especially because of the very low power consumption, to use it as the shared-services server of my 2 home desktops and laptop. I was previously using for that one of the desktops that mainly is for archive/backups, and so has four disks, and draws a lot more power. I can now put that desktop mostly to sleep, and keep the CI323 always on.
so far I think it is a very good product, with a very nice balance of useful features and a good price, and as a low power server it is impressive.
Testers of this and similar models
have looked at it as a desktop PC, or media PC or even as a
gaming console, and they re seem pretty decent for all of
those uses; the GPU is good enough for not the most recent
games, and the overall
seems both good and isotropic.
A somewhat fascinating story revolves around
They are two storage types that are designed to be used in a
rather atypical way:
Note: recent backups should go to cold-storage, not archival-storage (1, 2).
Decades ago hot-storage used to be magnetic drums, warm-storage magnetic disks, cold-storage and archival-storage were both on tape.
Currently hot-storage is on flash SSD or low-latency and low-capacity magnetic disk, warm-storage on smaller capacity (300GB-1TB) magnetic disk, cold-storage on larger capacity (2TB and larger) low IOPS-per-TB magnetic disks, and archival-storage on tape cartridges inside large automated cartridge shelves.
Cold-storage is meant to be read infrequently and only for
staging to warm- or hot-storage, and
archival-storage is meant to be almost never read back, to be
write-only in most cases.
The typical cold-storage hardware is currently
low IOPS-per-TB disks in the 4TB-6TB-8TB range,
arranged in 2-3 way replication or, at the boundary between
cold- and archival-storage, using
codes groups that have a very high cost for small writes
or any reads when some members of the group are unavailable,
but have better capacity utilization at the cost of only a
small loss of resilience.
What has happened in the recent past is that
cloud storage companies have been offering
all types of storage, and in particular for cold and backup
storage they have offered relatively low cost per TB used.
That is remarkable because usually cloud shared computing capacity is priced rather higher (1, 2, 3) than the cost of dedicated computing capacity, despite the myth being otherwise. The high cost of cloud computing is only worth it for those who require its special features, mostly that it effectively comes with a built-in CDN that otherwise would have to be built or rented from a CDN business.
The most notable examples are Amazon Glacier and BackBlaze B2 and they are very different.
They have rather dissimilar pricing for capacity: between $80 and $140 per year per TB for Glacier, and $60 per TB per year for B2.
The cost of S3 capacity is 10 times higher than Glacier; there are other services like rsync.net that costs similarly to S3, and Google Nearline that costs similarly to Glacier.
The main differences are that Glacier has very strange pricing for reading, a highly structured CDN, and Amazon keeps the details of how it is implemented a trade secret, and B2 has a simple pricing structure, its implementation is well known, and has a much simpler CDN. Starting with B2, the implementation is well documented (1, 2, 3, 4):
- 4TB hard drives => 3.6 petabytes/vault (Deploying today.)
- 6TB hard drives => 5.4 petabytes/vault (Currently testing.)
- 8TB hard drives => 7.2 petabytes/vault (Small-scale testing.)
- 10TB hard drives => 9.0 petabytes/vault (Announced by WD & Seagate.)
In other words, this is just a cost optimized version of a conventional cold-storage system. As to this BackBlaze have published several interesting insights, notably:
As to Glacier instead there is a the mystery of the trade secret implementation: the Amazon warm-storage service costs 10 times as much, and it is known that it is a cost optimized version of a conventional warm-storage system, based on ordinary lower capacity disk drives. Potentially some clue is in the strange pricing structure (1, 2, 3, 4), for reading back data from Glacier:
Therefore if one wants to read 100GiB out of 500GiB it is 10 times cheaper to request 10GiB every 4 hours than request 100GiB at once, as the first incurs a read charge of 10GiB for the whole month, and the latter one of 100GiB for the whole month.
Note: there are other pricing details like a relatively high per-object fee.
This pricing structure suggests that Glacier relies on a two tier storage implementation, and the cost of reading is the cost of copying the data between the two tiers:
My guess is that the staging first tier is just the warm-storage Amazon S3. There has been much speculation (1, 2, 3, 4, etc.) as to what the bulk second tier is implemented with, and the two leading contenders are:
surveillance classdisks tend to run at 5900RPM, and are targeted at the same usage profile as cold- and archival-storage: recording of video, which is very rarely read back.
The Blu-Ray hypothesis has recently got a boost as an Amazon executive involved in the Glacier project has commented upon the official release of a Blu-Ray cold-storage system, Sony's Everspan, and he writes:
But, leveraging an existing disk design will not produce a winning product for archival storage. Using disk would require a much larger, slower, more power efficient and less expensive hardware platform. It really would need to be different from current generation disk drives.
Designing a new platform for the archive market just doesn’t feel comfortable for disk manufacturers and, as a consequence, although the hard drive industry could have easily won the archive market, they have left most of the market for other storage technologies.
This is quite interesting, because it seems a strong hint that Glacier does not use magnetic disks for its bulk tier; at the same it is ironic because BackBlaze has created a competitive product (for those that don't need a built-in CDN with multi-region resilience) based on commodity ordinary cold-storage 4TB-6TB-8TB disks.