Computing notes 2017 part one

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg Technorati]

170513 Sat: Preventing Linux partition scanning

Some user on IRC reported a very inconvenient situation: for some reason a disk to be accessed from Linux had a malformed partition table that crashed the kernel code that attempted to parse it. It would have been very easy to clear the relevant area of the disk, but that could not be done because by default Linux attempts to read the partition table as soon as a disk is connected.

This was obviously one of the many cases of violation of sensible practices in Linux design due to an excessive desire for automagic, and I could relate to it as I had to maintain system which used unpartitioned disks and printed on every boot a lot of spurious error messages about malformed partition tables (without crashing at least).

It is easy to verify that in the Linux kernel the rescan_partitions function is invoked unconditionally by the __blkdev_get function, and therefore it is impossible to avoid an attempt to find and parse a partition table. Ten years ago a simple patch was proposed to avoid scanning for a partition table at boot time, but it was not adopted.

So I thought about a workaround that ought to work: to use the boot parameter that tells the kernel that a specific block device has a given partition table using the boot kernel parameter blkdevparts. Since the code in rescan_partitions stops looking for a volume label when the first match is found this ought to work. Even if in the check_part table, which is organized alphabetically, the cmdline_partition section follows several checks for some variants of the Acorn volume label type, which should not be mostly a problem.

The overall issue is that because of the ever deepening microsoftization of GNU/Linux culture improper designs are made to be as automagic as possible: in this case the improperty is that the kernel autodiscovers all possible devices, and then autoactivates them, including partition scanning, and then upcalls the appallingly misdesigned udev subsystem to ensure devices are mounted or started as soon as possible.

The proper UNIX-style logic would be for the default to be inverted, that is for the kernel to initiate discovery of devices only on user request, and then to activate discovered devices again only on user request, and to mount or start them again only on user request. With the ability for the system startup scripts to assume those user requests on boot, but with the ability for the user to override the default. But then the Linux kernel designers have even forgotten to specify an abstract device state machine, so it does not seem surprising that an improper activation logic is used.

170506 Sat: Web pages and laptop heat

So I have previously remarked that many web pages keep the CPU busy and that laptops and handhelds are limited by heat and now I am too a double victim: in the past couple of weeks as I have been looking at news and forum sites like The Guardian my laptop (which I have been using as my main system) has shut down abruptly a few times because of overheating, caused entirely by CPU load from free-running Javascript in web pages.

This a graph (using psensor) of the effect of closing a JavaScript tab containing a Disqus forum threads on CPU usage and temperature:

This is in part due to having a somewhat older CPU chip that does not throttle its clock when it gets hotter, but that in a sense would be cheating: a nominally fast CPU that almost transparently becomes slow when it runs at its rated speed.

There is no workaround: like many I use browser extensions that disable JavaScript code (for example NoScript) but several sites insist on dynamic content generation or update via JavaScript, but once a site is given free access to a browser's JavaScript virtual machine they can take as much advantage as possible. Like many I think that executable content is a very bad idea for several reasons, but it is here to stay for a while. Let's hope that enough handheld users scald their hands or laps or suffer from extreme slowness because of executable content sites that they become less popular.

170427 Thu: NFS, GNOME and KDE issues with auto-mounting

I have long used auto-mounters like am-utils (once known as AMD) and recently the Linux-native autofs5 as a more UNIX-style alternative to complex and insecure schemes based on MS-Windows like udev-based mounting on attachment of a network device, as mounting on attempted access and is both safer and results on shorter mounting periods. Because of this I have written that script for autofs5 that turns /etc/fstab into an auto-mount map.

Since recently I upgrade my NFS server configuration I also decided to look at, being somewhat related, my auto-mounting situation, which has had two slightly annoying problems:

The motivation in both cases is that I reckon it is a good practice to leave filetrees umounted as much as possible, and mount them only for the period where they are needed, because:

My investigation about the NFS server Ganesha and auto-unmounting shows that Ganesha opens an exported filetree's top directory on startup, and this means it won't be mounted only if accessed by clients:

#  lsof -a -c ganesha.nfsd /fs/*/.
ganesha.n 27646 root    5r   DIR   0,43      918  256 /mnt/export1/.
ganesha.n 27646 root    6r   DIR   0,47       84  256 /mnt/export2/.
ganesha.n 27646 root    7r   DIR    8,8    16384    5 /mnt/export3/.
ganesha.n 27646 root    8r   DIR   8,38     4096    5 /mnt/export5/.
ganesha.n 27646 root    9r   DIR   0,51      106  256 /mnt/export6/.
ganesha.n 27646 root   10r   DIR   0,55      156  256 /mnt/export7/.

The problem with both GNOME and KDE is that their open-file menus and their file managers are not designed to be UNIX-like, where there is a single tree of directories with user-invisible mouint points, but they are designed to resemble MS-Windows and MacOS X in having a main filetree and a list of volumes with independent and separate filetrees. The other systems do this because on those systems by default when a mountable volume is attached to the system it is mounted statically until it is detached, and they don't have a UNIX-like mount-on-access behaviour.

Because of this they open the system table of mountpoints and read a list of those mountpoints and access them periodically to get their status, and this of course triggers automounting and constant automounmting. This has caused a number of complaints (1, 2, 3, 4), and fortunately these have resulted in some ready-made solutions for GNOME, and similar ones for KDE:

GNOME solution

The problem is that GNOME uses a filesystem layer and daemons called GVFS that by default mounts all mount-points listed in system configuration files. This can be disabled in three ways:

  • Per user with:
    gsettings set \
      automount false
    gsettings set \
      automount-open false
  • Per mount-point by adding the relevant /etc/fstab line the flag x-gvfs-hide.
  • For the whole system change the default by creating a file under /etc/dconf/db/local.d/ containing:
    and then running dconf update.
KDE solution

KDE used the kio system, but that is not quite involved, instead there is an automounter daemon per-user, and some common plugins mount by default any available mount-points. To disable all these for a user:

  • In the KDE Runner settings ensure that the Devices plugin is disabled.
  • In the KDE Dolphin Places panel mark as hidden all mountpoints.
  • Run kcmshell4 device_automounter_kcm and enable Only automatically mount removable media that has been manually mounted before an disable the other two options, and Forget Device those listed under Device Overrides (not really sure about this).
  • Run kcmshell4 kcmkded and disable Removable devices automounter.

There is a patch to disable mount-point auto-mounting entirely (but only for those of type autofs) but is is not in the KDE version I am using.

170424 Mon: Btrfs and NFS service and NFS daemon Ganesha

So I am using Btrfs, despite it having some design and implementation flaws in its more advanced functionality, because it is fairly reliable as a single device filesystem with checksums and snapshots.

Because of its somewhat unusual redirect-on-write (a variant of COW) design and its many features it has somewhat surprising corner cases, and one of them is that direct IO writes does not update the data checksums (and direct IO reading does not verify them). Unfortunately the Linux nfs-kernel-server uses a kernel-internal version of direct IO and has the same issue with writing (it is I think safe to export Btrfs filetrees in read-only mode via nfs-kernel-server).

The obvious solution is to use the NFS daemon server Ganesha which runs like the SMB/CIFS daemon server Samba as a user process using ordinary IO, like many server daemons for other local and distributed filesystems. A kernel based server is going to have lower overheads, but for a file server the biggest costs are in storage and in network transfers, and running in user mode is comparatively a very modest cost.

Ganesha is something that I kept looking at, and years ago it was not quite finished and awkward to install and poorly documented. It is currently instead fairly polished, it is a standard package for Ubuntu, and newer versions are easily obtained via a fairly well maintained Ubuntu PPA.

It is still not very well documented, and in particular the existing example configurations are overly simplistic, as well as with some default paths not quite right for my Ubuntu 14 system, so here is my own example:

# /usr/share/doc/nfs-ganesha-doc/config_samples/config.txt.gz
# /usr/share/doc/nfs-ganesha-doc/config_samples/export.txt.gz


  Nb_Worker		=32;

  # Compare the port numbers with those in 
  # /etc/default/nfs-kernel-server (MNT_Port)
  # /etc/default/quota (Rquota_port)
  # /etc/modprobe.d/local.conf (NLM_Port)
  #   options lockd nlm_udpport=893 nlm_tcpport=893
  # /etc/sysctl.conf (NLM_Port)
  #   fs.nfs.nlm_tcpport=893
  #   fs.nfs.nlm_udpport=893

  NFS_Protocols		=4;
  NFS_Port		=2049;
  NFS_Program		=100003;
  MNT_Port		=892;
  MNT_Program		=100005;

  Enable_NLM		=true;
  NLM_Port		=893;
  NLM_Program		=100021;

  Enable_RQUOTA		=true;
  Rquota_Port		=894;
  Rquota_Program	=100011;

  Enable_TCP_keepalive	=false;
  Bind_addr		=;

  DomainName		="WHATEVER";
  IdmapConf		="/etc/idmapd.conf";

  Active_krb5		=true;
  # This could be "host" to recycle the host principal for NFS.
  PrincipalName		="nfs";
  KeytabPath		="/etc/krb5.keytab";
  CCacheDir		="/run/ganesha.nfs.kr5bcc";

  Protocols		=4;
  Transports		=TCP;

  Access_Type		=NONE;
  SecType		=krb5p;
  Squash		=Root_Squash;

  Access_Type		=RW;
  SecType		=krb5p;
  Squash		=Root_Squash;

  Export_ID		=1;
  FSAL			{ Name="XFS"; }
  # Must not have symlinks
  Path			="/var/data/pub";
  Pseudo		="/var/data/pub";

  PrefRead		=262144;
  PrefWrite		=262144;
  PrefReadDir		=262144;

Two of the interesting aspects of Ganesha is that it can serve not just NFSv3 and NFSv4, but also NFSv4.1, pNFS and even the 9p protocol and has specialized backends called FSAL for various type of underlying filesystems, notably CephFS and GlusterFS but also various other cluster filesystems, plus XFS and ZFS. The backends for the cluster filesystems give direct access to those filesystems without the need of a local client for them, which reduces considerably complications and overheads.

My experience with Ganesha has been so far very positive, the speed is good as expected, and it is rather easier to setup and configure than nfs-kernel-server in particular as to Kerberos authentication; also as to configuration, administration, monitoring and investigation, being a user-level daemon.

170324 Fri: IPv6 reaches 14% of Google traffic, SixXS will close down

Three years ago IPv6 traffic was seen by Google as hitting 2% of total IP traffic and it is quite remarkable that today it hit 14.22%, and that is 14% of a rather bigger total. It is also remarkable that:

At 14% of total traffic IPv6 is now commercially relevant, in the sense that services need to be provided over IPv6, as IPv6 users cannot be ignored, and accordingly even mass market ISPs like Sky Broadband UK have been giving IPv6 connectivity as standard for several months and it is therefore sad but understandable that transition services like SixXS are being closed down:


SixXS will be sunset in H1 2017. All services will be turned down on 2017-06-06, after which the SixXS project will be retired. Users will no longer be able to use their IPv6 tunnels or subnets after this date, and are required to obtain IPv6 connectivity elsewhere, primarily with their Internet service provider.

My current ISP does not yet support IPv6, but it has a low delay to a fairly good 6to4 gateway, so I continue to use 6to4 with NAT.

170302 Thu: Some coarse speed tests with Btrfs etc. and small files

In the previous note about a simple speed test of several Linux filesystems star was used because unlike GNU tar it does fsync(2) before close(2) of a written file, and I added that for Btrfs I also ensured that metadata was (as per default) duplicated as per the dup profile, and these were challenging details.

To demonstrate how challenging I have done some further coarse tests on a similar system, copying from a mostly root filesystem (which has many small files) to a Btrfs filesystem, with and without the star option -no-fsync and with Btrfs metadata single and dup in order of increasing write rate with fsync:

sys CPU
dup fsync 381m 28s 3.3MB/s 13m 14s
single fsync 293m 26s 4.2MB/s 13m 15s
dup no-fsync 20m 09s 61.7MB/s 3m 50s
single no-fsync 19m 58s 62.3MB/s 3m 41s

For comparison the same source and target and copy command with some other filesystems also in order of increasing write rate with fsync:

sys CPU
XFS fsync 318m 49s 3.9MB/s 7m 55s
XFS no-fsync 21m 24s 58.2MB/s 4m 10s
F2FS fsync 239m 04s 5.2MB/s 9m 09s
F2FS no-fsync 21m 32s 57.8MB/s 5m 02s
NILFS2 fsync 118m 27s 10.5MB/s 7m 52s
NILFS2 no-fsync 23m 17s 53.5MB/s 5m 03s
JFS fsync 21m 45s 57.2MB/s 5m 11s
JFS no-fsync 19m 42s 63.2MB/s 4m 24s

Source filetree was XFS on a very fast flash SSD, so not a bottleneck.
Source filetree 70-71GiB (74-75GB) and 0.94M inodes (0.10M of which directories), of which 0.48M under 1KiB, 0.69M under 4KiB and 0.8M under 8KiB (the Btrfs allocation block size).
Observed occasionally IO with blktrace and blkparse and wikth fsync virtually all IO was synchronous, as fsync on a set of files of 1 block is essentially the same as fsync on every block.
Because of its (nearly) copy-in-write nature Btrfs handles transactions with multiple fsyncs particularly well.

Obviously fsync per-file on most files is very expensive, and dup file metadata is also quite expensive, but nowhere as much as fsync. A very strong demonstration that the performance envelope of storage system is rather anisotropic. It is also ghastly interesting that given the same volume of data IO with fsync costs more than 3 times the system CPU time as without. Some filesystem specific notes:

170228 Tue: Some coarse speed tests with various Linux filesystems

Having previously mentioned my favourite filesystems I have decided to do again in a different for a rather informative, despite being simplistic and coarse, test of their speed similar to one I did a while ago, with some useful results. The test is:

It is coarse, it is simplistic, but it gives some useful upper bounds on how filesystem does in a fairly optimal case. In particular the write test, involving as it does a fair bit of synchronous writing and seeking for metadata, is fairly harsh; even if, as the test involves writing to a fresh, empty filetree, pretty much an ideal condition, it does not account at all for fragmentation on rewrites and updates. The results, commented below, sorted by fastest, in two tables for writing and reading:

type write
sys CPU
JFS 148m 01s 72.4MB/s 24m 36s
F2FS 170m 28s 62.9MB/s 26m 34s
OCFS2 183m 52s 58.3MB/s 36m 00s
XFS 198m 06s 54.1MB/s 23m 28s
NILFS2 224m 36s 47.7MB/s 32m 04s
ZFSonLinux 225m 09s 47.6MB/s 18m 37s
UDF 228m 47s 46.9MB/s 24m 32s
ReiserFS 236m 34s 45.3MB/s 37m 14s
Btrfs 252m 42s 42.4MB/s 21m 42s
type read
sys CPU
F2FS 106m 25s 100.7MB/s 66m 57s
Btrfs 108m 59s 98.4MB/s 71m 25s
OCFS2 113m 42s 94.3MB/s 66m 39s
UDF 116m 35s 92.0MB/s 66m 54s
XFS 117m 10s 91.5MB/s 66m 03s
JFS 120m 18s 89.1MB/s 66m 38s
ZFSonLinux 125m 01s 85.8MB/s 23m 11s
ReiserFS 125m 08s 85.7MB/s 69m 52s
NILFS2 128m 05s 83.7MB/s 69m 41s

The system was otherwise quiescent.
I have watched the various tests with iostat, vmstat, and looking at graphs produced by collectd, displayed by kcollectd and sometimes I have used blkstat blktrace; I have also used occasionally used strace to look at the IO operations requested by star.
Having looked at actual behaviour, I am fairly sure that all involved filesystem respected fsync semantics.
The source disk seemed at all times to not be the limiting factor for the copy, in particular as streaming reads are rather faster, as shown above, than writes.

The first comment on the numbers above is the obscene amount of system CPU time taken, especially for reading. That the system CPU time taken for reading being 2.5 times or more higher than that for writing is also absurd. The test system has an 8 thread AMD 8270e CPU with a highly optimized microarchitecture, 8MiB of level 3 cache, and a 3.3GHz clock rate.

The the system CPU time for most filesystem types is roughly the same, again especially for reading, which indicates that there is common cause that is not filesystem specific. For F2FS the system CPU time for reading is more than 50% of the elapsed time, an extreme case. It is interesting to see that ZFSonLinux, which has uses its own cache implementation, ARC has a system CPU time of roughly 1/3 that of the others.

That Linux block IO involves an obscene amount of system CPU time to do IO I had noticed already over 10 years ago and that the issue has persisted so far is a continuing assessment of the Linux kernel developers in charge of developing the block IO subsystem.

Another comment that applies across all filesystems is that the range of speeds is not that different, all of them had fairly adequate, reasonable speeds given the device. While there is a range of better to worse, this is to be expected from a coarse test like this, and a different ranking will apply to different workloads. What this coarse test says is that none of these filesystems is is particularly bad on this, all of them are fairly good.

Another filesystem independent aspect is that the absolute values are much higher, at 6-7 times better, than those I reported only four years ago. My guess is is this mostly because the previous test involved the Linux source tree which contains a a large number of very small files; but also because the hardware was an old system that I was no longer using in 2012, indeed I had not used since 2006.

As to the selection of filesystem types tested, the presence of F2FS, OCFS2, UDF, NILFS2 may seem surprising, as they are considered special-case or odd filesystems. Even if F2FS was targeted at flash storage, OCFS2 at shared-device clusters, UDF at DVDs and BDs, and NILFS2 at "continuous snapshotting", they are actually all general purpose, POSIX-compatible filesystems that work well on disk drives and with general purpose workloads. I have also added ZFSonLinux, even if I don't like for various reasons, as a a paragon. I have omitted a test of ext4 because I reckon that it is a largely pointless filesystem design, that exists only because of in-place upgradeability from ext3, which in turn was popular only because of in-place upgradeability from ext2, when the installed base of Linux was much smaller. Also OCFS2 has a design quite similar to that of the ext3 filesystem, and has some more interesting features.

Overall the winners, but not large margins, from this test seem to be F2FS, JFS, XFS, OCFS2. Some filesystem specific notes, in alphabetical order:

170214 Mar: New Seagate disc drives and their declared duty cycles

Some time ago I mentioned an archival disk drive model from HGST which had a specific lifetime rating for total reads and writes, of 180TB per year as compared to the 550TB per year of a similar non-archival disk drive.

The recent series of IronWolf, IronWolf Pro and Enterprise Capacity disk drives from Seagate which are targeted largely at cold storage use have similar ratings:

It is interesting also that the lower capacity drives are rated for Nonrecoverable Read Errors per Bits Read, Max of 1 per 10E14 and the large ones for 1 in 1E15 (pretty much industry standard), because 180TB per year is roughly 10E13, and the drives probably are designed to last 5-10 years (they have 3-5 years warranties).

170205 Sun: A straightforward alternative to the setuid mechanism too

Having just illustrated a simple confinement mechanism for UNIX/Linux systems that uses the regular UNIX/Linux style permissions, I should add that the same mechanism can also replace in one simple unified mechanism the setuid protection domain switching of UNIX/Linux systems. The mechanism would be to add to each process, along with its effective id (user/group) what I would now call a preventive id with the following rules:

Note: there are some other details to take care of, like apposite rules for access to a process via a debugger. The logic of the mechanism is that it is safe to let a process operate under the preventive id of its executable, because the program logic of the executable is under the control of the owner of the executable, and that should not be subverted.

The mechanism above is not quite backwards compatible with the UNIX/Linux semantics because it makes changes in the effective or preventive ids depend on explicit process actions, but it can be revised to be backwards compatible with the following alternative rules:

Note: the implementation of either variant of the mechanism is trivial, and in particular adding preventive id fields to a process does not require backward incompatible changes as process attributes are not persistent.

The overall logic is that in the UNIX/Linux semantics for a process to work across two protection domains it must play between the user and group ids; but it is simpler and more general to have the two protection domains identified directly by two separate ids for the running process.

170204 Sat: Aggregate cost of AWS servers

BusinessWeek has an interesting article on some businesses that offer tools to minimize cloud costs and a particularly interesting example they make contains these figures:

Proofpoint rents about 2,000 servers from Amazon Web Services (AWS),’s cloud arm, and paid more than $10 million in 2016, double its 2015 outlay. “Amazon Web Services was the largest ungoverned item on the company’s budget,” Sutton says, meaning no one had to approve the cloud expenses.

$10m for 2,000 virtual machines means $5,000 per year per VM, or $25,000 over the 5-year period where a physical server would be depreciated. That buys a very nice physical server and 5 years of colocation including power and cooling and remote hands, plus a good margin of saving, with none of the inefficiencies or limitations of VMs; actually if one buys 2,000 physical servers and colocation one can expect substial discounts over $25,000 for a single server over 5 years.

Note Those 2,000 servers are unlikely to have a large configuration, more likely to be small thin servers for purposes like running web front-ends.

The question is whether those who rent 2,000 VMs from AWS are mad. My impression is that they are more foolish than mad, and the key words in the story above are no one had to approve the cloud expenses. The point is not just lack of approval but that cloud VMs became the default option, the path of least resistance for every project inside the company.
The company probably started with something like 20 VMs to prototype their service to avoid investing in a fixed capital expense, and then since renting more VMs was easy and everybody did it, that grew by default to 2,000 with nobody really asking themselves for a long time whether a quick and easy option for starting with 20 systems is as sensible when having 2,000.

Cutting VM costs by 10-20% by improving capacity utilization is a start but fairly small compard to rolling their own.

I have written before that cloud storage is also very expensive, and cloud systems also seem to be. Cloud services seem to me premium products for those who love convenience more than price and/or have highly variable workloads, or those who need a builting content distribution network. Probably small startups are like that, but eventually they start growing slowly or at least predictably, and keep using cloud services by habit.

170202 Thu: A straightforward alternative to confinement mechanisms

There are two problems in access control, read-up and write-down, and two techniques (access lists and capabilities). The regular UNIX access control is aims to solve the read-up problem using an abbreviated form of access control lists, and POSIX added extended access control lists.

Preventing read-up with access control lists is a solution for preventing unintended access to resources by users, but does not prevent unintended access to resources by programs, or more precisely by the processes running those programs, because a process running with the user's id can access any resource belonging to that user, and potentially transfer the content of the resource to third parties such as the program's author, or someone who has managed to hack into the process running that program. That is, it does not prevent write-down.

The typical solution to confinement is use some form of container, that is to envelope a process running a program in some kind of isolation layer, that prevents it from accessing the resources belonging to the invoking user. The isolation layer can be usually:

There is a much simpler alternative (with a logic similar to distinct effective and real ids for inodes) that uses the regular UNIX/POSIX permissions and ACLs: to create a UNIX/POSIX user (and group) id per program, and then to allow access only if both the process owner's id and the program's id have access to a resource.

This is in effect what SELinux does in a convoluted way, and is fairly similar also to AppArmor profile files, which however suffer from the limitation of imposing policies to be shared by all users.

Instead allowing processes to be characterized by both the user id of the process owner, and the user id of the program (and similarly for group ids), would allow users to make use of regular permissions and ACL to tailor access by programs to their own resources, if they so wished.

Note: currently access is granted if the effective user id or the effective group id of a process owner have permission to access a resource. This would change to granting access if [the process effective user id and the program user id both have permission] or [the process effective group id and the program group id both have permission]. Plus some additional rules, for example that a program id of 0 has access to everything, and can only be set by the user with user id 0.

Note: of course multiple programs could share the same program user and group id, which perhaps should be really called foreign id or origin id.

170129 Sun: The mainframe development problem and MPI

At some point in order to boost the cost and lifestyles of their executive most IT technology companies try to move to higher margin market segments, which usually are those with the higher priced products. In the case of mainframes this meant an abandonment of lower priced market segments to minicomputer suppliers. This created a serious skills problem: to become a mainframe system administrator or programmer a mainframe was needed for learning, but mainframe hardware and operating systems were available only in large configurations at very high price levels, and therefore used only for production.

IBM was particularly affected by this as they really did not want to introduce minicomputers with a mainframe compatible hardware and operating system, to make sure customers locked-in to them would not be tempted to fall back on a minicomputers, and the IBM lines of minicomputers were kept rigorously incompatible with and much more primitive than the IBM mainframe line. Their solution, which did not quite succeed, was to introduce PC-sized workstations with a compatible instruction set, to run on them a version of the mainframe operating system, and even to create plug-in cards for the IBM PC line, all to make sure that the learning systems could not be used as cheap production systems.

Note: The IBM 5100 PC-sized mainframe emulator became an interesting detail in the story of time traveller John Titor.

The problem is more general: in order to learn to configure and program a system, one has to buy that system or a compatible one. Such a problem is currently less visible because most small or large systems are based on the same hardware architectures, the intel IA32 or AMD64 ones, and one of two operating system, MS-Windows or Linux, and a laptop or a desktop thus have the same runtime environment as a larger system.

Currently the problem happens in particular for large clusters for scientific computing, and it manifests particularly for highly parallel MPI programs. In particular many users of large clusters develop their programs on their laptops and desktops, and these programs read data from local files using POSIX primitives, rather than using MPI2 IO primitives. Thus the demand for highly parallel POSIX-like filesystems like Lustre that however are not quite suitable for highly parallel situations.

The dominant issue of such programs is that the issues that arise with them, mostly synchronization and latency impacting speed, cannot be reproduced on a workstation, even if it can run the program with MPI or other frameworks. Many of these programs cannot even be tested on a small cluster, because their issues arise only at grand scale, and may be even different on different clusters, as they may be specific to the performance envelope of the target.

The same problems happen with OpenMP, which is designed for shared memory systems with many CPUs: while even laptops today have some CPUs that share memory, the real issues happens with systems have have a few dozen CPUs and non-uniform shared memory, and with systems that have several dozen or even hundreds of CPUs, and the issue they have also arise only at scale, and vary depending on the details of the implementation.

It is not a simple problem to solve, and is a problem that limits severely the usefulness of highly parallel programs on large clusters, as it limits considerably their ecosystem.

170128 Sat: An interesting blog post on namespaces

The bottom of this site's index page has a list of sites and blogs similar to this in containing opinions about computer technology, mostly related to Linux and system and network engineering and programming, and I have been recently discovered a blog by the engineers and programmers offering their services via one of the main project based contract work sites.

The blog like every blog has a bit of a promotional role for the site and the contract workers it lists, but the technical content is not itself promotional, but has fairly reasonable and interesting contributions.

I have been particularly interested by a posting on using Linux based namespaces to achieve program and process confinement.

It is an interesting topic in part because it is less than wonderfully documented, and it can have surprising consequences.

The posting is a bit optimistic in arguing that using namespacing, it is possible to safely execute arbitrary or unknown programs on your server as there are documented cases of (moderately easy) programs breeaking out of containers and even virtual machines. But namespace do make it more difficult to do bad things and often raising the level of cost and difficulty achieves good-enough security.

Note: The difficulty with namespaces and isolation is that namespaces are quite complicated mechanisms that need changes to a lot of Linux code, and they are a somewhat forced retrofit into the logic of a POSIX-style system, while dependable security mechanism need to be very simple to describe and code. Virtual machine systems are however even more complicated and error prone.

The posting discusses process is, network interface, and mount namespaces, giving simple illustrative examples of code to use them, and briefly discusses also user id, IPC and host-name namespaces. Perhaps user id namespaces would deserve a longer discussion.

The posting indeed can serve as a useful starting point for someone who is interested in knowing more a complicated topic. It would be nice to see it complemented by another article on the history and rationale of the design of namespaces and related ideas, and maybe I'll write something related to that in this blog.

170124 Tue: A legitimate Unsolicited Commercial Email!

With great surprise I have received recently for the first time in a very, very long time a legitimate and thus non-spam unsolicited commercial email.

The reason why unsolicited commercial emails are usually considered spam is that they are as a rule mindless time wasting advertising, usually automated and impersonal, and come in large volume as a result. For the email I received it was unsolicited and commercial, but it was actually a reasonable business contact email specifically directed at me from an actual person who answered my reply.

Of course there are socially challenged people who regard any unsolicited attempt at contact as a violation, (especially from the tax office I guess :-)), but unsolicited contact is actually pretty reasonable if done in small doses and for non-trivial personal or business reasons.

It is remarkable how rare they are and that's why I use my extremely effective anti-spam wildcard domain scheme.

170120 Fri: What is the "Internet"?

While chatting the question was raised of what is the Internet. From a technical point of view that is actually an interesting question with a fairly definite answer:

There are IPv4 or IPv6 internets that are distinct from the two Internets, but usually they adopt the IANA conventions and have some kind of gateway (usually it needs to do NAT) to the two Internets.

Note: these internets use the same IPv4 or IPv6 address ranges as the two Internets, but use them for different hosts. Conceivably they could also use the same port numbers for different services: as port 80 has been assigned by IANA for HTTP service, a separate internet could use port 399. But while this is possible I have never heard of an internet that uses assigned numbers different from the IANA ones, except for the root DNS servers.

But it is more common to have IPv4 or IPv6 internets that differ from the two Internets only in having a different set of DNS root name server addresses, but are otherwise part, at the transport and lower levels, of the two Internets.

There is a specific technical term to indicate the consequences of having different sets of DNS root name servers, naming zone. Usually the naming zones for internets directly connected to the two Internets overlap and extend with those of the two Internets, they just add (and sometimes redefine or hide) the domains of the Internets.

Note: Both the IPv4 and the IPv6 Internets share the same naming zone, in the sense that the IPv4 DNS root servers and the IPv6 DNS root server serve the same root zone content by convention. This is not necessarily the case at deeper DNS hierarchy levels: it is a local convention whether a domain resolves to an IPv4 and IPv6 address that are equivalent as in being on the same interface and the same service daemons being bound to them.

170107 Sat: F2FS and Bcachefs

Two relatively new filesystem designs and implementations for Linux are F2FS and Bcachefs.

The latter is a personal project of the author of Bcache, a design to cache data from slow storage onto faster storage. It seems very promising, and it is one of the few with full checksumming, but it is not part yet of the default Linux sources, and work on it seems to be interrupted, even if the implementation of the main features seems finished and stable.

F2FS was initially targeted at flash storage devices, but is generally usable as a regular POSIX filesystem, and performs well as such. Its implementation is also among the smallest with around half or less the code size of XFS, Btrfs, OCFS2 or ext4:

   text    data     bss     dec     hex filename
 237952   32874     168  270994   42292 f2fs/f2fs.ko

Many congratulations to its main author Jaegeuk Kim, a random Korean engineer in the middle of huge corporation Samsung, for his work.

Since the work has been an official Samsung project, and F2FS is part of the default Linux sources, and is widely used on Android based cellphone and tablet devices, it is likely to be well tested and to have long term support.