Computing notes 2021 part one

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

2021 June

210623 Tue: Identifiable or self-identified targets

A friend was discussing the privacy and economic implications of online ordering and home delivery for books, but then the discussion went to home delivery in general, and someone observed that home delivery means that certain actions can then be targeted quite precisely to a certain person, because home delivery usually is to a named individual.

Indeed for computers this has meant that the Snowden revelations have confirmed that the NSA (and no doubt every other major player) department of targeted operations routinely intercept and compromise deliveries of computer equipment.

But this need not apply just to computers, but also to food, medicines, books even, all sorts of household items. At one extreme poisoned or explosive products and devices delivered to named individuals have been used often as attacks.

Overall there is much greater safety in numbers and anonymity: buying products with cash in random shops makes it much more likely that they have not been tampered with.

20210612 Sat: A really nice typeface

I have been looking at typefaces and fonts and rendering for a while and one of the questions has always been which text (that is non-decorative) typefaces to use. Once upon a time the ideal was to use the five typefaces in the Adobe 14 font set but they have always been proprietary (except for the donation of Courier to X Windows) and most suited to high DPI rendering for printers, mostly because of their design. The URW clones of the Adobe 14 donated to X Windows are good, and have been recently resurrected as the Gyre typefaces, but they are not popular.

The best alternative then became the Microsoft web core fonts as they were designed for low DPI devices and well hinted for them, but while they are free-of-charge to use, they are not otherwise free to modify and improve.

With the passing of years of use I have become ever fonder of the Ubuntu typefaces because both of their pretty nice design, and their being well hinted, complete and free to modify and extend.

There are other freeware typeface families, and the best alternative to the Ubuntu one is the DejaVu typefaces but I like them less, because I find them a bit too tall and print oriented. They are as complete (and perhapa even more so) and well hinted as the Ubuntu typefaces though.

The other freeware typeface families are either less complete, not well hinted, or have some other issues, for example the Google Noto font collection has too many of different and unncessary types.

Typefaces that are almost as good as the Ubuntu and DejaVu ones include the Google Roboto ones and the KDE Oxygen fonts, but they are not as well hinted or are not as complete.

But overall I still think that the Ubuntu typefaces are very good, peerhaps still the best autonomous Ubuntu project (the second best being probably Bazaar, at long last).

2021 May

20210523 Sun: Updated CSS styles for ATOM and RSS feeds

I have used in the pasts both RSS and Atom for the feeds of this web site, and I had written a CSS style sheet to display the RSS feeds. I have updated that a bit, and written an equivalent one for Atom. The sort-of special feature of both is that they turn the link elements in both into activable links by using some JavaScript postprocessing script that turns the main link elements into either:

link elements in the http://www.w3.org/1999/xlink namespace.
a elements in the http://www.w3.org/1999/xhtml namespace; this latter is slightly preferable because the brwosers handle it better.

Note: in Atom the text contents of some elements is supposed to full HTML, but the simple style sheets I have written treat the content as simple text.

Note: On this site the feeds are served both with a .atom or .rss suffix, and an .xml suffix, because the first two are served with application/atom+xml and with application/rss+xml MIME types, and the styling code in the browsers seem to be enabled only for the base text/html MIME type (as well as that for HTML).

The setup seems to work with Microsoft Edge and with Brave, and the links to the relevant files are:

Atom: CSS, Javascript.
RSS: CSS, Javascript.

Note: Here are links to both current ATOM feeds (and older RSS feed) for this site to be rendered using the stylesheets above: changed.xml, changed2.xml (changes.xml).

20210503 Sun: Short proper names for devices and media

In natural language there are category names like chair and proper names like Japan, where the propert names denote individual entities. I reckon that it is usually quite important to distinguish between the two even if computing not just in natural language.

In computing there is a further distinction, as there are generic proper names like site29host317 (commonly used for cattle type resources) and specific ones like sunflower (commonly used for pet type resources), and a further distinction can be made as to generic proper names between arbitrary ones and path based one, that is names that embody some information on how to find the named resource.

Names label not just to computers, but also to other resources, for example accounts, devices, filesystems, media, processes, files.

Most names used in computing are generic proper names, and most are at least in part path based. However I believe that proper specific names are useful in many more cases, because they are useful in labeling independent resources.

So I usually give filesystems a proper name, because as a rule to me which computer I am using is sort of not very important, what matters is which filesystems I am accessing; for example if a computer dies I usually just move the storage units containing its filesystems to some other computer.

A difficulty with that is that while it is possible to use filesystem specific proper names in various places, most notably /etc/fstab (and I have written a nice script that allows it to be used a a map for the automounter) that might require scanning all available devices and media to determine which one holds which filesystem, as ultimately it is block device names that have to be used to mount filesystems.

Also usually block device names depend on physical device names, which are usually based partially on paths, for example /dev/nvme0n1p7 or /dev/sdak3, and the naming of device in a contemporary GNU/Linux system is typically opaque and possibly semi-random.

Because of that I have decided to give to several devices and media on my pet systems specific individual names, treating them as pets too, to avoid having to remember the details of their paths.

This has required me to deal with the demented udevd framework that was replaced because of the execrable GKH) the much simpler and better devfsd (1, 2) system (designed and implemented well by Richard Gooch), with a scheme of symbolic-links based aliases like this:

Using usually their serial numbers and/or brand and model names, give specific proper names to devices with interchangeable media, and partitions thereof, putting them under /dev/disk/, for example for a dock (I have used a similar configuration for my two digital cameras):
```
# Alxum USB3 UASP dock

KERNEL=="sd*[!0-9]",	SUBSYSTEMS=="scsi", \
  ENV{SCSI_MODEL}=="ASMT1153e", \
  ENV{DEVTYPE}=="disk", SYMLINK+="disk/alxum"
KERNEL=="sd*",		SUBSYSTEMS=="scsi", \
  ENV{SCSI_MODEL}=="ASMT1153e", \
ENV{DEVTYPE}=="partition", SYMLINK+="disk/alxum%n">
```
or the SDXC slot on the side of a laptop:
```
# ThinkPad E495 MMC slot on the side.

KERNEL=="mmcblk0",	SUBSYSTEMS=="block", \
  ENV{DEVTYPE}=="disk", SYMLINK+="disk/side"
KERNEL=="mmcblk0p*",	SUBSYSTEMS=="block", \
  ENV{DEVTYPE}=="partition", SYMLINK+="disk/side%n"
```
This scheme break down a bit if there are multiple docks with the same model numbers, but that is uncommon.
The advantages of this scheme are:
- Whichever physical device name the kernel assigns to the device, I can use its proper name to refer to it.
- Whichever media is held in that device I can use the same simple and meaningful proper name to refer to it and its partitions.

Using their disk labels ids and/or serial numbers, give specific proper names to specific media, for example for the main media of a system:

# Media of laptop "daisy"

KERNEL=="nvme?n?",	SUBSYSTEMS=="block", \
  ENV{ID_PART_TABLE_UUID}=="ef4b523b-bb01-48b1-9396-7c60f5df2c2f", \
  ENV{DEVTYPE}=="disk", SYMLINK+="media/daisy0"
KERNEL=="nvme*",	SUBSYSTEMS=="block", \
  ENV{ID_PART_TABLE_UUID}=="ef4b523b-bb01-48b1-9396-7c60f5df2c2f", \
  ENV{DEVTYPE}=="partition", SYMLINK+="media/daisy0p%n"

KERNEL=="sd*[!0-9]",	SUBSYSTEMS=="scsi", \
  ENV{ID_PART_TABLE_UUID}=="42643b7a-2d2a-45a8-9412-06d23ba63f2f", \
  ENV{DEVTYPE}=="disk", SYMLINK+="media/daisy1"
KERNEL=="sd*",		SUBSYSTEMS=="scsi", \
  ENV{ID_PART_TABLE_UUID}=="42643b7a-2d2a-45a8-9412-06d23ba63f2f", \
  ENV{DEVTYPE}=="partition", SYMLINK+="media/daisy1p%n"

or for a backup disk:

# Seagate 5T backup disk

KERNEL=="sd*[!0-9]",	SUBSYSTEMS=="scsi", \
  ENV{ID_PART_TABLE_UUID}=="03871ee3-2a23-4cbc-b40e-dc17a7ff5ed6", \
  ENV{DEVTYPE}=="disk", SYMLINK+="media/both"
KERNEL=="sd*",		SUBSYSTEMS=="scsi", \
  ENV{ID_PART_TABLE_UUID}=="03871ee3-2a23-4cbc-b40e-dc17a7ff5ed6", \
  ENV{DEVTYPE}=="partition", SYMLINK+="media/both%n"

In the case of media I have chosen to put their names under a new /dev/media/ directory.

There are of course already aliases under some subdirectories of /dev/disk/, but they are either very verbose, or are path based, and the names I give are much better mnemonics. Some relevant notes:

There are filesystem proper names under /dev/disk/by-label/ but sometimes I like to backup filesystems by way of byte-by-byte copy, which duplicates their name label. Using device or media name plus partition number avoids that (and avoid the long pathname prefix)
There are partition proper names under /dev/disk/by-partlabel/, but those are only available for GPT disks and the prefix is really long. I have used them though for things like ZFS and Btrfs member names, as it allows replacing a medium with another with a different medium name (as it should be) but with the same name for the member block device.
Unfortunately neither the MBR nor the GPT partitioning labels for media allow proper symbolic names for a medium, so the scheme above helps translate the numeric identifier of a medium to a convenient proper name.

The main values of the scheme above are that it makes it easier to document some aspects of a system configuration, and that it prevents making a lot of mistakes that can happen when using generic (especially path based) names wrongly.

Update 2022-04-30: The entries in /etc/fstab can then be written more portably and easily, for example as:

/dev/media/daisy0p1 /		jfs	defaults,auto,atime,nodiratime	7 1
/dev/media/daisy0p2 none	swap	sw,pri=10			0 0
/dev/media/daisy0p3 /fs/home1	jfs	defaults,noauto,noatime		14 2
/dev/media/daisy1p1 /fs/home2	jfs	defaults,noauto,noatime		14 2

/dev/media/both1 /fs/both_root	f2fs	defaults,noauto,nofail,noatime	0 0
/dev/media/both2 /fs/both_home1	f2fs	defaults,noauto,nofail,noatime	0 0
/dev/media/both3 /fs/both_home2	f2fs	defaults,noauto,nofail,noatime	0 0

/dev/disk/alxum1 /fs/alxum1	auto	defaults,noauto,nofail,noatime	0 0
/dev/disk/alxum2 /fs/alxum2	auto	defaults,noauto,nofail,noatime	0 0
/dev/disk/alxum3 /fs/alxum3	auto	defaults,noauto,nofail,noatime	0 0

/dev/disk/alxum1 /fs/alxum1-n	nilfs2	defaults,noauto,nofail,noatime,nogc 0 0
/dev/disk/alxum2 /fs/alxum2-n	nilfs2	defaults,noauto,nofail,noatime,nogc 0 0
/dev/disk/alxum3 /fs/alxum3-n	nilfs2	defaults,noauto,nofail,noatime,nogc 0 0

/dev/disk/alxum1 /fs/alxum1-x	xfs	defaults,noauto,nofail,noatime,inode64 0 0
/dev/disk/alxum2 /fs/alxum2-x	xfs	defaults,noauto,nofail,noatime,inode64 0 0
/dev/disk/alxum3 /fs/alxum3-x	xfs	defaults,noauto,nofail,noatime,inode64 0 0

/dev/disk/alxum1 /mnt/alxum1-t	exfat	ro,rw,noauto,user,noatime,async,umask=0077 0 0
/dev/disk/alxum2 /mnt/alxum2-t	exfat	ro,rw,noauto,user,noatime,async,umask=0077 0 0
/dev/disk/alxum3 /mnt/alxum3-t	exfat	ro,rw,noauto,user,noatime,async,umask=0077 0 0

/dev/disk/alxum1 /mnt/alxum1-v 	vfat	ro,rw,noauto,user,noatime,async,umask=0077,showexec,shortname=mixed 0 0
/dev/disk/alxum2 /mnt/alxum2-v 	vfat	ro,rw,noauto,user,noatime,async,umask=0077,showexec,shortname=mixed 0 0
/dev/disk/alxum3 /mnt/alxum3-v 	vfat	ro,rw,noauto,user,noatime,async,umask=0077,showexec,shortname=mixed 0 0

Note: the -n suffix on the mount directories is to select the type of filesystem with the automounter or mount --target as user because in neither case it is possible to pass the filesystem type as a parameter.

That works particularly well if the /fs/ directory is managed by the automounter using my script to turn /etc/fstab into a dynamic automount map.

2021 April

20210417 Sat: Linux and BSD extensions should be UNIXy

There are among many others two celebrated reasons why GNU/Linux or also BSD (the base for MacOS X and iOS) have become popular:

As a rule they are no-cost options, there are many variants that can be copied free of charge.
They are not-proprietary, as they can be modified and redistributed without restrictions.

But there is a third reason that has been forgotten, it is not just or even mainly being non-proprietary, no-cost alternatives to MS-Windows and other products: they are based on the UNIX architecture, which is arguably better, as in simpler, more consistent, and more flexible than that of most other operating systems.

The original UNIX was widely adopted because of that being arguably better even if it was a proprietary product with a significant price, for example (also 1, 2):

This seems often to be forgotten by those (I think many associated with FreeDesktop.org and GNOME) who seem to be interested only in it being a non-proprietary, no-cost alternative to MS-Windows, and effectively take the MS-Windows design style as a model to follow. UNIX systems are supposed to be arguably better, not just cheaper or more free.

That is particularly important as to designing extensions to the base UNIX system, which had significant limitations: those extensions should be as simple, consistent, flexible as the rest of the UNIX design (for example 1, 2, 3) rather than being MS-Windows like messy extensions.

20210416 Fri: runit versus systemd

Having recently written a simplified history of early UNIX init designs this will be mostly about an evaluation of the design of systemd, using as a counterpart runit, as many people seem to misunderstand both, in particular because they do not seem to be aware of the central issue: which is not being an init system, which is trivial, but the separate issue of supervision of daemons, taking into account that they need to communicate with each other and arbitrary other processes, which therefore involves discussing the single greatest strength of the UNIX architecture, that is how well it allows IPC among related processes.

Note: in UNIX processes are related if they re the descendants of the same process, that is they share in part the environment and file descriptors of their ancestor process, which can to an extent control them.

The clearest example of how well UNIX allows dealing with related processes is shell pipelines that are composed of related processes, which allow IPC of many processes with automatic data flow control and implicit synchronization, without even having to write any explicit code, just by having the shells set up pipes as communication channels among the related processes it creates for the pipeline.

Conversely in the classic UNIX architecture IPC among unrelated processes is extremely awkward, because pipes are not available, and IPC happens only through files, and there are are only pretty awkward synchronization mechanisms for files, they must be coded explicitly, and there is no automatic data flow control.

Note: In the classic UNIX architecture doing IPC via files was flow controlled and synchronized manually: run manually a command on a file, once the file is done, run manually another command on that file.

However there are many services that should be provided by daemons to unrelated process, typically involving some form of spooling, and doing that manually is quite bad.

Therefore since shortly after the classic UNIX editions there have been many attempts, for example Edition 7 multiplexes (by Greg Chesson), 4BSD sockets (by Bill Joy), early System V named pipes, later System V STREAMS) to provide better IPC among unrelated processes, and most have attempted to replicate pipes as named entities in the filesytem, and all have substantially failed, in part because it is a difficult design problem, in large part because it cannot be solved by random hacking, it requires a level of knowledge and insight even greater than that of designing IPC among related processes using pipes, which was the major contribution from the authors of UNIX.

The big issues for supervising unrelated processes then involve IPC:

Starting a daemon is not the same as starting a service, because the service is only properly ready after an initialization phase, and viceversa stopping a service is not the same as stopping a process, because the service needs some shutdown phase before its process can be stopped.
Therefore starting and stopping processes cannot be the same as starting and stopping services except in trivial cases.
Dependencies and thus ordering among processes are therefore a not very interesting details, because what matters is dependencies and ordering among services. For example an SMTP service as a rule depends on a DNS service, but starting the DNS daemon process does not mean that the DNS daemon service is also started.

As to these issues runit and systemd take completely different approaches, as runit is a well designed, simple, robust and efficient supervisor of processes, and essentially ignores the issue of dependency ordering, while systemd aims to manage services and their dependencies, and does so in a poorly designed and misunderstood way.

What is common to them is the fundamental approach taken: they both turn unrelated daemon processes into related ones by having all daemon processes managed by the supervisor component, which in the case of of runit is a separate program from init (one quite similar to daemontools), and in the case of systemd is part of the init program. In both cases the supervisor keeps a table of the child process numbers, and in both cases process management is done not via per-daemon scripts as in the case of the System V init with /etc/init.d/ but through requests to the supervisor that then signals appositely the child processes.

Note: Some people a big deal of the latter aspect, but keeping a table of process numbers in a set of .pid files and sending signals to them directly instead of having the supervisors do that is nicer but not such a big deal.

There is an important point about both aspects, that runit manages process and with a separate supervisor from init while systemd attempts to manage services and with a supervisor integrated with init: systemd main goal is to minimize boot times by maximizing the parallelism with which service daemons can be launched, assuming that there are many services with a complex web of dependencies among them, while runit seems designed to supervise a relatively small number of service daemons that can be ordered in a much less parallel way.

The reason why systemd was designed with that main goal seems to me that modern GNU/Linux desktop GUI environments can have hundreds of services with complex relations among them, just like under MS-Windows, and minimizing the time to the appearance of a graphical login prompt by parallelizing services as much as possible gives the impression of a more responsive system, just like under MS-Windows. It is not by mere chance that systemd and related software are endorsed by FreeDesktop.org which seems an organization devoted to making UNIX as similar to MS-Windows as possible.

Note: Therefore systemd process supervision must be integrated with init to ensure that system initialization is managed by systemd too, including the initial activation of devices and the mounting of filestores, so they can also be parallelized to reduce boot times.

Therefore in order to manage optimally (in terms of parallelism) the dependencies among unrelated services, by making them all related by managing them all as children of the same systemd, every resource must be turned into a systemd service or a pseudo-service, and this includes traditionally separate concepts like initial boot activity, activating storage devices, mounting filesystems, and much else.

Note: Consider the example of two services each of which has a spool area on a separate filestore on two different block devices on two distinct physical devices: they can be started in parallel as long as activating the physical devices, configuring the block devices, and mounting the filestores can be done in parallel before the service is started, and all the relevant steps are done lazily as-needed.

Part of that is due to a rather fundamental problem, that while process start does not imply service start, there is no established convention by which a daemon can show that its services has been initialized and is ready. To work around this the idea has been to overlay on top of UNIX a universal IPC system for unrelated services (rather than processes) called D-Bus (which is similar to DCOM in MS-Windows), so that all requests to a daemon be turned into unrelated process IPC, and therefore requesting process automatically wait for a reply from the services they depend on:

$ qdbus --system | grep -v '^:' | sort
  com.redhat.NewPrinterNotification
  com.redhat.PrinterDriversInstaller
  org.bluez
  org.freedesktop.Accounts
  org.freedesktop.ColorManager
  org.freedesktop.PolicyKit1
  org.freedesktop.RealtimeKit1
  org.freedesktop.UDisks2
  org.freedesktop.UPower
  org.freedesktop.login1
  org.freedesktop.systemd1
  org.freedesktop.DBus

Then several existing UNIX services have to be rewritten into D-Bus accessible services, and integrated within systemd or packaged with it, in order to maximize parallelism while respecting dependencies, and indeed the systemd documentation advises:

Note that while systemd offers a flexible dependency system between units it is recommended to use this functionality only sparingly and instead rely on techniques such as bus-based or socket-based activation which make dependencies implicit, resulting in a both simpler and more flexible system.

My overall evaluation is that runit achieves its limited aims successfully and elegantly and robustly, and systemd achieves its much more sophisticated aims, even if worthy, mostly but not wholly successfully and with much complexity and fragility. I think these are the main reasons for the latter:

Designing such a sophisticated wrapper layer on top of the UNIX architecture to remedy some fundamental limitations of the latter as to unrelated processes and as to service state management is a huge problem that was not necessarily approached insightfully.
The main authors of D-Bus and systemd seem to me idiot savants in being clever but prone to mindless hacking, and I have the impression that there was a considerable lack of insight into the UNIX architectural choices and excessive admiration for Microsoft style designs (down to details like using .ini MS-DOS style configuration files).
Most UNIX daemons are not written to offer a distinction between process start and service readiness, and between service shutdown and process end. Therefore any attempt to work around this, by using wrappers or by rewriting them, is going to be quite awkward (usuall sleeping for some kind of estimated startup time).
That may mean having to rewrite them in the case where wrappers are not sufficient is D-Bus that is itself complex and awkward, and the standard interfaces and service state state model are not necessarily thoughtfully designed.
Given all the overwrought complexity, the quantity and quality of the documentation seems to me not necessarily at the level of completeness and quality of the original UNIX documentation either, which is a pity as the D-Bus and systemd infrastructure is overriding a lot of the simpler and better documented underlying UNIX architecture.

Note: systemd is thus a large and complex program dependent on a lot of complex libraries:

$ sudo lsof -p 1 |& grep -w REG | sort -k 9
  systemd   1 root  mem       REG              259,5  2454496    1616052 /lib/systemd/libsystemd-shared-245.so
  systemd   1 root  txt       REG              259,5  1620224    1616398 /lib/systemd/systemd
  systemd   1 root  mem       REG              259,5   191472     465694 /lib/x86_64-linux-gnu/ld-2.31.so
  systemd   1 root  mem       REG              259,5   133200     465682 /lib/x86_64-linux-gnu/libaudit.so.1.0.0
  systemd   1 root  mem       REG              259,5   351352     466666 /lib/x86_64-linux-gnu/libblkid.so.1.1.0
  systemd   1 root  mem       REG              259,5  2029224     466299 /lib/x86_64-linux-gnu/libc-2.31.so
  systemd   1 root  mem       REG              259,5    27064     465688 /lib/x86_64-linux-gnu/libcap-ng.so.0.0.0
  systemd   1 root  mem       REG              259,5    31120     465689 /lib/x86_64-linux-gnu/libcap.so.2.32
  systemd   1 root  mem       REG              259,5   202760     465692 /lib/x86_64-linux-gnu/libcrypt.so.1.1.0
  systemd   1 root  mem       REG              259,5   454192     369757 /lib/x86_64-linux-gnu/libcryptsetup.so.12.5.0
  systemd   1 root  mem       REG              259,5   431472     465699 /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
  systemd   1 root  mem       REG              259,5    18816     466300 /lib/x86_64-linux-gnu/libdl-2.31.so
  systemd   1 root  mem       REG              259,5   137584     465710 /lib/x86_64-linux-gnu/libgpg-error.so.0.28.0
  systemd   1 root  mem       REG              259,5   162264     465747 /lib/x86_64-linux-gnu/liblzma.so.5.2.4
  systemd   1 root  mem       REG              259,5  1369352     466301 /lib/x86_64-linux-gnu/libm-2.31.so
  systemd   1 root  mem       REG              259,5   387768     465707 /lib/x86_64-linux-gnu/libmount.so.1.1.0
  systemd   1 root  mem       REG              259,5    68320     465717 /lib/x86_64-linux-gnu/libpam.so.0.84.2
  systemd   1 root  mem       REG              259,5   157224     466313 /lib/x86_64-linux-gnu/libpthread-2.31.so
  systemd   1 root  mem       REG              259,5    40040     466315 /lib/x86_64-linux-gnu/librt-2.31.so
  systemd   1 root  mem       REG              259,5   163200     465772 /lib/x86_64-linux-gnu/libselinux.so.1
  systemd   1 root  mem       REG              259,5   178528     114923 /lib/x86_64-linux-gnu/libudev.so.1.6.17
  systemd   1 root  mem       REG              259,5    30936     389164 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
  systemd   1 root   10r      REG               0,23        0   13427082 /proc/1/mountinfo
  systemd   1 root   14r      REG               0,23        0 4026532084 /proc/swaps
  systemd   1 root  mem       REG              259,5    39088     789550 /usr/lib/x86_64-linux-gnu/libacl.so.1.1.2253
  systemd   1 root  mem       REG              259,5    80736     789565 /usr/lib/x86_64-linux-gnu/libapparmor.so.1.6.1
  systemd   1 root  mem       REG              259,5    34872     789575 /usr/lib/x86_64-linux-gnu/libargon2.so.1
  systemd   1 root  mem       REG              259,5  2954080     789770 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
  systemd   1 root  mem       REG              259,5  1168056     790022 /usr/lib/x86_64-linux-gnu/libgcrypt.so.20.2.5
  systemd   1 root  mem       REG              259,5   129096     790260 /usr/lib/x86_64-linux-gnu/libidn2.so.0.3.6
  systemd   1 root  mem       REG              259,5    35440     790279 /usr/lib/x86_64-linux-gnu/libip4tc.so.2.0.0
  systemd   1 root  mem       REG              259,5    67912     790309 /usr/lib/x86_64-linux-gnu/libjson-c.so.4.0.0
  systemd   1 root  mem       REG              259,5   104656     790328 /usr/lib/x86_64-linux-gnu/libkmod.so.2.3.5
  systemd   1 root  mem       REG              259,5   129248     790402 /usr/lib/x86_64-linux-gnu/liblz4.so.1.9.2
  systemd   1 root  mem       REG              259,5   584392     938151 /usr/lib/x86_64-linux-gnu/libpcre2-8.so.0.9.0
  systemd   1 root  mem       REG              259,5   133568     789590 /usr/lib/x86_64-linux-gnu/libseccomp.so.2.5.1
  systemd   1 root  mem       REG              259,5  1575112     938483 /usr/lib/x86_64-linux-gnu/libunistring.so.2.1.0

Therefore I am not very satisfied with either runit or systemd because while both work-ish enough to be viable:

runit is nice and robust but part of that is because it avoids some difficult issues.
systemd attempts to solve those difficult issues but does so incompletely and is ugly and fragile.
Neither involves a systematic, disciplined attempt to confront those difficult issues, for example by defining a clear and simple model of resource (not just service) states, and of unrelated process interactions.

While waiting for the attempt in the latter point, I feel that a more sophisticated system than runit yet less awkward, monolithic and complex system than systemd would have been possible, based on something like the automounter and inetd, and more filesystem based like the runit and Plan 9.

20210415 Thu: A simplified history of UNIX 'init'

For the sake of looking at a minimalist GNU/Linux distribution I have been running in a VM a live DVD image of Void Linux which notably uses musl libc and runit.

For various reasons this has finally prompted me to write about init implementation alternatives, starting with a bit of history here:

Traditional init

In traditional UNIX systems the init had very limited roles:

Become the parent of otherwise parentless process and reap them then they terminated.
Start a shell script.

Things started to become more complex when in Edition 7 the initial shell script began to start daemons in the order in which they were mentioned, and startup happened in sequential phases called run levels where it was guaranteed that all daemons in a run level were started before any daemon in the next run levels.

System V init with native configuration

Because the daemons started by the Edition 7 initial shell script were unsupervised, when they started to multiply someone designed a new init that did some process supervision as configured in the /etc/inittab file. Supervision is very limited: the list of supervised processes to start was static, and processes in practice can just be started and optionally restarted if they terminated, and are started the order in which they were listed in that /etc/inittab.

While System V init documentation mentions run levels but they are not sequential phases, they are a far more general notion of run levels states, which was in practice almost never used, where when transitioning from a run level state to another:

Any transition between run states is allowed, they are not sequential phases.
All daemons listed in the old state but not in the new state are stopped.
All daemons listed in the new state but not in the old state are started
The daemons common to both old and new state are not touched.

System V init with init.d/ scripts

Because of the limitations of the System V init someone modified it by adding an extra layer of self-supervision similar in part to that of the Edition 7 service starting shell scripts: apart from a few directly started daemons, most others were started by a per-daemon shell script under /etc/init.d/ with the following logic:

The script would accept options to start, stop, restart, reload the daemon.
On start the daemon processes would usually write their process number somewhere conventional.
The other options would usually be implemented by sending UNIX signals with that process number.
The scripts would be run initially with the start option in alphabetical order (so the name of most scripts would start with a sequence number).

The critical aspect of this scheme is that daemon supervision is like in Edition 7 and previous schemes no longer done (except for reaping) by init but by a per-daemon supervision script, according to a scheme unique to that daemon (even if usually based on signals sent to a stored process number).

Overall the critical differences are in daemon supervision, and in particular in two aspects:

How individual daemons are controlled:
- Merely started in traditional init.
- Started or stopped or restarted according to static configuration in System V init with native configuration.
- Started or stopped or restarted with their own supervision script in System V init with /etc/init.d/.
How dependencies among daemons are handled:
- All daemons in one run level are started before all daemons in the next sequential run level: in traditional init, and there are no explicit dependencies.
- Daemons are grouped in run level states where there is no predefined sequences of states in System V init with native configuration, and there are no explicit dependencies.
- Daemons are started in a specific locally defined order in System V init with /etc/init.d/, and the order is defined by dependencies listed in a comment in their supervision scripts.

Then in BSD UNIX and later a second daemon starting mechanism was added for daemons contactable via IPC: inetd (and its successor xinetd) which has the remarkable property of having the option to start a daemon lazily on access rather than pre-starting it at some initial point, and the ever more remarkable implicit property that since access to its daemons is via IPC dependencies among daemons are both automatic and are defined by daemons being ready (replying to IPC) rather than merely started.

The latter point is quite important: in all previous schemes ordering and dependencies of daemons was defined by the starting of a daemon, rather than by it being ready, and most daemons have an initialization phase before becoming ready. It is entirely possible that if daemon A is a dependency of daemon B and daemon A is started before daemon B, when daemon B is ready daemon A is not ready yet.

Fortunately the case where a daemon started before another becomes ready after the the one started after it is rare, and the only major cases of daemons that take a while to get ready are network service daemons, for which that is not a great problem, and storage activation, and in most UNIX like systems storage activation commands complete only when activation is ready, and they are usually run before all other daemons.

To avoid that something like inetd was created for network services and to activate storage lazily the various types of automounter services.

2021 March

20210319 Tue: Account and identity, PulseAudio, etc.

In UNIX like systems user authorization is based on numeric user (and group) identifiers, but there are also user (and group) names that define accounts, that is bundles of resources and configurations. While a user name cannot be associated with multiple user number, more than one user name can have the same user number, that is the same authorizations, even if this situation is rare.

Despite that some programs try to canonicalize user name by resolving them to user numbers and then looking up the user number to get back the user name, which then can be is different from the original user name used for logging in.

In particular this is done by the PulseAudio daemon, which then creates its service socket in the wrong home directory, at least the wrong one by the logic of many other programs that simply use the user name they find in the environment.

The workaround in that particular case is to hardcode the service socket path in the files:

$ cat .config/pulse/default.pa
load-module module-native-protocol-unix auth-anonymous=0 socket=/tmp/pulse-username
.include /etc/pulse/default.pa

$ cat .config/pulse/client.conf
default-server=unix:/tmp/pulse-username

Note: that configures a socket file under /tmp/ rather than under the user home directory because often the home directory is accessed via NFS or another network shared filesystem type, and socket on those have sometimes less preferable behaviour.

20210306 Sat: Extension snaps and packages

It seems quite funny to find a post about Ubuntu Snap containers with the following argument:

Better yet, you can use extensions, a framework designed to make snap usage more consistent, faster, and easier. We talked about extensions last year, with the KDE extension as an example. Similarly, there are several other supported extensions, including GNOME, ROS and Flutter.

In addition to having snaps behave in a more predictable way, extensions help you gain – or rather lose – size! [...]

For instance, the KDE’s KCalc snap, which typically weighs around 100 MB as a standalone application, comes in at a very small, neat 972 KB – a 99% reduction from the original target and a number worth the 1980s gaming scene. Of course, the necessary libraries still need to exist somewhere – and they are contained in the KDE frameworks snap, which is used for all KDE applications.

That prompts the question of how that is different from putting the KDE runtime environment in a single ordinary package and then having an ordinary kcalc ordinary package depends on that.

2021 February

20210213 Sat: Some simple data on filesystem type allocators

Previous posts show that I am quite interested in filesystem types (1, 2, 3, etc.) and storage (1, 2, 3, etc.) and that is because I think that they are really important; in part because in several situations I have seen there were terrible storage systems, which caused significant delays to users.

Note: however I have seen several other situations in which the storage systems were also terrible but this did not matter much because the requirements were so low.

So recently I wanted to look at a first idea of how good are allocators at avoiding fragmentation of file layout, at least in the simplest case, filling up a filesystem. In this case I used the root filesystem on a laptop, which is on a 2GB/s NVME stick (so reading it has negligible impact on the speed of the test), to an old slow 5400RPM 2.5in 250GB HDD over USB3, as I like to see how things go with high latency slow transfer rate (it tops out at around 80MB/s) devices (fast SSDs are too easy).

The filesystem tree takes around 81GiB and the partition in which it resides had a capacity of around 88GiB; being a root filesystem tree it has lots of small files, but I have added some subtrees also with somewhat large files (just several GiB though). I have done the copies with rsync -axHOJ after freshly formatting the filesystem. Then I have used a little script with find and filefrag to count the extents in files larger than 16KiB and ordered the results by number of fragments:

#  tail -n5 frag*
==> frag-sdc6-bcachefs <==
5       /mnt/sdc6/loc/data/fonts/lm1.106bas.zip
5       /mnt/sdc6/usr/lib/llvm-11/lib/libclang-cpp.so.11
5       /mnt/sdc6/usr/share/keyrings/debian-keyring.gpg
5       /mnt/sdc6/usr_src/pkg7/Cyberbit/CyberCJK.ZIP
6       /mnt/sdc6/usr/share/skypeforlinux/resources/app.asar.unpacked/modules/slimcore.node

==> frag-sdc6-btrfs <==
3       /mnt/sdc6/loc/data/distrib/gentoo-amd64-20190615-zfs-0.8.1.iso
3       /mnt/sdc6/opt/brave.com/brave/brave
5       /mnt/sdc6/var_data/recoll/xapiandb/termlist.glass
10      /mnt/sdc6/var_data/recoll/xapiandb/postlist.glass
14      /mnt/sdc6/var_data/recoll/xapiandb/position.glass

==> frag-sdc6-ext4 <==
10      /mnt/sdc6/loc/data/distrib/systemrescuecd-amd64-6.1.3.iso.part
12      /mnt/sdc6/loc/data/dbase/freedb-complete-20140101.tar.bz2
12      /mnt/sdc6/loc/data/distrib/gentoo-amd64-20190615-zfs-0.8.1.iso
15      /mnt/sdc6/var_data/recoll/xapiandb/docdata.glass
32      /mnt/sdc6/var_data/recoll/xapiandb/position.glass

==> frag-sdc6-f2fs <==
17      /mnt/sdc6/loc/bsp/xwi6/worldofpadman.run
27      /mnt/sdc6/loc/data/dbase/freedb-complete-20140101.tar.bz2
29      /mnt/sdc6/var_data/recoll/xapiandb/termlist.glass
32      /mnt/sdc6/var_data/recoll/xapiandb/position.glass
38      /mnt/sdc6/var_data/recoll/xapiandb/postlist.glass

==> frag-sdc6-jfs <==
12      /mnt/sdc6/usr/src/linux-oem-5.10-headers-5.10.0-1023/arch/mips/include/asm/octeon/cvmx-npei-defs.h
12      /mnt/sdc6/var_data/recoll/xapiandb/termlist.glass
14      /mnt/sdc6/usr/share/efitools/efi/ReadVars.efi
19      /mnt/sdc6/var_data/recoll/xapiandb/postlist.glass
43      /mnt/sdc6/var_data/recoll/xapiandb/position.glass

==> frag-sdc6-ocfs2 <==
467     /mnt/sdc6/usr/lib/firefox/libxul.so
491     /mnt/sdc6/usr/share/AAVMF/AAVMF_CODE.fd
513     /mnt/sdc6/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-3adda86a16d1040e.rlib
597     /mnt/sdc6/var/lib/bogofilter/wordlist.db
662     /mnt/sdc6/usr/share/code/code

==> frag-sdc6-reiserfs <==
464     /mnt/sdc6/var/lib/apt/lists/ftp.ch.debian.org_debian_dists_buster_main_source_Sources
661     /mnt/sdc6/var/lib/cdebconf/templates.dat
698     /mnt/sdc6/var_data/recoll/xapiandb/termlist.glass
2472    /mnt/sdc6/var_data/recoll/xapiandb/postlist.glass
4187    /mnt/sdc6/var_data/recoll/xapiandb/position.glass

==> frag-sdc6-xfs <==
2       /mnt/sdc6/loc/data/distrib/gentoo-amd64-20190615-zfs-0.8.1.iso
2       /mnt/sdc6/loc/data/distrib/systemrescuecd-amd64-6.1.3.iso.part
2       /mnt/sdc6/var_data/recoll/xapiandb/termlist.glass
3       /mnt/sdc6/var_data/recoll/xapiandb/position.glass
3       /mnt/sdc6/var_data/recoll/xapiandb/postlist.glass

Some comments in a meaningful order:

This is a crude metric for several reasons, one of them is that it matters also whether the extents are close or far apart.
Number of extents has a significant influence not just on write rates but also and especially on read rates, as most UNIX file accesses are for full sequential reads of whole files.
The .glass files are the largest files in the filesystem, most of them some GiB long.
OCFS2 and ReiserFS: quite disappointing and the average write rates have been quite low for both. What is weird with OCFS2 is that its most fragmented files are not the largest, or clustered by copy time, this suggests that it allocator has a fragmented free list. What is not weird with ReiserFS is that the high degree of fragmentation tends to be on the largest files, what is weird is that the degree of fragmentation is so high, as if the allocator had an upper limit on the amount of contiguous space it will allow; however looking at segment offset shows they are fairly nearby and tend to be sequential.
JFS and XFS: JFS is one of my favourite filesystems, and XFS is quite popular despite its complexity, and both do very well on fragmentation. The write transfer rates at 28.7MB/s for either are not that good, but looking at statistics and write rates while writing many smaller files, which are quite low, I think it is because of updating inodes and other metadata (which has to be done synchronously because of POSIX metadata rules). I do not blame them for that.
bcachefs, Btrfs, F2FS: these COW filesystem types have very nicely unfragmented allocation, and they also had very good write rates. By looking at write rates, which do not drop much even when writing lots of small files and comparing with ext4, JFS and XFS where that happens to varying degrees, I think that is because COW filesystems do not update metadata like inodes and internal trees in-place and thus are able to keep streaming. Alternatively they need less stringent persistence (as per POSIX and metadata updates) thanks to COW updates being both journaled and being effectively a journal.
ext4: pretty good like JFS and XFS, but with a average higher write rate at around 48.9MB/s. Since it too has to update metadata in place that is fairly suspicious. My guess is that as reported at the time of the famous O_PONIES controversy (1, 2, 3, etc.) its default persistency conditions are weaker than those of JFS and XFS (I suspect in particular I wonder about the auto_da_alloc option, but it is just a guess).

There have been recent indications that (sequential) COW systems (filesytems or databases) end up with a lot of write amplification but at least in this case the results are not bad, because the write amplification is sequential, and consumes bandwidth instead of much scarcer random IOPS, and seems to save even more IOPS by reducing updates in-place for metadata.

Overall as usual I like JFS, F2FS, and also Btrfs (even if I would not use its storage management aspects), and bcachefs looks very promising too (I like most of its storage management aspects but are not fully mature yet).

2021 January

20210123 Sat: Rough sizes of main Linux filesystem modules end of 2020

So I have been looking again at filesystems and rought measures of their complexity in the Linux kernel version 5.10 for bcachefs and here are the sizes of the sources and those of the compiled code:

# D="udf jfs nilfs2 reiserfs gfs2 f2fs ocfs2 bcachefs ../drivers/md/bcache xfs btrfs"
# for N in $D ext4; do L=`cat $N/*.[chsyl] | wc -l`; echo -e "$L\t$N"; done | sort -k1n
11297   udf
17656   ../drivers/md/bcache
22793   nilfs2
31976   gfs2
32384   jfs
32397   reiserfs
41828   f2fs
59809   ext4
67184   bcachefs
67474   xfs
70832   ocfs2
133746  btrfs
# for N in $D; do size $N/*.ko; done | sort -u -k1n
   text    data     bss     dec     hex filename
   2472    1228       0    3700     e74 ocfs2/ocfs2_stack_o2cb.ko
   4982    2224      38    7244    1c4c ocfs2/ocfs2_stackglue.ko
   6023    1420       8    7451    1d1b ocfs2/ocfs2_stack_user.ko
  87502    6592       8   94102   16f96 udf/udf.ko
 157573    3364     968  161905   27871 jfs/jfs.ko
 157754   10908      48  168710   29306 nilfs2/nilfs2.ko
 169614   25283     152  195049   2f9e9 ../drivers/md/bcache/bcache.ko
 202212    3588    4368  210168   334f8 reiserfs/reiserfs.ko
 261705  121432   33048  416185   659b9 gfs2/gfs2.ko
 474565   76180     560  551305   86989 f2fs/f2fs.ko
 679404  155492     240  835136   cbe40 ocfs2/ocfs2.ko
 689345   44535      72  733952   b3300 bcachefs/bcachefs.ko
 823525  252029     384 1075938  106ae2 xfs/xfs.ko
 984066  113236   15040 1112342  10f916 btrfs/btrfs.ko

Note: the code size for ext4 is missing as it is compiled to be built-in.

Note: the total size of bcachefs includes that of bcache which it uses for low level storage management.

Also separately and not exactly but quite reliably comparable sizes for OpenZFS:

# cat `find git-zfs/include git-zfs/lib git-zfs/module -name '*.[chyl]'` | wc -l
482914
# size /lib/modules/5.8.0-40-generic/kernel/zfs/zfs.ko
   text    data     bss     dec     hex filename
1917113   74096 1538072 3529281  35da41 /lib/modules/5.8.0-40-generic/kernel/zfs/zfs.ko

Note: these are quite rough measures of complexity and functionality, as source code line counts depend also on coding style, and compiled code size depends also on inlining.

Pretty obviously UDF, JFS, NILFS2, Reiser3, and even F2FS are in a class of their own: they are all sophisticated designs with full functionality and they are much, much simpler than OCFS2, bcachefs and especially XFS, Btrfs, ZFS.

In particular XFS seems to be very complex as it is largely a plain vanilla filesystem, without the complex parallel access logic of OCFS2 or the complex RAID and volume management logic of bcachefs, Btrfs and ZFS.

Given the enormous sizes of the latter I feel that it is a miracle that they work reliably, and that is probably mostly due to being around for a long time (except bcachefs which is relatively new).

My usual preference is for the simpler filesytems and in particular for JFS, but UDF and NILFS2 have interesting special features and work well too, and F2FS has some considerable special features and in particular it is heavily used in high end cellphones and tablets, so it can be expected to be well tested and maintained.

Of the more fantastically complex designs I have become more skeptical of XFS as while it is reliable its complexity is not matched by equivalent features (it was rumoured to have included five different B-tree implementations). I have been using mostly JFS and NILFS2, and also Btrfs quite a bit, at home, without using its questionable storage management design, and I have used ZFS often at various work places, because other people were familiar with it.

I have been using Btrfs (which is going to be the default for Fedora) in part because of its snapshotting, but then I rarely use it as I keep a small series of backups, which have pretty much the same functionality as to going back in time, but primarily for the checksums.

I am thinking of trying to stop using at home Btrfs and use more F2FS and where I reckon that I want data checksums to use bcachefs as it seems quite good and has the big advantage of having a single main developer with good taste, and perhaps continue using ZFS at work.