This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
This document contains unpublished drafts. Please do not refer to it or link to it publicly.
I shall illustrate some issues with Ansible a configuration management system that is both popular and makes the issues clearer than most.
In Ansible the main entities and relationships are:
Customization of a configuration file depending on context can happen for example in the following ways:
The data that drives such conditionality can be expressed:
There are very many choices available, and not all choices
are equally valuable for minimizing dependencies and
maximizing maintainability. Writing tasks, templates, data
files, inventory groups with a whatever
approach is something that is very difficult to unravel
later.
PID VIRT RES DATA SHR CPU %MEM TIME+ TTY COMMAND 6716 5383024 1.804g 3642440 237080 1.6 23.7 328:24.73 ? /usr/lib/firefox/firefox -P def+ 405 4799456 1.467g 3542988 84496 0.0 19.3 491:40.11 pts/9 /usr/lib/firefox/firefox --new-+ 27053 4052620 235516 3132780 45660 0.0 3.0 227:52.88 ? kdeinit4: plasma-desktop [kdein+ 27035 3209436 167740 2492344 23948 0.0 2.1 254:37.13 ? kdeinit4: kwin [kdeinit] --repl+ 3743 2155024 11236 1918840 0 0.0 0.1 2:23.27 ? akonadiserver 2935 2102680 1620 2052944 0 0.0 0.0 0:00.86 ? /usr/sbin/console-kit-daemon --+ 16194 2090552 2444 2064968 0 0.0 0.0 0:16.61 ? /usr/sbin/privoxy --pidfile /va+ 3813 1674824 53280 832428 0 0.0 0.7 124:47.88 ? /usr/bin/krunner 3613 1386240 208072 649288 10996 0.0 2.6 169:22.29 ? kdeinit4: kded4 [kdeinit] + 3471 1161784 567036 161904 488544 8.1 7.1 1249:47 tty9 /usr/lib/xorg/Xorg :9 vt9 -dpi + 3935 1056268 11620 493524 0 0.0 0.1 2:06.43 ? /usr/bin/knotify4 4437 1050408 233420 353716 18176 0.0 2.9 83:38.47 pts/8 emacs -f service-start
As mentioned previously I have quite liked the general
approach to connectivity redundancy and service mobility to
use
IP service addresses that are fully routed by themselves,
(fully unicast
) with servers carrying those
services running OSPF (or similar) to advertise their
reachability. It is in practice like multicasting but without
the added complications (and in effect OSPF distributes route
updates by multicasting).
The main advantage to simply leveraging IP routers to distribute reachability updates is indeed simplicity: that requires no tunneling, no weird layer-2 tricks, there are plenty of troubleshooting aids for IP, and routing is needed regardless, and in recent implementations it is very fast even for large numbers of individual addresses.
However, the ability to use Ethernet addresses as identifiers and the quasi-broadcast nature of Ethernet, plus habit from the time where layer-2 switching was much faster than layer-3 routing have meant that so-called layer-2 adjacency is highly prized even in its physically discontiguous variant.
Now how can physically discontiguous layer-2
adjacency in Ethernet be a thing? The idea is that adjacencyis a
` property of physical links (that ether
in the name) or
at least of physical switches.
The usual answer is to somehow merge physical links or
switches to have VLANs spanning multiple areas, but that has
the downside that one needs lots of switches, or long cables,
and results in large broadcast domains
, all
issues that fully unicast addresses don't have.
Well, a pretty major datacenter product that all but requires
layer-2 adjancency
(1,
2)
is VMware, which is popular because
despite the severe disadvantages of virtualization, and
therefore in recent years major network vendors and entities
have sought to support fine-grained, dynamic layer-2
adjacency, first by dynamic layer-2 forwarding, for example by
virtual wires
where layer-2 frames are
encapsulated in other layer-2 circuits or virtual circuits,
such as Ethernet-in-Ethernet
(e.g. Q-in-Q)
or Ethernet-in-whatever
(e.g. VPLS,
VxLAN layer 2)
but the recent trend is routed Ethernet
in
various guises and usually disguises.
The most common technique (if that isn't too noble a word for what is base cheating) encapsulation of layer-2 frames into layer-3 packets or even layer-4 datagrams and then routing them; techniques like SPB layer-3 TRILL, or Enterasys .
Therefore horrors like using IS-IS to route Ethernet frames over MPLS. And lesser horrors like the Enterasys scheme:
That is amazingly perverse still: in order to use ethernet identifiers in location independent way, they are mapped onto IP addresses used as identifiers by associating each with a route, and then these routes are distributed by IP multicast mapped into multiple ethernet broadcasts.
It is less bad in the straight Ethernet mapped to IP and OSPF an then back to `Ethernet version than in the insane versions where MPLS and IS-IS are involved as an extra layer. The complexity of involving another protocol family and layer of encapsulation, with a potentially different logical topology, and in effect circuit based rather than datagram based, has severe downsides, incouding a loss of security and safety due to the extra complexity preventing easier auditability of configurations and traffic.
In a recent typical discussion about systemd in the usual LWN site, there is a comparison between the transition from System V init to systemd and that between the executable format Linux a.out and COFF and ELF around 1995.
That transition, while far less damaging in the long term than the transition to dbus and systemd potentially is, was not wholly glorious however because it was a mixed blessing at best: an unnecessarily complex mode of program execution, that along with other related changes cause a lot of bloat and confusion.
The key issue was the handling of dynamic loaded and shared libraries, up from the (usually static ???
I am now using Btrfs for most of my filetrees, and it seems quite reliable in basic usage even with the Ubuntu LTS 14 kernel 3.13 and even better with the more recent backported kernels 3.19 (since ULTS 14.04.3) and 4.2 (since ULTS 14.04.4).
I use it primarily because of the checksumming, and
secondarily because of the copy-on-write, not because of the
storage management features. So I occasionally run the
btrfs scrub operation that verifies data, and
recently in around 5TB of data and backup filetrees it found 1
bit error. I checked the system logs and since this was
unreported this was obviously one of those
silent errors
.
perhaps related to the usual situation that storage devices
are typically rated for 1 unreported bit error per
1014 bits (11.4TiB).
As to the storage features they are not quite as strong as those of a dedicated storage layer like MD RAID but they are convenient sometimes. In general they help with the UNIX style of operating on trees rather than block devices.
As to the storage features the best is clearly the ability to
create subvolumes cheaply, that is multiple roots
in a filetree like NILFS2 and JFS (but
not really available in Linux) and ZFS; also to have cheap
snapshots of subvolumes thanks to something like
copy-on-write, like NILFS2 and ZFS. The RAID10 implementation
is also fairly usable and quite reliable.
The parity-RAID implementation is still unreliable, and it may be even a bit too flexible, as it will create parity stripes of length equivalent to that of the devices that have available space, so if devices are of different sizes it will silently create stripes of different width, down to 2 chunks for RAID5, one for data and one for parity (effectively equivalent to RAID1). Which is admirably complete, but is rather more expensive than RAID1 of 2 devices or RAID10 of 3 devices.
The big problem with Btrfs is that it is very, very complex, with a lot of code, and large opportunities for race conditions and corner cases. But the base functionality seems to have become quite reliable for a few years already. It is unfortunate that simpler filesystems like JFS or NILFS2 would require an ondisk format change to add checksums.
The two desktops and laptop at home had Ubuntu 12 LTS installed, and I decided yesterday to upgrade that to Ubuntu 14 LTS without waiting for Ubuntu 16 LTS, in part because Ubuntu 14 LTS is the last LTS versions that does not use SystemD and it will be maintained until 2019.
I decided to do an in-place upgrade, by switching to the
trusty archives instead of the precise
archives. I expected some difficulty as those systems have
many packages (around 5,000) as I have installed many
just-in-case or to try them out, and some are
backports
or from less official
sources.
I did not expect for it to take so long... first I did a test in-place upgrade on the desktop with the fewest packages installed and that started the day before yesterday around 1pm and was mostly finished by 8pm, with some more tweaking yesterday. The desktop with many more files I started around 3pm yesterday and it is mostly finished around noon today. The desktop upgrade started around 7pm yesterday and was mostly finished around 10am earlier today. While I left downloads running during the night some of the night hours were idle.
The biggest reason for the long time to upgrade as the
difference between the desktops and the laptop shows is
extremely slow file operations on disk drives: the laptop has
a flash
SSDs
which made deleting old files and unpacking archives and
writing them out far quicker. The issue is that the
root
filetree has lots of small files, in
my case over 700,000 in around 18GB of data, and nearly 500,000
of these are less than 4KiB in size, and 300,000 are less than
1KiB in size. Indeed on the desktops I could hear the constant
seeking of the disk drive during upgrade. Many of these
should be part of archive files.
This issue is more significant when using filesystems that
implement fsync semantics properly
as DPKG also uses fsync properly, which is a good
idea, but perhaps less so when there are very many updates
involving very many small files. Yet another instance of the
mailstore
problem.
Two other reasons that involved quite a bit of time were sort of inevitable: I specified that I wanted to be asked whether to keep or overwrite existing configuration files, and my choice of having backported or unofficial packages required some manual handling.
But the second biggest reasons for the long time to upgrade, which was of course even bigger on the laptop, was that the Debian and Ubuntu packaging format and related tools (DPKG is the package manager, APT is the dependency manager system) have some large flaws:
Reading about adapting some aspects of GUI in Ubuntu's Unity design to portable device screens:
Today’s scrollbars are optimized for cursor driven UI but they became easily unnecessary and bulky on touchable and small screen devices. In those cases, optimization of the screen’s real-estate becomes essential. Other platforms optimized for touch input like Android and iOS are already using a light-weight solution visible only while dragging the content.
This is interesting because there is one important detail that usually GUI designers get wrong: as I have pointed out given current skewed aspect ratios of monitors GUIs should put their decorations and items mostly on the side where there is most space. To add to that earlier post I was amused to see that it is an issue for others as shown by this point that vertical maximization matters (I use it a lot too) to make the most of the limited height of landscape-skewed displays.
So what's the solution? As I mentioned in practice only either to use a GUI framework that allows moving to left or right most GUI elements, or a taller display.
As to GUI elements I should add that KDE also has
repositionable icon bars (called toolbars
by KDE) that can be moved to any side, including left and
right, and I do that for most.
This is particular important with a 16:9 aspect ratio, such as 21in to 23in 1920×1080 displays, and even more on laptop 12in to 14in 1366×768 displays, as vertical space is so limited. The issue is less urgent on 1920×1200 displays as the aspect ratio is less skewed, or on 27in or higher 2560×1440 displays as they have rather more vertical space both in pixels and physical extent.
In the previous
discussion about resource and daemon states
in mentioning upstart it was pointed out that it
implements a generic dependency mechanism, and that the policy
can be either to start daemons when their dependencies become
available, as in dependency push
, or to
request the availability of those resources before starting a
daemon, as in dependency pull
. The two
policies are not equivalent in two important ways:
Which means that daemon managers should be based on a
pull
logic. The obvious example of that is
the classic
inetd
daemon manager that only starts a daemon if its
service port
is accessed. Put another way
daemon managers should be more like make files than
(parallel) sh scripts.
In other words when the system starts the only process started should be the daemon manager, and nothing else should happen by itself. No driver should be loaded, no disk should be recognized, no other process should be started until the daemon manager senses a request for service on one of the service ports it monitors. These may be serial lines, consoles, IP ports.
For example if a SAK is detected on a console the login daemon should be started on it, and this may try to access the /usr filesystem, and that might trigger a mount, which might trigger an access to the NFS server, which can then trigger bringing up the relevant IP interface, which may load the appropriate driver module, and its configuration.
In the UNIX tradition the make style logic to
pull
resource activation has been
implemented in several different ways other than
inetd, for example getty for text consoles,
xdm for X11 displays, amd and
autofs for mounting filesystems.
In general the latter looks to me the better model: that services be represented in the filetree as mountpoints and accessses below that mountpoint should trigger activation of the relevant service daemon. While it would be nice to have a clear single model of daemon activation and
In the recent and not so recent past I have noticed cases where startup or other actions were done in the wrong order or at least at the wrong time:
Both instances are symptoms of a general issue which it has little to do with dependency based service and daemon management, even if somewhat related.
The issue is that UNIX-style systems (like many others) have no well defined conceptual model of resource states and transitions, never mind managing their dependencies.
Which makes me remember how the MUSS operating system had a well defined, realistic and still simple, notion of which states devices (applicable to many resources) could be in and the allowed transitions (plus a very well designed system for interprocess signaling and communication), and that was many decades ago.
In the absence of a well defined model of states and transitions talking about daemon management is a bit vacuous, but there is another specific issue that applies regardless, and it is whether the type of dependencies that ought to be managed.
There are service and daemon management systems like upstart which are explicitly designed to be event receiving and sending: when an event is received by upstart a process is instantiated, but processes can send events at any time, not just when they end, and receive them also at any time.
This is a fairly low level mechanism because it can be used to implement fairly arbitrary dependency policies, either those where when a resouce becomes available a daemon can be started to use it, or those where when a daemon needs a resource it can wait until it becomes available.
The difficult part thus is not in the mechanism that allows daemons to depend on other daemons or resources, but in designing well structure policies for representing those dependencies, and this still depends on a clean conceptual model of resource and daemon states.
One of the more fundamental aspects of the Microsoft style of system design that is being adopted more and more into GNU/Linux systems is the avoidance of fixing issues with existing software, replaced by working those issues with layers of wrappers.
This is often well motivated, and mostly by organizational issues: it is often very difficult to get existing software modified in part because of territorialism by existing maintainers in part because of the fear of the possible consequences on stability.
It is sometimes necessary to combine several diverse programs to achieve a goal; for example recently I setup a system with an AFS fileserver relying on a filetree stored on a DRBD storage area relying on IPsec for data privacy; or my laptop with a cellular network connection needs a driver for the USB cellular modem, a program to enable it with a PIN, a PPP daemon do establish a connection and a firewall script to delimit it.
In all these cases it is very hard to document or automate how to operate each setup. The difficulty is that each of the components has several states, and the possible combinations of states across all components can be many, thus requiring a variable but significant paths to reach a desired a desired end state.
To some extent this is unavoidable in a system made of several moving parts: a lot of component states must align the right way to startup a GUI session on my desktop too.
The abundance of possible state combinations is what makes finding the cause of a fault difficult and lengthy: difficult because the fault finder must have a mental model of all the possible state combinations, and lengthy because it can be pretty time consuming to explore them to figure one which part is in a faulty state.
The abundance and intricacy of the dependencies among combined states for non-trivial systems required problem-solving when trying to get from a given current state to a desired other state, which can be for example having a browser capable of accessing the Internet over a PPP connection via a 3G modem.
However I have noticed that a lot of people try to avoid
problem solving, and they want instead a series of steps that
they can follow mindlessly to produce the result they desire,
a way of proceeding that in management
jargon is called deskilling
.
I have noticed this quite often in
IRC
channels or on blog posts or self-help books targeted at practitioners
, which tend to offer simple,
linear N-step copy-and-paste procedures to do things that are
actually very subtle, such as how to configure GNU/Linux for
high-speed 3D graphics, or MS-Windows for large distributed
storage setups. The usual context is the notion that busy
people don't have time to understand what they are doing, and
they just want to be able to do very difficult and complicated
things in 12 easy steps.
But doing difficult and complicated things usually involves many layers of astraction and implementation, and several interfacing parts, each with many states.
It is sometimes possible to wrap these with simple processes for people to follow or scripts for computers to execute, but it is almost never possible to do so in a failproof way: because handling potential mistakes or failures cannot be done as a rule with any simple process or script. And it is a big challenge to devise a suitable complex process or script that can handle every possible mistake or error state, because when many state combinations are possible, usually only a small number are good.
This is why people who attempt to configure non-trivial network setups with NetworkManager under GNU/Linux or the MS-Windows Control Panel even within the limits of those tools often incur in significant confusion: those tools are designed to hide the complexities of the states of the lower level tools they use, but this cannot be done fully because their states are far more complicated than it appears via the nhigher tool, creating deep difficulties in investigation in case of problems, plus notable restrictions in what is achievable.
This is a problem even with simple wrappers like my firewall scripts which even if rather limited in what it can configure it cannot mask the far more complicated states beneath it, but then it does prerend to.
The solution for me is to define clearly expected states, and in an economical way and with a view to minimizing interactions with other states to reduce the number of possible combinations of states; plus to show how operations change states and give guidelines on how to build sequences of operations to return to a desired state from an invalid one.
In my experience and from many sources the key to the learnability of a topic is for learners to form a mental model of how it works.
As to storage insanities
(for example
1,
2,
...)
one of the most common syntactic
delusions
is about single large storage pools with very many small
files, and there is a recent
excellent example:
Subject: Is XFS suitable for 350 million files on 20TB storage?Hi,
i have a backup system running 20TB of storage having 350 million files. This was working fine for month.
But now the free space is so heavily fragmented that i only see the kworker with 4x 100% CPU and write speed beeing very slow. 15TB of the 20TB are in use.
Overall files are 350 Million - all in different directories. Max 5000 per dir.
The author is well meaning of course, and seems to be a
strong adherent to the syntactic
approach
which is to consider all syntactically valid configurations
equally plausible and desirable from a technical point of
view, but having second thoughs after some month
.
Regrettably most syntactically valid configurations
don't work that well
(and it takes insight and skill to understand which ones do
and why) and in particular
single large free space pools
don't scale, except at the beginning when they are empty and
there is no load, and then things get interesting as storage
fills up and load increases.
Note: Many companies claim they offer very scalable large storage pools, both in size and speed; most recent news I read are about a company called StoragePool quite suggestively.
The specifics of the madness above are moderately interesting:
Continuing the previous post on distance-scaled font glyphs, the conclusion was that scaling display DPI is the easiest option. But things are not as simple as that.
Old style X Windows configurations allow to prepare several different potential display configurations at different pixel dimensions, and the DPI is recomputed accordingly, but that is not the right mechanism as it changes the logical pixel dimensions.
Various releases of the regrettable regrettable RANDR extension allow to change dynamically the reported DPI of the display, in which case the reported size of the display in millimiters instead of pixels is scaled accordingly. This can be done with one of (depending on whether for a single output only):
xrandr --dpi 130
xrandr --dpi 130/VGA1
Most applications read the current DPI from the X server only when they start, which is inconvenient. But desktop environments cause more trouble as some keep persistent state and this may include DPI and will be given to newly started applications. I have tried a few and have had mixed results:
With xterm using the FontConfig font system text is correctly scaled for the current DPI when it starts, but subsequent DPI changes are ignored.
With xterm using the old style X11 font system with a scalable DPI-independent (the -0-0- part) XLFD font specification there is no sensitivity to DPI changes either at startup or after.
This is probably because the laughable code that forces the DPI in a scalable font specification to be either 75 or 100 is still there.
With xterm using the old style X11 font system with a scaled, DPI-explicit font XLFD I do get on startup fonts scaled according to the DPI; the results are good for an outline font like Liberation Mono but of course much uglier for a bitmap font like -sony-fixed-*.
The GTK+ 2 library gets its settings from the ~/.gtkrc-2.0 file, from the X server resources, and from the X server itself.
It gets the font specification in particular from ~/.gtkrc-2.0, and that does not include the DPI, and gets the DPI from the Xft.dpi X resource if present else from the X server idea of DPI.
GTK+ 2.0 applications as a rule only read the DPI at startup.
Emacs font support is somewhat complicated because it can be built in two different X Windows supporting versions, one with the Lucid library, which only uses the X11 font system, and one with the GTK+ 2 library.
The font renderer in the GTK+ 2 version can use either the old style X11 font systems or the new style FontConfig one, just like xterm, but in an haphazard way: which one is used depends on the syntax of the font specification to commands like set-frame-font (which only autocompletes only XLFD font specifications); while the menu-set-font dialog only recognizes GNOME style font names (similar but not the same as the FontConfig ones) and the mouse-select-font dialog only allows choosing X11 font system fonts.
With XLFD style fonts Emacs behaves like xterm with the same.
With GNOME style fonts it is impossible to specify DPI, so one must specify a large point size as in for example Liberation Mono 14 with set-default-font or choose explicitly a larger point size with mouse-set-font or the Options>Set Default Font dialog. At least this applies the default fonts to all Emacs content windows.
The above only applies to the text displayed by Emacs in its main panel. The menus and dialogs are rendered by GTK+ 2 and therefore the font sizes are fixed when Emacs starts like for other GTK+ 2 applications.
This is typical KDE application and has two different behaviours: when it is launched from an existing KDE wrapper like the application menu, or equivalently from the command line with kwrapper4, it does not notice changes in DPI, because the KDE session caches the DPI.
I haven't been able to find a way to refresh the cached value.
However when started on the command line on its own it does notice at startup the current DPI. Changes are not noticed.
However KDE offers Enlarge Font
and Shrink
Font
commands accessible by the traditional key
combinations C−+ and
C−-.
Regrettably this is on a per-window basis, so it does not scale all konsole instances once and for all.
I was reading a blog entry about user environments and somewhat accidentally it pointed out that there is a third type of user environment, which should have been obvious to me, even if I used to think that there are two types of user environemtns:
The main difference in the above is interoperability:
Workspacesusually have within them a uniform high level interaction mechanism, which is what makes it easy to interact within the workspace. But also it makes it rather difficult to cross the workspace boundary, especially in the direction of programs outside the workspace interacting with processes and resources inside the workspace.
openenvironments usually make it somewhat easy to interact among any element in the environment, but not as convenient as in workspaces; in particular interaction methods are often fairly low level and there are multiple incompatible high level interaction methods.
A recent entry in another blog (Debian or KDE planet) points
out that there is a combined type, by saying that GNOME is
evolving from an open environment to a launcher
one, where the environment launches applications that are
themselves in essence workspaces
.
MS-DOS and MS-Windows were and are popular examples of this, and so are smart cellphones (while non-smart cellphones run workspace environments).
To some extent launcher
combine the
disadvantages of workspace and open environments: the users
suffer from a much reduced ability to make applications
interact, and often the only way is to dump data into a file
and reread it from another application, if the file formats
are vaguely compatible.
So the question is why are they so common, and I think that is because they are good for developers: precisely because of the isolation of the applications from each other, developers have a much simpler task, and can make each single application look best at what it does, rather narrowly. At the moment of choosing an application most users evaluate it narrowly too, for how well it looks and does whatever it is targeted at, and only later it becomes clear how interoperability is awkward.
The established UNIX tradition is to use hierarchical filetrees as keyword based classification systems, and since they are hierarchical there must a priority among classification criteria. The established UNIX tradition is to order keywords into:
In this classification major tree are administered differently, minor trees are collections of optional material, and role trees aggregate files that might be used in similr way.
Booting a system is often a delicate and fragile operation as it relies on the cooperation of several bits of software and firmware written by very different people and organizations, and recovery from boot problem can be quite awkward, precisely because until the system is booted very few tools are available.
For UNIX this usually meant having a boot program that is loaded and started by firmware, and then it loads and starts the kernel itself, when then loads and starts a shell script which on the basis of user input forks a set of services to enter either single-user or multi-user mode.
Early on this simple picture was slightly complicated by the introduction of a program called init that would be loaded and started by the kernel and it would fork itself in two, one fork to load and start the shell scripts. The existence of init was due to two technicalities of the UNIX system:
Configuring a system with 24×1TB 2.5in drives resulted in a choice of 4× (4+2) RAID6 sets. It is not as good as a RAID10 would have been, but considering the usage pattern (a lot of reading largish archived files) it is one of the few RAID6
Having discussed the past the message store problem, my main points were in summary that:
But I was discussing this briefly some time ago as to the decision of a site to store messages for a Dovecot IMAP server in Maildir, and I was told it was not a problem for them, and I was astonished. Then I saw their racks full of 146GB 15,000RPM SAS disk drives and realized that they were willing to pay for the privilege, with a storage layer capable of massive random IOPS.
A disk tray with 16× 146GB 15K RPM SAS disks has a capacity of around 2TB, and has 16 disk arms where a single 2TB disk has only one. It might also have much higher transfer rates, perhpas not 16 times higher, but that is far less impressive than 16 independent positioning arms.
Because for rotating disk storage the achievable IOPS for page sized (4KiB) transfers is around 2-3,000 for sequential transfers and 100-150 for random ones, or alternatively transfer rates of 80-120MB/s sequential and 0.5MB/s random.
There is a factor of over 100× between sequential and random transfer rates, and even a tray of 16 fast small drives only partially bridges that, even if it goes a long way. The overall result however is that it pays to arrange for data to be laid sequentially to the point that it is quicker to have files that contain coarsely selected data, and read it all and then keep only that which is relevant.
Which is one of the reasons why an archive containing many members are usually a win for collections of small even weakly related data items like e-mail messages, as long as it pays to process them in bulk and TBC.
However an archive containing many members has one seemingly trivial issue: when it is being updated it must be locked to prevent interleaved updates.
One reason why the
Maildir
structure was introduced to replace the mbox
archive format was indeed locking: because many implementations
of the POSIX/UNIX API implement file locking, especially over
popular network filesystems, unreliably, but implement reliably
with implicit locking operations over directories. It
is important to understand that locking still occurs when adding
or removing a message to the mail archive, as that is implemented
by adding or removing a file to a directory, which means atomically
updating the directory, so the following argument is false by
omission of file
between no locks
:
Why should I use maildir?
Two words: no locks. An MUA can read and delete messages while new mail is being delivered: each message is stored in a separate file with a unique name, so it isn't affected by operations on other messages.
An important detail is that it is therefore designed for the
very rare case of a mail archive which it is frequently written
concurrently by multiple programs. What is notable is that
Maildir was introduced together with an
MTA and in
particular as its spool queue format, and its delivery agent
inbox
format.
In other words for the case not of a mail archive, but a mail
spool
queue, and a fairly small one
too. Then a spool directory might be expedient, even if not quite
right either, and spool queues are fairly often implemented with
members being individual files in a directory, instead of parts of
an archive files. But again for small numbers of members, because
as the
newsspool
story demonstrates even spools have been converted to a dense
single file when they became large (in the case of the newsspools
the spool is also accessed as an archive).
I heard an interesting presentation on distributed development where the issue is that IC related description and simulation files are big, are generated relatively frequently, and are part of a multisite workflow.
The talk was about the case where the IC description files
are generated centrally and used in other sites, and that the
critical attributes of the generated datasets
are:
There are a lot of similar use cases that result in similar situations:
Montecarlomodeling.
In many situations there is a very considerable overlap between dataasets close in time.
The discussion was mostly on how to reduce the time to replicate the datasets offsite, and to minimize the cost of storage.
As to that using rsync with --link-dest and
--fuzzy as in BackupPC or
rsnapshot achieves large degrees of bandwidth and
space reduction, and further space reduction may be achieved
with similar more fine grained techniques like copy-on-write
as in filesystems like ZFS, Btrfs,
COW ext3,
which is somewhat similar to log structured
storage.
But in the scenario above there is a relatively common and big issue: that IO involving 200GB over 1M inodes can be very slow, as in most cases the physical layout of the inodes does not correlate well with the logical access patterns within the data set.
Given the extremely different transfer rates of rotating disk
storage between small and large records and sequential and
random access access patterns, the dominant factor in this
situation, both for backup and loading and writing the data
set, is to increase physical locality
.
Unfortunately trying to save space by using sharing of unmodified files or copy-on-write make locality worse, often much worse:
A recent query about XFS allocations patterns is one of a class:
one of our customers has application that write large (tens of GB) files using direct IO done in 16 MB chunks. They keep the fs around 80% full deleting oldest files when they need to store new ones. Usually the file can be stored in under 10 extents but from time to time a pathological case is triggered and the file has few thousands extents (which naturally has impact on performance).
It used to be the case that disk drives of the same gross capacity could have slightly different sizes in sector counts, and this made it somewhat unwise to use all the space on a drive for partitions or for RAID set members. The issue was that replacing a disk might happen with one slightly smaller, and then the data no longer fit.
Since the differences used to be small, I used to leave the last few dozen MiB, and currently the last GiB, of a storage device unused; I also usually left the first few dozen MiB, and current the first GiB, of a storage device unused.
I also standardize partition
sizes, so
that partitions would not cross typical disk size boundaries,
for example 80GB, 160GB, 250GB, 500GB, 1TB, 2TB, in particular
the smallest numbers expressed in conventional cylinders of
255*63*512B sectors, or 8225280B. I found in the past the
minimum sizes in cylinders were:
| gross capacity | cylinders |
|---|---|
| 80GB | 7297 |
| 160GB | 19457 |
| 250GB | 30400 |
| 500GB | 60800 |
However I have noticed that in the past few years disk drives from the few remaining manufacturer have exactly identical sizes expressed in sectors, which makes me suspect that finally some industry body has published a relevant standard precisely to avoid issues with slightly different sizes in RAID sets.
Some discussions of the finer points of Git, Monotone, Mercurial and Bazaar:
A list of some non obvious VCS features:
|   | Monotone | GIT | Bazaar | Mercurial |
|---|---|---|---|---|
| Language | C++ | C, shell | Python | Python |
| Metadata per ... | One file per repo | Per repo, Per commit | per directory | Per file |
| How many files | One | Depends on packing | . | 2 per managed file |
| Size of repo files | large | large and small | medium | small |
| Notes | . | . | . | . |
| Access via HTTP, or via other port, or via SSH | port/SSH | HTTP | port/SSH | HTTP |
| Special features | Very clean design, uses database. | Commits need periodic packing. | . | Large number of metadata files. |
| Scalability and speed | . | fast | . | slow |
| Integration with Trac or other ticketing system | yes | yes | yes | yes |
| Whether there is a GUI tool | no | yes | yes | yes |
| Preservation of history across renames | yes | yes | yes | plugin |
| Can cherrypick | . | yes | . | . |
| Can rebase | . | yes | . | . |
| How much file metadata is tracked | . | . | . | . |
| Signing of changes | yes | yes | . | . |
| Global or local version and file ids | . | . | . | . |
| Workflows | . | . | . | . |
| Conversion from CVS and SVN | . | . | . | . |
| Special features | Very clean design | Tracks content, not files. | . | Extensive plugins |
The main tradeoffs in computer systems designs involve latency vs. throughput vs. safety and both against cost and capacity.
For most computer systems designs cost is roughly fixed, and
the goal becomes to build the best
system
the budget allows for, and this usually involves tradeoffs
between throughput and latency.
These tradeoffs are pervasive because for almost every system component it is possible to find implementations that have lower latency but also lower capacity or higher latency but higher capacity, and therefore most system componenents are arranged in pairs or deeper hierarchies which take advantage of locality to obtain most of the advantage of the lower latency implementation (for more active work) and of the higher capacity one (for less active work).
The tradeoffs are especially important for storage components, at every level of the storage hierarchy.
At one extreme system memory components have achieve higher throughput over the years by increasing their internal degree of parallelism and thus the minimum physical transaction size, which has significantly impacted latency; this has been made less negative by the growth in capacity of CPU chip caches, which have smaller transaction sizes and much smaller latencies.
Today I was reminded of the importance of these tradeoffs
when bulk-overwriting an enterprise
disk
drive capable of peak sustained 90MB/s and noticed it could
sustain as much as 33MB/s with drive caching turned off. This
was amazing as I usually see something like 4MB/s.
Useful sites about premium keyboards:
Plus specialized or semi-specialized manufacturers:
Plus specialized or semi-specialized on-line vendors:
When a server with several 1TB ext3 filetrees crashed they needed fsck, and these are some of the times:
| name | time | size | used | fragm. | max inodes | used inodes |
|---|---|---|---|---|---|---|
| b | 13m41s | 1099.5GB | 139.8GB | 5.6% | 67.1m | 609,448 |
| c | 31m35s | 1099.5GB | 534.6GB | 16.4% | 67.1m | 1,029,415 |
| d | 16m41s | 1099.5GB | 354.4GB | 14.8% | 67.1m | 4,670 |
| e | 18m36s | 1099.5GB | 264.0GB | 28.6% | 67.1m | 136,142 |
| f | 25m18s | 1099.5GB | 370.5GB | 41.0% | 67.1m | 223,616 |
| g | 27m48s | 1099.5GB | 549.4GB | 17.0% | 67.1m | 355,585 |
| i | 9m53s | 1099.5GB | 27.7GB | 11.2% | 67.1m | 178 |
In the above the times are all for fsck -n there is a mere check, and all the filetrees checked had no or minimal inconsistencies, and as reported most were far from full, a a fairly good situation, but also the data in some of them was quite scattered.
Also all the filetrees were partitions on a RAID6 of 16 total drives, thus behaving as a 14 drive RAID0, except that one of the disks was missing, thus t
The rough estimate is around 1TB per hour (faster if the average file size is large, slower if it is small), which is roughly in line with other reports for mostly fine, not too scattered ext3 filetrees.