Computing notes 2016 part one

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

160626 Sat: Checksumming inside filesystems

A report by CERN in 2007 showed that undetected storage system errors are far more common (1, 2) than manufacturer estimates of undetected storage device errors, mainly because of hardware and firmware bugs, but not just.

Stored data corruption can be detected by well-placed strong checksums, and can be corrected by using redundancy, either in the form or replication or of error correction codes.

The only proper way to protect data againt corruption is end-to-end, that is to associate checksums and redundancy with the logical data object itself, so it is carried with it wherever it is stored.

But that imposes an overhead for curation of data, which is expensive for data of middling importance. Therefore checksums, and sometimes even redundancy, are a common feature of storage devices: for example every data block on a magnetic disk drive has them.

For some years now server (and personal computer) CPUs have been fast enough to allow just-in-case computing of checksums inside filesystems, every time data is written into a file, optionally using them for data verification every time data is read from a file. The Linux filesystems that use them currently are:

name meta data data type inline notes
Arvados Keep no yes MD5 no Distributed filesystem. Checksums identify data blocks.
ZFS yes yes Fletcher, SHA256 no Scrubbing can verify all checksums.
Reiser4 yes yes CRC32c yes Since version 4.0.1 of Reiser4. Requires at least version 1.1.0 of reiser4progs.
Btrfs yes yes CRC32c no Scrubbing can verify all checksums. Checksum field is 256b, currently only used for CRC32c. Checksumming can be disabled explicitly or by turning off copy-on-write mode.
NILFS2 yes yes CRC32? no? Not checked on read. Used to detect failed writes during log recovery rather than data corruption.
ext4 yes no CRC32c no Since Linux 3.6. Requires: conversion to 64b layout, version 1.43 of e2fsprogs.
XFS yes no CRC32c yes Since Linux 3.15. Requires: conversion to version 5 layout, version 3.2.0 of xfsprogs.

Note: in the above inline indicates whether the checksum is embedded with the metadata or data it refers to. If the checksum is inline it indicates only whether the data is corrupt or not, but it could be the wrong data; if the checksum is not inline, but in some kind of data descriptor, it also allows detecting mismatches between expected and actual data, but the check is more expensive. The authors of Reiser4 prefer inline checksums while the ZFS authors prefer checksums in data descriptors.

Note: among non-Linux filestems there is the BSD filesystem Hammer, the MS-Windows filesystem ReFS, the distributed filesystem GPFS, and Apple's not-quite-finished APFS.

It seems widely believed that redundancy-based systems also have checksums that can be used to verify data integrity. Such redundancy systems can be replication as in RAID1, or parity and syndrome like in RAID5 and RAID6, or erase codes in more complex parity schemes. For example the MD RAID module of Linux has check and repair operations that start something similar to the scrubbing of ZFS and Btrfs and use available redundancy to fix detected issues.

While these checks can be useful as far as they go, they do not go very far, because they are consistency checks at best, not integrity checks, and they exist only because it is possible to double-use redundancy schemes to perform limited consistency checks.

The basic issue is that it is possible to use some type or another of code (of which replication is a simple form) for identity, integrity and redundancy purposes, but the types of codes that are good for one purpose are not necessarily as good for the others.

For example MD5 is still good to check integrity but not as good to check identity: if you download a file and its MD5 checksum and its computed MD5 checksum matches the content of the file then it is quite likely that the file was not corrupted during download, but it is rather less likely that the file downloaded matches the original, even if these seem the same statement; because the probability of a random error resulting in the same MD5 code is low, but it is possible to deliberately construct a file with the same MD5 checksums as another. Even if it is possible it is expensive, so a match of MD5 checksums provides some assurance of identity too.

Note: in the not-inline case mentioned in a previous note an integrity checksum in a data descriptor is effectively dual-used as an identity code. Now most integrity codes are CRC32c which is a lot weaker for identity checking than a strong identity code like MD5 or SHA256.

Similarly using redundancy codes to verify integrity. And that's just about the codes; proper integrity checking also rely on careful staging of checks, which may be different from that for creating and using redundancy.

160607 Tue: Awesome but warm mobile phone and debt fueled selling

While recently reading a fairly enthisiatic review of the Samsung mobile phone Galaxy S7 Edge it was unsurprising to read:

With a renewed focus on mobile gaming, and and perhaps also an eye on the problems that rival manufacturers' phones have suffered with overheating, the Samsung Galaxy S7 Edge now has an internal liquid cooling system.

In practice, it will still heat up if you put it under pressure. Run the full set of GFXBench graphics tests while charging, for instance, and the phone will still get uncomfortably warm. I measured it at a peak of 43°C. It’s good to know that Samsung is acknowledging the issue, though, even if not solving it entirely.

Unsurprising because I had already noted that desktops are never going away because laptops, tablets and phones normally are in contact with parts of the user body, and are difficult to cool, so if they are powerful they are going to get uncomfortably warm.

The review was enthusiastic because it is a really well designed, powerful mini-tablet, with very high quality components, including a very good AMOLED displays with 2560×1440 pixels and a 5.5in diagonal.

Like the similar top rated Apple mobile phones all this does not come cheap: It's expensive, costing £639 inc VAT at retail.

That seems currently unremarkable, but let's ponder that figure again: it it equivalent to around US$ 1,000 and in the USA, one of the richest countries in history, about half of workers would have difficulty paying an unplanned US$ 400 expense.

Clearly US$ 1,000 mobile phones are conspicuous consumption items, status symbols, as smart mobile phones with fairly reasionable functionality cost $100-$200, so one would expect the top end phones of Apple and Samsung to be elite products that sell in very small numbers to those for whom US $1,000 is a small matter; Apple's founder once stated that Apple products were meant to be the equivalent of BMW cars, that is indeed a luxury product for the affluent.

However Apple and Samsung top end products are in effect mass market ones and many are purchased by the people in the income class that would not be able to pay an unplanned US$ 400 expense, and that's why Apple is so huge and immensely profitable, instead of being merely a high-margin, small-sales niche producer. There are three main reasons for this:

The last point is by far and away the most important, because it means that Apple's business model is no longer really that of selling premium electronics products, but of selling (indirectly via phone companies) small mortgages with a high interest-rate (what used to be called usury). Without a phenomenally loose credit policy Apple would not be able to have both such large profits and such huge unit sales.

Consider other Apple products like laptops that are in the same price range: many upper-middle class persons have an Apple MacBook, which is a status symbol too, but also a more plausibly utilitarian gadget than a US$ 1,000 mobile phone, but that is a small market of affluent consumers for which an up-front expense of US$ 1,000 is eminently affordable. That used to be the natural market for Apple and BMW products, and Apple did well out of it, but not as spectacularly as when it started effectively selling small-mortgages when it introduced the iPhone line.

Note: Also compare with wristwatches, briefcases or purses: there are many more people with US$ 500 or US$ 1,000 cellphones than with wristwatches, briefcases or purses.

160513 Fri: An interesting review of the HTC Vive

The HTC and Vive Vive is a virtual reality access device that sits between the VR room (CAVE), where most of the access is part of the room, and personal access devices like stereoscopic glasses, recently become popular in their variant with an embedded mobile phone. The Vive includes both VR room equipment in the form of locators that provide a room-based frame of reference, and stereoscopic glasses that provide user-based viewing and the pointer for the locators: the combined effect is that when the user wears the glasses it is as if the user were in a CAVE.

The Vive is one of the first VR access devices that has both a price accessible to (affluent) consumers and quality that is good enough for sustained immersion. A guest on a science-fiction's author website has given it a very positive review with the observation that such VR access devices may have interesting side-effect.

Many years ago I was involved in some work on something similar, and I had a few early VR access devices, including an early pair of nVidia stereoscopic glasses that worked pretty well at the time even if with very reduced quality to what is possible today. At the time I had suggested something similar with ultrasound room-based locators as in ultrasound 3D mice (I bought a farly cheap commercial model 15 years ago and it worked quite well) which I think is still a good option for example for desk or body based locators; another VR access device uses webcams as locators, and other techniques can be used.

What is notable about the review is mostly the very strong sense that it provides a good quality access, that the immersion from the VR access is on a similar level of quality as that from a conventional spectator monitor.

160624 Fri: Update on mechanical keyboards and mice

A while go I got some mechanical keyboards and gave some first impressions, so here is an update.

As to the CM Storm QuickFire TK. I have been disappointed to see that four of its keys have become unreliable quite a while ago. These are supposed to be highly reliable Cherry MX brown switches, and I could accept that one of them may have been defective, that that four of them (among them the much used Enter key) have become unreliable may mean that they were fakes.

Note: The Enter key is mostly dead. But it works if I remove the keycaps and as I press it I also push it northwards, where presumably the keyswitch is. Which means that the metal lip of the keyswitch has become dirty or is off position has already become enervated.

I have replaced it as my main keyboard with the Ducky DK-9087 Shine 3 keyboard that has continued to work very well, and is nicely backlit. The black rubber lacquer on its backlit keys has worn out a bit on the left-Shift, left-Ctr, A, I, O and Enter keycaps, but that is expected and the advantage of having Cherry MX keys is that there is a choice of spare keycaps. I still prefer the texture of the PBT keycaps I bought but they are less suitable for backlit keys, so I have not put them on yet.

The Corsair K65 still works well but I have not used it much, as I was using it for my rarely used test and gaming desktop, and then I replaced it, for testing a Zalman KM-500.

The KM-500 is one of the cheapest mechanical keyboards, and it does not use Cherry MX switches, but they are very similar to the Cherry MX Black ones, and have Cherry MX compatible keycaps. The version I got had a non-internatiinal layout with a thin Enter key, UK keycaps but 104 keys instead of 105 keys.

The missing key is the one with the vertical bar and backslash symbols. Under X-Windows these can be typed with ISO_Level3_Shift-`, which is marked on the keycap, and with ISO_Level3_Shift--, which is not marked, and somewhat inconvenient, especially for MS-Windows users that use the backslash a lot.

Note: ISO_Level3_Shift in X-Windows is usually mapped on the Right-Alt or Alt-Gr key.

After light use I am fairly happy with it, but I have seen reviews on the web that say that after time some keys become unreliable, but this has not happened to me yet.

Also while all the keyboards and mice that I bought work-ish with my PCs, some use advanced USB protocols, and those do not work with some of my USB hubs or KVM switches. Fortunately the K65 has a hardware switch to enable various different protocol modes, and so does the SHINE3. Some of the mice however just are not supported by some hubs or KVM switches.

160512 Thu: Four years old flash SSD still doing well, but slower

My laptop is four years old and soon after buying it I also replaced its disk drive with a 256GB flash SSD (1, 2, 3) which I use for the / and /home filetrees.

It is still going quite without problems and it reports only 3% of it rated total writes has been used after 36,279 hours (around 1,500 full days) of use:

# smartctl -A /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-22-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   001    Old_age   Always       -       36279
 12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       620
170 Grown_Failing_Block_Ct  0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   001    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       65
173 Wear_Leveling_Count     0x0033   097   097   010    Pre-fail  Always       -       101
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001    Old_age   Always       -       133
181 Non4k_Aligned_Access    0x0022   100   100   001    Old_age   Always       -       67 8 59
183 SATA_Iface_Downshift    0x0032   100   100   001    Old_age   Always       -       4
184 End-to-End_Error        0x0033   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   001    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   001    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   001    Old_age   Always       -       81
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       0
195 Hardware_ECC_Recovered  0x003a   100   100   001    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   001    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   001    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   001    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   001    Old_age   Always       -       0
202 Perc_Rated_Life_Used    0x0018   097   097   001    Old_age   Offline      -       3
206 Write_Error_Rate        0x000e   100   100   001    Old_age   Always       -       0

I have another two 256GB flash SSDs, one from SK Hynix that is in a PC that I rarely switch on:

#  smartctl -A /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-18-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 0
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   166   166   006    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0032   253   253   036    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       3342
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       51
100 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       2982997
171 Unknown_Attribute       0x0032   253   253   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   253   253   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0030   100   100   000    Old_age   Offline      -       12
175 Program_Fail_Count_Chip 0x0032   253   253   000    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   253   253   000    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   000    Old_age   Always       -       2744576
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   000    Old_age   Always       -       29
179 Used_Rsvd_Blk_Cnt_Tot   0x0032   100   100   000    Old_age   Always       -       214
180 Unused_Rsvd_Blk_Cnt_Tot 0x0032   100   100   000    Old_age   Always       -       5098
181 Program_Fail_Cnt_Total  0x0032   253   253   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   253   253   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   253   253   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   253   253   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   253   253   000    Old_age   Always       -       0
191 Unknown_SSD_Attribute   0x0032   253   253   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   029   000   000    Old_age   Always       -       29 (Min/Max 14/41)
195 Hardware_ECC_Recovered  0x0032   253   253   000    Old_age   Always       -       0
201 Unknown_SSD_Attribute   0x000e   100   100   000    Old_age   Always       -       0
204 Soft_ECC_Correction     0x000e   100   100   ---    Old_age   Always       -       0
231 Temperature_Celsius     0x0033   253   253   ---    Pre-fail  Always       -       0
234 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       10896
241 Total_LBAs_Written      0x0032   100   100   ---    Old_age   Always       -       8702
242 Total_LBAs_Read         0x0032   100   100   ---    Old_age   Always       -       2014
250 Read_Error_Retry_Rate   0x0032   100   100   ---    Old_age   Always       -       1720
	

Another from Samsung which I bought 11 months ago:

#  smartctl -A /dev/sde
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-21-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       8813
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       56
177 Wear_Leveling_Count     0x0013   097   097   000    Pre-fail  Always       -       127
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   073   059   000    Old_age   Always       -       27
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   099   099   000    Old_age   Always       -       9
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       39
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       37982825467

Both are used also for the / and /home filetrees, and they have been also problem-free. In general my experience is that flash SSDs are more reliable than disk drives, and in that reports of low flash SSD reliability depend on poorly designed hardware or firmware, or very high rates of rewriting.

What has gone wrong is raw speed (sequential, 1,000×1MiB blocks): the 4 year old Crucial device has slowed down from the SATA2 top speed of 250-270MB/s either reading or writing to 128MB/s reading and betweeen 49MB/s and 117MB/s writing; the SK Hynix device has slowed down from around 500MB/s reading or writing to 318MB/s reading and 395MB/s writing; the Samsung device still does around 500MB/s reading, but writing oscillates between that and 190MB/s.

Obviously the random access times are still very good, so the responsiveness is good, but the slowedown on the four year old drive is pretty huge and surprising. My current guess is that since it is used for the / and /home filetrees it is subject to a large number of small rewrites, e.g. from rsyslog and collectd. Presumably then the distribution of free blocks among erase blocks is then very fragmented, and so will be that of data blocks when they get rewritten; I guess that if I were to backup the contents, run SECURITY ERASE, and reload, it would go back to top rated speed. After all the volume of rewrites is small as the devices all report very little use of the rate write lifetimes.

160506 Fri: Visual programming languages and programming based on metaphors

A knowledgeable colleague gave me a link to a very amusing presentation based on pretending that it is still the 1970s, major new programming concepts have been developed, and it is easy to predict that they will be dominant in 20-40 years time; while none of them did. One of these was visual programming languages, those based on flowchart-like organization of the code.

This brought to me the site for the Self language in which visual programming is possible, and in particular another presentation on it and related topics by David Ungar.

The latter presentation made me cringe because while the presenter correctly points out that Self is not widely used, the claim that it is in some grand sense better than stuff which is widely available are grossly misguided.

The first is the naive conceit that visual programming in the sense of a reactive virtual world on screen in which programs are represented as flowcharts somehow makes programming easier. The challenge of programming is in managing scale, and developing and maintaining the interactions and dependencies of vast amounts of complicated code, not building cool looking pictures on a screen. This is because programming is based on understanding what is being coded, and this depends on the ability to build a modular abstract mental model of what is being programmed, and on-screen flowcharts or even dynamic runtime simulations by themselves do not achieve that. The idea that visual programming works is as simplistic as the idea behind COBOL that since it looked like english as in ADD 1 TO number of accounts it would make programming easy even for untrained people.

Another major and related one is the idea that programming can be made more approachable by adopting some kind of real-world metaphor, for example that of sending messages among people, or switches connected to objects inside a simulation. While I think that visual programming is essentially useless at building a modular abstract model of a program, real-world metaphors seem to me to make it harder, because the behaviour of program entities only very superficially can be made to fit in some kind of metaphor based on real world objects.
A computer desktop or folder do not behave like a desk-top or a folder, except in the vaguest sense, and to operate them users have to realize it does not, which sometimes take time.

Note: Also in a classic experiment by Mike Lesk two groups of secretaries were taught to use the same text editor program in two different fashions: one group using familiar metaphors for operations like cut and paste and scroll, and the other using made-up nonsensical names. The second group was at least equally proficient, because the key was building an abstract mental model of what the text editor program behaved like, and the familiar metaphors did not help at that.

The other is that in the specific case of Self two very big related misunderstandings are embodied in it and its terminology, even if they are presented as cool, insightful features by the presenter:

Overall while visual programming and programming based on metaphors are pointless or bad ideas, the research on Self involved developing or improving several interesting implementation techniques, which were used for other languages and systems. In part because they were pretty smart, but also because the hardware that made those techniques viable had just been developed or were being developed at the same time, and a lot of computer research in essence is the exploration of the possibilities opened up by better hardware.

160501 Sun: Two hundred kernel threads on a laptop

On my GNU/Linux laptop I have 228 kernel space threads:

tree$  ps ax | grep '[0-9] \[' | wc -l
228

Plus whatever number of creepy little user space background processes, but let's skip that. That 228 count of kernel threads is ridiculous enough. I do not even have 228 peripherals or anything that might remotely require, by working in parallel with other stuff, that many running threads.

What's going on? All those threads are mostly inactive. They are either waiting to be polled by something or are waiting to poll something. That is indeed ridiculous. In a typical quiescent system, as my laptop is most of the time, either because I am thinking or typing and the computer is in the endless space between the words, the system should have essentially no threads. Perhaps one user thread for my X terminal, and another one for the X server it communicates with and is running asynchronously to it. But nothing else should be going on.

Those threads that are waiting to poll or be polled are unnecessary: they can be replaced by functions that get called by the threads that are properly necessary.

A properly necessary thread is something that runs because it represents inside a computer an external source of activity, like a user or network client (not a daemon). If a user does nothing, or there is no active network client, the computer should just lock the CPU into idle mode, and have nearly zero processes or threads (kernel or user space).

An operating system is a collection of (privileged) libraries, not a program or even worse a process, or even worse still a collection of processes or threads. On a just booted computer the only active process should be:

Nothing else is needed: there is no need to do local IO or network communications except when a user or network client require it. Computer systems should be passive devices, not running two hundred (mostly low activity) threads. Old style systems, like Multics or MUSS or UNIX were naturally like that. What went wrong then? Some guesses:

Given the above guesses, in my imagination many hipster developers (or people called Lennart, or Greg) conceive of reading a disk block or allocating memory as a service to be implemented in its own thread, or something similarly inane.

Perhaps we shall soon experience the glorious moment when the trend is taken to its logical conclusion and Linux will have a kernel-side process creation service running as a thread (and user-side ls, cat, cp microservices implemented as daemons, and a new-style shell daemon that orchestrates calls to them).

160413 Wed: Monitors with 3840x2160 pixels, 24in, 27in, 32in displays

I was wishing some years ago for higher DPI displays for desktop monitors and they are slowly appearing, probably because they have become popular on mobile phone, tablets and laptops, in large part thanks to Apple.

I have just noticed that there are now more examples of a 24in monitor that has 3840×2160 pixels, which amounts to nearly 190DPI. It is almost print resolution and the pixels stop being visible at a viewing distance of around 50cm or 20in.

The price of £230 is also remarkable because it has an IPS panel and it is much the same as that of a similar 24in monitor with 1920×1200 pixels.

There are also similar monitors with a 27in display with 160DPI for around £360, and a 32in display with nearly 140DPI for around £600.

My 32in 2560×1440 monitor still looks good to me (even if still feeel a bit too large), so I am not upgrading soon, plus a monitor with a display with more than 1920×1200 is not supported well by older laptops like my current one or by cheaper KVM switches.

160409 Sat: The wonder of the world wide web

For a long time I have been doing computer science work and some research, and most of it was aboiut infrastructure, and a lot of it was in recent years about the Internet and the World Wide Web. I continue to have a sense of wonder at some of the most remarkable outcomes of their history, outcomes that give me greater hope for the evolution of civilization, if they can be matained.

The most obvious examples are Google Earth, Wikipedia, Google Youtube, Yahoo Flickr, The Internet Archive, The Gutenberg Project. These in essence are all libraries, and indeed I am a bookish person, but that is not the mere reason I am awed by them: it is not just for the vastness of their content, or for their worldwide availability, for the number of volunteers that maintain them, but mostly because that content was difficult and expensive to access before the Internet and the World Wide Web, and is now much easier to find and download.

These sites provide not just content, including a long tail of content which is important but of narrow interest, but very accessible content. Because of the very large improvement in accessibility they are probably in the long term as important as the reinvention of printing by letter by Gutenberg in the 15th century, which at first made existing content far more accessible.

In a similar way I appreciate very much (in different ways) blogging and statistical sites like Wordpress, Blogspot, University of Oxford podcasts, Tumblr, even if much of their content is vanity publishing, and sites like St. Louis FRED, Yahoo! Finance, EuroStat, OECD Statistics, as they support the puboishing of much interesting content.

I feel wonder from looking at places like Socotra on Google Earth and find it used to be the Dioscurides bishopric of the Church of the East on Wikipedia and read about the history of all of these in the same afternoon, where in years past it would have taken several trips to research libraris.

160401 Fri: More complexification of the Linux based systems

A few days ago I read the May 25th, 2015 open letter by Linas Vepstas (and there is a somewhat defeatist reply) to Debian and Ubuntu developers in which he quite rightly notes how simple critical aspects of their systems have been complexified into fragility. The prime examples are the usual ones of DBUS and systemd and the names the usual Kay Sievert and Lennart Poettering as influential instigators (to which I would add the similarly destructive Keith Packard and Greg Kroah-Hartman, and the latter seems to me one of the first and most despicable).

A minor but telling symptom happened to me today, where when replacing a failing hard disk I found that I could not reformat one of its partitions as it resulted still in use by the Linux kernel. This despite me having unmounted it, checking /proc/mounts, and checking dmsetup ls and losetup -a just-in-case; plus, knowing that the system was running some services in LXC-style containers, so I also checked whether it was mounted or used in any of those containers.

I was perplexed especially as I noticed that the four kernel threads that the XFS filesystem creates on a mount were still active for the relevant block device, so the kernel definitely reported it being in use correctly, and not a generic bug. So I started looking for a bug in XFS unmounting related to IO errors as the disk was being replaced because of IO errors in the filesystem metadata, which might have resulted in an aborted and stuck unmounting.

I found some mentions of similar situations in various web pages, but the best hint I got from a page that suggested doing a further check, unrelated to XFS issues: to check also the /proc/*/mounts pseudo-files, which list per-process (rather than per-system or per-container) mount points maintained by Linux. It turned out that five of the hundreds of processes had a private (per-process) mount of the relevant block device, none of which actually needed it.

That is each of those five processes had per-process mount namespaces. So I checked and the system had several such namespaces:

/proc# ls -ld */ns/mnt | sed 's/.* //' | sort | uniq -c
    518 mnt:[4026531840]
      1 mnt:[4026531856]
      1 mnt:[4026532278]
     29 mnt:[4026532853]
     78 mnt:[4026532856]
     77 mnt:[4026532930]
     37 mnt:[4026532998]
     77 mnt:[4026533066]
    107 mnt:[4026533134]
     29 mnt:[4026533200]
      1 mnt:[4026533522]
      1 mnt:[4026533524]
      1 mnt:[4026533649]
      1 mnt:[4026533650]
      1 mnt:[4026533686]

Here the one with the highest count is the per-system one, those with the next highest counts are those for the LXC containers, and the those with the single digit counts are per-process and used by a single process. The latter are somewhat inexplicable but most likely exist by accident: the relevant processes seem related to shameless OpenStack cluster and VM management system, which includes the overcomplicated Neutron virtual network system, and it uses network namespaces, and it is possible that someone coded it to unshare all namespace types instead of just networking.

But even given this there is a complexifying consequence of having per-process namespaces and how they have been defined: that in effect checking /proc/mounts or the /proc/mounts of all LXC container is currently pointless, one must always check the namespaces of all running processes.

Because namespaces can be recursively redefined, and containing namespaces (at least filesystem ones) are not supersets of the contained ones. Which means that it is not possible to look at the resources used by a containing namespace to manage it, one has to look explicitly at those of all contained ones too.

That makes investigating and reasoning about systems a lot more complex, and I have already noticed that a lot of people do not seem to be fully able to understand well the consequences of layers of virtualization and partitioning, such as VLANs or even simpler things like the consequences of sharing a disk arm among too many virtual machines, never mind sophisticated classification and access control schemes like SELinux.

160326 Sat: Fixing excessive logging by BCM4313 driver

I have finally decided to find the reason why I have been getting this kind of kernel report every few seconds:

cfg80211: World regulatory domain updated:
cfg80211:  DFS Master region: unset
cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp), (dfs_cac_time)
cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm), (N/A)
cfg80211:   (2457000 KHz - 2482000 KHz @ 40000 KHz), (300 mBi, 2000 mBm), (N/A)
cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm), (N/A)
cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm), (N/A)
cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm), (N/A)

That took a bit too much time and insight, and the general context is:

But there are some details that cause trouble:

In practice the simplest solution is to disable the udev rules for regulatory domains, on my system 40-crda.rules and 85-regulatory.rules, because if there is no rule for an event it gets ignored.

Another solution would be patching crda so it can send the configuration for a domain under the name of a different domain, so as to send it always with domain code 00.

But I am not sure whether the latter would work, because the crda documentation says that the regulatory domain configuration that is used by the kernel is the intersection of that received by it and that stored on the device:

In order to achieve this devices always respect their programmed regulatory domain and a country code selection will only enhance regulatory restrictions.

Unfortunately the regulatory domain programmed into the device as X2 is the default world domain which is already the most restrictive.

160322 Tue: Examples of anisotropy in Isilon and Qumulo

After looking at Isilon as to anisotropy I have noticed that some founders of Isilon have created another company to develop and market an improved product, Qumulo.

From a few articles on Qumulo it seems that Isilon had very anisotropic behaviour as to recovery (one of the figures of merit for filesystems that I had listed previously):

To prove that QSFS can handle lots of file, Qumulo stacked up a minimum configuration of its appliance, which has fou But the Qumulo appliance can rebuild the data from a lost drive in about 17 hours. (The company is not saying how it is protecting the data and what mechanism it is using to rebuild lost data if a drive fails, but says the process takes orders of magnitude less time than in other scale-out storage arrays sold today.)

Godman says that the system can do a rebuild and continue to perform its analytics regardless of the load on this entry storage server nodes and loaded it up with over 4 billion files and nearly 300,000 directories.

Godman says you cannot do this with Isilon arrays because if you blew a disk drive it would take you months to rebuild the data. But the Qumulo appliance can rebuild the data from a lost drive in about 17 hours.

If recovery is done by enumerating objects then it amounts to a whole-tree operation, and the consequence is that large numbers of objects take a lot of operations. I have previously remarked on how slow and expensive fsck-style operations can be (1, 2, 3).

As to the recovery time of Qumulo, that article continues with:

But the Qumulo appliance can rebuild the data from a lost drive in about 17 hours.

(The company is not saying how it is protecting the data and what mechanism it is using to rebuild lost data if a drive fails, but says the process takes orders of magnitude less time than in other scale-out storage arrays sold today.) Godman says that the system can do a rebuild and continue to perform its analytics regardless of the load on this entry system.

Previously the entry system had been described as:

The base Qumulo Q0626 appliance comes with four server nodes, each of them coming in a 1U form factor with four 6 TB Helium disk drives from HGST/Western Digital and four 800 GB solid state disks from Intel.

The nodes in the appliance have a single Intel Xeon E5-1650 v2 processor with six cores running at 3.5 GHz and 64 GB of main memory to run the Qumulo Core software stack. The storage servers have two 10 Gb/sec Ethernet ports that allows them to be clustered together

Those 6TB drives can be duplicated sequentially in around 11-12 hours, and that is as fast as it can be done, so a recovery time of 17 hours while the system is under load largely implies that the recovery process is similar to the resynchronization of a RAID set.

Note: Perhaps a RAID3 set, as the recovery time is claimed to not impact operation, or perhaps a RAID10 set (4 drives) or perhaps a log-structured object replicating scheme.

But then there is implicit in the above another massive anisotropy: the base system has 4×6TB drives and 2×800GB flash SSDs. The 6TB drives are big and cheap, but with an horrifyingly low IOPS-per-TB (1, 2) and exactly the opposite for the 800GB flash SSDs.

Obviously the expectation is that the metadata will reside entirely on the 1.6TB capacity of the flash SSDs, and that the working set of the 24TB of raw disk capacity will fit long-term in the flash SSDs as well. Given the relative raw sizes the assumption is likely to be that the working set is 10-20 times smaller than the capacity of the system. It is likely that the metadata will fit on the flash SSDs, but when the working set of the data does not fit on them those 6TB drives will end up in the oversaturated regime of arm contention. As someone wrote latency does not scale like bandwidth.

Clearly a double flash SSD failure would be catastrophic, but it would also be interesting to see the consequences of anisotropy on any realistic workload as to how the system would behave if one of the flash SSDs were removed and the flash SSD cache capacity were thus halved.

160317 Thu: Notes from FLOSSUK spring workshop day 2
Still quite draft.
Admonitor: simple server monitoring
Introduction
  • Andy Beverley, Simplelists.com, Ctrl O, SaaS.
  • Monitoring needed to comply with ISO27001. Informaton assurance, but also availability.
  • Spent a day looking at Icinga and Nagios. Wanted something more lightweight.
  • Follow process: spend money, spend time, then if both fail, code.
  • Modular, centralized, with plugins (agents on target host, checkers for central host, alerters are hard coded.
Details:
  • Agents are HTTP daemon, logs resources, Sqlite3 database.
  • Checkers do simple things or full email loop checks.
  • Alerters so far email. Maybe SMS one day.
  • Simple graphing interface via web.
  • No discovery, database with list of hosts, each host agent configured with a local YAML file.
Implementation:
  • Implemented in Perl5.
  • The local Sqlite3 database means that constant connectivity is not that important.
  • Agent takes 50MB of RAM, maybe rewrite into push.
  • Controller DB uses abstract database interface, currently MySQL. Also DBIC::Migration.
  • Web interface using Perl5 Dancer.
Configuration management and containers
Introduction:
  • Stephen Grier.
  • UCL web and core services (DNS, email, ....) using free software.
  • Talk about configuration management and containers, mostly Puppet and Docker.
  • CM is used monstly for maintainability.
  • Containerisation is an old technique, more like chroot than VMs. In 2007 two people at Google started to define cgroup facilities. Namespaces were introduced at roughly the same time.
  • LXC was result of namespaces and control groups.
  • Docker provides tools on top of LXC to build and deploy images. Similarly Rkt/CoreOS.
Comments on containersization:
  • Runtime benefits: isolation.
  • Self contained images.
  • We have lots of Java web apps, and we need a simple way to isolate them.
  • Isolation means that deployment from prebuilt images can be really quick, and does not matter what it contains.
  • So traditional CM converge to a declared state, but golden images do not need that.
  • But how are the images configured? Maybe Jenkins, usually manually. But would really like to define contents.
  • The Docker system uses a Dockerfile.
Why are CM tools still useful then?
  • To setup and install docker hosts unless they run CoreOS or Red Hat Atomic.
  • CoreOS has a shared registry which is is remotely accessed by clients and updated centrally.
  • Puppet has modules to manager both Dockers hosts and build images and define which containers are run by which host.
  • Sample garethr-docker module configurations look to me like Juju charms.
  • Using Puppet module hiera one can do mass deploys like in Kubernetes.
  • CM tools can be used to build golden images, for example by using puppet apply inside a Dockerfile.
  • It is also possible to run CM tools inside the running Docker image to update it on the fly.
Solving alligator problems an alligator at once
Introduction:
  • Matt Strout, Shadowcat.
  • I work for a lot of early startups, or older startups that work like startups.
  • Lots of technical debt, but that beats running out of money while building a good version.
  • How to fix mess while running legacy systems.
Basics:
  • Inventory systems. Your colo host should be able to tell you. Assume everything is terrible and undocumented and lots of forgotten systems.
  • Inventory services on the systems. Ask the OS.
  • Inventory packages, repositories, checkouts.
  • Check whether someone do development on a production machines. Up to 2002 perlllocal.pod was a good list of installed packages. There is now Dist::Surveyor, and slowly scans disks and verifies perl sources against original checksums.
  • Check what is talking to what. Check any service management daemon. Tools: lsof, netstat, ps ax.
Analysis:
  • You can feed status files and logs into MediaWiki, or into git for example as JSON, e.g. with JSON::Diffable.
  • Check for IP addresses. Not everything is in DNS.
  • Really important to communicate that reworking things is doing things for the better, often using catchy metaphors.
What to do:
  • Do not upgrade/rework in place: setup completely new systems.
  • Existing systems may be security compromised.
  • Internal firewall configurations are great to ensure connections are documented.
  • For CM: I really prefer pull-based. Sysadms tend to pick Puppet, developers tend to pick Chef, but for the basics rhey are all good enough.
  • Pull based CM systems will converge.
  • Don't try to be clever and exciting.
  • Backup eveything. Full system image backups too.
  • Eliminate any IP based configuration. If DNS is a mess? You can use rsync /etc/hosts.
  • Restore backups onto fresh clean machines. Try to update from the dev side.
Migration strategy:
  • Don't trust DNS timeouts. There are people who have arbitrary floors on TTL, like 24 hours, like BT.
  • So create new domain names, and slowly migrate names, and always have a way to quickly backout the migration.
After migration:
  • Now you have a trustable DNS or /etc/hosts.
  • Redundancy strategies are often very fragile. A dumb way that converges is rsync, and then kill, switch machine and restart.
  • Many automatic failover systems cause more outages than they fix. Auto failover can have zero outage or go horribly wrong.
  • Repeat for all services.
Questions:
  • Whether used tool called "machinery" to do inventory from SUSE: interesting, but prefer to use simple low level tools.
  • Push or pull, pull has the same problem on missed machines: but pull usually runs periodically, so converges. Also pull can be in base image.
Data Revolution Catalyst
Introduction:
  • I was a system administrator for 2 years, I am from Namibia and now more into business intelligence and big data and open data.
  • Access to information leads to free societies and free markets.
  • UN calls for open data to lead to a better sustainable world.
  • Open data is freely sharable data, and mostly but not just government data.
  • Some big challenge is to make the open data accessible to non technical people.
  • Open data is now increasingly published in online form from book form. For recently developing countries that means mobile phone access.
Why publish open data?
  • Promotes social values.
  • To drive the internet of everything.
  • Create value in general.
Challenges:
  • Digital divide is due to competence and equipment access because of social and economic differences.
Some solutions:
  • Example: Walkamomics allows sharing data about London streets. GoToVote in Kenya was about voter self registration. The ADB has simple tools that show changes in development applications.
  • Namibia ODIH is developing information sharing applications, like about broken traffic lights. Other FixMyCity developed.
  • Namibia ODIH also wanmts to develop interfaces to smart utility meters.
  • Namibia ODIH allows travellers in remote or difficult areas to send requests for assistance to emergency services or some nearby user that is willing to help.
  • Namibia ODIH also app to track usage of food bank users and non-users.
  • Namibia ODIH also eHealth to help pay for health expenses, and also related expenses like taxi to hospital.
  • Also app to call buses on request.
  • Non mobile apps: Open data Namibia service.
How does open data leads to big data:
  • Open data accumulates, and eventually become big data.
  • Big data means large or very diverse.
  • So need for scalable capacity and speed but also flexible data structure.
  • Also standards for data interchange within a country and between countries.
Computational models:
  • Ways to store access the data so it is really accessible.
  • Problem: dara ingestion, message queue, flexible schema.
  • I like firebase as an API for mobile applications.
Conclusions:
  • Data essential for our lives.
  • More data means more flexibility for our lives.
Questions:
  • my experience is that cooperation from african governments are not cooperating: we are getting cooperation even if there is always a trust issue. Our new government seems keen and we are connected to the independent stats agency. We are not not government funded, but just volunteers.
  • data protection issues: there are already many examples on how to handle these issues thanks to previous history in western countries. But we learn by doing and testing the boundaries. We are not concerned with sensitive data like defense data. We are not dealing with private or personal data, but data already published in booklets.
160316 Wed: Notes from FLOSSUK spring workshop day 1
Still quite draft.
When applications make promises
Presentation:
  • Myself: TomMay.
  • Debian or Apache developer.
  • Evangelist for Chef.
  • Used to work for Claranet, used Cfengine.
Promise theory:
  • Cfengine implementation of promise theory.
  • Rather than obligations. How do know what you going to do it.
  • Promise is a contract from agent to master.
  • Can reason about state of infrastructure in terms of promises fulfilled or not.
  • A Lincoln: do not promise what we should not, so we not be called to do what we cannot.
Look at promises:
  • Cfengine example about file creation: promise -> promisee, followed by attributes of file.
  • Chef nodes are also promisers. For example, become a web server or tell you I failed.
  • Configuration management should be data driven.
Orchestration:
  • Command and control.
  • TV remote controls have got us used to that (Mark Burgess).
  • At Google scale it is not practical: cattle versus pets.
  • Battle command and control: send synchronous messages.
  • Orchestration systems tend to be static and synchronous, also quite complex.
  • Each application team must talk to the orchestrator team.
  • Ansible needs 6GiB of RAM per thousand nodes.
  • Error handling is difficult: Ansible keeps track of which command failed so they can be rerun. That not so good.
Actor model:
  • Way of programming that has been gaining favour: Erlang, Scala, for example.
  • Received messages. Via a mailbox.
  • Messages are asynchronous.
  • Can send new messages.
  • Can spawn new actors.
  • Elixir based on Erlang, example.
  • Example with actor implemented as process.
Choreography:
  • Example: will you wave your arms right to left, when the previous one has done it.
  • No specific details: just promises.
  • This can be made dynamic. Google and "bin packing". They found the best level is local: node and cluster.
  • This mean scalability. Distributed logic.
  • Been doing lots of Kubernetes. About "pods". Each pod has an agent, and that means they are actor instances. Called replication controllers in Kubernetes.
  • Configuration management is thus performed locally.
  • Most importantly it is embedded in the application, not in the orchestrator.
  • Multi agent systems.
  • Kubernetes about emergent behaviour from autonomous separate system.
Questions:
  • What about push/pull: push maybe overlaps synchronous, pull maybe overlaps asynchronous.
  • What about intelligence: it should be in the application layer.
  • Difficulty there is concept of application: a system is a set of interacting applications.
Postgres replication.
Introduction:
  • First no replication, then PITR, then logical.
  • Logical replication is outside the DBMS: Slony and Londiste.
  • All solutions based on triggers, tables as queue, scriting language implementation.
  • Outside DBMS are slow but flexible.
Bidirectional replication stages:
  • Native, multimaster. Started 2011, prototype 2012, many patches to 9.3 to 9.6, for example logical decoding.
  • Not fully integrated, but framework first.
  • Two components: BDR which is 9.4 extentions, multimaster, DDL transparent, and pglogical whoch is n extension for 9.4 and does one way replication. it is being integrated.
Principles of streaming replication:
  • Physical: user to WAL+DB, then WAL sender sends to WAL receiver, writes to WAL, then startup that updates DB. Standby must then be readonly.
  • Logical: similar on the sender side, but WAL updates are made logical, as SQL statements, and sent directly to a DB updater.
  • All logical updates are ordered by row and commit, no DDL changes allowed, there is a C API, an SQL interface and an streaming interface.
  • Existing output plugins: bottledwater-pg into JSON AVRO, decoderbufs into protocol buffers, pglogical_output used by pglogical, native or JSON output.
The main topic is pglogical:
  • Receives streaming updates from WAL updater, applies them to DB.
  • Sends feeback to WAL updater, so sender can report when commit is complete.
  • Advantages of logical: allows temp tables, different indexes, security, some data transformation, can be selective, can be across services.
  • Replicates in commit order, optionally selective.
  • Also optionally semi-synchronous: some transaction can be replicated synchromously, some not.
  • Ensbles live upgrades of PostgreSQL, by synchronizing DBs and switching, especially if you are using a pooler.
Replication sets:
  • What is replicated is groups of tables; they need to be defined on each target node.
  • Only tables in a set are replicated, they can be in multiple ones, only one update sent.
  • Predefined: default, default_insert_only, ddl_sql.
  • There must be a primary key in the target. If there is no primary key you can only replicate insert and delete but not update.
  • One can customize what gets replicated. For example do not replicate delete for a historical log.
  • Cojfiguration is in DB on both provider and subscriber, defines node name and connection string. Then add tables on he provider, and then subscribe to provider from subscriber.
  • A subscriber can sbuscribe to totally unrelated providers. There is also conflict detection and resolution.
Miscellaneous:
  • Enormously better: same up to 4, then 8,000 vs. 2,000 (londiste3), 3,000 (Slony).
  • A bit slower than physical: same up to 16, then 12,000 with physical.
  • DLL not replicated yet: either nothing or identical replication, including schema of non-replicated tables, as it is being currently done with pgdump and pgrestore, but started to work on logical decoding of DDL.
Warnings:
  • Sequences: fixed mostly in 1.1, but there is always an issue.
  • Big transactions are always a problem as they are big.
  • Interplay with physical replication, fixed in 9.6; cannot follow the failover currently.
Future plans:
  • Probably in 9.7.
  • Data transformation hooks, filtering by expression.
  • Push as well as pull because of security requirments.
  • Integrate BDR.
Ansible VM bootstrapper
Introduction:
  • Toshaan Sharvani.
  • Ansible: CM tool, execution tool, orchestration.
  • Ansible: server based without a specific agent, but usually SSH and Python.
  • CM: flat files, database, scripts.
  • Run playbooks on node, but can delegate to another node.
  • Goals: define VM with given real and virtual resurces.
  • Current definition variables not quite documented, but there are examples.
Inventory management:
  • Target node Assumptions: can access host, can run libvirt.
  • Management node assumptions: Ansible, VMINSTALLER role, Ansible CMDB, qemu-img.
  • 32/64 bit AMD and PPC architectures tests as both management and target.
  • Most distros and MS-Windows 2008 and 2012 (via a DOS floppy).
  • Don't fully work: OpenBSD, FreeBSD, Ubuntu, Fedora, MS-Windows 7 and 10. Preseed files for Debian do not work on Ubuntu.
Operation:
  • Connects to target host, creates VM.
  • Waits for creation to complete, gathers facts from VM, ends.
  • Demo: image creation is delegated to the hypervisor host.
  • Fact gathering disabled as VM does not exist yet.
  • Pre-tasks: creates the SSH key of the BVM to be created.
  • Then vminstaller role creates the VM, and the others will configure it.
  • There is a hack by which VMs with memory less than 1-2GiB can be setup with that much to run the installer.
Configuration file for VM:
  • Various detail (8192 bit RSA keys).
  • Disks from specific images.
  • Separate list of disks and partitions.
  • Partitioning is preseed syntax on Debian, Kickstart syntax on EL.
Tuning for Databases
  • Focus on Linux, distributions quite different. EL for example ktune and then tuned.
  • DBMS tend to be hungry for IO, then memory, then CPU.
  • For clouds Amazong gives "provisioned IOPS". 30,000 IOPS for up to TB. FusionIO can do 120,000 IOPS.
  • MariaDB cannot take advantage of that, currently tops at 8KiB.
  • MariaDB fault size is 16KiB and has been increased to 64KiB.
  • Provisioned IO is very useful.
Types of storage
  • SAS, SATA, Ethernet (NAS, DRBD), SSD, Fusion IO.
  • MariaDB was recommending DRBD before having builtin replication.
  • Elevators: CFQ (he says good), NOOP, DEADLINE. DBs benefit from deadline.
  • Good idea to have DEADLINE. Not just to DBs.
  • MariaDB/MySQL no longer single threaded.
  • MariaDB recent.
Filesystems:
  • Recommend ext4, XFS. Riak recommends ZFS, others use BtrFS.
  • Recommend nobarrier and data=ordered.
  • Use battery-backed cache to optimize fsync.
  • Separate tables from logs.
Storage:
  • SSH more IOPS, needs new RAID host adapter.
  • Flash has even more IOPS, and very low latency. You may want InnoDB in double write buffer.
  • RAID0 faster, no redundancy. RAID1 slow write, fast read. RAID5 slow random writes, fast sequential writes, slow fast reading, recovery very slow.
  • RAID10 fast reads and writes.
  • LVM good for snapshots, but 40-80% fall in speed.
Memory:
  • Use ECC.
  • NUMA swap insanity. numactl --interleave=all.
  • numa_interleave option for recent systems.
  • RAM caching is good. Especially for relay logs. Also per-client like sort buffers.
  • Huge pages (transparent) are a benefit for some workloads. MongoDB does not, and requires them to be off.
  • It also turns out that malloc libraries make a huge different.
  • Swappiness: set to 0 to worka around bug in kernels. decreasing it reclaims unmapped pagecache, increasing it swaps mapped memory.
Networking:
  • tc and dropwatch.
  • Jumbo frames do not benefit normal workloads, benefit data warehouse.
  • Galera cluster is great with low latency, because as fast as your slowest node. Need to tune send_window and user_send_window on high latency. Galera 4 will remove limitations on transaction size.
Tuning approach:
  • Use tools to log and analyze slow queries.
  • Reduce locking if you can.
  • Many tuner scripts are for old versions.
  • Virtualization: some proprietary tools do not emulate fsync.
  • Virtualization: AIO over threads.
Containers and clouds:
  • Local/remote benchmarks can be very different. Average loss is 15%.
  • EC2 has instance limitatioms, EBS unpredictable performance.
  • RDS has similar performance to EBS RAID10.
Versions:
  • MariaDB/Percona recomended (e.g. threadpool).
  • Test everything. Hardware matters: bad batches of disks and NICs with very bad performance have been seen.
  • Don't disable SELinux. We work hard to make it work with Galera.
Benchmarking:
  • sysbench, LinkBench, YAHOO! Cloud service benchmark.
  • Most performance books are out of date.
Questions:
  • BtrFS not recommended by SUSE for DB.
  • Any tuning for SSD: recent released do not need to.
  • Any advantage to split workloads with different profiles in two DBs: maybe, usually not.
nftables
Introduction:
  • By Richard Melville.
  • Background and overview.
  • By a user, not a developer.
  • Basic usage.
  • It ought to include the functionality of 4 filtering systems, for IPv4, IPv6, ARP and bridge "tables".
  • One tool called nft and conversion tools.
  • In the kernel since 3.13.
Background:
  • Part of the Netfilter project.
  • The project is about all the current "tables" tools.
  • The single tool has a syntax similar to BPF/tcpdump syntax rather than options.
  • But uses the same hooks into the kernel.
Changes:
  • iptables has predefined tables and chains.
  • New-stile families look like tables a bit.
  • The inet family covers both IPv4 and IPv6, available since 3.14 and netdev was introduced in 4.2 and allows looking at packets even before prerouting.
  • Chains are not predefined, but they are base or non-base. Base are the same as before.
  • Base chains can be registered with hooks.
  • Non-base chains just group rules.
  • Rules belong to chains, and contain a filter expression and one or more actions, and have a unique handle.
  • Packet and byte counters are by default off.
Why better:
  • smaller code, better sytax.
  • Simplified dual stack thanks to inet.
  • Several actions can be performed in a single rule.
  • Sets of commands, dictionaries/maps .
Usage:
  • Small number of kernel modules, one per family.
CLI Perl for fun and profit
Myself from Shadowcat and LAMM Hackspace. About Moo:
  • 2/3 of Moose. OO system.
  • Backwards compatible, mostly forwards with Moose.
  • Written by mst.
Example class:
  • RFID door countrol. A class is a package.
  • Attributes declared with has, methods with sub.
About MooX::Options:
  • Creates CLI scripts t.hanks to new keyword option for attributes or methods.
  • Semi automates man pages.
  • Example shows a strings option definition.
  • Used with new_with_options instead of new and object creation; this parses ARGV.
  • So far it only sets variables from options, but one can add actions. My checking the next remaining element in ARGV.
About App:FatPacker:
  • Condenses in one files. It is fat to avoid missing modules that are dynamically loaded.
  • The whole of CPAN is 10MiB compressed, includes 10,000 modules.
  • Requires Perl 5.10 or newer, 10 years or newer, also with 5.8 with MRO::Compat.
  • You cannot use XS modules as they are machine dependent. Refuses to use them.
  • Small code modifications may be required, replace loadnamespaces.
  • The fatpack command actually is a wrapper for the perl command.
  • You need full Perl, that is perl and the core modules, default on Debian etc.

160130 Sat: Why desktops are going to stay

While looking at the innumerable logs of a very nice new Dell XPS 13 (1, 2, 3) I noticed several entries where CPU speed had been throttled because of rising temperature, which did not surprise me because:

This indicates why desktops are not going to go away: for high CPU or GPU intensive workloads, like reading Google Mail or playing 3D games, desktops can keep a running CPU at a lower temperature with active cooling, and since they do not need to be in contact with the body of the user, can run it hotter if needed.

Avoiding getting hot is quite a fundamental issue with portable devices that are held in contact with parts of the user's body, and it is not going to be solved easily. I am indeed typing this on a laptop on a train and it is keeping my thighs rather uncomfortably hot because I have forgotten to run:

killall -STOP firefox chromium-browser

Conversely any workload that does not involve sustained high CPU or GPU work can be run very conveniently on portable or passively cooled devices, and therefore even many server workloads could be run on laptops or mini-servers.

Ironically it is web user interfaces that keep CPUs busy and thus impact portable computers.

160119 Sat: Storage with nearly crossbar-switching

During a chat about storage for big data someone ventured that Isilon storage clusters have a fairly wide and somewhat isotropic performance envelope, to which my comment that they are expensive and they still have moderately frequent corner cases where they do not work too well.

This was acknowledged, in particular that they have a cost per TB probably more than four times that of a baseline storage cluster with a narrow performance envelope based on large slow drives with low IOPS-per-TB (excluding from the comparison the Isilon archival product using large slow drives).

That factor of four times does not buy a really clever design: it is simple design based on a fast, very low latency network which allows both the use of parity RAID-like redundancy, and for every backend to be a frontend, via a distributed data directory.

Put another way, it has a very low latency quasi-broadcast data and metadata interconnect. That is not quite but somewhat close to a mostly-scalable non-blocking low-latency crossbar-switch (1, 2, 3) (an isotropic interconnect) and inasmuch it is not a proper one it still allows pathological cases where hotspots happen (because of residual anisotropy, especially in the storage elements it connects).

It is very hard to compete with something that approximates a crossbar-switch because:

Note: the realized performance envelope is the intersection of that of the workload and that of the machinery on which it runs. A machinery with an isotropic performance envelope can therefore run well many different workloads can thus be considered to be grossly overprovisioned in quality rather than capacity, because each workload only utilizes it partially.

However mostly low-latency, mostly isotropic interconnects like that used by Isilon are rarely useful: because it happens rarely that what matters is median or maximum, as opposed to average, latency, and that workloads with very different anisotropic profiles are used against the same machinery.

Because it is usually possible to segment different workloads to differently structured machinery, that is to match the anisotropic workload profile with a similarly anisotopic machinery performance envelope.

So there are three base choices that could be somewhat equivalent at least in cost terms:

The Isilon products implement a mix of the first and second choices, because they do not actually have an isotropic crossbar-switch, but a somewhat more anisotropic interconnect plus more capacity to expand the performance envelope.

That is an interesting choice that seems optimal if workloads are known to have several differently anisotropic envelopes, or if it is not know in advance if they do. But my impression is that usually workload performance envelopes are known in advance, and therefore having multiple machineries to match them is cheaper, and to spend some of the difference in cost in overcapacity of those. The advantage here is that scaling anisotropic machinery is much easier.

160116 Sat: Some good transfer rates of USB3 UASP devices

Having been impressed recently by the transfer rate of an M.2 flash SSD and that of a 32GB USB3 flash key I have been impressed by the functionality and transfer rates of some USB3 UASP storage devices, in particular an external USB3 UASP NAS disk device device:

$ sudo sysctl vm/drop_caches=1; sudo dd bs=1M count=3000 if=/dev/sdc of=/dev/zero
vm.drop_caches = 1
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 12.9986 s, 242 MB/s

and a StarTech.com SDOCKU33BV dock for that reaches higher with a flash SSD capable of 500MB/s using a native SATA 6Gb/s bus:

# sysctl vm/drop_caches=1; dd bs=1M count=3000 if=/dev/sda of=/dev/zero
vm.drop_caches = 1
3000+0 records in
3000+0 records out
3145728000 bytes (3.1 GB) copied, 6.60298 s, 476 MB/s

This latter with a suitable motherboard thanks to UASP not only is much faster (and more reliable and better specified) than USB3 with the traditional USB Mass storage protocol, but also can work with the smartctl and hdparm tools to offer the same storage maintenance options as native SATA/SAS. UASP is similar then to the ancient ATAPI protocol, which allowed the same SCSI command protocol to be used over the ATA bus instead.

160115 Fri: Some side benefits of my recent desktop CPU upgrade

The faster single-CPU speed resulted in a noticable increased in reported disk transfer rates, for example from 170MB/s to 210MB/s. That means that disk transfer rates are somewhat CPU bound, which is not entirely news, but still surprises me a bit.

But mostly having 8 CPUs is also good for backups with pbzip2 (or pigz, lbzip2, pixz, ...) instead of lzop:

asks: 565 total,   1 running, 564 sleeping,   0 stopped,   0 zombie
%Cpu0  : 99.0 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  : 92.4 us,  6.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu2  : 97.7 us,  2.0 sy,  0.0 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 96.4 us,  3.0 sy,  0.0 ni,  0.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 95.7 us,  4.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 95.0 us,  5.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 99.3 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 86.4 us, 13.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  16395708 total, 14381516 used,  2014192 free,   100476 buffers
KiB Swap:        0 total,        0 used,        0 free. 11103208 cached Mem

  PID  PPID USER      PR  NI    VIRT    RES    DATA  %CPU %MEM     TIME+ TTY      COMMAND
 3287 11565 root      20   0  646392  36104  629832 754.0  0.2   6:37.91 pts/15   pbzip2 -2
 3288 11565 root      20   0    4412    788     348  15.9  0.0   0:08.16 pts/15   aespipe -e a+
 3286 11565 root      20   0   47868   8712    5248  10.9  0.1   0:08.07 pts/15   tar -c -b 64+
But it is still only around 50MB/s:
 8  0      0 170052  13780 13806248    0    0 53288     0 4070 4984 95  5  0  0  0
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 9  1      0 163232  13780 13815528    0    0 54724 30788 4541 4894 95  6  0  0  0
 8  0      0 166176  13780 13815628    0    0 52708 18320 5333 5958 94  6  0  0  0
 8  0      0 181432  13780 13802544    0    0 46964 10008 6236 5549 95  5  0  0  0
 8  0      0 151932  13780 13833064    0    0 43600     0 3739 4572 96  4  0  0  0
 8  1      0 169064  13780 13828516    0    0 51300 44992 4960 5693 95  5  0  0  0
 8  0      0 175976  13780 13824516    0    0 47904 18960 4200 4673 95  5  0  0  0
13  0      0 150420  13780 13855808    0    0 50356     0 4069 4882 95  5  0  0  0
 8  0      0 156180  13780 13851672    0    0 50348     0 4277 5703 96  4  0  0  0
 8  0      0 153664  13780 13856964    0    0 51424     0 4318 5383 95  5  0  0  0

Actually pigz is a lot faster than that and in most cases does not even need 8 CPUs.

Having 8 CPUs is good also for running backups between two pairs of two drives, especially encrypted ones, or using RSYNC over checksumming filesystems like BtrFS.

160114 Fri: An amusing demand about 'fsync', and more about 'fsync'

The general topic of proper storage system design fsync is fascinating and full of controvery, and I have been considering a post on it for many years, but in the meantime here is an illustration of it depth. The basic issue is that:

The issue is to define temporarily. In a topical thread there is a discussion of this issue and in it I see a typical big misunderstanding of the issues worded as effectively a demand for O_PONIES:

The operation I want to do is: 1. Apply changes to the store 2. Wait until all of those writes hit disk 3. Delete the temporary file I do not care if step 2 takes 5 minutes, nor do I want the kernel to schedule my writes in any particular way. If you implement step 2 as a fsync() (or fdatasync()) you're having a potentially huge impact on I/O throughput. I've seen these frequent fsync()s cause 50x performance drops!

The request here is the desire that step 2 should last as long as possible, to minimize the frequency of the expensive fsync implementation, but also should happen just before any system or device issue that might cause data loss. Amazing insight: indeed fsync is not needed except before an actual data loss situation. The difficulty is knowing when a data loss situation might be about to happen.

Also the idea that fsync impacts performance instead of speed is based on the usual misunderstanding of what performance is.

This comment is far more interesting:

For example, because of this fsync() issue, and the fact that fsync() calls flush all outstanding writes for the entire filesystem the file(s) being fsync()'ed reside upon, I've set up my servers such that my $PGDATA/pg_xlog directory is a symlink from the volume mounted at (well, above) $PGDATA to a separate, much smaller filesystem.
(That is: transaction logs, which must be fsync()'ed often to guarantee consistency, and enable crash recovery, reside on a smaller, dedicated filesystem, separate from the rest of my database's disk footprint.)

If I didn't do that, at every checkpoint, my performance would measurably fall.

Note: here the conclusion performance would measurably fall is proper here, because by separating files with different requirement better speed is obtained without an equivalent decrease in effective safety, that is the performance envelope has been actually expanded.

In part because it makes the excellent point of using storage systems with different profiles for data with different requirements.

In part because it is interesting to note that the statement that fsync() calls flush all outstanding writes for the entire filesystem the file(s) being fsync()'ed reside upon is not always right: for some filesystems that need not happen, such as most of those that do some form of copy-on-write, for example for Btrfs:

  • fsync(file) only writes metadata for that one file
  • fsync(file) does not trigger writeback of any other data blocks

Anyhow Btrfs has for fsync intensive workload has a significant advantage in a common case, one that pushes the performance envelope wider: since it can easily make snaspshots it can process multiple fsync operations in parallel as a single transaction rather than in series as a carefully ordered sequence of transactions:

There are also file system limitations to consider: btrfs is not quite stable enough for production, but it has the ability to journal and write data simultaneously, whereas XFS and ext4 do not.

Important Since Ceph has to write all data to the journal before it can send an ACK (for XFS and EXT4 at least), having the journal and OSD performance in balance is really important!

The reason is that while metadata updates must only happen after the relevant data updates, metadata updates by copy only take effect when higher level metadata are committed, up to the top of the tree.

160110 Sun: First OLED desktop monitor for retail sale

The first desktop monitor for retail sale with an OLED display is the Dell UP3017Q and the display is a 30in 3840×2160 one.

The monitor list price is $5,000 which is quite high, especially considering that a television with a 55in 3840×2160 display lists for less than half that at aroudn $2,400 (1, 2); this may be the difference between mass market and professional pricing.

160109 Sat: Laptop market segmentation indicators

Having previously remarked that it is useful when buying something to understand the market segmentation tactics of vendors, I have been looking for a new laptop for myself, so I had a look at a popular online laptop shop that helpfully lists many laptops allowing selection by attribute and listing the number of laptops in their catalog with that attribute. My impressions are that:

This said, my idea of a good contemporary laptop is an average corporate laptop with 13.3in display, 8GiB of RAM and 256GB of flash SSD, typically used with an external display and keyboard and mouse because:

So my usual preference is buy an average corporate laptop from a brand known to take care about them (usually Toshiba, but I also like Lenovo and Dell) and to separately to buy 8GiB of RAM and a 256GB flash SSD and upgrade. In part because I do not really need something like an Ultrabook and I do not particularly like their non-upgradeability or non-repairability.

This is what I did last time, in 2010, as I bought a Toshiba Satellite Pro L630 with the cheapest configuration I could find, with 2GiB of RAM and a 250GB disk drive, and upgraded it to 8GiB or RAM and a 256GB flash SSD, reusing the disk drive for backups.

While that laptop is now 5 years old, it still works very well. I have also put in a bigger battery that gives an autonomy of 8 hours. The only limitations seem to be that the SATA chipset only supports 300Gb/s, and and that the i3 370M 2.4GHz is a bit dated as it draws more power than recent CPUs, and has a slower older graphic core, plus lacks AES acceleration, and VT-d IO acceleration. Conversely it is possible to upgrade memory and storage without fully opening up the laptop shell.

But then I do not usually run VMs, or 3D graphics programs on that laptop, and not that much IO that software AES is that noticeable, and the flassh SSD currently into it is not capable of higher bus speeds than 300Gb/s anyhow.

While Toshiba currently seems to have repositioned the Satellite Pro brand for what I call consumer oriented laptops, the Tecra brand models equivalent to the Satellite Pro L630 are essentially equivalent to it, with much the same speed and features, except for the lower power, more advanced CPU, and much the same price. It looks to me that there has been very little progress in laptop products in 5 years, both as to features and as to price.

I also looked at ease of upgrading RAM, storage and battery, which are for me important concerns: Lenovo designed laptops tend to have 1 RAM slot and to require removing the bottom of the case, Toshiba designed laptops seem to have usually 2 RAM slots of which one is in use, Dell designed laptops (1, 2) seem to be more commonly old-style and have easily removed and replaced disks, often in a caddy, and to have 2 RAM slots, sometimes under a dedicated flap.

Overall I am fairly hesitant as I do not see yet much point in an upgrade until the L630 break.

160103 Sun: Desktop CPU upgrade "thanks" to JavaScript

So I upgraded my current (quite average) desktop replacing its AMD Phenom II X3 720 chip with 3× 2.8GHz CPUs with a new FX 8370E chip which has 8× 3.3GHz CPUs. Total power consumption is supposed to be the same and so are most features.

I did not replace it because of the increase in CPU clock frequency (faster CPUs could be bought in that prices range, but I went to a lower-power-draw CPU), but solely because of the extra numbers of CPUs. That seems strange because there are few applications that can usefully keep 8 threads busy, but the reason for that is instead JavaScript based web sites as each consumes around one CPU.

With only 3 CPUs interactive work becomes sluggish. While I usually I run my browser with JavaScript disabled, this cannot always happen as some sites, including online shops, not just Google and Tumblr and Flickr and the like, are JavaScript based. AJaX using sites that implement dynamically self-updating pages are particularly bad, and that describes many web applications.

Ideally I would be able to just suspend particular tabs in a windows or even a whole window, but probably browser developers have systems with dozens of CPUs, so they do not feel the need to scratch that itch. I have found an awkward alternative which is to use a separate Firefox instance using a distinct profile just for the worst sites, and then I can freeze that with kill -STOP but then I have to find out its process number.

Update 170616: some advertising sites also inject ads that are very CPU heavy, not just JavaScript, but also movie players and animations on loop.