Computing notes 2017 part two

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

2017 December

171227 Wed: So far Firefox 57 "Quantum" is good

Some weeks ago I upgrade my default web browser to Firefox 57 and so far so good: it works pretty well and it has some significant advantage, and I have found that being forced to change the main add-ons that I wanted, to limit usage of JavaScript and other inconveniences worked out pretty well, as both Policy Control and Tab suspend work well, just as uMatrix Control does.

The major advantage is actually in the enforced switch to multi-process, that is displaying tabs inside child processes. Since many web pages are prone to cause memory increases and CPU loops, this means that often my system freezes, which I can then painlessly fix by using the Linux kernel's Alt-SysRq-f keysequence that invokes the OOM killer logic, that kills the most invasive processes. Invariably these are Firefox )or other browser) child processes with tabs displaying troublesome pages, and onm Firefox 57 it is easy to restart them, and just them. In previous version of Firefox the OOM killer would kill the main and only Firefox process, necessitating a slow restart of all Firefox windows and tabs.

171211 Mon: VM infrastructure skepticism is good for VM infrastructures

Some of the more common themes in my discussions with system engineers are about security and VM infrastructures, as many of my recent posts here show. In particular I am a skeptic about the business advantages of virtualized infrastuctures, and I was trying to make the point recently that in the unlucky case that a virtualized infrastructure is mandated, a skeptic is what is needed to design and operate it, to avoid the many longer term downsides. Therefore here is a summary of the short-term advantages of a virtualized infrastructure:

Virtualized infrastructures (and containers) allow finer granularity of administrative domains, that is delegating administrative control (and responsibility) over systems.
Currently for certain types of server hardware mid-sized servers are most cost effective than small servers and virtual infrastructures (and containers) allow easier consolidation of several applications with low resource use onto the mid-sized servers.
Virtual systems can be more arbitrarily distributed on the underlying hardware resources (and even moved dynamically around), something called software-defined datacentres and software-defined networking and storage.

This is my summary of the some of the matching downsides:

Finer granularity of administrative domains means a lot more systems to setup, configure, maintain, which is both expensive and error prone. The workaround some suggest is then to adopt a smartphone style strategy, where each virtualized systems contains a single application and is maintained by the developers of that application, something that some people call devops. Unfortunately developers are not keen on system maintenance and eventually like most smartphones most virtual machines become unmaintained. It is also particularly dangerous that the effective administrative domain of all VMs running on a host is actually that of the host, so that security issues with the host automatically affect all its VMs, that is every VM hosting server is a giant undetectable backdoor into all the hosted VMs.
There are economies of scale for mid-sized servers, but VM overheads can be large, especially for IO-intensive workloads, and these often cost more than those savings. In particular it is very easy to fall to the temptation to overload mid-sized servers with VMs.
That virtual systems can be arbitrarily distributed onto physical systems is very flexible has two large downsides: troubleshooting both faults and performance issues becomes much, much harder, and performing dependency analysis to asset potential impact of failures becomes much harder, and in particular it is easy to make mistakes as to that.

The latter point is particularly important: the typical way around it is to increase the redundancy of systems, with live migration of virtual machines and active-active redundancy, on a corse scale, typically by having multiple sites for private clouds and multiple regions for public clouds.

Note: I have seen actual production designs where network systems, frontend servers, backend servers, where all not just virtualized but also in separate active-active clusters, creating amazing complexity. When talking about software-defined virtual infrastructure it is usually not mentioned that the software in issue has to written, tested, understood, maintained in addition to the hardware infrastructure; this applies also to non computing infrastructures, like power generating, communications, etc. infrastructures.

The problem with increased redundancy is that it largely negates the cost advantage of consolidation, and increases the complexity and thus the fragility of the virtualized infrastructure. The really hot topic in system engineering today is magical solutions, either for private or public clouds, that are claimed to deliver high availability and great flexibility thanks to software defined clustering, data-centres, networking and storage over multiple virtualized infrastructures. I have even been asked to propose designs for these, but my usual conclusion has been that the extreme complexity of the result made it very difficult to even estimate the overall reliability and actual cost.

Note: it is for me particularly hard to believe claims that high reliability is something that can be transparently and cheaply added at the infrastructure level by adding complex and fragile configuration software, that is without an expensive redesign of applications, which claim is often one of the main motivations to consolidate them onto a virtual infrastructure.

Of course swivel-eyed consultants and salesmen speak very positively about the benefits (especially those proposing OpenStack style infrastructures), as they look awesome on paper at a very high level of abstraction. But I find it telling that third-party cloud infrastructure providers have extremely cautious SLAs that give very few guarantees even by paying extra for extra guarantees, and their costs are often quite high, and higher than physical infrastructure.

Note: public shared infrastructures that make sense all rely on colossal economies of scale or colossal networking effects being available to fund the extra reliability and performance issues that they have. A huge dam with a dozen immense turbines produces electric power at such a lower cost that it is worth building a big long distance delivery infrastructure to share its output among many users; a national shared telephone network provides immense connectivity value to its users. Simply putting a lot of servers together in big data centres has some but not large economies of scale (and the scale beyond which they peter out is not that large), and negligible network effects, never mind reducing their capacity by running virtualization layers on top of them.

My conclusion is that virtualized infrastructures are good in small doses, and for limited purposes, in particular for non-critical services that require little storage and memory, and for development and testing, whether private or public, and that in general physical hardware, even in shared data centres, is a more reliable and cost-effective alternative for critical services.

The great attraction of virtualized infrastructures for the executive level is mostly as to financial accounting: the up-front cost of a private virtual infrastructure is low compared to the long-run maintenance costs, and leasing VMs on a public virtual infrastructure can be expensed straight away from taxable income. But my impression is that like other past very similar fads virtualized infrastructures will become a lot less popular with time as their long term issues become clearer.

Note: the past fad most similar in business rationale and as to business economics, and also as to downsides, to virtualized infrastructures has been blade system: I have found them fairly usable, but they have not become popular in the long term, because of the very similar weaknesses to virtualized infrastructures.

In particular I have noticed that like in many other cases with infrastructures (not just virtualized ones) a newly installed infrastructure in the first years of of use seems to work well as it has not reached full load and complexity, and therefore there will be early declarations of victory, as in most cases where long run-run costs are indeed deferred.

Given this, a virtual infrastructure skeptic is much better placed than an optimist as to minimizing the issues with virtual infrastructures until the fad passes and slow down the negative effects of increasing utilization and complexity: by discouraging the multiplication of small VMs that need independent setup and maintenance, by not undersizing the physical layer, by keeping clear inventories of which VM is on which part of the physical infrastracture to minimize actual dependencies and the impact of infrastructure failures, by exercising great discipline in building the configuration programs of software-defined data centres and networks and storage.

Finally my reflections above are from the point of view of cost effectiveness at the business level: at the level of system engineers virtualized infrastructures not only improve employment prospects and job security because of their complexity and the increased staff numbers that they require for effective operation, but they are also lots of fun because of the challenges involved as to technological complexity and sophistication and minimizing the downsides.

171201 Fri: Misconfiguration of SATA chip made flash SSD look slower

So I was disappointed previously that my laptop flash SSD seemed to have halved but I noticed recently that the unit was in SATA1 instead of SATA2 mode, and I could not change thatm even by disabling power saving using the PowerTOP tool or using the hdparm tool; then I suspected that was due to the SATA chipset resetting itself down to SATA1 mode on resuming from suspend-to-RAM, but after a fresh boot that did not change. Then I found out that I had at some point set in the laptop BIOS an option to save power on the SATA interface. Once I unset it the flash SSD resumed working at the top speed of the laptop's SATA2 interface of around 250MB/s. It looks like the BIOS somewhat sets the SATA interface to SATA1 speeds in a way that the Linux based tools cannot change back.

Also I retested the SK Hynix flash SSD and it is back to almost full nominal speed at 480MB/s reading instead of 380MB/s, I guess because of garbage collection in the firmware layer.

In most use, that is for regular-sized files, the change in transfer rate does not reslt in noticeable changes in responsiveness, which is still excellent, because the real advantage of flash SSDs for interactive use is in the much higher random IOPS than disks, more than the somewhat higher transfer rates.

Overall I am still very happy with the three flash SSDs I have, which very chosen among those with a good reputation, as none of them have had any significant issues in several years and they all report having at least 95% of their lifetimes writes available. This is one of the few cases where I have bought three different products from different manufacturers and all three have been quite good. The Samsung 850 Pro feel a bit more responsive than the others but they are all quite good, even the nearly 6 years old Micron M4.

2017 November

171118 Sat: Some recent news about "security"

First a list of recent and semi-recent successful hacks that were discovered:

Forget stealing data — these hackers hijacked Amazon cloud accounts to mine bitcoin
Maersk admits NotPeyta ransomware outbreak cost it $300m Firm saw a 2.5 per cent fall in shipping volumes as a result
FDIC hit with more than 50 security breaches over two years
Equifax hack included nearly 11 million US driver’s licenses
UK cybercops reacted to 590 'significant attacks' over past year – report
Forget stealing data — these hackers hijacked Amazon cloud accounts to mine bitcoin
Every single Yahoo account was compromised by hackers:

In its statement, the company said:

Subsequent to Yahoo's acquisition by Verizon, and during integration, the company recently obtained new intelligence and now believes, following an investigation with the assistance of outside forensic experts, that all Yahoo user accounts were affected by the August 2013 theft.

The hacked user information included phone numbers, birth dates, security questions and answers, and "hashed," or scrambled, passwords, Yahoo said in a list of frequently asked questions on its website. The information did not include "passwords in clear text, payment card data, or bank account information," the company said.

However, the technique Yahoo used to hash passwords on its site is an outdated one that is widely considered to be easily compromised, so it's possible that people who had the hashed passwords could unscramble them.

These are rather different stories because not all hacks are equal: those that generate revenue or offer the opportunity to generate revenue are much worse than the others:

The people who got the credentials to use someone else's AWS VMs to mine bitcoin stole actual money, that is the money used to pay the compute time.
The NotPetya ransomware headline figure is for the loss of 2.5% of the usual business, not money gained by the hackers. There is no indication whether Maersk actually paid the ransom.
The security breaches at the FDIC were clearly directed at finding confidential market-moving information, or the details of banks accounts of bank customers, so they were meant to generate revenue.
The 11 million US driver’s licenses were a small and not very significant part of the Equifax issue and the whole issue is not in itself very big because it is about information about people, even if it can be partly used in identity theft and thus has a revenue generating market value; the more concerning aspect of the Equifax issue is that 209,000 people had their credit card info leak and the breach also included dispute documents with personally identifying information from 182,000 consumers, because those can be used to charge the credit card accounts.
The 590 'significant attacks' over past year don't mean much, as they usually will be phishing attempts that while ruinous to individuals will usually hit few of them for small amounts of money, even if they are usually far more effective, according to Google, on a statistical basis:

The blackhat search turned up 1.9 billion credentials exposed by data breaches affecting users of MySpace, Adobe, LinkedIn, Dropbox and several dating sites. The vast majority of the credentials found were being traded on private forums.

Despite the huge numbers, only seven percent of credentials exposed in data breaches match the password currently being used by its billion Gmail users, whereas a quarter of 3.8 million credentials exposed in phishing attacks match the current Google password.

The study finds that victims of phishing are 400 times more likely to have their account hijacked than a random Google user, a figure that falls to 10 times for victims of a data breach. The difference is due to the type of information that so-called phishing kits collect.
The figuring out of credentials to access the AWS accounts of some businesses directly produces revenue in the form of usable computing power for mining Bitcoins.
The retrieval of the security details of billions of Yahoo! accounts does not directly generate revenue, but has a commercial value because many people use the same security details on many accounts. It has a wider import too, because 3 billions of account details is a modest but significant amount of data and retrieving them increases the load on the source computers too, so it is strange it was not detected.

Overall the real import of the above issues is very varied. And they are the least important, because they were discovered: the really bad security issues are those that don't get found. Knowing about a security breach is much better than not knowing about it.

As to this, phishing has evolved from banal e-mails to entire fake sites quite a while ago:

Phishing kits contain prepackaged fake login pages for popular and valuable sites, such as Gmail, Yahoo, Hotmail, and online banking. They're often uploaded to compromised websites, and automatically email captured credentials to the attacker's account.

Phishing kits enable a higher rate of account hijacking because they capture the same details that Google uses in its risk assessment when users login, such as victim's geolocation, secret questions, phone numbers, and device identifiers.

Security services use the same technique for entrapment, which is a form of phishing:

Australian police secretly operated one of the dark web’s largest child abuse sites for almost a year, posing as its founder in an undercover operation that has triggered arrests and rescues across the globe.

The sting has brought down a vast child exploitation forum, Childs Play, which acted as an underground meeting place for thousands of paedophiles.

Obviously this is just one case among very many that are as yet undisclosed. There may be few qualms about entrapment of paedophiles (but for the fact that it facilates their activities for a time), but consider the case of a web site for north korean political dissenters actually run by the north korean secret police, or a system security discussion forum for bank network administrators run by a russian hacker gang.

Never mind phone home ever-listening devices like digital assistants or smart television monitors (1, 2), or smartwatches that act as remote listening devices (1, 2). The issue with these is not just that they phone home to the manufacturers, but that they may be easy to compromise by third parties, and then the would phone home to those third parties, with potentially vast conseqences. For example it is not entirely well known that passwords can be easy to figure out from the sound of typing them, especially if they are typed repeatedly.

So as always there are two main types of security situations: those where one is part of a generic attack against a large number of mostly random people with a low expected rate of success, and what matters is not to be a success, and those targeted against specific individuals or groups of high value, and what matters is either to be not of high value or not to select oneself as being or appearing to be high value.

171117 Fri: Some interesting tools

I have belatedly become aware of some tools that seem interesting but I have not used them yet:

SnapRAID is not really a RAID scheme, but more a warm data recovery scheme that can complement a cold backup scheme. It works similarly to the par2 erasure-code tool, but at the filesystem level instead of the file-archive level. Given N filesystems and some additional space (on some other filesystem) it will compute statically a redundancy code in the additional space that allows recovering any single data loss in unmodified files.
Greyhole is some kind of Samba-does-RAID1 scheme. It works as a Samba VFS plug-in that writes data to 1 or more copies, and creates a transparent symbolic link to one of the copies, so that reads may proceed directly fromm it. There is a MySQL database that logs pending writes and which file is stored in which copyn where.
Singularity is yet another containerization scheme based on complete encapsulation of a runtime environment into a container, so that the only dependency is on the Linux kernel system call API, which is extremely stable.

171113 Mon: How to switch to Firefox 57 Quantum

So I have finally switched to Mozilla's new Firefox 57 Quantum, a largely re-engineered jump away from the previous codebase.

The re-engineering was supposed to deliver more portable add-ons and faster rendering of web pages, as well as making multi-process operation common. In all it seems to deliver: the new-style add-ons are no longer based on rewriting parts of the Firefox user interface internal XUL code itself, but on published internal interfaces, mostly compatible with those of Google's Chrome, and indeed page loading and rendering are much snappier, and multi-process operation means that in cases like memory overflow only some tabs get killed. There are however a few very important issues in switching, which makes it non-trivial:

There is a third major change: because Firefox cannot be easily contained using methods like SELinux or AppArmor in Linux, and similar facilities are not quite available under other operating systems, Firefox 57 includes a built-in containment system. This containment systems, called the sandbox, is by default set to be very constrictive, which on many systems means stuff does not work.
For example in many GNU/Linux systems font files may be in directories other than those in the built-in white list, and web pages display without text (while the text in Firefox UI elements does appear).
In particular if one is using one of the popular session managers, like the the one in Tab Mix plus, they are no longer available, and any sessions saved with them are unavailable.
While Firefox 57 is indeed noticeably quicker than previous versions, and multiprocessing makes it much more resilient to crashes, that is overwhelmed by the enormous CPU consumption of JavaScript based web sites and in particular advertising sites, and the somewhat increased memory consumption from multiprocessing.
Essentially all the previously popular add-ons are not compatible with Firefox 57, as XUL customization has been completely disabled. Very few have been rewritten, and many cannot have the same functionality as the version-independent add-on interface is nowhere as powerful.

The solutions are:

For missing text because of excessively constrictive confinement:
- Set security.sandbox.content.level to level 2 (writes are confined but reads are not).
- Leave security.sandbox.content.level on level 3 but add the font directory paths to security.sandbox.content.read_path_whitelist, with a trailing slash indicating to whitelist also the subdirectories.
  To find which directories to add set security.sandbox.logging.enabled to on (it is the default) and re-run Firefox with the environment variable MOZ_SANDBOX_LOGGING set to 1, and checking the copious log output.
To preserve the session:
- Before installing Firefox 57 disable any add-on session manager, and in the General preferences select When Firefox starts Show your windows and tabs last time as that enables the built-in session manager that saves and restores the session between previous versions and version 57.
- You may want to set browser.tabs.restorebutton to 1
- Exit Firefox, install version 57, and start it. The previous session should be preserved.
To minimize resource consumption it is very important to have add-ons to (selectively) disable JavaScript and to only load a page when it it accessed. Previous add-ons for this are no longer available, so I find two recent add-ons compatible with Firefox 57:
- Tab Suspender allows unloading the page in a tab and reloading it with a click. It will even automatically unload any page not access for more then N minutes if there are more than M tabs. It is very effective indeed, and reduces the memory usage of Firefox 57 enormously, as well as effectively disabling JavaScript for all unloaded pages.
- Policy Control allows several levels of blocking of JavaScript and other notorious features, with whitelists and blacklists of particular domains. It is also very effective, and for me it reduced the CPU usage on a single page from 300% to 5%. For the web pages that require removing the restrictions I simply keep a second Firefox instance with a profile where JavaScript is not restricted and drag and drop the URL to that.
For XUL add-ons that have not been rewritten there are often others that do much the same things or something usually a bit less powerful but are acceptable. So I replaced NoScript with Policy Control and Reload Tab with Tab Suspender, and another two with Disable Ctrl-Q and Cmd-Q (which unfortunately does not work on GNU/Linux Firefox currently) and Text Link.
While using the ancient CTWM window manager I found it does not handle well the popup dialog windows for the add-on installer, so I replaced it with the more modern OpenBox.

Having done the above Firefox 57 works pretty well, and in large part thanks to Tab Suspender and Policy Control, and in part to its improved implementation, it is far quicker and consumes much less memory, as the new add-ons are stricter than those they replaced. There are minor UI changes, and they seem to be broadly an improvement, even if slight.

171109 Thu: Btrfs status

As I spend a fair bit of time helping distraught users on IRC channels, in part for community service, in part to keep in touch with mortal user contemporary issues, and I was asked today for a summary of what Btrfs currently is good for and what are the limitations, so I might as well write it here for reference.

The prelude to a few lists is a general issue: Btrfs has a sophisticated design with a lot of features, and regardless of current status its main issue is that it is difficult both to explain its many possible aspects, it complex tradeoffs and to understand them. No filesystem design is maintenance-free or really simple, but Btrfs (and ZFS and XFS too) is really quite not fire-and-forget.

Because of that complexity I have decided to create a page of notes dedicated to Btrfs with part of its contents extracted from the Linux filesystems notes and with my summary how best to use it.

2017 October

171029 Sun: The "mixin" and "flavors" problem for configuration

I was chatting with two quite knowledgeable, hard bitten guys about system administration and their issue was that they got 500 diverse systems running diverse machines to manage and they use Puppet, and they have a big maintainability problem for Puppet scripts and configuration templates.

Note: the 500 different virtual systems is a large number, but there are organizations with 1,000 and the diversity is usually the result of legacy issues, such as consolidating different physical infrastructures belonging to different parts of an organization onto a single virtual infrastructure.

So I discussed a bit with them, and suggested that the big deal is not so much the wide variety of GNU/Linux distributions they use, because after all they run all the same applications, and they are not radically different; after all where a configuration file is put by a distribution is a second order complication.

I suggested that the bigger problems is that there are innumerable and incompatible versions of those applications in every possible combination, and that results in hard to maintain configuration files and templates. I also added that this is a known issue in general programming, which for example results in C source code with a lot of portability #ifdef cases.

They agreed with me, and asked my opinion as to other configuration management systems that might alleviate the problem. I mentioned Ansible, but did not have much time to get into that, as I had to leave. So I did not have the time to make more explicit my hint that it it not so much a problem with the tool, even if different tools make a difference on how easily the solution is implemented.

The difficulty is that in using software defined infrastructure the software needs to be designed carefully for understandability, modularity, maintainability, just like any other software that will be in se in the long term, and configuration management languages are often not designed to make that easy, or to gently encoruage the best practices: it os exceptionally easy to create a software defined infrastructure, be it servers, networks, hardware that is complex, opaque and fragile, by the usual programming technique of whatever works in the short term.

The main issue is thus one of program design, and in particular of which decomposition paradigm to use and discipline in adhering to it.

I have previously discussed what an enormous difference to maintainability there is between two different styles of Nagios configuration the issue is even larger in the case of generic configuration files, never mind instantiating and deploying them in many variants to different types of systems via configuration management scripts.

The main difficulty is that in a system infrastructure there are one-to-many and many-to-many relationships among for example:

Hardware types, for example 11^th generation Dell servers).
Hardware type instances (what I call hulls), for example a specific configuration of an 11^th generation Dell server.
Operating systems types, for example RedHat Enterprise Linux 7.4.
Operating system instances (what I call nodes), for example the configuration of EL 7.4 for a specific named host.
Specific environments in which they are instantiated (what I call sites), for example condfiguration specific to a an operating system instance in a specific rack of a given data centre.
Applications, for example Apache 2.4 or PostgreSQL 9.6.
Application instances for a service, for example a given installation of PostgreSQL 9.6 for a user accounts database service.

The final configuration of a system is a mixin (1, 2) of all those aspects. In an ideal world the scripts and templates used to create software-defined configurations would be perfectly portable across changes to any or all these aspects. This is the same problem as writing a complex program composed of many parts written in different languages, to be portable across many different hardware and software platforms, and easily maintainable.

This problem cannot be solved by changing tool; it can only be solved by using a structured approach to programming the configuration management system based on insight and discipline.

For example one of the great high level choices to make is whether to organize the configuration management system by system, or by service: this can be expressed by asking whether systems belong to services or whether services belong to systems. This relates to whether systems that run multiple services are common, or whether services that run on multiple systems are common, which depends in turn on the typical size of systems, the typical weight of services. The same questions can be asked about operating system types and application types.

Attempts to answer questions like this with both or whatever are usually straight paths to unmaintainability, because an excess of degrees of freedoms is usually too complex to achieve, never mind to maintain in the long term, both conceptually and as to capacity. Arbitrary combinations of arbitrary numbers of systems, services, OS types, application types are beyond current technology and practice for software-defined infrastructures, just as full modularity and portability of code is for applications.

What can be done if someone has got to the point of having several hundred virtual systems where (because of the pressure of delivering something quickly rather than well) little attention has been paid to the quality and discipline in the configuration system, and the result is a spaghetti outcome?

The answer usually is nothing, that is just to live with it, because any attempt to improve the situation will be likely to be too disruptive. In effect many software-defined infrastructures soon become unmodifiable, because the software part of software-defined soon becomes unamanageable, unless it is managed with great and continuing care from the beginning.

For software at the application level it is sometimes possible to refactor it incrementally, replacing hard-to-maintain parts of it with better structured ones, but it is much more difficult to do this with software that defines infrastructures because it has to be done and commissioned on the live infrastructure itself. Unless it is either quite small or quite simple.

One of the great advantages of a crop rotation strategy for infrastructure upgrades is that it offers a periodic opportunity for incremental refactoring of infrastructure both as to hardware and software-defined.

In effect software-defined infrastructures will achieve the same advantages and disadvantages as very detailed work-to-rule outsourcing contracts, because software is a type of contract.

This is going to happen whether the virtualized infrastructure is public or private, with the added difficulty for public infrastructure of the pressure by all customers to change nothing in existing interfaces and implementations to avoid breaking something in their production environments: because change reduces the cost of maintenance for the infrastructure maintainer, and backwards compatibility reduces the cost of the clients of those infrastructures.

That is not a new situation as to history of computing infrastructures: mainframe based infrastructures went the same way in the 1970s-1980s, and after 20 years of ever greater piling up of special cases, became very inflexible, any changes being too risky to existing critical applications. Amusingly the first attempt to fix this was by virtualizing the mainframe infrastructure with VM/370, as this was supposed to allow running very diverse system layers by sharing the same hardware.

But the really successful solution was to bypass the central mainframes first with departmental minicomputers, and then with personal computers, which could be maintained and upgraded individually, maximizing flexibility and minimizing the impact of failure.

To some extent entirely standardized systems processing trivially simple workloads can be served successfully from centralized cloud resources, as it happens for the electricity to power lightbulbs, fridges, washers in the average home. So for example I reckon that cloud-based content distribution networks are here to stay. But I think that the next great wave in business computing will be decentralization, as both private and public centralized virtual infrastructures become ossified and unresponive like mainframe-based computing became two-three decades ago.

171014 Sat: Mobile phone network hacking and more on security

As non-technical people ask me about the metaphysical subject of computer security my usual story is that every possible electronic device is suspicious, and if programmable it must be assumed to have several backdoors put in by various interested parties, but they are not going to be used unless the target is known to be valuable.

It is not just software or firmware backdoors: as a recent article on spyware dolls and another one reporting WiFi chips in electric irons suggest, it is very cheap to put some extra electronics on the circuit boards of various household items, either officially or unofficially, and so probably these are quite pervasive.

As I wrote previously being (or at least being known as) a target worth only a low budget is a very good policy, as otherwise the cost of preventive measures can be huge; for valuable targets, such as large Bitcoin wallets, even two factor authentication via mobile phones can be nullified because it is possible to subvert mobile phone message routing and the same article makes strongly the point that being known as a valuable target attracts unwanted attempts, and that in particular Bitcoin wallets are identified by IP address and it is easy to associate addresses to people.

In the extreme case of Equifax which held the very details of hundreds of millions of credit card users, the expected sale value is dozen of millions of dollars which may justify investments of millions of dollars to acquire, and it is very difficult to protect data against someone with a budget like that, that allows paying for various non technical but very effective methods.

The computers of most non-technical people and even system administrators are not particularly valuable targets, as long as they keep the bulk of their savings, if any, in offline accounts, and absolutely never write down (at least not on a computer) the PINs and passwords to their online savings accounts. Even so a determined adversary can grab those passwords when they are used, indirectly, by installing various types of bugs in a house or a computer before it gets delivered, but so-called targeted operations have a significant cost which is not worth spending over average-income people.

So my usual recommendation is to be suspicious of any electrical, not just electronic, device, and use them only for low value activities not involving money and not involving saleable assets like lists of credit card numbers.

171004 Wed: C vs. C++ and the cost of system level features

I have been quite fond of using the Btrfs filesystem, but only for its simpler features and in a limited way. But on its IRC channel and mailing list I come often across many less restrained users who get into trouble of some sort or another by using arbitrary combinations of features, expecting them to just work and fast too.

That trouble is often bugs, because there are many arbitrary combinations of many features, and designing so all of them behave correctly and sensibly, never mind testing them, is quite hard work.

But often that trouble relates to performance and speed, because the performance envelope of arbitrary combinations of features can be quite anisotropic indeed; for example in Btrfs creating snapshots is very quick, but deleting them can take a long time.

This reminded me of the C++ programming language, which also has many features and makes possible many arbitrary combinations, many of them unwise, and in particular one issue it has compared to the C programming language: in C every language feature is designed to have a small, bounded, easily understood cost, but many features of C++ can involve very large space or time costs, often surprisingly, and even more surprising can be the cost of combinations of features. This property of C++ is common to several programming languages that try to provide really advanced features as if they were simple and cheap, which of course is something that naive programmers love.

Btrfs is very similar: many of its advanced features sound like the filesystem code just does it but involve potentially enormous costs at non-trivial scale, or are very risky in case of mishaps. Hiding great complexity inside seemingly trivial features seems to lead astray a lot of engineers who should know better, or who believe that they know better.

I wish I could say something better than this: that it is very difficult to convey appropriately an impression of the real cost of features when they are presented elegantly. It seems to be a human failing to assume that things that looks elegant and simple are also cheap and reliable; just like in movies the good people are also handsome and look nice, and the bad people are also ugly and look nasty. Engineers should be more cynical, but many don't, or sometimes don't want to, as that may displease management who usually love optimism.

2017 September

170918 Mon: Complexity and maintainability and virtualization

One of the reasons why I am skeptical about virtualization is that it raises the complexity of a system: a systems engineer has then to manage not just systems running one or more applications, but also systems running those systems, creating a cascade of dependencies that are quite difficult to investigate in case of failure.

The supreme example of this design philosophy is OpenStack which I have worked on for a while (lots of investigating and fixing), and I have wanted to create a small OpenStack setup at home on my test system to examine it a bit more closely. The system at work was installed using MAAS and Juju, which turned out to be quite unreliable too, so I got the advice to try the now-official Ansible variant of Kolla (1, 2, 3, 4) setup method.

Kolla itself does not setup ansible: it is a tool to build Docker deployables for every OpenStack service. Then the two variants are for deployment: using Ansible or Kubernetes which is another popular buzzword along with Docker and OpenStack. I chose the Ansible installer as I am familiar with Ansible and also I wanted to do a simple install with all relevant services on a single host system, without installing Kubernetes too.

It turns out that the documentation as usual is pretty awful, very long on fantastic promises of magnificence, and obfuscates the banal reality:

kolla is in effect a collection of 252 (two hundred fifty two) Dockerfile Jinjia2 templates, and a simple driver written in Python to build them into Docker archives. It is hard to me to comprehend why this could not have been done with Make or Ansible itself.
kolla-ansible is a collection of 67 (sixty seven) roles for as many OpenStack component that result in the installing and configuration of the (one of more) Docker instances associated with the component. There is also a simple wrapper for the ansible-playbook command.

Note: one of the components and Dockerfile is for kolla-toolbox which is a new fairly small component for internal use.

The main value of kolla and kolla-ansible is that someone has already written the 252 Dockerfiles for the 67 components, and Ansible roles for them, and keeps somewhat maintaining them, as OpenStack components change.

Eventually my minimal install took 2-3 half days, and resulted in building 100 Docker images taking up around 11GiB and running 29 Docker instances:

soft# docker ps --format 'table {{.Names}}\t{{.Image}}' | grep -v NAMES | sort 
cron                        kolla/centos-binary-cron:5.0.0
fluentd                     kolla/centos-binary-fluentd:5.0.0
glance_api                  kolla/centos-binary-glance-api:5.0.0
glance_registry             kolla/centos-binary-glance-registry:5.0.0
heat_api_cfn                kolla/centos-binary-heat-api-cfn:5.0.0
heat_api                    kolla/centos-binary-heat-api:5.0.0
heat_engine                 kolla/centos-binary-heat-engine:5.0.0
horizon                     kolla/centos-binary-horizon:5.0.0
keystone                    kolla/centos-binary-keystone:5.0.0
kolla_toolbox               kolla/centos-binary-kolla-toolbox:5.0.0
mariadb                     kolla/centos-binary-mariadb:5.0.0
memcached                   kolla/centos-binary-memcached:5.0.0
neutron_dhcp_agent          kolla/centos-binary-neutron-dhcp-agent:5.0.0
neutron_l3_agent            kolla/centos-binary-neutron-l3-agent:5.0.0
neutron_metadata_agent      kolla/centos-binary-neutron-metadata-agent:5.0.0
neutron_openvswitch_agent   kolla/centos-binary-neutron-openvswitch-agent:5.0.0
neutron_server              kolla/centos-binary-neutron-server:5.0.0
nova_api                    kolla/centos-binary-nova-api:5.0.0
nova_compute                kolla/centos-binary-nova-compute:5.0.0
nova_conductor              kolla/centos-binary-nova-conductor:5.0.0
nova_consoleauth            kolla/centos-binary-nova-consoleauth:5.0.0
nova_libvirt                kolla/centos-binary-nova-libvirt:5.0.0
nova_novncproxy             kolla/centos-binary-nova-novncproxy:5.0.0
nova_scheduler              kolla/centos-binary-nova-scheduler:5.0.0
nova_ssh                    kolla/centos-binary-nova-ssh:5.0.0
openvswitch_db              kolla/centos-binary-openvswitch-db-server:5.0.0
openvswitch_vswitchd        kolla/centos-binary-openvswitch-vswitchd:5.0.0
placement_api               kolla/centos-binary-nova-placement-api:5.0.0
rabbitmq                    kolla/centos-binary-rabbitmq:5.0.0

Some notes on this:

Quite appallingly for tools designed to run on Debian and EL based systems, both of which have strong package management systems, both kolla and kolla-ansible and obviously their numerous dependencies must be installed via the Python-specific pip installer command.
I had to build 100 images using the kolla profiles infra and main, which took a while, because kolla needs to be run before kolla-ansible, and is an independent tool, so it is not known in advance which images need to be build. Actually it is, because it is declared in the globals.yml file, but whatever.
The 100 Docker images built only take around 11GiB, even if each ranges in size from 400MiB to 1200MiB because Docker, in its archive, shares the common part, like the base system which is indeed 400MiB per image.
As typical for containers, each image is setup to run (via the popular and useful dumb-init) a single daemon, usually a single process, that's why there are 8 Nova containers and 5 Neutron ones.

Most of the installation time was taken to figure out the rather banal nature of the two components, to de-obfuscate the documentation, to realize that the prebuilt images in the Kolla Docker hub repository were both rather incomplete and old and would not do, that I had to overbuild images, but mostly to work around the inevitable bugs in both the tools and OpenStack.

I found that the half a dozen blocker bugs that I investigated were mostly known years ago, as it often happens, ands that was lucky, as most of the relevant error messages were utterly opaque if not misleading, but would eventually lead to a post by someone who had investigated it.

Overall 1.5 days to setup a small OpenStack instance (without backing storage) is pretty good, considering how complicated it is, but the questions are whether it is necessary to have that level of complexity, and how fragile it is going to be.

170917 Sun: Some STREAM and HPL results from some not-so-new systems

On a boring evening I have run the STREAM and HPL benchmarks on some local systems, with interesting results, first for STREAM:

AMD Phenom X3 720 3 CPUs 2.8GHz

over$ ./stream.100M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5407.0     0.296645     0.295911     0.299323
Scale:           4751.1     0.338047     0.336766     0.344893
Add:             5457.5     0.439951     0.439759     0.440204
Triad:           5392.4     0.445349     0.445068     0.445903
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

intel i3-370M 2-4 CPUs 2.4GHz

tree$ ./stream.100M                                                            
-------------------------------------------------------------                   
STREAM version $Revision: 5.10 $                                                
...
Function    Best Rate MB/s  Avg time     Min time     Max time                  
Copy:            5668.9     0.299895     0.282241     0.327760                  
Scale:           5856.8     0.305240     0.273185     0.363663                  
Add:             6270.9     0.404547     0.382720     0.446088                  
Triad:           6207.4     0.408687     0.386636     0.456758                  
-------------------------------------------------------------                   
Solution Validates: avg error less than 1.000000e-13 on all three arrays        
-------------------------------------------------------------

AMD FX-6100 3-6 CPUs 3.3GHz

soft$ ./stream.100M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5984.0     0.268327     0.267381     0.270564
Scale:           5989.1     0.269746     0.267154     0.279534
Add:             6581.8     0.366100     0.364640     0.371339
Triad:           6520.0     0.374828     0.368098     0.419086
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

AMD FX-8370e 4-8 CPUs 3.3MHz

base$ ./stream.100M 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            6649.8     0.242686     0.240608     0.244452
Scale:           6427.1     0.251241     0.248944     0.257000
Add:             7444.9     0.324618     0.322367     0.327456
Triad:           7522.1     0.322253     0.319058     0.324474
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Xeon E3-1200 4-8 CPUs at 2.4GHz (Update 170920)

virt$ ./stream.100M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            9961.3     0.160857     0.160621     0.160998
Scale:           9938.1     0.161200     0.160997     0.161329
Add:            11416.7     0.210518     0.210218     0.210647
Triad:          11311.7     0.212472     0.212170     0.214260
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

The X3 720 system has DDR2, the others DDR3, obviously it is not making a huge difference. Note that the X3 720 is from 2009, and the i3-370M is in a laptop from 2010, and the FX-8370e from 2014, so a span of around 5 years.

Bigger differences for HPL; run single-host with mpirun -np 4 xhpl; the last 8 tests are fairly representative, and the last column (I have modified the original printing format from %18.3e to %18.6f) is GFLOPS:

AMD Phenom X3 720 3 CPUs 2.8GHz

over$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00     0.011727
WR00C2R4    35   4   4   1     0.00     0.012422
WR00R2L2    35   4   4   1     0.00     0.011343
WR00R2L4    35   4   4   1     0.00     0.012351
WR00R2C2    35   4   4   1     0.00     0.011562
WR00R2C4    35   4   4   1     0.00     0.012483
WR00R2R2    35   4   4   1     0.00     0.011615
WR00R2R4    35   4   4   1     0.00     0.012159

intel i3-370M 2-4 CPUs 2.4GHz

tree$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00     0.084724
WR00C2R4    35   4   4   1     0.00     0.089982
WR00R2L2    35   4   4   1     0.00     0.085233
WR00R2L4    35   4   4   1     0.00     0.088178
WR00R2C2    35   4   4   1     0.00     0.085176
WR00R2C4    35   4   4   1     0.00     0.092192
WR00R2R2    35   4   4   1     0.00     0.086446
WR00R2R4    35   4   4   1     0.00     0.092728

AMD FX-6100 3-6 CPUs 3.3GHz

soft$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00     0.074744
WR00C2R4    35   4   4   1     0.00     0.074744
WR00R2L2    35   4   4   1     0.00     0.073127
WR00R2L4    35   4   4   1     0.00     0.075299
WR00R2C2    35   4   4   1     0.00     0.072952
WR00R2C4    35   4   4   1     0.00     0.076627
WR00R2R2    35   4   4   1     0.00     0.076052
WR00R2R4    35   4   4   1     0.00     0.073127

AMD FX-8370e 4-8 CPUs 3.3MHz

base$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00     0.152807
WR00C2R4    35   4   4   1     0.00     0.149934
WR00R2L2    35   4   4   1     0.00     0.150643
WR00R2L4    35   4   4   1     0.00     0.152807
WR00R2C2    35   4   4   1     0.00     0.132772
WR00R2C4    35   4   4   1     0.00     0.160093
WR00R2R2    35   4   4   1     0.00     0.163582
WR00R2R4    35   4   4   1     0.00     0.158305

Xeon E3-1200 4-8 CPUs at 2.4GHz (Update 170920)

virt$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00      0.3457
WR00C2R4    35   4   4   1     0.00      0.3343
WR00R2L2    35   4   4   1     0.00      0.3343
WR00R2L4    35   4   4   1     0.00      0.3104
WR00R2C2    35   4   4   1     0.00      0.3418
WR00R2C4    35   4   4   1     0.00      0.3622
WR00R2R2    35   4   4   1     0.00      0.3497
WR00R2R4    35   4   4   1     0.00      0.3756

Note: the X3 720 and the FX-6100 are running SL7 (64 bit), and the i3-370M and FX-8370e are running ULTS14 (64 bit). On the SL7 systems I got better slightly better GFLOPS with OpenBLAS, but on the ULTS14 systems it was 100 times slower than ATLAS plus BLAS.

Obvious here is that the X3 720 was not very competive in FLOPS even with an i3-370M for a laptop, and that at least on floating point intel CPUs are quite competitive. The reason is that most compilers optimize/schedule floating point operations specifically for what's best intel floating point internal archictectures.

Note: the site cpubenchmark.net rates using PassMark® the X3 720 at 2,692 the i3-370M at 2,022 the FX-6100 at 5,412 and the FX-8370e at 7,782 which takes into account also non-floating point speed and the number of CPUs, and these ratings seem overall fair to me.

It is however faily impressive that the FC-8370e is still twice as fast at the same GHz as the FX-6100, and it is pretty good on an absolute level. However I mostly use the much slower i3-370M in the laptop, and for interactive work it does not feel much slower.

170904 Fri: A developer buys a very powerful laptop

In the blog of a software developer there is a report of the new laptop he has bought to do hiw work:

a new laptop, a Dell XPS 15 9560 with 4k display, 32 GiBs of RAM and 1 TiB M.2 SSD drive. Quite nice specs, aren't they :-)?

That is not surprising when a smart wristwatch has dual CPU 1GHz chip, 768MiB of RAM, 4GiB of flash SSD but it has consequences for many other people: such a powerful development system means that improving the speed and memory use of the software written by that software developer will not be a very high priority. It is difficult for me to see practical solutions for this unwanted consequence of hardware abudance.

2017 August

170831 Fri: SFTP speeds have improved a lot

Some years ago I had reported that the standard sftp (and sometimes scp) implementation then available for GNU/Linux and MS-Windows was extremely slow because of a limitation in the design of its protocol and implementation which means that it effectively behaved as if in half-duplex mode.

At the current time the speed of commonly available sftp implementations is pretty good instead, with speed of around 70-80MB/s on a 1Gbit links, because the limitation could be circumvented and the implementations have been (arguably) improved, both the standards one and an extended one.

170825 Fri: M.2 storage with and without backing capacitors

Previously I mentioned the large capacitors on flash SSDs in the 2.5in format targeted at the enterprise market. In an article about recent flash SSDs in the M.2 format there are photographs and descriptions of some with mini capacitors with the same function, and a note that:

The result is a M.2 22110 drive with capacities up to 2TB, write endurance rated at 1DWPD, ... A consumer-oriented version of this drive — shortened to the M.2 2280 form factor by the removal of power loss protection capacitors and equipped with client-oriented firmware — has not been announced

The absence of the capacitors is likely to save relatively little in cost, and having to manufacture two models is likely to add some of that back, but the lack of capacitors makes the write IOPS a lot lower because (because it requires write-through rather than write-back caching) and for the same reason also increase write amplification, thus creating enough differentiation for a much higher price for the enterprise version.

170817 Thu: 50TB 3.5in flash SSD

Interesting new of a line of (instead of the more conventional M.2 and 2.5in) 3.5in form factor flash SSD products, the Viking UHC-Silo (where UHC presumably stands for Ultra High Capacity) in 25TB and 50TB capacities, with a dual-port 6Gb/s SAS interface.

Interesting aspects: fairly conventional 500MB/s sequential read and 350MB/s sequential write rates, and not unexpectedly 16W power consumption. But note that at 350MB/s to fill (or duplicate) this unit takes 56 hours. That big flash capacity could probably sustain a much higher transfer rate, but that might involve a lot higher power drawn and thus heat dissipation, so I suspect that the limited transfer rate does not depend just on the 6Gb/s channel that after all allows for slightly higher 500MB/s read rates.

Initial price seems to be US$37,000 for the 50TB version, or around US$1,300 per TB which is more or less the price for a 1TB flash SSD enterprise SAS product.

170816 Wed: Another economic reason to virtualize

The main economic reason to virtualize has been allegedly that in many cases hardware servers that run the hosts systems of applications are mostly idle, as a typical data centre hosts many low-use applications that nevertheless are run on their own dedicated server (being effectively containerized on that hardware server) and therefore consolidating them onto a single hardware server can result in a huge saving; virtualization allows consolidating the hosting systems as-they-are, simply by copying them from real to virtual hardware, saving transition costs.

Note: that gives for granted that in many cases the applications are poorly written and cannot coexist in the same hosting system, which is indeed often the case. More disappointingly also gives for granted that increasing the impact of hardware failure from one to many applications is acceptable, and that replication of virtual systems to avoid that is cost-free, which it isn't.

But there is another rationale for consolidation, if not virtualization, that does not depend on there being a number of mostly-idle dedicated hardware servers: the notion that the cost per capacity unit of midrange servers is significantly lower than for small servers, and that the best price/capacity ratio is with servers like:

24 Cores (48 cores HT), 256GB RAM, 2* 10GBit and 12* 3TB HDD including a solid BBU. 10k EUR

Applications that actually fill that box are rare. We we are cutting it up to sell off the part.

That by itself is only a motivation for consolidation, that is for servers that run multiple applications; in the original presentation that becomes a case for virtualization because of the goal of having a multi-tenant service with independent administration domains.

The problem with virtualization to take advantage of lower marginal cost of capacity in mid-size hardware is that it is not at all free, because of direct and indirect costs:

Virtualization overheads are direct costs, lower for workloads that are CPU intensive, higher for network intensive ones, and highest for storage intensive ones. These overheads can nullify the cost advantage of sharing medium-size hardware servers.
Sharing the same server among several applications has indirect costs too, mostly about correlating multiple applications that could be independent:
- Failure of the shared server results in failure of all the applications running on it. This can be avoided by turning all applications into resilient ones, which is quite expensive, and unnecessary for many, or by turning all the virtual servers into highly resilient ones, which is often enormously complex and expensive.
- Because resources are shared high usage or misbehaviour of one application results in a reduced impact on the application itself but an impact on unrelated applications, and to counter this expensive overcapacity is required.
- Because the host physical server and system are in effect a giant backdoor into all hosted virtual machines it becomes a very attractive target for potential adversaries, and virtualization increases the complexity of its hardware and software.
- In the case of mostly-inactive applications running on its own hardware that can be more easily put in low power mode than a shared system where most but not all applications are mostly-inactive.

Having looked at the numbers involved my guess is that there is no definite advantage in overall cost to consolidation of multiple applications, and it all depends on the specific case, and that usually but not always virtualization has costs that outweight its advantages, especially for IO intensive workloads, and that this is reflected in the high costs of most virtualized hosting and storage services (1, 2).

Virtualization for consolidation has however an advantage as to the distribution of cost as it allows IT departments to redefine downwards their costs and accountabilities: from running application hosting systems to running virtual machine hosting systems, and if 10 applications in their virtual machines can be run on a single host system, that means that IT departments need to manage 10 times fewer systems, for example going from 500 small systems to 50.

But the number of actual systems has increased from 500 to 550, as the systems in the virtual machines have not disappeared, so the advantage for IT departments comes not so much from consolidation, but handing back cost of managing the 500 virtual systems to the developers and users of the applications running within them, which is what usually DevOps means.

The further stage for IT department cost-shifting is to get rid entirely of the hosting systems, and outsource the hosting of the virtual machines to a more expensive cloud provider, where the higher costs are then charged directly to the developers and users of the applications, eliminating those costs from the budget of IT department, that is only left with the cost of managing the contract with the cloud provider on behalf of the users.

Shifting hardware and systems costs out of the IT department budget into that of their users can have the advantage of boosting the career and bonuses of IT department executives by shrinking their apparent costs, even if does not reduce overall business costs. But it can reduce aggregate organization costs when it discourages IT users from using IT resources unless there is a large return, by substantially raising the direct cost of IT spending to them, so even at the aggregate level it might be, for specific cases, ultimately a good move.

That is a business that consolidates systems and switches IT provision from application hosting to systems hosting and then outsources system hosting is in effect telling its component business that they are overusing IT and that they should scale it back, by effectively charging more for application hosting, and supporting it less.

170810 Thu: Speedy writing to a BD-R DL disc

Today after upgrading (belatedly) the firmware of my BDR-2209 Pioneer drive to 1.33 I have seen for the first time a 50GB BD-R DL disc written at around 32MB/s average:

Current: BD-R sequential recording
Track 01: data  31707 MB        
Total size:    31707 MB (3607:41.41) = 16234457 sectors
Lout start:    31707 MB (3607:43/41) = 16234607 sectors
Starting to write CD/DVD at speed MAX in real TAO mode for single session.
Last chance to quit, starting real write in   0 seconds. Operation starts.
Waiting for reader process to fill input buffer ... input buffer ready.
Starting new track at sector: 0
Track 01: 9579 of 31707 MB written (fifo 100%) [buf 100%]   8.0x.

Track 01: Total bytes read/written: 33248166496/33248182272 (16234464 sectors).
Writing  time:  1052.512s

This on a fairly cheap no-name disc. I try sometimes also to write to a 50GB BD-RE DL discs, but it works only sometimes, and at best at 2x speed. I am tempted to try, just for the fun of it, to get a 100GB BD-RE XL disc (which have been theoretically available since 2011) but I suspect that's wasted time.

2017 July

170724 Mon: Different types of M.2 slots

As an another example that there are no generic products I was looking at a PC card to hold an M.2 SSD device and interface it to the host bus PCIe and the description carried a warning:

Note, this adapter is designed only for 'M' key M.2 PCIe x4 SSD's such as the Samsung XP941 or SM951. It will not work with a 'B' key M.2 PCIe x2 SSD or the 'B' key M.2 SATA SSD.

There are indeed several types of M slots and with different widths and speeds, supporting different protocols, and this is one of the most recent and faster variants. Indeed among the user reviews there is also a comment as to the speed achievable by an M.2 flash SSD attached to it:

I purchased LyCOM DT-120 to overcome the limit of my motherboard's M.2. slot. Installation was breeze. The SSD is immediately visible to the system, no drivers required. Now I enjoy 2500 MB/s reading and 1500 MB/s writing sequential speed. Be careful to install the device on a PCI x4 slot at least, or you will still be hindered.

Those are pretty remarkable speeds and much higher (peak sequential tranfer) than those for a memristor SSD.

170720 Thu: The O_PONIES summary and too much durability

So last night I was discussing the O_PONIES controversy and was asked to summarise it, which I did as follows:

Invoking fsync is very expensive for various reasons.
The most popular filesystem type used to be ext3 which had two special properties:
- Because of a design mistake an explicit fsync was very expensive with ext3.
- It would anyhow do an implicit fsync every 5 seconds by default that was still very expensive but with a lower instant impact being periodic.
As a result most users complained a lot when applications did an explicit fsync, and application writers (if they understood that fsync was needed at all) came to rely on the less expensive implicit one.
Filesystem designs that attempted to behave correctly came to be perceived as both slow and unreliable because of the previous point.

There is the additional problem that available memory has grown at a much faster rate than IO speed, at least that of hard disks, and this has meant that users and application writers have been happy to let very large amounts of unwritten data accumulate in the Linux page cache, which then takes a long time to be written to persistent storage.

The comments I got on this story were entirely expected, and I was a bit disappointed, but in particular one line of comment offers me the opportunity to explain a particularly popular and delusional point of view:

the behavior I want should be indistinguishable from copying the filesystem into an infinitely large RAM, and atomically snapshotting that RAM copy back onto disk. Once that's done, we can talk about making fsync() do a subset of the snapshot faster or out of order.

right. I'm arguing that the cases where you really care are actually very rare, but there have been some "design choices" that have resulted in fsync() additions being argued for applications..and then things like dpkg get really f'n slow, which makes me want to kill somebody when all I'm doing is debootstrapping some test container

in the "echo hi > a.new; mv a.new a; echo bye > b.new; mv b.new a" case, writing a.new is only necessary if the mv b.new a doesn't complete. A filesystem could opportunistically start writing a.new if it's not already busy elsewhere. In no circumstances should the mv a.new operation block, unless the system is out of dirty buffers

you have way too much emphasis on durability

The latter point is the key and, as I understand them, the implicit arguments are:

People want to be able to do sequences of many filesystem operations, involving many small files, where a directory in effect is a collection of records and each file in it is record-sized.
There is no need for each and every such operation to be durable, only the net outcome of the whole sequence needs to be durable.
In particular the requirement that metadata operations (on directories) should be individually durable as prescribed by POSIX, and individual file writes be made durable with fsync are both unnecessary.

As to that I pointed out that any decent implementation of O_PONIES indeed is based on the true and obvious point that it is pointless and wasteful to make a sequence of metadata and data updates persistent until just before a loss of non-persistent storage content and that therefore all that was needed were two systems internal functions returning the time interval to the next loss of non-persistent storage content, and to the end of the current semantically-coherent sequence of operations.

Note: the same reasoning of course applies to backups of persistent storage content: it is pointless and wasteful to make them until just before a loss of the contents of that persistent storage.

Given those O_PONIES functions, it would never be necessary to explicitly fsync, or implicitly fsync metadata in almost all cases, in a sequence of operations like:

echo hi > a.new; mv a.new a; echo bye > b.new; mv b.new a

Because they would be implicitly made persistent only once it were known that a loss of non-persistent storage content would happen before that sequence would complete.

As simple as that!

Unfortunately until someone sends patches to Linus Torvalds implementing those two simple functions there should be way too much emphasis on durability because:

Most metadata updates need to be persisted before completion is reported, to ensure that the filetree internal state is consistent just-in-case of an unpredicted loss of non-persistent storage content at that point.
Every update in a semantically coherent sequences of operations that needs to be persistent has to be individually marked as such by use of fsync by applications, because it cannot be predicted by the kernel whether the update will be part of the final state of that sequence of operations.
Because of the previous two points, collections of records that need to be edited as transactions made of multiple updates should be stored not as files in directrories, but as records inside files, so the outcome of the transaction can be persisted as a whole with fsync.
Regardless, the kernel level flushing daemon should be kept working regularly to avoid a buildup of a large number of non-fsync'ed written blocks, to avoid having to have to flush them all at once at some point.

170716c Sun: Hardware specification of "converged" mobile phone

In the same issue of Computer Shopper as the review of a memristor SSD there is also a review of a mobile phone also from Samsung, the model Galaxy S8+ which has an 8 CPU 2.3GHz chip, 4GiB of memory, and 64GiB of flash SSD builtin. That is the configuration of a pretty powerful desktop or a small server.

Most notably for me it has a 6.2in 2960×1440 AMOLED display, which is both physically large and with a higher pixel size than most desktop or laptop displays. There have been mobile phones with 2560×1440 displays for a few years which is amazing in itself, but this is a further step. There are currently almost no desktop or laptop AMOLED displays, and very few laptops have pixel sizes larger than 1366×768, or even decent IPS displays. Some do have 1920×1080 IPS LCD displays, only a very few even have 3200×1800 pixel sizes.

The other notable characteristic of the S8+ is that also given its processing power is huge it has an optional docking station that allows it to use an external monitor, keyboard and mouse (most likely the keyboard and mouse can be used anyhow as long as they are Bluetooth ones).

This is particularly interesting as desktop-mobile convergence (1, 2, 3) was the primary goal of the Ubuntu strategy of Canonical:

Our strategic priority for Ubuntu is making the best converged operating system for phones, tablets, desktops and more.

Earlier this year that strategy was abandoned and I now suspect that a significant motivation for that was that Samsung was introducing convergence themselves with Dex, and for Android, and on a scale and sophistication that Canonical could not match, not being a mobile phone manufacturer itself.

170716b Sun: Hardware specification of smart watch and related

In the same Computer Shopper UK, issue 255 with a review of a memristor SSD there is also a review of a series of fitness-oriented smartwatches, and some of them like the Samsung Gear S3 have 4GB of flash storage, 768MiB of RAM and a two CPU 1GHz chip. That can run a server for a significant workload.

170716 Sun: Miracles do happen: memristor product exists

Apparently memristor storage has arrived as Computer Shopper UK, issue 255 has a practical review of Intel Intel 32GB Optane Memory in an M.2 form factor. That is based on the mythical 3D XPoint memristor memory brand. The price and specifications have some interesting aspects:

Price is £70 (including VAT) for 32GB, where £70 usually buy 120-128GB of fast flash SSD in the M.2 form factor.
Top read sequential transfer rate is 1,350MB/s which is the same as for an M.2 flash SSD, and write rate is only 290MB/s, which is way lower than that of a flash SSD.
Read latency is just 7μs and write latency just 18μs which is very impressive.
Write endurance is 185TB or 100GB per day, much the same as a flash SSD.
Power consumption is 3.5W active, 1W idle, much the same as a flash SSD or disk drive of much larger capacity.
The specifications state that it does not have Enhanced Power Loss Data Protection technology, but then being persistent random access low latency memory it should not have a DRAM cache itself.

The specifications don't say it, but it is not a mass storage device, with a SATA or SAS protocol, it is a sort of memory technology device as far as the Linux kernel is concerned, and for optimal access it requires special handling.

Overall this first memristor product is underwhelming: it is more expensive and slower than equivalent M.2 flash SSDs, even if the read and write access time are much better.

170712 Wed: Special case indentation for PSGML

To edit HTML and XML files like this one I use EMACS with the PSGML library, as it is driven by the relevant DTD and this drives validation (which is fairly complete), and indentation. As to the latter some HTML elements should not be indented further, because they indicate types of speech, rather than parts of the document, and some do not need to be indented at all, as the indicate preformatted text.

Having looked at PSGML there is for the latter case a variable in psgml-html.el that seems relevant:

	sgml-inhibit-indent-tags     '("pre")

but is is otherwise unimplemented. So I have come up with a more complete scheme:

(defvar sgml-zero-indent-tags nil
  "*An alist of tags that should not be indented at all")

(defvar sgml-same-indent-tags nil
  "*An alist of tags that should not be indented further")

(eval-when-compile (require 'psgml-parse))
(eval-when-compile (require 'cl))

(defun sgml-indent-according-to-levels (element)
  (let ((name (symbol-name (sgml-element-name element))))
    (cond
      ((member name sgml-zero-indent-tags) 0)
      ((member name sgml-same-indent-tags)
	(* sgml-indent-step
	  (- (sgml-element-level element) 1)))
      (t
	(* sgml-indent-step
	  (sgml-element-level element)))
    )
  )
)

(setq sgml-mode-customized nil)

(defun sgml-mode-customize ()
  (if sgml-mode-customized t
    (setq sgml-content-indent-function 'sgml-indent-according-to-levels)
    (fset 'sgml-indent-according-to-level 'sgml-indent-according-to-levels)
    (setq sgml-mode-customized t)
  )
)

(if-true (and (fboundp 'sgml-mode) (not noninteractive))
  (if (fboundp 'eval-after-load)
     (eval-after-load "psgml" '(sgml-mode-customize))
     (sgml-mode-customize)
  )
)

(defun html-sgml-mode ()
  (interactive)
  "Simplified activation of HTML as an application of SGML mode."

  (sgml-mode)
  (html-mode)

  (make-local-variable			'sgml-default-doctype-name)
  (make-local-variable			'sgml-zero-indent-tags)
  (make-local-variable			'sgml-same-indent-tags)

  (setq
    sgml-default-doctype-name		"html"
    ; tags must be listed in upper case
    sgml-zero-indent-tags		'("PRE")
    sgml-same-indent-tags		'("B" "I" "U" "S" "EM" "STRONG"
					  "SUB" "SUP" "BIG" "SMALL" "FONT"
					  "TT" "KBD" "VAR" "CODE" "MARK"
					  "Q" "DFN" "DEL" "INS" "CITE"
					  "OUTPUT" "ADDRESS" "ABBR" "ACRONYM")
  )
)

It works well enough, except that I would prefer the elements with tags listed as sgml-zero-indent-tags to have the start and end also not-indented, not just the content; but PSGML indents those as content of the enclosing element, so to achieve that would require more invasive modications of the indentation code.

170704 Tue: Fast route lookup in Linux with large numbers of routes

Fascinating report with graphs on how route lookup has improved in the linux kernel, and the very low lookup times reached:

Two scenarios are tested:

500,000 routes extracted from an Internet router (half of them are /24), and

500,000 host routes (/32) tightly packed in 4 distinct subnets.

Since kernel version 3.6.11 the routing lookup cost was 140s and 40ns, since 4.1.42 it is 35ns and 25ns. Dedicated "enterprise" routers with hardware routing probably equivalent. In a previous post amount of memory used is given: With only 256 MiB, about 2 million routes can be stored!.

As previously mentioned, once upon a time IP routing was much more expensive than Ethernet forwarding and therefore there was a speed case for both Ethernet forwarding across multiple hubs or switches, and for routing based on prefixes; despite the big problems that arise from Ethernet forwarding across multiple switches and the limitations that are consequent to subnet routing.

But it has been a long time since subnet routing has been as cheap as Ethernet forwarding, and it is now pretty obvious that even per-address host routing is cheap enough at least on the datacentre and very likely on the campus level (500,000 addresses is huge!).

Therefore so-called host routing can result in a considerable change of design decisions, but that depends on realizing that host routing is improper terminology and the difference between IP adddresses and Ethernet addresses:

Most importantly host addresses don't exist in either IP or Ethernet: in IP addresses name endpoints (which belong to interfaces), of which there can be many per interface and per host, while in Ethernet addresses are per interface rather than per host.
IP addresses with subnets are properly addresses, as they name the location of the endpoint, and indeed there are routes (potentially many) to that location. It is a location in a topological (rather than geographical) neighbourhood determined by its prefix.
Ethernet addresses are not really addresses, but identifiers: they uniquely, by definition, identify an interface, worldwide, and more particularly they also locate it within a broadcast domain. If there is something vaguely similar to subnetting that is the use of VLAN tags to delimit broadcast domains, but non-broadcast addresses remain unique worldwide identifiers.

Note: the XNS internetorking protocols used addresses formed of a 32 bit prefix as network identifier (possibly subnetted) and 48 bit Ethernet identifier which to me seems a very good combination.

The popularity of Ethernet and in particular of VLAN broadcast domains spanning multiple switches depends critically on Ethernet addresses being actually identifiers: an interface can be attached to any network access point on any switch at any time and be reachable without further formality.

Now so-called host routes turn IP addresses into (within The Internet) endpoint identifiers, because the endpoint can change location, from one interface to another, or one host to another, and still identify the service it represents.

170701 Sat: Fixing an Ubuntu phone after a power loss crash

So I have this Aquaris E4.5 with Ubuntu Touch, where the battery is a bit fatigued, so it sometimes runs out before I could recharge it. Recently when it restarted the installed Ubuntu system seemed damaged: various things had bizarre or no behaviours.

For example when I clicked on System Settings>Wi-Fi I see nothing listed except that Previous networks lists a number of them, some 2-3 times, but when I click on any of them, System Settings ends; no SIM card is recognized (they work in my spare identical E4.5), and no non-local contacts appear.

Also in System Settings when clicking About, Phone, Security & Privacy, Reset it just crashes.

Previous "battery exhausted" crashes had no bad outcomes, as expected, as battery power usually runs out when the system is idle.

After some time looking into it figured out that part of the issue was that:

$HOME/.config/com.ubuntu.address-book/AddressBookApp.conf.lock
$HOME/.config/connectivity-service/config.ini.lock

were blocking the startups of the relevant services, so removing them allowed them to proceed.

As to System Settings crashing, it was doing SEGV, so strace let me figure out that was for running out of memory just after accessing something under $HOME/.cache/, so I just emptied that directory, and then all worked. Some cached setting had been perhaps corrupted. I suspect that the cache needs occasional cleaning out.

Note: another small detail: /usr/bin/qdbus invokes /usr/lib/arm-linux-gnueabihf/qt5/bin/qdbus which is missing, but /usr/lib/arm-linux-gnueabihf/qt4/bin/qdbus works.