Computing notes 2017 part two

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

2017 December

171201 Fri: Misconfiguration of SATA chip made flash SSD look slower

So I was disappointed previously that my laptop flash SSD seemed to have halved but I noticed recently that the unit was in SATA1 instead of SATA2 mode, and I could not change thatm even by disabling power saving using the PowerTOP tool or using the hdparm tool; then I suspected that was due to the SATA chipset resetting itself down to SATA1 mode on resuming from suspend-to-RAM, but after a fresh boot that did not change. Then I found out that I had at some point set in the laptop BIOS an option to save power on the SATA interface. Once I unset it the flash SSD resumed working at the top speed of the laptop's SATA2 interface of around 250MB/s. It looks like the BIOS somewhat sets the SATA interface to SATA1 speeds in a way that the Linux based tools cannot change back.

Also I retested the SK Hynix flash SSD and it is back to almost full nominal speed at 480MB/s reading instead of 380MB/s, I guess because of garbage collection in the firmware layer.

In most use, that is for regular-sized files, the change in transfer rate does not reslt in noticeable changes in responsiveness, which is still excellent, because the real advantage of flash SSDs for interactive use is in the much higher random IOPS than disks, more than the somewhat higher transfer rates.

Overall I am still very happy with the three flash SSDs I have, which very chosen among those with a good reputation, as none of them have had any significant issues in several years and they all report having at least 95% of their lifetimes writes available. This is one of the few cases where I have bought three different products from different manufacturers and all three have been quite good. The Samsung 850 Pro feel a bit more responsive than the others but they are all quite good, even the nearly 6 years old Micron M4.

2017 November

171118 Sat: Some recent news about "security"

First a list of recent and semi-recent successful hacks that were discovered:

These are rather different stories because not all hacks are equal: those that generate revenue or offer the opportunity to generate revenue are much worse than the others:

Overall the real import of the above issues is very varied. And they are the least important, because they were discovered: the really bad security issues are those that don't get found. Knowing about a security breach is much better than not knowing about it.

As to this, phishing has evolved from banal e-mails to entire fake sites quite a while ago:

Phishing kits contain prepackaged fake login pages for popular and valuable sites, such as Gmail, Yahoo, Hotmail, and online banking. They're often uploaded to compromised websites, and automatically email captured credentials to the attacker's account.

Phishing kits enable a higher rate of account hijacking because they capture the same details that Google uses in its risk assessment when users login, such as victim's geolocation, secret questions, phone numbers, and device identifiers.

Security services use the same technique for entrapment, which is a form of phishing:

Australian police secretly operated one of the dark web’s largest child abuse sites for almost a year, posing as its founder in an undercover operation that has triggered arrests and rescues across the globe.

The sting has brought down a vast child exploitation forum, Childs Play, which acted as an underground meeting place for thousands of paedophiles.

Obviously this is just one case among very many that are as yet undisclosed. There may be few qualms about entrapment of paedophiles (but for the fact that it facilates their activities for a time), but consider the case of a web site for north korean political dissenters actually run by the north korean secret police, or a system security discussion forum for bank network administrators run by a russian hacker gang.

Never mind phone home ever-listening devices like digital assistants or smart television monitors (1, 2), or smartwatches that act as remote listening devices (1, 2). The issue with these is not just that they phone home to the manufacturers, but that they may be easy to compromise by third parties, and then the would phone home to those third parties, with potentially vast conseqences. For example it is not entirely well known that passwords can be easy to figure out from the sound of typing them, especially if they are typed repeatedly.

So as always there are two main types of security situations: those where one is part of a generic attack against a large number of mostly random people with a low expected rate of success, and what matters is not to be a success, and those targeted against specific individuals or groups of high value, and what matters is either to be not of high value or not to select oneself as being or appearing to be high value.

171117 Fri: Some interesting tools

I have belatedly become aware of some tools that seem interesting but I have not used them yet:

171113 Mon: How to switch to Firefox 57 Quantum

So I have finally switched to Mozilla's new Firefox 57 Quantum, a largely re-engineered jump away from the previous codebase.

The re-engineering was supposed to deliver more portable add-ons and faster rendering of web pages, as well as making multi-process operation common. In all it seems to deliver: the new-style add-ons are no longer based on rewriting parts of the Firefox user interface internal XUL code itself, but on published internal interfaces, mostly compatible with those of Google's Chrome, and indeed page loading and rendering are much snappier, and multi-process operation means that in cases like memory overflow only some tabs get killed. There are however a few very important issues in switching, which makes it non-trivial:

The solutions are:

Having done the above Firefox 57 works pretty well, and in large part thanks to Tab Suspender and Policy Control, and in part to its improved implementation, it is far quicker and consumes much less memory, as the new add-ons are stricter than those they replaced. There are minor UI changes, and they seem to be broadly an improvement, even if slight.

171109 Thu: Btrfs status

As I spend a fair bit of time helping distraught users on IRC channels, in part for community service, in part to keep in touch with mortal user contemporary issues, and I was asked today for a summary of what Btrfs currently is good for and what are the limitations, so I might as well write it here for reference.

The prelude to a few lists is a general issue: Btrfs has a sophisticated design with a lot of features, and regardless of current status its main issue is that it is difficult both to explain its many possible aspects, it complex tradeoffs and to understand them. No filesystem design is maintenance-free or really simple, but Btrfs (and ZFS and XFS too) is really quite not fire-and-forget.

Because of that complexity I have decided to create a page of notes dedicated to Btrfs with part of its contents extracted from the Linux filesystems notes and with my summary how best to use it.

2017 October

171014 Sat: Mobile phone network hacking and more on security

As non-technical people ask me about the metaphysical subject of computer security my usual story is that every possible electronic device is suspicious, and if programmable it must be assumed to have several backdoors put in by various interested parties, but they are not going to be used unless the target is known to be valuable.

It is not just software or firmware backdoors: as a recent article on spyware dolls and another one reporting WiFi chips in electric irons suggest, it is very cheap to put some extra electronics on the circuit boards of various household items, either officially or unofficially, and so probably these are quite pervasive.

As I wrote previously being (or at least being known as) a target worth only a low budget is a very good policy, as otherwise the cost of preventive measures can be huge; for valuable targets, such as large Bitcoin wallets, even two factor authentication via mobile phones can be nullified because it is possible to subvert mobile phone message routing and the same article makes strongly the point that being known as a valuable target attracts unwanted attempts, and that in particular Bitcoin wallets are identified by IP address and it is easy to associate addresses to people.

In the extreme case of Equifax which held the very details of hundreds of millions of credit card users, the expected sale value is dozen of millions of dollars which may justify investments of millions of dollars to acquire, and it is very difficult to protect data against someone with a budget like that, that allows paying for various non technical but very effective methods.

The computers of most non-technical people and even system administrators are not particularly valuable targets, as long as they keep the bulk of their savings, if any, in offline accounts, and absolutely never write down (at least not on a computer) the PINs and passwords to their online savings accounts. Even so a determined adversary can grab those passwords when they are used, indirectly, by installing various types of bugs in a house or a computer before it gets delivered, but so-called targeted operations have a significant cost which is not worth spending over average-income people.

So my usual recommendation is to be suspicious of any electrical, not just electronic, device, and use them only for low value activities not involving money and not involving saleable assets like lists of credit card numbers.

171004 Wed: C vs. C++ and the cost of system level features

I have been quite fond of using the Btrfs filesystem, but only for its simpler features and in a limited way. But on its IRC channel and mailing list I come often across many less restrained users who get into trouble of some sort or another by using arbitrary combinations of features, expecting them to just work and fast too.

That trouble is often bugs, because there are many arbitrary combinations of many features, and designing so all of them behave correctly and sensibly, never mind testing them, is quite hard work.

But often that trouble relates to performance and speed, because the performance envelope of arbitrary combinations of features can be quite anisotropic indeed; for example in Btrfs creating snapshots is very quick, but deleting them can take a long time.

This reminded me of the C++ programming language, which also has many features and makes possible many arbitrary combinations, many of them unwise, and in particular one issue it has compared to the C programming language: in C every language feature is designed to have a small, bounded, easily understood cost, but many features of C++ can involve very large space or time costs, often surprisingly, and even more surprising can be the cost of combinations of features. This property of C++ is common to several programming languages that try to provide really advanced features as if they were simple and cheap, which of course is something that naive programmers love.

Btrfs is very similar: many of its advanced features sound like the filesystem code just does it but involve potentially enormous costs at non-trivial scale, or are very risky in case of mishaps. Hiding great complexity inside seemingly trivial features seems to lead astray a lot of engineers who should know better, or who believe that they know better.

I wish I could say something better than this: that it is very difficult to convey appropriately an impression of the real cost of features when they are presented elegantly. It seems to be a human failing to assume that things that looks elegant and simple are also cheap and reliable; just like in movies the good people are also handsome and look nice, and the bad people are also ugly and look nasty. Engineers should be more cynical, but many don't, or sometimes don't want to, as that may displease management who usually love optimism.

2017 September

170918 Mon: Complexity and maintainability and virtualization

One of the reasons why I am skeptical about virtualization is that it raises the complexity of a system: a systems engineer has then to manage not just systems running one or more applications, but also systems running those systems, creating a cascade of dependencies that are quite difficult to investigate in case of failure.

The supreme example of this design philosophy is OpenStack which I have worked on for a while (lots of investigating and fixing), and I have wanted to create a small OpenStack setup at home on my test system to examine it a bit more closely. The system at work was installed using MAAS and Juju, which turned out to be quite unreliable too, so I got the advice to try the now-official Ansible variant of Kolla (1, 2, 3, 4) setup method.

Kolla itself does not setup ansible: it is a tool to build Docker deployables for every OpenStack service. Then the two variants are for deployment: using Ansible or Kubernetes which is another popular buzzword along with Docker and OpenStack. I chose the Ansible installer as I am familiar with Ansible and also I wanted to do a simple install with all relevant services on a single host system, without installing Kubernetes too.

It turns out that the documentation as usual is pretty awful, very long on fantastic promises of magnificence, and obfuscates the banal reality:

Note: one of the components and Dockerfile is for kolla-toolbox which is a new fairly small component for internal use.

The main value of kolla and kolla-ansible is that someone has already written the 252 Dockerfiles for the 67 components, and Ansible roles for them, and keeps somewhat maintaining them, as OpenStack components change.

Eventually my minimal install took 2-3 half days, and resulted in building 100 Docker images taking up around 11GiB and running 29 Docker instances:

soft# docker ps --format 'table {{.Names}}\t{{.Image}}' | grep -v NAMES | sort 
cron                        kolla/centos-binary-cron:5.0.0
fluentd                     kolla/centos-binary-fluentd:5.0.0
glance_api                  kolla/centos-binary-glance-api:5.0.0
glance_registry             kolla/centos-binary-glance-registry:5.0.0
heat_api_cfn                kolla/centos-binary-heat-api-cfn:5.0.0
heat_api                    kolla/centos-binary-heat-api:5.0.0
heat_engine                 kolla/centos-binary-heat-engine:5.0.0
horizon                     kolla/centos-binary-horizon:5.0.0
keystone                    kolla/centos-binary-keystone:5.0.0
kolla_toolbox               kolla/centos-binary-kolla-toolbox:5.0.0
mariadb                     kolla/centos-binary-mariadb:5.0.0
memcached                   kolla/centos-binary-memcached:5.0.0
neutron_dhcp_agent          kolla/centos-binary-neutron-dhcp-agent:5.0.0
neutron_l3_agent            kolla/centos-binary-neutron-l3-agent:5.0.0
neutron_metadata_agent      kolla/centos-binary-neutron-metadata-agent:5.0.0
neutron_openvswitch_agent   kolla/centos-binary-neutron-openvswitch-agent:5.0.0
neutron_server              kolla/centos-binary-neutron-server:5.0.0
nova_api                    kolla/centos-binary-nova-api:5.0.0
nova_compute                kolla/centos-binary-nova-compute:5.0.0
nova_conductor              kolla/centos-binary-nova-conductor:5.0.0
nova_consoleauth            kolla/centos-binary-nova-consoleauth:5.0.0
nova_libvirt                kolla/centos-binary-nova-libvirt:5.0.0
nova_novncproxy             kolla/centos-binary-nova-novncproxy:5.0.0
nova_scheduler              kolla/centos-binary-nova-scheduler:5.0.0
nova_ssh                    kolla/centos-binary-nova-ssh:5.0.0
openvswitch_db              kolla/centos-binary-openvswitch-db-server:5.0.0
openvswitch_vswitchd        kolla/centos-binary-openvswitch-vswitchd:5.0.0
placement_api               kolla/centos-binary-nova-placement-api:5.0.0
rabbitmq                    kolla/centos-binary-rabbitmq:5.0.0

Some notes on this:

Most of the installation time was taken to figure out the rather banal nature of the two components, to de-obfuscate the documentation, to realize that the prebuilt images in the Kolla Docker hub repository were both rather incomplete and old and would not do, that I had to overbuild images, but mostly to work around the inevitable bugs in both the tools and OpenStack.

I found that the half a dozen blocker bugs that I investigated were mostly known years ago, as it often happens, ands that was lucky, as most of the relevant error messages were utterly opaque if not misleading, but would eventually lead to a post by someone who had investigated it.

Overall 1.5 days to setup a small OpenStack instance (without backing storage) is pretty good, considering how complicated it is, but the questions are whether it is necessary to have that level of complexity, and how fragile it is going to be.

170917 Sun: Some STREAM and HPL results from some not-so-new systems

On a boring evening I have run the STREAM and HPL benchmarks on some local systems, with interesting results, first for STREAM:

AMD Phenom X3 720 3 CPUs 2.8GHz
over$ ./stream.100M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5407.0     0.296645     0.295911     0.299323
Scale:           4751.1     0.338047     0.336766     0.344893
Add:             5457.5     0.439951     0.439759     0.440204
Triad:           5392.4     0.445349     0.445068     0.445903
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
intel i3-370M 2-4 CPUs 2.4GHz
tree$ ./stream.100M                                                            
-------------------------------------------------------------                   
STREAM version $Revision: 5.10 $                                                
...
Function    Best Rate MB/s  Avg time     Min time     Max time                  
Copy:            5668.9     0.299895     0.282241     0.327760                  
Scale:           5856.8     0.305240     0.273185     0.363663                  
Add:             6270.9     0.404547     0.382720     0.446088                  
Triad:           6207.4     0.408687     0.386636     0.456758                  
-------------------------------------------------------------                   
Solution Validates: avg error less than 1.000000e-13 on all three arrays        
-------------------------------------------------------------
AMD FX-6100 3-6 CPUs 3.3GHz
soft$ ./stream.100M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5984.0     0.268327     0.267381     0.270564
Scale:           5989.1     0.269746     0.267154     0.279534
Add:             6581.8     0.366100     0.364640     0.371339
Triad:           6520.0     0.374828     0.368098     0.419086
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
AMD FX-8370e 4-8 CPUs 3.3MHz
base$ ./stream.100M 
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            6649.8     0.242686     0.240608     0.244452
Scale:           6427.1     0.251241     0.248944     0.257000
Add:             7444.9     0.324618     0.322367     0.327456
Triad:           7522.1     0.322253     0.319058     0.324474
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Xeon E3-1200 4-8 CPUs at 2.4GHz (Update 170920)
virt$ ./stream.100M
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            9961.3     0.160857     0.160621     0.160998
Scale:           9938.1     0.161200     0.160997     0.161329
Add:            11416.7     0.210518     0.210218     0.210647
Triad:          11311.7     0.212472     0.212170     0.214260
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

The X3 720 system has DDR2, the others DDR3, obviously it is not making a huge difference. Note that the X3 720 is from 2009, and the i3-370M is in a laptop from 2010, and the FX-8370e from 2014, so a span of around 5 years.

Bigger differences for HPL; run single-host with mpirun -np 4 xhpl; the last 8 tests are fairly representative, and the last column (I have modified the original printing format from %18.3e to %18.6f) is GFLOPS:

AMD Phenom X3 720 3 CPUs 2.8GHz
over$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00     0.011727
WR00C2R4    35   4   4   1     0.00     0.012422
WR00R2L2    35   4   4   1     0.00     0.011343
WR00R2L4    35   4   4   1     0.00     0.012351
WR00R2C2    35   4   4   1     0.00     0.011562
WR00R2C4    35   4   4   1     0.00     0.012483
WR00R2R2    35   4   4   1     0.00     0.011615
WR00R2R4    35   4   4   1     0.00     0.012159
intel i3-370M 2-4 CPUs 2.4GHz
tree$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00     0.084724
WR00C2R4    35   4   4   1     0.00     0.089982
WR00R2L2    35   4   4   1     0.00     0.085233
WR00R2L4    35   4   4   1     0.00     0.088178
WR00R2C2    35   4   4   1     0.00     0.085176
WR00R2C4    35   4   4   1     0.00     0.092192
WR00R2R2    35   4   4   1     0.00     0.086446
WR00R2R4    35   4   4   1     0.00     0.092728
AMD FX-6100 3-6 CPUs 3.3GHz
soft$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00     0.074744
WR00C2R4    35   4   4   1     0.00     0.074744
WR00R2L2    35   4   4   1     0.00     0.073127
WR00R2L4    35   4   4   1     0.00     0.075299
WR00R2C2    35   4   4   1     0.00     0.072952
WR00R2C4    35   4   4   1     0.00     0.076627
WR00R2R2    35   4   4   1     0.00     0.076052
WR00R2R4    35   4   4   1     0.00     0.073127
AMD FX-8370e 4-8 CPUs 3.3MHz
base$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00     0.152807
WR00C2R4    35   4   4   1     0.00     0.149934
WR00R2L2    35   4   4   1     0.00     0.150643
WR00R2L4    35   4   4   1     0.00     0.152807
WR00R2C2    35   4   4   1     0.00     0.132772
WR00R2C4    35   4   4   1     0.00     0.160093
WR00R2R2    35   4   4   1     0.00     0.163582
WR00R2R4    35   4   4   1     0.00     0.158305
Xeon E3-1200 4-8 CPUs at 2.4GHz (Update 170920)
virt$ mpirun -np 4 xhpl | grep '^W' | tail -8 | sed 's/   / /g'
WR00C2R2    35   4   4   1     0.00      0.3457
WR00C2R4    35   4   4   1     0.00      0.3343
WR00R2L2    35   4   4   1     0.00      0.3343
WR00R2L4    35   4   4   1     0.00      0.3104
WR00R2C2    35   4   4   1     0.00      0.3418
WR00R2C4    35   4   4   1     0.00      0.3622
WR00R2R2    35   4   4   1     0.00      0.3497
WR00R2R4    35   4   4   1     0.00      0.3756

Note: the X3 720 and the FX-6100 are running SL7 (64 bit), and the i3-370M and FX-8370e are running ULTS14 (64 bit). On the SL7 systems I got better slightly better GFLOPS with OpenBLAS, but on the ULTS14 systems it was 100 times slower than ATLAS plus BLAS.

Obvious here is that the X3 720 was not very competive in FLOPS even with an i3-370M for a laptop, and that at least on floating point intel CPUs are quite competitive. The reason is that most compilers optimize/schedule floating point operations specifically for what's best intel floating point internal archictectures.

Note: the site cpubenchmark.net rates using PassMark® the X3 720 at 2,692 the i3-370M at 2,022 the FX-6100 at 5,412 and the FX-8370e at 7,782 which takes into account also non-floating point speed and the number of CPUs, and these ratings seem overall fair to me.

It is however faily impressive that the FC-8370e is still twice as fast at the same GHz as the FX-6100, and it is pretty good on an absolute level. However I mostly use the much slower i3-370M in the laptop, and for interactive work it does not feel much slower.

170904 Fri: A developer buys a very powerful laptop

In the blog of a software developer there is a report of the new laptop he has bought to do hiw work:

a new laptop, a Dell XPS 15 9560 with 4k display, 32 GiBs of RAM and 1 TiB M.2 SSD drive. Quite nice specs, aren't they :-)?

That is not surprising when a smart wristwatch has dual CPU 1GHz chip, 768MiB of RAM, 4GiB of flash SSD but it has consequences for many other people: such a powerful development system means that improving the speed and memory use of the software written by that software developer will not be a very high priority. It is difficult for me to see practical solutions for this unwanted consequence of hardware abudance.

2017 August

170831 Fri: SFTP speeds have improved a lot

Some years ago I had reported that the standard sftp (and sometimes scp) implementation then available for GNU/Linux and MS-Windows was extremely slow because of a limitation in the design of its protocol and implementation which means that it effectively behaved as if in half-duplex mode.

At the current time the speed of commonly available sftp implementations is pretty good instead, with speed of around 70-80MB/s on a 1Gbit links, because the limitation could be circumvented and the implementations have been (arguably) improved, both the standards one and an extended one.

170825 Fri: M.2 storage with and without backing capacitors

Previously I mentioned the large capacitors on flash SSDs in the 2.5in format targeted at the enterprise market. In an article about recent flash SSDs in the M.2 format there are photographs and descriptions of some with mini capacitors with the same function, and a note that:

The result is a M.2 22110 drive with capacities up to 2TB, write endurance rated at 1DWPD, ... A consumer-oriented version of this drive — shortened to the M.2 2280 form factor by the removal of power loss protection capacitors and equipped with client-oriented firmware — has not been announced

The absence of the capacitors is likely to save relatively little in cost, and having to manufacture two models is likely to add some of that back, but the lack of capacitors makes the write IOPS a lot lower because (because it requires write-through rather than write-back caching) and for the same reason also increase write amplification, thus creating enough differentiation for a much higher price for the enterprise version.

170817 Thu: 50TB 3.5in flash SSD

Interesting new of a line of (instead of the more conventional M.2 and 2.5in) 3.5in form factor flash SSD products, the Viking UHC-Silo (where UHC presumably stands for Ultra High Capacity) in 25TB and 50TB capacities, with a dual-port 6Gb/s SAS interface.

Interesting aspects: fairly conventional 500MB/s sequential read and 350MB/s sequential write rates, and not unexpectedly 16W power consumption. But note that at 350MB/s to fill (or duplicate) this unit takes 56 hours. That big flash capacity could probably sustain a much higher transfer rate, but that might involve a lot higher power drawn and thus heat dissipation, so I suspect that the limited transfer rate does not depend just on the 6Gb/s channel that after all allows for slightly higher 500MB/s read rates.

Initial price seems to be US$37,000 for the 50TB version, or around US$1,300 per TB which is more or less the price for a 1TB flash SSD enterprise SAS product.

170816 Wed: Another economic reason to virtualize

The main economic reason to virtualize has been allegedly that in many cases hardware servers that run the hosts systems of applications are mostly idle, as a typical data centre hosts many low-use applications that nevertheless are run on their own dedicated server (being effectively containerized on that hardware server) and therefore consolidating them onto a single hardware server can result in a huge saving; virtualization allows consolidating the hosting systems as-they-are, simply by copying them from real to virtual hardware, saving transition costs.

Note: that gives for granted that in many cases the applications are poorly written and cannot coexist in the same hosting system, which is indeed often the case. More disappointingly also gives for granted that increasing the impact of hardware failure from one to many applications is acceptable, and that replication of virtual systems to avoid that is cost-free, which it isn't.

But there is another rationale for consolidation, if not virtualization, that does not depend on there being a number of mostly-idle dedicated hardware servers: the notion that the cost per capacity unit of midrange servers is significantly lower than for small servers, and that the best price/capacity ratio is with servers like:

  • 24 Cores (48 cores HT), 256GB RAM, 2* 10GBit and 12* 3TB HDD including a solid BBU. 10k EUR
  • Applications that actually fill that box are rare. We we are cutting it up to sell off the part.

That by itself is only a motivation for consolidation, that is for servers that run multiple applications; in the original presentation that becomes a case for virtualization because of the goal of having a multi-tenant service with independent administration domains.

The problem with virtualization to take advantage of lower marginal cost of capacity in mid-size hardware is that it is not at all free, because of direct and indirect costs:

Having looked at the numbers involved my guess is that there is no definite advantage in overall cost to consolidation of multiple applications, and it all depends on the specific case, and that usually but not always virtualization has costs that outweight its advantages, especially for IO intensive workloads, and that this is reflected in the high costs of most virtualized hosting and storage services (1, 2).

Virtualization for consolidation has however an advantage as to the distribution of cost as it allows IT departments to redefine downwards their costs and accountabilities: from running application hosting systems to running virtual machine hosting systems, and if 10 applications in their virtual machines can be run on a single host system, that means that IT departments need to manage 10 times fewer systems, for example going from 500 small systems to 50.

But the number of actual systems has increased from 500 to 550, as the systems in the virtual machines have not disappeared, so the advantage for IT departments comes not so much from consolidation, but handing back cost of managing the 500 virtual systems to the developers and users of the applications running within them, which is what usually DevOps means.

The further stage for IT department cost-shifting is to get rid entirely of the hosting systems, and outsource the hosting of the virtual machines to a more expensive cloud provider, where the higher costs are then charged directly to the developers and users of the applications, eliminating those costs from the budget of IT department, that is only left with the cost of managing the contract with the cloud provider on behalf of the users.

Shifting hardware and systems costs out of the IT department budget into that of their users can have the advantage of boosting the career and bonuses of IT department executives by shrinking their apparent costs, even if does not reduce overall business costs. But it can reduce aggregate organization costs when it discourages IT users from using IT resources unless there is a large return, by substantially raising the direct cost of IT spending to them, so even at the aggregate level it might be, for specific cases, ultimately a good move.

That is a business that consolidates systems and switches IT provision from application hosting to systems hosting and then outsources system hosting is in effect telling its component business that they are overusing IT and that they should scale it back, by effectively charging more for application hosting, and supporting it less.

170810 Thu: Speedy writing to a BD-R DL disc

Today after upgrading (belatedly) the firmware of my BDR-2209 Pioneer drive to 1.33 I have seen for the first time a 50GB BD-R DL disc written at around 32MB/s average:

Current: BD-R sequential recording
Track 01: data  31707 MB        
Total size:    31707 MB (3607:41.41) = 16234457 sectors
Lout start:    31707 MB (3607:43/41) = 16234607 sectors
Starting to write CD/DVD at speed MAX in real TAO mode for single session.
Last chance to quit, starting real write in   0 seconds. Operation starts.
Waiting for reader process to fill input buffer ... input buffer ready.
Starting new track at sector: 0
Track 01: 9579 of 31707 MB written (fifo 100%) [buf 100%]   8.0x.
Track 01: Total bytes read/written: 33248166496/33248182272 (16234464 sectors).
Writing  time:  1052.512s

This on a fairly cheap no-name disc. I try sometimes also to write to a 50GB BD-RE DL discs, but it works only sometimes, and at best at 2x speed. I am tempted to try, just for the fun of it, to get a 100GB BD-RE XL disc (which have been theoretically available since 2011) but I suspect that's wasted time.

2017 July

170724 Mon: Different types of M.2 slots

As an another example that there are no generic products I was looking at a PC card to hold an M.2 SSD device and interface it to the host bus PCIe and the description carried a warning:

Note, this adapter is designed only for 'M' key M.2 PCIe x4 SSD's such as the Samsung XP941 or SM951. It will not work with a 'B' key M.2 PCIe x2 SSD or the 'B' key M.2 SATA SSD.

There are indeed several types of M slots and with different widths and speeds, supporting different protocols, and this is one of the most recent and faster variants. Indeed among the user reviews there is also a comment as to the speed achievable by an M.2 flash SSD attached to it:

I purchased LyCOM DT-120 to overcome the limit of my motherboard's M.2. slot. Installation was breeze. The SSD is immediately visible to the system, no drivers required. Now I enjoy 2500 MB/s reading and 1500 MB/s writing sequential speed. Be careful to install the device on a PCI x4 slot at least, or you will still be hindered.

Those are pretty remarkable speeds and much higher (peak sequential tranfer) than those for a memristor SSD.

170720 Thu: The O_PONIES summary and too much durability

So last night I was discussing the O_PONIES controversy and was asked to summarise it, which I did as follows:

There is the additional problem that available memory has grown at a much faster rate than IO speed, at least that of hard disks, and this has meant that users and application writers have been happy to let very large amounts of unwritten data accumulate in the Linux page cache, which then takes a long time to be written to persistent storage.

The comments I got on this story were entirely expected, and I was a bit disappointed, but in particular one line of comment offers me the opportunity to explain a particularly popular and delusional point of view:

the behavior I want should be indistinguishable from copying the filesystem into an infinitely large RAM, and atomically snapshotting that RAM copy back onto disk. Once that's done, we can talk about making fsync() do a subset of the snapshot faster or out of order.

right. I'm arguing that the cases where you really care are actually very rare, but there have been some "design choices" that have resulted in fsync() additions being argued for applications..and then things like dpkg get really f'n slow, which makes me want to kill somebody when all I'm doing is debootstrapping some test container

in the "echo hi > a.new; mv a.new a; echo bye > b.new; mv b.new a" case, writing a.new is only necessary if the mv b.new a doesn't complete. A filesystem could opportunistically start writing a.new if it's not already busy elsewhere. In no circumstances should the mv a.new operation block, unless the system is out of dirty buffers

you have way too much emphasis on durability

The latter point is the key and, as I understand them, the implicit arguments are:

As to that I pointed out that any decent implementation of O_PONIES indeed is based on the true and obvious point that it is pointless and wasteful to make a sequence of metadata and data updates persistent until just before a loss of non-persistent storage content and that therefore all that was needed were two systems internal functions returning the time interval to the next loss of non-persistent storage content, and to the end of the current semantically-coherent sequence of operations.

Note: the same reasoning of course applies to backups of persistent storage content: it is pointless and wasteful to make them until just before a loss of the contents of that persistent storage.

Given those O_PONIES functions, it would never be necessary to explicitly fsync, or implicitly fsync metadata in almost all cases, in a sequence of operations like:

echo hi > a.new; mv a.new a; echo bye > b.new; mv b.new a

Because they would be implicitly made persistent only once it were known that a loss of non-persistent storage content would happen before that sequence would complete.

As simple as that!

Unfortunately until someone sends patches to Linus Torvalds implementing those two simple functions there should be way too much emphasis on durability because:

170716c Sun: Hardware specification of "converged" mobile phone

In the same issue of Computer Shopper as the review of a memristor SSD there is also a review of a mobile phone also from Samsung, the model Galaxy S8+ which has an 8 CPU 2.3GHz chip, 4GiB of memory, and 64GiB of flash SSD builtin. That is the configuration of a pretty powerful desktop or a small server.

Most notably for me it has a 6.2in 2960×1440 AMOLED display, which is both physically large and with a higher pixel size than most desktop or laptop displays. There have been mobile phones with 2560×1440 displays for a few years which is amazing in itself, but this is a further step. There are currently almost no desktop or laptop AMOLED displays, and very few laptops have pixel sizes larger than 1366×768, or even decent IPS displays. Some do have 1920×1080 IPS LCD displays, only a very few even have 3200×1800 pixel sizes.

The other notable characteristic of the S8+ is that also given its processing power is huge it has an optional docking station that allows it to use an external monitor, keyboard and mouse (most likely the keyboard and mouse can be used anyhow as long as they are Bluetooth ones).

This is particularly interesting as desktop-mobile convergence (1, 2, 3) was the primary goal of the Ubuntu strategy of Canonical:

Our strategic priority for Ubuntu is making the best converged operating system for phones, tablets, desktops and more.

Earlier this year that strategy was abandoned and I now suspect that a significant motivation for that was that Samsung was introducing convergence themselves with Dex, and for Android, and on a scale and sophistication that Canonical could not match, not being a mobile phone manufacturer itself.

170716b Sun: Hardware specification of smart watch and related

In the same Computer Shopper UK, issue 255 with a review of a memristor SSD there is also a review of a series of fitness-oriented smartwatches, and some of them like the Samsung Gear S3 have 4GB of flash storage, 768MiB of RAM and a two CPU 1GHz chip. That can run a server for a significant workload.

170716 Sun: Miracles do happen: memristor product exists

Apparently memristor storage has arrived as Computer Shopper UK, issue 255 has a practical review of Intel Intel 32GB Optane Memory in an M.2 form factor. That is based on the mythical 3D XPoint memristor memory brand. The price and specifications have some interesting aspects:

The specifications don't say it, but it is not a mass storage device, with a SATA or SAS protocol, it is a sort of memory technology device as far as the Linux kernel is concerned, and for optimal access it requires special handling.

Overall this first memristor product is underwhelming: it is more expensive and slower than equivalent M.2 flash SSDs, even if the read and write access time are much better.

170712 Wed: Special case indentation for PSGML

To edit HTML and XML files like this one I use EMACS with the PSGML library, as it is driven by the relevant DTD and this drives validation (which is fairly complete), and indentation. As to the latter some HTML elements should not be indented further, because they indicate types of speech, rather than parts of the document, and some do not need to be indented at all, as the indicate preformatted text.

Having looked at PSGML there is for the latter case a variable in psgml-html.el that seems relevant:

	sgml-inhibit-indent-tags     '("pre")

but is is otherwise unimplemented. So I have come up with a more complete scheme:

(defvar sgml-inhibit-indent-tags nil
  "*An alist of tags that should not be indented at all")

(defvar sgml-same-indent-tags nil
  "*An alist of tags that should not be indented further")

(defun sgml-mode-customise ()

  (defun sgml-indent-according-to-level (element)
    (let ((name (symbol-name (sgml-element-name element))))
      (cond
	((member name sgml-inhibit-indent-tags) 0)
	((member name sgml-same-indent-tags)
	  (* sgml-indent-step
	    (- (sgml-element-level element) 1)))
	(t
	  (* sgml-indent-step
	    (sgml-element-level element)))
      )
    )
  )
)

(if-true (and (fboundp 'sgml-mode) (not noninteractive))
  (if (fboundp 'eval-after-load)
     (eval-after-load "psgml" '(sgml-mode-customise))
     (sgml-mode-customise)
  )
)

(defun html-sgml-mode ()
  (interactive)
  "Simplified activation of HTML as an application of SGML mode."

  (sgml-mode)
  (html-mode)

  (make-local-variable			'sgml-default-doctype-name)
  (make-local-variable			'sgml-inhibit-indent-tags)
  (make-local-variable			'sgml-same-indent-tags)

  (setq
    sgml-default-doctype-name		"html"
    ; tags must be listed in upper case
    sgml-inhibit-indent-tags		'("PRE")
    sgml-same-indent-tags		'("EM" "TT" "I" "B" "CITE" "VAR" "CODE" "DFN" "STRONG")
  )
)

It works well enough, except that I would prefer the elements with tags listed as sgml-inhibit-indent-tags to have the start and end also not-indented, not just the content; but PSGML indents those as content of the enclosing element, so to achieve that would require more invasive modications of the indentation code.

170704 Tue: Fast route lookup in Linux with large numbers of routes

Fascinating report with graphs on how route lookup has improved in the linux kernel, and the very low lookup times reached:

Two scenarios are tested:

  • 500,000 routes extracted from an Internet router (half of them are /24), and
  • 500,000 host routes (/32) tightly packed in 4 distinct subnets.

Since kernel version 3.6.11 the routing lookup cost was 140s and 40ns, since 4.1.42 it is 35ns and 25ns. Dedicated "enterprise" routers with hardware routing probably equivalent. In a previous post amount of memory used is given: With only 256 MiB, about 2 million routes can be stored!.

As previously mentioned, once upon a time IP routing was much more expensive than Ethernet forwarding and therefore there was a speed case for both Ethernet forwarding across multiple hubs or switches, and for routing based on prefixes; despite the big problems that arise from Ethernet forwarding across multiple switches and the limitations that are consequent to subnet routing.

But it has been a long time since subnet routing has been as cheap as Ethernet forwarding, and it is now pretty obvious that even per-address host routing is cheap enough at least on the datacentre and very likely on the campus level (500,000 addresses is huge!).

Therefore so-called host routing can result in a considerable change of design decisions, but that depends on realizing that host routing is improper terminology and the difference between IP adddresses and Ethernet addresses:

Note: the XNS internetorking protocols used addresses formed of a 32 bit prefix as network identifier (possibly subnetted) and 48 bit Ethernet identifier which to me seems a very good combination.

The popularity of Ethernet and in particular of VLAN broadcast domains spanning multiple switches depends critically on Ethernet addresses being actually identifiers: an interface can be attached to any network access point on any switch at any time and be reachable without further formality.

Now so-called host routes turn IP addresses into (within The Internet) endpoint identifiers, because the endpoint can change location, from one interface to another, or one host to another, and still identify the service it represents.

170701 Sat: Fixing an Ubuntu phone after a power loss crash

So I have this Aquaris E4.5 with Ubuntu Touch, where the battery is a bit fatigued, so it sometimes runs out before I could recharge it. Recently when it restarted the installed Ubuntu system seemed damaged: various things had bizarre or no behaviours.

For example when I clicked on System Settings>Wi-Fi I see nothing listed except that Previous networks lists a number of them, some 2-3 times, but when I click on any of them, System Settings ends; no SIM card is recognized (they work in my spare identical E4.5), and no non-local contacts appear.

Also in System Settings when clicking About, Phone, Security & Privacy, Reset it just crashes.

Previous "battery exhausted" crashes had no bad outcomes, as expected, as battery power usually runs out when the system is idle.

After some time looking into it figured out that part of the issue was that:

$HOME/.config/com.ubuntu.address-book/AddressBookApp.conf.lock
$HOME/.config/connectivity-service/config.ini.lock

were blocking the startups of the relevant services, so removing them allowed them to proceed.

As to System Settings crashing, it was doing SEGV, so strace let me figure out that was for running out of memory just after accessing something under $HOME/.cache/, so I just emptied that directory, and then all worked. Some cached setting had been perhaps corrupted. I suspect that the cache needs occasional cleaning out.

Note: another small detail: /usr/bin/qdbus invokes /usr/lib/arm-linux-gnueabihf/qt5/bin/qdbus which is missing, but /usr/lib/arm-linux-gnueabihf/qt4/bin/qdbus works.