Computing notes 2015 part one

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

150418 Sat: PCIe flash SSDs and heat

Flash SSDs come also in a format similar to memory sticks, with a PCIe compatible interface, and they tend to be quite fast too.

In a related discussion I was pointed at this fairly dramatic video on how hot becomes the CPU chip on the stick and over 100C° seems quite hot to me, and other reviews confirm that even for relatively modest operations involving a few GiB of IO the CPUs on other sticks can become similarly hot (1, 2).

The heat of course has an influence on the durability of the adjacent flash chips, as flash technology is fairly sensitive to heat; more so than the CPU chips themselves that probably are rated for those temperatures.

I am quite astonished that these memory-stick like drives don't come with even a minimal heat spreader and dissipator, as DRAM memory stick often (and usually pointlessly do).

I have also briefly worried about temperatures inside 2.5in form factor boxed flash SSD drives, but I think that there are two important differences: usually since the base board is rather larger the CPU chip is far away from the flash chips, and usually it is connected to the metallic case via a terhmally conductive pad. Since after all the total power draw of a flash SSD is of the order of a few watts the metallic case is most likely amply sufficient. The problem with memory-stick like devices is that the power draw and thus the heat is very concentraced in the small CPU chip and builds up.

But there are exceptions to this, and I noticed recently that a flash SSD 2.5in PCIe drive that is rated for peak power draw of 25W has heavy cooling fins.

150406b Mon: Long running test of flash SSD endurance ends

A very long running test of flash SSD by the often excellent TechReport involved rewriting constantly to them to verify their endurance and its has recently finished after a starting a little more than 18 months ago.

The results are quite impressive: the shortest endurance was for 700TB of writes, and the longest was for 2400TB written over 18 months, at a rate of over 120TB per month, with breaks, as the products where reset with a SATA secure erase every 100TB written.

The speed of the units showed little sign of degration until the end. But it is interesting to note that the continuos write speed is not quite the same as that for somewhat limited duration tests, as shows in the graph at the bottom of the page as it can go from 75MB/s to 250MB/s, and one (particularly good) devices oscillated between 75MB/s and 120MB/s.

The authors also report on the numbers reported by the devices as to erase block failures and they follow for most devices a fairly reasonable, steady rise but only after hundred of TBs have been written, and the reserve of fresh erase blocks tends to last a long time. This implies that leaving a small part of a flash SSD unused can indeed give benefits; and probably not just higher endurance, which in most cases is pointless, but probably lower latency and jitter for writes.

As the authors point out for typical single user workloads endurance is not an issue as the SSDs in my main desktop have logged less than two terabytes of writes over the past couple years which is near the 4GiB/day (1.5TB/year) written on my laptop.

The endurance test by TechReport shows that ordinary consumer grade flash SSD drives have pretty in practice huge endurance that amounts to many decades of desktop or laptop use. These endurances are not warranted, and as previously reported flash SSDs graded as enterprise have warranted endurances of 3500TB or even 7000TB of writes (1, 2), both compare well or favourably to the specified endurances of magnetic disk drives, because as I pointed out previously magnetic disk drives also have duty cycles for example 180TB per year for 3 years (archival drive) or 550TB per year for 5 years (enterprise drive) of total write and reads. Every product has a duty cycle.

Flash SSDs have improved a lot in a few years as to the wear leveling done by their FTL and also as the latency and jitter caused by wear leveling, even if erase block sizes have increased a lot (I think currently most flash chips have 8MiB erase blocks), and I think that my previous conclusion that at those levels of endurance these enterprise flash SSD drives might well be suitable for replacing 2.5in SAS 10000PRM or 15000RPM disk drives.

What ordinary flash SSDs probably will never be suitable for is long term offline storage, per the little known detail that I mentioned previously that they lose data after as little as 3 months without power while enterprise flash SSDs have huge capacitors that presumably can extend that significantly.

150406 Sat: Small shell functions to set window and tab labels

For several years I have been using some shell functions which are part of my standard env shell startup script. These functions allow setting from the command line the label (title) of an xterm or konsole window or a screen/tmux or konsole tab.

I have recently updated them cosmetically because they contained embedded raw escape sequences that were annoying to print; the updated versions have now all those escape sequences encoded, and the printf command or shell builtin emits the raw sequences.

The first function sets the window title of the current xterm window. It also works with other graphical terminal programs that use the same escape sequence.

# Set window title:	\033]0;MESSAGE\007
xtitle() \
{
  case "$#" in '0')	set "$HOST"\!"$USER";; esac
  case "$*" in \!|\@)	echo 2>&1 'No arguments or default!'; exit 1;; esac 
  printf "\033]0;$*\007"
}

This function sets the title of the current tab/window in the screen command. It is often quicker than using the the relevant prefix command.

# Set SCREEN tab title:	\033kMESSAGE\033\134
stitle() \
{
  case "$#" in '0')	set "$PS1";; esac
  case "$*" in \!|\@)	echo 2>&1 'No arguments or default!'; exit 1;; esac 
  printf "\033k$*\033\\"
}

This function sets the title of the current konsole tab.

# Set tab title:	\033]30;MESSAGE\007
ktitle() \
{
  case "$#" in '0')	set "$PS1";; esac
  case "$*" in \!|\@)	echo 2>&1 'No arguments or default!'; exit 1;; esac 
  printf "\033]30;$*\007"
}

The last one sets the color of the current konsole tab, but it seems no longer effective in recent konsole versions.

# http://meta.ath0.com/2006/05/24/unix-shell-games-with-kde
# Set tab title color:	\033[28;RGBt
# where RGB is in *decimal*.
kcolor() \
{
  case "$#" in '0')	set 255 32 0;; esac
  case "$*" in \!|\@)	echo 2>&1 'No arguments or default!'; exit 1;; esac 
  RGB=`expr "$1" \* 65536 + "$2" \* 256 + "$3"`
  printf "\033[28;${RGB}t"
}

150329 Sun: CERN's old large disk discussion and IOPS-per-TB

Having recently mentioned the issues arising from increasing disk capacity without increased access rates and slicing disks to reduce the impact of that I have reread one of the first mentions of this issue in a very interesting presentation by the CERN IT department in 2007 on Status and Plans of Procurements at CERN where one of the pages says:

Large Disk Servers (1)

Motivation

Disks getting ever larger, 5 TB usable will be small soon

CPUs ever more powerful

Potential economies per TB

Constraints

Need a bandwidth of ~ 18 MByte/s per TB

Hence everything more than 6 TB means multiple Gbit connections or a 10 Gbit link

Networking group strongly advised against link aggregation

That presentation mentions the bandwidth of ~ 18 MByte/s per TB and when the presentation was made I asked the author what that meant and he told me a concurrent transfer rate of a read and a write thread of around 18MB/s per TB, and it was designed to prevent vendors from bidding storage systems based entirely on large disks that did not have enough IOPS to sustain a reasonable workload.

I asked the same author a few days ago whether CERN procurement continued to have the same requirement and I understood from his reply that CERN no longer have it in practice because for systems that merely archive data it does not matter, and they are requiring flash SSDs (that certainly satisfy the requirement) for systems that do data processing, in part because ever larger disks as anticipated no longer have enough IOPS-per-TB.

Overall the problem is to have balanced systems, that is where there is a ratio between CPU, memory, storage capacity and IOPS to match the requirement of the workload. One of the oldest rules was the VAX rule of one MiB of memory per MIPS of CPU speed (and at the time typically one instruction was completed per Hz), and somewhat amusingly many current systems still have that ratio, for example server systems with 16 CPUs at 2GHz having 32GiB of memory are fairly typical.

The original VAX-11/780 had that ratio, and also typically around 500MiB of disks (typically 2×300MB disks) capable of around 100 random IOPS together, with a memory/storage ratio of around 500 and an IOPS-per-TB ratio of around 200,000.

A typical 16 CPU server with 32GiB of memory and two 500GiB SSDs capable together of 80,000 IOPS has a memory/storage ratio of around 30 and an IOPS-per-TB ratio of around 80,000 which is a factor of 2-3 lower than a few decades ago.

That seems acceptable to me considering that for many applications CPU processing time has gone up. For CERN the greater CPU-memory ratio to everything else has enabled them to greatly improve the quality of their analysis and simulation.

150328 Sat: Perhaps Amazon slices large disks across to implement cold storage

I recently argued that large disks should be sliced and RAID sets built across such slices, so I was amused to read (can't remember where) an interesting guess as to how the Amazon's service Glacier is so much cheaper than online storage but has such long delays for first access and this relates to my other post about The low IOPS-per-TB of large disks.

The idea is that Amazon may be short-stroking large disks by using only their outer cylinders (like some vendors also apparently do) to improve the IOPS-per-TB of their online storage and therefore has the rest of the disk available for low, low priority accesses. Presumably if this is true the long delay for first access to Glacier is due to the data being copied off the inner cylinders to staging disks very slowly (and intermittently) to avoid interfering with the accesses to the outer cylinders.

150321 Sat: Two better IPC schemes

I was recently discussing IPC methods, in the usual terms of message passing and shared memory:

Message passing involves the source thread in an address space copying the data (the arguments) to transfer into a queue buffer and then sleeping (at least conceptually), and the target thread then being woken up and reading that data from the queue buffer, processing it, and then putting the result data (the return values) back into the queue buffer.
Shared memory involves the buffer queue being accessible to both source and target threads, so that the source thread locks a buffer queue element, puts the data to process in it, then unlocks it and goes to sleep. The target threads awakes on the unlocking, processes the data, and does the same.

The main advantages and disadvantages are:

Message passing by copying allows multiple concurrent processing of the same data, if desired, without the need to lock; but by creating multiple copies of the data it risks potential inconsistencies, as well paying the price of making the copies.
Shared memory does not copy the data, so inconsistencies cannot occur, and there is no cost of copying, but by locking the one data instance reduces potential degree of concurrent processing and requires greater care by programmers as to locking.

The two common techniques above can be used with or without dividing memory into independent address spaces using virtual address translation, but then I mentioned that if the latter is available there is a much better and long forgotten scheme which was designed as part of the MUSS operating system several decades ago, which relies on the availability of virtual memory:

Every process contains a thread and a collection of data segments.
A source thread can send to a target thread a message containing the identifier of a segment; when the target thread receives the message the segment gets detached from the source process and attached to the target process.

This combines the best of message passing and shared memory, because it involves neither data copying nor data sharing: the crucial property of this scheme is that a data segment is only ever attached to just one process at a time, so it is never shared, but since it gets detached and attached by updating the virtual page tables, there is no copying of data either.

Note: the above is a simplified version of the MUSS scheme, which had another important property with additional conspicuous advantages: segments are not implicitly mapped into the address space of the process they are attached to. This means that detaching and attaching a process is in many cases a very cheap operation, just involving updating a kernel table. Thanks to virtual memory segments could be either memory resident or disk resident, with disk resident ones being page-faulted into memory (if mapped) and modified pages saved back to disk using ordinary virtual memory mechanisms.

Note: the MUSS scheme was generalized elegantly to network communications: if the source and target processes are on different systems the data in the segment can be moved (by copying its to the target and deleting it from the source) over the network entirely transparently. This is thanks to the essential property that segments are only ever attached to one process.

In MUSS this method was used for everything: characters and lines read from a terminal by the terminal driver were sent as segments to the login manager and then to the newly spawned command processor; files were segments opened by sending a segment with their name to a file system process, which would then reply with a (disk mapped) segment being the file.

But even the much better MUSS scheme can be substantially improved, as it occured to me long ago, because:

The source and targets thread are in effect engaging into a remote procedure call, where the data queued by the source process is in effect arguments, and the data queued back by the target process is in effect return values, and the operation more than queueing is actually pushing arguments and then result values on a conceptual stack.
It is therefore in the general case pointless to have a sequence of states where the source thread in the the calling process pushes the arguments and then goes to sleep only for the target thread to stop sleeping, processing the arguments, pushing the results, going back to sleep, and then the source thread wakes from sleep and looks at the results.

Therefore my scheme moves both the stack segment carrying the arguments and the thread from the source process to the target process. The segment will be most conveniently the stack segment of the thread itself. Thus a system will have a number of address spaces some of which will have no threads, and some of which will have several threads.

Note: I discovered this scheme by looking at and generalizing the MUSS scheme. A somewhat similar scheme in the Elmwood operating system was inspired by the RIG operating system project, which is part of a different path of kernel design (1, 2) and that inspired the Accent operating system which in turn inspired the better known Mach kernel.

Each address space will have then a specific entry point where threads that are moved to it start executing, and a set of segments containing the code implementing the service and the state of that service. Creating an address is the same as creating a process, where an initial thread initializes the address space, but with the difference that the thread will end with an indication to keep the address space after its end.

The service could be for example a DBMS instance where the initial thread opens the database files, initializes the state of the DBMS, and then end. A thread wanting to run a query on the database would then push the name of the operation on its stack, the query body, and move to the address space with the DBMS instance, by requesting the kernel to switch page tables and to set its instruction pointer to the address space entry address, and the code there would look at the first argument being the name of the operation, invoke it with the query from the second argument, and push the query results onto the stack and move the thread back to the original address space.

Note: when a thread moves out of an address space into another one there must be a way to ask the kernel to create for it a return entry point and push its identity onto the thread's stack. For this and other details there can be several different mechanisms, omitted here for brevity.

Note: among the details this scheme is quite easy to extend to full network transparency, thus allowing threads to move to address spaces in other systems, and it is also easy to allow for mutual distrust between source and target address space.

There is no explicit locking involved in moving a thread from one address space to another, because it is the same thread that moves, thus no need for synchronization. Once the thread is in the address space of the target DBMS instance there could be several different threads so there must be locking, but that would happen also for concurrent execution of message passing requests, and obviously also in the shared memory case.

An address space without threads is in effect what in capability systems is called a type manager in that threads can access the data contained in it only by entering it at a predefined address of the code contained in it.

My IPC scheme is thus in effect a software capability system framework with a coarse degree of protection, but one that is quite cheap, unlike most software capability designs that rely on extensive operating system work.

Note: it is a framework for a system as it does not define how capabilities are implemented, it only provides a low cost mechanism for implementing type managers as address spaces. The code withing them can implement capabilities in various ways, for example as in-address space data structures and then hand out to type users just handles to them, or as encrypted data structures returned directly to type users.

150316 Tue: Contortions needed for effective use of some advising operations

As I wrote previously about data access pattern advising fadvise with POSIX_FADV_DONTNEED and POSIX_FADV_WILLNEED does not advise about future accesses being sequential, but is about already read or written data, and has an instantaneous effect, that is it has be repeated after each series of data has been read or written. That is quitquitee unwise.

That for example means that tools like nocache have to do extra work, and bizarrely may need to issue it twice for it to have effect.

Since those two bits of advice are in effect operations rather than advice they can have further interesting complications as described here in detail and can be summarized as:

Since modified data cannot be discarded, and POSIX_FADV_DONTNEED acts momentarily (that is without tagging the indicated pages), using it just after writing has likely no effect. Therefore the page mentioned above suggests issuing fdatasync first.
Since POSIX_FADV_DONTNEED results in unmodified data to be removed from the Linux page cache unconditionally, if it was in use by other processes it will have to be read into the page cache again.

The solution to the latter issue proposed by in the linked page is to ask the OS the list of pages of the file already in the buffer cache, before reading or writing it, and then issue POSIX_FADV_DONTNEED only on those that don't belong in that list.

The definition of the POSIX_FADV_DONTNEED operation in particular is rather ill-advised (pun intended!). In theory the access pattern advices POSIX_FADV_SEQUENTIAL and POSIX_FADV_RANDOM should obviate the need for both POSIX_FADV_DONTNEED and POSIX_FADV_WILLNEED, but in practice they don't seem to work at all or well enough for Linux, requiring the use of lower level and awkward operations.

As previously argued the stdio library and the kernel of a POSIX system should be able to infer which advice to use implicitly in the vast majority of case from the opening flags and the first few file operations, as was the case in the UNIX kernel on the PDP-11 just 40 years ago...

150315 Mon: Very high density flash SSD box has 500TB in 1U

In one of the blog posts mentioned as to the news about a new 10TB disk drive has some comments reporting (1, 2) that a USA startup has designed a 1U rackmounted custom flash SSD unit with a capacity of over 500TB.

That's for a completely different target market: it is all about reducing cost of operation, from power to rack size, while delivering vey high IOPS-per-TB. But for very rich customers it can be used as a archival too.

Note: To me it looks like a fairly banal product that could be slapped together by any Taiwanese or Chinese company, but in inimitable Silicon Valley style the company that developed the product raised around $50m in venture capital funding and was valued several times that when it was sold. Yes, as one of their people in the relevant YouTube presentation says high density flash is not entirely trivial to design and make reliable, but perhaps that valuation is a bit high for that.

150314 Sat: 10TB drives with shingled tracks

In interesting news the first shingled tracks disk drive has been released and has a huge 10TB capacity. There is no information on the price, but 8TB conventional drives currently cost around $724 (+ taxes) so hopefully the new 10TB drive will cost less or else it does not make a lot of sense.

Shingled tracks drives require special handling as they have large write blocks, larger than read ones, so updates require OS support for RMW. Given that an 8-10TB drive is bound to have terrible IOPS-per-TB, whether it uses shingled tracks or nor, it is going hopefully going to be used for archival, almost as a tape cartridge with fast random access, so the large RMW problem is going to be minimized.

150305 Thu: How many VMs per disk arm?

There have been two trends in computing that to me seem underappreciated, and they are that access times for both DRAM cells and rotating disk have been essentially constant (or at best improving very slowly) for a long time, while capacity and transfer rate for RAM and disks have been growing a lot.

Note: the access time of a DRAM cell has barely improved for a long time, but the latency of a DRAM module has become a bit worse (in absolute time, rather worse in cycles) over time as DRAM cells get organized in ever wider channels to improve sequential transfer rates.

The situation is particularly interesting for rotating disks because access times depend on both arm movement speed and rotation speed, and while arm speed may have perhaps doubled over a few decades rotation speed has remained pretty constant at the common rates of 7200, 10000 and 15000 RPM.

The result has been that rotating storage IOPS-per-disk have been essentially constant around the 100-150 level, and this has meant that IOPS-per-TB have been falling significantly with the increase in disk capacity.

This often is bad news for consolidation of computing infrastructures because often the VMs end up sharing the same physical disk, where the original physical systems had their own disks. On paper sharing a large capacity disk among many VMs, where the capacity of physical system disks was often underused, seems a significant saving, until IOPS-per-TB limitations hit, because of two reasons:

The mere sharing of the same number of IOPS-per-disk slows down each machine unless each has quite low IOPS rates.
Sharing a disk among VMs can randomize accesses even if they were not random before:
- IOPS usually are intended to be those for random access (typically with 4KiB transfers), and IOPS for sequential access can be almost 2 orders of magnitude higher.
- When a physical system running a mostly single threaded application has a single disk it can thus perform well even with a single large disk if accesses are mostly sequential.
- Sharing a disk among VMs creates random access patterns even if every VM has a sequential access pattern, by interleaving randomly their access patterns.

Things are also bad when wide parity-RAID sets are used in an attempt to gain IOPS, because when writing those parity-RAID sets get synchronized by RMW involving parity, which much reduces aggregate IOPS.

So the big question becomes how many VM virtual disks to store on each physical disks sharing IOPS, given their access patterns. Often the answer is not pleasant because it turns our that consolidating disk capacity is much easier and cheaper than consolidating IOPS.

150302 Mon: Some secondary properties of RAID14

I have been illustrating some secondary properties of the RAID14 scheme that I mentioned some time ago for people who object to RAID10 having the geometric property that it does not always continue to operate after the loss of 2 members and I found that some of its properties are not so obvious, especially when comparing to a RAID15 scheme that is also possible:

The parity member is never read when reading data, except in case of failure of a pair in the data members, thus greatly reducing the load on it in almost all cases. It gets read on partial stripe updates, and gets written on every stripe update, but ideally stripe updates don't involve RMW and therefore all stripe members are involved anyhow
The failure of the parity member returns the RAID set to be RAID10, with no need for reconstruction of missing data, as only parity has been lost.
When a failed parity member is replaced rebuilding parity can be particularly fast even under load because there are two drives in each pair that can supply the same block, and MD will choose usually the least loaded.

I think still think that RAID10 is much preferable to RAID14, given likely statistical distributions of member losses, but the arguments above may make RAID14 preferable to RAID15 in many cases.

150222 Sun: A script to print the status of a UMTS 3G connection

Some years ago I started using a UMTS 3G (also known as mobile broadband) USB modem for Internet access from various places, and with some reception issue in some places. To make sense of these I wanted to see signal strength and other status, but suitable status tools were mostly only available under MS-Windows. I found one GNU/Linux tool but not very reliable.

So I have investigated and found out that most UMTS 3G modems have a very similar interface to a serial line modem with an AT command set (once upon a time called an Hayes command set) with UMTS 3G extensions that are fairly well standardized.

The modem appears under Linux as three serial ports, usually ttyUSB0, ttyUSB1 and ttyUSB2. The first is used for data traffic, but usually ttyUSB2 reports status, and/or can be used to give commands to setup status reporting.

So I have written a small perl script (download here) to do that. It is not as polished as I would like, but it seems to work pretty well. However I eventually realized that its overall logic should be different. As it is currently it sets up the modem to report every second its status, and then parses the status messages. What it ought to do is to disable periodic reporting, and instead sleep for some interval, and then request the current status, and repeat.

Its single parameter ought to be the serial device to use, which defaults to /dev/ttyUSB2, and typical invocation and output are:

$  perl umtsstat.pl 
hh:mm:ss RX-BPS TX-BPS  ELAPSED   RX/TX-MIB ATTEN.DBM MODE        CELL
10:23:22     20      0 000:00:02     0/0    -67 (21)  WCDMA.      6C09C2 
10:23:24      0      0 000:00:04     0/0    -67 (21)  WCDMA.      6C09C2 
10:23:26      0      0 000:00:06     0/0    -67 (21)  WCDMA.      6C09C2 
10:23:28      0      0 000:00:08     0/0    -67 (21)  WCDMA.      6C09C2 
10:23:30      0      0 000:00:10     0/0    -67 (21)  WCDMA.      6C09C2 
10:23:32      0      0 000:00:12     0/0    -67 (21)  WCDMA.      6C09C2 
10:23:34      0      0 000:00:14     0/0    -67 (21)  WCDMA.      6C09C2 
10:23:36     42     84 000:00:16     0/0    -67 (21)  WCDMA.      6C09C2 
10:23:38    126     84 000:00:18     0/0    -67 (21)  WCDMA.      6C09C2

The first column is the time the status was read.
RX-BPS and TX-BPS are current traffic rates in bytes per second.
The fourth column is how long the current connection has lasted.
RX/TX-MIB are MiB transferred so far incoming and outgoing.
The nest two values are the attenuation (thus negative values) in decibel plus an index of signal strength between 1 and 32 (32 is strongest signal).
The MODE column is the current cell connection type, with HSPA+ being the fastest 3G mode for my modem and WCDMA being the common and much slowed 2G mode. A following dot means that the mode is literally as reported as a string, else it as a code that was looked up in a table.
The last column is the mobile phone cell id.

150203 Tue: Expecting magic when writing large files slowly

Is surprises me sometimes still how high are the expectations by users of magical or even psychic behaviour by filesystems, and a recent example is the discovery by some users as neatly summarized here that some filesystem create rather fragmented files if applications don't give hints and then proceed to write large files in small increments and slowly.

Shock and horror, but what can the filesystem do without psychic powers enabling it to predict the final size of the file or that it will continue to be written slowly at intervals for a long time?

Some filesystem designs try to reserve some space the current end of an open file to accomodate further appending writes, and some do so even for closed files. Some if they have several files open for write at the same time will try to enable reserving space for growth by deliberately pushing those files far away from each other, which has its own problems.

But all these choices have severe downsides in not so uncommon cases. The results are familiar to those who don't expect magic: when downloading large files over slow links they often end up as many extents too, and the same happens with tablespaces of slowly growing databases, and similarly for data collected by instruments at intervals, and many other log-like data patterns.

This happens because of limitations are three different levels:

Storage devices and in particular those based on rotating discs have spectacularly anisotropic performance profiles, with very limited degrees of freedom. This means that it is exceptionally difficult to achieve close to optimal layouts except in very narrow cases.
Filesystem code in the absence of hints can try to predict future access patterns by using an anticipative logic making assumptions, but that will fail very badly in many common cases.
Many operating systems do not allow providing hints of future operation patterns, or these are insufficient.
Even when the operating system has some hint APIs most application developers prefer to avoid making the small effort of using them, or even learning about their existence.

In the instant case of a program like syslog or an imitation that writes relatively slow small increments over a long time it would be the application writer skill to use where available hints for access patterns and for file size.

I have illustrated access pattern hints previously and these ought to help a lot, along with preallocation hints (fallocate(2), truncate(2) with a value larger than the current size of a file) but then application programmers often don't use fsync(2) which is about data safety and similarly don't check the return value of close(2).

150131 Sat: The interfaces that are not system calls

There is a large ongoing issue with the Linux kernel, and it is only getting worse: while wisely the basic kernel system call interface is kept very stable, there are a significant number of kernel interfaces that are not system calls and change a lot more easily, for example the /proc interface and the /sys interface or the socket operations NETLINK interface.

Note: the common aspect is that they use core kernel data interfaces for metadata or even commands.

In other words just like in Microsoft's Windows subsystem designers design their own kernel APIs largely voiding the aim of having a unified and stable system call interface.

To some extent this is inevitable: after all at least some critical services may well be implemented in user space using whichever communication mechanism and API style to define special protocols. But it becomes rather different when it is kernel interfaces that are defined using arbitrary mechanisms, because the kernel system call interface then becomes largely irrelevant because most programs end up using many other interfaces besides.

Maintaining the system call API clean and stable migth have been of value when programs expected few and simple services, for example when networking and graphics were uncommon, and the kernel system call API was more or less coextensive with the stdio library.

What perhaps might have been better for the long term would have been to define a simple communication scheme, with a well established style of using it, and then used it for both process-to-kernel and process-to-process (client-to-daemon) requests.

But that course of action never proved popular, in part because most of its proponents only wanted a process-to-process communication scheme because they were fond of microkernel designs which turned out to be less optimally feasible on general purpose CPUs.

The result is that Linux is in effect not just a monolithic kernel with a low overhead system call API, it is an agglomeration of a number of specialized monolithic kernels with their own more or less low overhead APIs; what all these kernels share is some minimal low level infrastructure, mostly memory, interrupt and lock management.