Software and hardware annotations 2007 July

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

070705 Thu The Linux CPU governors don't look at system CPU time

My backups, which are disk-to-disk (thus my interest in elevators and smooth dirty page flushing) have been really slow recently. Now I would not normally notice because I run them during the evening, but I was surprised to notice the disk lights on this morning, and that was because my nighly backup job was lasting too long. Well, the reason was because it was running rather slower than usual because of very high CPU utilitization. On a guess I switches the copy program to use O_DIRECT (which miraculously seems to actually work as advertised since some recent Linux version, instead of having negligible effect) and CPU time dropped considerably, as the Linux page cache is so slow that IO can become CPU bound, but there were two additional complications that made it rather unresponsive:

The running frequency of my Athlon 64 3000+ CPU can be switched between 1,000 and 2,000 MHz and it appears that the conservative governor ignores completely system CPU time when computing at which speed the CPU should run, with the effect that it was running at 1,000MHz even if it was 95% busy; also system CPU time tends to crowd out user processes.
The vm/drop_caches kernel parameter was still set to 1 after some recent experiments, which may have been having a continuing effect, keeping caching constantly flushing.

Anyhow the fix was to switch the governor to the performance one. In part because with O_DIRECT it takes a largish block size, like 256KiB, to get good performance Like 60MiB/s disk-to-disk copy). But on second thought, the Linux cache gets really swamped without O_DIRECT because file access pattern advising still seems unimplemented, and then O_DIRECT with a large block size is the lesser of two evils.

070704 Wed Flushing unmodified cached pages under Linux

I only recently discovered that since Linux versions 2.6.16 (almost a year ago) there is a way to flush unmodified cached pages under Linux, by setting the kernel parameter vm/drop_caches to non-zero. This is very welcome as it helps in several cases, for example IO benchmarking, as the (ancient and traditional) alternative is to unmount a filesystem and remount it, which is not always convenient (note that -o remount does not have the same effect, as it really just changes options, does not effect a real remount, except in special case).
In theory the BLKFLUSHBUFS ioctl(2) (issued by the blockdev --flushbufs or hdparm -z commands) should do the same, or at last it is misunderstood by many to be supposed to do so, but it does not, being instead a per-block-device sync. Anyhow BLKFLUSHBUFS would apply only to a block device, while one might want to flush the unmodified buffers of a non block device filesystem for example an NFS one.
The umount and mount pair is less convenient than vm/drop_caches but more selective, as it applies to a single filesystem, while vm/drop_caches applies to all unmodified cached pages. As to this note that it does not apply to modified dirty pages, being the exact complement to sync.
Finally, it feels very wrong that this action (or many others like SCSI bus rescanning) is invoked by setting a variable, instead of by issuing a command wrapping a system call; something like BLKFLUSHBUFS would have been a better approach.

070701b Sun Disappointing Linux NFSv3 writing misfeature and workaround

My previous entry and interest in the Linux flusher parameters have arosen as a consequence of an investigation of unsatisfactory write performance during backups, or more recently using the Linux NFS client and server. The main issue is that on a contemporary high end clients and servers with GNU/Linux (in the particular case RHEL4) one can only write over NFS at around 25MB/s where the server's disks can do around 70-80MB/s, and the underlying 1gb/s LAN is otherwise quiet and can easily do 110MB/s transfer rates with less than 0.5ms packet latencies. The network traffic profile of one such write produced by WireShark shows clearly every couple of seconds a couple of seconds of pause, in other words some kind of stop-and-go behaviour which usually indicates some kind of half-duplex sessions usually caused by some sort of congestion. Not network congestion as such.
The first analysis showed that with async NFS export over UDP went at wire speed, indicating that either or both TCP and sync export contributed to the measured slowdown. As to TCP, it is well known that until recent versions of the kernel its default tuning parameters were set suitable for nodes with small memory and at most 100mb/s connections. Changing the IP and TCP tuning parameters in /etc/sysctl.conf to some more suitable values like:

# Mostly for he benefit of NFS.
#   http://WWW-DIDC.LBL.gov/TCP-tuning/linux.html
#   http://datatag.web.CERN.CH/datatag/howto/tcp.html

net/ipv4/tcp_no_metrics_save	=1

# 2500 for 1gb/s, 30000 for 10gb/s.
net/core/netdev_max_backlog	=2500
#net/core/netdev_max_backlog	=30000  

# Higher CPU overhead but higher protocol efficiency.
net/ipv4/tcp_sack 		=1
net/ipv4/tcp_timestamps		=1
net/ipv4/tcp_window_scaling	=1

net/ipv4/tcp_moderate_rcvbuf	=1

# This server has got 8GiB of memory mostly unused.
net/core/rmem_default 		=1000000
net/core/wmem_default 		=1000000
net/core/rmem_max 		=40000000
net/core/wmem_max 		=40000000
net/ipv4/tcp_rmem		=40000 1000000 40000000
net/ipv4/tcp_wmem		=40000 1000000 40000000

# Probably not necessary, but may be useful for NFS
# over UDP.
net/ipv4/ipfrag_low_thresh	=500000
net/ipv4/ipfrag_high_thresh	=2000000

made transfers with async export over TCP work almost as fast as with UDP. I prefer so far NFS over UDP for reliable LANs, given that the Linux nfs client driver does not recover properly from session problems with the server, as imprecisely described in the Linux NFS-HOWTO:

The disadvantage of using TCP is that it is not a stateless protocol like UDP. If your server crashes in the middle of a packet transmission, the client will hang and any shares will need to be unmounted and remounted.

but the ability to tune TCP for NFS to give almost the same performance of UDP is nice.
The bigger problem is using async exports. As it is well known async on the server side violates the semantics of NFS and the UNIX/Linux like filesystem API as the client is told that data has been committed to disk when it has not, in order to prevent pauses while the data is being flushed out. To be sure, of course the filesystem is usually mounted with async, the issue here is whether it is exported from the server with sync or async, as programs running on the client can always request explicitly synchronous writing on the mounted filesystem, but cannot override the async option on the server.
But in theory, for the large files (80MB to 8GB) I was writing for test the sync export option should not give performance different from async as I am using NFS version 3 and allegedly it allows doing delayed writes even when the server is in sync mode, as the client NFS driver (transparently to the application) can explicitly request flushing on the server when needed (and the server can refetch from the client data that could not be written):

Version 3 clients use COMMIT operations when flushing safe asynchronous writes to the server during a close(2) or fsync(2) system call, or when encountering memory pressure.

NFS Version 3 asynchronous writes eliminate the synchronous write bottleneck in NFS Version 2. When a server receives an asynchronous WRITE request, it is permitted to reply to the client immediately. Later, the client sends a COMMIT request to verify that the data has reached stable storage; the server must not reply to the COMMIT until it safely stores the data.
Asynchronous writes as defined in NFS Version 3 are most effective for large files. A client can send many WRITE requests, and then send a single COMMIT to flush the entire file to disk when it closes the file.

But the write rates I observed with sync export were still half those with async export (instead of one third as before changing the IP and TCP parameters), indicating some remaining stop-go behaviour), so I did another network trace of an NFS session (printed then with

tcpdump
	-ttt

to get inter-packet times) and I noticed some crucial moments:

The start of the writing, nothing special.
Just an in-progress REPLY from the server (the the WRITE UNSTABLE at the beginning), and there are no huge delays (117 microseconds). At that point the file size is 32KiB (sz 0x8000).
The client side send a COMMIT for the first 512KiB, presumably as it wants to get rid of them from its page cache, since these probably have long sice been written, and then starts a new WRITE UNSTABLE from 45MiB (32768 bytes @ 0x2b0c000) which get immediately therefore a REPLY, then around 400 packets later a huge 1.2s delay, not much thereafter. Evidently in that 1.2s period the server has executed the COMMIT, and for the whole 45MB outstanding.
Another COMMIT request from the client for 8MiB at 512KiB (8159232 bytes @ 0x80000), as evidently the client wants to free up the next 8MiB, and for 32KiB at 8.5MiB (32768 bytes @ 0x848000), which is the next block, while there is a new WRITE UNSTABLE at 80MiB (32768 bytes @ 0x4f14000), then around 300 packets later another huge 1.3s delay, which probably means that the server has actually done a COMMIT to 80MiB instead of the requested 8.5MiB.

So my speculation at this point was that:

The Linux NFS client issues COMMITs well before the end of the file because it needs to flush to reclaim page cache memory.
The client then stops writing to the server while waiting for the reply to the COMMIT, as without that it cannot reuse the existing unflushed cached pages.
The server ignores the specific region mentioned in the COMMIT requests, and flushes all the modified blocks received so far.

Indeed by looking at the source for Linux 2.6.9 (from RHEL4) and 2.6.21 (the latest released):

In 2.6.21 the NFS client code always set the region for COMMIT to 0-0 (which means everything by convention).
In both 2.6.9 and 2.6.21 the NFS server code flushes the whole file on COMMITs irrespective of the region specified in it.

In other words COMMIT is equivalent to fsync and whenever any of the data cached by the NFS client must be flushed all the data received so far by the NFS server gets written to disk, which is hardly better than NFS version 2-style synchronous writing on the server, especially as the NFS client flushes whenever it needs to reclaim some unwritten file page, not just at the end, and it only keeps a few dozen megabytes unflushed at any time.
So, we should really use async on the server, and likely the COMMIT range has never been implemented because server side async is how people get write performance, and then use battery backed servers. Indeed in my situation the server has a RAID host adapter with a huge memory cache, so it must be batter backed anyhow. However some people still would rather use sync exports, so I wanted to see if sync exports could be improved.
In theory the NFS client could continue writing after sending a COMMIT, and just remember the outstanding commit and when the NFS server responds mark just the data sent before the COMMIT as flushed. However as the Linux NFS faq elliptically says:

The Linux NFS client uses synchronous writes under many circumstances, some of which are obvious, and some of which you may not expect.

and the Linux NFS client does not do that, as illustrated above, and at some point synchronously waits for the response to the COMMIT, largely because it sends it when its cache is full (instead of for example periodically, or when it is half full), and it is these pauses that reduce write performance.
However if at the time the COMMIT is sent the NFS server has already flushed all the pages sent to it, then it can reply almost immediately to the client. One way to do this is to use the export options sync of course, but this involves too frequent waits for a reply, and the sessions becomes essentially half-duplex. So the server should be flushing the pages it receives from NFS clients asynchronously.
The reason why the server does not do it is that by default the Linux kernel flushing dæmon runs fairly rarely, and in particular rather less frequently than an NFS client sending a COMMIT. This is because its tuning parameters are set rather loose, and allow, depending on circumstances, dozen of MBs to several GBs of modified pages can be cached in memory unflushed, to be written then all at once, something that causes further problems both to NFS clients and to interactive programs.
So I experimented a bit and I found that by changing the flusher tuning parameter I can ensure that the flushing dæmon writes out the pages it receives from the NFS server in a continuous and smooth way, and without greatly increasing the amount of CPU time it consumes:

vm/dirty_ratio                  =40
vm/dirty_background_ratio       =2
vm/dirty_expire_centisecs       =400
vm/dirty_writeback_centisecs    =200

These parameters should be tightened also on the NFS client system, as it helps to have have the application written pages flushed and sent over the network to the NFS server in a continuous and smooth way.
The overall result is that on a server that can write to a local filesystem at 75MB/s one can write to the same over NFS at around 67MB/s with the async and around 57MB/s with the sync option, which may be good enough with async anyhow because even with the latter, thanks to the flushing dæmon parameters above the NFS server instead of accumulating 300-600MB of modified pages and then writing them out at once (the disk is attached to a host adapter with a large RAM cache), writes a steady stream of modified pages at around 60-70MB/s, and with a few seconds of delay with respect to the NFS client. This minimizes the window of vulnerability to crashes, giving almost the same safety as sync, nearly as if the NFS client was indeeed doing incremental COMMITs (as it should...). The effect would be probably sufficient for the application I have been tuning this server for without having to resort to sync, even if the NFS server did not have battery backup.

070701 Sun Outrageous Linux memory management misfeatures

Well I have written several draft entries during June but because of frantic work activity I haven't been able to post them yet here. However I have just done some work related tuning in which I have discovered some more outrageously aspects of Linux memory management (even if less abysmal than the vm/page-cluster story). The most recent one is this code from mm/page-writeback.c:get_dirty_limits():

        dirty_ratio = vm_dirty_ratio;
        if (dirty_ratio > unmapped_ratio / 2)
                dirty_ratio = unmapped_ratio / 2;

        if (dirty_ratio < 5)
                dirty_ratio = 5;

        background_ratio = dirty_background_ratio;
        if (background_ratio >= dirty_ratio)
                background_ratio = dirty_ratio / 2;

        background = (background_ratio * available_memory) / 100;
        dirty = (dirty_ratio * available_memory) / 100;
        tsk = current;
        if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
                background += background / 4;
                dirty += dirty / 4;
        }
        *pbackground = background;
        *pdirty = dirty;

which embodies a large number of particularly objectionable misfeatures, both strategic and tactical. The most damning is that the amount of unflushed memory allowed outstanding is set as a function of memory available, which is laughable, as it should be a function of disk speed, in other words should be set a number of pages, not a percentage of pages available (in other words background and dirty should be the parameters to be set directly).
Consider two systems, both with 50MiB/s disks, one with 512MiB of memory, and the other with 8GiB of memory, and both with 10% of memory allowed to become dirty before it is flushed: on one system that is 800MiB, or 16s of disk IO, and on the other 50MiB, or about 1s of IO. That's a very big difference in the window of vulnerability and the amount of data lost in case of a crash. But the details also are based on laughable misunderstandings, for example:

For some reason the dirty_background_ratio should not be larger than the dirty_ratio. Fine, but why halve it? And why do so if it is equal to it? This means that if dirty_ratio is 10 and dirty_background_ratio is 9 it is not changed, but if dirty_background_ratio is 11 it is reset to 5, not to 10. Indeed it is reset to 5 even if it is set to 10, the same as dirty_ratio.

Even worse the code can changes the value of both dirty_ratio and dirty_background_ratio, but does so invisibly. That is:

base# sysctl vm/dirty_ratio=8 vm/dirty_background_ratio=8
vm.dirty_ratio = 8
vm.dirty_background_ratio = 8
base# sysctl vm/dirty_ratio vm/dirty_background_ratio
vm.dirty_ratio = 8
vm.dirty_background_ratio = 8

even if the effective value of dirty_background_ratio is 4, as by being equal to dirty_ratio it has been set to half its value.
Less shoddy code might look like:

/*
 * Work out the current dirty-memory clamping and background
 * writeout thresholds.
 *
 * If the numbers are greater than 100 they are taken to be
 * directly number of pages, else percentages of available
 * lowmem pages.
 *
 * We try to bound the resulting number of page so that there
 * can be a minimum number of pages before the writing processes
 * or the flusher start writing out, and so that the flusher
 * activation treshold is not larger than the process
 * synchronous write one.
 */
static void
get_dirty_limits(long *const pbackground, long *const pdirty,
		 const struct address_space *const mapping)
{
#ifdef CONFIG_HIGHMEM
	/* Take only lowmem into account */
	const long unsigned available_pages = vm_total_pages - totalhigh_pages;
#else
	const long unsigned available_pages = vm_total_pages;
#endif
	
	const long unsigned unmapped_pages = vm_total_pages
		- global_page_state(NR_FILE_MAPPED)
		- global_page_state(NR_ANON_PAGES);

	/*
	 * If the value of the '/proc/sys' setting is higher
	 * than 100 it is not a percentage but a number of pages
	 * directly.
	 */
	const long unsigned vm_dirty_pages
		= (vm_dirty_ratio > 100L) ? vm_dirty_ratio
		: +(vm_dirty_ratio*available_pages)/100L;
	const long unsigned vm_background_pages
		= (vm_background_ratio > 100L) ? vm_background_ratio
		: +(vm_background_ratio*available_pages)/100L;

	/*
	 * We leave at least 8 pages unflushed, with an upper
	 * limit of 50% of unmapped pages for the process
	 * synchronous writing threshold, or of that threshold
	 * for the flusher treshold.
	 */
	const long unsigned dirty_pages
		= min(max(8,vm_dirty_pages),unmapped_pages/2);
	const long unsigned background_pages
		= min(max(8,vm_background_pages),dirty_pages);

	/*
	 * Reset the '/proc/sys' variables to the actual values
	 * computed here.
	 */
#if 0
	vm_dirty_ratio = +(dirty_pages*100L)/available_pages;
	vm_background_ratio = +(background_pages*100L)/available_pages;
#else
	vm_dirty_ratio = (dirty_pages > 100L) ? dirty_pages
		: +(dirty_pages*100)/available_pages;
	vm_background_ratio = (background_pages > 100lL) ? background_pages
		: +(background_pages*100L)/available_pages;
#endif

	*pdirty = (long) dirty_pages;
	*pbackground = (long) background_pages;
}