This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
One obvious point that I should have mentioned in the previous entry about OCFS2 is that while it is pretty good as a standalone filesystem, it is nice to have the option to share a filetree among systems in the future. In particular for real high availability applications including redundant storage media, not merely shared storage, for example with DRBD:
The Oracle Cluster File System, version 2 (OCFS2) is a concurrent access shared storage file system developed by Oracle Corporation. Unlike its predecessor OCFS, which was specifically designed and only suitable for Oracle database payloads, OCFS2 is a general-purpose filesystem that implements most POSIX semantics. The most common use case for OCFS2 is arguably Oracle Real Application Cluster (RAC), but OCFS2 may also be used for load-balanced NFS clusters, for example.
Although originally designed for use with conventional shared storage devices, OCFS2 is equally well suited to be deployed on dual-Primary DRBD. Applications reading from the filesystem may benefit from reduced read latency due to the fact that DRBD reads from and writes to local storage, as opposed to the SAN devices OCFS2 otherwise normally runs on. In addition, DRBD adds redundancy to OCFS2 by adding an additional copy to every filesystem image, as opposed to just a single filesystem image that is merely shared.
Another application is with shared non redundant storage, that is some kind of SAN layer which can be purchased ready made or built using a Linux host running some iSCSI dæmon (1, 2).
It is noticeable how much picture quality improves when a monitor that is fed a video signal via a traditional VGA analog link is well synchronized with it. The more so I guess with LCDs as their rigid geometry makes poor synchronization map the image onto it particularly poorly.
Contemporary monitors tend to have both manual and automatic synchronization (positions, phase and clock) and manual synchronization usually takes time, so automatic is a much easier option. However it sometimes does not get an optimal result.
In a very useful
monitor testing suite by Lagom
there is a page with
a test specifically for good synchronization
which uses a high contrast and frequency pattern to give a
visual indication of how well synchronized the video signal is
to the LCD geometry. The pattern suffers from obvious
spatial aliasing
if synchronization is poor.
Since I suspect that autosynchronization works by maximizing something like contrast in the signal. I noticed that autosync seemed better when I had terminal windows open covering the screen, and then wondered whether autosyncing with that pattern would help it converge. Indeed it does, and using the test pattern as a tile on large part of the screen (such as the background) seems to help autosync considerably (both in time needed and as to the resulting quality) in several monitors that I have used.
I have tried a number of filesystems over time and my favourites have always been XFS JFS because of their blend of good features and scalable performance, but there are a few others that are interesting.
These are not the currently popular favourites, the ext3 and ext4 ones, because too many compromises have been made in their design for the sake of backwards compatibility, and BTRFS because it is still a bit immature and its copy-on-write design is a relative novelty.
Unfortunately interesting filesystems like ReiserFS and Reiser4 are less well maintained than others.
A filesystem with rather interesting qualities and who is still well maintained with the support of a large sponsor like Oracle is OCFS2 which has a sound traditional design and it supports all the recent kernel features that have been added to ext4 and XFS (but not for example to JFS) like the use of barriers and TRIM or FITRIM, checksums (metadata only) and a form of snapshots, and space reservation and release.
Its main design goal was to be a
cluster shared
filesystem,
but it also works pretty well in standlone mode. I have also
found among my notes some very simple performance tests among
OCFS2, ext4, XFS and JFS on
an old spare test machine
which is somewhat slow by contemporary standards. The top hard
disk speed is:
# hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 160 MB in 3.01 seconds = 53.07 MB/sec
The test is a dump via tar of a GNU/Linux root filetree, that is with a large number of small files and some bigger ones, the filetree is in a partition not at the beginning of the disk, and the results in order of speed are:
ext4 with 4KiB inodes | # sysctl vm/drop_caches=3 vm.drop_caches = 3 # time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null 35900+0 records in 35900+0 records out 9410969600 bytes (9.4 GB) copied, 677.7 seconds, 13.9 MB/s real 11m18.762s user 0m3.139s sys 1m1.893s |
ext4 with 1KiB inodes | # sysctl vm/drop_caches=3 vm.drop_caches = 3 # time tar -cS -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null 35582+0 records in 35582+0 records out 9327607808 bytes (9.3 GB) copied, 620.853 seconds, 15.0 MB/s real 10m20.895s user 0m3.387s sys 1m1.026s |
ext4 with 256B inodes | # sysctl vm/drop_caches=3 vm.drop_caches = 3 # time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null 35901+0 records in 35901+0 records out 9411231744 bytes (9.4 GB) copied, 579.691 seconds, 16.2 MB/s real 9m40.770s user 0m3.060s sys 0m59.262s |
XFS | # sysctl vm/drop_caches=3 vm.drop_caches = 3 # time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null 35907+0 records in 35907+0 records out 9412804608 bytes (9.4 GB) copied, 413.755 seconds, 22.7 MB/s real 6m54.763s user 0m3.014s sys 1m3.816s |
OCFS2 | # sysctl vm/drop_caches=3 vm.drop_caches = 3 # time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null 35901+0 records in 35901+0 records out 9411231744 bytes (9.4 GB) copied, 368.144 seconds, 25.6 MB/s real 6m9.149s user 0m3.204s sys 1m27.334s |
JFS | # sysctl vm/drop_caches=3 vm.drop_caches = 3 # time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null 35901+0 records in 35901+0 records out 9411231744 bytes (9.4 GB) copied, 301.641 seconds, 31.2 MB/s real 5m2.679s user 0m3.051s sys 0m59.131s |
A simple performance test like this is not that rich in information, in particular because the filetree is freshly loaded and that is not a representative situation, but I think that it is somewhat indicative.
I also think that OCFS2 is going to be supported by Oracle
for a long time, as it is very popular
among users of
Oracle Database,
and in particular for its
It is a bit worrying that
LCD panels
have been manufactured at a loss for years
because it means that rather unsurprisingly the LCD panel
industry is similar to the
RAM industry,
with very costly factories having long build cycles, and thus
creating what economists call a
hog cycle
with periods of high profits when new factories are being
built and thus supply is scarce, and of large losses when new
factories start operating and supply is ample.
The worry is that when supply is overabundant the cycle can become extreme as suppliers completely exit the market, and eventually supply becomes too scarce as new factories stop being built. Also excessive investment in factories for one technlogy can discourage investment in factories for a better technology, and several are indeed possible like OLED panels.
But for the time being extreme demand from high end mobile
telephones and tablet
buyers should help
LCD manufacturers decide to stay in the LCD business, and
in the short term one can have really
impressive LCD monitor bargains.
While reading the current issue of PC Pro magazine I found the review of the Boston Quattro 1332-T chassis quite interesting. The idea is to put in a 3U chassis 2 power supplies and 8 single-socket server nodes, each with a Xeon E3 series chip, which are at the lower speed but also lower power end of the full Xeon spectrum, but still quite fast. Indeed the power consumption figures are particularly interesting:
As a baseline, we measured the chassis with all nodes turned off consuming only 15W. With one node powered up we measured usage with Windows Server 2008 R2 in idle settling at 102W. Running the SiSoft Sandra benchmarking app saw consumption peak at 138W.
We then powered up more nodes and took idle and peak measurements along the way. In idle we saw two, four and eight nodes draw a total of 128W, 154W and 222W; under load these figures peaked at 182W, 289W and 512W. These are respectable figures, equating to an average power draw per node of only 28W in idle and 64W under heavy load.
Especially considering the cost of supplying power and cooling those are remarkable and attractive numbers.
Boston like most other smaller server suppliers (like the familiar Transtec and Viglen) often design their products around SuperMicro or TYAN building blocks, and these have included in the recent past some similar solutions such as 1U chassis with 2 dual-socket server nodes which works out at 6 servers per 3U, instead of 8 servers (but 12 sockets instead of 8 sockets), or 4U chassis with 18 single-socket server nodes. In all these the configuration of each server node is not quite the same, and the Boston 1332-T mentioned earliers seems to have a particularly rich configuration, even if accepts only Xeon E3 series chips, while the 1U chassis with 2 server nodes can usually take more powerful Xeon chips.
The attractions of all these alternatives is that they are
designed for a
devirtualization
strategy, in which sites who have found out how huge the
overheads, bugs and administrative complications of virtual
machines are for storage and network intensive workloads can
undo that mistake, and the workload should be partitioned into
different domains, for example for higher resilience.
The alternative devirtualization strategy is to use servers
with multiple processor sockets and chips with many CPUs like
the recent
12×CPU chips
and
16×CPU chips
from AMD which are awesome for
highly multithreaded applications that are uneconomical to
partition like web servers accessing a common storage backend,
for embarassingly parallel
HPC
applications like Montecarlo
simulations
that are popular in finance and high-energy physics.
Continuing the discussion of the rather complicated subject of
the performance envelope
of flash based
SSDs, there are two aspects of it that are rarely looked in in
reviews, and they are related: concurrent reading and writing
and maximum operation latency.
They are related because several FTL fimwares perform very badly
when doing concurrent (which really mean closely
interleaved
as the interface into a flash SSD is as a rule
single channel, such as USB or SATA), and this manifest itself
as very high operation latencies.
As to latencies there is a fascinating graph from a recent review of several contemporary flash SSD products which shows maximum latencies of over 60 milliseconds for several on random small writes (and negligible ones on random small reads).
There are also other tests that show that certain flash SSDs perform well in random and sequential access patterns, as long as they are distinc and not interleaved, but then fall considerably if reads and writes are interleaved, as they would in most realistic usage patterns.
My guess is that the two phenomena are related, and in particular because:
cleaningoperations that read
flash pagesbelonging to separate not full
flash blocksto concatenate them in order to keep some flash blocks entirely empty and thus pre-erased.
The resulting performance levels are still usually very much better (2-3 orders of manitude better), just as for random pure reads and pure writes, than for rotating storage devices, as for the latter positioning is so expensive that it overshadows the cost of dealing with erasing flash blocks.
Yet in both cases above there is a reflection of the cost of having the FTL perform houseeking in the background and having to rely on the ability of the FTL authors to provide a less unbalanced performance envelope, and some authors seem to aim for that, and other for higher peaks. Part of the reason why I chose the Crucial M4 is that various tests seem to show that the authors of its FTL seem to have aimed for a less unbalanced performance envelope, which means fewer surprises. Anyhow since my laptop does not have a SATA3 interface, the possible performance peaks are not accessible.
Being a major business supplier Dell tend to have somewhat better documentation than most, and I was happy to find on their site some fairly reliable and interesting introduction to flash SSD drives and the differences between high end and low end ones: Dell™ Solid State Disk (SSD) Drives – High Performance and Long Product Life and an even more interesting Solid State Drive (SSD) FAQ which has some points about little known aspects of flash SSDs. The first was news to me, even if in retrospect it is quite understandable (a lot of what suppliers do is about minimizing warranty returns):
5. Why I might notice a decrease in write performance when I compare a used drive to a new drive?
SSD drives are intended for use in environments that perform a majority of reads vs. writes. In order for drives to live up to a specific warranty period, MLC drives will often have an endurance management mechanism built into the drives. If the drive projects that the useful life is going to fall short of its warranty, the drive will use a throttling mechanism to slow down the speed of the writes.
But there are r were technical reasons why writes could
become slower with the life of the drive: especially in the
absence of TRIM
style commands all flash blocks
could become partially used,
requiring read-modify-erase-write cycles on every update. The
second point surprised even me, as I was aware that flash
memory weakens with time, but not that quickly:
6. I have unplugged my SSD drive and put it into storage. How long can I expect the drive to retain my data without needing to plug the drive back in?
It depends on the how much the flash has been used (P/E cycle used), type of flash, and storage temperature. In MLC and SLC, this can be as low as 3 months and best case can be more than 10 years. The retention is highly dependent on temperature and workload.
NAND Technology Data Retention @ rated P/E cycle SLC 6 Months eMLC 3 months MLC 3 Months
Data Retention:
Data retention is the timespan over which a ROM remains accurately readable. It is how long the cell would maintain its programmed state when the chip is not under power bias. Data retention is very sensitive to number of P/E cycle put on the flash cell and also dependent on external environment. High temperature tends to reduce retention duration. Number of read cycles performed can also degrade this retention.
This of course happens to flash card or stick pocket SSDs, but I suspect that they have some capacitor for trickle current to keep the data refreshed. Or else it explains why many people have difficulty rereading photographs etc. from pocket SSDs if left on a shelf for a long time. Still most issues with them are about the limited number of erase cycles, as most pocket flash SSDs eventually fail because of erase cycle damage (as they tend not to have overprovisioning unless in USB form factor and from a good supplier). But it would still be a good idea in general to keep flash memory devices plugged in so they have power to allow (if possible) to refresh decaying memory charges.
Anyhow magnetic recording storage like conventional hard drives are not going to lose their primacy as backup storage, and not just because I can still read without problems ATA disks I had many years ago, but also because they are much cheaper and perform equally well in writing as reading, and pretty well in absolute terms in streaming mode (withlimited seeking).
Another interesting note concerns queue depth which was seen to have an important effect as to efficiency (along of course with blocks sizes) in some previously mentioned SSD tests:
- Varying Queue Depths: Queue depth is an important factor for systems and storage devices. Efficiencies can be gained from increasing queue depth to the SSD devices which allow for more efficient handling of write operations and may also help reduce write amplification that can affect the endurance life of the SSD.
The reasoning seems to be that higher queue depth helps provide larger amounts of data to buffer in the somewhat large onboard buffer cache (256MiB for my Crucial M4) the FTL is given, and thus better opportunities to arrange the data to be written in large aggregates thus reducing the numer of erase cycles.
But precisely because the onboard buffer cache is fairly large, I am somewhat skeptical that writing is really going to be improved by higher queue depths rather than simply larger write transactions or just closely spaced ones. It may be instead improving random small reads, and indeed in the previously mentioned test the read rate for 4KiB blocks with a 32 threads reading is three times higher than with a single thread. Which however is a bit surprising as there are no erase issues with reads, which leads me to think that there is some kind of latency (perhaps in the MS-Windows IO code) that impact single threaded random reads.
As to flash SSD endurance I have been running for a day a command to report every four hours the amount of write performed to my laptop flash SSD:
# date; iostat -dk sda $((3600*4)) Thu Jan 12 10:32:41 GMT 2012 Linux 3.0.0-14-generic (tree.ty.sabi.co.UK) 12/01/12 _x86_64_ (2 CPU) Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 2.75 20.64 28.48 3005808 4147957 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 1.67 4.67 33.35 67180 480226 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 21.78 115.60 60.75 1664696 874750 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 3.53 28.51 63.55 410512 915110 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 5.74 91.91 65.65 1323452 945326 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 0.33 0.02 4.66 252 67152 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 2.30 13.54 22.49 194956 323885 ^C tree# uptime; iostat -dk sda 13:05:17 up 2 days, 18:59, 9 users, load average: 0.03, 0.04, 0.05 Linux 3.0.0-14-generic (tree.ty.sabi.co.UK) 13/01/12 _x86_64_ (2 CPU) Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 3.85 27.89 33.48 6727568 8075561
The first report is for total since boot up to that point,
the next six are for the previous four hours. The average is
well under 4GiB per day, and curiously the amount written is
considerably higher than the amount read (except for a surge
when I RSYNC'ed with my backups
server). I have investigated in the past disk activity on
my laptop with a view to minimize it to save power by
letting it stay in standby
as long as
possible, and the write rates are due to some poorly written
programs that keep updating or saving some files
needlessly.
The Chinese government may be paranoid (USA designed chips may have backdoors), or simply they understand economics (chips are high value added), and they have been funding the development of Chinese designed CPU chips for a long time, focusing in particular on those bvased on MIPS architectures.
A relatively recent article notes that a large Chinese cluster uses Chinese CPU chips which is significant news. The CPU chips are indeed implementations of the MIPS architecture and their microarchitecture is rumoured to be heavily inspired by that of the DEC Alpha CPU chips.
Presumably because the Chinese government do not have yet the
very expensive chip factories needed to build contemporary
technology chips (but I hope that they will invest in or buy
AMD
and its fabrication spinoff) so the cips runs only at 1.1GHz,
and therefore aims more at being low power and having lots of
CPUs (each chip has 16 CPUs) to compensate for that. Which I
often think is the right approach for the embarassingly parallel
sort of problems for
which clusters are suitable.
As to being low power, some sources note that a 0.79PFLOPS cluster only draws 1MWatt is quite remarkable, as in this list of power efficient clusters almost all the more power efficient clusters are much smaller than it, and clusters of similar overall computing power draw a lot more electrical power, and this is very interesting because cluster cooling is one of the biggest problem in HPC:
First, with a max draw of around one megawatt, it is incredibly power-efficient. Its contemporaries at the top of the supercomputing charts use at least two megawatts and the US’s fastest supercomputer, Jaguar, draws no less than seven megawatts.
Since I work a lot on storage issues I think a lot about them too, as this blog probably shows, this requires good terminology to avoid long composite names and unproductive ambiguity.
One of the complications that I have been struggling with is that existing terminology engenders a confusion between a type of file system (for example JFS or XFS) and a specific instance with specific content. There has been a common and rather dangerous convention that I have followed so far where file system (note the space) means the type and filesystem means an instance.
So for example one could say that the block
device
/dev/sda6 contains an XFS filesystem
, and that XFS is a file system
particularly suitable for
large filesystems
.
I have decided that is rather too dependent on context, error
prone, and that always saying file system type
is too awkward.
Therefore I will be using file-system to denote
the type, the structure, and file-tree to denote
an instance. Conceivably some file-system
allows non-tree like instances, but essentially all
contemporary ones only allow trees (with shared leaves, or at
most branches) so file-tree
is not a
misnomer.
While revising the
specification sheet
of my
recently acquired flash SSD drive
I noticed that under
Endurance
it says 72TB or 40GB per day for 5 years. So I looked up some
photos of the components making up the devices
and there 16×29F128G08 CFAAB
units to check on their physical endurance properties.
These are chip carriers containing two MLC, 25nm, synchronous, 128Gib/8GiB chips with 4KiB flash-page, 1MiB flash-blocks and an endurance of 3000 erase cycles. If each chip were to be erased 3000 times and each time fully written, that would be 768TB, that is around 10 times more than in the specification sheet.
The difference is so large (a bit more than an order of magnitude) that it is unlikely that it is just a conservative estimate, the manufacturer also probably expects that in most cases flash-blocks will be erased without then writing in all the flash-pages they can hold.
There is also the problem that as soon as one flash-block reaches the maximum erase count (which is presumably slightly different for each chip or even block), it becomes unusable and thus the effective capacity of the device is reduced by that block.
This is why the FTL of flash SSDs aims for wear leveling
by mapping the logical device
onto the flash chips in a circular queue (similar to the lower
level of a log structured file-system
).
But in addition the firmware of rotating storage and flash SSD devices reserve a chunk of storage capacity for logically replacing failed parts of the capacity, and this is called sparing for rotating storage devices and overprovisioning for flash SSDs.
In the case of flash SSDs despite wear-leveling some flash-blocks will reach their maximum erase count before others, and those will then be ignored and a flash-block from the overprovisioned reserve will be used in their stead. The overprovisioning for my consumer-grade flash SSD is around 7% of total capacity (enterprise-grade flash SSDs tend to have a lot more). This is typical of consumer level flash SSDs, because they use chips with capacities measured in powers of 2, but offer visible capacities measured in powers of 10, so they have 256 gibibytes of physical capacity, but 256 gigabytes of visible capacity, and since 256 gigabytes are about 238 gibibytes, there are around 18 gibibytes of invisible capacity which gets turned into 7% overprovisioning.
Overprovisioning is also used to make writes faster, by keeping a reserve of empty flash-blocks ready to accept a stream of writes without having to be erased on the fly, and this can make a big difference to write latency as shown in this blog post by a manufacturer, but this does not need to consume a lot of the reserve unless there are very long surges of random writes.
But if erase endurance is a worry (not in the case of my
laptop storage unit) one can short-stroke
a
consumer flash SSD to greatly increase its erase endurance. This
means leaving some part of the visible capacity unused, for
example by only allocating half of the visible capacity to
partitions containing filetrees.
This means that half of all flash-blocks would not be written to directly, and that since the whole visible capacity is part of a single circular erase queue for wear leveling, writes on the logically allocated half of the capacity would spread on all flash blocks, reducing the average erase cycle count per block to half of what it would be otherwise.
This is quite similar to short-stroking, but the effect is not to reduce average travel time of the head as per rotating storage devices, but to reduce the the average number of erases per flash-block; it also significantly improves the chances of finding an empty flash-block reducing during a random-write surge just like overprovisioning built into the FTL.
Simply not filling more than some percentage of each file-tree is probably going to have much the same effect as only allocating an equivalent percentage of visible capacity to file-trees. But I suspect that it is best to just reduce the size of the file-tree, both because that is a crude way of doing an allocated space restriction, and because leaving a part of the visible capacity wholly untouched may help the FTL to do its mappings better.
While doing some RSYNC backups on my new flash SSD in my laptop I half noticed a strong clicking or buzzing sound at the same time as heavy seeking on the filetree (also as evident from the storage activity light on the front of the laptop), and directly related and proportional to it.
Then I realized that it was impossible: I was sort of habituated to the noise because it was nearly identical to the noise made while seeking by a rotating storage device with a magnetic pickup arm going back and forth. But flash based SSDs don't have anything like that.
This greatly perplexed me and I did some web searches and
something similar seems to have been noticed by other flash
SSD users, which was indeed strange, so I decided to
investigate. First I wanted to verify the source of the noise,
and eventually I found that the noise was coming from the
loudspeakers of the laptop. This was very perplexing
itself because I had earphones plugged into its sound output
jack
and this should redirect sound to the
earphones, as indeed I was able to verify.
Then I looked for some possible background dæmon that would report disk activity with a buzzing sound, or perhaps something in the laptop hardware that was designed to have the same effects, and while looking I unplugged the earphones to plug in some external loudspeakers, and the seek noise stopped, and the laptop loudspeakers became silent while nothing was plugged into the sound output jack. Plugging in the earphones (or external loudspeakers) made the seek noise reappear.
My conclusion is that this is because of imperfect isolation
of parts of the sound circuits of the laptop from the
electrical noise of the computing circuits: when the earphones
are not plugged in, the loudspeakers are driven by the sound
chip
in the laptop, and when the earphones
are plugged in that chip is connected to them, and perhaps
then the loudspeakers are connected to ground or whatever
and pick up computing circuits
electrical noise, most likely bus
noise,
and that has some harmonics in the audible range. Amusingly
these are also nearly identical to rotating storage device
seek noise.
My impressions is supported by previous experience: when
using cheap sound cards
in some desktop PC I
noticed that there was a characteristic background noise that
was obviously related to CPU and memory activity.
Indeed it is somewhat daring to put a sound circuit in the same box and electrically connected to a lot of other electrically noisy electronics, and some people prefer to use external USB sound devices to minimize electrical couplings and reduce background noise (but poorly designed USB sound devices can pick up noise from the USB connection).
I have a long list of parity RAID perversities to report, usually about overly wide, complicated setups, but there is a very recent one that in several different ways is typical and it is about just a narrow 2+2 RAID6 set:
I got a SMART error email yesterday from my home server with a 4 x 1Tb RAID6.
Why not a 4 drive RAID10? In general there are vanishingly few cases in which RAID6 makes sense, and in the 4 drive case a RAID10 makes even more sense than usual. Especially with the really cool setup options that MD RAID10 offers.
The main reason is the easy ability to grow the RAID6 to an extra drive when I need the space. I've just about allocated all of the array to various VMs and file storage. One thats full, its easier to add another 1Tb drive, grow the RAID, grow the PV and then either add more LVs or grow the ones that need it.
In this case, the raid6 can suffer the loss of any two drives and continue operating. Raid10 cannot, unless you give up more space for triple redundancy.
Basic trade-off: speed vs. safety.
The idea of a 2+2 RAID6 set seems crazy to me because not only RAID6 is in almost every case a bad idea based I think on huge misconception about failure probabilities but also because the alternatives are so much better, and I'll list them here:
This seems to be a one of the few cases where RAID5 seems appropriate because it is doubtful whether one needs that much redundancy for a home server, it is going most likely to be almost entirely read-only (typical home servers are used as media archives).
It also also allows saving one drive that can go towards making offline backups of half of the content (for example with a cheap external eSATA cradle), and a 2+1 RAID5 would have much better rebuild times and most likely (write) performance than a RAID6.
This would have the same usable capacity, enormously better write performance independent of alignment requirements, much shorter rebuild times.
There is also the advantage of being able to continue working after the failure of any number of non paired units, while RAID6 can continue working after the failure of up to 2 units. In this case it is not relevant though, as there are only 2 pairs of drives, so RAID6 seems to have the advantage, which arguably it does not in the general case.
RAID10 continues to have another large advantage though: that rebuild times are much shorter and less stressful on the hardware, because they involve just the remirroring of the affected pair(s), while rebuilding with RAID6 involves the rereading of all old units interleaved with the writing of the new units.
The latter is significant because the essential
difference between RAID10 and RAID6 (or RAID5) is that in
the latter all units are tangled
by the
the parity blocks. Having all drives active means that
rebuilds are usually rather slower (for RAID6 also because
parity computations are somewhat expensive) and as usually
the drives are physically contiguous a rebuild means a lot
of extra load on the drive set and thus extra vibration
and heat and power draw that often leads to additional
failures. Something that also applies in general to
writing with RAID6 (with RAID5 one can use the shortcut of
reading just the block to be written and the parity block
of any stripe).
In this I think that RAID10 is only slightly better than
RAID6 overall. Including the idea of resizing the volume,
because resizing RAID sets or even just filetrees
is a pretty desperate move
because:
Even just 2 RAID1 pairs may be a good idea for a home server (or a production server) compared to a 2+2 RAID6, a 2+1 RAID5, or a 2×(1+1) RAID10, because it is a very simple layout with two fully independent areas.
Read performance may not be as high, because each area can only draw on one disk, but write rates are likely to be the same, and the advantage is not just simple setup, but also fully independent operation of the pairs. One of the two pairs can die entirely, and the other continues to work.
Among the benefits of simplicity is that by using just
RAID1 it can be rather easy to boot
fully from a pair, while it can be a bit more complicated
with a non-RAID1 pair.
All the previous setups have in my eyes a serious defect: they involve continuous redundancy, that is the members of a redundancy set are constantly in-use and subject to largely the same stress.
Often I prefer a hot
+warm
arrangement, where there are pairs of
storage units, and one is online
and
hot, and gets periodically copied to the other member
of the pair which is online and warm.
This can happen with drives which are in the same drive enclosure or one is outside, and by block-by-block copying or temporary RAID1 remirroring.
The advantage of this arrangement is that if internal the backup drive can go into sleep mode most of the time, and in any case is subject to a lot less load and different stress than a RAIDed one. Also, previous versions of files can be recovered from the backup drive.
Also, the idea of reshaping A RAID6 into a larger one and using LVM2 seems very risk to me, as it is a particularly stressful rebuild, and one that leaves the contained filetree misaligned as to RMW.
For my home server I use the hot-warm arrangement
with 2 levels of backup: for every hot drive a warm backup
drive in the PC box, which is in sleep mode almost all day,
and gets
a few hours of mirroring
during each night by way of a CRON
script
, and one or more offline cold
backup drives that get mirrored somewhat
periodically in an eSATA external box.
For work I tend to have RAID1 pairs or triples for data which
does not require high write rates, and RAID10s set for those
that do. Very rarely narrow (2+1 or 3+1)
RAID5 sets in the few cases they make sense
and even more rarely RAID0 sets for high transfer rate cases
for volatile data. Never RAID6 unless
upper management
believes a storage
salesman's facile promises and tell me to just put up with
it.
As a rule using MD instead of hardware RAID host adapters (most have amazingly buggy firmware or bizarre configuration limitations), virtually never DM/LVM2 with the exception being for snapshots (and I am very interested in file-systems that have built-in snapshotting, usually because they are based on COW implementations).
Having mentioned the speed of treewise filetree level copies across disks on on my home server these are the speeds of linear block device copies on the same server, first those between 1TB disks, and then between 2TB disks, both nightly image backups:
'/dev/sda1' to '/dev/sdb1': 25603+1 records in 25603+1 records out 26847313920 bytes (27 GB) copied, 243.94 seconds, 110 MB/s 25603+1 records in 25603+1 records out 26847313920 bytes (27 GB) copied, 243.943 seconds, 110 MB/s real 4m3.944s user 0m0.071s sys 0m18.332s '/dev/sda3' to '/dev/sdb3': 25603+1 records in 25603+1 records out 26847313920 bytes (27 GB) copied, 242.319 seconds, 111 MB/s 25603+1 records in 25603+1 records out 26847313920 bytes (27 GB) copied, 242.322 seconds, 111 MB/s real 4m2.405s user 0m0.072s sys 0m18.429s '/dev/sda6' to '/dev/sdb6': 95385+1 records in 95385+1 records out 100018946048 bytes (100 GB) copied, 917.785 seconds, 109 MB/s 95385+1 records in 95385+1 records out 100018946048 bytes (100 GB) copied, 917.791 seconds, 109 MB/s real 15m17.792s user 0m0.245s sys 1m8.049s '/dev/sda7' to '/dev/sdb7': 85344+1 records in 85344+1 records out 89490522112 bytes (89 GB) copied, 845.122 seconds, 106 MB/s 85344+1 records in 85344+1 records out 89490522112 bytes (89 GB) copied, 845.126 seconds, 106 MB/s real 14m5.798s user 0m0.223s sys 1m1.021s '/dev/sda8' to '/dev/sdb8': 237962+1 records in 237962+1 records out 249521897472 bytes (250 GB) copied, 2532.82 seconds, 98.5 MB/s 237962+1 records in 237962+1 records out 249521897472 bytes (250 GB) copied, 2532.83 seconds, 98.5 MB/s real 42m14.171s user 0m0.707s sys 2m49.092s '/dev/sda9' to '/dev/sdb9': 475423+1 records in 475423+1 records out 498517671936 bytes (499 GB) copied, 6585.2 seconds, 75.7 MB/s 475423+1 records in 475423+1 records out 498517671936 bytes (499 GB) copied, 6585.2 seconds, 75.7 MB/s real 109m47.678s user 0m1.322s sys 5m35.689s
'/dev/sdc1' to '/dev/sdd1': 25603+1 records in 25603+1 records out 26847313920 bytes (27 GB) copied, 206.092 seconds, 130 MB/s 25603+1 records in 25603+1 records out 26847313920 bytes (27 GB) copied, 206.095 seconds, 130 MB/s real 3m26.782s user 0m0.055s sys 0m15.635s '/dev/sdc3' to '/dev/sdd3': 25603+1 records in 25603+1 records out 26847313920 bytes (27 GB) copied, 206.394 seconds, 130 MB/s 25603+1 records in 25603+1 records out 26847313920 bytes (27 GB) copied, 206.396 seconds, 130 MB/s real 3m26.397s user 0m0.057s sys 0m15.830s '/dev/sdc6' to '/dev/sdd6': 419195+1 records in 419195+1 records out 439558766592 bytes (440 GB) copied, 3507.74 seconds, 125 MB/s 419195+1 records in 419195+1 records out 439558766592 bytes (440 GB) copied, 3507.74 seconds, 125 MB/s real 58m29.626s user 0m0.938s sys 4m17.484s '/dev/sdc7' to '/dev/sdd7': 475423+1 records in 475423+1 records out 498517671936 bytes (499 GB) copied, 4321.96 seconds, 115 MB/s 475423+1 records in 475423+1 records out 498517671936 bytes (499 GB) copied, 4321.96 seconds, 115 MB/s real 72m3.533s user 0m1.214s sys 5m17.721s '/dev/sdc8' to '/dev/sdd8': 950344+1 records in 950344+1 records out 996508860416 bytes (997 GB) copied, 11258.7 seconds, 88.5 MB/s 950344+1 records in 950344+1 records out 996508860416 bytes (997 GB) copied, 11258.7 seconds, 88.5 MB/s real 187m44.220s user 0m2.742s sys 11m4.643s
The notable features are the satisfactory linear speeds, which are not however that much higher than the treewise ones, the expected decline as to the inner tracks, and the significant improvement over previous reports from April 2006 May 2007, June 2009.
After regretfully converting my laptop partitions from JFS to XFS I am converting my server's partitions too, by copying them from a JFS to a XFS partition on the backup disk to the XFS partition on the live disk, and I have 2 live (and thus 2 backup) disks, and this is the aggregate transfer rate for 2 concurrent filetree (somewhat outer cylinders) copies:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 4 3 0 102440 328 3693320 0 0 153876 158236 4766 14750 0 29 20 50 0 0 3 0 102012 320 3693716 0 0 148364 139740 5176 15396 0 28 32 39 0 4 5 0 113916 332 3681632 0 0 99772 110472 4129 10544 0 17 38 45 0 1 3 0 103692 356 3691424 0 0 208552 199824 5817 19128 0 40 14 46 0 1 2 0 96316 364 3699624 0 0 108808 100068 4715 11346 0 22 42 36 0 1 2 0 110456 372 3685308 0 0 171912 168132 5656 16903 0 37 21 42 0 1 4 0 120624 352 3675564 0 0 133116 117740 5057 14079 0 26 39 35 0 2 3 0 103516 332 3692568 0 0 208896 203128 8105 22806 0 51 15 33 0 2 3 0 114960 304 3681052 0 0 197512 193936 7624 22460 1 51 15 33 0 3 3 0 110404 268 3685924 0 0 208128 201532 8046 22699 1 52 13 34 0 3 3 0 105964 256 3690176 0 0 206720 209336 7977 22331 0 51 16 33 0 2 3 0 107744 244 3688060 0 0 199320 186580 7673 21648 0 49 17 34 0 5 4 0 102328 236 3693952 0 0 209024 203032 8005 23010 0 53 15 32 0 3 3 0 103052 220 3693340 0 0 219392 214136 8617 23030 0 54 14 32 0 3 1 0 111684 212 3684580 0 0 200960 194764 7551 23520 0 51 12 37 0 2 4 0 98528 212 3697812 0 0 201600 197308 7527 21233 0 48 17 35 0 0 5 0 97140 212 3699264 0 0 204928 202348 7776 23740 1 51 14 34 0 4 4 0 118084 160 3678132 0 0 208128 182308 7838 24767 0 51 13 36 0 3 5 0 106352 144 3690336 0 0 203904 168648 7402 24999 0 51 13 36 0 6 4 0 97688 140 3697916 0 0 211708 166412 7688 26822 0 53 11 35 0 0 5 0 95012 128 3698996 0 0 217092 174740 8022 28231 0 56 11 33 0
This is done with a tar -c -f - .... | tar -x -f - ... style pipeline in each case. The transfer rates are pretty good (in the period above I think the transfer was relatively large files like photographs). This is obviously because both JFS and XFS perform quite well, and I sometimes (too infrequently) copy the filetrees in the same way to improve locality.
The fairly high system CPU time is due in part (around 10-15% of available CPU time) to the usual high CPU overheads of the Linux page cache, and in part (around 30-40% of available CPU time) to one of the filetrees involved being on an two encrypted blocks devices. While I understand that encryption is expensive, the percentage of time devoted to just managing the page cache is fairly notable.
A single copy on an unencrypted filetree on inner cylinders is running like this a bit later (while copying music tracks of a few MiB each):
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 2 0 101900 172 3713000 0 0 68616 59820 2620 7994 0 8 65 27 0 1 2 0 103920 172 3710992 0 0 71148 73844 2792 8324 0 10 64 26 0 1 3 0 96296 144 3718692 0 0 72692 65484 2717 7973 0 9 62 29 0 0 1 0 107504 144 3707744 0 0 46888 52668 2361 4692 0 4 67 29 0 0 1 0 105032 144 3710072 0 0 46712 50888 2340 4181 0 3 69 28 0 0 2 0 98476 140 3716688 0 0 69372 69916 2707 7370 0 5 66 28 0 1 2 0 98160 140 3716804 0 0 50496 48124 2340 5303 0 5 67 28 0 2 1 0 104700 140 3710312 0 0 44828 45680 2230 4917 0 4 67 29 0 1 1 0 105792 140 3709312 0 0 62788 55240 2550 6965 0 6 67 27 0 0 2 0 107232 152 3708112 0 0 50300 56752 2445 5804 0 5 63 32 0 0 1 0 96424 156 3718756 0 0 48420 46140 2303 4773 0 5 67 28 0 1 1 0 100036 156 3715388 0 0 55696 56336 2456 5510 0 6 65 29 0 0 3 0 102044 156 3713064 0 0 63980 53392 2557 7526 0 7 62 31 0 0 2 0 94812 156 3720228 0 0 41696 47428 2253 4039 0 4 64 33 0 0 2 0 97216 164 3717836 0 0 51868 58480 2544 5331 0 5 65 30 0 0 1 0 98424 164 3716596 0 0 49960 50164 2329 5234 0 5 66 29 0 0 1 0 103424 156 3711676 0 0 64468 67608 2698 7042 0 6 64 30 0 0 3 0 96984 152 3718324 0 0 75164 75944 2857 8409 0 7 61 32 0
Fascinating article on Google getting gamed by SEO spammers, with a very important aspect that is mentioned in the article but otherwise underrated:
...
After all, hardly anyone links to anyone anymore, unless they’re spammers.
The really important aspect is that
hyperlinks
have been heavily discouraged by
Google and the article above has an
infographic
that lists some detailed
reasons for that. Broadly speaking however the reasons
are:
mininga free resource, in effect the willingness of site authors to act as Google's unpaid editors. There is a myth that Google used
algorithmsto replace human judgement, as they actually used PageRank to collect that human judgement, because finding good-quality outgoing hyperlinks takes time and human insight.
eyeballs) can be
monetized, and therefore that regardless of their role in PageRanks, incoming hyperlinks are good, because they lead to more visits, and outgoing hyperlinks are bad, because they lead to shorter visits as web users follow them to migrate to other sites.
Note: these are expanded from an earlier simplified argument in a previous post.
The really big problem is the last one: increasing what
marketing experts call the stickyness
of a
site. The big social sites like Facebook
for example heavily discourage outgoing hyperlinks because they
would take away visitors, and thus provide a number of
features that result in more hyperlinks within the site, among
the many internal pages of their users.
Something similar applies to Wikipedia : they discourage outgoing hyperlinks and encourage internal cross references among Wikipedia pages, and I have just checked and as the article above says all external Wikipedia hyperlinks are tagged rel=nofollow (hopefully just to discourage hyperlink spammers from abusing Wikipedia pages).
In other words visitors are a desirable resource and their attention is hoarded by not publishing outgoing hyperlinks or by marking them rel=nofollow so that they don't contribute to another site's Google PageRank.
But note that there is a big difference between using rel=nofollow and not using hyperlinks at all: in the former case the source site visitor still benefits from the hyperlinks and the target site authors benefit from the incoming visitors, they just don't get the PageRank contribution. But this brings up some other aspect of the modern web:
search engineto the point that they type full hyperlink URLs in Google's search field, and then click on the result containing it, instead of typing it directly in their browser's address field.
Now before the conclusion a vital historical aspect: in the
early days of the web, or even before the web in the early
days of the Internet, it was very difficult to locate the
content one was interested in, and therefore several
directories
were made listing types of
content and their locations. Some of these were printed,
and initially YAHOO! itself was
merely a directory (as in the backronym
Yet Another Hierarchically Organized Oracle
!)
as
Wayback Machine snapshot of 1996
shows clearly. There are two problems with directories (which still
exist):
Give those two issues directories became unsustainable with the growth of the web, and some people started using standard textual database search methods applying them to the web, that is to build directories algorithmically using techniques well developed for content like books or news. This usually failed because web content is largely unedited (that is, most of it is rubbish) and therefore difficult to search for relevance.
Note: this happens also in internal web sites and wiki sites (for example and famously Microsoft SharePoint installations) which usually are not afforded any funding for a human editor and therefore usually degenerate into piles of rubbish without any structure.
The idea by Google (and several others before them) was to build directories by figuring out which sites were well edited web directories for some topic or another and using their structure as the basis for their algorithmically constructured directories. The final dialectic and irony is at this point finally clear:
Note: Put another way sites win by having incoming hyperlinks from other sites and principally Google, and lose by having well chosen outgoing hyperlinks, but at the same time Google relies on those well chosen outgoing hyperlinks to build the database of incoming hyperlinks to all sites.
The problem Google has is that they should be implicitly rewarding instead of punishing the web site authors who create well edited sites with good quality outgoing hyperlinks, and this donate free editorship to Google, but that they cannot do so because their business model is to steal outgoing hyperlinks from them. Perhaps they need to adapt their business model so that outgoing hyperlinks revenue is either shared with the sites that contributed to their selection (nearly impossible) or so that sites with well chosen outgoing hyperlinks directly generate revenue from them, and that means hyperlinks embedded in the text of those sites, not separately as in Google AdSense.
For web visitors the big problem is that content that is not visible via web search is almost inaccessible, and content that is not reachable via a chain of hyperlinks (the more arduous alternative) is effectively inaccessible, and that the two are not independent, and because of that they are both becoming less useful.
The only bit of luck is that some authors still provide well edited sites with valuable outgoing hyperlinks, but those sites are difficult to find, and they tend not to last, or to be updated often, because providing well edited web sites is expensive.
Presumably the web will continue to fragment more into islands of highly connected sites with well edited content where there is some reason for such sites to be well edited other than personal dedication, or where personal dedication is motivated for the long term, performing the role of libraries, and web search like Google will become a lot less useful, as it devolves more and more into mindless advertising.
I was doing some nostalgia searching on old PC technology and found an entertaining USENET thread from July 1989 on how easy it is to have 30 concurrent users on a 16MiB PC with a 25mHz CPU and I feel like reproducing here one of the more revealing messages:
Newsgroups: comp.unix.i386 From: j...@specialix.co.uk (John Pettitt) Date: 30 Jun 89 07:51:32 GMT Local: Fri, Jun 30 1989 3:51 am Subject: Re: How many users _really_ ? hjesper...@trillium.waterloo.edu (Hans Jespersen) writes: >In article <1989Jun27.074031.11...@specialix.co.uk> j...@specialix.co.uk (John Pettitt) writes: >>How many 30+ user 386 systems do you know that run `dumb' cards ? >Perhaps a better question is "How many 30+ user 386 systems do you >know that run (period)." Quite a large number. >Apart from this reference I have seen/heard many people talking about >putting 32 users on a '386 PC system. In my experience (although limited >to 20 MHz, non-cached machines) it seems as though 32 is a unbelievably >high number of users to put on a 386 system. The bus seems to be the >bottleneck in most cases, specifically due to disk I/O. I know I'd >feel much better proposing a 32 user mini based solution that a '386 based >one. Firstly a 20 Mhz non-cached 386 is not the place to start building a 32 user system. Most of the big systems we see are based on 25 Mhz cached machines like the top end Olivetti machines. Secondly a large number of these systems run DPT or SCSI disks, this gives a noticable improvment in performance. The AT bus is only used by these systems for I/O, memory has it's own bus so the throughput is not too bad. Thirdly, most of the user of this type of system are running commercial accounting or office automation software. The system I am typing this from is a 25 Mhz Intel 302 and 4 engineers can kill it (well jeremyr kill it by running VP/ix), but the office automation is a very different application. On the 32 terminal system there will be between 8 and 24 active users most of the time and of those only about half will be doing much more than reading mail / enquiry access. >How many CONCURRENT users ( of typical office automation software ) >can you respectfully support on a '386 PC ? I know this is vauge and >will vary considerably depending on the configuration but anyone want >to give it a shot ? This is the real point - CONCURRENT users - my guess is with a good disk an a good I/O board (like ours plug plug :-) and a well tuned system you should be able to support about 16-20 wp users or about 8-10 spread sheet users (need a 387 tho). We have a number of customers with 32 and in one case 64 terminals on 386 systems and very usable perfomance levels. Oh and I nearly forgot - you can _never_ have too much RAM. >-- >Hans Jespersen >hjesper...@trillium.waterloo.edu >uunet!watmath!trillium!hjespersen -- John Pettitt, Specialix, Giggs Hill Rd, Thames Ditton, Surrey, U.K., KT7 0TR {backbone}!ukc!slxsys!jpp jpp%slx...@uunet.uu.net j...@specialix.co.uk Tel: +44-1-941-2564 Fax: +44-1-941-4098 Telex: 918110 SPECIX G
Which reminds me that some years earlier I was able to have 3 users (barely) on a PDP-11/34 with 256KiB runnix UNIX V7 derivative 2.9BSD.
The vital qualification is that these were all text-mode users, with no GUI like X window system.
Some of my earlier and recent posts mention as an important
feature of a
SATA
storage unit support for the SECURE ERASE UNIT
command, as for example accessible via the options
--security-erase and similar options of the
ATA Security Feature Set
of the tool
hdparm.
A minor reason is that having an erase feature built into the drive is a good idea, even if the security of such a feature is somewhat dubious compared to physical destruction of the storage unit.
The major reason is that it offers an opportunity to sort-of
reformat
the storage unit, which may help
make it more reliable, or to solve minor damage issues.
Modern storage units cannot really be reformatted
because nearly all of them have
IDE,
that is an embedded controller
, which does
not support reformatting. This is also because the physical
recording layer is formatted often in a very particular way,
with precise embedded servo signals and tracks, which can only
be made by dedicated factory equipment, and not by the signal
generators inside the IDE, which can only read them.
Ancient storage units did not have IDE, and thus received
directly from a controller the signal to be recorded onto the
medium. For IA (previously known as IBM compatible
)
PCs the first
storage units had a
ST506 signaling interface
which connected a controller card (usually a
WD1003)
in the PC to the rotating disk drive that acted as a recorder
and player of the analog data signal produced by the
controller. It was even possible by changing the controller to
change recording scheme, for example from
MFM
to
RLL 2,7
encoding
and I remember being lucky that my
Miniscribe 3085s nominally 80MB
with a 400KB/s transfer rate (and 28ms average access time)
would work with an
ACB2372B controller thus delivering
instead 120MB at at 600KB/s transfer rate (thanks to being
able to put 27 instead of 16 sectors in each track).
Currently controllers are integrated with the storage medium and recording hardware (on the PCB usually on the bottom) and the controllers are rather sophisticated computers, usually with a 16-b or 32-bit CPU with dozen of MiB of RAM and running a small real-time operating system and hundreds of thousands or millions of lines of code to implement a high level protocol like the SCSI command set (which is used by SATA and SAS equally), and don't offer any direct access to the underlying physical device.
The controller usually offers to the host an abstract,
indirect view of the device as a randomly addresses array of
logical sectors
that get mapped in some
way or another onto the physical recording medium (and
the mapping can be
quite complicated in the case of flash SSD devices).
Part of this mapping is sparing
where the
physical device has spare recording medium that is kept in
reserve in case some of it fails, and in that case the logical
sectors assigned to the damaged part get reassigned to a
section of the spare recording medium.
This sparing is essential as contemporary devices have hundreds of millions of logical sectors and the damage rate for physical medium cannot be zero. Unfortunately it often does not work that well, for various reasons, for example:
firmwaredo not handle well the difficult case of a read failure, because in that case the data is lost, and then often try for too long to reread the recording medium hoping one read will succeed, sometimes apparently not doing any reassignment in the end.
For all these reasons it is often (but not always) possible
to refresh
a storage unit to look error
free and almost as-if reformatted by performing a
SECURITY ERASE UNIT style operation, as that usually
triggers some firmware logic that looks remarkably like the
initial setup of the drive, after the recording medium has
been primed at the factory.
The SECURITY ERASE UNIT command therefore usually does a bit more than simply erasing the content of all logical blocks: it effectively rescans and reanalyzers most or all of the physical recording medium, and rebuilds the internal tables that implement the logical device view on top of the physical medium.
This is fairly important for flash SSDs because it is in
effect a large whole-disk style TRIM operation that resets all
the flash erase blocks to empty, and resets all logical sector
to SSD blocks assignments, and thus usually the whole unit
back to nearly optimal performance, with minimum
write amplification
.
Considering how quick the process is on a flash based SSD,
and that they are quite faster than a rotating storage device
at restoring archives, a complete backup/security erase/reload
cycle can be affordably fast, and completely refresh both the
logical and physical layouts (defragment
)
of the storage unit.
In my
new XFS setup
I have had to specify to XFS the logical sector
size explicitly as 4096B because my new
new SSD storage unit
mispreports its geometry (from hdparm -I output:
ATA device, with non-removable media Model Number: M4-CT256M4SSD2 Serial Number: ____________________ Firmware Revision: 0009 Transport: Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0 Standards: Used: unknown (minor revision code 0x0028) Supported: 9 8 7 6 5 Likely used: 9 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 500118192 Logical Sector size: 512 bytes Physical Sector size: 512 bytes Logical Sector-0 offset: 0 bytes device size with M = 1024*1024: 244198 MBytes device size with M = 1000*1000: 256060 MBytes (256 GB) cache/buffer size = unknown Form Factor: 2.5 inch Nominal Media Rotation Rate: Solid State Device
Where I have put in bold type some notable details, the first of
which being that the driver reports a 512B logical sector size
and also a 512B physical sector size. I would rather it reported
a 4KiB logical sector size and a 1MiB physical sector size, but
probably the 512B report for both is to avoid trouble with older
operating system
versions that cannot handle
well or at all sector sizes other than 512B. But then usually
those older operating systems don't query drives for sector
sizes either. Continuing to look at the reported features:
Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Advanced power management level: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * NOP cmd * DOWNLOAD_MICROCODE * Advanced Power Management feature set SET_MAX security extension * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test * General Purpose Logging feature set * WRITE_{DMA|MULTIPLE}_FUA_EXT * 64-bit World wide name * IDLE_IMMEDIATE with UNLOAD Write-Read-Verify feature set * WRITE_UNCORRECTABLE_EXT command * {READ,WRITE}_DMA_EXT_GPL commands * Segmented DOWNLOAD_MICROCODE * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * unknown 76[3] * Native Command Queueing (NCQ) * Host-initiated interface power management * Phy event counters * NCQ priority information * DMA Setup Auto-Activate optimization * Device-initiated interface power management * Software settings preservation * SMART Command Transport (SCT) feature set * SCT LBA Segment Access (AC2) * SCT Error Recovery Control (AC3) * SCT Features Control (AC4) * SCT Data Tables (AC5) * Data Set Management determinate TRIM supported Security: Master password revision code = 65534 supported not enabled not locked frozen not expired: security count supported: enhanced erase 2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT. Logical Unit WWN Device Identifier: 500a075_________ NAA : 5 IEEE OUI : 00a075 Unique ID : _________ Checksum: correct
In the above it is nice to see that it supports
SCT ERC
and of course it supports
ATA TRIM.
It is also amusing to see that SECURITY ERASE UNIT is
supposed to take only 2 minutes. My 1TB 3.5" rotating storage
devices estimated it at 152 and 324 minutes, and the 500GB
2.5" rotating storage device that was in my laptop estimated
it at 104 minutes. There is some advantage to bulk-erase times
from large simultaneously-erased flash
blocks
.
While installing my new SSD storage unit in my laptop I have decided with much regret to switch from my favourite JFS to XFS as the main data file-system.
While I think that after JFS it is XFS that is the most convincing file-system for Linux it has many complications and pitfalls that I would really rather to stay with JFS. But JFS is not being actively maintained, and various features that are good for SSDs have not been added to it.
Currently I may have some time to help maintain it, but because of the extreme quality assurance demands for file systems I cannot really help much; it is vital for quality assurance to spend quite a bit of hardware and time resources in testing, as many file-system problems manifest under load on large setups.
Also XFS is now actively supported (in particular as to large scale quality assurance) by Red Hat for EL, as they have bought a significant subset of the development team, and EL and derivatives are important in my work activities.
I have used this non-standard set of parameters to setup the filetrees:
# mkfs.xfs -f -s size=4096 -b size=4096 -d su=8k,sw=512 -i size=2048,attr=2 -l size=64m,su=256k -L sozan /dev/sda6 meta-data=/dev/sda6 isize=2048 agcount=4, agsize=6104672 blks = sectsz=4096 attr=2 data = bsize=4096 blocks=24418688, imaxpct=25 = sunit=2 swidth=1024 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=4096 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
The XFS developers insist that the defaults are usually fine,
and indeed most of the above are defaults made explicit. The
main differences are specifying 2048B
inode
records instead of the smaller default, and specifying
alignment requirements, in particular marking logical sectors
to be considered 4096B, but
also requesting RAID
-like alignments on a
single storage unit. The hopes are that this helps with
aligning to flash page
and flash
block
boundaries and does not cost otherwise much (the
only cost I can see is a bit of space).
I have begrundgingly kept the root
filetree as ext3 (under an EL5 derivative) and
ext4 (under ULTS10) as usually
initrd
based boot
sequences tend to be fragile
with file-systems other than the default.
Since a new release of the very useful Grml live CD has happened, I have downloaded it and wanted to install it on some USB portable storage device that I have, and in particular the combined 32-bit and 64-bit dual boot edition. That did not work smoothly (with various instances of the message:
Missing Operating System
because of some mistakes with setting up correctly the SYLINUX bootloader:
partition typeof 0x06 (FAT16) instead of partition type 0x0e (MS-Windows 95 FAT16 with LBA).
compilerthat generates information for the SYSLINUX bootloader, a bit like the lilo command compiles information for the LILO bootloader.
It actually turns out that with recent versions of SYSLINUX one can use VFAT types throughout, so the partition type can be 0x0e, and the filetree in the partition formatted and mounted as FAT32.
The other big problem was that the SYSLINUX configuration files for the 32-bit plus 64-bit dual boot did not seem to work from the USB storage device, and did not seem finished either, and anyhow I found them messy and hard to understand. So I have updated them to work more robustly and to be cleaner, and I have in the process removed the graphical menus that are hardly useful. I have put the updated version here.
Since the USB storage device uses flash memory, access times are very low and it has much faster access times than a rotating storage device like a CD-ROM or a DVD-ROM, and accordingly the running live system seems much snappier, even if a small flash drive has a single chip or two and accordingly the transfer rates are not high (around 20MB/s reading and 6MB/s writing on the one I got).
I still usually prefer bootable CDs and DVDs as they are read-only, cheap and easy to duplicate, but admittedly currently many systems currently have no CD or DVD drive, but have USB ports (and whil I have a USB DVD drive, it is somewhat bulky to carry around or to use on a stack of servers).
I have been reading for the past few months some reviews of the new AMD FX-8150 chips with 8 CPUs, which are reported to perform less well than their intel counterparts even if they scale better and it was quite interesting to see that they are reported to share units between CPUs.
This is a bit of cheating, as it has a bit the flavour of
Intel's hyperthreading
.
The difference is that in AMD's case only a few of the
resources are shared between CPUs, while in Intel's case
virtually all are (except the memory engines).
I have recently bought a
256GB Crucial M4
flash
SSD
(firmware 009) for my laptop, and I have been curious about
its performance profile. The laptop is an older laptop that
does not support 6Gb/s SATA, so read rates are limited by the
3Gb/s maximum.
As expected the SSD unit is far less sensitive to read-ahead and transaction size settings (at least for reading) than a rotating storage unit, for example:
# # for N in 8 32 128 512 2048; do blockdev --setra "$N" /dev/sda; blockdev --flushbufs /dev/sda; dd bs=64k count=10000 if=/dev/sda of=/dev/zero; done 10000+0 records in 10000+0 records out 655360000 bytes (655 MB) copied, 8.21186 s, 79.8 MB/s 10000+0 records in 10000+0 records out 655360000 bytes (655 MB) copied, 4.49033 s, 146 MB/s 10000+0 records in 10000+0 records out 655360000 bytes (655 MB) copied, 2.52406 s, 260 MB/s 10000+0 records in 10000+0 records out 655360000 bytes (655 MB) copied, 2.42923 s, 270 MB/s 10000+0 records in 10000+0 records out 655360000 bytes (655 MB) copied, 2.67782 s, 245 MB/s
and there are entirely equivalent results with O_DIRECT and varying transaction sizes:
# for N in 8 32 128 512 2048; do dd bs="$N"b count=$[1200000 / $N] iflag=direct if=/dev/sda of=/dev/zero; done 150000+0 records in 150000+0 records out 614400000 bytes (614 MB) copied, 11.946 s, 51.4 MB/s 37500+0 records in 37500+0 records out 614400000 bytes (614 MB) copied, 4.78695 s, 128 MB/s 9375+0 records in 9375+0 records out 614400000 bytes (614 MB) copied, 3.33459 s, 184 MB/s 2343+0 records in 2343+0 records out 614203392 bytes (614 MB) copied, 2.7135 s, 226 MB/s 585+0 records in 585+0 records out 613416960 bytes (613 MB) copied, 2.58665 s, 237 MB/s
In both cases, a 64KiB read-ahead or 64KiB read size deliver near the top transfer rate, thanks to the very low latency due to negligible access times. For the same reason fsck (of an undamaged filetree) is 10-20 times faster on the SSD than on the same filetree on rotating storage.
Just discovered that someone has run several code analysis tools over the Linux file-systems sources.
The main interest are the many ways of looking at inter-module references. I haven't spotted anything notable yet.
In my previous notes about
how SSDs are structured
I simplified a bit what is a complex picture. One of the
simplifications is that I wrote that flash memory must be
erased before it is written, and that's not quite the case:
flash memory bits cannot be written to, they can only be set,
that is one can only logical-or new data bitwise into a flash page
; erasing simply re-sets all bits in
a flash block
to unset.
In theory this property of flash memory can be used to
minimize read-erase-write cycles, because if the new content
of a flash page has only set bits that were unset in the
current content it can be written
directly;
but I don't know whether flash drive firmware checks for
that.