Notes about Linux storage
This section is about known hints and issues with
various aspects of common filesystems. They can be just
inconveniences or limitations or severe performance
- Most consumer disks do not allow shortening a very long
period (often minutes) of recovery in case of disk errors,
which can cause unnessary RAID and service failures in
- Disks that allow setting a shorter period of recovery in
case of disks errors require a tool that handles
which is only part of
version 5.40 or newer.
- The Linux storage system has
its own error recovery duration period that can be set
shorter (and often should be set shorter) by setting
- Mounting filetrees on flash SSD devices with option
discard can result in occasional relatively long
delays because the related
TRIM operation is synchronous.
- The sector size for CDs is 2KiB.
- The sector size for DVDs is 32KiB, but they are required
to simulate a 2KiB sector size, potentially triggering
- In Linux 2.6.37
the barrier logic in the kernel was replaced by
better FUA logic
which should increase the parallel scheduling of IO
Version independent hints:
- Different versions of mdadm
write different member geometries that have the same metadata version
Because the metadata version only indicates different
superblock offsets, and remains the same even with different
This is extremely dangerous when
- If an MD RAID superblock is at the end of a partition,
which is at the end of a partitioned block device, methods
MD RAID member autodiscovery can wrongly guess
that the whole disk is an MD RAID set member, as the
superblock is at the end of the disk too.
- The type of superblock gets OR'ed with 1 when the MD
RAID set is being reshaped.
- The initial sync for RAID5 and RAID6 can be much faster if
the array is created with spares and missing drives:
raid6 resync (like raid5)
is optimised for an array the is already in sync. It
reads everything and checks the P and Q blocks. When it
finds P or Q that are wrong, it calculates the correct
value and goes back to write it out.
On a fresh drive, this will involve lots of writing
which means seeking back to write something. With a
larger stripe_cache, the writes can presumably be done
in larger slabs so there are fewer seeks.
You might get a better result by creating the array
with two missing devices and two spares. It will then
read the good devices completely linearly, and write the
spares completely linearly and so should get full
hardware speed with normal stripe_cache size.
For raid5, mdadm makes this arrangement automatically.
It doesn't for raid6.
Version dependent hints for the mdadm:
- Version 3.3 allows setting explicitly the data
- Version 3.2.5 automatically reduces the default 128MiB
data offset if this is required to fit data in the
- Version 3.2.4 introduces a default data offset of 128MiB
for version 1.1 and version 1.2 metadata.
- Version 3.1.2 changes the default data offset to 1MiB for
version 1.1 and 1.2 metadata.
- Version 3.1.2 changes the defauilt metadata type to
Version dependent hints for the kernel module:
- Kernel 3.10.3 has a bug that can cause hangs on
- In corner cases with
kernel versions 3.2.1 and 3.3
a bug can cause the MD superblock to become corrupted on
reboot. The symptoms are that the MD set is inactive, all or
most members seem spares, and mdadm --examine
applied to the spares shows MD superblocks without a valid
RAID level and number of devices in the MD set.
- In kernels released before 2013
fsync on a read-only MD device is broken.
- In kernel 2.6.37
WRITE_FUA support was added to MD.
- In kernel 2.6.33
MD got barrier support for all types of RAID.
- In kernel 2.6.28 MD switched handling of MD devices with
respect to partitioning:
In Linux kernels prior to version 2.6.28 there were two
distinctly different types of md devices that could be
created: one that could be partitioned using standard
partitioning tools and one that could not. Since 2.6.28
that distinction is no longer relevant as both type of
devices can be partitioned.
- Kernels before 2.6.10 can only use version 0.90 metadata
- DM/LVM2 snapshot LVs can be very slow.
Version independent hints:
indexed disk-metadata the maximum
size of a DRBD block device is 4TiB or 4096GiB, not 4GiB as
the manual page says, given the 128MiB size of each indexed
Version dependent hints: