Notes about Linux Btrfs
Updated: 2021-01-01
Created: 2016
This page as of 20171109 is quite draft, being incomplete, but the
written parts are likely to be mostly accurate.
This list of Btrfs terms in effect also describes the overall
external and internal structure of a Btrfs instance.
- volume: one or more block devices containing one
or more
subvolumes
stored into on one or
more member devices
; it is labeled by a
128-bit UUID
and can be also labeled by a
name.
- subvolume: a mountable
root
directory inode
and all inodes reachable from it. It has
an integer id and usually a default mount path within the root subvolume
.
Creating subvolumes is very quick, deleting them can take
a long time especially if the contents are heavily reflinked.
- root subvolume: the topmost subvolume associated
with a volume, has id 5.
- default subvolume: the subvolume that is mounted
by default, and it is by default the root subvolume.
- directory inode: an inode that contains the root
of a B-tree of
extents
containing directory
entries, linking to other inodes, for example file
inodes.
- file inode: an inode which contains the root of a
B-tree of
reflinks
to extents
. The same extent can be shared by
several inodes by way of them reflinking to it. Each extent
has a backref
to all the inodes reflinking
to it.
- reflink: a B-tree entry linking from an inode to
an extent. The term
reflinking
however
usually is used in the special case of creating links afte the
first.
- backref: an entry in a B-tree linking from an
extent to all the inodes that have a reflink to it.
Extents with many backrefs can inolve high CPU costs in
several operations whose running costs are proportional to the
product of the number of reflinks and backrefs.
- copy-on-write: also called
redirect-on-write
is the act of implementing
a write to an existing extent as the creation of a new
extent, the writing of the new data to that extent, and
updating the reflink and backrefs from the inode to the new
extent. Updating the reflink and backrefs can involve more
copy-on-write, until the volume root directory inode is reached.
- generation: a new version of the volume root
directory inode.
- transaction id: a monotonically increasing number
that labels each generation of the volume root directory
inode.
- snapshot: a subvolume that starts as the
reflinked copy of another subvolume, can be read-write, or
entirely or read-only regardless of the permissions in the
inodes.
Read-only subvolumes cannot be deleted without turning
them read-write first.
- extent: a set of
nodes
that
are both logically and physically contiguous. The maximum size
id 256MiB for ordinary extents and 128KiB for compressed
extents.
- node: the unit of allocation of space, by default
16KiB, can be as small as one virtual memory page. Nodes are
grouped into
chunks
.
- chunk: an array of contiguous
nodes
of the same profile
which is part of a single member device
of a single volume. It can be mixed
or
not-mixed.
Chunks can be (almost) any size, but by default data
chunks are 1GiB long, metadata chunks are 256MiB long, system
chunks are 32MiB long, and the last chunk on a device is
whatever length fills the device.
- mixed chunk: a chunk that can contain nodes of
both data and metadata kind. Since all nodes in a chunk must
have the same
profile
this means that data
and metadata nodes must have the same profile.
- not-mixed chunk: a chunk that contains only data
or only metadata or only system nodes.
- member device: A block device that is part of a
single Btrfs volume. It has a superblock, containing both the
volume UUID and the member device UUID. Each block device is
subdivided into
chunks
.
- free space: the amounts of space in nodes that
are not used.
- unallocated space: the space in chunks that
contain only unused nodes. It can happen that there is no
unallocated space yet plenty of free space in partially used
chunks.
- balancing: an operation that redistributes used
nodes across chunks, to increase the amount of used nodes per
chunk, potentially deallocating whole chunks. This operates by
allocating first an unallocated chunk and moving to it used
nodes from allocated chunks, so it will fail if all nodes are
allocated. Balancing also will redistribute more equally nodes
across chunks on differenty member devices, and can be used to
move nodes from chunks with a given profiles to chunks with a
different profile.
- defragmentation: an operation that copies extents
to contiguous (if possible) free nodes, making the extents
more contiguous. It relies critically on the availability of
contiguous free nodes, which are the result of balancing.
- What currently works quite reliably
-
- Basic POSIX filesystem features, and copy-on-write
updates, work quite well.
- Subvolumes and snapshots
and reflinking work quite well, with some scalability
limitations.
- Recoverability in case of issues is fairly good within the
limits above.
- What currently works with some limitations:
-
- Checksumming works well but it has several surprising
limitations and costs a lot of CPU time. The limitations are
that it leads to damage amplification unless metadata is
redundant, does not necessarily work with direct IO, and is
not very useful with NFS clients.
- The send/receive copy methods work but
they can be subtle to understand and don't copy some less
used metadata like inode flags.
- Automatic compression of files. The main limitation is
that there are many corner cases with surprising performance
behaviours related to compression. I personally think it is
not worth the complications.
- Updating-in-place of large files works well but usually
leads to greatly fragmented extents.
- Defragmentation works but since it works by making copies
of files it breaks reflinking and thus can greatly increase
space used.
- Full balance can be very slow.
- Very little testing and use is done on platforms other
than amd64.
- Like all current local filesystems, volumes larger than
4-8TB work well but have severe scalablity problems in
maintenance operations.
- Two level allocation in chunks and nodes can lead to a
situation where all chunks are allocated but there is plenty
of free nodes, which can require simple but subtle
workarounds.
- For some hard-to-imagine reasons non-privileged users can
create subvolumes and snapshots without limit.
- Things that don't quite work
-
- Multi device volumes are fundamentally misdesigned, so
even those implemented correctly behave strangely in
some important situations. The profiles single
and raid0, raid1 and raid10
mostly work. The raid1, raid10
profiles in particular have unpleasant corner cases when
operating
degraded
(a partial fix is
avaiable from kernel 4.14). There is no data loss, but
consequences can be time consuming.
- Quota groups have severe scalability problems and other
problems.
- Too many snapshots (more than 20-50 let's say) and other
forms of reflinking have severe scalability issues.
- The ssd space allocator behaves badly on almost
every devide and situation.
- In particular be careful about the list of
major issues.
So my recommended pattern of use is:
- Use of the latest possible kernel and tools, that is at
least version 4.0.
- Filesystems up to 4-8TB on a single block device with
profile single for data and dup for
metadata, unless the block device is already redundant
(for example an MD RAID1 one).
- Leave checksums enabled by default, as most recent-ish
systems can compute them fast enough.
- Subvolumes and snapshots in a
flat
arrangement, all under the volume root subvolume. Only the
currently-in-use subvolume is mounted.
- No use of quota groups, and no more than 20-50 subvolumes
or shapshots or highly reflinked large fles.
- Use of the nocow inode flag on files that are
updated in-place, like DBMS data files and VM virtual disk
images and logs.
- Periodic balancing with filters like -musage=70
-dusage=70.
- Use always the nossd mount option.
- If you have a workload with a low rate of file creation
use the nospace_cache mount option else use a
recent kernel that allows the space_cache=v2
option.
- No automatic volume-wide defragmentation, only periodic
for specific large highly fragmented files.
- Backup via send/receive only when it is
well understood what it does and how, and inode flags do not
need to be preserved.
- Consider reverting to the earlier default of 4KiB nodes.
Another list of major issues.
- The raid1 profile considers a multidevice with 1
member device as not-RAID1 and thus only allows mounting it
degraded,ro. However as a special case it can be
mounted degraded,rw once. This
means that if a single device is left for a raid1
set proper recovery involves converting it to the
single profile as follows:
If the remaining device of a raid1 profile set has
been remounted degraded,rw once already, it can
only be mounted degraded,ro, and another recovery
procedure must be used:
- Use mkfs.btrfs to format the new device as an
entirely new filetree with single profile and
mount it.
- Copy the contents of the remaining device mounted
degraded,ro into the new Btrfs filetreee on the
new device.
- Unmount the remaining device, and btrfs device add
-f it to the new device, and convert the profile
of the new filetree back to raid1.
- The raid5 profile allows 1 or more member
devices. With 1 or 2 member devices there is an implicit
all-zeroes member for XOR purposes.
When a raid5 set is reduced to 1 working member
it must be mounted in degraded mode, but it can be
mounted degraded,rw multiple times. While it is
possible to add new members to a raid5 set with 1
working member, it is not possible to remove the non-working
members from it, and if they are not removed the filetree
will always to be mounted as degraded. The only way
to remove them is to convert the set to single
(unless the new member device has the same major,minor as
the missing one).
Another list of kernel version dependent hints.
- Kernel 5.5 has 3-way and 4-way mirroring modes.
- Kernel 4.14 has better behavior on mounting degraded
filetrees, allowing mounts based on chunk availability
instead of device availability.
- Kernels before 4.10rc1 have very high CPU usage when
deleting or balacing
inodes
with many reflinks, as files shared in many snapshots.
- Kernels before between 4.5.8 and 4.8.5 inclusive have
various issues under load.
- Kernels 4.7.x have a kernel memory allocation issue with
order 2 blocks.
- Kernel 4.5 has the version 2 space cache.
- Kernels before 4.4 can have very slow writing when disk free
space is fragmented.
- Kernels before 4.3 did not have a fully work
FSTRIM implementation.
- Kernel 4.0 has a Btrfs bug that prevents RAID reshaping
from succeeding.
- Kernel 3.19 to 3.19.4 can deadlock at mount time.
- Scrubbing or replacing devices in parity RAID profiles
only work from kernel 3.19.
- Kernel 3.15 to 3.16.1 can deadlock during heavy IO with
compression.
- Kernel 3.14 has improved representation for holes. This is
not backwards compatible (no-holes).
- Kernel 3.10 and later have
skinny
metadata. This is not backwards compatible.
- Kernel 3.10 has many bug fixes.
- Kernel 3.7 and later allows up to 64k hard-links
(extended-iref). This is not backwards
compatible.
- Kernel 3.5 and later have a large speedup for
fsync.
- Conversion of RAID profiles only from 3.3.
- Btrfs in kernel versions before 3.2 is not particularly
reliable.
- Btrfs mounts all available block devices with the same
volume UUID as the one being mounted, therefore block-by-block
copies of a Btrfs block device can be dangerous.
- Adding to a Btrfs filesystem a block devices that was
already part of it causes data loss. Versions of the
btrfs tool newer than 0.19 check for that.
- Btrfs directories always have a link count of 1.
- Btrfs volumes always have inode counts (maximum, used) of
0.
- Files subject to many small scattered updates can become
extremely fragmented when allocated with copy-on-write, as
updates usually split an
extent
in three
parts.
These are pointers to some of the entries in my
technical blog
where Btrfs is discussed:
- 171004
C vs. C++ and the cost of system level features
- 170424
Btrfs and NFS service and NFS daemon Ganesha
- 170407
Unusual filesystem properties
- 170302
Some coarse speed tests with Btrfs etc. and small files
- 170228
Some coarse speed tests with various Linux filesystems
- 161217
The most interesting filesystem types
- 160104
An amusing demand about 'fsync', and more about 'fsync'
- 131005
ZFS and 'fsck'
- 120303c
SUSE SLES 11 SP2 has massive updates and switch to Btrfs
- 120211
Another Btrfs presentation
- 120203
Interview with leader of Btrfs development