Notes about Linux Btrfs

Btrfs terms (20171111)

This list of Btrfs terms in effect also describes the overall external and internal structure of a Btrfs instance.

volume: one or more block devices containing one or more subvolumes stored into on one or more member devices; it is labeled by a 128-bit UUID and can be also labeled by a name.
subvolume: a mountable root directory inode and all inodes reachable from it. It has an integer id and usually a default mount path within the root subvolume.
Creating subvolumes is very quick, deleting them can take a long time especially if the contents are heavily reflinked.
root subvolume: the topmost subvolume associated with a volume, has id 5.
default subvolume: the subvolume that is mounted by default, and it is by default the root subvolume.
directory inode: an inode that contains the root of a B-tree of extents containing directory entries, linking to other inodes, for example file inodes.
file inode: an inode which contains the root of a B-tree of reflinks to extents. The same extent can be shared by several inodes by way of them reflinking to it. Each extent has a backref to all the inodes reflinking to it.
reflink: a B-tree entry linking from an inode to an extent. The term reflinking however usually is used in the special case of creating links afte the first.
backref: an entry in a B-tree linking from an extent to all the inodes that have a reflink to it.
Extents with many backrefs can inolve high CPU costs in several operations whose running costs are proportional to the product of the number of reflinks and backrefs.
copy-on-write: also called redirect-on-write is the act of implementing a write to an existing extent as the creation of a new extent, the writing of the new data to that extent, and updating the reflink and backrefs from the inode to the new extent. Updating the reflink and backrefs can involve more copy-on-write, until the volume root directory inode is reached.
generation: a new version of the volume root directory inode.
transaction id: a monotonically increasing number that labels each generation of the volume root directory inode.
snapshot: a subvolume that starts as the reflinked copy of another subvolume, can be read-write, or entirely or read-only regardless of the permissions in the inodes.
Read-only subvolumes cannot be deleted without turning them read-write first.
extent: a set of nodes that are both logically and physically contiguous. The maximum size id 256MiB for ordinary extents and 128KiB for compressed extents.
node: the unit of allocation of space, by default 16KiB, can be as small as one virtual memory page. Nodes are grouped into chunks.
chunk: an array of contiguous nodes of the same profile which is part of a single member device of a single volume. It can be mixed or not-mixed.
Chunks can be (almost) any size, but by default data chunks are 1GiB long, metadata chunks are 256MiB long, system chunks are 32MiB long, and the last chunk on a device is whatever length fills the device.
mixed chunk: a chunk that can contain nodes of both data and metadata kind. Since all nodes in a chunk must have the same profile this means that data and metadata nodes must have the same profile.
not-mixed chunk: a chunk that contains only data or only metadata or only system nodes.
member device: A block device that is part of a single Btrfs volume. It has a superblock, containing both the volume UUID and the member device UUID. Each block device is subdivided into chunks.
free space: the amounts of space in nodes that are not used.
unallocated space: the space in chunks that contain only unused nodes. It can happen that there is no unallocated space yet plenty of free space in partially used chunks.
balancing: an operation that redistributes used nodes across chunks, to increase the amount of used nodes per chunk, potentially deallocating whole chunks. This operates by allocating first an unallocated chunk and moving to it used nodes from allocated chunks, so it will fail if all nodes are allocated. Balancing also will redistribute more equally nodes across chunks on differenty member devices, and can be used to move nodes from chunks with a given profiles to chunks with a different profile.
defragmentation: an operation that copies extents to contiguous (if possible) free nodes, making the extents more contiguous. It relies critically on the availability of contiguous free nodes, which are the result of balancing.

Btrfs status (20171109)

What currently works quite reliably

Basic POSIX filesystem features, and copy-on-write updates, work quite well.
Subvolumes and snapshots and reflinking work quite well, with some scalability limitations.
Recoverability in case of issues is fairly good within the limits above.

What currently works with some limitations:

Checksumming works well but it has several surprising limitations and costs a lot of CPU time. The limitations are that it leads to damage amplification unless metadata is redundant, does not necessarily work with direct IO, and is not very useful with NFS clients.
The send/receive copy methods work but they can be subtle to understand and don't copy some less used metadata like inode flags.
Automatic compression of files. The main limitation is that there are many corner cases with surprising performance behaviours related to compression. I personally think it is not worth the complications.
Updating-in-place of large files works well but usually leads to greatly fragmented extents.
Defragmentation works but since it works by making copies of files it breaks reflinking and thus can greatly increase space used.
Full balance can be very slow.
Very little testing and use is done on platforms other than amd64.
Like all current local filesystems, volumes larger than 4-8TB work well but have severe scalablity problems in maintenance operations.
Two level allocation in chunks and nodes can lead to a situation where all chunks are allocated but there is plenty of free nodes, which can require simple but subtle workarounds.
For some hard-to-imagine reasons non-privileged users can create subvolumes and snapshots without limit.

Things that don't quite work

Multi device volumes are fundamentally misdesigned, so even those implemented correctly behave strangely in some important situations. The profiles single and raid0, raid1 and raid10 mostly work. The raid1, raid10 profiles in particular have unpleasant corner cases when operating degraded (a partial fix is avaiable from kernel 4.14). There is no data loss, but consequences can be time consuming.
Quota groups have severe scalability problems and other problems.
Too many snapshots (more than 20-50 let's say) and other forms of reflinking have severe scalability issues.
The ssd space allocator behaves badly on almost every devide and situation.
In particular be careful about the list of major issues.

So my recommended pattern of use is:

Use of the latest possible kernel and tools, that is at least version 4.0.
Filesystems up to 4-8TB on a single block device with profile single for data and dup for metadata, unless the block device is already redundant (for example an MD RAID1 one).
Leave checksums enabled by default, as most recent-ish systems can compute them fast enough.
Subvolumes and snapshots in a flat arrangement, all under the volume root subvolume. Only the currently-in-use subvolume is mounted.
No use of quota groups, and no more than 20-50 subvolumes or shapshots or highly reflinked large fles.
Use of the nocow inode flag on files that are updated in-place, like DBMS data files and VM virtual disk images and logs.
Periodic balancing with filters like -musage=70 -dusage=70.
Use always the nossd mount option.
If you have a workload with a low rate of file creation use the nospace_cache mount option else use a recent kernel that allows the space_cache=v2 option.
No automatic volume-wide defragmentation, only periodic for specific large highly fragmented files.
Backup via send/receive only when it is well understood what it does and how, and inode flags do not need to be preserved.
Consider reverting to the earlier default of 4KiB nodes.

Btrfs hints

Btrfs major issues (20171109)

Another list of major issues.

The raid1 profile considers a multidevice with 1 member device as not-RAID1 and thus only allows mounting it degraded,ro. However as a special case it can be mounted degraded,rw once. This means that if a single device is left for a raid1 set proper recovery involves converting it to the single profile as follows:
- It must be mounted degraded,rw if not still mounted, and this can only be done once.
- The multidevice must be converted first to the single profile with something like:
```
btrfs balance start -mconvert=single -dconvert=single -f /mnt/sdb3
```
- When the conversion to the single profile ends, the devices that have gone missing must be deleted with something like:
```
btrfs device delete missing /mnt/sdb3
```
- At this point the remaining device can be remounted read-write without degraded.
- When additional devices become available they can be added to the existing device, and the profile then can be converted back to raid1.
If the remaining device of a raid1 profile set has been remounted degraded,rw once already, it can only be mounted degraded,ro, and another recovery procedure must be used:
- Use mkfs.btrfs to format the new device as an entirely new filetree with single profile and mount it.
- Copy the contents of the remaining device mounted degraded,ro into the new Btrfs filetreee on the new device.
- Unmount the remaining device, and btrfs device add -f it to the new device, and convert the profile of the new filetree back to raid1.
The raid5 profile allows 1 or more member devices. With 1 or 2 member devices there is an implicit all-zeroes member for XOR purposes.
When a raid5 set is reduced to 1 working member it must be mounted in degraded mode, but it can be mounted degraded,rw multiple times. While it is possible to add new members to a raid5 set with 1 working member, it is not possible to remove the non-working members from it, and if they are not removed the filetree will always to be mounted as degraded. The only way to remove them is to convert the set to single (unless the new member device has the same major,minor as the missing one).

kernel version dependent hints (20210101)

Another list of kernel version dependent hints.

Kernel 5.5 has 3-way and 4-way mirroring modes.
Kernel 4.14 has better behavior on mounting degraded filetrees, allowing mounts based on chunk availability instead of device availability.
Kernels before 4.10rc1 have very high CPU usage when deleting or balacing inodes with many reflinks, as files shared in many snapshots.
Kernels before between 4.5.8 and 4.8.5 inclusive have various issues under load.
Kernels 4.7.x have a kernel memory allocation issue with order 2 blocks.
Kernel 4.5 has the version 2 space cache.
Kernels before 4.4 can have very slow writing when disk free space is fragmented.
Kernels before 4.3 did not have a fully work FSTRIM implementation.
Kernel 4.0 has a Btrfs bug that prevents RAID reshaping from succeeding.
Kernel 3.19 to 3.19.4 can deadlock at mount time.
Scrubbing or replacing devices in parity RAID profiles only work from kernel 3.19.
Kernel 3.15 to 3.16.1 can deadlock during heavy IO with compression.
Kernel 3.14 has improved representation for holes. This is not backwards compatible (no-holes).
Kernel 3.10 and later have skinny metadata. This is not backwards compatible.
Kernel 3.10 has many bug fixes.
Kernel 3.7 and later allows up to 64k hard-links (extended-iref). This is not backwards compatible.
Kernel 3.5 and later have a large speedup for fsync.
Conversion of RAID profiles only from 3.3.
Btrfs in kernel versions before 3.2 is not particularly reliable.

Tools version dependent hints (20170621)

Tools version dependent hints:

btrfs check versions 4.6.2 to 4.7.1 report backref mismatches wrong, and should not be used to do --repair.

kernel version independent hints (20170621)

Btrfs mounts all available block devices with the same volume UUID as the one being mounted, therefore block-by-block copies of a Btrfs block device can be dangerous.
Adding to a Btrfs filesystem a block devices that was already part of it causes data loss. Versions of the btrfs tool newer than 0.19 check for that.
Btrfs directories always have a link count of 1.
Btrfs volumes always have inode counts (maximum, used) of 0.
Files subject to many small scattered updates can become extremely fragmented when allocated with copy-on-write, as updates usually split an extent in three parts.

Btrfs references (20171109)

Btrfs documentation.
Ohad Rodeh Josef Bacik Chris Mason BTRFS: The Linux B-tree filesystem 2012-07-09.

Some of my notes on Btrfs (20171111)

These are pointers to some of the entries in my technical blog where Btrfs is discussed:

171004 C vs. C++ and the cost of system level features
170424 Btrfs and NFS service and NFS daemon Ganesha
170407 Unusual filesystem properties
170302 Some coarse speed tests with Btrfs etc. and small files
170228 Some coarse speed tests with various Linux filesystems
161217 The most interesting filesystem types
160104 An amusing demand about 'fsync', and more about 'fsync'
131005 ZFS and 'fsck'
120303c SUSE SLES 11 SP2 has massive updates and switch to Btrfs
120211 Another Btrfs presentation
120203 Interview with leader of Btrfs development