Notes about Linux Btrfs
    Updated: 2021-01-01
Created: 2016
    
      
    
    This page as of 20171109 is quite draft, being incomplete, but the
      written parts are likely to be mostly accurate.
    
      
      This list of Btrfs terms in effect also describes the overall
	external and internal structure of a Btrfs instance.
      
	- volume: one or more block devices containing one
	  or more subvolumes stored into on one or
	  moremember devices ; it is labeled by a
	  128-bitUUID and can be also labeled by a
	  name.
- subvolume: a mountable root
	  directory inode and all inodes reachable from it. It has
	  an integer id and usually a default mount path within theroot subvolume .
 Creating subvolumes is very quick, deleting them can take
	  a long time especially if the contents are heavily reflinked.
- root subvolume: the topmost subvolume associated
	  with a volume, has id 5.
- default subvolume: the subvolume that is mounted
	  by default, and it is by default the root subvolume.
- directory inode: an inode that contains the root
	  of a B-tree of extents containing directory
	  entries, linking to other inodes, for example file
	  inodes.
- file inode: an inode which contains the root of a
	  B-tree of reflinks toextents .  The same extent can be shared by
	  several inodes by way of them reflinking to it. Each extent
	  has abackref to all the inodes reflinking
	  to it.
- reflink: a B-tree entry linking from an inode to
	  an extent. The term reflinking however
	  usually is used in the special case of creating links afte the
	  first.
- backref: an entry in a B-tree linking from an
	  extent to all the inodes that have a reflink to it.
	  
 Extents with many backrefs can inolve high CPU costs in
	  several operations whose running costs are proportional to the
	  product of the number of reflinks and backrefs.
- copy-on-write: also called redirect-on-write is the act of implementing
	  a write to an existing extent as the creation of a new
	  extent, the writing of the new data to that extent, and
	  updating the reflink and backrefs from the inode to the new
	  extent. Updating the reflink and backrefs can involve more
	  copy-on-write, until the volume root directory inode is reached.
- generation: a new version of the volume root
	  directory inode.
- transaction id: a monotonically increasing number
	  that labels each generation of the volume root directory
	  inode.
- snapshot: a subvolume that starts as the
	  reflinked copy of another subvolume, can be read-write, or
	  entirely or read-only regardless of the permissions in the
	  inodes.
	  
 Read-only subvolumes cannot be deleted without turning
	  them read-write first.
- extent: a set of nodes that
	  are both logically and physically contiguous. The maximum size
	  id 256MiB for ordinary extents and 128KiB for compressed
	  extents.
- node: the unit of allocation of space, by default
	  16KiB, can be as small as one virtual memory page. Nodes are
	  grouped into chunks .
- chunk: an array of contiguous nodes of the sameprofile which is part of a singlemember device of a single volume. It can bemixed or
	  not-mixed.
 Chunks can be (almost) any size, but by default data
	  chunks are 1GiB long, metadata chunks are 256MiB long, system
	  chunks are 32MiB long, and the last chunk on a device is
	  whatever length fills the device.
- mixed chunk: a chunk that can contain nodes of
	  both data and metadata kind. Since all nodes in a chunk must
	  have the same profile this means that data
	  and metadata nodes must have the same profile.
- not-mixed chunk: a chunk that contains only data
	  or only metadata or only system nodes.
- member device: A block device that is part of a
	  single Btrfs volume. It has a superblock, containing both the
	  volume UUID and the member device UUID. Each block device is
	  subdivided into chunks .
- free space: the amounts of space in nodes that
	  are not used.
- unallocated space: the space in chunks that
	  contain only unused nodes. It can happen that there is no
	  unallocated space yet plenty of free space in partially used
	  chunks.
- balancing: an operation that redistributes used
	  nodes across chunks, to increase the amount of used nodes per
	  chunk, potentially deallocating whole chunks. This operates by
	  allocating first an unallocated chunk and moving to it used
	  nodes from allocated chunks, so it will fail if all nodes are
	  allocated. Balancing also will redistribute more equally nodes
	  across chunks on differenty member devices, and can be used to
	  move nodes from chunks with a given profiles to chunks with a
	  different profile.
- defragmentation: an operation that copies extents
	  to contiguous (if possible) free nodes, making the extents
	  more contiguous. It relies critically on the availability of
	  contiguous free nodes, which are the result of balancing.
 
    
      
      
	- What currently works quite reliably
- 
	  
	    - Basic POSIX filesystem features, and copy-on-write
	      updates, work quite well.
- Subvolumes and snapshots
	      and reflinking work quite well, with some scalability
	      limitations.
- Recoverability in case of issues is fairly good within the
	      limits above.
 
- What currently works with some limitations:
- 
	  
	    - Checksumming works well but it has several surprising
	      limitations and costs a lot of CPU time. The limitations are
	      that it leads to damage amplification unless metadata is
	      redundant, does not necessarily work with direct IO, and is
	      not very useful with NFS clients.
- The send/receive copy methods work but
	      they can be subtle to understand and don't copy some less
	      used metadata like inode flags.
- Automatic compression of files. The main limitation is
	      that there are many corner cases with surprising performance
	      behaviours related to compression. I personally think it is
	      not worth the complications.
- Updating-in-place of large files works well but usually
	      leads to greatly fragmented extents.
- Defragmentation works but since it works by making copies
	      of files it breaks reflinking and thus can greatly increase
	      space used.
- Full balance can be very slow.
- Very little testing and use is done on platforms other
	      than amd64.
- Like all current local filesystems, volumes larger than
	      4-8TB work well but have severe scalablity problems in
	      maintenance operations.
- Two level allocation in chunks and nodes can lead to a
	      situation where all chunks are allocated but there is plenty
	      of free nodes, which can require simple but subtle
	      workarounds.
- For some hard-to-imagine reasons non-privileged users can
	      create subvolumes and snapshots without limit.
 
- Things that don't quite work
- 
	  
	    - Multi device volumes are fundamentally misdesigned, so
	      even those implemented correctly behave strangely in
	      some important situations. The profiles single
	      and raid0, raid1 and raid10
	      mostly work. The raid1, raid10
	      profiles in particular have unpleasant corner cases when
	      operating degraded (a partial fix is
	      avaiable from kernel 4.14). There is no data loss, but
	      consequences can be time consuming.
- Quota groups have severe scalability problems and other
	      problems.
- Too many snapshots (more than 20-50 let's say) and other
	      forms of reflinking have severe scalability issues.
- The ssd space allocator behaves badly on almost
	      every devide and situation.
- In particular be careful about the list of
	      major issues.
 
So my recommended pattern of use is:
      
	- Use of the latest possible kernel and tools, that is at
	  least version 4.0.
- Filesystems up to 4-8TB on a single block device with
	  profile single for data and dup for
	  metadata, unless the block device is already redundant
	  (for example an MD RAID1 one).
- Leave checksums enabled by default, as most recent-ish
	  systems can compute them fast enough.
- Subvolumes and snapshots in a flat arrangement, all under the volume root subvolume. Only the
	  currently-in-use subvolume is mounted.
- No use of quota groups, and no more than 20-50 subvolumes
	  or shapshots or highly reflinked large fles.
- Use of the nocow inode flag on files that are
	  updated in-place, like DBMS data files and VM virtual disk
	  images and logs.
- Periodic balancing with filters like -musage=70
	    -dusage=70.
- Use always the nossd mount option.
- If you have a workload with a low rate of file creation
	  use the nospace_cache mount option else use a
	  recent kernel that allows the space_cache=v2
	  option.
- No automatic volume-wide defragmentation, only periodic
	  for specific large highly fragmented files.
- Backup via send/receive only when it is
	  well understood what it does and how, and inode flags do not
	  need to be preserved.
- Consider reverting to the earlier default of 4KiB nodes.
 
    
      
      
	
	Another list of major issues.
	  
	
	  - The raid1 profile considers a multidevice with 1
	    member device as not-RAID1 and thus only allows mounting it
	    degraded,ro. However as a special case it can be
	    mounted degraded,rw once. This
	    means that if a single device is left for a raid1
	    set proper recovery involves converting it to the
	    single profile as follows:
	    
	    If the remaining device of a raid1 profile set has
	    been remounted degraded,rw once already, it can
	    only be mounted degraded,ro, and another recovery
	    procedure must be used:
	    
	      - Use mkfs.btrfs to format the new device as an
		entirely new filetree with single profile and
		mount it.
- Copy the contents of the remaining device mounted
		degraded,ro into the new Btrfs filetreee on the
		new device.
- Unmount the remaining device, and btrfs device add
		  -f it to the new device, and convert the profile
		of the new filetree back to raid1.
 
- The raid5 profile allows 1 or more member
	    devices. With 1 or 2 member devices there is an implicit
	    all-zeroes member for XOR purposes.
	    
 When a raid5 set is reduced to 1 working member
	    it must be mounted in degraded mode, but it can be
	    mounted degraded,rw multiple times. While it is
	    possible to add new members to a raid5 set with 1
	    working member, it is not possible to remove the non-working
	    members from it, and if they are not removed the filetree
	    will always to be mounted as degraded. The only way
	    to remove them is to convert the set to single
	    (unless the new member device has the same major,minor as
	    the missing one).
 
      
	
	Another list of kernel version dependent hints.
	
	  - Kernel 5.5 has 3-way and 4-way mirroring modes.
- Kernel 4.14 has better behavior on mounting degraded
	    filetrees, allowing mounts based on chunk availability
	    instead of device availability.
- Kernels before 4.10rc1 have very high CPU usage when
	    deleting or balacing
	    inodes
	    with many reflinks, as files shared in many snapshots.
- Kernels before between 4.5.8 and 4.8.5 inclusive have
	    various issues under load.
- Kernels 4.7.x have a kernel memory allocation issue with
	    order 2 blocks.
- Kernel 4.5 has the version 2 space cache.
- Kernels before 4.4 can have very slow writing when disk free
	    space is fragmented.
- Kernels before 4.3 did not have a fully work
	    FSTRIM implementation.
- Kernel 4.0 has a Btrfs bug that prevents RAID reshaping
	    from succeeding.
- Kernel 3.19 to 3.19.4 can deadlock at mount time.
- Scrubbing or replacing devices in parity RAID profiles
	    only work from kernel 3.19.
- Kernel 3.15 to 3.16.1 can deadlock during heavy IO with
	    compression.
- Kernel 3.14 has improved representation for holes. This is
	    not backwards compatible (no-holes).
- Kernel 3.10 and later have skinny metadata. This is not backwards compatible.
- Kernel 3.10 has many bug fixes.
- Kernel 3.7 and later allows up to 64k hard-links
	    (extended-iref). This is not backwards
	    compatible.
- Kernel 3.5 and later have a large speedup for
	    fsync.
- Conversion of RAID profiles only from 3.3.
- Btrfs in kernel versions before 3.2 is not particularly
	    reliable.
 
      
      
	
	
	  - Btrfs mounts all available block devices with the same
	    volume UUID as the one being mounted, therefore block-by-block
	    copies of a Btrfs block device can be dangerous.
- Adding to a Btrfs filesystem a block devices that was
	    already part of it causes data loss. Versions of the
	    btrfs tool newer than 0.19 check for that.
- Btrfs directories always have a link count of 1.
- Btrfs volumes always have inode counts (maximum, used) of
	    0.
- Files subject to many small scattered updates can become
	    extremely fragmented when allocated with copy-on-write, as
	    updates usually split an extent in three
	    parts.
 
     
    
    
      
     These are pointers to some of the entries in my
	technical blog
	where Btrfs is discussed:
      
	- 171004
	    C vs. C++ and the cost of system level features
- 170424
	    Btrfs and NFS service and NFS daemon Ganesha
- 170407
	    Unusual filesystem properties
- 170302
	    Some coarse speed tests with Btrfs etc. and small files
- 170228
	    Some coarse speed tests with various Linux filesystems
- 161217
	    The most interesting filesystem types
- 160104
	    An amusing demand about 'fsync', and more about 'fsync'
- 131005
	    ZFS and 'fsck'
- 120303c
	    SUSE SLES 11 SP2 has massive updates and switch to Btrfs
- 120211
	    Another Btrfs presentation
- 120203
	    Interview with leader of Btrfs development
 
    
    