Notes about Linux Ceph

Some of my notes on Ceph (2024-02-06)

Ceph is a distributed shared filesystem based on a classic structure:

Files are divided into chunks and distributed across local filesystems on several data servers.
The metadata that describes each file and where its chunks are is held on some other servers.
Once a client has fetched from the metadata servers the description of each file it only contacts, in parallel, the data servers listed in that description and only contacts again the metadata servers if the file description needs to change (for example the file grows in size with more chunks on more metadata servers).

Ceph pragmatics (2024-02-06)

As a rule Ceph works with 3 replicas of every chunk.
Ceph also allows k+m replicas with k data and m erasure code chunks, but these are extremely expensive in IOPS, especially because of RMW resulting in write amplification.
Each replica must be on a different bucket which can be a block device, a server, a rack, a data centre, etc.
Replication is not by file but by arbitrary groups of files called placement groups, all files in the same PG are replicated in the same way. PGs are not data containers, they are just lists of files.
How a file should be replicated in described in a pool definition. Pools (despite the name) are not data containers either, they are just replication profiles.

Ceph do not do these (2024-02-07)

Note: many of these things can be done in special cases where speed or reliability matter much less than cost, but that is very unlikely, even if you think that you do know better.

OSDs as buckets

If buckets are OSDS and a server with various OSDs fails, several PGs will become unusable for the duration, because those PGs will be replicated on several OSDs on that server.

There are cases where availability of files matters less than, and data resilience matters more, and there are few servers and many OSDs, where having OSDs as buckets might be a lesser evil.

Many OSDs on the same server

Many OSDS on the same server usually also means that there will be few servers, and therefore if a server is a bucket, odds are that many PGs will become undersized and slower.

This is particularly bad for erasure-coded PGs, which normally replicate across many OSDs, and for OSDs with large capacity, which therefore will have PGS with lots of objects.

Use large storage devices as OSDs

If large (more than 1-2TB) storage devices are HDDs they have a very low IOPS-per-TB ratio, which makes maintenance operations like checking, scrubbing and rebalancing extremely slow.
Regardless of speed large storage devices have by necessity the chunks (which usually have a limited size of up to a few MiB) of a large number of files on them, so if one fails a large number of files will become degraded.
Large storage devices can get very fragmented because they have so many chunks.
Since large storage devices store chunks from many files it is much more likely that they will serve reads and writes from many threads, and many-threaded IO is expensive even if the storage device has lots of IOPS by being an SSD.

Use SSDs without Power Loss Protection even for data.

SSDs with PLP are really necessary for DB, WAL and metadata, which all have very (very!) high small write rates.

But Ceph by spreading files across potentially many OSDs and being often used for highly parallel workloads generates IO requests that look very random from the point of view of a storage device, and for write requests PLP allows caching them until they can be reordered in more sequential ways, which is really quite important for SSDs in particular.

Use one WAL/DB fast SSD for too many HDDs or slow data SSDs.

a WAL/DB area can see a very considerable amount of traffic so having too many of them on just one even fast SSD can be a significant bottleneck, especially with large HDDs. For example 12×18TB HDD OSDs saturate a single fast SSD, and can only operate at around 30% of possible speed each, suggesting that each WAL/DB SSD can only support 4 such HDDs.

Use erasure coding

Erasure coding transform all writes into multiple block writes, and write operations complete only when all blocks have wbeen written to separate OSDs, involving much latency.
If checksum checking is enabled for erasure coded blocks, then reads can only complete when all blocks of an erasure coded sets have been read.

Because of reasons Ceph data should havea least double redundancy, which means that in k+m replicas m should be at least 2, which means that k should also be at least 2 (because 1+2 uses the space space as 3-way mirroring and is otherwise worse), and often is 4 or 6 or even bigger. The larger k+m is the worse are the speed downsides of erasure coding.

The classic 3-way mirroring is roughly equivalent in space to 1+2 erasure coding without compression, but is often with compression equivalent to 2+2 erasure coding, and it is much, much faster. It is almost always better to use 3-way mirroring with somewhat slower or larger OSDs than erasure coding weith rather faster and smaller OSDs, because of the latency of doing scatter-gather operations across the many chunks in an erasure-coded set.

Use a degree of redundancy less than 2

Because of reasons Ceph data should have at least double redundancy, so mirroring with 1 replica or erasure coding with 1 erasure code block may cause data unavailability.

Data safety: Always have min_size at least +1 more than needed for minimal reachability

That means good combinations are, at least:

Replica: n>=3: size=n, min_size=2

Erasure code: n>=2, m>=2, i>=1: EC=n+m => size=n+m, min_size=n+i

see current values in ceph osd pool ls detail

Why? Every write should have at least one redundant OSD, even when you're down to min_size. Because if another disk dies when you're at min_size without a redundant OSD everything is lost. Every write should be backed by at least one additional drive, even if you are already degraded.

Fill your storage beyond 75-80%

Ceph and its BlueStore layer often reduce in speed dramatically if they get full-ish both because this increases fragmentation of BlueStore and makes it more difficult to rebalance and backfill.

Attempt to "optimize" an insufficient storage layer

Ceph works best when it has plenty of hardware capacity to operate and in particular of spare capacity because it can have high peaks of utilization from user workloads and from system workloads, so providing just enough capacity for the expected workloads will often have poor results.

Attempts to optimize at the software level an insufficient storage layer wil usually just result in lower speed and reliability. In particular attempts to optimize a storage layer that lacks enough IOPS to sustain both user and system workloads will result in bitter experiences, especially if the optimization is to disable the system workloads to fit the user workload in the available hardware capacity.

Ceph capacity planning (2024-02-06)

IOPS

Like all distributed filesystems but more than most Ceph requires a lot of IOPS and in particular low-latency IOPS, and IOPS required are proportional to both metadata and data items because Ceph is self healing and that means that it does a lot of background checking and optimization (scrubbing and rebalancing and backfilling) and these consume a lot of IOPS for both data and metadata accesses as they involve whole-tree and whole-device scans.

There are two sources of demand for IOPS:

User workloads: this depends on how many processes are doing IO at the same time and with which block size and whether it is reads or writes.
System workload: this is mostly for metadata access all the time, and extra metadata and data access during maintenance operations such as scrubbing, backup, Checking, rebuilding, rebalancing. Since maintenance operations are often whole-tree or all-entities scans they can be quite heavy and long lasting, and there must be enough IOPS to run then in parallel with the user workload.

There are three types of IOPS load to consider:

Absolute IOPS: dependent solely on the characteristics of the workload and not on the amount of data stored or the number of files or objects stored. Usually mostly dependent on the user workload.
IOPS-per-TB: IOPS that are proportional to the amount of data stored. Usually dependent mostly on the maintenance workload.
IOPS-per-entity: IOPS that are proportional to the number of files or objects stored. Usually dependent on both the user and the system workload.

Because of rebalancing and backfilling Ceph OSDs usually require a lot of IOPS-per-TB, and often lots of IOPS-per-entity are required too.

Ceph terms (2024-02-07)

Note: there is an official Ceph glossary.

Ceph instance

A set of inter-related daemons (MON, MGR, OSD) and storage devices. It is defined by the set of MON daemons that synchronized with eaxch other and the MGR and OSD daemons registered with those MON daemons.

MON

Something that stores the running state of a cluster both in memory and in a local RocksDB instance. The shape of the cluster is stored in a few configuration files in some kind of persistent local filesystem.

MGR

Adds some more monitoring services to a MON daemons.

RADOS

Library and command line tools to access eph objects.

Ceph application

Some set of daemons that use RADOS to provide higher level services, for example a POSIX filesystem (CephFS), block devices (RBD), S3-compatible object access (RGW).

pool

Equivalently:

A replication layout profile.
A set of PGs with the same layout profile.
A set of objects with same layout profile.

PG

A set of chunks belonging to objects with the same pool profile where each chink of the same object is stored in a different bucket.

bucket

A subset of all the OSDs in a Ceph instance, One or many, usually chosen for being in the same failure domain.

PG

inconsistent (PG, object)

An object, or a PG containing an object, with differences among its replicas or with its checksum.

PG undersized

When some OSDs are unavailable and all objects are complete (not degraded).

PG degraded

When some OSDs are not available and some objects are degraded (incomplete).

PG inactive

PG stale

PG unclean

PG peering

pool

Equivalently: