Updated: 2024-02-06
Created: 2024-02-03
Ceph is a distributed shared filesystem based on a classic structure:
chunksand distributed across local filesystems on several data servers.
erasure codechunks, but these are extremely expensive in IOPS, especially because of RMW resulting in
write amplification.
bucketwhich can be a block device, a server, a rack, a data centre, etc.
placement groups, all files in the same PG are replicated in the same way. PGs are not data containers, they are just lists of files.
pooldefinition. Pools (despite the name) are not data containers either, they are just replication profiles.
Note: many of these things can be done in special cases where speed or reliability matter much less than cost, but that is very unlikely, even if you think that you do know better.
If buckets are OSDS and a server with various OSDs fails, several PGs will become unusable for the duration, because those PGs will be replicated on several OSDs on that server.
There are cases where availability of files matters less than, and data resilience matters more, and there are few servers and many OSDs, where having OSDs as buckets might be a lesser evil.
Many OSDS on the same server usually also means that there will be few servers, and therefore if a server is a bucket, odds are that many PGs will become undersized and slower.
This is particularly bad for erasure-coded PGs, which normally replicate across many OSDs, and for OSDs with large capacity, which therefore will have PGS with lots of objects.
Power Loss Protectioneven for data.
SSDs with PLP are really necessary for DB, WAL and metadata, which all have very (very!) high small write rates.
But Ceph by spreading files across potentially many OSDs and being often used for highly parallel workloads generates IO requests that look very random from the point of view of a storage device, and for write requests PLP allows caching them until they can be reordered in more sequential ways, which is really quite important for SSDs in particular.
a WAL/DB area can see a very considerable amount of traffic so having too many of them on just one even fast SSD can be a significant bottleneck, especially with large HDDs. For example 12×18TB HDD OSDs saturate a single fast SSD, and can only operate at around 30% of possible speed each, suggesting that each WAL/DB SSD can only support 4 such HDDs.
Because of reasons
Ceph data should havea
least double redundancy, which means that in k+m
replicas m should be at least 2, which means that
k should also be at least 2 (because 1+2 uses
the space space as 3-way mirroring and is otherwise worse),
and often is 4 or 6 or even bigger. The larger k+m is
the worse are the speed downsides of erasure coding.
The classic 3-way mirroring is roughly equivalent in space to 1+2 erasure coding without compression, but is often with compression equivalent to 2+2 erasure coding, and it is much, much faster. It is almost always better to use 3-way mirroring with somewhat slower or larger OSDs than erasure coding weith rather faster and smaller OSDs, because of the latency of doing scatter-gather operations across the many chunks in an erasure-coded set.
Because of reasons
Ceph data should have at
least double redundancy, so mirroring with 1 replica or erasure
coding with 1 erasure code block may cause data
unavailability.
Data safety: Always have min_size at least +1 more than needed for minimal reachability
- That means good combinations are, at least:
- Replica: n>=3: size=n, min_size=2
- Erasure code: n>=2, m>=2, i>=1: EC=n+m => size=n+m, min_size=n+i
- see current values in ceph osd pool ls detail
- Why? Every write should have at least one redundant OSD, even when you're down to min_size. Because if another disk dies when you're at min_size without a redundant OSD everything is lost. Every write should be backed by at least one additional drive, even if you are already degraded.
Ceph and its BlueStore layer often reduce in speed dramatically if they get full-ish both because this increases fragmentation of BlueStore and makes it more difficult to rebalance and backfill.
Ceph works best when it has plenty of hardware capacity to operate and in particular of spare capacity because it can have high peaks of utilization from user workloads and from system workloads, so providing just enough capacity for the expected workloads will often have poor results.
Attempts to optimize
at the software level
an insufficient storage layer wil usually just result in lower
speed and reliability. In particular attempts to optimize a
storage layer that lacks enough IOPS to sustain both user and
system workloads will result in bitter experiences, especially
if the optimization
is to disable the system
workloads to fit the user workload in the available hardware
capacity.
Like all distributed filesystems but more than most Ceph
requires a lot of IOPS and in particular low-latency IOPS, and
IOPS required are proportional to both metadata and data items
because Ceph is self healing
and that means that it does a
lot of background checking and optimization (scrubbing and
rebalancing and backfilling) and these consume a lot of IOPS for
both data and metadata accesses as they involve whole-tree and
whole-device scans.
There are two sources of demand for IOPS:
scrubbing,
backup,
Checking,
rebuilding,
rebalancing. Since maintenance operations are often whole-tree or all-entities scans they can be quite heavy and long lasting, and there must be enough IOPS to run then in parallel with the user workload.
There are three types of IOPS load to consider:
Because of rebalancing and backfilling Ceph OSDs usually require a lot of IOPS-per-TB, and often lots of IOPS-per-entity are required too.
Note: there is an official Ceph glossary.
pool
PG
A set of chunks belonging to objects with the same pool profile where each chink of the same object is stored in a different bucket.
bucket
A subset of all the OSDs in a Ceph instance, One or many, usually chosen for being in the same failure domain.