Data capacity is the first thing that comes to mind while talking about large ceph clusters, or data storage systems in general. The number of drives is another measure to think about. And sometimes maximum iops is something to look out for, especially while considering a full-flash / nvme cluster. But heaviness? What does that even mean?
First interpretation maybe is the total weight of any server hardware, - and having seen some wheels of some racks bent under the total pressure, that’s definitely something to account for, but not for this post.
Evenly filling different size boxes
Ceph uses an algorithm called
CRUSH to place
the data all over the cluster. And to be able to distribute data evenly, this
algorithm assigns a weight to each of the drives. This is a unitless quantity
meaning it’s just there comparatively. So a drive with weight 1
will have x
amount of data, while another drive in the same cluster with weight 2
will
have 2x
data. Simple enough.
Since we almost always want to distribute data according to the capacity, this
weight is auto-assigned to the capacity of the underlying drives for a few
years now. So if you have 6TB
drives, its default crush weight would be 6
,
or if you have 10TB
drives it would be 10
. In practice, it would be 5.455
and 9.095
, because … you know, sometimes we use 1000s
and sometimes
1024s
.
Accounting for total size of boxes
Second part is that crush takes this weight concept and aggregates it up the
hardware hierarchy. So a node with ten 6TB
drives will have a weight of 60
,
or if you put 10TB
drives to a similar node instead its weight would be 100
.
And this weight can be summed up in the rack, row, pod, etc. levels if you have them. Whether you have these levels or not, what doesn’t change is that all ceph clusters have a total weight. That’s the heaviness(!) of a ceph cluster, and by default it’s the total raw capacity of the cluster, in tebibytes.
For example when we configure a ceph cluster 10
storage nodes, all containing
24
12TB
drives, this cluster would have 2880TB
of raw capacity and if we
deploy it using defaults, its root weight would be 2619.36
. For a hypothetical
large cluster with 10.000
10TB
drives you’ll have a 100PB
cluster with a
default root weight of 90950
. Right? Unfortunately no, you can’t.
Limits to the scale
What would happen instead is your deployment would get stuck exactly at 7205
osds and any other osd trying to start being unable to do so with a generic
message like
insert_item unable to rebuild roots with classes:
(34) Numerical result out of range
This looks like some kind of overflow but where? There are definitely people out
there running clusters containing more than 7205
drives, right? Probably yes,
but I guess those clusters are anything but “by defaults”.
The thing is, weights in the current implementation of crush have a hardcoded
limit of 65535
. This also means the ceph cluster can weigh a total of 65535
and beyond that is too heavy. And if using default weights, a ceph cluster can
be as large as 72PB
and not a single PB
more.
Time to get off the defaults highway
Fortunately this purely hypothetical, rather complex and probably even unheard
of heaviness limit has a simple remedy: setting the weights to something other
than the capacity in TiBs
, like 10TiBs
. Or probably setting the initial
weights to something more relatable. In global section of ceph.conf:
[global]
osd crush initial weight = 1
Or if you’re hypothetical deployment got stuck at 7205th
same sized drive, you
can reweight your entire cluster with:
ceph osd crush reweight-subtree default 1
That way the next limit to hit would be your 65535th
drive. So keep that in mind
while trying to scale your cluster 10x
it’s original size! (Hint: Don’t)