Feature request: take bluestore_min_alloc_size into account #34

patrakov · 2024-03-05T11:30:18Z

I have access to a cluster created long ago and then expanded by adding new OSDs. I found that, in order to balance it properly, I had to add --osdsize device balance --osdused delta. Otherwise, its idea of how full an OSD is disagrees with what ceph osd df says, and disagrees differently for different OSDs.

Today, with the help of my colleagues, we root-caused this: old OSDs have bluestore_min_alloc_size=65536, while new ones have bluestore_min_alloc_size=4096. It means that the average per-object overhead is different. This overhead is what makes the sum of PG sizes (i.e., the sum of all stored object sizes) different from the used space on the OSD.

Please assume by default that each stored object comes with an overhead of bluestore_min_alloc_size / 2, and take this into account when figuring out how much space would be used or freed by PG moves. On Ceph 17.2.7 and later, you can get this from ceph osd metadata.

For example, an OSD that has a total of 56613739 objects in all PGs would have 1.7 TB of overhead with bluestore_min_alloc_size=65536, but only 100 GB of overhead with bluestore_min_alloc_size=4096.

Here is ceph osd df (please ignore the first bunch of OSDs with only 0.75% utilization - they are outside of the CRUSH root, waiting for an "ok" to be placed in the proper hierarchy):
ceph-osd-df.txt

Here is ceph pg ls-by-osd 221 (this one was redeployed recently, so it has bluestore_min_alloc_size=4096):
ceph-pg-ls-by-osd-221.txt

Here is ceph pg ls-by-osd 223:
ceph-pg-ls-by-osd-223.txt

As you can see, these two OSDs have almost the same size, almost the same (differing only by 1) number of PGs, but their utilization differs by 1.9 TB, which matches (although not perfectly) the overhead calculation presented above.

Sorry, I am not allowed to post the full osdmap.

P.S. I am also going to file the same bug against the built-in Ceph balancer.

The text was updated successfully, but these errors were encountered:

TheJJ · 2024-03-05T18:15:25Z

hi! thanks for the report - I can fix this best if you create a state dump file (placementoptimizer.py gather file.xz) and upload/send it to me to jj -at- sft.lol. that would be great and i will be able to reproduce/fix this locally!

TheJJ · 2024-03-14T09:20:39Z

received - will have a look and see what I can fix :)

TheJJ · 2024-04-04T11:39:12Z

I'll leave this here for reference:

I wonder why the alloc size even matters here, since I just evaluate the reported PG and OSD sizes.
The --osdsize flag can select between crushweight, devicesize and weighted devicesize, and all of them are independent of alloc sizes.
--osdused determines how we calculate partial movement sizes, and those are not based on the alloc size either.
just to make sure, what exactly did you mean with "balancing it properly"? what is your essential requirement for a proper balancement? 🙂

Let me start by answering the main question: the definition of proper
balance. I want all OSDs to have the same percentage of their size
used, as reported by ceph osd df.

Regarding the --osdsize argument, well, I agree that it is not
essential here. However, --osdused delta is important.

The problem is that if the calculation of partial movement sizes does
not take the alloc size into account, the effect of the move would be
miscalculated. A 270 GB PG does not necessarily consume only 270 GB on
the OSD - on the contrary, it consumes more, and the overhead depends
on the number of objects in the PG and the per-object overhead, which
can be estimated as bluestore_min_alloc_size / 2. So, it expands when
moved from an OSD with 4K alloc size to an OSD with 64K min alloc
size.

While this effect is insignificant when moving a small amount of PGs,
it is significant when evaluating the initial state. That's why on
this cluster --osdused delta is the only thing that worked. At least
in the past, without it, the --osdfrom fullest made mistakes about the
choice of the source OSDs - as bad as selecting a 75% full OSD when
there was a 88% full OSD as the source.

Let me also explain it another way by quoting from the built-in help.

  --osdused {delta,shardsum}
                        how is the osd usage predicted during
simulation? default: shardsum. delta: adjust the builtin osd usage
report by in-move pg deltas - more accurate but doesn't account
pending data deletion. shardsum: estimate the usage by summing up all
pg shardsizes - doesn't account PG metadata overhead.

Let's have the best of both worlds by adjusting the sum of all pg
shardsizes in the shardsum mode by the known overhead (which is the
number of objects in the PG multiplied by the half of min alloc size).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: take bluestore_min_alloc_size into account #34

Feature request: take bluestore_min_alloc_size into account #34

patrakov commented Mar 5, 2024

TheJJ commented Mar 5, 2024

TheJJ commented Mar 14, 2024

TheJJ commented Apr 4, 2024

Feature request: take bluestore_min_alloc_size into account #34

Feature request: take bluestore_min_alloc_size into account #34

Comments

patrakov commented Mar 5, 2024

TheJJ commented Mar 5, 2024

TheJJ commented Mar 14, 2024

TheJJ commented Apr 4, 2024