-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: take bluestore_min_alloc_size into account #34
Comments
hi! thanks for the report - I can fix this best if you create a state dump file ( |
received - will have a look and see what I can fix :) |
I'll leave this here for reference:
|
I have access to a cluster created long ago and then expanded by adding new OSDs. I found that, in order to balance it properly, I had to add
--osdsize device balance --osdused delta
. Otherwise, its idea of how full an OSD is disagrees with whatceph osd df
says, and disagrees differently for different OSDs.Today, with the help of my colleagues, we root-caused this: old OSDs have
bluestore_min_alloc_size=65536
, while new ones havebluestore_min_alloc_size=4096
. It means that the average per-object overhead is different. This overhead is what makes the sum of PG sizes (i.e., the sum of all stored object sizes) different from the used space on the OSD.Please assume by default that each stored object comes with an overhead of
bluestore_min_alloc_size / 2
, and take this into account when figuring out how much space would be used or freed by PG moves. On Ceph 17.2.7 and later, you can get this fromceph osd metadata
.For example, an OSD that has a total of 56613739 objects in all PGs would have 1.7 TB of overhead with
bluestore_min_alloc_size=65536
, but only 100 GB of overhead withbluestore_min_alloc_size=4096
.Here is
ceph osd df
(please ignore the first bunch of OSDs with only 0.75% utilization - they are outside of the CRUSH root, waiting for an "ok" to be placed in the proper hierarchy):ceph-osd-df.txt
Here is
ceph pg ls-by-osd 221
(this one was redeployed recently, so it hasbluestore_min_alloc_size=4096
):ceph-pg-ls-by-osd-221.txt
Here is
ceph pg ls-by-osd 223
:ceph-pg-ls-by-osd-223.txt
As you can see, these two OSDs have almost the same size, almost the same (differing only by 1) number of PGs, but their utilization differs by 1.9 TB, which matches (although not perfectly) the overhead calculation presented above.
Sorry, I am not allowed to post the full osdmap.
P.S. I am also going to file the same bug against the built-in Ceph balancer.
The text was updated successfully, but these errors were encountered: