showremapped --by-osd crash #49

tatuylonen · 2024-10-18T15:51:16Z

Trying to run "showremapped --by-osd" gives the following:

./placementoptimizer.py showremapped --by-osd
Traceback (most recent call last):
File "/root/ceph-balancer/./placementoptimizer.py", line 5497, in
exit(main())
File "/root/ceph-balancer/./placementoptimizer.py", line 5491, in main
run()
File "/root/ceph-balancer/./placementoptimizer.py", line 5461, in
run = lambda: showremapped(args, state)
File "/root/ceph-balancer/./placementoptimizer.py", line 5347, in showremapped
print(f"{osdname}: {cluster.osds[osdid]['host_name']} =>{sum_to} {sum_data_to_pp} <={sum_from} {sum_data_from_pp}"
KeyError: -1

I'm suspecting this crash could be due to there being undersized & degraded pgs in the cluster that are being remapped and cluster.osds being indexed by -1. If so, this should be pretty easy to fix.

Excerpt from output of "showremapped":
pg 18.38d toofull 223.0G: 4659450 of 4659450, 100.0%, 97->220;221->236;55->77;90->113;124->60;-1->76
pg 18.40a toofull 111.3G: 3874000 of 3874000, 100.0%, 216->72;-1->220;45->162;252->90;46->58;171->130;54->175;186->54;116->112;147->118
pg 18.432 backfill 111.6G: 3881870 of 3881870, 100.0%, -1->72;117->97;139->185;27->240;51->212;175->29;33->95;85->109;239->102;96->44
pg 18.45a backfill 111.4G: 1165623 of 1165623, 100.0%, 92->61;-1->99;114->104

ceph status
cluster:
id: xxx
health: HEALTH_WARN
3 failed cephadm daemon(s)
nodeep-scrub flag(s) set
10 backfillfull osd(s)
19 nearfull osd(s)
Low space hindering backfill (add storage if this doesn't resolve itself): 105 pgs backfill_toofull
Degraded data redundancy: 318827/9214981270 objects degraded (0.003%), 1 pg degraded, 12 pgs undersized
88 pgs not deep-scrubbed in time
1220 pgs not scrubbed in time
5 pool(s) backfillfull
29 slow ops, oldest one blocked for 667 sec, daemons [osd.110,osd.42] have slow ops.

services:
mon: 3 daemons, quorum sm1,sm3,sm2 (age 29h)
mgr: sm2.igewzl(active, since 29h), standbys: sm1.guvysx, sm3.hjkzda
mds: 1/1 daemons up, 2 standby
osd: 259 osds: 259 up (since 16m), 259 in (since 18m); 208 remapped pgs
flags nodeep-scrub

data:
volumes: 1/1 healthy
pools: 14 pools, 3122 pgs
objects: 1.22G objects, 1.8 PiB
usage: 2.4 PiB used, 1.2 PiB / 3.6 PiB avail
pgs: 318827/9214981270 objects degraded (0.003%)
78100572/9214981270 objects misplaced (0.848%)
2554 active+clean
360 active+clean+scrubbing
99 active+remapped+backfilling
96 active+remapped+backfill_toofull
8 active+undersized+remapped+backfill_toofull
3 active+undersized+remapped+backfilling
1 active+undersized+degraded+remapped+backfilling
1 active+remapped+backfill_wait+backfill_toofull

io:
client: 1.6 MiB/s rd, 43 KiB/s wr, 1.15k op/s rd, 17 op/s wr
recovery: 1.7 GiB/s, 1.62k objects/s

tatuylonen · 2024-10-18T21:12:04Z

I just noticed there was already another issue on this same problem (#39).

tatuylonen closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

showremapped --by-osd crash #49

showremapped --by-osd crash #49

tatuylonen commented Oct 18, 2024

tatuylonen commented Oct 18, 2024

showremapped --by-osd crash #49

showremapped --by-osd crash #49

Comments

tatuylonen commented Oct 18, 2024

tatuylonen commented Oct 18, 2024