feature: add new compactor based revision count #16427

fuweid · 2023-08-16T15:40:54Z

Proposal: #16426

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

What would you like to be added? Add new compactor based revision count, instead of fixed interval time. In order to make it happen, the mvcc store needs to export `CompactNotify` function to notify the compactor that configured number of write transactions have occured since previsious compaction. The new compactor can get the revision change and delete out-of-date data in time, instead of waiting with fixed interval time. The underly bbolt db can reuse the free pages as soon as possible. Why is this needed? In the kubernetes cluster, for instance, argo workflow, there will be batch requests to create pods , and then there are also a lot of pod status's PATCH requests, especially when the pod has more than 3 containers. If the burst requests increase the db size in short time, it will be easy to exceed the max quota size. And then the cluster admin get involved to defrag, which may casue long downtime. So, we hope the ETCD can delete the out-of-date data as soon as possible and slow down the grow of total db size. Currently, both revision and periodic are based on time. It's not easy to use fixed interval time to face the unexpected burst update requests. The new compactor based on revision count can make the admin life easier. For instance, let's say that average of object size is 50 KiB. The new compactor will compact based on 10,000 revisions. It's like that ETCD can compact after new 500 MiB data in, no matter how long ETCD takes to get new 10,000 revisions. It can handle the burst update requests well. There are some test results: * Fixed value size: 10 KiB, Update Rate: 100/s, Total key space: 3,000 ``` enchmark put --rate=100 --total=300000 --compact-interval=0 \ --key-space-size=3000 --key-size=256 --val-size=10240 ``` | Compactor | DB Total Size | DB InUse Size | | -- | -- | -- | | Revision(5min,retension:10000) | 570 MiB | 208 MiB | | Periodic(1m) | 232 MiB | 165 MiB | | Periodic(30s) | 151 MiB | 127 MiB | | NewRevision(retension:10000) | 195 MiB | 187 MiB | * Random value size: [9 KiB, 11 KiB], Update Rate: 150/s, Total key space: 3,000 ``` bnchmark put --rate=150 --total=300000 --compact-interval=0 \ --key-space-size=3000 --key-size=256 --val-size=10240 \ --delta-val-size=1024 ``` | Compactor | DB Total Size | DB InUse Size | | -- | -- | -- | | Revision(5min,retension:10000) | 718 MiB | 554 MiB | | Periodic(1m) | 297 MiB | 246 MiB | | Periodic(30s) | 185 MiB | 146 MiB | | NewRevision(retension:10000) | 186 MiB | 178 MiB | * Random value size: [6 KiB, 14 KiB], Update Rate: 200/s, Total key space: 3,000 ``` bnchmark put --rate=200 --total=300000 --compact-interval=0 \ --key-space-size=3000 --key-size=256 --val-size=10240 \ --delta-val-size=4096 ``` | Compactor | DB Total Size | DB InUse Size | | -- | -- | -- | | Revision(5min,retension:10000) | 874 MiB | 221 MiB | | Periodic(1m) | 357 MiB | 260 MiB | | Periodic(30s) | 215 MiB | 151 MiB | | NewRevision(retension:10000) | 182 MiB | 176 MiB | For the burst requests, we needs to use short periodic interval. Otherwise, the total size will be large. I think the new compactor can handle it well. Additional Change: Currently, the quota system only checks DB total size. However, there could be a lot of free pages which can be reused to upcoming requests. Based on this proposal, I also want to extend current quota system with DB's InUse size. If the InUse size is less than max quota size, we should allow requests to update. Since the bbolt might be resized if there is no available continuous pages, we should setup a hard limit for the overflow, like 1 GiB. ```diff // Quota represents an arbitrary quota against arbitrary requests. Each request @@ -130,7 +134,17 @@ func (b *BackendQuota) Available(v interface{}) bool { return true } // TODO: maybe optimize Backend.Size() - return b.be.Size()+int64(cost) < b.maxBackendBytes + + // Since the compact comes with allocatable pages, we should check the + // SizeInUse first. If there is no continuous pages for key/value and + // the boltdb continues to resize, it should not increase more than 1 + // GiB. It's hard limitation. + // + // TODO: It should be enabled by flag. + if b.be.Size()+int64(cost)-b.maxBackendBytes >= maxAllowedOverflowBytes(b.maxBackendBytes) { + return false + } + return b.be.SizeInUse()+int64(cost) < b.maxBackendBytes } ``` And it's likely to disable NOSPACE alarm if the compact can get much more free pages. It can reduce downtime. Signed-off-by: Wei Fu <[email protected]>

stale · 2024-03-17T12:56:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

fuweid mentioned this pull request Aug 16, 2023

feature: add new compactor based revision count #16426

Open

stale bot added the stale label Mar 17, 2024

fuweid removed the stale label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: add new compactor based revision count #16427

feature: add new compactor based revision count #16427

fuweid commented Aug 16, 2023

stale bot commented Mar 17, 2024

feature: add new compactor based revision count #16427

Are you sure you want to change the base?

feature: add new compactor based revision count #16427

Conversation

fuweid commented Aug 16, 2023

stale bot commented Mar 17, 2024