-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Description
Summary
We propose a new compaction method that reduces the time required for etcdserver compaction and decreases I/O consumption during the compaction process.
Motivation
Thanks to performance optimizations in freepage allocation within BoltDB, etcdserver now supports database sizes of up to approximately 100 GB. However, large databases are highly unfavorable for compaction: each compaction operation must traverse all revisions in the entire database, regardless of how much data has actually been updated. This leads to two major issues:
- Traversing the entire database generates substantial I/O load—monitoring tools often show I/O utilization approaching 100%, which negatively impacts concurrent read and write requests.
- After traversing every
CompactionBatchLimitnumber of revisions, the compaction process sleeps for a duration ofCompactionSleepInterval. This can result in significant unnecessary sleep periods, as many traversed revision ranges may contain no historical data requiring cleanup.
We introduce a novel compaction implementation that avoids these problems while preserving data correctness.
Proposal
Overview of range-free compaction
The current compaction logic consists of two parts:
- treeIndex executes the Compact method, traverses all keyIndex objects, determines the set of revisions that need to be retained in the current state, and returns this set.
- store traverses all revisions in the database, checks each revision against the retained set, deletes unreferenced revisions, and commits the deletions in batches.
The proposed Range-free Compaction logic works as follows:
- The Index directly computes the set of revisions to be deleted and returns this set.
- The store receives this set of revisions, sorts them, deletes them sequentially, and commits the deletions in batches.
Details
Changes of keyIndex
Modify the compact method signature to add a new return parameter: delete map[Revision]struct{}, which represents the set of revisions to be deleted for a single keyIndex after compaction.
-func (ki *keyIndex) compact(lg *zap.Logger, atRev int64, available map[Revision]struct{}) {
+func (ki *keyIndex) compact(lg *zap.Logger, atRev int64, available map[Revision]struct{}, delete map[Revision]struct{}) {
Changes of Index interface
Update the Index interface’s Compact method to include an additional parameter that carries the set of revisions to be deleted after treeIndex compaction.
-func (ti *treeIndex) Compact(rev int64) map[Revision]struct{} {
+func (ti *treeIndex) Compact(rev int64) (map[Revision]struct{}, map[Revision]struct{}) {
Changes of db compaction
Add a new method *store.scheduleCompactionRangeFree with the following logic.
func (s *store) scheduleCompactionRangeFree(compactMainRev, prevCompactRev int64) (KeyValueHash, error) {
// Use Compact to get revisions to delete and perform index compaction
_, toDelete := s.kvindex.Compact(compactMainRev)
// Convert toDelete map to sorted slice for deterministic iteration
revs := toSlice(toDelete)
sort(revs)
// Process deletions in batches
for _, rev := range revs {
// Delete the revision directly
tx.UnsafeDelete(schema.Key, keyBytes)
// Commit in batches
if deletedCount%batchNum == 0 {
s.b.ForceCommit()
//sleep some time
}
}
UnsafeSetFinishedCompact(tx, compactMainRev)
return hash, nil
}
Backward Compatibility
Incompatibable KeyValueHash
Since range-free compaction does not traverse revision data, the hash computation logic is incompatible with the previous approach. This incompatibility is a significant issue if the user has enabled corrupt check.
Correctness
We need to update certain unit tests to ensure the correctness of compaction in keyIndex, treeIndex, and store. We also need to verify that the results of the new compaction method are consistent with those of the current implementation. The test logic is as follows:
- Create a new store.
- Write a fixed set of data, randomly triggering compaction operations during the process.
- Perform a final compaction, iterate through all data, compute a hash, and then close the store.
- Create another new store that uses range-free compaction.
- Repeat steps 2 and 3 to obtain a new hash, and compare it with the previous hash to ensure data consistency.
FeatureGate
This new feature can be enabled via a FeatureGate, allowing users to opt in or out at their discretion.
Performence
Performence
conditions:
- Total 2M keys, each value size is 4K
- total dbsize 40GB
server flags:
--auto-compaction-retention=5m
--backend-batch-interval=100ms
--backend-batch-limit=1000
client write command:
while true; do _build/benchmark put --endpoints=http://127.0.0.1:3379,http://127.0.0.1:23379,http://127.0.0.1:33379 --clients 100 --conns 20 --key-size 128 --val-size 4000 --total 2000000 --key-space-size 2000000 --key-prefix aa --rate 7000 --sequential-keys ; done
| TPS of Put | cost time of original compact | cost time of range-free compact | improvement |
|---|---|---|---|
| 1K/s | 32.140923178s | 6.051189144s | 72% |
| 2K/s | 38.630751905s | 12.803279511s | 68% |
| 5K/s | 1m0.072742644s | 31.387194197s | 48% |
| 7K/s | 1m20.295984944s | 42.799126317s | 47% |