Skip to content

Proposal: Range-free Compaction #21284

@silentred

Description

@silentred

Summary

We propose a new compaction method that reduces the time required for etcdserver compaction and decreases I/O consumption during the compaction process.

Motivation

Thanks to performance optimizations in freepage allocation within BoltDB, etcdserver now supports database sizes of up to approximately 100 GB. However, large databases are highly unfavorable for compaction: each compaction operation must traverse all revisions in the entire database, regardless of how much data has actually been updated. This leads to two major issues:

  1. Traversing the entire database generates substantial I/O load—monitoring tools often show I/O utilization approaching 100%, which negatively impacts concurrent read and write requests.
  2. After traversing every CompactionBatchLimit number of revisions, the compaction process sleeps for a duration of CompactionSleepInterval. This can result in significant unnecessary sleep periods, as many traversed revision ranges may contain no historical data requiring cleanup.

We introduce a novel compaction implementation that avoids these problems while preserving data correctness.

Proposal

Overview of range-free compaction

The current compaction logic consists of two parts:

  • treeIndex executes the Compact method, traverses all keyIndex objects, determines the set of revisions that need to be retained in the current state, and returns this set.
  • store traverses all revisions in the database, checks each revision against the retained set, deletes unreferenced revisions, and commits the deletions in batches.

The proposed Range-free Compaction logic works as follows:

  • The Index directly computes the set of revisions to be deleted and returns this set.
  • The store receives this set of revisions, sorts them, deletes them sequentially, and commits the deletions in batches.

Details

Changes of keyIndex

Modify the compact method signature to add a new return parameter: delete map[Revision]struct{}, which represents the set of revisions to be deleted for a single keyIndex after compaction.

-func (ki *keyIndex) compact(lg *zap.Logger, atRev int64, available map[Revision]struct{}) {

+func (ki *keyIndex) compact(lg *zap.Logger, atRev int64, available map[Revision]struct{}, delete map[Revision]struct{}) {

Changes of Index interface

Update the Index interface’s Compact method to include an additional parameter that carries the set of revisions to be deleted after treeIndex compaction.

-func (ti *treeIndex) Compact(rev int64) map[Revision]struct{} {

+func (ti *treeIndex) Compact(rev int64) (map[Revision]struct{}, map[Revision]struct{}) {

Changes of db compaction

Add a new method *store.scheduleCompactionRangeFree with the following logic.

func (s *store) scheduleCompactionRangeFree(compactMainRev, prevCompactRev int64) (KeyValueHash, error) {
	// Use Compact to get revisions to delete and perform index compaction
	_, toDelete := s.kvindex.Compact(compactMainRev)

	// Convert toDelete map to sorted slice for deterministic iteration
	revs := toSlice(toDelete)
    sort(revs)

	// Process deletions in batches
	for _, rev := range revs {
		// Delete the revision directly
		tx.UnsafeDelete(schema.Key, keyBytes)
		
		// Commit in batches
		if deletedCount%batchNum == 0 {
			s.b.ForceCommit()
			//sleep some time
		}
	}

	UnsafeSetFinishedCompact(tx, compactMainRev)
	return hash, nil
}

Backward Compatibility

Incompatibable KeyValueHash

Since range-free compaction does not traverse revision data, the hash computation logic is incompatible with the previous approach. This incompatibility is a significant issue if the user has enabled corrupt check.

Correctness

We need to update certain unit tests to ensure the correctness of compaction in keyIndex, treeIndex, and store. We also need to verify that the results of the new compaction method are consistent with those of the current implementation. The test logic is as follows:

  1. Create a new store.
  2. Write a fixed set of data, randomly triggering compaction operations during the process.
  3. Perform a final compaction, iterate through all data, compute a hash, and then close the store.
  4. Create another new store that uses range-free compaction.
  5. Repeat steps 2 and 3 to obtain a new hash, and compare it with the previous hash to ensure data consistency.

FeatureGate

This new feature can be enabled via a FeatureGate, allowing users to opt in or out at their discretion.
Performence

Performence

conditions:

  • Total 2M keys, each value size is 4K
  • total dbsize 40GB

server flags:

--auto-compaction-retention=5m 
--backend-batch-interval=100ms
--backend-batch-limit=1000

client write command:

while true; do _build/benchmark put --endpoints=http://127.0.0.1:3379,http://127.0.0.1:23379,http://127.0.0.1:33379 --clients 100 --conns 20 --key-size 128 --val-size 4000 --total 2000000 --key-space-size 2000000 --key-prefix aa --rate 7000 --sequential-keys ; done
TPS of Put cost time of original compact cost time of range-free compact improvement
1K/s 32.140923178s 6.051189144s 72%
2K/s 38.630751905s 12.803279511s 68%
5K/s 1m0.072742644s 31.387194197s 48%
7K/s 1m20.295984944s 42.799126317s 47%

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions