Proposal: Range-free Compaction

# Summary

We propose a new compaction method that reduces the time required for etcdserver compaction and decreases I/O consumption during the compaction process.

# Motivation

Thanks to performance optimizations in freepage allocation within BoltDB, etcdserver now supports database sizes of up to approximately 100 GB. However, large databases are highly unfavorable for compaction: each compaction operation must traverse all revisions in the entire database, regardless of how much data has actually been updated. This leads to two major issues:
1. Traversing the entire database generates substantial I/O load—monitoring tools often show I/O utilization approaching 100%, which negatively impacts concurrent read and write requests.
2. After traversing every `CompactionBatchLimit` number of revisions, the compaction process sleeps for a duration of `CompactionSleepInterval`. This can result in significant unnecessary sleep periods, as many traversed revision ranges may contain no historical data requiring cleanup.

We introduce a novel compaction implementation that avoids these problems while preserving data correctness.

# Proposal
## Overview of range-free compaction

The current compaction logic consists of two parts:
- treeIndex executes the Compact method, traverses all keyIndex objects, determines the set of revisions that need to be retained in the current state, and returns this set.
- store traverses all revisions in the database, checks each revision against the retained set, deletes unreferenced revisions, and commits the deletions in batches.

The proposed Range-free Compaction logic works as follows:
- The Index directly computes the set of revisions to be deleted and returns this set.
- The store receives this set of revisions, sorts them, deletes them sequentially, and commits the deletions in batches.

## Details
### Changes of keyIndex
Modify the compact method signature to add a new return parameter: delete map[Revision]struct{}, which represents the set of revisions to be deleted for a single keyIndex after compaction.
```
-func (ki *keyIndex) compact(lg *zap.Logger, atRev int64, available map[Revision]struct{}) {

+func (ki *keyIndex) compact(lg *zap.Logger, atRev int64, available map[Revision]struct{}, delete map[Revision]struct{}) {
```

### Changes of Index interface
Update the Index interface’s Compact method to include an additional parameter that carries the set of revisions to be deleted after treeIndex compaction.

```
-func (ti *treeIndex) Compact(rev int64) map[Revision]struct{} {

+func (ti *treeIndex) Compact(rev int64) (map[Revision]struct{}, map[Revision]struct{}) {
```

### Changes of db compaction
Add a new method *store.scheduleCompactionRangeFree with the following logic. 
```
func (s *store) scheduleCompactionRangeFree(compactMainRev, prevCompactRev int64) (KeyValueHash, error) {
	// Use Compact to get revisions to delete and perform index compaction
	_, toDelete := s.kvindex.Compact(compactMainRev)

	// Convert toDelete map to sorted slice for deterministic iteration
	revs := toSlice(toDelete)
    sort(revs)

	// Process deletions in batches
	for _, rev := range revs {
		// Delete the revision directly
		tx.UnsafeDelete(schema.Key, keyBytes)
		
		// Commit in batches
		if deletedCount%batchNum == 0 {
			s.b.ForceCommit()
			//sleep some time
		}
	}

	UnsafeSetFinishedCompact(tx, compactMainRev)
	return hash, nil
}
```

# Backward Compatibility

## Incompatibable KeyValueHash

Since range-free compaction does not traverse revision data, the hash computation logic is incompatible with the previous approach. This incompatibility is a significant issue if the user has enabled corrupt check.

## Correctness

We need to update certain unit tests to ensure the correctness of compaction in keyIndex, treeIndex, and store. We also need to verify that the results of the new compaction method are consistent with those of the current implementation. The test logic is as follows:
1. Create a new store.
2. Write a fixed set of data, randomly triggering compaction operations during the process.
3. Perform a final compaction, iterate through all data, compute a hash, and then close the store.
4. Create another new store that uses range-free compaction.
5. Repeat steps 2 and 3 to obtain a new hash, and compare it with the previous hash to ensure data consistency.

## FeatureGate

This new feature can be enabled via a FeatureGate, allowing users to opt in or out at their discretion.
Performence

# Performence

conditions:
- Total 2M keys, each value size is 4K
- total dbsize 40GB

server flags:
```
--auto-compaction-retention=5m 
--backend-batch-interval=100ms
--backend-batch-limit=1000
```

client write command:
```
while true; do _build/benchmark put --endpoints=http://127.0.0.1:3379,http://127.0.0.1:23379,http://127.0.0.1:33379 --clients 100 --conns 20 --key-size 128 --val-size 4000 --total 2000000 --key-space-size 2000000 --key-prefix aa --rate 7000 --sequential-keys ; done
```

| TPS of Put | cost time of original compact | cost time of range-free compact | improvement |
| ---------- | ----------------------------- | ------------------------------- | ----------- |
| 1K/s       | 32.140923178s                 | 6.051189144s                    | 72%         |
| 2K/s       | 38.630751905s                 | 12.803279511s                   | 68%         |
| 5K/s       | 1m0.072742644s                | 31.387194197s                   | 48%         |
| 7K/s       | 1m20.295984944s               | 42.799126317s                   | 47%         |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Range-free Compaction #21284

Summary

Motivation

Proposal

Overview of range-free compaction

Details

Changes of keyIndex

Changes of Index interface

Changes of db compaction

Backward Compatibility

Incompatibable KeyValueHash

Correctness

FeatureGate

Performence

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TPS of Put	cost time of original compact	cost time of range-free compact	improvement
1K/s	32.140923178s	6.051189144s	72%
2K/s	38.630751905s	12.803279511s	68%
5K/s	1m0.072742644s	31.387194197s	48%
7K/s	1m20.295984944s	42.799126317s	47%

Proposal: Range-free Compaction #21284

Description

Summary

Motivation

Proposal

Overview of range-free compaction

Details

Changes of keyIndex

Changes of Index interface

Changes of db compaction

Backward Compatibility

Incompatibable KeyValueHash

Correctness

FeatureGate

Performence

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions