Skip to content

Commit b8beaa7

Browse files
authored
Merge pull request #365 from Altinity/backports/23.3.19/56030_fix_JBOD_incorrect_free_space_accounting
23.3 Backport of ClickHouse#56030 - Fix incorrect free space accounting for `least_used` JBOD policy - take 2
2 parents 643f2c9 + f5bb1c1 commit b8beaa7

6 files changed

Lines changed: 107 additions & 16 deletions

File tree

docs/en/engines/table-engines/mergetree-family/mergetree.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -797,14 +797,15 @@ Storage policies configuration markup:
797797

798798
Tags:
799799

800-
- `policy_name_N` — Policy name. Policy names must be unique.
801-
- `volume_name_N` — Volume name. Volume names must be unique.
802-
- `disk` — a disk within a volume.
803-
- `max_data_part_size_bytes` — the maximum size of a part that can be stored on any of the volume’s disks. If the a size of a merged part estimated to be bigger than `max_data_part_size_bytes` then this part will be written to a next volume. Basically this feature allows to keep new/small parts on a hot (SSD) volume and move them to a cold (HDD) volume when they reach large size. Do not use this setting if your policy has only one volume.
804-
- `move_factor` — when the amount of available space gets lower than this factor, data automatically starts to move on the next volume if any (by default, 0.1). ClickHouse sorts existing parts by size from largest to smallest (in descending order) and selects parts with the total size that is sufficient to meet the `move_factor` condition. If the total size of all parts is insufficient, all parts will be moved.
805-
- `prefer_not_to_merge` — Disables merging of data parts on this volume. When this setting is enabled, merging data on this volume is not allowed. This allows controlling how ClickHouse works with slow disks.
806-
- `perform_ttl_move_on_insert` — Disables TTL move on data part INSERT. By default if we insert a data part that already expired by the TTL move rule it immediately goes to a volume/disk declared in move rule. This can significantly slowdown insert in case if destination volume/disk is slow (e.g. S3).
807-
- `load_balancing` - Policy for disk balancing, `round_robin` or `least_used`.
800+
- `policy_name_N` — Policy name. Policy names must be unique.
801+
- `volume_name_N` — Volume name. Volume names must be unique.
802+
- `disk` — a disk within a volume.
803+
- `max_data_part_size_bytes` — the maximum size of a part that can be stored on any of the volume’s disks. If the a size of a merged part estimated to be bigger than `max_data_part_size_bytes` then this part will be written to a next volume. Basically this feature allows to keep new/small parts on a hot (SSD) volume and move them to a cold (HDD) volume when they reach large size. Do not use this setting if your policy has only one volume.
804+
- `move_factor` — when the amount of available space gets lower than this factor, data automatically starts to move on the next volume if any (by default, 0.1). ClickHouse sorts existing parts by size from largest to smallest (in descending order) and selects parts with the total size that is sufficient to meet the `move_factor` condition. If the total size of all parts is insufficient, all parts will be moved.
805+
- `prefer_not_to_merge` — Disables merging of data parts on this volume. When this setting is enabled, merging data on this volume is not allowed. This allows controlling how ClickHouse works with slow disks.
806+
- `perform_ttl_move_on_insert` — Disables TTL move on data part INSERT. By default (if enabled) if we insert a data part that already expired by the TTL move rule it immediately goes to a volume/disk declared in move rule. This can significantly slowdown insert in case if destination volume/disk is slow (e.g. S3). If disabled then already expired data part is written into a default volume and then right after moved to TTL volume.
807+
- `load_balancing` - Policy for disk balancing, `round_robin` or `least_used`.
808+
- `least_used_ttl_ms` - Configure timeout (in milliseconds) for the updating available space on all disks (`0` - update always, `-1` - never update, default is `60000`). Note, if the disk can be used by ClickHouse only and is not subject to a online filesystem resize/shrink you can use `-1`, in all other cases it is not recommended, since eventually it will lead to incorrect space distribution.
808809

809810
Cofiguration examples:
810811

src/Disks/StoragePolicy.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,8 @@ StoragePolicy::StoragePolicy(
7070
/* max_data_part_size_= */ 0,
7171
/* are_merges_avoided_= */ false,
7272
/* perform_ttl_move_on_insert_= */ true,
73-
VolumeLoadBalancing::ROUND_ROBIN);
73+
VolumeLoadBalancing::ROUND_ROBIN,
74+
/* least_used_ttl_ms_= */ 60'000);
7475
volumes.emplace_back(std::move(default_volume));
7576
}
7677

src/Disks/VolumeJBOD.cpp

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ VolumeJBOD::VolumeJBOD(
6868
perform_ttl_move_on_insert = config.getBool(config_prefix + ".perform_ttl_move_on_insert", true);
6969

7070
are_merges_avoided = config.getBool(config_prefix + ".prefer_not_to_merge", false);
71+
least_used_ttl_ms = config.getUInt64(config_prefix + ".least_used_ttl_ms", 60'000);
7172
}
7273

7374
VolumeJBOD::VolumeJBOD(const VolumeJBOD & volume_jbod,
@@ -93,6 +94,11 @@ DiskPtr VolumeJBOD::getDisk(size_t /* index */) const
9394
case VolumeLoadBalancing::LEAST_USED:
9495
{
9596
std::lock_guard lock(mutex);
97+
if (!least_used_ttl_ms || least_used_update_watch.elapsedMilliseconds() >= least_used_ttl_ms)
98+
{
99+
disks_by_size = LeastUsedDisksQueue(disks.begin(), disks.end());
100+
least_used_update_watch.restart();
101+
}
96102
return disks_by_size.top().disk;
97103
}
98104
}
@@ -127,11 +133,23 @@ ReservationPtr VolumeJBOD::reserve(UInt64 bytes)
127133
{
128134
std::lock_guard lock(mutex);
129135

130-
DiskWithSize disk = disks_by_size.top();
131-
disks_by_size.pop();
136+
ReservationPtr reservation;
137+
if (!least_used_ttl_ms || least_used_update_watch.elapsedMilliseconds() >= least_used_ttl_ms)
138+
{
139+
disks_by_size = LeastUsedDisksQueue(disks.begin(), disks.end());
140+
least_used_update_watch.restart();
132141

133-
ReservationPtr reservation = disk.reserve(bytes);
134-
disks_by_size.push(disk);
142+
DiskWithSize disk = disks_by_size.top();
143+
reservation = disk.reserve(bytes);
144+
}
145+
else
146+
{
147+
DiskWithSize disk = disks_by_size.top();
148+
disks_by_size.pop();
149+
150+
reservation = disk.reserve(bytes);
151+
disks_by_size.push(disk);
152+
}
135153

136154
return reservation;
137155
}

src/Disks/VolumeJBOD.h

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44
#include <optional>
55

66
#include <Disks/IVolume.h>
7+
#include <base/defines.h>
8+
#include <base/types.h>
9+
#include <Common/Stopwatch.h>
710

811

912
namespace DB
@@ -22,9 +25,10 @@ using VolumesJBOD = std::vector<VolumeJBODPtr>;
2225
class VolumeJBOD : public IVolume
2326
{
2427
public:
25-
VolumeJBOD(String name_, Disks disks_, UInt64 max_data_part_size_, bool are_merges_avoided_, bool perform_ttl_move_on_insert_, VolumeLoadBalancing load_balancing_)
28+
VolumeJBOD(String name_, Disks disks_, UInt64 max_data_part_size_, bool are_merges_avoided_, bool perform_ttl_move_on_insert_, VolumeLoadBalancing load_balancing_, UInt64 least_used_ttl_ms_)
2629
: IVolume(name_, disks_, max_data_part_size_, perform_ttl_move_on_insert_, load_balancing_)
2730
, are_merges_avoided(are_merges_avoided_)
31+
, least_used_ttl_ms(least_used_ttl_ms_)
2832
{
2933
}
3034

@@ -69,7 +73,7 @@ class VolumeJBOD : public IVolume
6973
DiskPtr disk;
7074
uint64_t free_size = 0;
7175

72-
DiskWithSize(DiskPtr disk_)
76+
explicit DiskWithSize(DiskPtr disk_)
7377
: disk(disk_)
7478
, free_size(disk->getUnreservedSpace())
7579
{}
@@ -96,7 +100,10 @@ class VolumeJBOD : public IVolume
96100
/// Index of last used disk, for load_balancing=round_robin
97101
mutable std::atomic<size_t> last_used = 0;
98102
/// Priority queue of disks sorted by size, for load_balancing=least_used
99-
mutable std::priority_queue<DiskWithSize> disks_by_size;
103+
using LeastUsedDisksQueue = std::priority_queue<DiskWithSize>;
104+
mutable LeastUsedDisksQueue disks_by_size TSA_GUARDED_BY(mutex);
105+
mutable Stopwatch least_used_update_watch TSA_GUARDED_BY(mutex);
106+
UInt64 least_used_ttl_ms = 0;
100107

101108
/// True if parts on this volume participate in merges according to START/STOP MERGES ON VOLUME.
102109
std::atomic<std::optional<bool>> are_merges_avoided_user_override{std::nullopt};

tests/integration/test_jbod_load_balancing/configs/config.d/storage_configuration.xml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
<disk>jbod3</disk>
3232

3333
<load_balancing>least_used</load_balancing>
34+
<least_used_ttl_ms>0</least_used_ttl_ms>
3435
</disks>
3536
</volumes>
3637
</jbod_least_used>

tests/integration/test_jbod_load_balancing/test.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,3 +134,66 @@ def test_jbod_load_balancing_least_used_next_disk(start_cluster):
134134
]
135135
finally:
136136
node.query("DROP TABLE IF EXISTS data_least_used_next_disk SYNC")
137+
138+
139+
def test_jbod_load_balancing_least_used_detect_background_changes(start_cluster):
140+
def get_parts_on_disks():
141+
parts = node.query(
142+
"""
143+
SELECT count(), disk_name
144+
FROM system.parts
145+
WHERE table = 'data_least_used_detect_background_changes'
146+
GROUP BY disk_name
147+
ORDER BY disk_name
148+
"""
149+
)
150+
parts = [l.split("\t") for l in parts.strip().split("\n")]
151+
return parts
152+
153+
try:
154+
node.query(
155+
"""
156+
CREATE TABLE data_least_used_detect_background_changes (p UInt8)
157+
ENGINE = MergeTree
158+
ORDER BY tuple()
159+
SETTINGS storage_policy = 'jbod_least_used';
160+
161+
SYSTEM STOP MERGES data_least_used_detect_background_changes;
162+
"""
163+
)
164+
165+
node.exec_in_container(["fallocate", "-l200M", "/jbod3/.test"])
166+
node.query(
167+
"""
168+
INSERT INTO data_least_used_detect_background_changes SELECT * FROM numbers(10);
169+
INSERT INTO data_least_used_detect_background_changes SELECT * FROM numbers(10);
170+
INSERT INTO data_least_used_detect_background_changes SELECT * FROM numbers(10);
171+
INSERT INTO data_least_used_detect_background_changes SELECT * FROM numbers(10);
172+
"""
173+
)
174+
parts = get_parts_on_disks()
175+
assert parts == [
176+
["4", "jbod2"],
177+
]
178+
179+
node.exec_in_container(["rm", "/jbod3/.test"])
180+
node.query(
181+
"""
182+
INSERT INTO data_least_used_detect_background_changes SELECT * FROM numbers(10);
183+
INSERT INTO data_least_used_detect_background_changes SELECT * FROM numbers(10);
184+
INSERT INTO data_least_used_detect_background_changes SELECT * FROM numbers(10);
185+
INSERT INTO data_least_used_detect_background_changes SELECT * FROM numbers(10);
186+
"""
187+
)
188+
parts = get_parts_on_disks()
189+
assert parts == [
190+
# previous INSERT
191+
["4", "jbod2"],
192+
# this INSERT
193+
["4", "jbod3"],
194+
]
195+
finally:
196+
node.exec_in_container(["rm", "-f", "/jbod3/.test"])
197+
node.query(
198+
"DROP TABLE IF EXISTS data_least_used_detect_background_changes SYNC"
199+
)

0 commit comments

Comments
 (0)