Skip to content

Conversation

@arthurpassos
Copy link
Collaborator

@arthurpassos arthurpassos commented Nov 5, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Export partition support for replicated mt engines

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

Workflow [PR], commit [6b9870a]

)", 0) \
DECLARE(String, export_merge_tree_part_file_already_exists_policy, "NO_OP", R"(
Possible values:
- NO_OP - No-op if the file already exists - Default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "skip" instead of no_op ? Also, it looks like most of the enum or enum-like values are lowercase, maybe it makes sense to use lowercase here too?

@arthurpassos
Copy link
Collaborator Author

@codex review for proper zookeeper usage. Check if the code is safe when zookeeper is not available. If paths are properly destroyed upon table shutdown. And other keeper related stuff.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 134 to 137
auto status_watch_callback = std::make_shared<Coordination::WatchCallback>([this, key](const Coordination::WatchResponse &) {
storage.export_merge_tree_partition_manifest_updater->addStatusChange(key);
storage.export_merge_tree_partition_status_handling_task->schedule();
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard ZooKeeper status watcher against destroyed task

When a new export entry is seen, the code installs a status watcher whose callback captures this from ExportPartitionManifestUpdatingTask. These callbacks are owned by the global ZooKeeper session and there is no teardown when the table is shut down, so dropping the table or resetting the storage destroys the task while the watcher is still registered. If the watcher later fires (e.g. when another replica updates the status or when the export subtree is deleted), it will dereference the dangling pointer and call methods on an already–destroyed storage instance. This is a use‑after‑free that can crash the server. The callback should use a weak_ptr to the task or be explicitly cancelled during shutdown to ensure the storage is still alive before accessing it.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch

@arthurpassos
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 548 to 551
case ASTAlterCommand::EXPORT_PARTITION:
{
required_access.emplace_back(AccessType::ALTER_EXPORT_PARTITION, command.to_database, command.to_table);
required_access.emplace_back(AccessType::INSERT, command.to_database, command.to_table);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check ALTER EXPORT privilege on wrong table

When building the access list for ALTER … EXPORT PARTITION, the code requests AccessType::ALTER_EXPORT_PARTITION on command.to_database/to_table (the destination) instead of on the table that is being altered (database/table). As a result a user only needs INSERT plus ALTER_EXPORT_PARTITION on the destination table to export arbitrary partitions from any source table even if they have no privileges on that source, which is a privilege‑escalation issue. The source table is already stored in the database/table variables (same as the EXPORT_PART branch above) and that is the table whose ALTER privilege should be checked.

Useful? React with 👍 / 👎.

Comment on lines +71 to +74
boost::multi_index::hashed_unique<
boost::multi_index::tag<ExportPartitionTaskEntryTagByTransactionId>,
boost::multi_index::const_mem_fun<ExportReplicatedMergeTreePartitionTaskEntry, std::string, &ExportReplicatedMergeTreePartitionTaskEntry::getTransactionId>
>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Unique index on transaction id drops parallel exports

The export manifest stores transaction_id = query_context->getCurrentQueryId() (StorageReplicatedMergeTree.cpp L8192‑8204), and the in‑memory task container defines a hashed_unique index on getTransactionId() (ExportReplicatedMergeTreePartitionTaskEntry.h L71‑74). A single ALTER statement can contain multiple EXPORT PARTITION commands executed in the same loop (MergeTreeData.cpp L6340‑6355), meaning all of them share the same query id. Because of the unique index, inserting the second manifest silently fails and that export is never scheduled even though its ZooKeeper entry exists, leaving the partition stuck in PENDING forever. Either the transaction id must be made unique per command (e.g. include partition/destination) or the unique constraint must be relaxed so multiple exports from the same query can progress.

Useful? React with 👍 / 👎.

@arthurpassos
Copy link
Collaborator Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 45 to 60
for (auto & entry : storage.export_merge_tree_partition_task_entries_by_create_time)
{
const auto & manifest = entry.manifest;
const auto key = entry.getCompositeKey();
const auto & database = storage.getContext()->resolveDatabase(manifest.destination_database);
const auto & table = manifest.destination_table;

/// No need to query zk for status if the local one is not PENDING
if (entry.status != ExportReplicatedMergeTreePartitionTaskEntry::Status::PENDING)
{
LOG_INFO(storage.log, "ExportPartition scheduler task: Skipping... Local status is {}", magic_enum::enum_name(entry.status).data());
continue;
}

const auto destination_storage_id = StorageID(QualifiedTableName {database, table});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid dangling reference when resolving destination DB

In ExportPartitionTaskScheduler::run the destination database name is stored as const auto & database = storage.getContext()->resolveDatabase(manifest.destination_database); (lines 49‑60). Context::resolveDatabase() returns a String by value (src/Interpreters/Context.cpp:1232), so binding it to a reference leaves database pointing to destroyed storage once the temporary goes out of scope at the end of the full expression. Every subsequent use of database (e.g. when constructing StorageID and logging) operates on a dangling pointer, which is undefined behaviour and can randomly corrupt the export scheduler. The string needs to be stored by value instead of by reference.

Useful? React with 👍 / 👎.

Comment on lines +8133 to +8169
/// check if entry already exists
if (zookeeper->exists(partition_exports_path))
{
LOG_INFO(log, "Export with key {} is already exported or it is being exported. Checking if it has expired so that we can overwrite it", export_key);

bool has_expired = false;

if (zookeeper->exists(fs::path(partition_exports_path) / "metadata.json"))
{
std::string metadata_json;
if (zookeeper->tryGet(fs::path(partition_exports_path) / "metadata.json", metadata_json))
{
const auto manifest = ExportReplicatedMergeTreePartitionManifest::fromJsonString(metadata_json);

const auto now = time(nullptr);
const auto expiration_time = manifest.create_time + manifest.ttl_seconds;

LOG_INFO(log, "Export with key {} has expiration time {}, now is {}", export_key, expiration_time, now);

if (static_cast<time_t>(expiration_time) < now)
{
has_expired = true;
}
}
}

if (!has_expired && !query_context->getSettingsRef()[Setting::export_merge_tree_partition_force_export])
{
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Export with key {} already exported or it is being exported, and it has not expired. Set `export_merge_tree_partition_force_export` to overwrite it.", export_key);
}

LOG_INFO(log, "Overwriting export with key {}", export_key);

/// Not putting in ops (same transaction) because we can't construct a "tryRemoveRecursive" request.
/// It is possible that the zk being used does not support RemoveRecursive requests.
/// It is ok for this to be non transactional. Worst case scenario an on-going export is going to be killed and a new task won't be scheduled.
zookeeper->tryRemoveRecursive(partition_exports_path);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge TTL cleanup kills in-progress partition exports

When ALTER TABLE … EXPORT PARTITION sees an existing export entry it unconditionally treats it as expired whenever manifest.create_time + manifest.ttl_seconds < now (lines 8133‑8169). There is no check of the task status before calling tryRemoveRecursive, so a slow export that is still in PENDING state is removed as soon as the TTL elapses (default 180 s) and a new export is scheduled. This contradicts the documented behavior of export_merge_tree_partition_manifest_ttl, which says it “does not affect / delete in progress tasks, it only cleans up the completed ones” (src/Core/Settings.cpp, setting description). The current implementation can therefore kill long-running exports and leave their background tasks writing into the destination while the manifest is removed. The expiry check should respect the status node and only allow overwriting completed/failed exports (or require explicit force_export).

Useful? React with 👍 / 👎.

Comment on lines 4498 to 4505
const auto exceptions_per_replica_path = export_partition_path / "exceptions_per_replica";

const auto exception_replicas = zk->getChildren(exceptions_per_replica_path);
for (const auto & replica : exception_replicas)
{
std::string exception_count_string;
if (!zk->tryGet(exceptions_per_replica_path / replica / "count", exception_count_string))
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge system.replicated_partition_exports can throw on concurrent deletion

getPartitionExportsInfo() fetches exception information with zk->getChildren(exceptions_per_replica_path) (line ~4500) without using the try* variants. If a partition export entry is deleted between enumerating /exports/<key> and reading /exceptions_per_replica, ZooKeeper responds with ZNONODE and getChildren throws, causing a query against system.replicated_partition_exports to fail even though the data was merely deleted concurrently. All other reads in this function already use tryGet/tryGetChildren to handle the race. The same pattern should be used here so that the system table degrades gracefully instead of throwing.

Useful? React with 👍 / 👎.

@arthurpassos
Copy link
Collaborator Author

@codex review the enable_experimental_export_merge_tree_partition_feature setting. Is it safe enough? Also look for easy ways to reduce zookeeper chattiness so we don't overload it

@chatgpt-codex-connector
Copy link

Summary

  • Added idle/active polling intervals for the export manifest updater so replicas only re-check ZooKeeper every three minutes when no partition exports are queued, while keeping the 30 s cadence whenever manifests are present; watch callbacks still re-schedule the task immediately when new work appears. src/Storages/StorageReplicatedMergeTree.cppL4411-L4439

Testing

  • ⚠️ Not run (not requested)

View task →

@arthurpassos
Copy link
Collaborator Author

There is a problem with the scheduling algorithm of now (several, actually)

1st problem: I lock the part (which is kind of expensive, requires a few zk calls), and then try to schedule it without knowing we have space to process that. If we fail to schedule because we don't have space, we'll still do it for all the other ones. Expensive and useless. The best approach would be to know in advance if we have room for another export, and then try to lock.

2nd problem: scheduler is calling exportPartToTable, which will only put it in a wait list that is memory bound. Meaning it is probably going to be able to schedule as many export it wants, even tho it does not have room for those.

@arthurpassos
Copy link
Collaborator Author

There is a problem with the scheduling algorithm of now (several, actually)

1st problem: I lock the part (which is kind of expensive, requires a few zk calls), and then try to schedule it without knowing we have space to process that. If we fail to schedule because we don't have space, we'll still do it for all the other ones. Expensive and useless. The best approach would be to know in advance if we have room for another export, and then try to lock.

2nd problem: scheduler is calling exportPartToTable, which will only put it in a wait list that is memory bound. Meaning it is probably going to be able to schedule as many export it wants, even tho it does not have room for those.

To solve that, I think I'll do two things:

  1. Somehow check we have space before running the scheduler (not a guarantee tho because of TOCTOU)
  2. Delegate the part locking process to the task itself

@Altinity Altinity deleted a comment from chatgpt-codex-connector bot Nov 18, 2025
@Altinity Altinity deleted a comment from chatgpt-codex-connector bot Nov 18, 2025
@Enmk Enmk merged commit d09fb0b into antalya-25.8 Nov 19, 2025
251 of 282 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants