Yet another export replicated partition pr #1124

arthurpassos · 2025-11-05T00:24:48Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Export partition support for replicated mt engines

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

Regression jobs to run:

…it file

… as one can at once

github-actions · 2025-11-05T00:25:49Z

Workflow [PR], commit [6b9870a]

Enmk · 2025-11-14T14:19:01Z

src/Core/Settings.cpp

+)", 0) \
+    DECLARE(String, export_merge_tree_part_file_already_exists_policy, "NO_OP", R"(
+Possible values:
+- NO_OP - No-op if the file already exists - Default.


maybe "skip" instead of no_op ? Also, it looks like most of the enum or enum-like values are lowercase, maybe it makes sense to use lowercase here too?

arthurpassos · 2025-11-14T14:21:12Z

@codex review for proper zookeeper usage. Check if the code is safe when zookeeper is not available. If paths are properly destroyed upon table shutdown. And other keeper related stuff.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-14T14:28:59Z

src/Storages/MergeTree/ExportPartitionManifestUpdatingTask.cpp

+        auto status_watch_callback = std::make_shared<Coordination::WatchCallback>([this, key](const Coordination::WatchResponse &) {
+            storage.export_merge_tree_partition_manifest_updater->addStatusChange(key);
+            storage.export_merge_tree_partition_status_handling_task->schedule();
+        });


Guard ZooKeeper status watcher against destroyed task

When a new export entry is seen, the code installs a status watcher whose callback captures this from ExportPartitionManifestUpdatingTask. These callbacks are owned by the global ZooKeeper session and there is no teardown when the table is shut down, so dropping the table or resetting the storage destroys the task while the watcher is still registered. If the watcher later fires (e.g. when another replica updates the status or when the export subtree is deleted), it will dereference the dangling pointer and call methods on an already–destroyed storage instance. This is a use‑after‑free that can crash the server. The callback should use a weak_ptr to the task or be explicitly cancelled during shutdown to ensure the storage is still alive before accessing it.

Useful? React with 👍 / 👎.

…rver experimental setting

arthurpassos · 2025-11-17T12:03:36Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-17T12:15:54Z

src/Interpreters/InterpreterAlterQuery.cpp

+        case ASTAlterCommand::EXPORT_PARTITION:
+        {
+            required_access.emplace_back(AccessType::ALTER_EXPORT_PARTITION, command.to_database, command.to_table);
+            required_access.emplace_back(AccessType::INSERT, command.to_database, command.to_table);


Check ALTER EXPORT privilege on wrong table

When building the access list for ALTER … EXPORT PARTITION, the code requests AccessType::ALTER_EXPORT_PARTITION on command.to_database/to_table (the destination) instead of on the table that is being altered (database/table). As a result a user only needs INSERT plus ALTER_EXPORT_PARTITION on the destination table to export arbitrary partitions from any source table even if they have no privileges on that source, which is a privilege‑escalation issue. The source table is already stored in the database/table variables (same as the EXPORT_PART branch above) and that is the table whose ALTER privilege should be checked.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-11-17T12:15:54Z

src/Storages/ExportReplicatedMergeTreePartitionTaskEntry.h

+        boost::multi_index::hashed_unique<
+            boost::multi_index::tag<ExportPartitionTaskEntryTagByTransactionId>,
+            boost::multi_index::const_mem_fun<ExportReplicatedMergeTreePartitionTaskEntry, std::string, &ExportReplicatedMergeTreePartitionTaskEntry::getTransactionId>
+        >


Unique index on transaction id drops parallel exports

The export manifest stores transaction_id = query_context->getCurrentQueryId() (StorageReplicatedMergeTree.cpp L8192‑8204), and the in‑memory task container defines a hashed_unique index on getTransactionId() (ExportReplicatedMergeTreePartitionTaskEntry.h L71‑74). A single ALTER statement can contain multiple EXPORT PARTITION commands executed in the same loop (MergeTreeData.cpp L6340‑6355), meaning all of them share the same query id. Because of the unique index, inserting the second manifest silently fails and that export is never scheduled even though its ZooKeeper entry exists, leaving the partition stuck in PENDING forever. Either the transaction id must be made unique per command (e.g. include partition/destination) or the unique constraint must be relaxed so multiple exports from the same query can progress.

Useful? React with 👍 / 👎.

arthurpassos · 2025-11-17T13:57:04Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-17T14:12:11Z

src/Storages/MergeTree/ExportPartitionTaskScheduler.cpp

+    for (auto & entry : storage.export_merge_tree_partition_task_entries_by_create_time)
+    {
+        const auto & manifest = entry.manifest;
+        const auto key = entry.getCompositeKey();
+        const auto & database = storage.getContext()->resolveDatabase(manifest.destination_database);
+        const auto & table = manifest.destination_table;
+
+        /// No need to query zk for status if the local one is not PENDING
+        if (entry.status != ExportReplicatedMergeTreePartitionTaskEntry::Status::PENDING)
+        {
+            LOG_INFO(storage.log, "ExportPartition scheduler task: Skipping... Local status is {}", magic_enum::enum_name(entry.status).data());
+            continue;
+        }
+
+        const auto destination_storage_id = StorageID(QualifiedTableName {database, table});
+


Avoid dangling reference when resolving destination DB

In ExportPartitionTaskScheduler::run the destination database name is stored as const auto & database = storage.getContext()->resolveDatabase(manifest.destination_database); (lines 49‑60). Context::resolveDatabase() returns a String by value (src/Interpreters/Context.cpp:1232), so binding it to a reference leaves database pointing to destroyed storage once the temporary goes out of scope at the end of the full expression. Every subsequent use of database (e.g. when constructing StorageID and logging) operates on a dangling pointer, which is undefined behaviour and can randomly corrupt the export scheduler. The string needs to be stored by value instead of by reference.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-11-17T14:12:12Z

src/Storages/StorageReplicatedMergeTree.cpp

+    /// check if entry already exists
+    if (zookeeper->exists(partition_exports_path))
+    {
+        LOG_INFO(log, "Export with key {} is already exported or it is being exported. Checking if it has expired so that we can overwrite it", export_key);
+
+        bool has_expired = false;
+
+        if (zookeeper->exists(fs::path(partition_exports_path) / "metadata.json"))
+        {
+            std::string metadata_json;
+            if (zookeeper->tryGet(fs::path(partition_exports_path) / "metadata.json", metadata_json))
+            {
+                const auto manifest = ExportReplicatedMergeTreePartitionManifest::fromJsonString(metadata_json);
+
+                const auto now = time(nullptr);
+                const auto expiration_time = manifest.create_time + manifest.ttl_seconds;
+
+                LOG_INFO(log, "Export with key {} has expiration time {}, now is {}", export_key, expiration_time, now);
+
+                if (static_cast<time_t>(expiration_time) < now)
+                {
+                    has_expired = true;
+                }
+            }
+        }
+
+        if (!has_expired && !query_context->getSettingsRef()[Setting::export_merge_tree_partition_force_export])
+        {
+            throw Exception(ErrorCodes::BAD_ARGUMENTS, "Export with key {} already exported or it is being exported, and it has not expired. Set `export_merge_tree_partition_force_export` to overwrite it.", export_key);
+        }
+
+        LOG_INFO(log, "Overwriting export with key {}", export_key);
+
+        /// Not putting in ops (same transaction) because we can't construct a "tryRemoveRecursive" request.
+        /// It is possible that the zk being used does not support RemoveRecursive requests.
+        /// It is ok for this to be non transactional. Worst case scenario an on-going export is going to be killed and a new task won't be scheduled.
+        zookeeper->tryRemoveRecursive(partition_exports_path);


TTL cleanup kills in-progress partition exports

When ALTER TABLE … EXPORT PARTITION sees an existing export entry it unconditionally treats it as expired whenever manifest.create_time + manifest.ttl_seconds < now (lines 8133‑8169). There is no check of the task status before calling tryRemoveRecursive, so a slow export that is still in PENDING state is removed as soon as the TTL elapses (default 180 s) and a new export is scheduled. This contradicts the documented behavior of export_merge_tree_partition_manifest_ttl, which says it “does not affect / delete in progress tasks, it only cleans up the completed ones” (src/Core/Settings.cpp, setting description). The current implementation can therefore kill long-running exports and leave their background tasks writing into the destination while the manifest is removed. The expiry check should respect the status node and only allow overwriting completed/failed exports (or require explicit force_export).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-11-17T14:12:12Z

src/Storages/StorageReplicatedMergeTree.cpp

+        const auto exceptions_per_replica_path = export_partition_path / "exceptions_per_replica";
+
+        const auto exception_replicas = zk->getChildren(exceptions_per_replica_path);
+        for (const auto & replica : exception_replicas)
+        {
+            std::string exception_count_string;
+            if (!zk->tryGet(exceptions_per_replica_path / replica / "count", exception_count_string))
+            {


system.replicated_partition_exports can throw on concurrent deletion

getPartitionExportsInfo() fetches exception information with zk->getChildren(exceptions_per_replica_path) (line ~4500) without using the try* variants. If a partition export entry is deleted between enumerating /exports/<key> and reading /exceptions_per_replica, ZooKeeper responds with ZNONODE and getChildren throws, causing a query against system.replicated_partition_exports to fail even though the data was merely deleted concurrently. All other reads in this function already use tryGet/tryGetChildren to handle the race. The same pattern should be used here so that the system table degrades gracefully instead of throwing.

Useful? React with 👍 / 👎.

arthurpassos · 2025-11-17T14:49:46Z

@codex review the enable_experimental_export_merge_tree_partition_feature setting. Is it safe enough? Also look for easy ways to reduce zookeeper chattiness so we don't overload it

chatgpt-codex-connector · 2025-11-17T14:59:44Z

Summary

Added idle/active polling intervals for the export manifest updater so replicas only re-check ZooKeeper every three minutes when no partition exports are queued, while keeping the 30 s cadence whenever manifests are present; watch callbacks still re-schedule the task immediately when new work appears. src/Storages/StorageReplicatedMergeTree.cppL4411-L4439

Testing

⚠️ Not run (not requested)

View task →

arthurpassos · 2025-11-17T15:02:11Z

There is a problem with the scheduling algorithm of now (several, actually)

1st problem: I lock the part (which is kind of expensive, requires a few zk calls), and then try to schedule it without knowing we have space to process that. If we fail to schedule because we don't have space, we'll still do it for all the other ones. Expensive and useless. The best approach would be to know in advance if we have room for another export, and then try to lock.

2nd problem: scheduler is calling exportPartToTable, which will only put it in a wait list that is memory bound. Meaning it is probably going to be able to schedule as many export it wants, even tho it does not have room for those.

arthurpassos · 2025-11-17T15:08:19Z

There is a problem with the scheduling algorithm of now (several, actually)

1st problem: I lock the part (which is kind of expensive, requires a few zk calls), and then try to schedule it without knowing we have space to process that. If we fail to schedule because we don't have space, we'll still do it for all the other ones. Expensive and useless. The best approach would be to know in advance if we have room for another export, and then try to lock.

2nd problem: scheduler is calling exportPartToTable, which will only put it in a wait list that is memory bound. Meaning it is probably going to be able to schedule as many export it wants, even tho it does not have room for those.

To solve that, I think I'll do two things:

Somehow check we have space before running the scheduler (not a guarantee tho because of TOCTOU)
Delegate the part locking process to the task itself

arthurpassos added 29 commits October 22, 2025 10:24

tmp

6097651

tmp2 - just in case of disaster recovery

43e9459

able to export partition using two different replicas and upload comm…

659b309

…it file

checkpoint

d3bb820

some changes

444e0ee

add a silly test

35c6cca

hold parts references to prevent deletion

b884fd3

fix a few tests

c7493cb

try to fix integ test failure and fix failure handling

f4f9d52

a few fixes

91c7ec2

make dest storage id part of the key

69cd83f

add system.replicated_partition_exports

62cb51f

add exception to replicated systems table

54c2dfb

add the replica that caused the exception

7b3a7c9

export_merge_tree_partition_force_export

3f3983c

almost done with kill export partition

b89cd5e

working kill export, update next idx upon lock and lock as many parts…

bb04fd9

… as one can at once

fix conflicts

55e7b94

rmv from system.exports

abe14f3

add no fasttest

e225798

some adjustments

0ca5e28

silly change to force cicd rebuild

63c48ce

remove kind of dead code

ee00ebb

small tweaks for demo

5c61bd6

todo comment

d609d04

wip - refactor new design

b571f5a

looking good before ordering

4487431

preserve order of tasks..

f23ed2d

Merge branch 'antalya-25.8' into export_replicated_mt_partition_v2

149a437

Enmk reviewed Nov 14, 2025

View reviewed changes

do not mark part export as failed in case of cancelled export

522cfdb

chatgpt-codex-connector bot reviewed Nov 14, 2025

View reviewed changes

arthurpassos added 8 commits November 14, 2025 15:38

rmv super flaky test

147d21c

put the background threads, kill operation and system table behind se…

bada40f

…rver experimental setting

try to prevent dangling reference access

0486df2

rmv unrelated change

6a5e2a2

use the right config name

8602587

use settingsenum

276dbf1

Enable experimental export merge tree partition

1ec94fa

Update enable_experimental_export_merge_tree_partition.xml

4135ff7

chatgpt-codex-connector bot reviewed Nov 17, 2025

View reviewed changes

arthurpassos added 3 commits November 17, 2025 10:55

fix privilege issue

a736a6c

generate transaction id instead of reusing the queryid

37d50ec

Merge branch 'antalya-25.8' into export_replicated_mt_partition_v2

7c7f251

chatgpt-codex-connector bot reviewed Nov 17, 2025

View reviewed changes

arthurpassos added 3 commits November 17, 2025 11:14

fix possible dangling ref

3314f21

use yet another tryget

59cd727

Merge branch 'antalya-25.8' into export_replicated_mt_partition_v2

6b9870a

Altinity deleted a comment from chatgpt-codex-connector bot Nov 18, 2025

Enmk merged commit d09fb0b into antalya-25.8 Nov 19, 2025
251 of 282 checks passed

arthurpassos mentioned this pull request Nov 19, 2025

Somehow impose a limit of how many parts can be locked by a given replica during a partition exports #1152

Closed

Yet another export replicated partition pr #1124

Yet another export replicated partition pr #1124

Uh oh!

Conversation

arthurpassos commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Enmk Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos commented Nov 14, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos commented Nov 17, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos commented Nov 17, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos commented Nov 17, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 17, 2025

Uh oh!

arthurpassos commented Nov 17, 2025

Uh oh!

arthurpassos commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arthurpassos commented Nov 5, 2025 •

edited

Loading

github-actions bot commented Nov 5, 2025 •

edited

Loading