feat: add multi-path support for lance data paths #4765

jaystarshot · 2025-09-18T01:14:35Z

Supports multi path support for lance datasets.
sample api shown

How to Use It

Creating a multi-path dataset:

# Set up multiple storage locations and pick where to write
dataset = lance.write_dataset(
    data,
    "s3://primary/dataset",
    mode="create",
    initial_data_paths=["s3://bucket1/data", "s3://bucket2/data"],
    target_data_paths=["s3://bucket1/data"] 
)

Appending to a different location:

# Later, append new data to bucket2 instead
lance.write_dataset(
    new_data,
    dataset,
    mode="append",
    target_data_paths=["s3://bucket2/data"] 
)

Overwriting with completely new paths:

# Replace everything - both the path registry and the data
lance.write_dataset(
    replacement_data,
    "s3://primary/dataset", 
    mode="overwrite",
    initial_data_paths=["s3://new-bucket1/data", "s3://new-bucket2/data"],
    target_data_paths=["s3://new-bucket1/data"]
)

Main Changes

Modified overwrite transaction to add the bases which will be written to the manifest.
Fixed handling around object stores when using multi bases. This was because previously we were using the primary dataset's object store to write / read from multiple bases.

WIP and unclear

Compaction (and possibly other) operation handling in a multi path scenario
Check if Conflict with git like operations
Adding new paths to a dataset possibly in new operations
Currently storage options only allow us to specify storage options for one cloud provider but we might need storage options per path especially if we want to support multi cloud setup etc

codecov-commenter · 2025-09-18T02:50:35Z

Codecov Report

❌ Patch coverage is 88.14590% with 117 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.69%. Comparing base (a3ed68d) to head (e0f61b8).

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/write.rs	90.34%	44 Missing and 31 partials ⚠️
rust/lance/src/dataset/transaction.rs	50.94%	25 Missing and 1 partial ⚠️
rust/lance/src/dataset.rs	61.53%	6 Missing and 4 partials ⚠️
rust/lance-io/src/object_store.rs	42.85%	2 Missing and 2 partials ⚠️
rust/lance/src/dataset/fragment.rs	95.00%	0 Missing and 1 partial ⚠️
rust/lance/src/io/commit.rs	94.44%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4765      +/-   ##
==========================================
+ Coverage   81.64%   81.69%   +0.04%     
==========================================
  Files         333      333              
  Lines      131594   132497     +903     
  Branches   131594   132497     +903     
==========================================
+ Hits       107444   108243     +799     
- Misses      20550    20619      +69     
- Partials     3600     3635      +35

Flag	Coverage Δ
unittests	`81.69% <88.14%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jackye1995

This is exciting feature! Added some initial thoughts

python/python/lance/dataset.py

python/src/dataset.rs

rust/lance-table/src/format/manifest.rs

jackye1995 · 2025-09-18T03:57:30Z

test_multi_bucket_logging.py

let's turn this to proper tests in python and rust.

jaystarshot · 2025-09-18T20:47:20Z

Thanks @jackye1995 for the review! I will incorporate your suggestions and try to add tests in all current flows too!

jackye1995 · 2025-09-30T19:59:07Z

posting some offline discussions here:

we should allow directly adding new bases as a part of the dataset creation process. Related fields should be added to the Overwrite transaction model. This should be a simple change since Overwrite conflicts with everything.
after dataset is created, we should allow new bases to be added through API like dataset.add_base(...)
for update/delete/insert, if user wants to add new bases, it can do 2 commits internally, 1 to add_base, 1 to do the actual write. This avoids the complexity to add base information to the other write transaction models and handle their concurrent issues. It is similar to how we reserve fragment IDs as a separated commit during compaction.

python/python/tests/test_multi_bucket.py

python/python/lance/dataset.py

python/src/dataset.rs

jackye1995 · 2025-10-04T00:46:25Z

python/python/lance/dataset.py

        If both `commit_message` and `properties` are provided, `commit_message` will
        override any "lance.commit.message" key in `properties`.
+    initial_data_paths: list of str, optional
+        *Experimental*. Data file base URIs for registering in the manifest


nit: no need to mark as experimental

jackye1995 · 2025-10-04T00:46:29Z

python/python/lance/dataset.py

+        Only used in CREATE mode for manifest registration.
+        Example: ["s3://storage1/data", "s3://storage2/data"]
+    target_data_paths: list of str, optional
+        *Experimental*. Target URI for writing data files (array with exactly one element).


nit: no need to mark as experimental

jackye1995 · 2025-10-04T00:51:17Z

rust/lance/src/dataset.rs

    }

+    /// Get the ObjectStore for a specific path based on base_id
+    pub(crate) fn get_object_store_for_path(&self, base_id: Option<&u32>) -> Result<Arc<lance_io::object_store::ObjectStore>> {


I guess this would be necessary for multi-cloud use case, but apart from that, would the configuration within the same cloud be different? We cannot really pass different configurations for the same cloud, so I think we should cache object stores by the path scheme, instead of by the path. This will reduce the number of object stores we cache per dataset, especially if you need to set a lot of bases.

Yes we need this for each bucket since object store has the bucket info in the constructor , but that could be changed i guess.
Else we write to the same bucket as the primary. The cache key here is the bucket uri, so there will just be one entry per bucket.

The key thing here I think is, do you expect to pass in different object store configs per data path? Today these will just be initialized with whatever the credentials and settings is in the environment or you pass in when opening the dataset. They won't really be different object stores from configuration perspective. If you need to customize at that level, we need to probably find a way to pass that info along the way here

Not for the current use case but i guess in general it could be nice to have in future.

Sounds good. But in general, we already have a cache in ObjectStoreRegistry, we should use that instead of creating another cache here. That cache already uses base path as the cache key as well.

@jackye1995 we need the object store cache at the dataset level (like the primary object store is stored here

This is used by the reader to fetch the correct object store (using get_object_store_for_path)
Are you suggesting to store the ObjectStoreRegistry here?

Sorry let me be clear, I mean in the dataset builder, we should not just initialize these extra object stores. It is not always gonna be used, and it is not a cheap struct to initialize. We should try to initialize it on demand when we need to use it.

/// Additional path URIs for multi-path support (cached in ObjectStoreRegistry) pub(crate) extra_path_uris: Arc<HashMap<u32, String>>, /// Store parameters used to create object stores (needed for extra paths) pub(crate) store_params: ObjectStoreParams,

I will need to store these in dataset so that we can create the object store on demand

jackye1995 · 2025-10-04T05:05:10Z

protos/transaction.proto

    // Key-value pairs to merge with existing config.
    map<string, string> config_upsert_values = 4;
+    // Additional path URIs for data file distribution in multi-path layouts
+    repeated string initial_data_paths = 5;


Regardless of what we do in the Python API, at transaction level we should have a list of BasePath , not just string data paths. In BasePath we have important fields like is_dataset_root, as well as a user provided name of the path.

majin1102 · 2025-10-04T11:25:50Z

3. It is similar to how we reserve fragment IDs as a separated commit during compaction.

OK, found the context!

I think there is difference with the two cases: reserving fragment IDs does not add content into manifest right? It's an updating field operation. The transactional issue would bring unused paths(might even exists). Even if we have multi-statement transaction, I think we might not garentee people use it correctly and might result in an bloated manifest file for a long lifetime dataset.

Maybe we could add cleaning unrelated base_paths in the cleanup procedure for this. Or make this a table config like cleaning unrelated paths during latest n snapshots. And of course this is optional in case users want to manage it manually. This could also guard the files deletion/compaction cases.

What do you think? @jackye1995 @dacort

jackye1995 · 2025-10-04T15:55:45Z

The transactional issue would bring unused paths(might even exists). Even if we have multi-statement transaction, I think we might not garentee people use it correctly and might result in an bloated manifest file for a long lifetime dataset.

can you elaborate a bit more? I did not get the concern here.

I think it is important for users to assign a logical name for the path for deduping purpose, so that if you try to add paths of the same name, it should fail.

jackye1995 · 2025-10-04T16:16:24Z

python/python/lance/dataset.py

        and can be retrieved using read_transaction().
        If both `commit_message` and `properties` are provided, `commit_message` will
        override any "lance.commit.message" key in `properties`.
+    initial_data_paths: list of str, optional


I was thinking about this last night, I feel there is an issue between simple user experience and flexibility and correctness here:

In the most precise way, what user should define is the full base path information, something like:

write_dataset(data, dataset, mode="create", initial_bases=[ {"name": "b1", "path": "s3://a/b1", "is_dataset_root": "true"}, {"name": "b2", "path": "s3://a/b2", "is_dataset_root": "false"}, ]) # target existing bases write_dataset(data2, dataset, mode="append", target_bases=[ "b1", "b2" ]) # target new bases write_dataset(data2, dataset, mode="append", target_bases=[ {"name": "b3", "path": "s3://a/b3", "is_dataset_root": "true"}, ])

but I also understand this is quite cumbersome to write, comparing to directly write initial_data_paths and target_data_paths. One big concern I have around that is people write paths slightly differently all the time (e.g. s3://a/b1 vs s3://a/b1/). It is hard to know if they mean the same one or they actually mean different paths.

We also have a similar problem about how expressive we should be in the Overwrite transaction, as I commented above.

So here is my latest thinking: at the user interface level, instead of having a dedicated field only for create mode, we just fully separate the concept of the new bases to register and the target bases to use. Regardless of create mode or not, user always register new bases in the field new_bases, and specify what to use in target_bases. Here is the updated example:

write_dataset(data, dataset, mode="create", new_bases=[ {"name": "b1", "path": "s3://a/b1", "is_dataset_root": "true"}, {"name": "b2", "path": "s3://a/b2", "is_dataset_root": "false"}, ], target_bases=["b1", "b2"]) # target existing bases write_dataset(data2, dataset, mode="append", target_bases=[ "b1", "b2" ]) # target new bases write_dataset(data2, dataset, mode="append", new_bases=[ {"name": "b3", "path": "s3://a/b3", "is_dataset_root": "true"}, ], target_bases=["b3"])

And at transaction level, the new_bases directly translate to the new bases we should add in the transaction model, for example in Overwrite:

// Create or overwrite the entire dataset. message Overwrite { // The new fragments // // Fragment IDs are not yet assigned. repeated DataFragment fragments = 1; // The new schema repeated lance.file.Field schema = 2; // Schema metadata. map<string, bytes> schema_metadata = 3; // Key-value pairs to merge with existing config. map<string, string> config_upsert_values = 4; // base paths for the whole dataset repeated BasePath initial_bases = 5; }

@majin1102 @jaystarshot what do we think about this proposal?

I think we can define the api for the overwrite behavior once we have concrete use cases. For us just having a separate API is fine but there could be other use cases.

Agree with the proposal of putting them together in one operation.

I was wondering if it is that necessary of registering unrelated base paths in write interface. If we make adding base path idempotent I think we could just:

write_dataset(data2, dataset, mode="append", bases=[{"path": "s3://a/b3"}])

This means name is none(if we don't force there must be one) and the is_dataset_root is default.

Or we could use:

write_dataset(data2, dataset, mode="append", bases=[{"name":"b3", "path": "s3://a/b3", "is_dataset_root":"false"}])

In the runtime we check if the base path exists and make sure it is added before commiting the data.

jaystarshot · 2025-10-08T07:43:03Z

rust/lance/src/dataset.rs

-                ).await?;
-
+                    &bucket_uri,
+                    &ObjectStoreParams::default(),


The default params likely won’t work when storage params are specified ( gcs/oci).

majin1102 · 2025-10-09T07:22:44Z

The transactional issue would bring unused paths(might even exists). Even if we have multi-statement transaction, I think we might not garentee people use it correctly and might result in an bloated manifest file for a long lifetime dataset.

can you elaborate a bit more? I did not get the concern here.

I think it is important for users to assign a logical name for the path for deduping purpose, so that if you try to add paths of the same name, it should fail.

Sorry for the late reply.

I was thinking that if the scenario was writing files to a none added base. This should potentially be an ACID operation. But on second thought this was reasonable to split this into two operations. One to add base, one to write data, with a little limitation to the scenario if people want to dynamically manage bases according to data.

Two cases in my mental mind:

The base paths are static, just added once for all.
The base paths are added according to the written data, by some field value or even UUID. This is somehow like partition usage.

For the second scenario, I assumed we have to add the base before commit the data. Then we might encounter the case that data commit failed but base added. For this secnario we could expect a commit retry but without assurance. Then the base could be left there forever(if we use UUID then it is just a rubbish hole).

I think the multi-base management could be a complex issue and somehow I think is related to the scenario of partitions(like this multi-bucket case). Or let's say it might provide a basement that partitions aims for. Nowdays when we talk about partitions we usually thought of partition spec balabala in Iceberg, but if we looked back to Hive, the partitions are just managed paths. I was thinking if we could reuse this multi-base ability in partition solution. What do you think of this? @dacort @jackye1995

To be clear I'm on the side of not designing multi-base as heavy as partitions. But I think if we considered there are common things between them we could draw lessons from APIs and consider how the upper layers will evolve towards a partition-oriented solution based on the multi-base architecture in the future

majin1102 · 2025-10-09T07:35:03Z

I think it is important for users to assign a logical name for the path for deduping purpose, so that if you try to add paths of the same name, it should fail.

I didn't get this. I think the path itself is identical and could be used deduping purpose? if we force people to use a name as identifier I might get something from the path which would be a little unnecessary.

jaystarshot

Thanks a lot, looks great

The lance.write_dataset() function already supports writing to multiple storage buckets via the target_bases parameter in #4765 However, write_fragments() did not expose this capability, even though the underlying Rust implementation _write_fragments, _write_fragments_transaction already supported it

github-actions bot added enhancement New feature or request python labels Sep 18, 2025

jaystarshot force-pushed the jay-lance-multi-bucket branch from e76b7c0 to 0b1bddc Compare September 18, 2025 01:15

jackye1995 reviewed Sep 18, 2025

View reviewed changes

jackye1995 mentioned this pull request Sep 30, 2025

New API & transaction type for user to set new base paths #4864

Open

jaystarshot force-pushed the jay-lance-multi-bucket branch from 7d00d09 to 80b8d32 Compare October 1, 2025 17:58

jackye1995 reviewed Oct 2, 2025

View reviewed changes

python/python/tests/test_multi_bucket.py Outdated Show resolved Hide resolved

jackye1995 reviewed Oct 2, 2025

View reviewed changes

python/python/lance/dataset.py Outdated Show resolved Hide resolved

jackye1995 reviewed Oct 2, 2025

View reviewed changes

python/python/lance/dataset.py Outdated Show resolved Hide resolved

jackye1995 reviewed Oct 2, 2025

View reviewed changes

python/src/dataset.rs Outdated Show resolved Hide resolved

jaystarshot force-pushed the jay-lance-multi-bucket branch 5 times, most recently from a4fe19e to a0e27dd Compare October 3, 2025 23:26

jaystarshot marked this pull request as ready for review October 3, 2025 23:40

jaystarshot force-pushed the jay-lance-multi-bucket branch 2 times, most recently from d2c3cfe to 60224bc Compare October 3, 2025 23:47

jaystarshot changed the title ~~feat: add multi-bucket support for lance dataset~~ feat: add multi-path support for lance data paths Oct 3, 2025

jaystarshot force-pushed the jay-lance-multi-bucket branch from 60224bc to a4946f1 Compare October 4, 2025 00:04

jackye1995 reviewed Oct 4, 2025

View reviewed changes

jaystarshot force-pushed the jay-lance-multi-bucket branch 3 times, most recently from 64be91c to 7a85bc2 Compare October 7, 2025 00:10

jaystarshot commented Oct 8, 2025

View reviewed changes

jackye1995 force-pushed the jay-lance-multi-bucket branch from 8ae5bb8 to e53dc1c Compare October 8, 2025 20:06

jackye1995 force-pushed the jay-lance-multi-bucket branch 2 times, most recently from b02e14e to 527759c Compare October 10, 2025 01:16

github-actions bot added the java label Oct 10, 2025

jaystarshot commented Oct 10, 2025

View reviewed changes

jackye1995 added 6 commits October 9, 2025 20:54

commit

207699b

modify

d1200f6

seems good now

1396b1c

clippy

705c82b

python lint

e0c7715

fix shallow clone bug

44ce9bf

jackye1995 force-pushed the jay-lance-multi-bucket branch from df59fb0 to 44ce9bf Compare October 10, 2025 03:59

jackye1995 added 2 commits October 9, 2025 22:45

fix windows shallow clone

550be7c

box new param to minimize future size

27cef48

jackye1995 force-pushed the jay-lance-multi-bucket branch from f58919d to 27cef48 Compare October 10, 2025 06:07

jackye1995 added 3 commits October 9, 2025 23:08

fmt

0784526

fix ci oom and remove doc examples

31fefde

increase no build size

e0f61b8

jackye1995 approved these changes Oct 10, 2025

View reviewed changes

jackye1995 merged commit dba1e0f into lance-format:main Oct 10, 2025
38 of 39 checks passed

jaystarshot mentioned this pull request Nov 13, 2025

feat: add target_bases extension to python write_fragments API #5234

Merged

feat: add multi-path support for lance data paths #4765

feat: add multi-path support for lance data paths #4765

Uh oh!

Conversation

jaystarshot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to Use It

Uh oh!

codecov-commenter commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaystarshot commented Sep 18, 2025

Uh oh!

jackye1995 commented Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaystarshot Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majin1102 commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackye1995 commented Oct 4, 2025

Uh oh!

jackye1995 Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaystarshot Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majin1102 commented Oct 9, 2025

Uh oh!

majin1102 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaystarshot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

jaystarshot commented Sep 18, 2025 •

edited

Loading

codecov-commenter commented Sep 18, 2025 •

edited

Loading

jaystarshot Oct 4, 2025 •

edited

Loading

majin1102 commented Oct 4, 2025 •

edited

Loading

jackye1995 Oct 4, 2025 •

edited

Loading

jaystarshot Oct 8, 2025 •

edited

Loading

majin1102 commented Oct 9, 2025 •

edited

Loading