Skip to content

Conversation

@jaystarshot
Copy link
Contributor

@jaystarshot jaystarshot commented Sep 18, 2025

Fixes #4702

Supports multi path support for lance datasets.
sample api shown

How to Use It

Creating a multi-path dataset:

# Set up multiple storage locations and pick where to write
dataset = lance.write_dataset(
    data,
    "s3://primary/dataset",
    mode="create",
    initial_data_paths=["s3://bucket1/data", "s3://bucket2/data"],
    target_data_paths=["s3://bucket1/data"] 
)

Appending to a different location:

# Later, append new data to bucket2 instead
lance.write_dataset(
    new_data,
    dataset,
    mode="append",
    target_data_paths=["s3://bucket2/data"] 
)

Overwriting with completely new paths:

# Replace everything - both the path registry and the data
lance.write_dataset(
    replacement_data,
    "s3://primary/dataset", 
    mode="overwrite",
    initial_data_paths=["s3://new-bucket1/data", "s3://new-bucket2/data"],
    target_data_paths=["s3://new-bucket1/data"]
)

Main Changes

  1. Modified overwrite transaction to add the bases which will be written to the manifest.
  2. Fixed handling around object stores when using multi bases. This was because previously we were using the primary dataset's object store to write / read from multiple bases.

WIP and unclear

  1. Compaction (and possibly other) operation handling in a multi path scenario
  2. Check if Conflict with git like operations
  3. Adding new paths to a dataset possibly in new operations
  4. Currently storage options only allow us to specify storage options for one cloud provider but we might need storage options per path especially if we want to support multi cloud setup etc

@github-actions github-actions bot added enhancement New feature or request python labels Sep 18, 2025
@jaystarshot jaystarshot force-pushed the jay-lance-multi-bucket branch from e76b7c0 to 0b1bddc Compare September 18, 2025 01:15
@codecov-commenter
Copy link

codecov-commenter commented Sep 18, 2025

Codecov Report

❌ Patch coverage is 88.14590% with 117 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.69%. Comparing base (a3ed68d) to head (e0f61b8).

Files with missing lines Patch % Lines
rust/lance/src/dataset/write.rs 90.34% 44 Missing and 31 partials ⚠️
rust/lance/src/dataset/transaction.rs 50.94% 25 Missing and 1 partial ⚠️
rust/lance/src/dataset.rs 61.53% 6 Missing and 4 partials ⚠️
rust/lance-io/src/object_store.rs 42.85% 2 Missing and 2 partials ⚠️
rust/lance/src/dataset/fragment.rs 95.00% 0 Missing and 1 partial ⚠️
rust/lance/src/io/commit.rs 94.44% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4765      +/-   ##
==========================================
+ Coverage   81.64%   81.69%   +0.04%     
==========================================
  Files         333      333              
  Lines      131594   132497     +903     
  Branches   131594   132497     +903     
==========================================
+ Hits       107444   108243     +799     
- Misses      20550    20619      +69     
- Partials     3600     3635      +35     
Flag Coverage Δ
unittests 81.69% <88.14%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting feature! Added some initial thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's turn this to proper tests in python and rust.

@jaystarshot
Copy link
Contributor Author

Thanks @jackye1995 for the review! I will incorporate your suggestions and try to add tests in all current flows too!

@jackye1995
Copy link
Contributor

posting some offline discussions here:

  1. we should allow directly adding new bases as a part of the dataset creation process. Related fields should be added to the Overwrite transaction model. This should be a simple change since Overwrite conflicts with everything.
  2. after dataset is created, we should allow new bases to be added through API like dataset.add_base(...)
  3. for update/delete/insert, if user wants to add new bases, it can do 2 commits internally, 1 to add_base, 1 to do the actual write. This avoids the complexity to add base information to the other write transaction models and handle their concurrent issues. It is similar to how we reserve fragment IDs as a separated commit during compaction.

@jaystarshot jaystarshot force-pushed the jay-lance-multi-bucket branch 5 times, most recently from a4fe19e to a0e27dd Compare October 3, 2025 23:26
@jaystarshot jaystarshot marked this pull request as ready for review October 3, 2025 23:40
@jaystarshot jaystarshot force-pushed the jay-lance-multi-bucket branch 2 times, most recently from d2c3cfe to 60224bc Compare October 3, 2025 23:47
@jaystarshot jaystarshot changed the title feat: add multi-bucket support for lance dataset feat: add multi-path support for lance data paths Oct 3, 2025
@jaystarshot jaystarshot force-pushed the jay-lance-multi-bucket branch from 60224bc to a4946f1 Compare October 4, 2025 00:04
If both `commit_message` and `properties` are provided, `commit_message` will
override any "lance.commit.message" key in `properties`.
initial_data_paths: list of str, optional
*Experimental*. Data file base URIs for registering in the manifest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to mark as experimental

Only used in CREATE mode for manifest registration.
Example: ["s3://storage1/data", "s3://storage2/data"]
target_data_paths: list of str, optional
*Experimental*. Target URI for writing data files (array with exactly one element).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need to mark as experimental

}

/// Get the ObjectStore for a specific path based on base_id
pub(crate) fn get_object_store_for_path(&self, base_id: Option<&u32>) -> Result<Arc<lance_io::object_store::ObjectStore>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this would be necessary for multi-cloud use case, but apart from that, would the configuration within the same cloud be different? We cannot really pass different configurations for the same cloud, so I think we should cache object stores by the path scheme, instead of by the path. This will reduce the number of object stores we cache per dataset, especially if you need to set a lot of bases.

Copy link
Contributor Author

@jaystarshot jaystarshot Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need this for each bucket since object store has the bucket info in the constructor , but that could be changed i guess.
Else we write to the same bucket as the primary. The cache key here is the bucket uri, so there will just be one entry per bucket.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key thing here I think is, do you expect to pass in different object store configs per data path? Today these will just be initialized with whatever the credentials and settings is in the environment or you pass in when opening the dataset. They won't really be different object stores from configuration perspective. If you need to customize at that level, we need to probably find a way to pass that info along the way here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for the current use case but i guess in general it could be nice to have in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. But in general, we already have a cache in ObjectStoreRegistry, we should use that instead of creating another cache here. That cache already uses base path as the cache key as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackye1995 we need the object store cache at the dataset level (like the primary object store is stored here

This is used by the reader to fetch the correct object store (using get_object_store_for_path)
Are you suggesting to store the ObjectStoreRegistry here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry let me be clear, I mean in the dataset builder, we should not just initialize these extra object stores. It is not always gonna be used, and it is not a cheap struct to initialize. We should try to initialize it on demand when we need to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Additional path URIs for multi-path support (cached in ObjectStoreRegistry)
    pub(crate) extra_path_uris: Arc<HashMap<u32, String>>,
    
    /// Store parameters used to create object stores (needed for extra paths)
    pub(crate) store_params: ObjectStoreParams,

I will need to store these in dataset so that we can create the object store on demand

// Key-value pairs to merge with existing config.
map<string, string> config_upsert_values = 4;
// Additional path URIs for data file distribution in multi-path layouts
repeated string initial_data_paths = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of what we do in the Python API, at transaction level we should have a list of BasePath , not just string data paths. In BasePath we have important fields like is_dataset_root, as well as a user provided name of the path.

@majin1102
Copy link
Contributor

majin1102 commented Oct 4, 2025

3. It is similar to how we reserve fragment IDs as a separated commit during compaction.

OK, found the context!

I think there is difference with the two cases: reserving fragment IDs does not add content into manifest right? It's an updating field operation. The transactional issue would bring unused paths(might even exists). Even if we have multi-statement transaction, I think we might not garentee people use it correctly and might result in an bloated manifest file for a long lifetime dataset.

Maybe we could add cleaning unrelated base_paths in the cleanup procedure for this. Or make this a table config like cleaning unrelated paths during latest n snapshots. And of course this is optional in case users want to manage it manually. This could also guard the files deletion/compaction cases.

What do you think? @jackye1995 @dacort

@jackye1995
Copy link
Contributor

The transactional issue would bring unused paths(might even exists). Even if we have multi-statement transaction, I think we might not garentee people use it correctly and might result in an bloated manifest file for a long lifetime dataset.

can you elaborate a bit more? I did not get the concern here.

I think it is important for users to assign a logical name for the path for deduping purpose, so that if you try to add paths of the same name, it should fail.

and can be retrieved using read_transaction().
If both `commit_message` and `properties` are provided, `commit_message` will
override any "lance.commit.message" key in `properties`.
initial_data_paths: list of str, optional
Copy link
Contributor

@jackye1995 jackye1995 Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this last night, I feel there is an issue between simple user experience and flexibility and correctness here:

In the most precise way, what user should define is the full base path information, something like:

write_dataset(data, dataset, mode="create", initial_bases=[
  {"name": "b1", "path": "s3://a/b1", "is_dataset_root": "true"},
{"name": "b2", "path": "s3://a/b2", "is_dataset_root": "false"},
])

# target existing bases
write_dataset(data2, dataset, mode="append", target_bases=[
  "b1", 
  "b2"
])

# target new bases
write_dataset(data2, dataset, mode="append", target_bases=[
  {"name": "b3", "path": "s3://a/b3", "is_dataset_root": "true"},
])

but I also understand this is quite cumbersome to write, comparing to directly write initial_data_paths and target_data_paths. One big concern I have around that is people write paths slightly differently all the time (e.g. s3://a/b1 vs s3://a/b1/). It is hard to know if they mean the same one or they actually mean different paths.

We also have a similar problem about how expressive we should be in the Overwrite transaction, as I commented above.

So here is my latest thinking: at the user interface level, instead of having a dedicated field only for create mode, we just fully separate the concept of the new bases to register and the target bases to use. Regardless of create mode or not, user always register new bases in the field new_bases, and specify what to use in target_bases. Here is the updated example:

write_dataset(data, dataset, mode="create", new_bases=[
  {"name": "b1", "path": "s3://a/b1", "is_dataset_root": "true"},
{"name": "b2", "path": "s3://a/b2", "is_dataset_root": "false"},
],
target_bases=["b1", "b2"])

# target existing bases
write_dataset(data2, dataset, mode="append", target_bases=[
  "b1", 
  "b2"
])

# target new bases
write_dataset(data2, dataset, mode="append", new_bases=[
  {"name": "b3", "path": "s3://a/b3", "is_dataset_root": "true"},
], target_bases=["b3"])

And at transaction level, the new_bases directly translate to the new bases we should add in the transaction model, for example in Overwrite:

  // Create or overwrite the entire dataset.
  message Overwrite {
    // The new fragments
    //
    // Fragment IDs are not yet assigned.
    repeated DataFragment fragments = 1;
    // The new schema
    repeated lance.file.Field schema = 2;
    // Schema metadata.
    map<string, bytes> schema_metadata = 3;
    // Key-value pairs to merge with existing config.
    map<string, string> config_upsert_values = 4;
    //  base paths for the whole dataset
    repeated BasePath initial_bases = 5;
}

@majin1102 @jaystarshot what do we think about this proposal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can define the api for the overwrite behavior once we have concrete use cases. For us just having a separate API is fine but there could be other use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the proposal of putting them together in one operation.

I was wondering if it is that necessary of registering unrelated base paths in write interface. If we make adding base path idempotent I think we could just:

write_dataset(data2, dataset, mode="append", bases=[{"path": "s3://a/b3"}])

This means name is none(if we don't force there must be one) and the is_dataset_root is default.

Or we could use:

write_dataset(data2, dataset, mode="append", bases=[{"name":"b3", "path": "s3://a/b3", "is_dataset_root":"false"}])

In the runtime we check if the base path exists and make sure it is added before commiting the data.

@jaystarshot jaystarshot force-pushed the jay-lance-multi-bucket branch 3 times, most recently from 64be91c to 7a85bc2 Compare October 7, 2025 00:10
).await?;

&bucket_uri,
&ObjectStoreParams::default(),
Copy link
Contributor Author

@jaystarshot jaystarshot Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default params likely won’t work when storage params are specified ( gcs/oci).

@jackye1995 jackye1995 force-pushed the jay-lance-multi-bucket branch from 8ae5bb8 to e53dc1c Compare October 8, 2025 20:06
@majin1102
Copy link
Contributor

The transactional issue would bring unused paths(might even exists). Even if we have multi-statement transaction, I think we might not garentee people use it correctly and might result in an bloated manifest file for a long lifetime dataset.

can you elaborate a bit more? I did not get the concern here.

I think it is important for users to assign a logical name for the path for deduping purpose, so that if you try to add paths of the same name, it should fail.

Sorry for the late reply.

I was thinking that if the scenario was writing files to a none added base. This should potentially be an ACID operation. But on second thought this was reasonable to split this into two operations. One to add base, one to write data, with a little limitation to the scenario if people want to dynamically manage bases according to data.

Two cases in my mental mind:

  1. The base paths are static, just added once for all.
  2. The base paths are added according to the written data, by some field value or even UUID. This is somehow like partition usage.

For the second scenario, I assumed we have to add the base before commit the data. Then we might encounter the case that data commit failed but base added. For this secnario we could expect a commit retry but without assurance. Then the base could be left there forever(if we use UUID then it is just a rubbish hole).

I think the multi-base management could be a complex issue and somehow I think is related to the scenario of partitions(like this multi-bucket case). Or let's say it might provide a basement that partitions aims for. Nowdays when we talk about partitions we usually thought of partition spec balabala in Iceberg, but if we looked back to Hive, the partitions are just managed paths. I was thinking if we could reuse this multi-base ability in partition solution. What do you think of this? @dacort @jackye1995

To be clear I'm on the side of not designing multi-base as heavy as partitions. But I think if we considered there are common things between them we could draw lessons from APIs and consider how the upper layers will evolve towards a partition-oriented solution based on the multi-base architecture in the future

@majin1102
Copy link
Contributor

majin1102 commented Oct 9, 2025

I think it is important for users to assign a logical name for the path for deduping purpose, so that if you try to add paths of the same name, it should fail.

I didn't get this. I think the path itself is identical and could be used deduping purpose? if we force people to use a name as identifier I might get something from the path which would be a little unnecessary.

@jackye1995 jackye1995 force-pushed the jay-lance-multi-bucket branch 2 times, most recently from b02e14e to 527759c Compare October 10, 2025 01:16
@github-actions github-actions bot added the java label Oct 10, 2025
Copy link
Contributor Author

@jaystarshot jaystarshot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, looks great

@jackye1995 jackye1995 force-pushed the jay-lance-multi-bucket branch from df59fb0 to 44ce9bf Compare October 10, 2025 03:59
@jackye1995 jackye1995 force-pushed the jay-lance-multi-bucket branch from f58919d to 27cef48 Compare October 10, 2025 06:07
@jackye1995 jackye1995 merged commit dba1e0f into lance-format:main Oct 10, 2025
38 of 39 checks passed
jackye1995 pushed a commit that referenced this pull request Nov 18, 2025
The lance.write_dataset() function already supports writing to multiple
storage buckets via the target_bases parameter in
#4765

However, write_fragments() did not expose this capability, even though
the underlying Rust implementation
 _write_fragments, _write_fragments_transaction  already supported it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multi bucket layout

4 participants