-
Notifications
You must be signed in to change notification settings - Fork 483
feat: strategized plan compaction #5233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: strategized plan compaction #5233
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5233 +/- ##
==========================================
+ Coverage 82.05% 82.23% +0.18%
==========================================
Files 342 344 +2
Lines 141516 144881 +3365
Branches 141516 144881 +3365
==========================================
+ Hits 116115 119149 +3034
- Misses 21561 21800 +239
- Partials 3840 3932 +92
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few (mostly minor) suggestions
rust/lance/src/dataset/optimize.rs
Outdated
| // get all fragments by default | ||
| fn get_fragments(&self, dataset: &Dataset, _options: &CompactionOptions) -> Vec<FileFragment> { | ||
| // get_fragments should be returning fragments in sorted order (by id) | ||
| // and fragment ids should be unique | ||
| dataset.get_fragments() | ||
| } | ||
|
|
||
| // no filter by default | ||
| async fn filter_fragments( | ||
| &self, | ||
| _dataset: &Dataset, | ||
| fragments: Vec<FileFragment>, | ||
| _options: &CompactionOptions, | ||
| ) -> Result<Vec<FileFragment>> { | ||
| Ok(fragments) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these really need to be trait methods? I think we can probably leave them out and just let individual implementations use them if they want to. It will keep the trait simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
rust/lance/src/dataset/optimize.rs
Outdated
| Ok(fragments) | ||
| } | ||
|
|
||
| async fn plan(&self, dataset: &Dataset, options: &CompactionOptions) -> Result<CompactionPlan>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's document this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added docs
rust/lance/src/dataset/optimize.rs
Outdated
| Ok(fragments) | ||
| } | ||
|
|
||
| async fn plan(&self, dataset: &Dataset, options: &CompactionOptions) -> Result<CompactionPlan>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will CompactionOptions be flexible enough for all possible strategies? Should we maybe accept options as a JSON string or a Map<String, String>? This way different strategies can expose their own custom options. That would leave the API a little less defined but it would be more flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we even need to take CompactionOptions? Maybe it should be a argument to the constructor of the individual structs. That way each could have their own arguments but also be strongly typed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will CompactionOptions be flexible enough for all possible strategies? Should we maybe accept options as a JSON string or a Map<String, String>? This way different strategies can expose their own custom options. That would leave the API a little less defined but it would be more flexible.
Hi @westonpace Thanks a lot for your review. Added Map<String, String> for flexible.
Do we even need to take CompactionOptions? Maybe it should be a argument to the constructor of the individual structs. That way each could have their own arguments but also be strongly typed.
Hi @wjones127 Thanks a lot for your review.
I have tried several ways to eliminate the CompactionOptions parameter in the plan method, but none of them are perfect :( The main contradiction is that during users start planning compaction based on the built planner, they may dynamically adjust the options parameters on certain conditions, such as
If options are passed in when building the planner, then after modifying the options subsequently, it must also be ensured that the options in the planner can be seen. Therefore, we need Arc + mutex and cannot use clone.
On the contrary, it might be simpler and more flexible to pass in the desired options each time the plan method is called here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If options are passed in when building the planner, then after modifying the options subsequently, it must also be ensured that the options in the planner can be seen. Therefore, we need Arc + mutex and cannot use clone.
I don't understand this. The logic of validate() can live in the planner and be internal.
If I were to rewrite compact_files, I would do:
pub async fn compact_files(
dataset: &mut Dataset,
mut options: CompactionOptions,
remap_options: Option<Arc<dyn IndexRemapperOptions>>, // These will be deprecated later
) -> Result<CompactionMetrics> {
info!(target: TRACE_DATASET_EVENTS, event=DATASET_COMPACTING_EVENT, uri = &dataset.uri);
// .validate() now happens inside of `from_options`
let planner = DefaultCompactionPlanner::from_options(options);
compact_files_with_planner(dataset, &planner, remap_options).await
}
rust/lance/src/dataset/optimize.rs
Outdated
| /// Compacts the files in the dataset without reordering them. | ||
| /// | ||
| /// This does a few things: | ||
| /// By default, his does a few things: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// By default, his does a few things: | |
| /// By default, this does a few things: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
| pub async fn compact_files( | ||
| dataset: &mut Dataset, | ||
| options: CompactionOptions, | ||
| remap_options: Option<Arc<dyn IndexRemapperOptions>>, // These will be deprecated later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Hi @westonpace and @wjones127 Thanks a lot for your review. All comments are addressed. PTAL :) |
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I accidentally left my comments pending. I'm still not sure about the design. It seems like it could be simplified further.
| // get all fragments by default | ||
| fn get_fragments(&self, dataset: &Dataset, _options: &CompactionOptions) -> Vec<FileFragment> { | ||
| // get_fragments should be returning fragments in sorted order (by id) | ||
| // and fragment ids should be unique | ||
| dataset.get_fragments() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was already commented on, but do we need this? It seems like individual implementations can just call dataset.get_fragments() and then do whatever filtering they would like.
rust/lance/src/dataset/optimize.rs
Outdated
| Ok(fragments) | ||
| } | ||
|
|
||
| async fn plan(&self, dataset: &Dataset, options: &CompactionOptions) -> Result<CompactionPlan>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If options are passed in when building the planner, then after modifying the options subsequently, it must also be ensured that the options in the planner can be seen. Therefore, we need Arc + mutex and cannot use clone.
I don't understand this. The logic of validate() can live in the planner and be internal.
If I were to rewrite compact_files, I would do:
pub async fn compact_files(
dataset: &mut Dataset,
mut options: CompactionOptions,
remap_options: Option<Arc<dyn IndexRemapperOptions>>, // These will be deprecated later
) -> Result<CompactionMetrics> {
info!(target: TRACE_DATASET_EVENTS, event=DATASET_COMPACTING_EVENT, uri = &dataset.uri);
// .validate() now happens inside of `from_options`
let planner = DefaultCompactionPlanner::from_options(options);
compact_files_with_planner(dataset, &planner, remap_options).await
}
Close #5186