-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Adds memory-bound DefaultListFilesCache #18855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| // TODO: config | ||
| 512 * 1024, | ||
| Duration::new(600, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This POC doesn't implement any of the user configuration. This seems like a good opportunity to divide the work on this effort! We could get the base DefaultListFilesCache approved for merge without user configuration, and leave it disabled, and user configuration could be added by anyone who wants to contribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
Another thing we should do is some way to introspect the cache as part of datafusion-cli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we could also do some of the more complex desires for this cache, like being "prefix aware" in some way, as follow-on issues once this initial scaffolding exists.
| pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects | ||
| pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual default values here probably need to be discussed. These seemed relatively sane to me, but any input here on what values these should have to best accommodate a variety of workflows would be useful feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion is that we should set TTL much longer (maybe even infinite) by default. Users who know their files are changing are likely going to have to crank it down from a default anyways, so we might as well make the default behavior more deterministic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, my initial thought here was that here are certainly going to be users who don't read the release notes and their expected workflow of new objects being picked up on every query will be broken. With a relatively short TTL their data will at least appear eventually, which may be preferable to not-at-all. However, I really like the strategy of using the most deterministic solution as the default (infinite), especially as the long-term solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
their expected workflow of new objects being picked up on every query will be broken.
I agree with this
On the other hand, I am not sure what your patience level is but it is very unlikely I would wait 10 minutes... If it doesn't work within a few seconds, I would probably go start looking for problems.
|
|
||
| pub struct DefaultListFilesCacheState { | ||
| lru_queue: LruQueue<Path, (Arc<Vec<ObjectMeta>>, Instant)>, | ||
| capacity: usize, // TODO: do "bytes" matter here, or should we stick with "entries"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't feel like limiting this cache by "bytes" really made sense because the data stored in the cache is generally very uniform in size, perhaps aside from the path. I felt that it was probably small enough that simply limiting it by the number of entries should suffice, and "entries" seems like it would be easier for users to configure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that bytes makes more sense, mostly because that is the resource that is actually used. count of items is a proxy for resource usage
| // TODO: driveby-cleanup | ||
| /// The cache accessor, users usually working on this interface while manipulating caches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this doc comment could be edited for additional clarity, so I figured while we were in this area of code we could improve this!
| /// See [`crate::runtime_env::RuntimeEnv`] for more details | ||
| pub type ListFilesCache = | ||
| Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>; | ||
| pub trait ListFilesCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the things I would like to do in the Cache Manager caches is to segregate the cache eviction policy. Personally I think the user should be given an option on what is the eviction behaviour they want. wdyt @alamb @BlakeOrth ? I can work on getting some draft out this weekend on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reasonable suggestion, and it looks like there may be some ongoing discussion around this topic here:
While that work would undoubtedly impact this (and other) default cache implementations, I think it should probably be discussed separately to this effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally in favor of keeping what is in DataFusion as simple to use / understand as possible (which i think this and the metadata cache do)
In terms of customizable eviction strategy, as we have mentioned elsewhere that is already possible today, but it requires effectively copying/forking the entire cache implementation which adds to the maintenance burden of downstream projects
However, adding more APIs to DataFusion increases the maintenance burden on the core project
So I see customizable eviction strategies as a tradeoff. If there are multiple users who are likely to use a customizable eviction strategy, I agree it makes sense to put it in the core repo. If there are not that many, I think it would be better to keep DataFusion simpler and move the burden downstream for those users who need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb After reading your thoughts on this I'm wondering if a more complex cache infrastructure would be a good project to start in https://github.com/datafusion-contrib and if it gains enough traction perhaps it could be promoted to the core repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds like a great idea to me -- i think there is a wide wonderful world of complex caching strategies that people might want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit side topic. @alamb was looking into other caches, didn't understand why DefaultFileStatisticsCache isn't memory bound yet. Do we want to do it incrementally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you are asking -- do you mean metadata_cache_limit on https://datafusion.apache.org/user-guide/configs.html#runtime-configuration-settings ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit side topic. @alamb was looking into other caches, didn't understand why
DefaultFileStatisticsCacheisn't memory bound yet. Do we want to do it incrementally?
I have filed a ticket for this item:
| table_files_statistics_cache: Default::default(), | ||
| list_files_cache: Default::default(), | ||
| list_files_cache_limit: DEFAULT_LIST_FILES_CACHE_LIMIT, | ||
| list_files_cache_ttl: DEFAULT_LIST_FILES_CACHE_TTL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some usecases don't need a TTL, we should provide a way to keep that disable as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well thinking more, I understand why you have kept it.. I feel it diverges from the metadata cache and could confuse the end users somewhat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually agree with your first assessment that TTL can, and likely should, be optional if a user has a use case where they know the underlying objects are immutable once written (my personal use of DataFusion falls into this category). In either case, we may have to accept some differences between this cache and the Metadata cache. Unlike the Metadata cache, which can issue HEAD requests against objects to detect modification, there's no mechanism in an object store to inquire whether or not paths/sub-paths have been changed, so this cache will need to make some concessions around that limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree having a "infinite TTL" is an important usecase
I don't have a strong opinion on how that is expressed in the config options
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, FWIW
Thank you @BlakeOrth and @alchemist51
| /// See [`crate::runtime_env::RuntimeEnv`] for more details | ||
| pub type ListFilesCache = | ||
| Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>; | ||
| pub trait ListFilesCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally in favor of keeping what is in DataFusion as simple to use / understand as possible (which i think this and the metadata cache do)
In terms of customizable eviction strategy, as we have mentioned elsewhere that is already possible today, but it requires effectively copying/forking the entire cache implementation which adds to the maintenance burden of downstream projects
However, adding more APIs to DataFusion increases the maintenance burden on the core project
So I see customizable eviction strategies as a tradeoff. If there are multiple users who are likely to use a customizable eviction strategy, I agree it makes sense to put it in the core repo. If there are not that many, I think it would be better to keep DataFusion simpler and move the burden downstream for those users who need it.
| // Returns the cache's object ttl. | ||
| fn cache_ttl(&self) -> Duration; | ||
|
|
||
| // Updates the cache with a new boject limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Updates the cache with a new boject limit. | |
| // Updates the cache with a new object limit. |
| RequestCountingObjectStore() | ||
| Total Requests: 4 | ||
| - LIST prefix=data | ||
| Total Requests: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is pretty neat to see this in action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests have been backed out of the most recent commit because the cache is no longer enabled by default (since it doesn't quite fulfill those requirements yet). Integration level testing changes/enhancements will come once we get the ability to configure the cache from the user level.
| RequestCountingObjectStore() | ||
| Total Requests: 2 | ||
| - LIST prefix=data | ||
| Total Requests: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add a test that shows that once the cache eviction happens, a subsquent query does actually make a new LIST request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's a good test case, and obviously any tests have more or less been ignored thus far. The only tests I actually focused on initially were these tests since they both show (and helped me validate) this code was actually performing internal caching as expected.
| /// command on the local filesystem. This operation can be expensive, | ||
| /// especially when done over remote object stores. | ||
| /// | ||
| /// See [`crate::runtime_env::RuntimeEnv`] for more details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically a breaking API change (a good one in my mind). I am just pointing it out
| table_files_statistics_cache: Default::default(), | ||
| list_files_cache: Default::default(), | ||
| list_files_cache_limit: DEFAULT_LIST_FILES_CACHE_LIMIT, | ||
| list_files_cache_ttl: DEFAULT_LIST_FILES_CACHE_TTL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree having a "infinite TTL" is an important usecase
I don't have a strong opinion on how that is expressed in the config options
| pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects | ||
| pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion is that we should set TTL much longer (maybe even infinite) by default. Users who know their files are changing are likely going to have to crank it down from a default anyways, so we might as well make the default behavior more deterministic
|
|
||
| pub struct DefaultListFilesCacheState { | ||
| lru_queue: LruQueue<Path, (Arc<Vec<ObjectMeta>>, Instant)>, | ||
| capacity: usize, // TODO: do "bytes" matter here, or should we stick with "entries"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that bytes makes more sense, mostly because that is the resource that is actually used. count of items is a proxy for resource usage
| } | ||
| } | ||
|
|
||
| fn get(&mut self, key: &Path) -> Option<Arc<Vec<ObjectMeta>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just changes some of these APIs to take &self rather than &mut self -- I am not sure if we want to do the same thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for pointing this out. I ran into and resolved one merge conflict related to that PR and didn't realize it may have had a subtly wider impact than just the method that caused a conflict on my rebase. I will make sure to revisit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the changes from the above PR and I believe the implementation follows the same mutability pattern as the Metadata cache. The CacheAccessor::remove method takes &self and the methods for the DefaultListFilesCacheState still need to take &mut self.
| _k: &Path, | ||
| _e: &Self::Extra, | ||
| ) -> Option<Arc<Vec<ObjectMeta>>> { | ||
| panic!("Not supported DefaultListFilesCache get_with_extra") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember what the rationale for this panic was. It seems to have come in via
Maybe @Ted-Jiang or @suremarc have some thoughts here
|
@alamb Thanks for the initial review and comments! I think there's one key decision that needs to be made before moving forward in earnest here, which is what action we should take with the cache for
Do you have any thoughts or preferences on this? |
- Moves the default ListFilesCache implementation into a new list_files_cache module - Refactors the ListFilesCache to mirror the MetadataCache by defining a new trait instead of a fixed type wrapping a trait - Applies a memory based size limit on the cache - Allows for an optional TTL on cache entries - Implements unit tests for the new implementation - New docs for new methods and some small cleanup on docs for existing cache doc strings
f2f4df8 to
6067aa2
Compare
|
@alamb I have done a decent amount of cleanup, implemented tests, added/edited doc strings etc. and I believe this PR is now out of a POC state moving towards being merge ready. Given the constructive comments that have already been made here, I decided to just update and clean up this existing PR. I'm also happy to open a fresh PR if we want a clean slate! |
| // Invalidate cache entries for this table if they exist | ||
| if let Some(lfc) = state.runtime_env().cache_manager.get_list_files_cache() { | ||
| let _ = lfc.remove(table_path.prefix()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the initial implementation I chose to use the simple option of invalidating a table's cache entries on INSERT. Since the default TTL is infinite any DataSink that only supports writing new files (as opposed to appending) would never pick up inserts without some action being taken here.
It would be very nice to handle this more elegantly in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Thank you
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the cache_unit.rs file was already getting pretty large so I pulled this implementation into a new module to better organize the code and tests. I think the Metadata cache could probably also be isolated into its own module, but I didn't want to get too heavy handed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this something you are willing to do, or shall I make one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm happy to make an issue for this!
| fn get_with_extra(&self, k: &Path, _e: &Self::Extra) -> Option<Arc<Vec<ObjectMeta>>> { | ||
| self.get(k) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward, I'm wondering if these methods can end up supporting this cache's ability to be "prefix aware". Maybe it makes sense to have type Extra = Path; where k: &Path is the table path and e: &Self::Extra is the prefix. That way the cache can internally differentiate between entries that are just table list entries vs entries that have prefixes.
|
Thanks @BlakeOrth -- I'll try and give this one a look today |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR carefully @BlakeOrth -- thank you so much.
I took the liberty of
- adding a note to the upgrade guide
- merging up from main.
- Adding a test in object_store_access to show that the
LISTare properly avoided (it seems not right)
The only thing I think we need to fix before merging is the test in object_store_access -- I don't understand why the cache isn't being used.
Otherwise I think we can merge this PR and handle other issues as follow ons.
Before we release DataFusion 52 i think we need to do these items:
| // Invalidate cache entries for this table if they exist | ||
| if let Some(lfc) = state.runtime_env().cache_manager.get_list_files_cache() { | ||
| let _ = lfc.remove(table_path.prefix()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. Thank you
| /// See [`crate::runtime_env::RuntimeEnv`] for more details | ||
| pub type ListFilesCache = | ||
| Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>; | ||
| pub trait ListFilesCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit side topic. @alamb was looking into other caches, didn't understand why
DefaultFileStatisticsCacheisn't memory bound yet. Do we want to do it incrementally?
I have filed a ticket for this item:
| /// Default is disabled. | ||
| pub list_files_cache: Option<Arc<dyn ListFilesCache>>, | ||
| /// Limit the number of objects to keep in the `list_files_cache`. Default: ~125k objects | ||
| pub list_files_cache_limit: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this setup is used to support runtime configuration of the cache. See for example
datafusion/datafusion/core/src/execution/context/mod.rs
Lines 1160 to 1163 in 838e1de
| "metadata_cache_limit" => { | |
| let limit = Self::parse_memory_limit(value)?; | |
| builder.with_metadata_cache_limit(limit) | |
| } |
I think what we should do (as a follow on PR) is to add runtime configuration settings for the max cache size and its ttl in https://datafusion.apache.org/user-guide/configs.html#runtime-configuration-settings
I filed a ticket to do so:
| impl ListFilesEntry { | ||
| fn try_new(metas: Arc<Vec<ObjectMeta>>, ttl: Option<Duration>) -> Option<Self> { | ||
| let size_bytes = (metas.capacity() * size_of::<ObjectMeta>()) | ||
| + metas.iter().map(meta_heap_bytes).reduce(|acc, b| acc + b)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 i double checked and I believe this captures the heap size accurately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for double checking this!
| assert_eq!(cache.len(), 0); // path2 was removed by contains_key() | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are very nice tests
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this PR carefully @BlakeOrth -- thank you so much.
I took the liberty of
- adding a note to the upgrade guide
- merging up from main.
- Adding a test in object_store_access to show that the
LISTare properly avoided (it seems not right)
The only thing I think we need to fix before merging is the test in object_store_access -- I don't understand why the cache isn't being used.
Otherwise I think we can merge this PR and handle other issues as follow ons.
Before we release DataFusion 52 i think we need to do these items:
| + let field = Arc::clone(df_schema.field("my_column")); | ||
| ``` | ||
|
|
||
| ### ListingTableProvider now caches `LIST` commands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this section to communicate the new behavior
| +---------+-------+-------+ | ||
| ------- Object Store Request Summary ------- | ||
| RequestCountingObjectStore() | ||
| Total Requests: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlakeOrth this seems wrong to me -- now that we have a ListFilesCache, I would expect this query to NOT need to do a LIST command 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current state of this PR I think this result is actually expected. I chose to remove the code that enables the cache by default, as well as this test (they were in the POC for illustration purposes). I thought since the cache currently cannot be configured, and really specifically that cannot be disabled, it would be best to leave it disabled by default to avoid breaking users.
Additionally, quite a few of the sqllogic tests seem to rely on the copy functionality and the existing expectation that DataFusion will pick up the changes on query. The easiest way to handle those tests will be to disable this cache for the duration of the tests, but that again relies on the cache having some configuration.
All that being said, I'm more than happy to move forward on this effort in whatever way you think is best! I can add a very simple configuration option (and docs etc.) to enable/disable this cache so we can keep this test you've written and have it show a reduction in LIST operations, or we can leave that as a portion of the "configuration issue" associated with this cache.
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this something you are willing to do, or shall I make one?
Which issue does this PR close?
Initial cache implementation for
ListFilesCacheimplementation for theListingTable#18827This does not fully close the issue, since this implementation currently lacks user configuration and does not enable the cache by default
Rationale for this change
This work lays the groundwork and initial implementation for the default cache. It's a necessary initial step to allow collaboration around this issue
What changes are included in this PR?
Are these changes tested?
Yes. New unit tests have been introduced to test these changes. The cache is not yet enabled by default and cannot be enabled through configuration, so integration tests don't yet apply.
Are there any user-facing changes?
Yes, this work breaks the existing
ListFilesCachepublic API.cc @alamb @alchemist51