-
Notifications
You must be signed in to change notification settings - Fork 11
use parquet metadata cache for parquetmetadata format as well #636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| namespace DB | ||
| { | ||
|
|
||
| ParquetFileMetaDataCache::ParquetFileMetaDataCache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks too silly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this singleton here? Can't the cache be just a member somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used by two classes: ParquetBlockInputFormat and ParquetMetadataInputFormat
| } | ||
| } | ||
|
|
||
| static std::shared_ptr<parquet::FileMetaData> getFileMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two implementations of getFileMetadata. One for ParquetBlockInputFormat.cpp and one for ParquetMetadataInputFormat. I am still not sure if I should leave it duplicated or put it somewhere.. The logic is pretty much the same, but requires a couple of arguments to be passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see a lot of duplication (as code is really a bit different). But it depends on whether we are planning to keep this here or move it to upstream later.
If we are keeping it here, I would avoid unnecessary refactoring in this case -- this code does not look like it will change a lot, but changing code in more places may cause additional conflicts
175a172 to
cb8d032
Compare
|
This is an automated comment for commit 8abf0b1 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
|
ParquetSchemaReader also calls One thing to note is that Therefore, we have the following cases: when reading a single file: it would be benefitial to add the caching logic there as well, because later on when reading multiple files: wouldn't make a big difference since that code is meant to read the schema, and the schema for a single file suffices for the whole bunch. |
zvonand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Previous implementation #586 only applied caching to blockinput format. In order to re-use and share the cache, some refactoring had to be done. Cache initialization was moved to
Server.cpp.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Use ParquetMetadataCache for ParquetMetadata format as well.
Documentation entry for user-facing changes
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: