use parquet metadata cache for parquetmetadata format as well #636

arthurpassos · 2025-02-18T18:57:59Z

Previous implementation #586 only applied caching to blockinput format. In order to re-use and share the cache, some refactoring had to be done. Cache initialization was moved to Server.cpp.

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Use ParquetMetadataCache for ParquetMetadata format as well.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

Exclude tests:

Extra options:

do not test (only style check)
disable merge-commit (no merge from master before tests)
disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

1
2
3
4

arthurpassos · 2025-02-18T19:06:57Z

src/Processors/Formats/Impl/ParquetFileMetaDataCache.cpp

+namespace DB
+{
+
+ParquetFileMetaDataCache::ParquetFileMetaDataCache()


This looks too silly

Why do we need this singleton here? Can't the cache be just a member somewhere?

It is used by two classes: ParquetBlockInputFormat and ParquetMetadataInputFormat

arthurpassos · 2025-02-18T19:08:02Z

src/Processors/Formats/Impl/ParquetMetadataInputFormat.cpp

    }
 }

 static std::shared_ptr<parquet::FileMetaData> getFileMetadata(


There are two implementations of getFileMetadata. One for ParquetBlockInputFormat.cpp and one for ParquetMetadataInputFormat. I am still not sure if I should leave it duplicated or put it somewhere.. The logic is pretty much the same, but requires a couple of arguments to be passed

I do not see a lot of duplication (as code is really a bit different). But it depends on whether we are planning to keep this here or move it to upstream later.

If we are keeping it here, I would avoid unnecessary refactoring in this case -- this code does not look like it will change a lot, but changing code in more places may cause additional conflicts

altinity-robot · 2025-02-18T20:24:04Z

This is an automated comment for commit 8abf0b1 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	❌ failure
Regression aarch64 S3 aws_s3	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ failure
Regression aarch64 S3 azure	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ failure
Regression aarch64 S3 gcs	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ failure
Regression release S3 aws_s3	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ failure
Regression release S3 azure	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ failure
Regression release S3 gcs	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ failure
Sign aarch64	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ error
Sign release	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ error
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	❌ failure

Successful checks

Check name	Description	Status
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Ready for release	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Alter attach partition	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Alter move partition	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Alter replace partition	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Benchmark aws_s3	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Benchmark gcs	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Benchmark minio	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Clickhouse Keeper SSL	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 LDAP authentication	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 LDAP external_user_directory	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 LDAP role_mapping	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Parquet aws_s3	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Parquet minio	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Parquet	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 S3 minio	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Tiered Storage minio	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Tiered Storage s3amazon	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 Tiered Storage s3gcs	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 aes_encryption	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 aggregate_functions	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 atomic_insert	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 base_58	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 clickhouse_keeper	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 data_types	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 datetime64_extended_range	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 disk_level_encryption	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 dns	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 engines	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 example	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 extended_precision_data_types	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 kafka	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 kerberos	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 key_value	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 lightweight_delete	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 memory	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 part_moves_between_shards	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 rbac	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 selects	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 session_timezone	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 ssl_server	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 tiered_storage	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression aarch64 window_functions	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Alter attach partition	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Alter move partition	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Alter replace partition	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Benchmark aws_s3	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Benchmark gcs	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Benchmark minio	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Clickhouse Keeper SSL	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release LDAP authentication	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release LDAP external_user_directory	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release LDAP role_mapping	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Parquet aws_s3	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Parquet minio	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Parquet	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release S3 minio	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Tiered Storage minio	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Tiered Storage s3amazon	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release Tiered Storage s3gcs	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release aes_encryption	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release aggregate_functions	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release atomic_insert	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release base_58	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release clickhouse_keeper	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release data_types	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release datetime64_extended_range	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release disk_level_encryption	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release dns	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release engines	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release example	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release extended_precision_data_types	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release kafka	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release kerberos	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release key_value	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release lightweight_delete	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release memory	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release part_moves_between_shards	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release rbac	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release selects	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release session_timezone	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release ssl_server	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release tiered_storage	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Regression release window_functions	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success

arthurpassos · 2025-02-24T12:49:44Z

ParquetSchemaReader also calls parquet::ReadMetaData, but ClickHouse already has schema cache in place, so I don't think it is needed to add the caching logic there.

One thing to note is that ParquetSchemaReader reads the metadata before reaching ParquetBlockInputFormat.

Therefore, we have the following cases:

when reading a single file: it would be benefitial to add the caching logic there as well, because later on ParquetBlockInputFormat wouldn't need to make the call. On the other hand, once both caches are warmed up, none of these two calls will be made. So maybe it's not worth it.

when reading multiple files: wouldn't make a big difference since that code is meant to read the schema, and the schema for a single file suffices for the whole bunch.

zvonand

LGTM

…xes_1

use parquet metadata cache for parquetmetadata format as well

85b268e

arthurpassos commented Feb 18, 2025

View reviewed changes

add test, see what CICD says

cb8d032

arthurpassos force-pushed the project-antalya-24.12.2_metadata_cache_fixes_1 branch from 175a172 to cb8d032 Compare February 18, 2025 19:18

arthurpassos added 2 commits February 18, 2025 17:35

simplify dependencies by forward declaring parquet structure

760bc30

fix tests

79da2cf

Enmk changed the base branch from project-antalya-24.12.2 to antalya February 19, 2025 21:29

zvonand approved these changes Feb 25, 2025

View reviewed changes

arthurpassos and others added 2 commits February 25, 2025 12:04

Merge branch 'antalya' into project-antalya-24.12.2_metadata_cache_fi…

a0f7b08

…xes_1

Merge branch 'antalya' into project-antalya-24.12.2_metadata_cache_fi…

8abf0b1

…xes_1

Enmk merged commit dc3ad7f into antalya Mar 1, 2025
289 of 338 checks passed

arthurpassos mentioned this pull request Apr 4, 2025

Forward port parquet metadata cache impl #715

Merged

hodgesrm mentioned this pull request May 23, 2025

Project Antalya Roadmap 2025 - Real-Time Data Lakes #804

Open

37 tasks

Enmk mentioned this pull request May 27, 2025

25.3 Antalya port - Parquet metadata caching #795

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use parquet metadata cache for parquetmetadata format as well #636

use parquet metadata cache for parquetmetadata format as well #636

Uh oh!

arthurpassos commented Feb 18, 2025 •

edited

Loading

Uh oh!

arthurpassos Feb 18, 2025

Uh oh!

zvonand Feb 24, 2025 •

edited

Loading

Uh oh!

arthurpassos Feb 24, 2025

Uh oh!

arthurpassos Feb 18, 2025

Uh oh!

zvonand Feb 24, 2025 •

edited

Loading

Uh oh!

altinity-robot commented Feb 18, 2025 •

edited

Loading

Uh oh!

arthurpassos commented Feb 24, 2025

Uh oh!

zvonand left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

use parquet metadata cache for parquetmetadata format as well #636

use parquet metadata cache for parquetmetadata format as well #636

Uh oh!

Conversation

arthurpassos commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Modify your CI run:

Include tests (required builds will be added automatically):

Exclude tests:

Extra options:

Only specified batches in multi-batch jobs:

Uh oh!

arthurpassos Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

zvonand Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arthurpassos Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

arthurpassos Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

zvonand Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

altinity-robot commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Feb 24, 2025

Uh oh!

zvonand left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arthurpassos commented Feb 18, 2025 •

edited

Loading

zvonand Feb 24, 2025 •

edited

Loading

zvonand Feb 24, 2025 •

edited

Loading

altinity-robot commented Feb 18, 2025 •

edited

Loading