Skip to content

Conversation

@arthurpassos
Copy link
Collaborator

@arthurpassos arthurpassos commented Feb 18, 2025

Previous implementation #586 only applied caching to blockinput format. In order to re-use and share the cache, some refactoring had to be done. Cache initialization was moved to Server.cpp.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Use ParquetMetadataCache for ParquetMetadata format as well.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/


Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with Analyzer
  • Add your option here

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • Add your option here

Extra options:

  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4

namespace DB
{

ParquetFileMetaDataCache::ParquetFileMetaDataCache()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks too silly

Copy link
Collaborator

@zvonand zvonand Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this singleton here? Can't the cache be just a member somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used by two classes: ParquetBlockInputFormat and ParquetMetadataInputFormat

}
}

static std::shared_ptr<parquet::FileMetaData> getFileMetadata(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two implementations of getFileMetadata. One for ParquetBlockInputFormat.cpp and one for ParquetMetadataInputFormat. I am still not sure if I should leave it duplicated or put it somewhere.. The logic is pretty much the same, but requires a couple of arguments to be passed

Copy link
Collaborator

@zvonand zvonand Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a lot of duplication (as code is really a bit different). But it depends on whether we are planning to keep this here or move it to upstream later.

If we are keeping it here, I would avoid unnecessary refactoring in this case -- this code does not look like it will change a lot, but changing code in more places may cause additional conflicts

@arthurpassos arthurpassos force-pushed the project-antalya-24.12.2_metadata_cache_fixes_1 branch from 175a172 to cb8d032 Compare February 18, 2025 19:18
@altinity-robot
Copy link
Collaborator

altinity-robot commented Feb 18, 2025

This is an automated comment for commit 8abf0b1 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Regression aarch64 S3 aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Regression aarch64 S3 azureThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Regression aarch64 S3 gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Regression release S3 aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Regression release S3 azureThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Regression release S3 gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
Sign aarch64There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ error
Sign releaseThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ error
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors❌ failure
Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Ready for releaseThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter attach partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter move partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Alter replace partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Benchmark minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Clickhouse Keeper SSLThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP authenticationThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP external_user_directoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 LDAP role_mappingThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Parquet aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Parquet minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 ParquetThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 S3 minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Tiered Storage minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Tiered Storage s3amazonThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 Tiered Storage s3gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 aes_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 aggregate_functionsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 atomic_insertThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 base_58There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 clickhouse_keeperThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 datetime64_extended_rangeThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 disk_level_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 dnsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 enginesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 exampleThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 extended_precision_data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 kafkaThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 kerberosThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 key_valueThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 lightweight_deleteThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 memoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 part_moves_between_shardsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 rbacThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 selectsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 session_timezoneThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 ssl_serverThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 tiered_storageThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression aarch64 window_functionsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Alter attach partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Alter move partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Alter replace partitionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Benchmark aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Benchmark gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Benchmark minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Clickhouse Keeper SSLThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release LDAP authenticationThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release LDAP external_user_directoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release LDAP role_mappingThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Parquet aws_s3There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Parquet minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release ParquetThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release S3 minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Tiered Storage minioThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Tiered Storage s3amazonThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release Tiered Storage s3gcsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release aes_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release aggregate_functionsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release atomic_insertThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release base_58There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release clickhouse_keeperThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release datetime64_extended_rangeThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release disk_level_encryptionThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release dnsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release enginesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release exampleThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release extended_precision_data_typesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release kafkaThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release kerberosThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release key_valueThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release lightweight_deleteThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release memoryThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release part_moves_between_shardsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release rbacThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release selectsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release session_timezoneThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release ssl_serverThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release tiered_storageThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Regression release window_functionsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success

@Enmk Enmk changed the base branch from project-antalya-24.12.2 to antalya February 19, 2025 21:29
@arthurpassos
Copy link
Collaborator Author

ParquetSchemaReader also calls parquet::ReadMetaData, but ClickHouse already has schema cache in place, so I don't think it is needed to add the caching logic there.

One thing to note is that ParquetSchemaReader reads the metadata before reaching ParquetBlockInputFormat.

Therefore, we have the following cases:

when reading a single file: it would be benefitial to add the caching logic there as well, because later on ParquetBlockInputFormat wouldn't need to make the call. On the other hand, once both caches are warmed up, none of these two calls will be made. So maybe it's not worth it.

when reading multiple files: wouldn't make a big difference since that code is meant to read the schema, and the schema for a single file suffices for the whole bunch.

Copy link
Collaborator

@zvonand zvonand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Enmk Enmk merged commit dc3ad7f into antalya Mar 1, 2025
289 of 338 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants