Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 32 additions & 1 deletion docs/en/engines/table-engines/integrations/azureBlobStorage.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ This engine provides an integration with [Azure Blob Storage](https://azure.micr

```sql
CREATE TABLE azure_blob_storage_table (name String, value UInt32)
ENGINE = AzureBlobStorage(connection_string|storage_account_url, container_name, blobpath, [account_name, account_key, format, compression])
ENGINE = AzureBlobStorage(connection_string|storage_account_url, container_name, blobpath, [account_name, account_key, format, compression, partition_strategy, partition_columns_in_data_file])
[PARTITION BY expr]
[SETTINGS ...]
```
Expand All @@ -30,6 +30,8 @@ CREATE TABLE azure_blob_storage_table (name String, value UInt32)
- `account_key` - if storage_account_url is used, then account key can be specified here
- `format` — The [format](/interfaces/formats.md) of the file.
- `compression` — Supported values: `none`, `gzip/gz`, `brotli/br`, `xz/LZMA`, `zstd/zst`. By default, it will autodetect compression by file extension. (same as setting to `auto`).
- `partition_strategy` – Options: `WILDCARD` or `HIVE`. `WILDCARD` requires a `{_partition_id}` in the path, which is replaced with the partition key. `HIVE` does not allow wildcards, assumes the path is the table root, and generates Hive-style partitioned directories with Snowflake IDs as filenames and the file format as the extension. Defaults to `WILDCARD`
- `partition_columns_in_data_file` - Only used with `HIVE` partition strategy. Tells ClickHouse whether to expect partition columns to be written in the data file. Defaults `false`.

**Example**

Expand Down Expand Up @@ -96,6 +98,35 @@ SETTINGS filesystem_cache_name = 'cache_for_azure', enable_filesystem_cache = 1;

2. reuse cache configuration (and therefore cache storage) from clickhouse `storage_configuration` section, [described here](/operations/storing-data.md/#using-local-cache)

### PARTITION BY {#partition-by}

`PARTITION BY` — Optional. In most cases you don't need a partition key, and if it is needed you generally don't need a partition key more granular than by month. Partitioning does not speed up queries (in contrast to the ORDER BY expression). You should never use too granular partitioning. Don't partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression).

For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](/sql-reference/data-types/date.md). The partition names here have the `"YYYYMM"` format.

#### Partition strategy {#partition-strategy}

`WILDCARD` (default): Replaces the `{_partition_id}` wildcard in the file path with the actual partition key. Reading is not supported.

`HIVE` implements hive style partitioning for reads & writes. Reading is implemented using a recursive glob pattern. Writing generates files using the following format: `<prefix>/<key1=val1/key2=val2...>/<snowflakeid>.<toLower(file_format)>`.

Note: When using `HIVE` partition strategy, the `use_hive_partitioning` setting has no effect.

Example of `HIVE` partition strategy:

```sql
arthur :) create table azure_table (year UInt16, country String, counter UInt8) ENGINE=AzureBlobStorage(account_name='devstoreaccount1', account_key='Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==', storage_account_url = 'http://localhost:30000/devstoreaccount1', container='cont', blob_path='hive_partitioned', format='Parquet', compression='auto', partition_strategy='hive') PARTITION BY (year, country);

arthur :) insert into azure_table values (2020, 'Russia', 1), (2021, 'Brazil', 2);

arthur :) select _path, * from azure_table;

┌─_path──────────────────────────────────────────────────────────────────────┬─year─┬─country─┬─counter─┐
1. │ cont/hive_partitioned/year=2020/country=Russia/7351305360873664512.parquet │ 2020 │ Russia │ 1 │
2. │ cont/hive_partitioned/year=2021/country=Brazil/7351305360894636032.parquet │ 2021 │ Brazil │ 2 │
└────────────────────────────────────────────────────────────────────────────┴──────┴─────────┴─────────┘
```

## See also {#see-also}

[Azure Blob Storage Table Function](/sql-reference/table-functions/azureBlobStorage)
50 changes: 49 additions & 1 deletion docs/en/engines/table-engines/integrations/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ SELECT * FROM s3_engine_table LIMIT 2;

```sql
CREATE TABLE s3_engine_table (name String, value UInt32)
ENGINE = S3(path [, NOSIGN | aws_access_key_id, aws_secret_access_key,] format, [compression])
ENGINE = S3(path [, NOSIGN | aws_access_key_id, aws_secret_access_key,] format, [compression], [partition_strategy], [partition_columns_in_data_file])
[PARTITION BY expr]
[SETTINGS ...]
```
Expand All @@ -46,6 +46,8 @@ CREATE TABLE s3_engine_table (name String, value UInt32)
- `format` — The [format](/sql-reference/formats#formats-overview) of the file.
- `aws_access_key_id`, `aws_secret_access_key` - Long-term credentials for the [AWS](https://aws.amazon.com/) account user. You can use these to authenticate your requests. Parameter is optional. If credentials are not specified, they are used from the configuration file. For more information see [Using S3 for Data Storage](../mergetree-family/mergetree.md#table_engine-mergetree-s3).
- `compression` — Compression type. Supported values: `none`, `gzip/gz`, `brotli/br`, `xz/LZMA`, `zstd/zst`. Parameter is optional. By default, it will auto-detect compression by file extension.
- `partition_strategy` – Options: `WILDCARD` or `HIVE`. `WILDCARD` requires a `{_partition_id}` in the path, which is replaced with the partition key. `HIVE` does not allow wildcards, assumes the path is the table root, and generates Hive-style partitioned directories with Snowflake IDs as filenames and the file format as the extension. Defaults to `WILDCARD`
- `partition_columns_in_data_file` - Only used with `HIVE` partition strategy. Tells ClickHouse whether to expect partition columns to be written in the data file. Defaults `false`.

### Data cache {#data-cache}

Expand Down Expand Up @@ -84,6 +86,52 @@ There are two ways to define cache in configuration file.

For partitioning by month, use the `toYYYYMM(date_column)` expression, where `date_column` is a column with a date of the type [Date](/sql-reference/data-types/date.md). The partition names here have the `"YYYYMM"` format.

#### Partition strategy {#partition-strategy}

`WILDCARD` (default): Replaces the `{_partition_id}` wildcard in the file path with the actual partition key. Reading is not supported.

`HIVE` implements hive style partitioning for reads & writes. Reading is implemented using a recursive glob pattern, it is equivalent to `SELECT * FROM s3('table_root/**.parquet')`.
Writing generates files using the following format: `<prefix>/<key1=val1/key2=val2...>/<snowflakeid>.<toLower(file_format)>`.

Note: When using `HIVE` partition strategy, the `use_hive_partitioning` setting has no effect.

Example of `HIVE` partition strategy:

```sql
arthur :) CREATE TABLE t_03363_parquet (year UInt16, country String, counter UInt8)
ENGINE = S3(s3_conn, filename = 't_03363_parquet', format = Parquet, partition_strategy='hive')
PARTITION BY (year, country);

arthur :) INSERT INTO t_03363_parquet VALUES
(2022, 'USA', 1),
(2022, 'Canada', 2),
(2023, 'USA', 3),
(2023, 'Mexico', 4),
(2024, 'France', 5),
(2024, 'Germany', 6),
(2024, 'Germany', 7),
(1999, 'Brazil', 8),
(2100, 'Japan', 9),
(2024, 'CN', 10),
(2025, '', 11);

arthur :) select _path, * from t_03363_parquet;

┌─_path──────────────────────────────────────────────────────────────────────┬─year─┬─country─┬─counter─┐
1. │ test/t_03363_parquet/year=2100/country=Japan/7329604473272971264.parquet │ 2100 │ Japan │ 9 │
2. │ test/t_03363_parquet/year=2024/country=France/7329604473323302912.parquet │ 2024 │ France │ 5 │
3. │ test/t_03363_parquet/year=2022/country=Canada/7329604473314914304.parquet │ 2022 │ Canada │ 2 │
4. │ test/t_03363_parquet/year=1999/country=Brazil/7329604473289748480.parquet │ 1999 │ Brazil │ 8 │
5. │ test/t_03363_parquet/year=2023/country=Mexico/7329604473293942784.parquet │ 2023 │ Mexico │ 4 │
6. │ test/t_03363_parquet/year=2023/country=USA/7329604473319108608.parquet │ 2023 │ USA │ 3 │
7. │ test/t_03363_parquet/year=2025/country=/7329604473327497216.parquet │ 2025 │ │ 11 │
8. │ test/t_03363_parquet/year=2024/country=CN/7329604473310720000.parquet │ 2024 │ CN │ 10 │
9. │ test/t_03363_parquet/year=2022/country=USA/7329604473298137088.parquet │ 2022 │ USA │ 1 │
10. │ test/t_03363_parquet/year=2024/country=Germany/7329604473306525696.parquet │ 2024 │ Germany │ 6 │
11. │ test/t_03363_parquet/year=2024/country=Germany/7329604473306525696.parquet │ 2024 │ Germany │ 7 │
└────────────────────────────────────────────────────────────────────────────┴──────┴─────────┴─────────┘
```

### Querying partitioned data {#querying-partitioned-data}

This example uses the [docker compose recipe](https://github.com/ClickHouse/examples/tree/5fdc6ff72f4e5137e23ea075c88d3f44b0202490/docker-compose-recipes/recipes/ch-and-minio-S3), which integrates ClickHouse and MinIO. You should be able to reproduce the same queries using S3 by replacing the endpoint and authentication values.
Expand Down
Loading
Loading