Skip to content

Writing columns containing large binary blobs will result in huge, untruncated statistics headers & compression is ~100,000x worse than pyarrow #23498

@jonasdedden

Description

@jonasdedden

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars

df = polars.DataFrame({"foo": [b"a" * 1024*1024*16]})
df.write_parquet("test.parquet")
df.write_parquet("test_pyarrow.parquet", use_pyarrow=True)

Log output

Issue description

This snippet will write the same dataframe containing a single row with 16MiB of binary data to the file system, in the native polars mode as well as the pyarrow mode.

The two files analysed with https://github.com/XiangpengHao/parquet-viewer:
polars
Image
pyarrow
Image

Differences:

File Size Metadata size Compression % Uncompressed Compressed row groups
pyarrow 944 Bytes 342 Bytes 0.00% 16 MiB 598 Bytes
polars 128 MiB 32 MiB 66.67% 48 MiB 32 MiB
relative 142180x (!) 98112x (!) - 3x 56111x (!)

Expected behavior

  • The individual row group compression should be ~50,000x better, it seems like the row itself is not compressed at all (598Bytes vs. 32 MiB)
  • pyarrow doesn't seem to write statistics (even with statistics=True), but even if polars is different here, it should truncate the statistics after 128 bytes or so, instead of writing the entire row in an uncompressed manner in the header

This is extremely similar to an issue that was already solved with the base arrow-rs crate:
apache/arrow-rs#7555
apache/arrow-rs#7489

Installed versions

Details
>>> polars.show_versions()
--------Version info---------
Polars:              1.31.0
Index type:          UInt32
Platform:            Linux-6.15.3-1-MANJARO-x86_64-with-glibc2.41
Python:              3.12.9 (main, Mar 17 2025, 21:01:58) [Clang 20.1.0 ]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.5
openpyxl             <not installed>
pandas               2.2.3
polars_cloud         <not installed>
pyarrow              20.0.0
pydantic             2.11.4
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-io-parquetArea: reading/writing Parquet filesbugSomething isn't workingneeds triageAwaiting prioritization by a maintainerpythonRelated to Python Polars

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions