Checks
Reproducible example
import polars
df = polars.DataFrame({"foo": [b"a" * 1024*1024*16]})
df.write_parquet("test.parquet")
df.write_parquet("test_pyarrow.parquet", use_pyarrow=True)
Log output
Issue description
This snippet will write the same dataframe containing a single row with 16MiB of binary data to the file system, in the native polars mode as well as the pyarrow mode.
The two files analysed with https://github.com/XiangpengHao/parquet-viewer:
polars

pyarrow

Differences:
|
File Size |
Metadata size |
Compression % |
Uncompressed |
Compressed row groups |
pyarrow |
944 Bytes |
342 Bytes |
0.00% |
16 MiB |
598 Bytes |
polars |
128 MiB |
32 MiB |
66.67% |
48 MiB |
32 MiB |
| relative |
142180x (!) |
98112x (!) |
- |
3x |
56111x (!) |
Expected behavior
- The individual row group compression should be ~50,000x better, it seems like the row itself is not compressed at all (598Bytes vs. 32 MiB)
pyarrow doesn't seem to write statistics (even with statistics=True), but even if polars is different here, it should truncate the statistics after 128 bytes or so, instead of writing the entire row in an uncompressed manner in the header
This is extremely similar to an issue that was already solved with the base arrow-rs crate:
apache/arrow-rs#7555
apache/arrow-rs#7489
Installed versions
Details
>>> polars.show_versions()
--------Version info---------
Polars: 1.31.0
Index type: UInt32
Platform: Linux-6.15.3-1-MANJARO-x86_64-with-glibc2.41
Python: 3.12.9 (main, Mar 17 2025, 21:01:58) [Clang 20.1.0 ]
LTS CPU: False
----Optional dependencies----
Azure CLI <not installed>
adbc_driver_manager <not installed>
altair <not installed>
azure.identity <not installed>
boto3 <not installed>
cloudpickle <not installed>
connectorx <not installed>
deltalake <not installed>
fastexcel <not installed>
fsspec <not installed>
gevent <not installed>
google.auth <not installed>
great_tables <not installed>
matplotlib <not installed>
numpy 2.2.5
openpyxl <not installed>
pandas 2.2.3
polars_cloud <not installed>
pyarrow 20.0.0
pydantic 2.11.4
pyiceberg <not installed>
sqlalchemy <not installed>
torch <not installed>
xlsx2csv <not installed>
xlsxwriter <not installed>
Checks
Reproducible example
Log output
Issue description
This snippet will write the same dataframe containing a single row with 16MiB of binary data to the file system, in the native polars mode as well as the pyarrow mode.
The two files analysed with https://github.com/XiangpengHao/parquet-viewer:


polarspyarrowDifferences:
pyarrowpolarsExpected behavior
pyarrowdoesn't seem to write statistics (even withstatistics=True), but even ifpolarsis different here, it should truncate the statistics after 128 bytes or so, instead of writing the entire row in an uncompressed manner in the headerThis is extremely similar to an issue that was already solved with the base
arrow-rscrate:apache/arrow-rs#7555
apache/arrow-rs#7489
Installed versions
Details