read_delta / scan_delta drop Enum dtype despite metadata being preserved end-to-end
Polars version
1.40.1 (with deltalake==1.5.1)
Summary
When a DataFrame with a pl.Enum column is written to a Delta table and read back via pl.read_delta / pl.scan_delta, the Enum dtype is downgraded to String. The interesting bit is that the underlying _PL_ENUM_VALUES2 field metadata that Polars uses to encode Enum cats is preserved at every layer — it survives in the parquet files inside the Delta directory, and DeltaTable(uri).to_pyarrow_table() returns it on the schema. So the data and metadata round-trip cleanly through Delta storage; it's specifically the Polars-side Delta read path that doesn't surface it.
(Filing this against the read side. The known write_delta panic on Enum dtypes — cannot downcast Utf8View dictionary value to byte array — is a separate issue; the reproduction below works around it by casting Enum→String at the Arrow layer with the _PL_ENUM_VALUES2 metadata kept on the field.)
Reproduction
import polars as pl
import pyarrow as pa
from pathlib import Path
from deltalake import write_deltalake, DeltaTable
cats = ["apple", "banana", "cherry"]
df = pl.DataFrame({"fruit": ["apple", "banana", "apple"]}).with_columns(
pl.col("fruit").cast(pl.Enum(cats))
)
print("Source schema:", df.schema)
# => Schema({'fruit': Enum(categories=['apple', 'banana', 'cherry'])})
# Cast Enum -> string at the Arrow layer, preserving _PL_ENUM_VALUES2 metadata
# (works around the deltalake-rs writer panic on Enum-typed Arrow columns)
arr = df.to_arrow()
field_meta = arr.schema.field("fruit").metadata
arr_str = arr.cast(pa.schema([
pa.field("fruit", pa.string(), metadata=field_meta),
]))
uri = "/tmp/fruit_delta"
write_deltalake(uri, arr_str)
# Metadata is preserved through Delta storage:
arr_back = DeltaTable(uri).to_pyarrow_table()
print(arr_back.schema.field("fruit").metadata)
# => {b'_PL_ENUM_VALUES2': b'5;apple6;banana6;cherry'}
# Polars from_arrow with dictionary type would reconstruct, but DeltaTable
# returns plain `string` (delta-rs unwraps dictionary types on read), and
# Polars's read paths return String:
print(pl.read_delta(uri).schema)
# => Schema({'fruit': String})
print(pl.scan_delta(uri).collect_schema())
# => Schema({'fruit': String})
Expected
The Enum dtype to be reconstructed from _PL_ENUM_VALUES2 field metadata, returning Schema({'fruit': Enum([...])} from both read_delta and scan_delta.
Workarounds we're using
- Eager:
pl.from_arrow(DeltaTable(uri).to_pyarrow_table()) plus a manual with_columns([pl.col(c).cast(pl.Enum(cats)) for ...]) step where we carry our own JSON-serialized cats list in a custom field metadata key. Works but bypasses read_delta.
- Lazy: enumerate parquet files via
DeltaTable(uri).file_uris(), peek at field metadata of one file, then pl.scan_parquet(files).with_columns([cast to Enum]). Works for filter pushdown and the streaming engine but bypasses scan_delta.
Both work because the metadata is genuinely there on disk; the wrappers just plumb it back into a Polars Enum dtype.
Possible directions (very much suggestions from outside, please apply your judgement)
Two thoughts that seem tractable from the outside, but you'll have a much better view of the right fix:
- Maybe
pl.from_arrow could reconstruct Enum from string|large_string + _PL_ENUM_VALUES2 field metadata (it already handles dictionary<...> + _PL_ENUM_VALUES2). This would help anywhere an Arrow Table comes back as plain string with the metadata intact — Delta is one such case since delta-rs always unwraps dictionary types on read, but other Arrow-aware sources have similar shapes.
- Or
read_delta / scan_delta could peek at parquet field metadata at schema-resolution time and surface Enum dtypes in the resolved schema.
Both directions feel backward-compatible (columns without _PL_ENUM_VALUES2 would stay String). Happy to be told one or both is wrong — there are likely subtleties around schema unification across files / partition spec / etc. that we don't see from the wrapper side.
Why this matters for us
We're using Polars Enum to represent fixed-cat columns end-to-end (preprocessing → cache → training → inference) and the parquet preservation on the local cache is great. The Delta gap means we either keep cats in a sidecar (workable) or maintain the wrapper. Native support would be a small but meaningful ergonomic win, and given the metadata is already preserved on disk, it feels like the Delta read path is the only place it leaks.
Thanks for the work on the categoricals refactor in 1.32 — the new design is great to use.
read_delta/scan_deltadrop Enum dtype despite metadata being preserved end-to-endPolars version
1.40.1 (with
deltalake==1.5.1)Summary
When a DataFrame with a
pl.Enumcolumn is written to a Delta table and read back viapl.read_delta/pl.scan_delta, the Enum dtype is downgraded toString. The interesting bit is that the underlying_PL_ENUM_VALUES2field metadata that Polars uses to encode Enum cats is preserved at every layer — it survives in the parquet files inside the Delta directory, andDeltaTable(uri).to_pyarrow_table()returns it on the schema. So the data and metadata round-trip cleanly through Delta storage; it's specifically the Polars-side Delta read path that doesn't surface it.(Filing this against the read side. The known
write_deltapanic on Enum dtypes —cannot downcast Utf8View dictionary value to byte array— is a separate issue; the reproduction below works around it by casting Enum→String at the Arrow layer with the_PL_ENUM_VALUES2metadata kept on the field.)Reproduction
Expected
The Enum dtype to be reconstructed from
_PL_ENUM_VALUES2field metadata, returningSchema({'fruit': Enum([...])}from bothread_deltaandscan_delta.Workarounds we're using
pl.from_arrow(DeltaTable(uri).to_pyarrow_table())plus a manualwith_columns([pl.col(c).cast(pl.Enum(cats)) for ...])step where we carry our own JSON-serialized cats list in a custom field metadata key. Works but bypassesread_delta.DeltaTable(uri).file_uris(), peek at field metadata of one file, thenpl.scan_parquet(files).with_columns([cast to Enum]). Works for filter pushdown and the streaming engine but bypassesscan_delta.Both work because the metadata is genuinely there on disk; the wrappers just plumb it back into a Polars Enum dtype.
Possible directions (very much suggestions from outside, please apply your judgement)
Two thoughts that seem tractable from the outside, but you'll have a much better view of the right fix:
pl.from_arrowcould reconstruct Enum fromstring|large_string + _PL_ENUM_VALUES2field metadata (it already handlesdictionary<...> + _PL_ENUM_VALUES2). This would help anywhere an Arrow Table comes back as plain string with the metadata intact — Delta is one such case sincedelta-rsalways unwraps dictionary types on read, but other Arrow-aware sources have similar shapes.read_delta/scan_deltacould peek at parquet field metadata at schema-resolution time and surface Enum dtypes in the resolved schema.Both directions feel backward-compatible (columns without
_PL_ENUM_VALUES2would stay String). Happy to be told one or both is wrong — there are likely subtleties around schema unification across files / partition spec / etc. that we don't see from the wrapper side.Why this matters for us
We're using Polars Enum to represent fixed-cat columns end-to-end (preprocessing → cache → training → inference) and the parquet preservation on the local cache is great. The Delta gap means we either keep cats in a sidecar (workable) or maintain the wrapper. Native support would be a small but meaningful ergonomic win, and given the metadata is already preserved on disk, it feels like the Delta read path is the only place it leaks.
Thanks for the work on the categoricals refactor in 1.32 — the new design is great to use.