Skip to content

read_delta / scan_delta drop Enum dtype despite _PL_ENUM_VALUES2 field metadata being preserved through Delta #27515

@uditrana

Description

@uditrana

read_delta / scan_delta drop Enum dtype despite metadata being preserved end-to-end

Polars version

1.40.1 (with deltalake==1.5.1)

Summary

When a DataFrame with a pl.Enum column is written to a Delta table and read back via pl.read_delta / pl.scan_delta, the Enum dtype is downgraded to String. The interesting bit is that the underlying _PL_ENUM_VALUES2 field metadata that Polars uses to encode Enum cats is preserved at every layer — it survives in the parquet files inside the Delta directory, and DeltaTable(uri).to_pyarrow_table() returns it on the schema. So the data and metadata round-trip cleanly through Delta storage; it's specifically the Polars-side Delta read path that doesn't surface it.

(Filing this against the read side. The known write_delta panic on Enum dtypes — cannot downcast Utf8View dictionary value to byte array — is a separate issue; the reproduction below works around it by casting Enum→String at the Arrow layer with the _PL_ENUM_VALUES2 metadata kept on the field.)

Reproduction

import polars as pl
import pyarrow as pa
from pathlib import Path
from deltalake import write_deltalake, DeltaTable

cats = ["apple", "banana", "cherry"]
df = pl.DataFrame({"fruit": ["apple", "banana", "apple"]}).with_columns(
    pl.col("fruit").cast(pl.Enum(cats))
)
print("Source schema:", df.schema)
# => Schema({'fruit': Enum(categories=['apple', 'banana', 'cherry'])})

# Cast Enum -> string at the Arrow layer, preserving _PL_ENUM_VALUES2 metadata
# (works around the deltalake-rs writer panic on Enum-typed Arrow columns)
arr = df.to_arrow()
field_meta = arr.schema.field("fruit").metadata
arr_str = arr.cast(pa.schema([
    pa.field("fruit", pa.string(), metadata=field_meta),
]))

uri = "/tmp/fruit_delta"
write_deltalake(uri, arr_str)

# Metadata is preserved through Delta storage:
arr_back = DeltaTable(uri).to_pyarrow_table()
print(arr_back.schema.field("fruit").metadata)
# => {b'_PL_ENUM_VALUES2': b'5;apple6;banana6;cherry'}

# Polars from_arrow with dictionary type would reconstruct, but DeltaTable
# returns plain `string` (delta-rs unwraps dictionary types on read), and
# Polars's read paths return String:
print(pl.read_delta(uri).schema)
# => Schema({'fruit': String})

print(pl.scan_delta(uri).collect_schema())
# => Schema({'fruit': String})

Expected

The Enum dtype to be reconstructed from _PL_ENUM_VALUES2 field metadata, returning Schema({'fruit': Enum([...])} from both read_delta and scan_delta.

Workarounds we're using

  • Eager: pl.from_arrow(DeltaTable(uri).to_pyarrow_table()) plus a manual with_columns([pl.col(c).cast(pl.Enum(cats)) for ...]) step where we carry our own JSON-serialized cats list in a custom field metadata key. Works but bypasses read_delta.
  • Lazy: enumerate parquet files via DeltaTable(uri).file_uris(), peek at field metadata of one file, then pl.scan_parquet(files).with_columns([cast to Enum]). Works for filter pushdown and the streaming engine but bypasses scan_delta.

Both work because the metadata is genuinely there on disk; the wrappers just plumb it back into a Polars Enum dtype.

Possible directions (very much suggestions from outside, please apply your judgement)

Two thoughts that seem tractable from the outside, but you'll have a much better view of the right fix:

  1. Maybe pl.from_arrow could reconstruct Enum from string|large_string + _PL_ENUM_VALUES2 field metadata (it already handles dictionary<...> + _PL_ENUM_VALUES2). This would help anywhere an Arrow Table comes back as plain string with the metadata intact — Delta is one such case since delta-rs always unwraps dictionary types on read, but other Arrow-aware sources have similar shapes.
  2. Or read_delta / scan_delta could peek at parquet field metadata at schema-resolution time and surface Enum dtypes in the resolved schema.

Both directions feel backward-compatible (columns without _PL_ENUM_VALUES2 would stay String). Happy to be told one or both is wrong — there are likely subtleties around schema unification across files / partition spec / etc. that we don't see from the wrapper side.

Why this matters for us

We're using Polars Enum to represent fixed-cat columns end-to-end (preprocessing → cache → training → inference) and the parquet preservation on the local cache is great. The Delta gap means we either keep cats in a sidecar (workable) or maintain the wrapper. Native support would be a small but meaningful ergonomic win, and given the metadata is already preserved on disk, it feels like the Delta read path is the only place it leaks.

Thanks for the work on the categoricals refactor in 1.32 — the new design is great to use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions