read_delta / scan_delta drop Enum dtype despite `_PL_ENUM_VALUES2` field metadata being preserved through Delta

# `read_delta` / `scan_delta` drop Enum dtype despite metadata being preserved end-to-end

## Polars version
1.40.1 (with `deltalake==1.5.1`)

## Summary

When a DataFrame with a `pl.Enum` column is written to a Delta table and read back via `pl.read_delta` / `pl.scan_delta`, the Enum dtype is downgraded to `String`. The interesting bit is that the underlying `_PL_ENUM_VALUES2` field metadata that Polars uses to encode Enum cats **is preserved at every layer** — it survives in the parquet files inside the Delta directory, and `DeltaTable(uri).to_pyarrow_table()` returns it on the schema. So the data and metadata round-trip cleanly through Delta storage; it's specifically the Polars-side Delta read path that doesn't surface it.

(Filing this against the read side. The known `write_delta` panic on Enum dtypes — `cannot downcast Utf8View dictionary value to byte array` — is a separate issue; the reproduction below works around it by casting Enum→String at the Arrow layer with the `_PL_ENUM_VALUES2` metadata kept on the field.)

## Reproduction

```python
import polars as pl
import pyarrow as pa
from pathlib import Path
from deltalake import write_deltalake, DeltaTable

cats = ["apple", "banana", "cherry"]
df = pl.DataFrame({"fruit": ["apple", "banana", "apple"]}).with_columns(
    pl.col("fruit").cast(pl.Enum(cats))
)
print("Source schema:", df.schema)
# => Schema({'fruit': Enum(categories=['apple', 'banana', 'cherry'])})

# Cast Enum -> string at the Arrow layer, preserving _PL_ENUM_VALUES2 metadata
# (works around the deltalake-rs writer panic on Enum-typed Arrow columns)
arr = df.to_arrow()
field_meta = arr.schema.field("fruit").metadata
arr_str = arr.cast(pa.schema([
    pa.field("fruit", pa.string(), metadata=field_meta),
]))

uri = "/tmp/fruit_delta"
write_deltalake(uri, arr_str)

# Metadata is preserved through Delta storage:
arr_back = DeltaTable(uri).to_pyarrow_table()
print(arr_back.schema.field("fruit").metadata)
# => {b'_PL_ENUM_VALUES2': b'5;apple6;banana6;cherry'}

# Polars from_arrow with dictionary type would reconstruct, but DeltaTable
# returns plain `string` (delta-rs unwraps dictionary types on read), and
# Polars's read paths return String:
print(pl.read_delta(uri).schema)
# => Schema({'fruit': String})

print(pl.scan_delta(uri).collect_schema())
# => Schema({'fruit': String})
```

## Expected
The Enum dtype to be reconstructed from `_PL_ENUM_VALUES2` field metadata, returning `Schema({'fruit': Enum([...])}` from both `read_delta` and `scan_delta`.

## Workarounds we're using

- **Eager**: `pl.from_arrow(DeltaTable(uri).to_pyarrow_table())` plus a manual `with_columns([pl.col(c).cast(pl.Enum(cats)) for ...])` step where we carry our own JSON-serialized cats list in a custom field metadata key. Works but bypasses `read_delta`.
- **Lazy**: enumerate parquet files via `DeltaTable(uri).file_uris()`, peek at field metadata of one file, then `pl.scan_parquet(files).with_columns([cast to Enum])`. Works for filter pushdown and the streaming engine but bypasses `scan_delta`.

Both work because the metadata is genuinely there on disk; the wrappers just plumb it back into a Polars Enum dtype.

## Possible directions (very much suggestions from outside, please apply your judgement)

Two thoughts that *seem* tractable from the outside, but you'll have a much better view of the right fix:

1. Maybe `pl.from_arrow` could reconstruct Enum from `string|large_string + _PL_ENUM_VALUES2` field metadata (it already handles `dictionary<...> + _PL_ENUM_VALUES2`). This would help anywhere an Arrow Table comes back as plain string with the metadata intact — Delta is one such case since `delta-rs` always unwraps dictionary types on read, but other Arrow-aware sources have similar shapes.
2. Or `read_delta` / `scan_delta` could peek at parquet field metadata at schema-resolution time and surface Enum dtypes in the resolved schema.

Both directions feel backward-compatible (columns without `_PL_ENUM_VALUES2` would stay String). Happy to be told one or both is wrong — there are likely subtleties around schema unification across files / partition spec / etc. that we don't see from the wrapper side.

## Why this matters for us

We're using Polars Enum to represent fixed-cat columns end-to-end (preprocessing → cache → training → inference) and the parquet preservation on the local cache is great. The Delta gap means we either keep cats in a sidecar (workable) or maintain the wrapper. Native support would be a small but meaningful ergonomic win, and given the metadata is already preserved on disk, it feels like the Delta read path is the only place it leaks.

Thanks for the work on the categoricals refactor in 1.32 — the new design is great to use.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_delta / scan_delta drop Enum dtype despite `_PL_ENUM_VALUES2` field metadata being preserved through Delta #27515

`read_delta` / `scan_delta` drop Enum dtype despite metadata being preserved end-to-end

Polars version

Summary

Reproduction

Expected

Workarounds we're using

Possible directions (very much suggestions from outside, please apply your judgement)

Why this matters for us

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

read_delta / scan_delta drop Enum dtype despite _PL_ENUM_VALUES2 field metadata being preserved through Delta #27515

Description

read_delta / scan_delta drop Enum dtype despite metadata being preserved end-to-end

Polars version

Summary

Reproduction

Expected

Workarounds we're using

Possible directions (very much suggestions from outside, please apply your judgement)

Why this matters for us

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

read_delta / scan_delta drop Enum dtype despite `_PL_ENUM_VALUES2` field metadata being preserved through Delta #27515

`read_delta` / `scan_delta` drop Enum dtype despite metadata being preserved end-to-end