Skip to content

[Lance] Vectorized reads of nested struct sub-fields fail with partial projection (lance-spark bug) #18681

@rahil-c

Description

@rahil-c

Summary

When using Lance as the Hudi base file format and reading a nested struct in a way that prunes some of its children (e.g. selecting only image_bytes.reference.offset and image_bytes.reference.length from a struct that also has external_path and managed), the Lance vectorized reader throws UnsupportedOperationException from ArrowVectorAccessor.getLong.

This is upstream, in lance-format/lance-spark's LanceArrowColumnVector — Hudi's LanceRecordIterator is just the caller. Filing this issue to track the impact on the Hudi side and decide whether we want any temporary mitigation until upstream lands a fix.

Upstream issue

lance-format/lance-spark#499

Repro from a Hudi context

Hudi 1.2.0-rc1 + lance-spark-bundle-3.5_2.12 0.4.0, Hudi table with 'hoodie.table.base.file.format' = 'lance' and a BLOB column. The descriptor read from BatchedBlobReader works fine because it projects the full struct; a user-written query that prunes nested children fails.

Failing:

SELECT image_bytes.reference.offset,
       image_bytes.reference.length
FROM hudi_lance_table

Working:

SELECT image_bytes.reference.external_path,
       image_bytes.reference.offset,
       image_bytes.reference.length,
       image_bytes.reference.managed
FROM hudi_lance_table

Stack trace (relevant frames)

java.lang.UnsupportedOperationException
    at org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getLong(ArrowColumnVector.java:238)
    at org.apache.spark.sql.vectorized.ArrowColumnVector.getLong(ArrowColumnVector.java:90)
    at org.lance.spark.vectorized.LanceArrowColumnVector.getLong(LanceArrowColumnVector.java:310)
    at org.apache.spark.sql.vectorized.ColumnarRow.getLong(ColumnarRow.java:116)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
    at org.apache.hudi.io.storage.LanceRecordIterator.next(LanceRecordIterator.java:162)

Possible Hudi-side actions (to discuss)

  1. Wait for upstream — bump lance-spark once lance-format/lance-spark#499 is fixed and released. Lowest effort, no Hudi changes.
  2. Documentation — add a "Known issues" note in the Lance integration docs so users hit it less.
  3. Workaround in LanceRecordIterator — force full nested-struct projection when binding Lance vectors so partial pruning never reaches LanceArrowColumnVector. Higher effort and may regress read perf on wide structs; only worth it if upstream stalls.

Environment

  • Apache Hudi 1.2.0-rc1
  • Spark 3.5, Scala 2.12
  • lance-spark-bundle-3.5_2.12 0.4.0
  • macOS, JDK 11

Notes

Discovered while building demo assertions in hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_blob_reader_demo.py. The demo's assert_descriptors() step now projects all four reference children as a workaround.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions