Summary
When using Lance as the Hudi base file format and reading a nested struct in a way that prunes some of its children (e.g. selecting only image_bytes.reference.offset and image_bytes.reference.length from a struct that also has external_path and managed), the Lance vectorized reader throws UnsupportedOperationException from ArrowVectorAccessor.getLong.
This is upstream, in lance-format/lance-spark's LanceArrowColumnVector — Hudi's LanceRecordIterator is just the caller. Filing this issue to track the impact on the Hudi side and decide whether we want any temporary mitigation until upstream lands a fix.
Upstream issue
lance-format/lance-spark#499
Repro from a Hudi context
Hudi 1.2.0-rc1 + lance-spark-bundle-3.5_2.12 0.4.0, Hudi table with 'hoodie.table.base.file.format' = 'lance' and a BLOB column. The descriptor read from BatchedBlobReader works fine because it projects the full struct; a user-written query that prunes nested children fails.
Failing:
SELECT image_bytes.reference.offset,
image_bytes.reference.length
FROM hudi_lance_table
Working:
SELECT image_bytes.reference.external_path,
image_bytes.reference.offset,
image_bytes.reference.length,
image_bytes.reference.managed
FROM hudi_lance_table
Stack trace (relevant frames)
java.lang.UnsupportedOperationException
at org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getLong(ArrowColumnVector.java:238)
at org.apache.spark.sql.vectorized.ArrowColumnVector.getLong(ArrowColumnVector.java:90)
at org.lance.spark.vectorized.LanceArrowColumnVector.getLong(LanceArrowColumnVector.java:310)
at org.apache.spark.sql.vectorized.ColumnarRow.getLong(ColumnarRow.java:116)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.hudi.io.storage.LanceRecordIterator.next(LanceRecordIterator.java:162)
Possible Hudi-side actions (to discuss)
- Wait for upstream — bump
lance-spark once lance-format/lance-spark#499 is fixed and released. Lowest effort, no Hudi changes.
- Documentation — add a "Known issues" note in the Lance integration docs so users hit it less.
- Workaround in
LanceRecordIterator — force full nested-struct projection when binding Lance vectors so partial pruning never reaches LanceArrowColumnVector. Higher effort and may regress read perf on wide structs; only worth it if upstream stalls.
Environment
- Apache Hudi 1.2.0-rc1
- Spark 3.5, Scala 2.12
lance-spark-bundle-3.5_2.12 0.4.0
- macOS, JDK 11
Notes
Discovered while building demo assertions in hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_blob_reader_demo.py. The demo's assert_descriptors() step now projects all four reference children as a workaround.
Summary
When using Lance as the Hudi base file format and reading a nested struct in a way that prunes some of its children (e.g. selecting only
image_bytes.reference.offsetandimage_bytes.reference.lengthfrom a struct that also hasexternal_pathandmanaged), the Lance vectorized reader throwsUnsupportedOperationExceptionfromArrowVectorAccessor.getLong.This is upstream, in
lance-format/lance-spark'sLanceArrowColumnVector— Hudi'sLanceRecordIteratoris just the caller. Filing this issue to track the impact on the Hudi side and decide whether we want any temporary mitigation until upstream lands a fix.Upstream issue
lance-format/lance-spark#499
Repro from a Hudi context
Hudi 1.2.0-rc1 +
lance-spark-bundle-3.5_2.120.4.0, Hudi table with'hoodie.table.base.file.format' = 'lance'and aBLOBcolumn. The descriptor read fromBatchedBlobReaderworks fine because it projects the full struct; a user-written query that prunes nested children fails.Failing:
Working:
Stack trace (relevant frames)
Possible Hudi-side actions (to discuss)
lance-sparkonce lance-format/lance-spark#499 is fixed and released. Lowest effort, no Hudi changes.LanceRecordIterator— force full nested-struct projection when binding Lance vectors so partial pruning never reachesLanceArrowColumnVector. Higher effort and may regress read perf on wide structs; only worth it if upstream stalls.Environment
lance-spark-bundle-3.5_2.120.4.0Notes
Discovered while building demo assertions in
hudi-examples/hudi-examples-spark/src/test/python/vector_blob_demo/hudi_blob_reader_demo.py. The demo'sassert_descriptors()step now projects all four reference children as a workaround.