Bug Description
What happened:
When hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=true is set on a COW table (or on any table queried with READ_OPTIMIZED), querying a partition set that includes one or more partitions with no base files throws NullPointerException deep inside Spark query planning, originating from BaseHoodieTableFileIndex.getInputFileSlices.
What you expected:
The query should plan and execute successfully. Empty partitions should appear in the result map with an empty file-slice list, matching the contract already honored by the non-RO path (filterFiles), which iterates over partitions and naturally returns an entry per partition.
Steps to reproduce:
- Create a COW table with multiple partitions, where at least one partition contains no base files (e.g., produced by clustering, cleaning, or simply never written).
- Enable
hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=true.
- Issue a Spark SQL query that prunes down to a partition set including the empty one.
- The driver throws
NullPointerException during query optimization.
Root cause:
BaseHoodieTableFileIndex#loadFileSlicesForPartitions has two code paths:
| Path |
Method |
Behavior for empty partitions |
| A |
generatePartitionFileSlicesPostROTablePathFilter (added in #18136) |
iterates over files, so a partition with zero files gets no entry in the returned map |
| B |
filterFiles |
iterates over partitions, so every partition gets an entry (possibly empty list) |
When Path A is taken, cachedAllInputFileSlices is left without entries for empty partitions. The caller getInputFileSlices then does:
return Arrays.stream(partitions).collect(
Collectors.toMap(Function.identity(), partition -> cachedAllInputFileSlices.get(partition))
);
cache.get(emptyPartition) returns null, and Collectors.toMap's uniqKeysMapAccumulator calls Objects.requireNonNull(value) → NPE.
A fix and regression test are in PR #<PR_NUMBER>.
Environment
Hudi version: master (bug introduced by #18136 in 1.x branch; affects 1.2 and later releases that include the optimization)
Query engine: Spark 3.3.x (also affects other Spark versions; bug is in hudi-common)
Relevant configs:
hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=true (the trigger; default is false)
- Table type
COPY_ON_WRITE or query type READ_OPTIMIZED (either condition activates Path A)
Logs and Stack Trace
Caused by: java.lang.NullPointerException
at java.util.Objects.requireNonNull(Objects.java:222)
at java.util.stream.Collectors.lambda$uniqKeysMapAccumulator$1(Collectors.java:178)
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at org.apache.hudi.BaseHoodieTableFileIndex.getInputFileSlices(BaseHoodieTableFileIndex.java:252)
at org.apache.hudi.HoodieFileIndex.prunePartitionsAndGetFileSlices(HoodieFileIndex.scala:351)
at org.apache.hudi.HoodieFileIndex.filterFileSlices(HoodieFileIndex.scala:227)
at org.apache.spark.sql.hudi.analysis.Spark33HoodiePruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(Spark33HoodiePruneFileSourcePartitions.scala:63)
... (Spark catalyst frames)
Bug Description
What happened:
When
hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=trueis set on a COW table (or on any table queried withREAD_OPTIMIZED), querying a partition set that includes one or more partitions with no base files throwsNullPointerExceptiondeep inside Spark query planning, originating fromBaseHoodieTableFileIndex.getInputFileSlices.What you expected:
The query should plan and execute successfully. Empty partitions should appear in the result map with an empty file-slice list, matching the contract already honored by the non-RO path (
filterFiles), which iterates over partitions and naturally returns an entry per partition.Steps to reproduce:
hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=true.NullPointerExceptionduring query optimization.Root cause:
BaseHoodieTableFileIndex#loadFileSlicesForPartitionshas two code paths:generatePartitionFileSlicesPostROTablePathFilter(added in #18136)filterFilesWhen Path A is taken,
cachedAllInputFileSlicesis left without entries for empty partitions. The callergetInputFileSlicesthen does:cache.get(emptyPartition)returnsnull, andCollectors.toMap'suniqKeysMapAccumulatorcallsObjects.requireNonNull(value)→ NPE.A fix and regression test are in PR #<PR_NUMBER>.
Environment
Hudi version: master (bug introduced by #18136 in 1.x branch; affects 1.2 and later releases that include the optimization)
Query engine: Spark 3.3.x (also affects other Spark versions; bug is in
hudi-common)Relevant configs:
hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=true(the trigger; default isfalse)COPY_ON_WRITEor query typeREAD_OPTIMIZED(either condition activates Path A)Logs and Stack Trace