Skip to content

NPE in BaseHoodieTableFileIndex.getInputFileSlices when RO path filter is enabled and a queried partition has no base files #18638

@prashantwason

Description

@prashantwason

Bug Description

What happened:

When hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=true is set on a COW table (or on any table queried with READ_OPTIMIZED), querying a partition set that includes one or more partitions with no base files throws NullPointerException deep inside Spark query planning, originating from BaseHoodieTableFileIndex.getInputFileSlices.

What you expected:

The query should plan and execute successfully. Empty partitions should appear in the result map with an empty file-slice list, matching the contract already honored by the non-RO path (filterFiles), which iterates over partitions and naturally returns an entry per partition.

Steps to reproduce:

  1. Create a COW table with multiple partitions, where at least one partition contains no base files (e.g., produced by clustering, cleaning, or simply never written).
  2. Enable hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=true.
  3. Issue a Spark SQL query that prunes down to a partition set including the empty one.
  4. The driver throws NullPointerException during query optimization.

Root cause:

BaseHoodieTableFileIndex#loadFileSlicesForPartitions has two code paths:

Path Method Behavior for empty partitions
A generatePartitionFileSlicesPostROTablePathFilter (added in #18136) iterates over files, so a partition with zero files gets no entry in the returned map
B filterFiles iterates over partitions, so every partition gets an entry (possibly empty list)

When Path A is taken, cachedAllInputFileSlices is left without entries for empty partitions. The caller getInputFileSlices then does:

return Arrays.stream(partitions).collect(
    Collectors.toMap(Function.identity(), partition -> cachedAllInputFileSlices.get(partition))
);

cache.get(emptyPartition) returns null, and Collectors.toMap's uniqKeysMapAccumulator calls Objects.requireNonNull(value) → NPE.

A fix and regression test are in PR #<PR_NUMBER>.

Environment

Hudi version: master (bug introduced by #18136 in 1.x branch; affects 1.2 and later releases that include the optimization)
Query engine: Spark 3.3.x (also affects other Spark versions; bug is in hudi-common)
Relevant configs:

  • hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=true (the trigger; default is false)
  • Table type COPY_ON_WRITE or query type READ_OPTIMIZED (either condition activates Path A)

Logs and Stack Trace

Caused by: java.lang.NullPointerException
    at java.util.Objects.requireNonNull(Objects.java:222)
    at java.util.stream.Collectors.lambda$uniqKeysMapAccumulator$1(Collectors.java:178)
    at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at org.apache.hudi.BaseHoodieTableFileIndex.getInputFileSlices(BaseHoodieTableFileIndex.java:252)
    at org.apache.hudi.HoodieFileIndex.prunePartitionsAndGetFileSlices(HoodieFileIndex.scala:351)
    at org.apache.hudi.HoodieFileIndex.filterFileSlices(HoodieFileIndex.scala:227)
    at org.apache.spark.sql.hudi.analysis.Spark33HoodiePruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(Spark33HoodiePruneFileSourcePartitions.scala:63)
    ... (Spark catalyst frames)

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugBug reports and fixes

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions