Skip to content

[SUPPORT] Schema-less Hudi tables explode on spark.read.schema(...).load(path) — 3-arg DefaultSource.createRelation does not catch HoodieSchemaNotFoundException #18668

@prashantwason

Description

@prashantwason

Tips before filing an issue

  • Have you gone through our FAQs?
  • Yes, this is a code-level bug.

Describe the problem you faced

org.apache.hudi.DefaultSource has two read-side overloads of createRelation:

  • 2-arg createRelation(sqlContext, parameters) — wraps the body in try { … } catch { case _: HoodieSchemaNotFoundException => new EmptyRelation(…) }. This catch was added in HUDI-7147 / PR #10689 precisely so that a schema-less Hudi table (no commits / commit metadata deleted / legacy schemaless layout) does not explode at query analysis time.
  • 3-arg createRelation(sqlContext, optParams, schema) — calls DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap) directly, without the same catch.

Spark's DataSource.resolveRelation() chooses which overload to invoke based on whether a user-supplied schema is present:

```scala
case (dataSource: SchemaRelationProvider, Some(schema)) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
case (dataSource: RelationProvider, _) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
```

Any read path that supplies a schema (e.g. `spark.read.schema(s).format("hudi").load(path)`, or HMS-catalog resolution that already knows the schema) bypasses the 2-arg catch and surfaces `HoodieSchemaNotFoundException` directly.

To Reproduce

  1. Create a Hudi table with one insert commit, then delete the only completed `.commit` file in the timeline (or otherwise produce a layout where `TableSchemaResolver` cannot resolve a schema).
  2. `spark.read.format("hudi").load(basePath)` → returns empty DataFrame (works because of the 2-arg catch — see existing test `TestCOWDataSource.testReadOfAnEmptyTable`).
  3. `spark.read.schema(someSchema).format("hudi").load(basePath)` → throws `org.apache.hudi.exception.HoodieSchemaNotFoundException: No schema found for table at `.

The same scenario also reproduces when a Spark SQL query resolves a Hudi table via `HiveMetastoreCatalog` and the catalog already supplies the schema.

Expected behavior

Both overloads should treat a schema-less Hudi table identically: return an `EmptyRelation` rather than propagating `HoodieSchemaNotFoundException`. Behavior should not depend on whether the caller supplied a schema.

Environment Description

  • Hudi version: master (also reproduces on 0.x and 1.x)
  • Spark version: 3.x
  • Hive version: any
  • Hadoop version: any
  • Storage (HDFS/S3/GCS..): any
  • Running on Docker? (yes/no): no

Additional context

The fix is a small mirror of the existing 2-arg catch on the 3-arg overload. PR follows.

Stacktrace

```
org.apache.hudi.exception.HoodieSchemaNotFoundException: No schema found for table at
at org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaInternal(TableSchemaResolver.java:...)
at org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchema(TableSchemaResolver.java:...)
at org.apache.hudi.HoodieBaseRelation.(HoodieBaseRelation.scala:...)
at org.apache.hudi.DefaultSource$.createRelation(DefaultSource.scala:...)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:137) ← 3-arg overload, no catch
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:...)
...
```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions