fix(spark): catch HoodieSchemaNotFoundException in 3-arg DefaultSource.createRelation by prashantwason · Pull Request #18669 · apache/hudi

prashantwason · 2026-04-30T21:59:00Z

Describe the issue this Pull Request addresses

org.apache.hudi.DefaultSource has two read-side overloads of createRelation:

The 2-arg overload createRelation(sqlContext, parameters) wraps its body in a try { … } catch { case _: HoodieSchemaNotFoundException => new EmptyRelation(…) }. This catch was added in HUDI-7147 / #10689 so that schema-less Hudi tables (no commits / commit metadata deleted / legacy schema-less layout) do not explode at query analysis time.
The 3-arg overload createRelation(sqlContext, optParams, schema) calls DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap) directly, without the same catch.

Spark's DataSource.resolveRelation() chooses the overload based on whether a user-supplied schema is present:

case (dataSource: SchemaRelationProvider, Some(schema)) =>
  dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
case (dataSource: RelationProvider, _) =>
  dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)

So any read path that supplies a schema (e.g. spark.read.schema(s).format("hudi").load(path), or HMS-catalog resolution that already knows the schema) bypasses the 2-arg catch and surfaces HoodieSchemaNotFoundException directly.

Summary and Changelog

DefaultSource.scala (3-arg createRelation): mirror the existing 2-arg catch so HoodieSchemaNotFoundException resolves to EmptyRelation on this overload too. Adds an inline comment explaining why both overloads need the same catch.
TestCOWDataSource.testReadOfAnEmptyTableWithUserSuppliedSchema: sibling of the existing testReadOfAnEmptyTable that asserts spark.read.schema(userSchema).format("hudi").load(basePath).count() == 0 instead of throwing on a schema-less table.

Impact

User-facing: a Hudi table whose schema is unresolvable will now return an empty relation when queried with a user-supplied schema, matching the existing no-schema-supplied behavior. No previously-successful path changes behavior — this only converts a previously-thrown exception into an empty result on the same exact failure condition.

Risk Level

low — minimal scope (one try/catch mirroring existing logic), covered by a new unit test that mirrors an existing one.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable
CI passed

🤖 Generated with Claude Code

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the existing HoodieSchemaNotFoundException catch from the 2-arg createRelation overload onto the 3-arg overload so schema-less Hudi tables don't fail when the read path supplies a user schema. One question worth checking on whether EmptyRelation should reflect the user-supplied schema rather than an empty StructType. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here.

hudi-agent · 2026-04-30T22:02:19Z

+    val relation = try {
+      DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap)
+    } catch {
+      case _: HoodieSchemaNotFoundException => new EmptyRelation(sqlContext, new StructType())


🤖 Should the EmptyRelation here be constructed with the user-supplied schema rather than new StructType()? On this overload, Spark has been given an explicit schema and downstream operations like .select("col1") would presumably expect those columns to exist on the resolved relation. The test only checks .count(), which would pass with an empty schema, but a query that projects columns might fail with AnalysisException against an empty StructType. The 2-arg path uses an empty StructType because no schema is known there, but here we have one.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

🤖 I think this is already handled — the line reads new EmptyRelation(sqlContext, Option(schema).getOrElse(new StructType())), so when the 3-arg overload is invoked with a non-null user-supplied schema, that schema is passed through to EmptyRelation and only falls back to an empty StructType if schema happens to be null. So projections like .select("col1") against a user-supplied schema should resolve correctly.

…e.createRelation The 2-arg `createRelation(sqlContext, parameters)` overload wraps its body in a try/catch that converts `HoodieSchemaNotFoundException` to `EmptyRelation` (added in HUDI-7147 / apache#10689). The 3-arg `createRelation(sqlContext, optParams, schema)` overload — which Spark's `DataSource.resolveRelation()` invokes directly via the `SchemaRelationProvider` path whenever a user-supplied schema is present (e.g. `spark.read.schema(s).format("hudi").load(path)`, or HMS-catalog resolution that already knows the schema) — has no such catch, so the exception propagates and breaks query analysis. Mirror the 2-arg catch on the 3-arg overload so behavior is symmetric: schema-less Hudi tables resolve to an empty relation regardless of which overload Spark invokes. Also adds `TestCOWDataSource.testReadOfAnEmptyTableWithUserSuppliedSchema`, a sibling of the existing `testReadOfAnEmptyTable` that exercises the 3-arg path. Closes apache#18668 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

prashantwason · 2026-04-30T23:16:12Z

Pushed a follow-up to the same commit: when the catch fires, pass the caller-supplied schema through to EmptyRelation instead of new StructType(). Without this, downstream Spark analysis (e.g. WHERE datestr >= '...') fails with Column 'datestr' does not exist because Spark sees the relation has zero columns even though the HMS catalog supplied a schema. Surfaced via a downstream test where the on-disk Hudi table is schemaless but HMS knows the columns.

Test (testReadOfAnEmptyTableWithUserSuppliedSchema) now also asserts df.schema == userSchema.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the existing HoodieSchemaNotFoundException catch from the 2-arg createRelation overload onto the 3-arg overload so schema-less tables resolve to an EmptyRelation on the SchemaRelationProvider path too, and adds a corresponding test. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small simplification opportunity in the catch block; otherwise the code is clean and well-commented for a non-obvious fix.

cc @yihua

hudi-agent · 2026-04-30T23:27:42Z

+      DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap)
+    } catch {
+      case _: HoodieSchemaNotFoundException =>
+        new EmptyRelation(sqlContext, Option(schema).getOrElse(new StructType()))


🤖 nit: Option(schema).getOrElse(new StructType()) implies schema could be null, but on the SchemaRelationProvider path Spark always provides it — that's the whole contract of this overload. Could you simplify to just schema? That way a null accidentally passed by a caller would fail fast instead of silently falling back to an empty schema.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

Done in bd4e4c2 - passed schema through directly. Agreed the Option wrapper was overly defensive given the SchemaRelationProvider contract.

Update: had to revert the simplification in 5e25a57. Turns out the 2-arg createRelation overload (line 78) re-enters this 3-arg method with schema=null, so the SchemaRelationProvider non-null contract assumption doesn't hold for internal callers. The defensive Option(schema).getOrElse(new StructType()) was load-bearing - removing it broke TestCOWDataSource.testReadOfAnEmptyTable on spark3.3 / spark3.5 with NPE in BaseRelation.schema().isEmpty. Comment now documents the internal-recursion reason.

hudi-agent · 2026-04-30T23:27:45Z

+    val relation = try {
+      DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap)
+    } catch {
+      case _: HoodieSchemaNotFoundException => new EmptyRelation(sqlContext, new StructType())


🤖 I think this is already handled — the line reads new EmptyRelation(sqlContext, Option(schema).getOrElse(new StructType())), so when the 3-arg overload is invoked with a non-null user-supplied schema, that schema is passed through to EmptyRelation and only falls back to an empty StructType if schema happens to be null. So projections like .select("col1") against a user-supplied schema should resolve correctly.

lokeshj1703

Thanks @prashantwason for working on this! The changes look good to me. There is a NIT posted by hudi-agent which might be good to look at. +1 o.w.

The SchemaRelationProvider contract guarantees schema is non-null on this overload, so the Option(schema).getOrElse(new StructType()) wrapper was defensive code that obscured intent. Passing schema through directly fails fast on a contract violation instead of silently producing an empty relation. Addresses review feedback on PR apache#18669.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the existing 2-arg HoodieSchemaNotFoundException catch on the 3-arg createRelation overload so schema-supplied reads of schemaless tables return an EmptyRelation instead of throwing, with a corresponding test. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

The previous simplification (commit bd4e4c2) assumed the SchemaRelationProvider contract guaranteed `schema` is non-null, but the 2-arg createRelation overload re-enters this 3-arg method with `schema = null` (line 78). When the schema-not-found catch fires on that internal call, passing `null` straight through to EmptyRelation produces a relation whose `.schema` is null, and the 2-arg overload's downstream `relation.schema.isEmpty` check NPEs. Surfaced by TestCOWDataSource.testReadOfAnEmptyTable failing on spark3.3 / spark3.5 with NPE on BaseRelation.schema(). Restore the defensive Option(schema).getOrElse(new StructType()) and document the internal-recursion reason so future readers don't strip it again.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the existing HoodieSchemaNotFoundException catch onto the 3-arg createRelation overload so schema-less tables don't blow up on the SchemaRelationProvider path, with a test that exercises spark.read.schema(...).load(...). No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

codecov-commenter · 2026-05-07T18:40:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.92%. Comparing base (38db5ed) to head (5e25a57).
⚠️ Report is 14 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18669      +/-   ##
============================================
- Coverage     68.06%   67.92%   -0.15%     
- Complexity    28922    28991      +69     
============================================
  Files          2518     2522       +4     
  Lines        140574   141167     +593     
  Branches      17419    17509      +90     
============================================
+ Hits          95682    95885     +203     
- Misses        37036    37412     +376     
- Partials       7856     7870      +14

Flag	Coverage Δ
common-and-other-modules	`44.17% <100.00%> (-0.20%)`	⬇️
hadoop-mr-java-client	`44.99% <ø> (+0.10%)`	⬆️
spark-client-hadoop-common	`48.35% <ø> (-0.09%)`	⬇️
spark-java-tests	`48.99% <100.00%> (+0.35%)`	⬆️
spark-scala-tests	`44.90% <100.00%> (+0.20%)`	⬆️
utilities	`37.62% <100.00%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...src/main/scala/org/apache/hudi/DefaultSource.scala	`76.47% <100.00%> (+1.32%)`	⬆️

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-05-07T18:52:07Z

CI report:

5e25a57 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

prashantwason · 2026-05-07T19:18:36Z

@lokeshj1703 @yihua - this PR is green on the latest commit 5e25a57: 62/62 checks passing.

Quick history of the recent commits in case it helps the re-review:

bd4e4c2 - applied a hudi-agent nit to drop Option(schema).getOrElse(new StructType()) in favor of schema directly, on the assumption that SchemaRelationProvider guarantees non-null schema.
5e25a57 - reverted bd4e4c2. The 2-arg createRelation overload (line 78 of DefaultSource.scala) re-enters the 3-arg overload with schema = null, so the defensive fallback was load-bearing. Without it, TestCOWDataSource.testReadOfAnEmptyTable NPE'd on spark3.3 / spark3.5. The comment above the catch now explicitly calls out the internal-recursion reason so future readers don't strip it again.

Net effect vs your prior approval: same as 43e0dd4 semantically + one extra clarifying sentence in the comment. Ready to merge whenever you have a moment.

hudi-agent reviewed Apr 30, 2026

View reviewed changes

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 30, 2026

prashantwason force-pushed the fix-schema-not-found-3arg-createrelation branch from 8c5b113 to 43e0dd4 Compare April 30, 2026 23:15

hudi-agent reviewed Apr 30, 2026

View reviewed changes

lokeshj1703 approved these changes May 5, 2026

View reviewed changes

hudi-agent reviewed May 7, 2026

View reviewed changes

Conversation

prashantwason commented Apr 30, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

prashantwason commented Apr 30, 2026

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

prashantwason May 7, 2026

Choose a reason for hiding this comment

Uh oh!

prashantwason May 7, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 7, 2026

Codecov Report

Uh oh!

hudi-bot commented May 7, 2026

CI report:

Uh oh!

prashantwason commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants