Skip to content

fix(spark): catch HoodieSchemaNotFoundException in 3-arg DefaultSource.createRelation#18669

Open
prashantwason wants to merge 3 commits intoapache:masterfrom
prashantwason:fix-schema-not-found-3arg-createrelation
Open

fix(spark): catch HoodieSchemaNotFoundException in 3-arg DefaultSource.createRelation#18669
prashantwason wants to merge 3 commits intoapache:masterfrom
prashantwason:fix-schema-not-found-3arg-createrelation

Conversation

@prashantwason
Copy link
Copy Markdown
Member

Describe the issue this Pull Request addresses

Closes #18668.

org.apache.hudi.DefaultSource has two read-side overloads of createRelation:

  • The 2-arg overload createRelation(sqlContext, parameters) wraps its body in a try { … } catch { case _: HoodieSchemaNotFoundException => new EmptyRelation(…) }. This catch was added in HUDI-7147 / #10689 so that schema-less Hudi tables (no commits / commit metadata deleted / legacy schema-less layout) do not explode at query analysis time.
  • The 3-arg overload createRelation(sqlContext, optParams, schema) calls DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap) directly, without the same catch.

Spark's DataSource.resolveRelation() chooses the overload based on whether a user-supplied schema is present:

case (dataSource: SchemaRelationProvider, Some(schema)) =>
  dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
case (dataSource: RelationProvider, _) =>
  dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)

So any read path that supplies a schema (e.g. spark.read.schema(s).format("hudi").load(path), or HMS-catalog resolution that already knows the schema) bypasses the 2-arg catch and surfaces HoodieSchemaNotFoundException directly.

Summary and Changelog

  • DefaultSource.scala (3-arg createRelation): mirror the existing 2-arg catch so HoodieSchemaNotFoundException resolves to EmptyRelation on this overload too. Adds an inline comment explaining why both overloads need the same catch.
  • TestCOWDataSource.testReadOfAnEmptyTableWithUserSuppliedSchema: sibling of the existing testReadOfAnEmptyTable that asserts spark.read.schema(userSchema).format("hudi").load(basePath).count() == 0 instead of throwing on a schema-less table.

Impact

User-facing: a Hudi table whose schema is unresolvable will now return an empty relation when queried with a user-supplied schema, matching the existing no-schema-supplied behavior. No previously-successful path changes behavior — this only converts a previously-thrown exception into an empty result on the same exact failure condition.

Risk Level

low — minimal scope (one try/catch mirroring existing logic), covered by a new unit test that mirrors an existing one.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable
  • CI passed

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the existing HoodieSchemaNotFoundException catch from the 2-arg createRelation overload onto the 3-arg overload so schema-less Hudi tables don't fail when the read path supplies a user schema. One question worth checking on whether EmptyRelation should reflect the user-supplied schema rather than an empty StructType. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here.

val relation = try {
DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap)
} catch {
case _: HoodieSchemaNotFoundException => new EmptyRelation(sqlContext, new StructType())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Should the EmptyRelation here be constructed with the user-supplied schema rather than new StructType()? On this overload, Spark has been given an explicit schema and downstream operations like .select("col1") would presumably expect those columns to exist on the resolved relation. The test only checks .count(), which would pass with an empty schema, but a query that projects columns might fail with AnalysisException against an empty StructType. The 2-arg path uses an empty StructType because no schema is known there, but here we have one.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 I think this is already handled — the line reads new EmptyRelation(sqlContext, Option(schema).getOrElse(new StructType())), so when the 3-arg overload is invoked with a non-null user-supplied schema, that schema is passed through to EmptyRelation and only falls back to an empty StructType if schema happens to be null. So projections like .select("col1") against a user-supplied schema should resolve correctly.

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 30, 2026
…e.createRelation

The 2-arg `createRelation(sqlContext, parameters)` overload wraps its body
in a try/catch that converts `HoodieSchemaNotFoundException` to
`EmptyRelation` (added in HUDI-7147 / apache#10689). The 3-arg
`createRelation(sqlContext, optParams, schema)` overload — which Spark's
`DataSource.resolveRelation()` invokes directly via the
`SchemaRelationProvider` path whenever a user-supplied schema is present
(e.g. `spark.read.schema(s).format("hudi").load(path)`, or HMS-catalog
resolution that already knows the schema) — has no such catch, so the
exception propagates and breaks query analysis.

Mirror the 2-arg catch on the 3-arg overload so behavior is symmetric:
schema-less Hudi tables resolve to an empty relation regardless of which
overload Spark invokes.

Also adds `TestCOWDataSource.testReadOfAnEmptyTableWithUserSuppliedSchema`,
a sibling of the existing `testReadOfAnEmptyTable` that exercises the
3-arg path.

Closes apache#18668

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@prashantwason prashantwason force-pushed the fix-schema-not-found-3arg-createrelation branch from 8c5b113 to 43e0dd4 Compare April 30, 2026 23:15
@prashantwason
Copy link
Copy Markdown
Member Author

Pushed a follow-up to the same commit: when the catch fires, pass the caller-supplied schema through to EmptyRelation instead of new StructType(). Without this, downstream Spark analysis (e.g. WHERE datestr >= '...') fails with Column 'datestr' does not exist because Spark sees the relation has zero columns even though the HMS catalog supplied a schema. Surfaced via a downstream test where the on-disk Hudi table is schemaless but HMS knows the columns.

Test (testReadOfAnEmptyTableWithUserSuppliedSchema) now also asserts df.schema == userSchema.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the existing HoodieSchemaNotFoundException catch from the 2-arg createRelation overload onto the 3-arg overload so schema-less tables resolve to an EmptyRelation on the SchemaRelationProvider path too, and adds a corresponding test. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small simplification opportunity in the catch block; otherwise the code is clean and well-commented for a non-obvious fix.

cc @yihua

DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap)
} catch {
case _: HoodieSchemaNotFoundException =>
new EmptyRelation(sqlContext, Option(schema).getOrElse(new StructType()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: Option(schema).getOrElse(new StructType()) implies schema could be null, but on the SchemaRelationProvider path Spark always provides it — that's the whole contract of this overload. Could you simplify to just schema? That way a null accidentally passed by a caller would fail fast instead of silently falling back to an empty schema.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in bd4e4c2 - passed schema through directly. Agreed the Option wrapper was overly defensive given the SchemaRelationProvider contract.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: had to revert the simplification in 5e25a57. Turns out the 2-arg createRelation overload (line 78) re-enters this 3-arg method with schema=null, so the SchemaRelationProvider non-null contract assumption doesn't hold for internal callers. The defensive Option(schema).getOrElse(new StructType()) was load-bearing - removing it broke TestCOWDataSource.testReadOfAnEmptyTable on spark3.3 / spark3.5 with NPE in BaseRelation.schema().isEmpty. Comment now documents the internal-recursion reason.

val relation = try {
DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap)
} catch {
case _: HoodieSchemaNotFoundException => new EmptyRelation(sqlContext, new StructType())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 I think this is already handled — the line reads new EmptyRelation(sqlContext, Option(schema).getOrElse(new StructType())), so when the 3-arg overload is invoked with a non-null user-supplied schema, that schema is passed through to EmptyRelation and only falls back to an empty StructType if schema happens to be null. So projections like .select("col1") against a user-supplied schema should resolve correctly.

Copy link
Copy Markdown
Collaborator

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @prashantwason for working on this! The changes look good to me. There is a NIT posted by hudi-agent which might be good to look at. +1 o.w.

The SchemaRelationProvider contract guarantees schema is non-null
on this overload, so the Option(schema).getOrElse(new StructType())
wrapper was defensive code that obscured intent. Passing schema
through directly fails fast on a contract violation instead of
silently producing an empty relation.

Addresses review feedback on PR apache#18669.
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the existing 2-arg HoodieSchemaNotFoundException catch on the 3-arg createRelation overload so schema-supplied reads of schemaless tables return an EmptyRelation instead of throwing, with a corresponding test. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

The previous simplification (commit bd4e4c2) assumed the
SchemaRelationProvider contract guaranteed `schema` is non-null, but
the 2-arg createRelation overload re-enters this 3-arg method with
`schema = null` (line 78). When the schema-not-found catch fires on
that internal call, passing `null` straight through to EmptyRelation
produces a relation whose `.schema` is null, and the 2-arg overload's
downstream `relation.schema.isEmpty` check NPEs.

Surfaced by TestCOWDataSource.testReadOfAnEmptyTable failing on
spark3.3 / spark3.5 with NPE on BaseRelation.schema(). Restore the
defensive Option(schema).getOrElse(new StructType()) and document
the internal-recursion reason so future readers don't strip it again.
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the existing HoodieSchemaNotFoundException catch onto the 3-arg createRelation overload so schema-less tables don't blow up on the SchemaRelationProvider path, with a test that exercises spark.read.schema(...).load(...). No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.92%. Comparing base (38db5ed) to head (5e25a57).
⚠️ Report is 14 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18669      +/-   ##
============================================
- Coverage     68.06%   67.92%   -0.15%     
- Complexity    28922    28991      +69     
============================================
  Files          2518     2522       +4     
  Lines        140574   141167     +593     
  Branches      17419    17509      +90     
============================================
+ Hits          95682    95885     +203     
- Misses        37036    37412     +376     
- Partials       7856     7870      +14     
Flag Coverage Δ
common-and-other-modules 44.17% <100.00%> (-0.20%) ⬇️
hadoop-mr-java-client 44.99% <ø> (+0.10%) ⬆️
spark-client-hadoop-common 48.35% <ø> (-0.09%) ⬇️
spark-java-tests 48.99% <100.00%> (+0.35%) ⬆️
spark-scala-tests 44.90% <100.00%> (+0.20%) ⬆️
utilities 37.62% <100.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...src/main/scala/org/apache/hudi/DefaultSource.scala 76.47% <100.00%> (+1.32%) ⬆️

... and 34 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented May 7, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@prashantwason
Copy link
Copy Markdown
Member Author

@lokeshj1703 @yihua - this PR is green on the latest commit 5e25a57: 62/62 checks passing.

Quick history of the recent commits in case it helps the re-review:

  • bd4e4c2 - applied a hudi-agent nit to drop Option(schema).getOrElse(new StructType()) in favor of schema directly, on the assumption that SchemaRelationProvider guarantees non-null schema.
  • 5e25a57 - reverted bd4e4c2. The 2-arg createRelation overload (line 78 of DefaultSource.scala) re-enters the 3-arg overload with schema = null, so the defensive fallback was load-bearing. Without it, TestCOWDataSource.testReadOfAnEmptyTable NPE'd on spark3.3 / spark3.5. The comment above the catch now explicitly calls out the internal-recursion reason so future readers don't strip it again.

Net effect vs your prior approval: same as 43e0dd4 semantically + one extra clarifying sentence in the comment. Ready to merge whenever you have a moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

5 participants