fix(spark): catch HoodieSchemaNotFoundException in 3-arg DefaultSource.createRelation#18669
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR mirrors the existing HoodieSchemaNotFoundException catch from the 2-arg createRelation overload onto the 3-arg overload so schema-less Hudi tables don't fail when the read path supplies a user schema. One question worth checking on whether EmptyRelation should reflect the user-supplied schema rather than an empty StructType. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here.
| val relation = try { | ||
| DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap) | ||
| } catch { | ||
| case _: HoodieSchemaNotFoundException => new EmptyRelation(sqlContext, new StructType()) |
There was a problem hiding this comment.
🤖 Should the EmptyRelation here be constructed with the user-supplied schema rather than new StructType()? On this overload, Spark has been given an explicit schema and downstream operations like .select("col1") would presumably expect those columns to exist on the resolved relation. The test only checks .count(), which would pass with an empty schema, but a query that projects columns might fail with AnalysisException against an empty StructType. The 2-arg path uses an empty StructType because no schema is known there, but here we have one.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
There was a problem hiding this comment.
🤖 I think this is already handled — the line reads new EmptyRelation(sqlContext, Option(schema).getOrElse(new StructType())), so when the 3-arg overload is invoked with a non-null user-supplied schema, that schema is passed through to EmptyRelation and only falls back to an empty StructType if schema happens to be null. So projections like .select("col1") against a user-supplied schema should resolve correctly.
…e.createRelation The 2-arg `createRelation(sqlContext, parameters)` overload wraps its body in a try/catch that converts `HoodieSchemaNotFoundException` to `EmptyRelation` (added in HUDI-7147 / apache#10689). The 3-arg `createRelation(sqlContext, optParams, schema)` overload — which Spark's `DataSource.resolveRelation()` invokes directly via the `SchemaRelationProvider` path whenever a user-supplied schema is present (e.g. `spark.read.schema(s).format("hudi").load(path)`, or HMS-catalog resolution that already knows the schema) — has no such catch, so the exception propagates and breaks query analysis. Mirror the 2-arg catch on the 3-arg overload so behavior is symmetric: schema-less Hudi tables resolve to an empty relation regardless of which overload Spark invokes. Also adds `TestCOWDataSource.testReadOfAnEmptyTableWithUserSuppliedSchema`, a sibling of the existing `testReadOfAnEmptyTable` that exercises the 3-arg path. Closes apache#18668 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8c5b113 to
43e0dd4
Compare
|
Pushed a follow-up to the same commit: when the catch fires, pass the caller-supplied schema through to Test ( |
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR mirrors the existing HoodieSchemaNotFoundException catch from the 2-arg createRelation overload onto the 3-arg overload so schema-less tables resolve to an EmptyRelation on the SchemaRelationProvider path too, and adds a corresponding test. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small simplification opportunity in the catch block; otherwise the code is clean and well-commented for a non-obvious fix.
cc @yihua
| DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap) | ||
| } catch { | ||
| case _: HoodieSchemaNotFoundException => | ||
| new EmptyRelation(sqlContext, Option(schema).getOrElse(new StructType())) |
There was a problem hiding this comment.
🤖 nit: Option(schema).getOrElse(new StructType()) implies schema could be null, but on the SchemaRelationProvider path Spark always provides it — that's the whole contract of this overload. Could you simplify to just schema? That way a null accidentally passed by a caller would fail fast instead of silently falling back to an empty schema.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
There was a problem hiding this comment.
Done in bd4e4c2 - passed schema through directly. Agreed the Option wrapper was overly defensive given the SchemaRelationProvider contract.
There was a problem hiding this comment.
Update: had to revert the simplification in 5e25a57. Turns out the 2-arg createRelation overload (line 78) re-enters this 3-arg method with schema=null, so the SchemaRelationProvider non-null contract assumption doesn't hold for internal callers. The defensive Option(schema).getOrElse(new StructType()) was load-bearing - removing it broke TestCOWDataSource.testReadOfAnEmptyTable on spark3.3 / spark3.5 with NPE in BaseRelation.schema().isEmpty. Comment now documents the internal-recursion reason.
| val relation = try { | ||
| DefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap) | ||
| } catch { | ||
| case _: HoodieSchemaNotFoundException => new EmptyRelation(sqlContext, new StructType()) |
There was a problem hiding this comment.
🤖 I think this is already handled — the line reads new EmptyRelation(sqlContext, Option(schema).getOrElse(new StructType())), so when the 3-arg overload is invoked with a non-null user-supplied schema, that schema is passed through to EmptyRelation and only falls back to an empty StructType if schema happens to be null. So projections like .select("col1") against a user-supplied schema should resolve correctly.
lokeshj1703
left a comment
There was a problem hiding this comment.
Thanks @prashantwason for working on this! The changes look good to me. There is a NIT posted by hudi-agent which might be good to look at. +1 o.w.
The SchemaRelationProvider contract guarantees schema is non-null on this overload, so the Option(schema).getOrElse(new StructType()) wrapper was defensive code that obscured intent. Passing schema through directly fails fast on a contract violation instead of silently producing an empty relation. Addresses review feedback on PR apache#18669.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR mirrors the existing 2-arg HoodieSchemaNotFoundException catch on the 3-arg createRelation overload so schema-supplied reads of schemaless tables return an EmptyRelation instead of throwing, with a corresponding test. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
The previous simplification (commit bd4e4c2) assumed the SchemaRelationProvider contract guaranteed `schema` is non-null, but the 2-arg createRelation overload re-enters this 3-arg method with `schema = null` (line 78). When the schema-not-found catch fires on that internal call, passing `null` straight through to EmptyRelation produces a relation whose `.schema` is null, and the 2-arg overload's downstream `relation.schema.isEmpty` check NPEs. Surfaced by TestCOWDataSource.testReadOfAnEmptyTable failing on spark3.3 / spark3.5 with NPE on BaseRelation.schema(). Restore the defensive Option(schema).getOrElse(new StructType()) and document the internal-recursion reason so future readers don't strip it again.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR mirrors the existing HoodieSchemaNotFoundException catch onto the 3-arg createRelation overload so schema-less tables don't blow up on the SchemaRelationProvider path, with a test that exercises spark.read.schema(...).load(...). No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18669 +/- ##
============================================
- Coverage 68.06% 67.92% -0.15%
- Complexity 28922 28991 +69
============================================
Files 2518 2522 +4
Lines 140574 141167 +593
Branches 17419 17509 +90
============================================
+ Hits 95682 95885 +203
- Misses 37036 37412 +376
- Partials 7856 7870 +14
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@lokeshj1703 @yihua - this PR is green on the latest commit 5e25a57: 62/62 checks passing. Quick history of the recent commits in case it helps the re-review:
Net effect vs your prior approval: same as 43e0dd4 semantically + one extra clarifying sentence in the comment. Ready to merge whenever you have a moment. |
Describe the issue this Pull Request addresses
Closes #18668.
org.apache.hudi.DefaultSourcehas two read-side overloads ofcreateRelation:createRelation(sqlContext, parameters)wraps its body in atry { … } catch { case _: HoodieSchemaNotFoundException => new EmptyRelation(…) }. This catch was added in HUDI-7147 / #10689 so that schema-less Hudi tables (no commits / commit metadata deleted / legacy schema-less layout) do not explode at query analysis time.createRelation(sqlContext, optParams, schema)callsDefaultSource.createRelation(sqlContext, metaClient, schema, options.toMap)directly, without the same catch.Spark's
DataSource.resolveRelation()chooses the overload based on whether a user-supplied schema is present:So any read path that supplies a schema (e.g.
spark.read.schema(s).format("hudi").load(path), or HMS-catalog resolution that already knows the schema) bypasses the 2-arg catch and surfacesHoodieSchemaNotFoundExceptiondirectly.Summary and Changelog
DefaultSource.scala(3-argcreateRelation): mirror the existing 2-arg catch soHoodieSchemaNotFoundExceptionresolves toEmptyRelationon this overload too. Adds an inline comment explaining why both overloads need the same catch.TestCOWDataSource.testReadOfAnEmptyTableWithUserSuppliedSchema: sibling of the existingtestReadOfAnEmptyTablethat assertsspark.read.schema(userSchema).format("hudi").load(basePath).count() == 0instead of throwing on a schema-less table.Impact
User-facing: a Hudi table whose schema is unresolvable will now return an empty relation when queried with a user-supplied schema, matching the existing no-schema-supplied behavior. No previously-successful path changes behavior — this only converts a previously-thrown exception into an empty result on the same exact failure condition.
Risk Level
low — minimal scope (one try/catch mirroring existing logic), covered by a new unit test that mirrors an existing one.
Documentation Update
none
Contributor's checklist
🤖 Generated with Claude Code