-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30151][SQL] Issue better error message when user-specified schema mismatched #26781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
468660f
9847847
aab2391
a4f3122
dd45804
de036b6
8fac1fa
6700dee
f54dea9
ffdff16
c2b5eea
814821a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -339,11 +339,34 @@ case class DataSource( | |
| dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) | ||
| case (_: SchemaRelationProvider, None) => | ||
| throw new AnalysisException(s"A schema needs to be specified when using $className.") | ||
| case (dataSource: RelationProvider, Some(schema)) => | ||
| case (dataSource: RelationProvider, Some(specifiedSchema)) => | ||
| val baseRelation = | ||
| dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) | ||
| if (baseRelation.schema != schema) { | ||
| throw new AnalysisException(s"$className does not allow user-specified schemas.") | ||
| val persistentSchema = baseRelation.schema | ||
| val persistentSize = persistentSchema.size | ||
| val specifiedSize = specifiedSchema.size | ||
| if (persistentSize == specifiedSize) { | ||
| val (persistentFields, specifiedFields) = persistentSchema.zip(specifiedSchema) | ||
| .filter { case (existedField, userField) => existedField != userField } | ||
| .unzip | ||
| if (persistentFields.nonEmpty) { | ||
| val errorMsg = | ||
| s"Mismatched fields detected between persistent schema and user specified schema: " + | ||
|
||
| s"persistentFields: ${persistentFields.map(_.toDDL).mkString(", ")}, " + | ||
| s"specifiedFields: ${specifiedFields.map(_.toDDL).mkString(", ")}. " + | ||
| s"This happens either you make a mistake in schema or type mapping between Spark " + | ||
| s"and external data sources have been updated while your specified schema still " + | ||
| s"using the old schema. Please either correct the schema or just do not specify " + | ||
| s"the schema since a specified schema for $className is not necessary." | ||
| throw new AnalysisException(errorMsg) | ||
| } | ||
| } else { | ||
| val errorMsg = | ||
| s"The number of fields between persistent schema and user specified schema " + | ||
| s"mismatched: expect $persistentSize fields, but got $specifiedSize fields. " + | ||
| s"Please either correct the schema or just do not specify the schema since " + | ||
| s"a specified schema for $className is not necessary." | ||
| throw new AnalysisException(errorMsg) | ||
|
||
| } | ||
| baseRelation | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to improve such error message case across the codebase, we might also think about having a common method (maybe something called
assertEqualityinStructType?) that checks each type recursively and shows a better message. Can we at least have a private method here for this case in the future?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether we'd require this similar functionality in some cases in the future. But, maybe, we could still give it a try.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it wont handle nested cases. There are other external data sources that support nested schema and the current code tells only root columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are many cases to show better error messages like this. E.g., StructType.merge or
_merge_typein Python's schema inference (https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1097-L1111)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #19792 or #18521 as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HyukjinKwon , after discussing with wenchen offline, we decide not to make it too complicated here. If schemas are detected not match, we simply show the whole schema to user rather than those mismatched fields as previously did. Please see de036b6.