-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9381][SQL]Migrate JSON data source to the new partitioning data source #7696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #38545 has finished for PR 7696 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern match can be refactored to:
(this.inputRDD, that.inputRDD) match {
case (Some(thisRDD), Some(thatRDD)) => thisRDD eq thatRDD
case (None, None) => true
case _ => false
}|
@chenghao-intel Would you please add a |
|
Also, since PR description is written into Git commit log, please have a short description there instead of cc-ing me 😝 |
|
Thank you @liancheng it did find a bug after enabling more unit test. :) |
|
retest this please |
|
Test build #180 has finished for PR 7696 at commit
|
|
Test build #39219 has finished for PR 7696 at commit
|
|
Test build #39233 has finished for PR 7696 at commit
|
|
Test build #39364 has finished for PR 7696 at commit
|
|
Test build #39555 has finished for PR 7696 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the last chance that to refresh the file status, otherwise, we may not able to reflect the latest files under the specified path, as we will pass the t.paths to create the rdd later on.
|
Test build #39659 has finished for PR 7696 at commit
|
|
Test build #39674 has finished for PR 7696 at commit
|
|
seems not related failure. |
|
retest this please |
|
Test build #200 has finished for PR 7696 at commit
|
|
retest this please |
|
Test build #202 has finished for PR 7696 at commit
|
|
Test build #39683 has finished for PR 7696 at commit
|
|
Test build #39691 has finished for PR 7696 at commit
|
|
retest this please |
|
Test build #39791 has finished for PR 7696 at commit
|
|
Test build #219 has finished for PR 7696 at commit
|
|
Thanks for working on this! Merging to master. |
…ata source Support partitioning for the JSON data source. Still 2 open issues for the `HadoopFsRelation` - `refresh()` will invoke the `discoveryPartition()`, which will auto infer the data type for the partition columns, and maybe conflict with the given partition columns. (TODO enable `HadoopFsRelationSuite.Partition column type casting" - When insert data into a cached HadoopFsRelation based table, we need to invalidate the cache after the insertion (TODO enable `InsertSuite.Caching`) Author: Cheng Hao <[email protected]> Closes #7696 from chenghao-intel/json and squashes the following commits: d90b104 [Cheng Hao] revert the change for JacksonGenerator.apply 307111d [Cheng Hao] fix bug in the unit test 8738c8a [Cheng Hao] fix bug in unit testing 35f2cde [Cheng Hao] support partition for json format (cherry picked from commit 519cf6d) Signed-off-by: Reynold Xin <[email protected]>
PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing (cherry picked from commit e3fef0f) Signed-off-by: Yin Huai <[email protected]>
PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing
PR apache#7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes apache#8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing
Support partitioning for the JSON data source.
Still 2 open issues for the
HadoopFsRelationrefresh()will invoke thediscoveryPartition(), which will auto infer the data type for the partition columns, and maybe conflict with the given partition columns. (TODO enable `HadoopFsRelationSuite.Partition column type casting"InsertSuite.Caching)