[SPARK-9381][SQL]Migrate JSON data source to the new partitioning data source #7696

chenghao-intel · 2015-07-27T13:33:41Z

Support partitioning for the JSON data source.

Still 2 open issues for the HadoopFsRelation

refresh() will invoke the discoveryPartition(), which will auto infer the data type for the partition columns, and maybe conflict with the given partition columns. (TODO enable `HadoopFsRelationSuite.Partition column type casting"
When insert data into a cached HadoopFsRelation based table, we need to invalidate the cache after the insertion (TODO enable InsertSuite.Caching)

SparkQA · 2015-07-27T13:56:31Z

Test build #38545 has finished for PR 7696 at commit 63b3804.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-07-28T16:08:33Z

sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala

This pattern match can be refactored to:

(this.inputRDD, that.inputRDD) match { case (Some(thisRDD), Some(thatRDD)) => thisRDD eq thatRDD case (None, None) => true case _ => false }

liancheng · 2015-07-28T16:16:19Z

@chenghao-intel Would you please add a JSONHadoopFsRelationSuite after OrcHadoopFsRelation and SimpleTextHadoopFsRelationSuite? This ensures partitioning facilities work as expected for the newly migrated JSON data source.

liancheng · 2015-07-28T16:18:02Z

Also, since PR description is written into Git commit log, please have a short description there instead of cc-ing me 😝

chenghao-intel · 2015-07-31T11:45:07Z

Thank you @liancheng it did find a bug after enabling more unit test. :)

chenghao-intel · 2015-07-31T16:19:22Z

retest this please

SparkQA · 2015-07-31T16:35:12Z

Test build #180 has finished for PR 7696 at commit 57c81b2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-31T16:36:58Z

Test build #39219 has finished for PR 7696 at commit 57c81b2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-31T16:38:08Z

Test build #39233 has finished for PR 7696 at commit 57c81b2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-01T17:03:18Z

Test build #39364 has finished for PR 7696 at commit bdb49c8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-03T17:31:38Z

Test build #39555 has finished for PR 7696 at commit 6e5b4df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-08-04T04:41:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

This is the last chance that to refresh the file status, otherwise, we may not able to reflect the latest files under the specified path, as we will pass the t.paths to create the rdd later on.

SparkQA · 2015-08-04T05:03:29Z

Test build #39659 has finished for PR 7696 at commit 307111d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-04T07:35:35Z

Test build #39674 has finished for PR 7696 at commit d90b104.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-08-04T07:41:05Z

seems not related failure.

chenghao-intel · 2015-08-04T07:41:43Z

retest this please

SparkQA · 2015-08-04T08:36:51Z

Test build #200 has finished for PR 7696 at commit d90b104.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-08-04T08:43:34Z

retest this please

SparkQA · 2015-08-04T09:37:00Z

Test build #202 has finished for PR 7696 at commit d90b104.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-04T09:41:46Z

Test build #39683 has finished for PR 7696 at commit d90b104.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-04T10:36:11Z

Test build #39691 has finished for PR 7696 at commit d90b104.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

chenghao-intel · 2015-08-05T00:35:28Z

retest this please

SparkQA · 2015-08-05T02:26:29Z

Test build #39791 has finished for PR 7696 at commit d90b104.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-05T02:32:28Z

Test build #219 has finished for PR 7696 at commit d90b104.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-05T14:29:28Z

Thanks for working on this! Merging to master.

…ata source Support partitioning for the JSON data source. Still 2 open issues for the `HadoopFsRelation` - `refresh()` will invoke the `discoveryPartition()`, which will auto infer the data type for the partition columns, and maybe conflict with the given partition columns. (TODO enable `HadoopFsRelationSuite.Partition column type casting" - When insert data into a cached HadoopFsRelation based table, we need to invalidate the cache after the insertion (TODO enable `InsertSuite.Caching`) Author: Cheng Hao <[email protected]> Closes #7696 from chenghao-intel/json and squashes the following commits: d90b104 [Cheng Hao] revert the change for JacksonGenerator.apply 307111d [Cheng Hao] fix bug in the unit test 8738c8a [Cheng Hao] fix bug in unit testing 35f2cde [Cheng Hao] support partition for json format (cherry picked from commit 519cf6d) Signed-off-by: Reynold Xin <[email protected]>

PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing (cherry picked from commit e3fef0f) Signed-off-by: Yin Huai <[email protected]>

PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing

PR apache#7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions. The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`. This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case. [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63 [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91 Author: Cheng Lian <[email protected]> Closes apache#8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits: ec1957d [Cheng Lian] Fixes JSONRelation refreshing

liancheng reviewed Jul 28, 2015
View reviewed changes

chenghao-intel changed the title ~~[SPARK-9381][SQL][WIP]Migrate JSON data source to the new partitioning data source~~ [SPARK-9381][SQL]Migrate JSON data source to the new partitioning data source Jul 31, 2015

chenghao-intel force-pushed the json branch from 57c81b2 to bdb49c8 Compare August 1, 2015 16:14

chenghao-intel force-pushed the json branch from bdb49c8 to 6e5b4df Compare August 3, 2015 15:47

chenghao-intel added 3 commits August 3, 2015 21:29

support partition for json format

35f2cde

fix bug in unit testing

8738c8a

fix bug in the unit test

307111d

chenghao-intel force-pushed the json branch from 6e5b4df to 307111d Compare August 4, 2015 04:37

chenghao-intel reviewed Aug 4, 2015
View reviewed changes

revert the change for JacksonGenerator.apply

d90b104

asfgit closed this in 519cf6d Aug 5, 2015

chenghao-intel mentioned this pull request Aug 7, 2015

[SPARK-9689][SQL]Fix bug of not invalidate the cache for InsertIntoHadoopFsRelation #8023

Closed

marmbrus mentioned this pull request Aug 7, 2015

[SPARK-9743] [SQL] Fixes JSONRelation refreshing #8035

Closed

[SPARK-9381][SQL]Migrate JSON data source to the new partitioning data source #7696

[SPARK-9381][SQL]Migrate JSON data source to the new partitioning data source #7696

Uh oh!

Conversation

chenghao-intel commented Jul 27, 2015

Uh oh!

SparkQA commented Jul 27, 2015

Uh oh!

liancheng Jul 28, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Jul 28, 2015

Uh oh!

liancheng commented Jul 28, 2015

Uh oh!

chenghao-intel commented Jul 31, 2015

Uh oh!

chenghao-intel commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Jul 31, 2015

Uh oh!

SparkQA commented Aug 1, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

chenghao-intel Aug 4, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

chenghao-intel commented Aug 4, 2015

Uh oh!

chenghao-intel commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

chenghao-intel commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

SparkQA commented Aug 4, 2015

Uh oh!

chenghao-intel commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

liancheng commented Aug 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants