PHOENIX-7407 Remove deprecated datasource V1 code from spark2 and spark3 connector#145
PHOENIX-7407 Remove deprecated datasource V1 code from spark2 and spark3 connector#145stoty merged 1 commit intoapache:masterfrom
Conversation
|
|
I added tests for datasource v1 when I've been working on PHOENIX-6783 because there has been no tests before that. |
|
I added steps in github action workflow to generate test reports. The steps has failed with error "Error: HttpError: Resource not accessible by integration". |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
I have not forgotten about this, but I need to take care of some higher priority stuff first. |
I got some free time this weekend so I wanted to test something. Please do not mind about future pushes, I will post a message when I finish updates. I will create a new class with kept and added tests for V1 and helper methods then remove old class to make review easier. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Yetus is supposed to be fixed, kicking off a CI run. |
|
Regarding the savemode changes: |
d1ea8ba to
fb0a8b1
Compare
@stoty, done. |
|
It should truncate table and then insert data. |
Thanks. |
|
Also there are still a few unused imports and similar Yetus warnings. |
Yes, we can log something at INF., In this PR, I kept the actual behavior for Overwrite it means that it will not do the truncate before inserting data. |
That's good. |
b271f60 to
a21759e
Compare
stoty
left a comment
There was a problem hiding this comment.
Please also fix the whitespace and codestyle issues reported by Yetus.
| JavaSparkContext jsc = new JavaSparkContext(sparkConf); | ||
| SQLContext sqlContext = new SQLContext(jsc); | ||
|
|
||
| SparkSession spark = SparkUtil.getSparkSession(); |
There was a problem hiding this comment.
Looks like spark.hadoopRDD.ignoreEmptySplits is default since 3.2.0, so removing it should be OK.
There was a problem hiding this comment.
I just replaced old code with a call to existing SparkUtil.getSparkSession() to avoid having duplicate code.
There was a problem hiding this comment.
That one doesn't have spark.hadoopRDD.ignoreEmptySplits for Spark3 (which is fine)
| JavaSparkContext jsc = new JavaSparkContext(sparkConf); | ||
| SQLContext sqlContext = new SQLContext(jsc); | ||
|
|
||
| SparkSession spark = SparkUtil.getSparkSession(); |
There was a problem hiding this comment.
since 3.2.0 / SPARK-34809 spark.hadoopRDD.ignoreEmptySplits is enabled by default.
However, this uses Spark 2. Shouldn't we keep that property for Spark 2 ?
There was a problem hiding this comment.
I just replaced old code with a call to existing SparkUtil.getSparkSession() to avoid having duplicate code and the property is there.
| @@ -1,3 +1,21 @@ | |||
| /* | |||
There was a problem hiding this comment.
Thanks.
As you are the orginal author, we can handle this here instead of a separate ticket.
phoenix5-spark3/README.md
Outdated
| The `save` method also takes a `SaveMode` option, for which only `SaveMode.Append` is supported. | ||
| The `save` method also takes a `SaveMode` option, it is recommended to use `SaveMode.Append`. | ||
| For maintaining compatibility with source type `"org.apache.phoenix.spark"`, | ||
| `SaveMode.Overwrite` is accepted but it behave same way as `SaveMode.Append`. |
There was a problem hiding this comment.
grammar: "behaves the same way"
phoenix5-spark3/README.md
Outdated
| The `"org.apache.phoenix.spark"` datasource does not accept the `"jdbcUrl"` parameter, | ||
| only `"zkUrl"` | ||
| - DataSourceV1 implementation was removed, source type `"org.apache.phoenix.spark"` | ||
| uses DatasourceV2 since connector 6.0.0 release. |
| * The class facilitates the creation of a batch write operation that is configured with the provided | ||
| * logical write information and options specific to the Phoenix data source. | ||
| * | ||
| * Note: Overwrite mode does not do truncate table and behave the same as Append mode. |
|
This needs a rebase as I merged PHOENIX-7561 @rejeb . |
@stoty, how can I run the checks that Yetus do before pushing to repository. |
Not easily. In theory you can also run Yetus locally, but setting that up is a big task, I would not recommend that (unless you plan to work on Yetus support) |
Thanks, I'll wait for report instead. |
da4d8ec to
49eef1e
Compare
|
I've been unable to fix the spotbugs error. Seams like it is a false positive. |
| data | ||
| .write | ||
| .format("phoenix") | ||
| .mode(SaveMode.Overwrite) |
There was a problem hiding this comment.
Since we don't actually implement Overwrite, wouldn't it make more sense to use Append here ?
There was a problem hiding this comment.
This was in spark2 connector, the Append mode is available in spark3.
Would you like to add Append mode support for spark2 ?
If it is the case, I think it is better to do it in a seperate ticket.
There was a problem hiding this comment.
Sure, it makes sense to handle that separately.
| override def createRelation(sqlContext: SQLContext, mode: SaveMode, | ||
| parameters: Map[String, String], data: DataFrame): BaseRelation = { | ||
|
|
||
| if (!mode.equals(SaveMode.Overwrite)) { |
There was a problem hiding this comment.
Is there an advatage to keeping this check ?
Do we have tests that check for it ?
There was a problem hiding this comment.
The same check is done in spark2 datasource V2 code in this class org.apache.phoenix.spark.datasource.v2.writer.PhoenixDataSourceWriter.
IDK if there is a test for this.
|
This mostly looks good, my last concern is whether we should force Overwite in the compatibility layer (as noted above.) |
For spark3, SaveMode is forced to Append in the compatibility layer. |
Removed datasource V1 classes.
Since "org.apache.phoenix.spark" uses datasource V2 implementation there is no need to keep all tests in
PhoenixSparkDatasourceV1IT. I only kept necessary tests because others are duplication of the ones inPhoenixSparkIT.I added addtional steps in github action to generates tests reports details.