forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
Merge with master #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ordination for push-based shuffle ### What changes were proposed in this pull request? This is one of the patches for SPIP SPARK-30602 which is needed for push-based shuffle. Summary of changes: This PR introduces a new RPC to be called within Driver. When the expected shuffle push wait time reaches, Driver will call this RPC to facilitate coordination of shuffle map/reduce stages and notify external shuffle services to finalize shuffle block merge for a given shuffle. Shuffle services also respond back the metadata about a merged shuffle partition back to the caller. ### Why are the changes needed? Refer to the SPIP in SPARK-30602. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This code snippets won't be called by any existing code and will be tested after the coordinated driver changes gets merged in SPARK-32920. Lead-authored-by: Min Shen mshenlinkedin.com Closes #30163 from zhouyejoe/SPARK-32918. Lead-authored-by: Ye Zhou <[email protected]> Co-authored-by: Min Shen <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…lts by version ### What changes were proposed in this pull request? In the discussion #30292 (comment), we planned to apply a new API key for each Spark release. However, it turns that DocSearch supports crawling multiple URLs from one website and filtering by fact key: https://docsearch.algolia.com/docs/config-file/#using-regular-expressions Thanks to the help from shortcuts, our Spark doc supports multiple version now: algolia/docsearch-configs#2868 This PR is to add the fact key in the search script and update the instruction in the comment. ### Why are the changes needed? To support filtering Spark documentation search results by the current document version. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test Closes #30469 from gengliangwang/apiKeyFollowUp. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
### What changes were proposed in this pull request? This PR aims the followings. 1. Upgrade from Scala 2.13.3 to 2.13.4 for Apache Spark 3.1 2. Fix exhaustivity issues in both Scala 2.12/2.13 (Scala 2.13.4 requires this for compilation.) 3. Enforce the improved exhaustive check by using the existing Scala 2.13 GitHub Action compilation job. ### Why are the changes needed? Scala 2.13.4 is a maintenance release for 2.13 line and improves JDK 15 support. - https://github.com/scala/scala/releases/tag/v2.13.4 Also, it improves exhaustivity check. - scala/scala#9140 (Check exhaustivity of pattern matches with "if" guards and custom extractors) - scala/scala#9147 (Check all bindings exhaustively, e.g. tuples components) ### Does this PR introduce _any_ user-facing change? Yep. Although it's a maintenance version change, it's a Scala version change. ### How was this patch tested? Pass the CIs and do the manual testing. - Scala 2.12 CI jobs(GitHub Action/Jenkins UT/Jenkins K8s IT) to check the validity of code change. - Scala 2.13 Compilation job to check the compilation Closes #30455 from dongjoon-hyun/SCALA_3.13. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…hCode for `BucketTransform` ### What changes were proposed in this pull request? This PR aims to change `InMemoryTable` not to use `Tuple.hashCode` for `BucketTransform`. ### Why are the changes needed? SPARK-32168 made `InMemoryTable` to handle `BucketTransform` as a hash of `Tuple` which is dependents on Scala versions. - https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala#L159 **Scala 2.12.10** ```scala $ bin/scala Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode res0: Int = -2074071657 ``` **Scala 2.13.3** ```scala Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode val res0: Int = -1669302457 ``` ### Does this PR introduce _any_ user-facing change? Yes. This is a correctness issue. ### How was this patch tested? Pass the UT with both Scala 2.12/2.13. Closes #30477 from dongjoon-hyun/SPARK-33524. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…iminationBenchmark ### What changes were proposed in this pull request? This patch adds predicate related benchmark to `SubExprEliminationBenchmark`. ### Why are the changes needed? We should have a benchmark for subexpression elimination of predicate. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Run benchmark locally. Closes #30476 from viirya/SPARK-33523. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…n specs ### What changes were proposed in this pull request? In the PR, I propose to changes the resolver of partition specs used in V2 `ALTER TABLE .. ADD/DROP PARTITION` (at the moment), and re-use `CAST` in conversion partition values to desired types according to the partition schema. ### Why are the changes needed? Currently, the resolver of V2 partition specs supports just a few types: https://github.com/apache/spark/blob/23e9920b3910e4f05269853429c7f18888cdc7b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolvePartitionSpec.scala#L72, and fails on other types like date/timestamp. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running `AlterTablePartitionV2SQLSuite` Closes #30474 from MaxGekk/dsv2-partition-value-types. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ble to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `TRUNCATE TABLE` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `TRUNCATE TABLE` works only with v1 tables, and not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t using csv AS SELECT 1") sql("USE db") sql("TRUNCATE TABLE t") // Succeeds ``` With this PR, `TRUNCATE TABLE` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$42(Analyzer.scala:866) ``` , which is expected since temporary view is resolved first and `TRUNCATE TABLE` doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `TRUNCATE TABLE` is resolved to a temp view `t` instead of table `db.t` in the above scenario. ### How was this patch tested? Updated existing tests. Closes #30457 from imback82/truncate_table. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…enkins script
### What changes were proposed in this pull request?
It seems that Jenkins tests tasks in many pr have test failed. The failed cases include:
- `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 get binary type`
- `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 get binary type`
- `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 get binary type`
- `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 get binary type`
- `org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 get binary type`
The error message as follows:
```
Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("Stacktracesbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("
at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302)
```
But they can pass the GitHub Action, maybe it's related to the `LANG` of the Jenkins build machine, this pr add `export LANG="en_US.UTF-8"` in `run-test-jenkins` script.
### Why are the changes needed?
Ensure LANG in Jenkins test process is `en_US.UTF-8` to pass `HIVE_CLI_SERVICE_PROTOCOL_VX` related tests
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Jenkins tests pass
Closes #30487 from LuciferYang/SPARK-33535.
Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Structured Streaming UI is not containing state custom metrics information. In this PR I've added it. ### Why are the changes needed? Missing state custom metrics information. ### Does this PR introduce _any_ user-facing change? Additional UI elements appear. ### How was this patch tested? Existing unit tests + manual test. ``` #Compile Spark echo "spark.sql.streaming.ui.enabledCustomMetricList stateOnCurrentVersionSizeBytes" >> conf/spark-defaults.conf sbin/start-master.sh sbin/start-worker.sh spark://gsomogyi-MBP16:7077 ./bin/spark-submit --master spark://gsomogyi-MBP16:7077 --deploy-mode client --class com.spark.Main ../spark-test/target/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar ``` <img width="1119" alt="Screenshot 2020-11-18 at 12 45 36" src="https://user-images.githubusercontent.com/18561820/99527506-2f979680-299d-11eb-9187-4ae7fbd2596a.png"> Closes #30336 from gaborgsomogyi/SPARK-33287. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
### What changes were proposed in this pull request? This pull request: - Adds following flags to the main mypy configuration: - [`strict_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-strict_optional) - [`no_implicit_optional`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-no_implicit_optional) - [`disallow_untyped_defs`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-disallow_untyped_calls) These flags are enabled only for public API and disabled for tests and internal modules. Additionally, these PR fixes missing annotations. ### Why are the changes needed? Primary reason to propose this changes is to use standard configuration as used by typeshed project. This will allow us to be more strict, especially when interacting with JVM code. See for example #29122 (review) Additionally, it will allow us to detect cases where annotations have unintentionally omitted. ### Does this PR introduce _any_ user-facing change? Annotations only. ### How was this patch tested? `dev/lint-python`. Closes #30382 from zero323/SPARK-33457. Authored-by: zero323 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
… MLlib (pyspark.mllib.*) ### What changes were proposed in this pull request? This PR proposes migration of `pyspark.mllib` to NumPy documentation style. ### Why are the changes needed? To improve documentation style. Before:  After:  ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? `dev/lint-python` and manual inspection. Closes #30413 from zero323/SPARK-33252. Authored-by: zero323 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? This PR updates `ShuffleExchangeExec` to carry more information about how much we can change the partitioning. For `repartition(col)`, we should preserve the user-specified partitioning and don't apply the AQE local shuffle reader. ### Why are the changes needed? Similar to `repartition(number, col)`, we should respect the user-specified partitioning. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? a new test Closes #30432 from cloud-fan/aqe. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…eOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `SHOW COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `SHOW COLUMNS` is not yet supported for v2 tables. ### Why are the changes needed? To use `UnresolvedTableOrView` for table/view resolution. Note that `ShowColumnsCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30490 from imback82/show_columns. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This PR proposes to add the watermark gap information in SS UI page. Please refer below screenshots to see what we'd like to show in UI.  Please note that this PR doesn't plot the watermark value - knowing the gap between actual wall clock and watermark looks more useful than the absolute value. ### Why are the changes needed? Watermark is the one of major metrics the end users need to track for stateful queries. Watermark defines "when" the output will be emitted for append mode, hence knowing how much gap between wall clock and watermark (input data) is very helpful to make expectation of the output. ### Does this PR introduce _any_ user-facing change? Yes, SS UI query page will contain the watermark gap information. ### How was this patch tested? Basic UT added. Manually tested with two queries: > simple case You'll see consistent watermark gap with (15 seconds + a) = 10 seconds are from delay in watermark definition, 5 seconds are trigger interval. ``` import org.apache.spark.sql.streaming.Trigger spark.conf.set("spark.sql.shuffle.partitions", "10") val query = spark .readStream .format("rate") .option("rowsPerSecond", 1000) .option("rampUpTime", "10s") .load() .selectExpr("timestamp", "mod(value, 100) as mod", "value") .withWatermark("timestamp", "10 seconds") .groupBy(window($"timestamp", "1 minute", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) .writeStream .format("console") .trigger(Trigger.ProcessingTime("5 seconds")) .outputMode("append") .start() query.awaitTermination() ```  > complicated case This randomizes the timestamp, hence producing random watermark gap. This won't be smaller than 15 seconds as I described earlier. ``` import org.apache.spark.sql.streaming.Trigger spark.conf.set("spark.sql.shuffle.partitions", "10") val query = spark .readStream .format("rate") .option("rowsPerSecond", 1000) .option("rampUpTime", "10s") .load() .selectExpr("*", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() * 100000) AS BIGINT) AS TIMESTAMP) AS tsMod") .selectExpr("tsMod", "mod(value, 100) as mod", "value") .withWatermark("tsMod", "10 seconds") .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) .writeStream .format("console") .trigger(Trigger.ProcessingTime("5 seconds")) .outputMode("append") .start() query.awaitTermination() ```  Closes #30427 from HeartSaVioR/SPARK-33224. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
…n't consider case-sensitivity for properties ### What changes were proposed in this pull request? This PR fixes an issue that `BasicConnectionProvider` doesn't consider case-sensitivity for properties. For example, the property `oracle.jdbc.mapDateToTimestamp` should be considered case-sensitivity but it is not considered. ### Why are the changes needed? This is a bug introduced by #29024 . Caused by this issue, `OracleIntegrationSuite` doesn't pass. ``` [info] - SPARK-16625: General data types to be mapped to Oracle *** FAILED *** (32 seconds, 129 milliseconds) [info] types.apply(9).equals(org.apache.spark.sql.types.DateType) was false (OracleIntegrationSuite.scala:238) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.sql.jdbc.OracleIntegrationSuite.$anonfun$new$4(OracleIntegrationSuite.scala:238) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:233) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:392) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:233) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:232) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) [info] at org.scalatest.Suite.run$(Suite.scala:1094) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:237) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:237) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:236) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With this change, I confirmed that `OracleIntegrationSuite` passes with the following command. ``` $ git clone https://github.com/oracle/docker-images.git $ cd docker-images/OracleDatabase/SingleInstance/dockerfiles $ ./buildDockerImage.sh -v 18.4.0 -x $ ORACLE_DOCKER_IMAGE_NAME=oracle/database:18.4.0-xe build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver "testOnly org.apache.spark.sql.jdbc.OracleIntegrationSuite" ``` Closes #30485 from sarutak/fix-oracle-integration-suite. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Hive Metastore supports strings and integral types in filters. It could also support dates. Please see [HIVE-5679](apache/hive@5106bf1) for more details. This pr add support it. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30408 from wangyum/SPARK-33477. Authored-by: Yuming Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…ericToTimestamp ### What changes were proposed in this pull request? Remove SQL configuration spark.sql.legacy.allowCastNumericToTimestamp ### Why are the changes needed? In the current master branch, there is a new configuration `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast Numeric types to Timestamp or not. The default value is true. After #30260, the type conversion between Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` for disallowing the conversion. Users just need to set `spark.sql.ansi.enabled` for the behavior. As the configuration is not in any released yet, we should remove the configuration to make things simpler. ### Does this PR introduce _any_ user-facing change? No, since the configuration is not released yet. ### How was this patch tested? Existing test cases Closes #30493 from gengliangwang/LEGACY_ALLOW_CAST_NUMERIC_TO_TIMESTAMP. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…orts partition management ### What changes were proposed in this pull request? 1. Add new method `listPartitionByNames` to the `SupportsPartitionManagement` interface. It allows to list partitions by partition names and their values. 2. Implement new method in `InMemoryPartitionTable` which is used in DSv2 tests. ### Why are the changes needed? Currently, the `SupportsPartitionManagement` interface exposes only `listPartitionIdentifiers` which allows to list partitions by partition values. And it requires to specify all values for partition schema fields in the prefix. This restriction does not allow to list partitions by some of partition names (not all of them). For example, the table `tableA` is partitioned by two column `year` and `month` ``` CREATE TABLE tableA (price int, year int, month int) USING _ partitioned by (year, month) ``` and has the following partitions: ``` PARTITION(year = 2015, month = 1) PARTITION(year = 2015, month = 2) PARTITION(year = 2016, month = 2) PARTITION(year = 2016, month = 3) ``` If we want to list all partitions with `month = 2`, we have to specify `year` for **listPartitionIdentifiers()** which not always possible as we don't know all `year` values in advance. New method **listPartitionByNames()** allows to specify partition values only for `month`, and get two partitions: ``` PARTITION(year = 2015, month = 2) PARTITION(year = 2016, month = 2) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite `SupportsPartitionManagementSuite`. Closes #30452 from MaxGekk/column-names-listPartitionIdentifiers. Authored-by: Max Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…tion overwrite mode
### What changes were proposed in this pull request?
When using dynamic partition overwrite, each task has its working dir under staging dir like `stagingDir/.spark-staging-{jobId}`, each task commits to `outputPath/.spark-staging-{jobId}/{partitionId}/part-{taskId}-{jobId}{ext}`.
When speculation enable, multiple task attempts would be setup for one task, **they have same task id and they would commit to same file concurrently**. Due to host done or node preemption, the partly-committed files aren't cleaned up, a FileAlreadyExistsException would be raised in this situation, resulting in job failure.
I don't try to change task commit process for dynamic partition overwrite, like adding attempt id to task working dir for each attempts and committing to final output dir via a new outputCommitCoordinator, here is reason:
1. `FileOutputCommitter` already has commit coordinator for each task attempts, we can leverage it rather than build a new one.
2. To say the least, we implement a coordinator solving task attempts commit conflict, suppose a severe case, application master failover, tasks with same attempt id and same task id would commit to same files, the `FileAlreadyExistsException` risk still exists
In this pr, I leverage FileOutputCommitter to solve the problem:
1. when initing a write job description, set `outputPath/.spark-staging-{jobId}` as the output dir
2. each task attempt writes output to `outputPath/.spark-staging-{jobId}/_temporary/${appAttemptId}/_temporary/${taskAttemptId}/{partitionId}/part-{taskId}-{jobId}{ext}`
3. leverage `FileOutputCommitter` coordinator, write job firstly commits output to `outputPath/.spark-staging-{jobId}/{partitionId}`
4. for dynamic partition overwrite, write job finally move `outputPath/.spark-staging-{jobId}/{partitionId}` to `outputPath/{partitionId}`
### Why are the changes needed?
Without this pr, dynamic partition overwrite would fail
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
added UT.
Closes #29000 from WinkerDu/master-fix-dynamic-partition-multi-commit.
Authored-by: duripeng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? * Unify the create table syntax in the parser by merging Hive and DataSource clauses * Add `SerdeInfo` and `external` boolean to statement plans and update AstBuilder to produce them * Add conversion from create statement plan to v1 create plans in ResolveSessionCatalog * Support new statement clauses in ResolveCatalogs conversion to v2 create plans * Remove SparkSqlParser rules for Hive syntax * Add "option." namespace to distinguish SERDEPROPERTIES and OPTIONS in table properties ### Why are the changes needed? * Current behavior is confusing. * A way to pass the Hive create options to DSv2 is needed for a Hive source. ### Does this PR introduce any user-facing change? Not by default, but v2 sources will be able to handle STORED AS and other Hive clauses. ### How was this patch tested? Existing tests validate there are no behavior changes. Update unit tests for using a statement plan for Hive create syntax: * Move create tests from spark-sql DDLParserSuite into PlanResolutionSuite * Add parser tests to spark-catalyst DDLParserSuite Closes #28026 from rdblue/unify-create-table. Lead-authored-by: Ryan Blue <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? After #30260, there are some type conversions disallowed under ANSI mode. We should tell users what they can do if they have to use the disallowed casting. ### Why are the changes needed? Make it more user-friendly. ### Does this PR introduce _any_ user-facing change? Yes, the error message is improved on casting failure when ANSI mode is enabled ### How was this patch tested? Unit tests. Closes #30440 from gengliangwang/improveAnsiCastErrorMSG. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request? This patch proposes to support subexpression elimination for interpreted predicate. ### Why are the changes needed? Similar to interpreted projection, there are use cases when codegen predicate is not able to work, e.g. too complex schema, non-codegen expression, etc. When there are frequently occurring expressions (subexpressions) among predicate expression, the performance is quite bad as we need to re-compute same expressions. We should be able to support subexpression elimination for interpreted predicate like interpreted projection. ### Does this PR introduce _any_ user-facing change? No, this doesn't change user behavior. ### How was this patch tested? Unit test and benchmark. Closes #30497 from viirya/SPARK-33540. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR is a follow-up to fix Scala 2.13 compilation. ### Why are the changes needed? To support Scala 2.13 in Apache Spark 3.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action Scala 2.13 compilation job. Closes #30502 from dongjoon-hyun/SPARK-31257. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? We supported Hive metastore are 0.12.0 through 3.1.2, but we supported hive-jdbc are 0.12.0 through 2.3.7. It will throw `TProtocolException` if we use hive-jdbc 3.x: ``` [rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default Connecting to jdbc:hive2://localhost:10000/default Connected to: Spark SQL (version 3.1.0-SNAPSHOT) Driver: Hive JDBC (version 3.1.2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.2 by Apache Hive 0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet; Unexpected end of file when reading from HS2 server. The root cause might be too many concurrent connections. Please ask the administrator to check the number of active connections, and adjust hive.server2.thrift.max.worker.threads if applicable. Error: org.apache.thrift.transport.TTransportException (state=08S01,code=0) ``` ``` org.apache.thrift.protocol.TProtocolException: Missing version in readMessageBegin, old client? at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:234) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:832) ``` This pr upgrade hive-service-rpc to 3.1.2 to fix this issue. ### Why are the changes needed? To support hive-jdbc 3.x. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test: ``` [rootspark-3267648 apache-hive-3.1.2-bin]# bin/beeline -u jdbc:hive2://localhost:10000/default Connecting to jdbc:hive2://localhost:10000/default Connected to: Spark SQL (version 3.1.0-SNAPSHOT) Driver: Hive JDBC (version 3.1.2) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 3.1.2 by Apache Hive 0: jdbc:hive2://localhost:10000/default> create table t1(id int) using parquet; +---------+ | Result | +---------+ +---------+ No rows selected (1.051 seconds) 0: jdbc:hive2://localhost:10000/default> insert into t1 values(1); +---------+ | Result | +---------+ +---------+ No rows selected (2.08 seconds) 0: jdbc:hive2://localhost:10000/default> select * from t1; +-----+ | id | +-----+ | 1 | +-----+ 1 row selected (0.605 seconds) ``` Closes #30478 from wangyum/SPARK-33525. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? remove python 3.8 from python/run-tests.py and stop build breaks ### Why are the changes needed? the python tests are running against the bare-bones system install of python3, rather than an anaconda environment. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? via jenkins Closes #30506 from shaneknapp/remove-py38. Authored-by: shane knapp <[email protected]> Signed-off-by: shane knapp <[email protected]>
…EliminationBenchmark ### What changes were proposed in this pull request? Fix the wrong benchmark case name. ### Why are the changes needed? The last commit to refactor the benchmark code missed a change of case name. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30505 from viirya/SPARK-33523-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? 1. Remove the fixed width style of class `container-fluid-div`. So that the UI looks clean when the text is long. 2. Add one space between a checkbox and the text on the right side, which is consistent with the stage page. ### Why are the changes needed? The width of class `container-fluid-div` is set as 200px after #21688 . This makes the checkbox in the executor page messy.  We should remove the width limit. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test. After the changes:  Closes #30500 from gengliangwang/reviseStyle. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…8 in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to keep the test coverage with Python 3.8 in GitHub Actions. It is not tested for now in Jenkins due to an env issue. **Before this change in GitHub Actions:** ``` ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.6', 'pypy3'] ... ``` **After this change in GitHub Actions:** ``` ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.6', 'python3.8', 'pypy3'] ``` ### Why are the changes needed? To keep the test coverage with Python 3.8 in GitHub Actions. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions in this build will test. Closes #30510 from HyukjinKwon/SPARK-33565. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request? This PR fixes an AQE issue where local shuffle reader, partition coalescing, or skew join optimization can be mistakenly applied to a shuffle introduced by repartition or a regular shuffle that logically replaces a repartition shuffle. The proposed solution checks for the presence of any repartition shuffle and filters out not applicable optimization rules for the final stage in an AQE plan. ### Why are the changes needed? Without the change, the output of a repartition query may not be correct. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #30494 from maryannxue/csr-repartition. Authored-by: Maryann Xue <[email protected]> Signed-off-by: Xiao Li <[email protected]>
…s in PySpark and SparkR ### What changes were proposed in this pull request? This PR adds the following functions (introduced in Scala API with SPARK-33061): - `acosh` - `asinh` - `atanh` to Python and R. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New functions. ### How was this patch tested? New unit tests. Closes #30501 from zero323/SPARK-33563. Authored-by: zero323 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…on configurable when read CSV ### What changes were proposed in this pull request? There are some differences between Spark CSV, opencsv and commons-csv, the typical case are described in SPARK-33566, When there are both unescaped quotes and unescaped qualifier in value, the results of parsing are different. The reason for the difference is Spark use `STOP_AT_DELIMITER` as default `UnescapedQuoteHandling` to build `CsvParser` and it not configurable. On the other hand, opencsv and commons-csv use the parsing mechanism similar to `STOP_AT_CLOSING_QUOTE ` by default. So this pr make `unescapedQuoteHandling` option configurable to get the same parsing result as opencsv and commons-csv. ### Why are the changes needed? Make unescapedQuoteHandling option configurable when read CSV to make parsing more flexible。 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Add a new case similar to that described in SPARK-33566 Closes #30518 from LuciferYang/SPARK-33566. Authored-by: yangjie01 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…R COLUMNS" on temporary views
### What changes were proposed in this pull request?
This PR proposes to fix the exception message for `ANALYZE TABLE ... FOR COLUMNS` on temporary views.
The current behavior throws `NoSuchTableException` even if the temporary view exists:
```
sql("CREATE TEMP VIEW t AS SELECT 1 AS id")
sql("ANALYZE TABLE t COMPUTE STATISTICS FOR COLUMNS id")
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 't' not found in database 'db';
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.analyzeColumnInTempView(AnalyzeColumnCommand.scala:76)
at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:54)
```
After this PR, more reasonable exception is thrown:
```
org.apache.spark.sql.AnalysisException: Temporary view `testView` is not cached for analyzing columns.;
[info] at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.analyzeColumnInTempView(AnalyzeColumnCommand.scala:74)
[info] at org.apache.spark.sql.execution.command.AnalyzeColumnCommand.run(AnalyzeColumnCommand.scala:54)
```
### Why are the changes needed?
To fix a misleading exception.
### Does this PR introduce _any_ user-facing change?
Yes, the exception thrown is changed as shown above.
### How was this patch tested?
Updated existing test.
Closes #30519 from imback82/analyze_table_message.
Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…edTableOrView ### What changes were proposed in this pull request? This PR proposes to improve the exception messages while `UnresolvedTableOrView` is handled based on this suggestion: #30321 (comment). Currently, when an identifier is resolved to a temp view when a table/permanent view is expected, the following exception message is displayed (e.g., for `SHOW CREATE TABLE`): ``` t is a temp view not table or permanent view. ``` After this PR, the message will be: ``` t is a temp view. 'SHOW CREATE TABLE' expects a table or permanent view. ``` Also, if an identifier is not resolved, the following exception message is currently used: ``` Table or view not found: t ``` After this PR, the message will be: ``` Table or permanent view not found for 'SHOW CREATE TABLE': t ``` or ``` Table or view not found for 'ANALYZE TABLE ... FOR COLUMNS ...': t ``` ### Why are the changes needed? To improve the exception message. ### Does this PR introduce _any_ user-facing change? Yes, the exception message will be changed as described above. ### How was this patch tested? Updated existing tests. Closes #30475 from imback82/unresolved_table_or_view. Authored-by: Terry Kim <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Currently in Spark one could redefine a window. For instance:
`select count(*) OVER w FROM tenk1 WINDOW w AS (ORDER BY unique1), w AS (ORDER BY unique1);`
The window `w` is defined two times. In PgSQL, on the other hand, a thrown will happen:
`ERROR: window "w" is already defined`
### Why are the changes needed?
The current implement gives the following window definitions a higher priority. But it wasn't Spark's intention and users can't know from any document of Spark.
This PR fixes the bug.
### Does this PR introduce _any_ user-facing change?
Yes.
There is an example query output with/without this fix.
```
SELECT
employee_name,
salary,
first_value(employee_name) OVER w highest_salary,
nth_value(employee_name, 2) OVER w second_highest_salary
FROM
basic_pays
WINDOW
w AS (ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING),
w AS (ORDER BY salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 2 FOLLOWING)
ORDER BY salary DESC
```
The output before this fix:
```
Larry Bott 11798 Larry Bott Gerard Bondur
Gerard Bondur 11472 Larry Bott Gerard Bondur
Pamela Castillo 11303 Larry Bott Gerard Bondur
Barry Jones 10586 Larry Bott Gerard Bondur
George Vanauf 10563 Larry Bott Gerard Bondur
Loui Bondur 10449 Larry Bott Gerard Bondur
Mary Patterson 9998 Larry Bott Gerard Bondur
Steve Patterson 9441 Larry Bott Gerard Bondur
Julie Firrelli 9181 Larry Bott Gerard Bondur
Jeff Firrelli 8992 Larry Bott Gerard Bondur
William Patterson 8870 Larry Bott Gerard Bondur
Diane Murphy 8435 Larry Bott Gerard Bondur
Leslie Jennings 8113 Larry Bott Gerard Bondur
Gerard Hernandez 6949 Larry Bott Gerard Bondur
Foon Yue Tseng 6660 Larry Bott Gerard Bondur
Anthony Bow 6627 Larry Bott Gerard Bondur
Leslie Thompson 5186 Larry Bott Gerard Bondur
```
The output after this fix:
```
struct<>
-- !query output
org.apache.spark.sql.catalyst.parser.ParseException
The definition of window 'w' is repetitive(line 8, pos 0)
```
### How was this patch tested?
Jenkins test.
Closes #30512 from beliefer/SPARK-28645.
Lead-authored-by: gengjiaan <[email protected]>
Co-authored-by: beliefer <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…an't be parsed, or the pattern string is invalid ### What changes were proposed in this pull request? Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid, when ANSI mode is enable. This patch should update GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30442 from leanken/leanken-SPARK-33498. Authored-by: xuewei.linxuewei <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This PR makes CreateViewCommand/AlterViewAsCommand capturing runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. Users can set `spark.sql.legacy.useCurrentConfigsForView` to `true` to restore the behavior before. ### Why are the changes needed? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138) that proposes to unify temp view and permanent view behaviors. This PR makes permanent views mimicking the temp view behavior that "fixes" view semantic by directly storing resolved LogicalPlan. For example, if a user uses spark 2.4 to create a view that contains null values from division-by-zero expressions, she may not want that other users' queries which reference her view throw exceptions when running on spark 3.x with ansi mode on. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? added UT + existing UTs (improved) Closes #30289 from luluorta/SPARK-33141. Authored-by: luluorta <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…rs python ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `R` * `common` * `dev` * `mlib` * `external` * `project` * `streaming` * `resource-managers` * `python` Split per srowen #30323 (comment) NOTE: The misspellings have been reported at jsoref@706a726#commitcomment-44064356 ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python. Authored-by: Josh Soref <[email protected]> Signed-off-by: Sean Owen <[email protected]>
ayushi-agarwal
pushed a commit
that referenced
this pull request
Mar 6, 2021
…e are foldable boolean types ### What changes were proposed in this pull request? Improve `SimplifyConditionals`. Simplify `If(cond, TrueLiteral, FalseLiteral)` to `cond`. Simplify `If(cond, FalseLiteral, TrueLiteral)` to `Not(cond)`. The use case is: ```sql create table t1 using parquet as select id from range(10); select if (id > 2, false, true) from t1; ``` Before this pr: ``` == Physical Plan == *(1) Project [if ((id#1L > 2)) false else true AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == *(1) Project [(id#1L <= 2) AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#30849 from wangyum/SPARK-33798-2. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
ayushi-agarwal
pushed a commit
that referenced
this pull request
Mar 6, 2021
### What changes were proposed in this pull request? This PR intends to fix flaky GitHub Actions (GA) tests below in `transform.sql` (this flakiness does not seem to happen in the Jenkins tests): - https://github.com/apache/spark/runs/1592987501 - https://github.com/apache/spark/runs/1593196242 - https://github.com/apache/spark/runs/1595496305 - https://github.com/apache/spark/runs/1596309555 This is because the error message is different between test runs in GA (the error message seems to be truncated indeterministically) ,e.g., ``` # https://github.com/apache/spark/runs/1592987501 Expected "...h status 127. Error:[ /bin/bash: some_non_existent_command: command not found]", but got "...h status 127. Error:[]" Result did not match for query #2 # https://github.com/apache/spark/runs/1593196242 Expected "...istent_command: comm[and not found]", but got "...istent_command: comm[]" Result did not match for query #2 ``` The root cause of this indeterministic behaviour happening only in GA is not clear though, this test throws SparkException consistently even in GA. So, this PR proposes to make the test just check if it will be thrown when running it. This PR comes from the dongjoon-hyun comment: https://github.com/apache/spark/pull/29414/files#r547414513 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes apache#30896 from maropu/SPARK-32106-FOLLOWUP. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.