Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

This PR fixes the RAT exclusion rule which was originated from SPARK-1144 (Apache Spark 1.0)

Why are the changes needed?

This prevents the situation like #30415.

Currently, it missed catalog directory due to .log rule.

$ dev/check-license
Could not find Apache license headers in the following files:
 !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java
 !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CI with the new rule.

titsuki and others added 30 commits July 26, 2020 09:13
…istently print the metrics on driver's stdout

### What changes were proposed in this pull request?

Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout.

### Why are the changes needed?

Some RDDs in this example (e.g., precision, recall) call println without calling collect.
If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout.
However if the job is under cluster mode, the job prints the metrics on the executor's stdout.
It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout.
All of the metrics should output its result on the driver's stdout.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is example code. It doesn't have any tests.

Closes #29222 from titsuki/SPARK-32428.

Authored-by: Itsuki Toyota <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 86ead04)
Signed-off-by: Sean Owen <[email protected]>
…alatest-maven-plugin

### What changes were proposed in this pull request?

Two different versions are used for the same artifacts, `exec-maven-plugin` and `scalatest-maven-plugin`. This PR aims to use the same versions for `exec-maven-plugin` and `scalatest-maven-plugin`. In addition, this PR removes `scala-maven-plugin.version` from `K8s` integration suite because it's unused.

### Why are the changes needed?

This will prevent the mistake which upgrades only one place and forgets the others.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins K8S IT.

Closes #29248 from dongjoon-hyun/SPARK-32448.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 13c64c2)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
update sql-ref docs, the following key words will be added in this PR.

CASE/ELSE
WHEN/THEN
MAP KEYS TERMINATED BY
NULL DEFINED AS
LINES TERMINATED BY
ESCAPED BY
COLLECTION ITEMS TERMINATED BY
PIVOT
LATERAL VIEW OUTER?
ROW FORMAT SERDE
ROW FORMAT DELIMITED
FIELDS TERMINATED BY
IGNORE NULLS
FIRST
LAST

### Why are the changes needed?
let more users know the sql key words usage

### Does this PR introduce _any_ user-facing change?
![image](https://user-images.githubusercontent.com/46367746/88148830-c6dc1f80-cc31-11ea-81ea-13bc9dc34550.png)
![image](https://user-images.githubusercontent.com/46367746/88148968-fb4fdb80-cc31-11ea-8649-e8297cf5813e.png)
![image](https://user-images.githubusercontent.com/46367746/88149000-073b9d80-cc32-11ea-9aa4-f914ecd72663.png)
![image](https://user-images.githubusercontent.com/46367746/88149021-0f93d880-cc32-11ea-86ed-7db8672b5aac.png)

### How was this patch tested?
No

Closes #29056 from GuoPhilipse/add-missing-keywords.

Lead-authored-by: GuoPhilipse <[email protected]>
Co-authored-by: GuoPhilipse <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit 8de4333)
Signed-off-by: Takeshi Yamamuro <[email protected]>
### What changes were proposed in this pull request?
Fixes spacing in an error message

### Why are the changes needed?
Makes error messages easier to read

### Does this PR introduce _any_ user-facing change?
Yes, it changes the error message

### How was this patch tested?
This patch doesn't affect any logic, so existing tests should cover it

Closes #29264 from hauntsaninja/patch-1.

Authored-by: Shantanu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 77f2ca6)
Signed-off-by: HyukjinKwon <[email protected]>
…if f overflow happens

This PR backports d315ebf to branch-3.0
### What changes were proposed in this pull request?

When using `Seconds.toMicros` API to convert epoch seconds to microseconds,

```scala
 /**
     * Equivalent to
     * {link #convert(long, TimeUnit) MICROSECONDS.convert(duration, this)}.
     * param duration the duration
     * return the converted duration,
     * or {code Long.MIN_VALUE} if conversion would negatively
     * overflow, or {code Long.MAX_VALUE} if it would positively overflow.
     */
```
This PR change it to `Math.multiplyExact(epochSeconds, MICROS_PER_SECOND)`

### Why are the changes needed?

fix silent data change between 3.x and 2.x
```
 ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200722   bin/spark-sql -S -e "select to_timestamp('300000', 'y');"
+294247-01-10 12:00:54.775807
```
```
 kentyaohulk  ~/Downloads/spark/spark-2.4.5-bin-hadoop2.7  bin/spark-sql -S  -e "select to_timestamp('300000', 'y');"
284550-10-19 15:58:1010.448384
```

### Does this PR introduce _any_ user-facing change?

Yes, we will raise `ArithmeticException` instead of giving the wrong answer if overflow.

### How was this patch tested?

add unit test

Closes #29267 from yaooqinn/SPARK-32424-30.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
Rewrite a clearer and complete BLAS native acceleration enabling guide.

### Why are the changes needed?
The document of enabling BLAS native acceleration in ML guide (https://spark.apache.org/docs/latest/ml-guide.html#dependencies) is incomplete and unclear to the user.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
N/A

Closes #29139 from xwu99/blas-doc.

Lead-authored-by: Xiaochang Wu <[email protected]>
Co-authored-by: Wu, Xiaochang <[email protected]>
Signed-off-by: Huaxin Gao <[email protected]>
(cherry picked from commit 44c868b)
Signed-off-by: Huaxin Gao <[email protected]>
### What changes were proposed in this pull request?
`spark.kryo.registrator` in 3.0 has a regression problem. From [SPARK-12080](https://issues.apache.org/jira/browse/SPARK-12080), it supports multiple user registrators by
```scala
private val userRegistrators = conf.get("spark.kryo.registrator", "")
    .split(',').map(_.trim)
    .filter(!_.isEmpty)
```
But it donsn't work in 3.0. Fix it by `toSequence` in `Kryo.scala`

### Why are the changes needed?
In previous Spark version (2.x), it supported multiple user registrators by
```scala
private val userRegistrators = conf.get("spark.kryo.registrator", "")
    .split(',').map(_.trim)
    .filter(!_.isEmpty)
```
But it doesn't work in 3.0. It's should be a regression.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed unit tests.

Closes #29123 from LantaoJin/SPARK-32283.

Authored-by: LantaoJin <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 26e6574)
Signed-off-by: Wenchen Fan <[email protected]>
…Plugin and starting heartbeat thread

### What changes were proposed in this pull request?

This PR changes the order between initialization for ExecutorPlugin and starting heartbeat thread in Executor.

### Why are the changes needed?

In the current master, heartbeat thread in a executor starts after plugin initialization so if the initialization takes long time, heartbeat is not sent to driver and the executor will be removed from cluster.

### Does this PR introduce _any_ user-facing change?

Yes. Plugins for executors will be allowed to take long time for initialization.

### How was this patch tested?

New testcase.

Closes #29002 from sarutak/fix-heartbeat-issue.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
(cherry picked from commit 9be0883)
Signed-off-by: Thomas Graves <[email protected]>
…consistent between modules

### What changes were proposed in this pull request?

Upgrade codehaus maven build helper to allow people to specify a time during the build to avoid snapshot artifacts with different version strings.

### Why are the changes needed?

During builds of snapshots the maven may assign different versions to different artifacts based on the time each individual sub-module starts building.

The timestamp is used as part of the version string when run `maven deploy` on a snapshot build. This results in different sub-modules having different version strings.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual build while specifying the current time, ensured the time is consistent in the sub components.

Open question: Ideally I'd like to backport this as well since it's sort of a bug fix and while it does change a dependency version it's not one that is propagated. I'd like to hear folks thoughts about this.

Closes #29274 from holdenk/SPARK-32397-snapshot-artifact-timestamp-differences.

Authored-by: Holden Karau <[email protected]>
Signed-off-by: DB Tsai <[email protected]>
(cherry picked from commit 50911df)
Signed-off-by: DB Tsai <[email protected]>
…pply with Arrow vectorization

### What changes were proposed in this pull request?

This PR proposes to:

1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example,

    ```R
    df <- createDataFrame(list(list(a=1L, b="2")))
    count(gapply(df, "a", function(key, group) { group }, structType("a int, b int")))
    ```

    **Before:**

    ```
    Error in handleErrors(returnStatus, conn) :
      ...
      java.lang.UnsupportedOperationException
	    ...
    ```

    **After:**

    ```
    Error in handleErrors(returnStatus, conn) :
     ...
     java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType
        ...
    ```

2. Update documentation about the schema matching for `gapply` and `dapply`.

### Why are the changes needed?

To show which schema is not matched, and let users know what's going on.

### Does this PR introduce _any_ user-facing change?

Yes, error message is updated as above, and documentation is updated.

### How was this patch tested?

Manually tested and unitttests were added.

Closes #29283 from HyukjinKwon/r-vectorized-error.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Fix regression bug in load-spark-env.cmd with Spark 3.0.0

cmd doesn't support set env twice. So set `SPARK_ENV_CMD=%SPARK_CONF_DIR%\%SPARK_ENV_CMD%` doesn't take effect, which caused regression.

No

Manually tested.
1. Create a spark-env.cmd under conf folder. Inside this, `echo spark-env.cmd`
2. Run old load-spark-env.cmd, nothing printed in the output
2. Run fixed load-spark-env.cmd, `spark-env.cmd` showed in the output.

Closes #29044 from warrenzhu25/32227.

Lead-authored-by: Warren Zhu <[email protected]>
Co-authored-by: Warren Zhu <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 7437720)
Signed-off-by: HyukjinKwon <[email protected]>
### What changes were proposed in this pull request?

This PR removes a test added in SPARK-32175(#29002).

### Why are the changes needed?

That test is flaky. It can be mitigated by increasing the timeout but it would rather be simpler to remove the test.
See also the [discussion](#29002 (comment)).

### Does this PR introduce _any_ user-facing change?

No.

Closes #29314 from sarutak/remove-flaky-test.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Kousuke Saruta <[email protected]>
(cherry picked from commit 9d7b1d9)
Signed-off-by: Kousuke Saruta <[email protected]>
### What changes were proposed in this pull request?
Backports SPARK-32332 to 3.0 branch.

### Why are the changes needed?
Plugins cannot replace exchanges with columnar versions when AQE is enabled without this patch.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tests included.

Closes #29310 from andygrove/backport-SPARK-32332.

Authored-by: Andy Grove <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
### What changes were proposed in this pull request?

When https is enabled for Spark UI, an HTTP request will be redirected as an encoded HTTPS URL: https://github.com/apache/spark/pull/10238/files#diff-f79a5ead735b3d0b34b6b94486918e1cR312

When we create the redirect url, we will call getRequestURI and getQueryString. Both two methods may return an encoded string. However, we pass them directly to the following URI constructor
```
URI(String scheme, String authority, String path, String query, String fragment)
```
As this URI constructor assumes both path and query parameters are decoded strings, it will encode them again. This makes the redirect URL encoded twice.

This problem is on stage page with HTTPS enabled. The URL of "/taskTable" contains query parameter `order%5B0%5D%5Bcolumn%5D`. After encoded it becomes  `order%255B0%255D%255Bcolumn%255D` and it will be decoded as `order%5B0%5D%5Bcolumn%5D` instead of `order[0][dir]`.  When the parameter `order[0][dir]` is missing, there will be an excetpion from:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176
and the stage page fail to load.

To fix the problem, we can try decoding the query parameters before encoding it. This is to make sure we encode the URL

### Why are the changes needed?

Fix a UI issue when HTTPS is enabled

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new Unit test + manually test on a cluster

Closes #29271 from gengliangwang/urlEncode.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
(cherry picked from commit 71aea02)
Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request?
This PR fixes issues relate to Canonicalization of FileSourceScanExec when it contains unused DPP Filter.

### Why are the changes needed?

As part of PlanDynamicPruningFilter rule, the unused DPP Filter are simply replaced by `DynamicPruningExpression(TrueLiteral)` so that they can be avoided. But these unnecessary`DynamicPruningExpression(TrueLiteral)` partition filter inside the FileSourceScanExec affects the canonicalization of the node and so in many cases, this can prevent ReuseExchange from happening.

This PR fixes this issue by ignoring the unused DPP filter in the `def doCanonicalize` method.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT.

Closes #29318 from prakharjain09/SPARK-32509_df_reuse.

Authored-by: Prakhar Jain <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 7a09e71)
Signed-off-by: Wenchen Fan <[email protected]>
…Partition

This is a partial backport of #29307

Most of the changes are not needed because #28226 is in master only.

This PR only backports the safeguard in `ShuffleExchangeExec.canChangeNumPartitions`

Closes #29321 from cloud-fan/aqe.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…ow to create SparkContext in executors

### What changes were proposed in this pull request?

This is a backport of #29278, but with allowing to create `SparkContext` in executors by default.

This PR adds a config to switch allow/disallow to create `SparkContext` in executors.

- `spark.driver.allowSparkContextInExecutors`

### Why are the changes needed?

Some users or libraries actually create `SparkContext` in executors.
We shouldn't break their workloads.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to disallow to create `SparkContext` in executors with the config disabled.

### How was this patch tested?

More tests are added.

Closes #29294 from ueshin/issues/SPARK-32160/3.0/add_configs.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
… switch allow/disallow SparkContext in executors

### What changes were proposed in this pull request?

This is a follow-up of #29294.
This PR changes the config name to switch allow/disallow `SparkContext` in executors as per the comment #29278 (review).

### Why are the changes needed?

The config name `spark.executor.allowSparkContext` is more reasonable.

### Does this PR introduce _any_ user-facing change?

Yes, the config name is changed.

### How was this patch tested?

Updated tests.

Closes #29341 from ueshin/issues/SPARK-32160/3.0/change_config_name.

Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…ister outputs for executor on fetch failure after executor is lost

### What changes were proposed in this pull request?

If an executor is lost, the `DAGScheduler` handles the executor loss by removing the executor but does not unregister its outputs if the external shuffle service is used. However, if the node on which the executor runs is lost, the shuffle service may not be able to serve the shuffle files.
In such a case, when fetches from the executor's outputs fail in the same stage, the `DAGScheduler` again removes the executor and by right, should unregister its outputs. It doesn't because the epoch used to track the executor failure has not increased.

We track the epoch for failed executors that result in lost file output separately, so we can unregister the outputs in this scenario. The idea to track a second epoch is due to Attila Zsolt Piros.

### Why are the changes needed?

Without the changes, the loss of a node could require two stage attempts to recover instead of one.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test. This test fails without the change and passes with it.

Closes #29193 from wypoon/SPARK-32003-3.0.

Authored-by: Wing Yew Poon <[email protected]>
Signed-off-by: Imran Rashid <[email protected]>
… status change

# What changes were proposed in this pull request?
This PR adds a `FileNotFoundException` try catch block while adding a new entry to history server application listing to skip the non-existing path.

### Why are the changes needed?
If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from `foo.inprogress` to `foo`.

It leads to a problem when adding an entry to the listing, querying file status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log.

```
20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log updates
 java.io.FileNotFoundException: File does not exist: hdfs://xx/logs/spark/application_11111111111111.lz4.inprogress
 at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527)
 at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520)
 at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520)
 at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. setup another script keeps changing the filename of applications under history log dir
2. launch the history server
3. check whether the `File does not exist` error log was gone.

Closes #29350 from yanxiaole/SPARK-32529.

Authored-by: Yan Xiaole <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit c1d17df)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
Get table names directly from a sequence of Hive tables in `HiveClientImpl.listTablesByType()` by skipping conversions Hive tables to Catalog tables.

### Why are the changes needed?
A Hive metastore can be shared across many clients. A client can create tables using a SerDe which is not available on other clients, for instance `ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.XmlSerDe"`. In the current implementation, other clients get the following exception while getting views:
```
java.lang.RuntimeException: MetaException(message:java.lang.ClassNotFoundException Class com.ibm.spss.hive.serde2.xml.XmlSerDe not found)
```
when `com.ibm.spss.hive.serde2.xml.XmlSerDe` is not available.

### Does this PR introduce _any_ user-facing change?
Yes. For example, `SHOW VIEWS` returns a list of views instead of throwing an exception.

### How was this patch tested?
- By existing test suites like:
```
$ build/sbt -Phive-2.3 "test:testOnly org.apache.spark.sql.hive.client.VersionsSuite"
```
- And manually:

1. Build Spark with Hive 1.2: `./build/sbt package -Phive-1.2 -Phive -Dhadoop.version=2.8.5`

2. Run spark-shell with a custom Hive SerDe, for instance download [json-serde-1.3.8-jar-with-dependencies.jar](https://github.com/cdamak/Twitter-Hive/blob/master/json-serde-1.3.8-jar-with-dependencies.jar) from https://github.com/cdamak/Twitter-Hive:
```
$ ./bin/spark-shell --jars ../Downloads/json-serde-1.3.8-jar-with-dependencies.jar
```

3. Create a Hive table using this SerDe:
```scala
scala> :paste
// Entering paste mode (ctrl-D to finish)

sql(s"""
  |CREATE TABLE json_table2(page_id INT NOT NULL)
  |ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
  |""".stripMargin)

// Exiting paste mode, now interpreting.
res0: org.apache.spark.sql.DataFrame = []

scala> sql("SHOW TABLES").show
+--------+-----------+-----------+
|database|  tableName|isTemporary|
+--------+-----------+-----------+
| default|json_table2|      false|
+--------+-----------+-----------+

scala> sql("SHOW VIEWS").show
+---------+--------+-----------+
|namespace|viewName|isTemporary|
+---------+--------+-----------+
+---------+--------+-----------+
```

4. Quit from the current `spark-shell` and run it without jars:
```
$ ./bin/spark-shell
```

5. Show views. Without the fix, it throws the exception:
```scala
scala> sql("SHOW VIEWS").show
20/08/06 10:53:36 ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found
java.lang.ClassNotFoundException: Class org.openx.data.jsonserde.JsonSerDe not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
	at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:385)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
	at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
	at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
```

After the fix:
```scala
scala> sql("SHOW VIEWS").show
+---------+--------+-----------+
|namespace|viewName|isTemporary|
+---------+--------+-----------+
+---------+--------+-----------+
```

Authored-by: Max Gekk <max.gekkgmail.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>
(cherry picked from commit dc96f2f)
Signed-off-by: Max Gekk <max.gekkgmail.com>

Closes #29377 from MaxGekk/fix-listTablesByType-for-views-3.0.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
The test creates 10 batches of data  to train the model and expects to see error on test data improves as model is trained. If the difference between the 2nd error and the 10th error is smaller than 2, the assertion fails:
```
FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
Test that error on test data improves as model is trained.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 466, in test_train_prediction
    eventually(condition, timeout=180.0)
  File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", line 81, in eventually
    lastValue = condition()
  File "/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 461, in condition
    self.assertGreater(errors[1] - errors[-1], 2)
AssertionError: 1.672640157855923 not greater than 2
```
I saw this quite a few time on Jenkins but was not able to reproduce this on my local. These are the ten errors I got:
```
4.517395047937127
4.894265404350079
3.0392090466559876
1.8786361640757654
0.8973106042078115
0.3715780507684368
0.20815690742907672
0.17333033743125845
0.15686783249863873
0.12584413600569616
```
I am thinking of having 15 batches of data instead of 10, so the model can be trained for a longer time. Hopefully the 15th error - 2nd error will always be larger than 2 on Jenkins. These are the 15 errors I got on my local:
```
4.517395047937127
4.894265404350079
3.0392090466559876
1.8786361640757658
0.8973106042078115
0.3715780507684368
0.20815690742907672
0.17333033743125845
0.15686783249863873
0.12584413600569616
0.11883853835108477
0.09400261862100823
0.08887491447353497
0.05984929624986607
0.07583948141520978
```

### Why are the changes needed?
Fix flaky test

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually tested

Closes #29380 from huaxingao/flaky_test.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Huaxin Gao <[email protected]>
(cherry picked from commit 75c2c53)
Signed-off-by: Huaxin Gao <[email protected]>
…d in unit-tests.log

### What changes were proposed in this pull request?

This PR lets the logger log timestamp based on local time zone during test.
`SparkFunSuite` fixes the default time zone to America/Los_Angeles so the timestamp logged in unit-tests.log is also based on the fixed time zone.

### Why are the changes needed?

It's confusable for developers whose time zone is not America/Los_Angeles.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Run existing tests and confirmed uint-tests.log.
If your local time zone is America/Los_Angeles, you can test by setting the environment variable `TZ` like as follows.
```
$ TZ=Asia/Tokyo build/sbt "testOnly org.apache.spark.executor.ExecutorSuite"
$ tail core/target/unit-tests.log
```

Closes #29356 from sarutak/fix-unit-test-log-timezone.

Authored-by: Kousuke Saruta <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
(cherry picked from commit 4e267f3)
Signed-off-by: Takeshi Yamamuro <[email protected]>
…rocessInsert

### What changes were proposed in this pull request?
improve exception message

### Why are the changes needed?
the before message lack of single quotes, we may improve it to keep consisent.
![image](https://user-images.githubusercontent.com/46367746/89595808-15bbc300-d888-11ea-9914-b05ea7b66461.png)

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
No ,it is only improving the message.

Closes #29376 from GuoPhilipse/improve-exception-message.

Lead-authored-by: GuoPhilipse <[email protected]>
Co-authored-by: GuoPhilipse <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit aa4d3c1)
Signed-off-by: HyukjinKwon <[email protected]>
…where required

### What changes were proposed in this pull request?
1. URL encode the `ASF_PASSWORD` of the release manager.
2. Update the image to install `qpdf` and `jq` dep
3. Increase the JVM HEAM memory for maven build.

### Why are the changes needed?
Release script takes hours to run, and if a single failure happens about somewhere midway, then either one has to get down to manually doing stuff or re run the entire script. (This is my understanding) So, I have made the fixes of a few failures, discovered so far.

1. If the release manager password contains a char, that is not allowed in URL, then it fails the build at the clone spark step.
`git clone "https://$ASF_USERNAME:$ASF_PASSWORD$ASF_SPARK_REPO" -b $GIT_BRANCH`

          ^^^ Fails with bad URL

`ASF_USERNAME` may not be URL encoded, but we need to encode `ASF_PASSWORD`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the release for branch-2.4, using both type of passwords, i.e. passwords with special chars and without it.

Closes #29373 from ScrapCodes/release-script-fix2.

Lead-authored-by: Prashant Sharma <[email protected]>
Co-authored-by: Prashant Sharma <[email protected]>
Signed-off-by: Prashant Sharma <[email protected]>
(cherry picked from commit 6c3d0a4)
Signed-off-by: Prashant Sharma <[email protected]>
… generation on actual TPCDS data

### What changes were proposed in this pull request?

`TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then checks if plans can be generated correctly. But, the generated plans can be different from actual ones because the input tables are empty (e.g., the plans always use broadcast-hash joins, but actual ones use sort-merge joins for larger tables). To mitigate the issue, this PR defines data statistics constants extracted from generated TPCDS data in `TPCDSTableStats`, then injects the statistics via `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in `TPCDSQuerySuite`.

Please see a link below about how to extract the table statistics:
 - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3

For example, the generated plans of TPCDS `q2` are different with/without this fix:
```
==== w/ this fix: q2 ====
== Physical Plan ==
* Sort (43)
+- Exchange (42)
   +- * Project (41)
      +- * SortMergeJoin Inner (40)
         :- * Sort (28)
         :  +- Exchange (27)
         :     +- * Project (26)
         :        +- * BroadcastHashJoin Inner BuildRight (25)
         :           :- * HashAggregate (19)
         :           :  +- Exchange (18)
         :           :     +- * HashAggregate (17)
         :           :        +- * Project (16)
         :           :           +- * BroadcastHashJoin Inner BuildRight (15)
         :           :              :- Union (9)
         :           :              :  :- * Project (4)
         :           :              :  :  +- * Filter (3)
         :           :              :  :     +- * ColumnarToRow (2)
         :           :              :  :        +- Scan parquet default.web_sales (1)
         :           :              :  +- * Project (8)
         :           :              :     +- * Filter (7)
         :           :              :        +- * ColumnarToRow (6)
         :           :              :           +- Scan parquet default.catalog_sales (5)
         :           :              +- BroadcastExchange (14)
         :           :                 +- * Project (13)
         :           :                    +- * Filter (12)
         :           :                       +- * ColumnarToRow (11)
         :           :                          +- Scan parquet default.date_dim (10)
         :           +- BroadcastExchange (24)
         :              +- * Project (23)
         :                 +- * Filter (22)
         :                    +- * ColumnarToRow (21)
         :                       +- Scan parquet default.date_dim (20)
         +- * Sort (39)
            +- Exchange (38)
               +- * Project (37)
                  +- * BroadcastHashJoin Inner BuildRight (36)
                     :- * HashAggregate (30)
                     :  +- ReusedExchange (29)
                     +- BroadcastExchange (35)
                        +- * Project (34)
                           +- * Filter (33)
                              +- * ColumnarToRow (32)
                                 +- Scan parquet default.date_dim (31)

==== w/o this fix: q2 ====
== Physical Plan ==
* Sort (40)
+- Exchange (39)
   +- * Project (38)
      +- * BroadcastHashJoin Inner BuildRight (37)
         :- * Project (26)
         :  +- * BroadcastHashJoin Inner BuildRight (25)
         :     :- * HashAggregate (19)
         :     :  +- Exchange (18)
         :     :     +- * HashAggregate (17)
         :     :        +- * Project (16)
         :     :           +- * BroadcastHashJoin Inner BuildRight (15)
         :     :              :- Union (9)
         :     :              :  :- * Project (4)
         :     :              :  :  +- * Filter (3)
         :     :              :  :     +- * ColumnarToRow (2)
         :     :              :  :        +- Scan parquet default.web_sales (1)
         :     :              :  +- * Project (8)
         :     :              :     +- * Filter (7)
         :     :              :        +- * ColumnarToRow (6)
         :     :              :           +- Scan parquet default.catalog_sales (5)
         :     :              +- BroadcastExchange (14)
         :     :                 +- * Project (13)
         :     :                    +- * Filter (12)
         :     :                       +- * ColumnarToRow (11)
         :     :                          +- Scan parquet default.date_dim (10)
         :     +- BroadcastExchange (24)
         :        +- * Project (23)
         :           +- * Filter (22)
         :              +- * ColumnarToRow (21)
         :                 +- Scan parquet default.date_dim (20)
         +- BroadcastExchange (36)
            +- * Project (35)
               +- * BroadcastHashJoin Inner BuildRight (34)
                  :- * HashAggregate (28)
                  :  +- ReusedExchange (27)
                  +- BroadcastExchange (33)
                     +- * Project (32)
                        +- * Filter (31)
                           +- * ColumnarToRow (30)
                              +- Scan parquet default.date_dim (29)
```

This comes from the cloud-fan comment: #29270 (comment)

This is the backport of #29384.

### Why are the changes needed?

For better test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29390 from maropu/SPARK-32564-BRANCH3.0.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

This PR fixes some typos in <code>core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala</code> file.

### Why are the changes needed?

<code>spark.dynamicAllocation.sustainedSchedulerBacklogTimeout</code> (N) is used only after the <code>spark.dynamicAllocation.schedulerBacklogTimeout</code> (M) is exceeded.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No test needed.

Closes #29351 from JoeyValentine/master.

Authored-by: JoeyValentine <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit dc3fac8)
Signed-off-by: Dongjoon Hyun <[email protected]>
…ty tables

### What changes were proposed in this pull request?

This is the follow-up PR of #29384 to address the cloud-fan comment: #29384 (comment)
This PR re-enables `TPCDSQuerySuite` with empty tables for better test coverages.

### Why are the changes needed?

For better test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29391 from maropu/SPARK-32564-FOLLOWUP.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 1df855b)
Signed-off-by: Dongjoon Hyun <[email protected]>
… did't handle non-ASCII characters correctly

### What changes were proposed in this pull request?

This is a backport of #29375
The trim logic in Cast expression introduced in #26622 trim non-ASCII characters unexpectly.

Before this patch
![image](https://user-images.githubusercontent.com/1312321/89513154-caad9b80-d806-11ea-9ebe-17c9e7d1b5b3.png)

After this patch
![image](https://user-images.githubusercontent.com/1312321/89513196-d731f400-d806-11ea-959c-6a7dc29dcd49.png)

### Why are the changes needed?
The behavior described above doesn't make sense, and also doesn't consistent with the behavior when cast a string to double/float, as well as doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added more UT

Closes #29393 from WangGuangxin/cast-bugfix-branch-3.0.

Authored-by: wangguangxin.cn <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…tgresIntegrationSuite

### What changes were proposed in this pull request?

This PR intends to add tests to check if all the character types in PostgreSQL supported.

The document for character types in PostgreSQL: https://www.postgresql.org/docs/current/datatype-character.html

Closes #29192.

### Why are the changes needed?

For better test coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add tests.

Closes #29394 from maropu/pr29192.

Lead-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: kujon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit b2c45f7)
Signed-off-by: Dongjoon Hyun <[email protected]>
Southwest16 and others added 5 commits November 16, 2020 10:32
…itionThresholdInBytes' in documentation

### What changes were proposed in this pull request?

In the 'Optimizing Skew Join' section of the following two pages:
1. [https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html](https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html)
2. [https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html](https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html)

The configuration 'spark.sql.adaptive.skewedPartitionThresholdInBytes' should be changed to 'spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes', The former is missing the 'skewJoin'.

### Why are the changes needed?

To document the correct name of configuration

### Does this PR introduce _any_ user-facing change?

Yes, this is a user-facing doc change.

### How was this patch tested?

Jenkins / CI builds in this PR.

Closes #30376 from aof00/doc_change.

Authored-by: aof00 <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 0933f1c)
Signed-off-by: HyukjinKwon <[email protected]>
…al, and examples

### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules: graphx, external, and examples.
Split per holdenk #30323 (comment)

NOTE: The misspellings have been reported at jsoref@706a726#commitcomment-44064356

Backport of #30326

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No testing was performed

Closes #30342 from jsoref/branch-3.0-30326.

Authored-by: Josh Soref <[email protected]>
Signed-off-by: Takeshi Yamamuro <[email protected]>
…ure GitHub Actions yaml

### What changes were proposed in this pull request?

This PR backports #30391. Note that it's a partial backport.

This PR proposes:
- Add `~/.sbt` directory into the build cache, see also sbt/sbt#3681
- ~Move `hadoop-2` below to put up together with `java-11` and `scala-213`, see #30391 (comment)
- Remove unnecessary `.m2` cache if you run SBT tests only.
- Remove `rm ~/.m2/repository/org/apache/spark`. If you don't `sbt publishLocal` or `mvn install`, we don't need to care about it.
- ~Use Java 8 in Scala 2.13 build. We can switch the Java version to 11 used for release later.~
- Add caches into linters. The linter scripts uses `sbt` in, for example, `./dev/lint-scala`, and uses `mvn` in, for example, `./dev/lint-java`. Also, it requires to `sbt package` in Jekyll build, see: https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L160-L161. We need full caches here for SBT, Maven and build tools.
- Use the same syntax of Java version, 1.8 -> 8.

### Why are the changes needed?

- Remove unnecessary stuff
- Cache what we can in the build

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

It will be tested in GitHub Actions build at the current PR

Closes #30416 from HyukjinKwon/SPARK-33464-3.0.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…g.String when pruning partition column

### What changes were proposed in this pull request?

This pr fix filter for int column and value class java.lang.String when pruning partition column.

How to reproduce this issue:
```scala
spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET")
spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test")
spark.sql("SELECT * FROM test_view WHERE id = '0'").explain
```
```
20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test
20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String
20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0']
java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
 at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828)
 at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745)
 at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
 at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
 at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
 at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276)
 at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743)
```

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30380 from wangyum/SPARK-27421.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Yuming Wang <[email protected]>
(cherry picked from commit 014e1fb)
Signed-off-by: Yuming Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.