merge in spark #1

DaveDeCaprio · 2018-11-27T19:17:43Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

…r files ## What changes were proposed in this pull request? This PR breaks down the large ml/tests.py file that contains all Python ML unit tests into several smaller test files to be easier to read and maintain. The tests are broken down as follows: ``` pyspark ├── __init__.py ... ├── ml │ ├── __init__.py ... │ ├── tests │ │ ├── __init__.py │ │ ├── test_algorithms.py │ │ ├── test_base.py │ │ ├── test_evaluation.py │ │ ├── test_feature.py │ │ ├── test_image.py │ │ ├── test_linalg.py │ │ ├── test_param.py │ │ ├── test_persistence.py │ │ ├── test_pipeline.py │ │ ├── test_stat.py │ │ ├── test_training_summary.py │ │ ├── test_tuning.py │ │ └── test_wrapper.py ... ├── testing ... │ ├── mlutils.py ... ``` ## How was this patch tested? Ran tests manually by module to ensure test count was the same, and ran `python/run-tests --modules=pyspark-ml` to verify all passing with Python 2.7 and Python 3.6. Closes #23063 from BryanCutler/python-test-breakup-ml-SPARK-26033. Authored-by: Bryan Cutler <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

…Python 2.6 before ## What changes were proposed in this pull request? Currently, some of PySpark tests sill assume the tests could be ran in Python 2.6 by importing `unittest2`. For instance: ```python if sys.version_info[:2] <= (2, 6): try: import unittest2 as unittest except ImportError: sys.stderr.write('Please install unittest2 to test with Python 2.6 or earlier') sys.exit(1) else: import unittest ``` While I am here, I removed some of unused imports and reordered imports per PEP 8. We officially dropped Python 2.6 support a while ago and started to discuss about Python 2 drop. It's better to remove them out. ## How was this patch tested? Manually tests, and existing tests via Jenkins. Closes #23077 from HyukjinKwon/SPARK-26105. Lead-authored-by: hyukjinkwon <[email protected]> Co-authored-by: Bryan Cutler <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

## What changes were proposed in this pull request? This restores scaladoc artifact generation, which got dropped with the Scala 2.12 update. The change looks large, but is almost all due to needing to make the InterfaceStability annotations top-level classes (i.e. `InterfaceStability.Stable` -> `Stable`), unfortunately. A few inner class references had to be qualified too. Lots of scaladoc warnings now reappear. We can choose to disable generation by default and enable for releases, later. ## How was this patch tested? N/A; build runs scaladoc now. Closes #23069 from srowen/SPARK-26026. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? Make SparkHadoopUtil private to Spark ## How was this patch tested? Existing tests. Closes #23066 from srowen/SPARK-26043. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…chunks correctly ## What changes were proposed in this pull request? Empty chunk in ChunkedByteBuffer will truncate the ChunkedByteBufferInputStream. The detail reason is described in: https://issues.apache.org/jira/browse/SPARK-26068 ## How was this patch tested? Modified current UT to cover this case. Closes #23040 from LinhongLiu/fix-empty-chunked-byte-buffer. Lead-authored-by: Liu,Linhong <[email protected]> Co-authored-by: Xianjin YE <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? The following 5 functions were removed from branch-2.4: - map_entries - map_filter - transform_values - transform_keys - map_zip_with We should update the since version to 3.0.0. ## How was this patch tested? Existing tests. Closes #23082 from ueshin/issues/SPARK-26112/since. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Following [SPARK-26024](https://issues.apache.org/jira/browse/SPARK-26024), I noticed the number of elements in each partition after repartitioning using `df.repartitionByRange` can vary for the same setup: ```scala // Shuffle numbers from 0 to 1000, and make a DataFrame val df = Random.shuffle(0.to(1000)).toDF("val") // Repartition it using 3 partitions // Sum up number of elements in each partition, and collect it. // And do it several times for (i <- 0 to 9) { var counts = df.repartitionByRange(3, col("val")) .mapPartitions{part => Iterator(part.size)} .collect() println(counts.toList) } // -> the number of elements in each partition varies ``` This is expected as for performance reasons this method uses sampling to estimate the ranges (with default size of 100). Hence, the output may not be consistent, since sampling can return different values. But documentation was not mentioning it at all, leading to misunderstanding. ## What changes were proposed in this pull request? Update the documentation (Spark & PySpark) to mention the impact of `spark.sql.execution.rangeExchange.sampleSizePerPartition` on the resulting partitioned DataFrame. Closes #23025 from JulienPeloton/SPARK-26024. Authored-by: Julien <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Due to implementation limitation, currently Spark can't compare or do equality check between map types. As a result, map values can't appear in EQUAL or comparison expressions, can't be grouping key, etc. The more important thing is, map loop up needs to do equality check of the map key, and thus can't support map as map key when looking up values from a map. Thus it's not useful to have map as map key. This PR proposes to stop users from creating maps using map type as key. The list of expressions that are updated: `CreateMap`, `MapFromArrays`, `MapFromEntries`, `MapConcat`, `TransformKeys`. I manually checked all the places that create `MapType`, and came up with this list. Note that, maps with map type key still exist, via reading from parquet files, converting from scala/java map, etc. This PR is not to completely forbid map as map key, but to avoid creating it by Spark itself. Motivation: when I was trying to fix the duplicate key problem, I found it's impossible to do it with map type map key. I think it's reasonable to avoid map type map key for builtin functions. ## How was this patch tested? updated test Closes #23045 from cloud-fan/map-key. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…nd build warnings for Spark 3 ## What changes were proposed in this pull request? The build has a lot of deprecation warnings. Some are new in Scala 2.12 and Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy miscellaneous ones here. They're too numerous and small to list here; see the pull request. Some highlights: - `BeanInfo` is deprecated in 2.12, and BeanInfo classes are pretty ancient in Java. Instead, case classes can explicitly declare getters - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0 - finalize() is finally deprecated (just needs to be suppressed) - StageInfo.attempId was deprecated and easiest to remove here I'm not now going to touch some chunks of deprecation warnings: - Parquet deprecations - Hive deprecations (particularly serde2 classes) - Deprecations in generated code (mostly Thriftserver CLI) - ProcessingTime deprecations (we may need to revive this class as internal) - many MLlib deprecations because they concern methods that may be removed anyway - a few Kinesis deprecations I couldn't figure out - Mesos get/setRole, which I don't know well - Kafka/ZK deprecations (e.g. poll()) - Kinesis - a few other ones that will probably resolve by deleting a deprecated method ## How was this patch tested? Existing tests, including manual testing with the 2.11 build and Java 11. Closes #23065 from srowen/SPARK-26090. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…higher-order functions: ArrayExists, ArrayFilter, MapFilter ## What changes were proposed in this pull request? Extend the `ReplaceNullWithFalse` optimizer rule introduced in SPARK-25860 (#22857) to also support optimizing predicates in higher-order functions of `ArrayExists`, `ArrayFilter`, `MapFilter`. Also rename the rule to `ReplaceNullWithFalseInPredicate` to better reflect its intent. Example: ```sql select filter(a, e -> if(e is null, null, true)) as b from ( select array(null, 1, null, 3) as a) ``` The optimized logical plan: **Before**: ``` == Optimized Logical Plan == Project [filter([null,1,null,3], lambdafunction(if (isnull(lambda e#13)) null else true, lambda e#13, false)) AS b#9] +- OneRowRelation ``` **After**: ``` == Optimized Logical Plan == Project [filter([null,1,null,3], lambdafunction(if (isnull(lambda e#13)) false else true, lambda e#13, false)) AS b#9] +- OneRowRelation ``` ## How was this patch tested? Added new unit test cases to the `ReplaceNullWithFalseInPredicateSuite` (renamed from `ReplaceNullWithFalseSuite`). Closes #23079 from rednaxelafx/catalyst-master. Authored-by: Kris Mok <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Current `memLimitExceededLogMessage`: <img src="https://user-images.githubusercontent.com/5399861/48467789-ec8e1000-e824-11e8-91fc-280d342e1bf3.png" width="360"> It‘s not very clear, because physical memory exceeds but suggestion contains virtual memory config. This pr makes it more clear and replace deprecated config: ```spark.yarn.executor.memoryOverhead```. ## How was this patch tested? manual tests Closes #23030 from wangyum/EXECUTOR_MEMORY_OVERHEAD. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…spark-env.sh ## What changes were proposed in this pull request? When I try to run scripts (e.g. `start-master.sh`/`start-history-server.sh ` in latest master, I got such error: ``` Presence of build for multiple Scala versions detected. Either clean one of them or, export SPARK_SCALA_VERSION in spark-env.sh. ``` The error message is quite confusing. Without reading `load-spark-env.sh`, I didn't know which directory to remove, or where to find and edit the `spark-evn.sh`. This PR is to make the error message more clear. Also change the script for less maintenance when we add or drop Scala versions in the future. As now with #22967, we can revise the error message as following(in my local setup): ``` Presence of build for multiple Scala versions detected (/Users/gengliangwang/IdeaProjects/spark/assembly/target/scala-2.12 and /Users/gengliangwang/IdeaProjects/spark/assembly/target/scala-2.11). Remove one of them or, export SPARK_SCALA_VERSION=2.12 in /Users/gengliangwang/IdeaProjects/spark/conf/spark-env.sh. Visit https://spark.apache.org/docs/latest/configuration.html#environment-variables for more details about setting environment variables in spark-env.sh. ``` ## How was this patch tested? Manual test Closes #23049 from gengliangwang/reviseEnvScript. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…ing HTTP requestHeaderSize ## What changes were proposed in this pull request? Introducing spark.ui.requestHeaderSize for configuring Jetty's HTTP requestHeaderSize. This way long authorization field does not lead to HTTP 413. ## How was this patch tested? Manually with curl (which version must be at least 7.55). With the original default value (8k limit): ```bash # Starting history server with default requestHeaderSize $ ./sbin/start-history-server.sh starting org.apache.spark.deploy.history.HistoryServer, logging to /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.history.HistoryServer-1-apiros-MBP.lan.out # Creating huge header $ echo -n "X-Custom-Header: " > cookie $ printf 'A%.0s' {1..9500} >> cookie # HTTP GET with huge header fails with 431 $ curl -H cookie http://458apiros-MBP.lan:18080/ <h1>Bad Message 431</h1><pre>reason: Request Header Fields Too Large</pre> # The log contains the error $ tail -1 /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.history.HistoryServer-1-apiros-MBP.lan.out 18/11/19 21:24:28 WARN HttpParser: Header is too large 8193>8192 ``` After: ```bash # Creating the history properties file with the increased requestHeaderSize $ echo spark.ui.requestHeaderSize=10000 > history.properties # Starting Spark History Server with the settings $ ./sbin/start-history-server.sh --properties-file history.properties starting org.apache.spark.deploy.history.HistoryServer, logging to /Users/attilapiros/github/spark/logs/spark-attilapiros-org.apache.spark.deploy.history.HistoryServer-1-apiros-MBP.lan.out # HTTP GET with huge header gives back HTML5 (I have added here only just a part of the response) $ curl -H cookie http://458apiros-MBP.lan:18080/ <!DOCTYPE html><html> <head>... <link rel="shortcut icon" href="/static/spark-logo-77x50px-hd.png"></link> <title>History Server</title> </head> <body> ... ``` Closes #23090 from attilapiros/JettyHeaderSize. Authored-by: “attilapiros” <[email protected]> Signed-off-by: Imran Rashid <[email protected]>

…ception ## What changes were proposed in this pull request? This PR fixes an exception in `AggregateExpression.references` called on unresolved expressions. It implements the solution proposed in [SPARK-26084](https://issues.apache.org/jira/browse/SPARK-26084), a minor refactoring that removes the unnecessary dependence on `AttributeSet.toSeq`, which requires expression IDs and, therefore, can only execute successfully for resolved expressions. The refactored implementation is both simpler and faster, eliminating the conversion of a `Set` to a `Seq` and back to `Set`. ## How was this patch tested? Added a new test based on the failing case in [SPARK-26084](https://issues.apache.org/jira/browse/SPARK-26084). hvanhovell Closes #23075 from ssimeonov/ss_SPARK-26084. Authored-by: Simeon Simeonov <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

## What changes were proposed in this pull request? `dev/lint-python` is a mess of nearly unreadable bash. i would like to fix that as best as i can. ## How was this patch tested? the build system will test this. Closes #22994 from shaneknapp/lint-python-refactor. Authored-by: shane knapp <[email protected]> Signed-off-by: shane knapp <[email protected]>

… work in 2.11 ## What changes were proposed in this pull request? Hotfix a change to SparkHadoopUtil that doesn't work in 2.11 ## How was this patch tested? Existing tests. Closes #23097 from srowen/SPARK-26043.2. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? Update many plugins we use to the latest version, especially MiMa, which entails excluding some new errors on old changes. ## How was this patch tested? N/A Closes #23087 from srowen/Plugins. Authored-by: Sean Owen <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? In the PR, I propose to pass the CSV option `encoding`/`charset` to `uniVocity` parser to allow parsing CSV files in different encodings when `multiLine` is enabled. The value of the option is passed to the `beginParsing` method of `CSVParser`. ## How was this patch tested? Added new test to `CSVSuite` for different encodings and enabled/disabled header. Closes #23091 from MaxGekk/csv-miltiline-encoding. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

…ured Streaming R tests ## What changes were proposed in this pull request? Stop the streaming query in `Specify a schema by using a DDL-formatted string when reading` to avoid outputting annoying logs. ## How was this patch tested? Jenkins Closes #23089 from zsxwing/SPARK-26120. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

## What changes were proposed in this pull request? [Kafka 2.1.0 vote](https://lists.apache.org/thread.html/9f487094491e512b556a1c9c3c6034ac642b088e3f797e3d192ebc9d%3Cdev.kafka.apache.org%3E) passed. Since Kafka 2.1.0 includes official JDK 11 support [KAFKA-7264](https://issues.apache.org/jira/browse/KAFKA-7264), we had better use that. ## How was this patch tested? Pass the Jenkins. Closes #23099 from dongjoon-hyun/SPARK-25954. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…he task table are different ## What changes were proposed in this pull request? Task summary table displays the summary of the task table in the stage page. However, the 'Duration' metrics of 'task summary' table and 'task table' are not matching. The reason is because, in the 'task summary' we display 'executorRunTime' as the duration, and in the 'task table' the actual duration of the task. Except duration metrics, all other metrics are properly displaying in the task summary. In Spark2.2, used to show 'executorRunTime' as duration in the 'taskTable'. That is why, in summary metrics also the 'exeuctorRunTime' shows as the duration. So, we need to show 'executorRunTime' as the duration in the tasks table to follow the same behaviour as the previous versions of spark. ## How was this patch tested? Before patch: ![screenshot from 2018-11-19 04-32-06](https://user-images.githubusercontent.com/23054875/48679263-1e4fff80-ebb4-11e8-9ed5-16d892039e01.png) After patch: ![screenshot from 2018-11-19 04-37-39](https://user-images.githubusercontent.com/23054875/48679343-e39a9700-ebb4-11e8-8df9-9dc3a28d4bce.png) Closes #23081 from shahidki31/duratinSummary. Authored-by: Shahid <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? This change fixes a particular scenario where default spark SQL can't encode (thrift) types that are generated by twitter scrooge. These types are a trait that extends `scala.ProductX` with a constructor defined only in a companion object, rather than a actual case class. The actual case class used is child class, but that type is almost never referred to in code. The type has no corresponding constructor symbol and causes an exception. For all other purposes, these classes act just like case classes, so it is unfortunate that spark SQL can't serialize them nicely as it can actual case classes. For an full example of a scrooge codegen class, see https://gist.github.com/anonymous/ba13d4b612396ca72725eaa989900314. This change catches the case where the type has no constructor but does have an `apply` method on the type's companion object. This allows for thrift types to be serialized/deserialized with implicit encoders the same way as normal case classes. This fix had to be done in three places where the constructor is assumed to be an actual constructor: 1) In serializing, determining the schema for the dataframe relies on inspecting its constructor (`ScalaReflection.constructParams`). Here we fall back to using the companion constructor arguments. 2) In deserializing or evaluating, in the java codegen ( `NewInstance.doGenCode`), the type couldn't be constructed with the new keyword. If there is no constructor, we change the constructor call to try the companion constructor. 3) In deserializing or evaluating, without codegen, the constructor is directly invoked (`NewInstance.constructor`). This was fixed with scala reflection to get the actual companion apply method. The return type of `findConstructor` was changed because the companion apply method constructor can't be represented as a `java.lang.reflect.Constructor`. There might be situations in which this approach would also fail in a new way, but it does at a minimum work for the specific scrooge example and will not impact cases that were already succeeding prior to this change Note: this fix does not enable using scrooge thrift enums, additional work for this is necessary. With this patch, it seems like you could patch `com.twitter.scrooge.ThriftEnum` to extend `_root_.scala.Product1[Int]` with `def _1 = value` to get spark's implicit encoders to handle enums, but I've yet to use this method myself. Note: I previously opened a PR for this issue, but only was able to fix case 1) there: #18766 ## How was this patch tested? I've fixed all 3 cases and added two tests that use a case class that is similar to scrooge generated one. The test in ScalaReflectionSuite checks 1), and the additional asserting in ObjectExpressionsSuite checks 2) and 3). Closes #23062 from drewrobb/SPARK-8288. Authored-by: Drew Robb <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? We currently don't have good visibility into query planning time (analysis vs optimization vs physical planning). This patch adds a simple utility to track the runtime of various rules and various planning phases. ## How was this patch tested? Added unit tests and end-to-end integration tests. Closes #23096 from rxin/SPARK-26129. Authored-by: Reynold Xin <[email protected]> Signed-off-by: Reynold Xin <[email protected]>

….sql.debug.maxToStringFields conf ## What changes were proposed in this pull request? In the PR, I propose: - new SQL config `spark.sql.debug.maxToStringFields` to control maximum number fields up to which `truncatedString` cuts its input sequences. - Moving `truncatedString` out of `core` to `sql/catalyst` because it is used only in the `sql/catalyst` packages for restricting number of fields converted to strings from `TreeNode` and expressions of`StructType`. ## How was this patch tested? Added a test to `QueryExecutionSuite` to check that `spark.sql.debug.maxToStringFields` impacts to behavior of `truncatedString`. Closes #23039 from MaxGekk/truncated-string-catalyst. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…classification models ## What changes were proposed in this pull request? The setter methods are deprecated since 2.1 for the models of regression and classification using trees. The deprecation was stating that the method would have been removed in 3.0. Hence the PR removes the deprecated method. ## How was this patch tested? NA Closes #23093 from mgaido91/SPARK-26127. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…ages optional ## What changes were proposed in this pull request? bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. ## How was this patch tested? Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <[email protected]> Closes #23053 from ramaddepally/SPARK-25957.

…ySpark ## What changes were proposed in this pull request? Arguably, unittests usually takes longer then doctests. We better prioritize unittests over doctests. Other modules are already being prioritized over doctests. Looks ML module was missed at the very first place. ## How was this patch tested? Jenkins tests. Closes #23078 from HyukjinKwon/SPARK-26106. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

## What changes were proposed in this pull request? An input without valid JSON tokens on the root level will be treated as a bad record, and handled according to `mode`. Previously such input was converted to `null`. After the changes, the input is converted to a row with `null`s in the `PERMISSIVE` mode according the schema. This allows to remove a code in the `from_json` function which can produce `null` as result rows. ## How was this patch tested? It was tested by existing test suites. Some of them I have to modify (`JsonSuite` for example) because previously bad input was just silently ignored. For now such input is handled according to specified `mode`. Closes #22938 from MaxGekk/json-nulls. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…gation should be named as "key" too ## What changes were proposed in this pull request? When doing typed aggregation on a Dataset, for struct key type, the key attribute is named as "key". But for non-struct type, the key attribute is named as "value". This key attribute should also be named as "key" for non-struct type. ## How was this patch tested? Added test. Closes #23054 from viirya/SPARK-26085. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…/Float GROUP BY treats -0.0 and 0.0 as different values which is unlike hive's behavior. In addition current behavior with codegen is unpredictable (see example in JIRA ticket). ## What changes were proposed in this pull request? In Platform.putDouble/Float() checking if the value is -0.0, and if so replacing with 0.0. This is used by UnsafeRow so it won't have -0.0 values. ## How was this patch tested? Added tests Closes #23043 from adoron/adoron-spark-26021-replace-minus-zero-with-zero. Authored-by: Alon Doron <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ndency to Kafka modules ## What changes were proposed in this pull request? This is a followup of #23099 . After upgrading to Kafka 2.1.0, maven test fails due to Zookeeper test dependency while sbt test succeeds. - [sbt test on master branch](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/5203/) - [maven test on master branch](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5653/) The root cause is that the embedded Kafka server is using [Zookeepr 3.4.7 API](https://zookeeper.apache.org/doc/r3.4.7/api/org/apache/zookeeper/AsyncCallback.MultiCallback.html ) while Apache Spark provides Zookeeper 3.4.6. This PR adds a test dependency. ``` KafkaMicroBatchV2SourceSuite: *** RUN ABORTED *** ... org.apache.spark.sql.kafka010.KafkaTestUtils.setupEmbeddedKafkaServer(KafkaTestUtils.scala:123) ... Cause: java.lang.ClassNotFoundException: org.apache.zookeeper.AsyncCallback$MultiCallback at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at kafka.zk.KafkaZkClient$.apply(KafkaZkClient.scala:1693) at kafka.server.KafkaServer.createZkClient$1(KafkaServer.scala:348) at kafka.server.KafkaServer.initZkClient(KafkaServer.scala:372) at kafka.server.KafkaServer.startup(KafkaServer.scala:202) at org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setupEmbeddedKafkaServer$2(KafkaTestUtils.scala:120) at org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setupEmbeddedKafkaServer$2$adapted(KafkaTestUtils.scala:116) ... ``` ## How was this patch tested? Pass the maven Jenkins test. Closes #23119 from dongjoon-hyun/SPARK-25954-2. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? `org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures` is still flaky and here is error message: ``` sbt.ForkMain$ForkError: java.lang.AssertionError: Got a non-empty set [Failed to send RPC RPC 8249697863992194475 to /172.17.0.2:41177: java.io.IOException: Broken pipe] at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.spark.network.RpcIntegrationSuite.assertErrorAndClosed(RpcIntegrationSuite.java:389) at org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures(RpcIntegrationSuite.java:347) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:27) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at org.junit.runner.JUnitCore.run(JUnitCore.java:115) at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132) at sbt.ForkMain$Run$2.call(ForkMain.java:296) at sbt.ForkMain$Run$2.call(ForkMain.java:286) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` This happened when the second RPC message was being sent but the connection was closed at the same time. ## How was this patch tested? Jenkins Closes #23109 from zsxwing/SPARK-26069-2. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…moryError when catch exception ## What changes were proposed in this pull request? the pr #20014 which introduced `SparkOutOfMemoryError` to avoid killing the entire executor when an `OutOfMemoryError `is thrown. so apply for memory using `MemoryConsumer. allocatePage `when catch exception, use `SparkOutOfMemoryError `instead of `OutOfMemoryError` ## How was this patch tested? N / A Closes #23084 from heary-cao/SparkOutOfMemoryError. Authored-by: caoxuewen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? In the PR, I propose new options for CSV datasource - `lineSep` similar to Text and JSON datasource. The option allows to specify custom line separator of maximum length of 2 characters (because of a restriction in `uniVocity` parser). New option can be used in reading and writing CSV files. ## How was this patch tested? Added a few tests with custom `lineSep` for enabled/disabled `multiLine` in read as well as tests in write. Also I added roundtrip tests. Closes #23080 from MaxGekk/csv-line-sep. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

… fitting in long ## What changes were proposed in this pull request? Fix Decimal `toScalaBigInt` and `toJavaBigInteger` used to only work for decimals not fitting long. ## How was this patch tested? Added test to DecimalSuite. Closes #23022 from juliuszsompolski/SPARK-26038. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

## What changes were proposed in this pull request? This patch defines an internal Spark interface for reporting shuffle metrics and uses that in shuffle reader. Before this patch, shuffle metrics is tied to a specific implementation (using a thread local temporary data structure and accumulators). After this patch, callers that define their own shuffle RDDs can create a custom metrics implementation. With this patch, we would be able to create a better metrics for the SQL layer, e.g. reporting shuffle metrics in the SQL UI, for each exchange operator. Note that I'm separating read side and write side implementations, as they are very different, to simplify code review. Write side change is at #23106 ## How was this patch tested? No behavior change expected, as it is a straightforward refactoring. Updated all existing test cases. Closes #23105 from rxin/SPARK-26140. Authored-by: Reynold Xin <[email protected]> Signed-off-by: gatorsmile <[email protected]>

…ow UnsupportedOperationException for Kryo ## What changes were proposed in this pull request? `deserialize` for kryo, the type of input parameter is ByteBuffer, if it is not backed by an accessible byte array. it will throw `UnsupportedOperationException` Exception Info: ``` java.lang.UnsupportedOperationException was thrown. java.lang.UnsupportedOperationException at java.nio.ByteBuffer.array(ByteBuffer.java:994) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:362) ``` ## How was this patch tested? Added a unit test Closes #22779 from 10110346/InputStreamKryo. Authored-by: liuxian <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? In the summary section of stage page: ![image](https://user-images.githubusercontent.com/1097932/48935518-ebef2b00-ef42-11e8-8672-eaa4cac92c5e.png) 1. the following metrics names can be revised: Output => Output Size / Records Shuffle Read: => Shuffle Read Size / Records Shuffle Write => Shuffle Write Size / Records After changes, the names are more clear, and consistent with the other names in the same page. 2. The associated job id URL should not contain the 3 tails spaces. Reduce the number of spaces to one, and exclude the space from link. This is consistent with SQL execution page. ## How was this patch tested? Manual check: ![image](https://user-images.githubusercontent.com/1097932/48935538-f7425680-ef42-11e8-8b2a-a4f388d3ea52.png) Closes #23125 from gengliangwang/reviseStagePage. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…n docs. ## What changes were proposed in this pull request? "Running on Kubernetes" references `spark.driver.pod.name` few places, and it should be `spark.kubernetes.driver.pod.name`. ## How was this patch tested? See changes Closes #23133 from Leemoonsoo/fix-driver-pod-name-prop. Authored-by: Lee moon soo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… tests to speed up ## What changes were proposed in this pull request? This PR proposes to increase parallelism in PySpark tests to speed up from 4 to 8. It decreases the elapsed time from https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99163/consoleFull Tests passed in 1770 seconds to https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99186/testReport/ Tests passed in 1027 seconds ## How was this patch tested? Jenkins tests Closes #23111 from HyukjinKwon/parallelism. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

## What changes were proposed in this pull request? The DOI foundation recommends [this new resolver](https://www.doi.org/doi_handbook/3_Resolution.html#3.8). Accordingly, this PR re`sed`s all static DOI links ;-) ## How was this patch tested? It wasn't, since it seems as safe as a "[typo fix](https://spark.apache.org/contributing.html)". In case any of the files is included from other projects, and should be updated there, please let me know. Closes #23129 from katrinleinweber/resolve-DOIs-securely. Authored-by: Katrin Leinweber <[email protected]> Signed-off-by: Sean Owen <[email protected]>

## What changes were proposed in this pull request? This PR is to add back `unionAll`, which is widely used. The name is also consistent with our ANSI SQL. We also have the corresponding `intersectAll` and `exceptAll`, which were introduced in Spark 2.4. ## How was this patch tested? Added a test case in DataFrameSuite Closes #23131 from gatorsmile/addBackUnionAll. Authored-by: gatorsmile <[email protected]> Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? SparkSQL doesn't support to encode `Option[Product]` as a top-level row now, because in SparkSQL entire top-level row can't be null. However for use cases like Aggregator, it is reasonable to use `Option[Product]` as buffer and output column types. Due to above limitation, we don't do it for now. This patch proposes to encode `Option[Product]` at top-level as single struct column. So we can work around the issue that entire top-level row can't be null. To summarize encoding of `Product` and `Option[Product]`. For `Product`, 1. at root level, the schema is all fields are flatten it into multiple columns. The `Product ` can't be null, otherwise it throws an exception. ```scala val df = Seq((1 -> "a"), (2 -> "b")).toDF() df.printSchema() root |-- _1: integer (nullable = false) |-- _2: string (nullable = true) ``` 2. At non-root level, `Product` is a struct type column. ```scala val df = Seq((1, (1 -> "a")), (2, (2 -> "b")), (3, null)).toDF() df.printSchema() root |-- _1: integer (nullable = false) |-- _2: struct (nullable = true) | |-- _1: integer (nullable = false) | |-- _2: string (nullable = true) ``` For `Option[Product]`, 1. it was not supported at root level. After this change, it is a struct type column. ```scala val df = Seq(Some(1 -> "a"), Some(2 -> "b"), None).toDF() df.printSchema root |-- value: struct (nullable = true) | |-- _1: integer (nullable = false) | |-- _2: string (nullable = true) ``` 2. At non-root level, it is also a struct type column. ```scala val df = Seq((1, Some(1 -> "a")), (2, Some(2 -> "b")), (3, None)).toDF() df.printSchema root |-- _1: integer (nullable = false) |-- _2: struct (nullable = true) | |-- _1: integer (nullable = false) | |-- _2: string (nullable = true) ``` 3. For use case like Aggregator, it was not supported too. After this change, we support to use `Option[Product]` as buffer/output column type. ```scala val df = Seq( OptionBooleanIntData("bob", Some((true, 1))), OptionBooleanIntData("bob", Some((false, 2))), OptionBooleanIntData("bob", None)).toDF() val group = df .groupBy("name") .agg(OptionBooleanIntAggregator("isGood").toColumn.alias("isGood")) group.printSchema root |-- name: string (nullable = true) |-- isGood: struct (nullable = true) | |-- _1: boolean (nullable = false) | |-- _2: integer (nullable = false) ``` The buffer and output type of `OptionBooleanIntAggregator` is both `Option[(Boolean, Int)`. ## How was this patch tested? Added test. Closes #21732 from viirya/SPARK-24762. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? Create a new suite DataFrameSetOperationsSuite for the test cases of DataFrame/Dataset's set operations. Also, add test cases of NULL handling for Array Except and Array Intersect. ## How was this patch tested? N/A Closes #23137 from gatorsmile/setOpsTest. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? This PR is to improve the code comments to document some common traits and traps about the expression. ## How was this patch tested? N/A Closes #23135 from gatorsmile/addcomments. Authored-by: gatorsmile <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… compute `numFeatures` ## What changes were proposed in this pull request? use base models' `numFeature` instead of `first` job ## How was this patch tested? existing tests Closes #23123 from zhengruifeng/avoid_first_job. Authored-by: zhengruifeng <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…afka's consumer group (group.id) ## What changes were proposed in this pull request? Allow the Spark Structured Streaming user to specify the prefix of the consumer group (group.id), compared to force consumer group ids of the form `spark-kafka-source-*` ## How was this patch tested? Unit tests provided by Spark (backwards compatible change, i.e., user can optionally use the functionality) `mvn test -pl external/kafka-0-10` Closes #23103 from zouzias/SPARK-26121. Authored-by: Anastasios Zouzias <[email protected]> Signed-off-by: cody koeninger <[email protected]>

## What changes were proposed in this pull request? This PR adds configurations to use subpaths with Spark on k8s. Subpaths (https://kubernetes.io/docs/concepts/storage/volumes/#using-subpath) allow the user to specify a path within a volume to use instead of the volume's root. ## How was this patch tested? Added unit tests. Ran SparkPi on a cluster with event logging pointed at a subpath-mount and verified the driver host created and used the subpath. Closes #23026 from NiharS/k8s_subpath. Authored-by: Nihar Sheth <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

…columns and searching Support column sort, pagination and search for Stage Page using jQuery DataTable and REST API. Before this commit, the Stage page generated a hard-coded HTML table that could not support search. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve the user experience. Created the stagespage-template.html for displaying application information in datables. Added REST api endpoint and javascript code to fetch data from the endpoint and display it on the data table. Because of the above change, certain functionalities in the page had to be modified to support the addition of datatables. For example, the toggle checkbox 'Select All' previously would add the checked fields as columns in the Task table and as rows in the Summary Metrics table, but after the change, only columns are added in the Task Table as it got tricky to add rows dynamically in the datatables. ## How was this patch tested? I have attached the screenshots of the Stage Page UI before and after the fix. **Before:** <img width="1419" alt="30564304-35991e1c-9c8a-11e7-850f-2ac7a347f600" src="https://user-images.githubusercontent.com/22228190/42137915-52054558-7d3a-11e8-8c85-433b2c94161d.png"> <img width="1435" alt="31360592-cbaa2bae-ad14-11e7-941d-95b4c7d14970" src="https://user-images.githubusercontent.com/22228190/42137928-79df500a-7d3a-11e8-9068-5630afe46ff3.png"> **After:** <img width="1432" alt="31360591-c5650ee4-ad14-11e7-9665-5a08d8f21830" src="https://user-images.githubusercontent.com/22228190/42137936-a3fb9f42-7d3a-11e8-8502-22b3897cbf64.png"> <img width="1388" alt="31360604-d266b6b0-ad14-11e7-94b5-dcc4bb5443f4" src="https://user-images.githubusercontent.com/22228190/42137970-0fabc58c-7d3b-11e8-95ad-383b1bd1f106.png"> Closes #21688 from pgandhi999/SPARK-21809-2.3. Authored-by: pgandhi <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

…w the right number of the total tasks Total tasks in the aggregated table and the tasks table are not matching some times in the WEBUI. We need to force update the executor summary of the particular executorId, when ever last task of that executor has reached. Currently it force update based on last task on the stage end. So, for some particular executorId task might miss at the stage end. Tests to reproduce: ``` bin/spark-shell --master yarn --conf spark.executor.instances=3 sc.parallelize(1 to 10000, 10).map{ x => throw new RuntimeException("Bad executor")}.collect() ``` Before patch: ![screenshot from 2018-11-15 02-24-05](https://user-images.githubusercontent.com/23054875/48511776-b0d36480-e87d-11e8-89a8-ab97216e2c21.png) After patch: ![screenshot from 2018-11-15 02-32-38](https://user-images.githubusercontent.com/23054875/48512141-c39a6900-e87e-11e8-8535-903e1d11d13e.png) Closes #23038 from shahidki31/SPARK-25451. Authored-by: Shahid <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

Closes #23149 from vanzin/SPARK-25451.hotfix. Authored-by: Marcelo Vanzin <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

Closes #23107

## What changes were proposed in this pull request? In #23105, due to working on two parallel PRs at once, I made the mistake of committing the copy of the PR that used the name ShuffleMetricsReporter for the interface, rather than the appropriate one ShuffleReadMetricsReporter. This patch fixes that. ## How was this patch tested? This should be fine as long as compilation passes. Closes #23147 from rxin/ShuffleReadMetricsReporter. Authored-by: Reynold Xin <[email protected]> Signed-off-by: gatorsmile <[email protected]>

## What changes were proposed in this pull request? This is follow-up of #21732. This patch inlines `isOptionType` method. ## How was this patch tested? Existing tests. Closes #23143 from viirya/SPARK-24762-followup. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…icate ## What changes were proposed in this pull request? Based on #22857 and #23079, this PR did a few updates - Limit the data types of NULL to Boolean. - Limit the input data type of replaceNullWithFalse to Boolean; throw an exception in the testing mode. - Create a new file for the rule ReplaceNullWithFalseInPredicate - Update the description of this rule. ## How was this patch tested? Added a test case Closes #23139 from gatorsmile/followupSpark-25860. Authored-by: gatorsmile <[email protected]> Signed-off-by: DB Tsai <[email protected]>

## What changes were proposed in this pull request? This is the write side counterpart to #23105 ## How was this patch tested? No behavior change expected, as it is a straightforward refactoring. Updated all existing test cases. Closes #23106 from rxin/SPARK-26141. Authored-by: Reynold Xin <[email protected]> Signed-off-by: Reynold Xin <[email protected]>

…in SetOperationSuite ## What changes were proposed in this pull request? The purpose of this PR is supplement new test cases for a + 1,a + b and Rand in SetOperationSuite. It comes from the comment of closed PR:#20541, thanks. ## How was this patch tested? add new test cases Closes #23138 from heary-cao/UnionPushTestCases. Authored-by: caoxuewen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

The "build context" for a docker image - basically the whole contents of the current directory where "docker" is invoked - can be huge in a dev build, easily breaking a couple of gigs. Doing that copy 3 times during the build of docker images severely slows down the process. This patch creates a smaller build context - basically mimicking what the make-distribution.sh script does, so that when building the docker images, only the necessary bits are in the current directory. For PySpark and R that is optimized further, since those images are built based on the previously built Spark main image. In my current local clone, the dir size is about 2G, but with this script the "context" sent to docker is about 250M for the main image, 1M for the pyspark image and 8M for the R image. That speeds up the image builds considerably. I also snuck in a fix to the k8s integration test dependencies in the sbt build, so that the examples are properly built (without having to do it manually). Closes #23019 from vanzin/SPARK-26025. Authored-by: Marcelo Vanzin <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

BryanCutler and others added 30 commits November 18, 2018 16:02

This is a dummy commit to trigger ASF git sync

86cc907

Alon Doron and others added 29 commits November 23, 2018 08:55

[SPARK-25451][HOTFIX] Call stage.attemptNumber instead of attemptId.

6f1a1c1

Closes #23149 from vanzin/SPARK-25451.hotfix. Authored-by: Marcelo Vanzin <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

[INFRA] Close stale PR.

9deaa72

Closes #23107

DaveDeCaprio merged commit 22bd4bd into DaveDeCaprio:master Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge in spark #1

merge in spark #1

Uh oh!

DaveDeCaprio commented Nov 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

merge in spark #1

merge in spark #1

Uh oh!

Conversation

DaveDeCaprio commented Nov 27, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants