Spark - August 1, 2017 #4

pgandhi999 · 2017-08-01T13:57:43Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

- Author: Johan Grande <[email protected]> Closes #18738 from nahoj/patch-1.

## What changes were proposed in this pull request? This PR proposes `StructType.fieldNames` that returns a copy of a field name list rather than a (undocumented) `StructType.names`. There are two points here: - API consistency with Scala/Java - Provide a safe way to get the field names. Manipulating these might cause unexpected behaviour as below: ```python from pyspark.sql.types import * struct = StructType([StructField("f1", StringType(), True)]) names = struct.names del names[0] spark.createDataFrame([{"f1": 1}], struct).show() ``` ``` ... java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 1 fields are required while 0 values are provided. at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:138) at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741) at org.apache.spark.sql.SparkSession$$anonfun$6.apply(SparkSession.scala:741) ... ``` ## How was this patch tested? Added tests in `python/pyspark/sql/tests.py`. Author: hyukjinkwon <[email protected]> Closes #18618 from HyukjinKwon/SPARK-20090.

…entation ## What changes were proposed in this pull request? JIRA ticket : [SPARK-21508](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21508) correcting a mistake in example code provided in Spark Streaming Custom Receivers Documentation The example code provided in the documentation on 'Spark Streaming Custom Receivers' has an error. doc link : https://spark.apache.org/docs/latest/streaming-custom-receivers.html ``` // Assuming ssc is the StreamingContext val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port)) val words = lines.flatMap(_.split(" ")) ... ``` instead of `lines.flatMap(_.split(" "))` it should be `customReceiverStream.flatMap(_.split(" "))` ## How was this patch tested? this documentation change is tested manually by jekyll build , running below commands ``` jekyll build jekyll serve --watch ``` screen-shots provided below ![screenshot1](https://user-images.githubusercontent.com/8828470/28744636-a6de1ac6-7482-11e7-843b-ff84b5855ec0.png) ![screenshot2](https://user-images.githubusercontent.com/8828470/28744637-a6def496-7482-11e7-9512-7f4bbe027c6a.png) Author: Remis Haroon <[email protected]> Closes #18770 from remisharoon/master.

## What changes were proposed in this pull request? ```DStreams class FileInputDStream [line 162] protected[streaming] override def clearMetadata(time: Time) { batchTimeToSelectedFiles.synchronized { val oldFiles = batchTimeToSelectedFiles.filter(_._1 < (time - rememberDuration)) batchTimeToSelectedFiles --= oldFiles.keys ``` The above code does not remove the old generatedRDDs. "super.clearMetadata(time)" was added to the beginning of clearMetadata to remove the old generatedRDDs. ## How was this patch tested? At the end of clearMetadata, the testing code (print the number of generatedRDDs) was added to check the old RDDS were removed manually. Author: shaofei007 <[email protected]> Author: Fei Shao <[email protected]> Closes #18718 from shaofei007/master.

… by its canonicalized child ## What changes were proposed in this pull request? When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <[email protected]> Closes #18761 from viirya/SPARK-21555.

…undary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c After this been merged, we can close #16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <[email protected]> Closes #18540 from jiangxb1987/rangeFrame.

## What changes were proposed in this pull request? This pr added parsing rules to support subquery column aliases in FROM clause. This pr is a sub-task of #18079. ## How was this patch tested? Added tests in `PlanParserSuite` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <[email protected]> Closes #18185 from maropu/SPARK-20962.

## What changes were proposed in this pull request? The format of none should be consistent with other compression codec(\`snappy\`, \`lz4\`) as \`none\`. ## How was this patch tested? This is a typo. Author: GuoChenzhao <[email protected]> Closes #18758 from gczsjdy/typo.

## What changes were proposed in this pull request? 1.add count about 'Session Statistics' and 'SQL Statistics' in 'JDBC/ODBC Server' page.The purpose is to know the statistics clearly. fix before: ![1](https://user-images.githubusercontent.com/26266482/27819373-7fbe4002-60cc-11e7-9e7f-e9cc6f9ef746.png) fix after: ![1](https://user-images.githubusercontent.com/26266482/28700157-876cb7d6-7380-11e7-869c-0a4f18d65357.png) ## How was this patch tested? manual tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: guoxiaolong <[email protected]> Closes #18525 from guoxiaolongzte/SPARK-21297.

In programming guide, `numTasks` is used in several places as arguments of Transformations. However, in code, `numPartitions` is used. In this fix, I replace `numTasks` with `numPartitions` in programming guide for consistency. Author: Cheng Wang <[email protected]> Closes #18774 from polarke/replace-numtasks-with-numpartitions-in-doc.

## What changes were proposed in this pull request? This PR proposes to fix few rather typos in `merge_spark_pr.py`. - `# usage: ./apache-pr-merge.py (see config env vars below)` -> `# usage: ./merge_spark_pr.py (see config env vars below)` - `... have local a Spark ...` -> `... have a local Spark ...` - `... to Apache.` -> `... to Apache Spark.` I skimmed this file and these look all I could find. ## How was this patch tested? pep8 check (`./dev/lint-python`). Author: hyukjinkwon <[email protected]> Closes #18776 from HyukjinKwon/minor-merge-script.

## What changes were proposed in this pull request? When BytesToBytesMap spills, its longArray should be released. Otherwise, it may not released until the task complete. This array may take a significant amount of memory, which cannot be used by later operator, such as UnsafeShuffleExternalSorter, resulting in more frequent spill in sorter. This patch release the array as destructive iterator will not use this array anymore. ## How was this patch tested? Manual test in production Author: Zhan Zhang <[email protected]> Closes #17180 from zhzhan/memory.

…rialization ## What changes were proposed in this pull request? Remove surplus synchronized blocks. ## How was this patch tested? Unit tests run OK. Author: iurii.ant <[email protected]> Closes #18775 from SereneAnt/eliminate_unnecessary_synchronization_in_java-R_serialization.

…ypes ## What changes were proposed in this pull request? Currently, StructType.merge() only reports data types of conflicting fields when merging two incompatible schemas. It would be nice to also report the field names for easier debugging. ## How was this patch tested? Unit test in DataTypeSuite. Print exception message when conflict is triggered. Author: bravo-zhang <[email protected]> Closes #16365 from bravo-zhang/spark-18950.

…tion algorithms ## What changes were proposed in this pull request? SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR. This is a followup PR for SPARK-20307. ## How was this patch tested? New Unit tests are added. Author: wangmiao1981 <[email protected]> Closes #18605 from wangmiao1981/class.

## What changes were proposed in this pull request? This pr added documents about unsupported functions in Hive UDF/UDTF/UDAF. This pr relates to #18768 and #18527. ## How was this patch tested? N/A Author: Takeshi Yamamuro <[email protected]> Closes #18792 from maropu/HOTFIX-20170731.

…leOutputStream in some critical paths ## What changes were proposed in this pull request? Java's `FileInputStream` and `FileOutputStream` overrides finalize(), even this file input/output stream is closed correctly and promptly, it will still leave some memory footprints which will only get cleaned in Full GC. This will introduce two side effects: 1. Lots of memory footprints regarding to Finalizer will be kept in memory and this will increase the memory overhead. In our use case of external shuffle service, a busy shuffle service will have bunch of this object and potentially lead to OOM. 2. The Finalizer will only be called in Full GC, and this will increase the overhead of Full GC and lead to long GC pause. https://bugs.openjdk.java.net/browse/JDK-8080225 https://www.cloudbees.com/blog/fileinputstream-fileoutputstream-considered-harmful So to fix this potential issue, here propose to use NIO's Files#newInput/OutputStream instead in some critical paths like shuffle. Left unchanged FileInputStream in core which I think is not so critical: ``` ./core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:467: val file = new DataInputStream(new FileInputStream(filename)) ./core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:942: val in = new FileInputStream(new File(path)) ./core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala:76: val fileIn = new FileInputStream(file) ./core/src/main/scala/org/apache/spark/deploy/RPackageUtils.scala:248: val fis = new FileInputStream(file) ./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:910: input = new FileInputStream(new File(t)) ./core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala:20:import java.io.{FileInputStream, InputStream} ./core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala:132: case Some(f) => new FileInputStream(f) ./core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala:20:import java.io.{FileInputStream, InputStream} ./core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala:77: val fis = new FileInputStream(f) ./core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala:27:import org.apache.spark.io.NioBufferedFileInputStream ./core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala:94: new DataInputStream(new NioBufferedFileInputStream(index)) ./core/src/main/scala/org/apache/spark/storage/DiskStore.scala:111: val channel = new FileInputStream(file).getChannel() ./core/src/main/scala/org/apache/spark/storage/DiskStore.scala:219: val channel = new FileInputStream(file).getChannel() ./core/src/main/scala/org/apache/spark/TestUtils.scala:20:import java.io.{ByteArrayInputStream, File, FileInputStream, FileOutputStream} ./core/src/main/scala/org/apache/spark/TestUtils.scala:106: val in = new FileInputStream(file) ./core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala:89: inputStream = new FileInputStream(activeFile) ./core/src/main/scala/org/apache/spark/util/Utils.scala:329: if (in.isInstanceOf[FileInputStream] && out.isInstanceOf[FileOutputStream] ./core/src/main/scala/org/apache/spark/util/Utils.scala:332: val inChannel = in.asInstanceOf[FileInputStream].getChannel() ./core/src/main/scala/org/apache/spark/util/Utils.scala:1533: gzInputStream = new GZIPInputStream(new FileInputStream(file)) ./core/src/main/scala/org/apache/spark/util/Utils.scala:1560: new GZIPInputStream(new FileInputStream(file)) ./core/src/main/scala/org/apache/spark/util/Utils.scala:1562: new FileInputStream(file) ./core/src/main/scala/org/apache/spark/util/Utils.scala:2090: val inReader = new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8) ``` Left unchanged FileOutputStream in core: ``` ./core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:957: val out = new FileOutputStream(file) ./core/src/main/scala/org/apache/spark/api/r/RBackend.scala:20:import java.io.{DataOutputStream, File, FileOutputStream, IOException} ./core/src/main/scala/org/apache/spark/api/r/RBackend.scala:131: val dos = new DataOutputStream(new FileOutputStream(f)) ./core/src/main/scala/org/apache/spark/deploy/master/FileSystemPersistenceEngine.scala:62: val fileOut = new FileOutputStream(file) ./core/src/main/scala/org/apache/spark/deploy/RPackageUtils.scala:160: val outStream = new FileOutputStream(outPath) ./core/src/main/scala/org/apache/spark/deploy/RPackageUtils.scala:239: val zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile, false)) ./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:949: val out = new FileOutputStream(tempFile) ./core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala:20:import java.io.{File, FileOutputStream, InputStream, IOException} ./core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala:106: val out = new FileOutputStream(file, true) ./core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:109: * Therefore, for local files, use FileOutputStream instead. */ ./core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:112: new FileOutputStream(uri.getPath) ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:20:import java.io.{BufferedOutputStream, File, FileOutputStream, OutputStream} ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:71: private var fos: FileOutputStream = null ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:102: fos = new FileOutputStream(file, true) ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:213: var truncateStream: FileOutputStream = null ./core/src/main/scala/org/apache/spark/storage/DiskBlockObjectWriter.scala:215: truncateStream = new FileOutputStream(file, true) ./core/src/main/scala/org/apache/spark/storage/DiskStore.scala:153: val out = new FileOutputStream(file).getChannel() ./core/src/main/scala/org/apache/spark/TestUtils.scala:20:import java.io.{ByteArrayInputStream, File, FileInputStream, FileOutputStream} ./core/src/main/scala/org/apache/spark/TestUtils.scala:81: val jarStream = new JarOutputStream(new FileOutputStream(jarFile)) ./core/src/main/scala/org/apache/spark/TestUtils.scala:96: val jarFileStream = new FileOutputStream(jarFile) ./core/src/main/scala/org/apache/spark/util/logging/FileAppender.scala:20:import java.io.{File, FileOutputStream, InputStream, IOException} ./core/src/main/scala/org/apache/spark/util/logging/FileAppender.scala:31: volatile private var outputStream: FileOutputStream = null ./core/src/main/scala/org/apache/spark/util/logging/FileAppender.scala:97: outputStream = new FileOutputStream(file, true) ./core/src/main/scala/org/apache/spark/util/logging/RollingFileAppender.scala:90: gzOutputStream = new GZIPOutputStream(new FileOutputStream(gzFile)) ./core/src/main/scala/org/apache/spark/util/Utils.scala:329: if (in.isInstanceOf[FileInputStream] && out.isInstanceOf[FileOutputStream] ./core/src/main/scala/org/apache/spark/util/Utils.scala:333: val outChannel = out.asInstanceOf[FileOutputStream].getChannel() ./core/src/main/scala/org/apache/spark/util/Utils.scala:527: val out = new FileOutputStream(tempFile) ``` Here in `DiskBlockObjectWriter`, it uses `FileDescriptor` so it is not easy to change to NIO Files API. For the `FileInputStream` and `FileOutputStream` in common/shuffle* I changed them all. ## How was this patch tested? Existing tests and manual verification. Author: jerryshao <[email protected]> Closes #18684 from jerryshao/SPARK-21475.

…from HasThreshold ## What changes were proposed in this pull request? GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold ## How was this patch tested? existing tests Author: Zheng RuiFeng <[email protected]> Author: Ruifeng Zheng <[email protected]> Closes #18612 from zhengruifeng/override_HasXXX.

… for Client Mode The fix deployed for SPARK-21541 resulted in the Application Master to set the final status of a spark application as Failed for the client mode as the flag 'registered' was not being set to true for client mode. So, in order to fix the issue, I have set the flag 'registered' as true in client mode on successfully registering Application Master. ## How was this patch tested? Tested the patch manually. Before: <img width="1275" alt="screen shot-before2" src="https://user-images.githubusercontent.com/22228190/28799641-02b5ed78-760f-11e7-9eb0-bf8407dad0ad.png"> After: <img width="1221" alt="screen shot-after2" src="https://user-images.githubusercontent.com/22228190/28799646-0ac9ef14-760f-11e7-8bf5-9dfd743d0f2f.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pgandhi <[email protected]> Author: pgandhi999 <[email protected]> Closes #18788 from pgandhi999/SPARK-21585.

nahoj and others added 19 commits July 28, 2017 16:51

Typo in comment

0ef9fe6

- Author: Johan Grande <[email protected]> Closes #18738 from nahoj/patch-1.

pgandhi999 merged commit 9b3cebc into pgandhi999:master Aug 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark - August 1, 2017 #4

Spark - August 1, 2017 #4

Uh oh!

pgandhi999 commented Aug 1, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Spark - August 1, 2017 #4

Spark - August 1, 2017 #4

Uh oh!

Conversation

pgandhi999 commented Aug 1, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants