Skip to content

Conversation

@JamieZZZ
Copy link

No description provided.

brkyvz and others added 30 commits December 4, 2015 12:09
Python tests require access to the `KinesisTestUtils` file. When this file exists under src/test, python can't access it, since it is not available in the assembly jar.

However, if we move KinesisTestUtils to src/main, we need to add the KinesisProducerLibrary as a dependency. In order to avoid this, I moved KinesisTestUtils to src/main, and extended it with ExtendedKinesisTestUtils which is under src/test that adds support for the KPL.

cc zsxwing tdas

Author: Burak Yavuz <[email protected]>

Closes #10050 from brkyvz/kinesis-py.
…s in SparkR.

Author: Sun Rui <[email protected]>

Closes #9804 from sun-rui/SPARK-11774.

(cherry picked from commit c8d0e16)
Signed-off-by: Shivaram Venkataraman <[email protected]>
Need to match existing method signature

Author: felixcheung <[email protected]>

Closes #9680 from felixcheung/rcorr.

(cherry picked from commit 895b6c4)
Signed-off-by: Shivaram Venkataraman <[email protected]>
… be consistent with Scala/Python

Change ```numPartitions()``` to ```getNumPartitions()``` to be consistent with Scala/Python.
<del>Note: If we can not catch up with 1.6 release, it will be breaking change for 1.7 that we also need to explain in release note.<del>

cc sun-rui felixcheung shivaram

Author: Yanbo Liang <[email protected]>

Closes #10123 from yanboliang/spark-12115.

(cherry picked from commit 6979edf)
Signed-off-by: Shivaram Venkataraman <[email protected]>
1, Add ```isNaN``` to ```Column``` for SparkR. ```Column``` should has three related variable functions: ```isNaN, isNull, isNotNull```.
2, Replace ```DataFrame.isNaN``` with ```DataFrame.isnan``` at SparkR side. Because ```DataFrame.isNaN``` has been deprecated and will be removed at Spark 2.0.
<del>3, Add ```isnull``` to ```DataFrame``` for SparkR. ```DataFrame``` should has two related functions: ```isnan, isnull```.<del>

cc shivaram sun-rui felixcheung

Author: Yanbo Liang <[email protected]>

Closes #10037 from yanboliang/spark-12044.

(cherry picked from commit b6e8e63)
Signed-off-by: Shivaram Venkataraman <[email protected]>
Author: gcc <[email protected]>

Closes #10101 from rh99/master.

(cherry picked from commit 04b6799)
Signed-off-by: Sean Owen <[email protected]>
When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u.

yhuai Please review it. I did reproduce it and it works after the fix. Thanks!

Author: gatorsmile <[email protected]>

Closes #10155 from gatorsmile/escapeU.

(cherry picked from commit 49efd03)
Signed-off-by: Yin Huai <[email protected]>
…y when Jenkins load is high

We need to make sure that the last entry is indeed the last entry in the queue.

Author: Burak Yavuz <[email protected]>

Closes #10110 from brkyvz/batch-wal-test-fix.

(cherry picked from commit 6fd9e70)
Signed-off-by: Tathagata Das <[email protected]>
This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.

Author: Sun Rui <[email protected]>

Closes #10030 from sun-rui/SPARK-12034.

(cherry picked from commit 39d677c)
Signed-off-by: Shivaram Venkataraman <[email protected]>
Currently, the current line is not cleared by Cltr-C

After this patch
```
>>> asdfasdf^C
Traceback (most recent call last):
  File "~/spark/python/pyspark/context.py", line 225, in signal_handler
    raise KeyboardInterrupt()
KeyboardInterrupt
```

It's still worse than 1.5 (and before).

Author: Davies Liu <[email protected]>

Closes #10134 from davies/fix_cltrc.

(cherry picked from commit ef3f047)
Signed-off-by: Davies Liu <[email protected]>
…ner not present

The reason is that TrackStateRDDs generated by trackStateByKey expect the previous batch's TrackStateRDDs to have a partitioner. However, when recovery from DStream checkpoints, the RDDs recovered from RDD checkpoints do not have a partitioner attached to it. This is because RDD checkpoints do not preserve the partitioner (SPARK-12004).

While #9983 solves SPARK-12004 by preserving the partitioner through RDD checkpoints, there may be a non-zero chance that the saving and recovery fails. To be resilient, this PR repartitions the previous state RDD if the partitioner is not detected.

Author: Tathagata Das <[email protected]>

Closes #9988 from tdas/SPARK-11932.

(cherry picked from commit 5d80d8c)
Signed-off-by: Tathagata Das <[email protected]>
https://issues.apache.org/jira/browse/SPARK-11963

Author: Xusen Yin <[email protected]>

Closes #9962 from yinxusen/SPARK-11963.

(cherry picked from commit 871e85d)
Signed-off-by: Joseph K. Bradley <[email protected]>
…cala doc

In SPARK-11946 the API for pivot was changed a bit and got updated doc, the doc changes were not made for the python api though. This PR updates the python doc to be consistent.

Author: Andrew Ray <[email protected]>

Closes #10176 from aray/sql-pivot-python-doc.

(cherry picked from commit 36282f7)
Signed-off-by: Yin Huai <[email protected]>
Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods.

This covers all instances in spark.mllib.  There were no uses of the constructor in spark.ml.

CC: mengxr yhuai

Author: Joseph K. Bradley <[email protected]>

Closes #10161 from jkbradley/mllib-sqlcontext-fix.

(cherry picked from commit 3e7e05f)
Signed-off-by: Xiangrui Meng <[email protected]>
…ing include_example

Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three  java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer

Author: Xusen Yin <[email protected]>
Author: somideshmukh <[email protected]>

Closes #10002 from somideshmukh/SomilBranch1.33.

(cherry picked from commit 78209b0)
Signed-off-by: Xiangrui Meng <[email protected]>
Add since annotation to ml.classification

Author: Takahashi Hiroshi <[email protected]>

Closes #8534 from taishi-oss/issue10259.

(cherry picked from commit 7d05a62)
Signed-off-by: Xiangrui Meng <[email protected]>
…mple code

Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.

Author: Yanbo Liang <[email protected]>

Closes #10006 from yanboliang/spark-11958.

(cherry picked from commit 4a39b5a)
Signed-off-by: Xiangrui Meng <[email protected]>
…means Value

Author: cody koeninger <[email protected]>

Closes #10132 from koeninger/SPARK-12103.

(cherry picked from commit 48a9804)
Signed-off-by: Sean Owen <[email protected]>
Author: Jeff Zhang <[email protected]>

Closes #10172 from zjffdu/SPARK-12166.

(cherry picked from commit 7081291)
Signed-off-by: Sean Owen <[email protected]>
This reverts PR #10002, commit 78209b0.

The original PR wasn't tested on Jenkins before being merged.

Author: Cheng Lian <[email protected]>

Closes #10200 from liancheng/revert-pr-10002.

(cherry picked from commit da2012a)
Signed-off-by: Cheng Lian <[email protected]>
Fix commons-collection group ID to commons-collections for version 3.x

Patches earlier PR at #9731

Author: Sean Owen <[email protected]>

Closes #10198 from srowen/SPARK-11652.2.

(cherry picked from commit e3735ce)
Signed-off-by: Sean Owen <[email protected]>
checked with hive, greatest/least should cast their children to a tightest common type,
i.e. `(int, long) => long`, `(int, string) => error`, `(decimal(10,5), decimal(5, 10)) => error`

Author: Wenchen Fan <[email protected]>

Closes #10196 from cloud-fan/type-coercion.

(cherry picked from commit 381f17b)
Signed-off-by: Michael Armbrust <[email protected]>
This PR is to add three more data types into Encoder, including `BigDecimal`, `Date` and `Timestamp`.

marmbrus cloud-fan rxin Could you take a quick look at these three types? Not sure if it can be merged to 1.6. Thank you very much!

Author: gatorsmile <[email protected]>

Closes #10188 from gatorsmile/dataTypesinEncoder.

(cherry picked from commit c0b13d5)
Signed-off-by: Michael Armbrust <[email protected]>
… APIs

This PR contains the following updates:

- Created a new private variable `boundTEncoder` that can be shared by multiple functions, `RDD`, `select` and `collect`.
- Replaced all the `queryExecution.analyzed` by the function call `logicalPlan`
- A few API comments are using wrong class names (e.g., `DataFrame`) or parameter names (e.g., `n`)
- A few API descriptions are wrong. (e.g., `mapPartitions`)

marmbrus rxin cloud-fan Could you take a look and check if they are appropriate? Thank you!

Author: gatorsmile <[email protected]>

Closes #10184 from gatorsmile/datasetClean.

(cherry picked from commit 5d96a71)
Signed-off-by: Michael Armbrust <[email protected]>
jira: https://issues.apache.org/jira/browse/SPARK-10393

Since the logic of the text processing part has been moved to ML estimators/transformers, replace the related code in LDA Example with the ML pipeline.

Author: Yuhao Yang <[email protected]>
Author: yuhaoyang <[email protected]>

Closes #8551 from hhbyyh/ldaExUpdate.

(cherry picked from commit 872a2ee)
Signed-off-by: Joseph K. Bradley <[email protected]>
…unction

Delays application of ResolvePivot until all aggregates are resolved to prevent problems with UnresolvedFunction and adds unit test

Author: Andrew Ray <[email protected]>

Closes #10202 from aray/sql-pivot-unresolved-function.

(cherry picked from commit 4bcb894)
Signed-off-by: Yin Huai <[email protected]>
jira: https://issues.apache.org/jira/browse/SPARK-11605
Check Java compatibility for MLlib for this release.

fix:

1. `StreamingTest.registerStream` needs java friendly interface.

2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`.

TBD:
[updated] no fix for now per discussion.
`org.apache.spark.mllib.classification.LogisticRegressionModel`
`public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation.
`SVMModel` has the similar issue.

Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary.

cc jkbradley feynmanliang

Author: Yuhao Yang <[email protected]>

Closes #10102 from hhbyyh/javaAPI.

(cherry picked from commit 5cb4695)
Signed-off-by: Joseph K. Bradley <[email protected]>
Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.

Author: BenFradet <[email protected]>

Closes #10166 from BenFradet/SPARK-12159.

(cherry picked from commit 06746b3)
Signed-off-by: Joseph K. Bradley <[email protected]>
This patch tightens them to `private[memory]`.

Author: Andrew Or <[email protected]>

Closes #10182 from andrewor14/memory-visibility.

(cherry picked from commit 9494521)
Signed-off-by: Josh Rosen <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #10060 from marmbrus/docs.

(cherry picked from commit 3959489)
Signed-off-by: Michael Armbrust <[email protected]>
gatorsmile and others added 25 commits February 1, 2016 11:22
JIRA: https://issues.apache.org/jira/browse/SPARK-12989

In the rule `ExtractWindowExpressions`, we simply replace alias by the corresponding attribute. However, this will cause an issue exposed by the following case:

```scala
val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num")
  .withColumn("Data", struct("A", "B", "C"))
  .drop("A")
  .drop("B")
  .drop("C")

val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc)
data.select($"*", max("num").over(winSpec) as "max").explain(true)
```
In this case, both `Data.A` and `Data.B` are `alias` in `WindowSpecDefinition`. If we replace these alias expression by their alias names, we are unable to know what they are since they will not be put in `missingExpr` too.

Author: gatorsmile <[email protected]>
Author: xiaoli <[email protected]>
Author: Xiao Li <[email protected]>

Closes #10963 from gatorsmile/seletStarAfterColDrop.

(cherry picked from commit 33c8a49)
Signed-off-by: Michael Armbrust <[email protected]>
ISTM `lib` is better because `datanucleus` jars are located in `lib` for release builds.

Author: Takeshi YAMAMURO <[email protected]>

Closes #10901 from maropu/DocFix.

(cherry picked from commit da9146c)
Signed-off-by: Michael Armbrust <[email protected]>
Changed a target at branch-1.6 from #10635.

Author: Takeshi YAMAMURO <[email protected]>

Closes #10915 from maropu/pr9935-v3.
It is not valid to call `toAttribute` on a `NamedExpression` unless we know for sure that the child produced that `NamedExpression`.  The current code worked fine when the grouping expressions were simple, but when they were a derived value this blew up at execution time.

Author: Michael Armbrust <[email protected]>

Closes #11011 from marmbrus/groupByFunction.
Author: Michael Armbrust <[email protected]>

Closes #11014 from marmbrus/seqEncoders.

(cherry picked from commit 29d9218)
Signed-off-by: Michael Armbrust <[email protected]>
…ML python models' properties

Backport of [SPARK-12780] for branch-1.6

Original PR for master: #10724

This fixes StringIndexerModel.labels in pyspark.

Author: Xusen Yin <[email protected]>

Closes #10950 from jkbradley/yinxusen-spark-12780-backport.
I've tried to solve some of the issues mentioned in: https://issues.apache.org/jira/browse/SPARK-12629
Please, let me know what do you think.
Thanks!

Author: Narine Kokhlikyan <[email protected]>

Closes #10580 from NarineK/sparkrSavaAsRable.

(cherry picked from commit 8a88e12)
Signed-off-by: Shivaram Venkataraman <[email protected]>
java mapwithstate with Function3 has wrong conversion of java `Optional` to scala `Option`, fixed code uses same conversion used in the mapwithstate call that uses Function4 as an input. `Optional.fromNullable(v.get)` fails if v is `None`, better to use `JavaUtils.optionToOptional(v)` instead.

Author: Gabriele Nizzoli <[email protected]>

Closes #11007 from gabrielenizzoli/branch-1.6.
…lumn name duplication

Fixes problem and verifies fix by test suite.
Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn
and deduplicates SchemaUtils.appendColumn functions.

Author: Grzegorz Chilkiewicz <[email protected]>

Closes #10741 from grzegorz-chilkiewicz/master.

(cherry picked from commit b1835d7)
Signed-off-by: Joseph K. Bradley <[email protected]>
Jira:
https://issues.apache.org/jira/browse/SPARK-13056

Create a map like
{ "a": "somestring", "b": null}
Query like
SELECT col["b"] FROM t1;
NPE would be thrown.

Author: Daoyuan Wang <[email protected]>

Closes #10964 from adrian-wang/npewriter.

(cherry picked from commit 358300c)
Signed-off-by: Michael Armbrust <[email protected]>

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
The example will throw error like
<console>:20: error: not found: value StructType

Need to add this line:
import org.apache.spark.sql.types._

Author: Kevin (Sangwoo) Kim <[email protected]>

Closes #10141 from swkimme/patch-1.

(cherry picked from commit b377b03)
Signed-off-by: Michael Armbrust <[email protected]>
https://issues.apache.org/jira/browse/SPARK-13122

A race condition can occur in MemoryStore's unrollSafely() method if two threads that
return the same value for currentTaskAttemptId() execute this method concurrently. This
change makes the operation of reading the initial amount of unroll memory used, performing
the unroll, and updating the associated memory maps atomic in order to avoid this race
condition.

Initial proposed fix wraps all of unrollSafely() in a memoryManager.synchronized { } block. A cleaner approach might be introduce a mechanism that synchronizes based on task attempt ID. An alternative option might be to track unroll/pending unroll memory based on block ID rather than task attempt ID.

Author: Adam Budde <[email protected]>

Closes #11012 from budde/master.

(cherry picked from commit ff71261)
Signed-off-by: Andrew Or <[email protected]>

Conflicts:
	core/src/main/scala/org/apache/spark/storage/MemoryStore.scala
…uration columns

I have clearly prefix the two 'Duration' columns in 'Details of Batch' Streaming tab as 'Output Op Duration' and 'Job Duration'

Author: Mario Briggs <[email protected]>
Author: mariobriggs <[email protected]>

Closes #11022 from mariobriggs/spark-12739.

(cherry picked from commit e9eb248)
Signed-off-by: Shixiong Zhu <[email protected]>
…ld not fail analysis of encoder

nullability should only be considered as an optimization rather than part of the type system, so instead of failing analysis for mismatch nullability, we should pass analysis and add runtime null check.

backport #11035 to 1.6

Author: Wenchen Fan <[email protected]>

Closes #11042 from cloud-fan/branch-1.6.
minor fix for api link in ml onevsrest

Author: Yuhao Yang <[email protected]>

Closes #11068 from hhbyyh/onevsrestDoc.

(cherry picked from commit c2c956b)
Signed-off-by: Xiangrui Meng <[email protected]>
…ot set but timeoutThreshold is defined

Check the state Existence before calling get.

Author: Shixiong Zhu <[email protected]>

Closes #11081 from zsxwing/SPARK-13195.

(cherry picked from commit 8e2f296)
Signed-off-by: Shixiong Zhu <[email protected]>
Author: Bill Chambers <[email protected]>

Closes #11094 from anabranch/dynamic-docs.

(cherry picked from commit 66e1383)
Signed-off-by: Andrew Or <[email protected]>
There is a bug when we try to grow the buffer, OOM is ignore wrongly (the assert also skipped by JVM), then we try grow the array again, this one will trigger spilling free the current page, the current record we inserted will be invalid.

The root cause is that JVM has less free memory than MemoryManager thought, it will OOM when allocate a page without trigger spilling. We should catch the OOM, and acquire memory again to trigger spilling.

And also, we could not grow the array in `insertRecord` of `InMemorySorter` (it was there just for easy testing).

Author: Davies Liu <[email protected]>

Closes #11095 from davies/fix_expand.
…ters with Jackson 2.2.3

Patch to

1. Shade jackson 2.x in spark-yarn-shuffle JAR: core, databind, annotation
2. Use maven antrun to verify the JAR has the renamed classes

Being Maven-based, I don't know if the verification phase kicks in on an SBT/jenkins build. It will on a `mvn install`

Author: Steve Loughran <[email protected]>

Closes #10780 from steveloughran/stevel/patches/SPARK-12807-master-shuffle.

(cherry picked from commit 34d0b70)
Signed-off-by: Marcelo Vanzin <[email protected]>
JIRA: https://issues.apache.org/jira/browse/SPARK-10524

Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction.

Author: Liang-Chi Hsieh <[email protected]>
Author: Liang-Chi Hsieh <[email protected]>
Author: Joseph K. Bradley <[email protected]>

Closes #8734 from viirya/dt-soft-centroids.

(cherry picked from commit 9267bc6)
Signed-off-by: Joseph K. Bradley <[email protected]>
… SpecificParquetRecordReaderBase

This is a minor followup to #10843 to fix one remaining place where we forgot to use reflective access of TaskAttemptContext methods.

Author: Josh Rosen <[email protected]>

Closes #11131 from JoshRosen/SPARK-12921-take-2.
Update Aggregator links to point to #org.apache.spark.sql.expressions.Aggregator

Author: raela <[email protected]>

Closes #11158 from raelawang/master.

(cherry picked from commit 719973b)
Signed-off-by: Reynold Xin <[email protected]>
…e system besides HDFS

jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks!

Author: Yu ISHIKAWA <[email protected]>

Closes #11151 from yu-iskw/SPARK-13265.

(cherry picked from commit efb65e0)
Signed-off-by: Xiangrui Meng <[email protected]>
…n error

Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.

In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'

However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb  = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false

cc holdenk

Author: sethah <[email protected]>

Closes #10962 from sethah/SPARK-13047.

(cherry picked from commit b354673)
Signed-off-by: Xiangrui Meng <[email protected]>
…alue parameter

Fix this defect by check default value exist or not.

yanboliang Please help to review.

Author: Tommy YU <[email protected]>

Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.

(cherry picked from commit d3e2e20)
Signed-off-by: Xiangrui Meng <[email protected]>
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@vanzin
Copy link
Contributor

vanzin commented Feb 12, 2016

Hi can you please close this PR?

markpavey and others added 2 commits February 13, 2016 08:39
… Windows

Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK.

Is it worth considering also including this fix in any future 1.5.x releases (if any)?

I confirm this is my own original work and license it to the Spark project under its open source license.

Author: markpavey <[email protected]>

Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix.

(cherry picked from commit 374c4b2)
Signed-off-by: Sean Owen <[email protected]>
…ailed test

JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes #10539 from viirya/fix-poweriter.

(cherry picked from commit e3441e3)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in 610196f Feb 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.