Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
750 commits
Select commit Hold shift + click to select a range
58210f2
[SQL] Minor: Introduce SchemaRDD#aggregate() for simple aggregations
aarondav May 26, 2014
035b976
HOTFIX: Add no-arg SparkContext constructor in Java
pwendell May 26, 2014
88c9844
[SPARK-1914] [SQL] Simplify CountFunction not to traverse to evaluate…
ueshin May 26, 2014
e6c0550
Fix scalastyle warnings in yarn alpha
witgo May 26, 2014
c282a31
SPARK-1925: Replace '&' with '&&'
zsxwing May 26, 2014
294e5c2
[SPARK-1931] Reconstruct routing tables in Graph.partitionBy
ankurdave May 26, 2014
d56d894
SPARK-1929 DAGScheduler suspended by local task OOM
zhpengg May 27, 2014
66a244d
Fixed the error message for OutOfMemoryError in DAGScheduler.
rxin May 27, 2014
e3ca337
Updated dev Python scripts to make them PEP8 compliant.
rxin May 27, 2014
b3aa4be
SPARK-1933: Throw a more meaningful exception when a directory is pas…
rxin May 27, 2014
b2158a3
SPARK-1932: Fix race conditions in onReceiveCallback and cachedPeers
zsxwing May 27, 2014
a0d07eb
bugfix worker DriverStateChanged state should match DriverState.FAILED
lianhuiwang May 27, 2014
f1c7811
[SPARK-1926] [SQL] Nullability of Max/Min/First should be true.
ueshin May 27, 2014
8c631a0
[SPARK-1915] [SQL] AverageFunction should not count if the evaluated …
ueshin May 27, 2014
0c62787
[SQL] SPARK-1922
May 27, 2014
cf00662
[SPARK-1938] [SQL] ApproxCountDistinctMergeFunction should return Int…
ueshin May 28, 2014
c01fd48
Fix doc about NetworkWordCount/JavaNetworkWordCount usage of spark st…
jmu May 28, 2014
bde5ff5
Organize configuration docs
pwendell May 28, 2014
369b6dc
Spark 1916
May 28, 2014
be53829
[SPARK-1712]: TaskDescription instance is too big causes Spark to hang
witgo May 28, 2014
8b15ccd
Added doctest and method description in context.py
jyotiska May 29, 2014
5f8c829
SPARK-1935: Explicitly add commons-codec 1.5 as a dependency.
yhuai May 29, 2014
e26687c
[SPARK-1368][SQL] Optimized HiveTableScan
liancheng May 29, 2014
ebcfe8e
initial version of LPA
ankurdave May 29, 2014
3fe7d72
[SPARK-1820] Make GenerateMimaIgnore @DeveloperApi annotation aware.
ScrapCodes May 30, 2014
39fd10a
[SPARK-1566] consolidate programming guide, and general doc updates
mateiz May 30, 2014
d770a90
[SPARK-1971] Update MIMA to compare against Spark 1.0.0
ScrapCodes May 30, 2014
fb94cfb
[SPARK-1901] worker should make sure executor has exited before updat…
zhpengg May 30, 2014
0d21059
Typo: and -> an
ash211 May 31, 2014
342cce6
updated link to mailing list
nchammas May 31, 2014
1b04476
SPARK-1976: fix the misleading part in streaming docs
CodingCat May 31, 2014
96fdfbb
[SPARK-1959] String "NULL" shouldn't be interpreted as null value
liancheng May 31, 2014
0325c8d
correct tiny comment error
CrazyJvm May 31, 2014
3ab4055
[SPARK-1947] [SQL] Child of SumDistinct or Average should be widened …
ueshin May 31, 2014
d4a04a5
Optionally include Hive as a dependency of the REPL.
marmbrus May 31, 2014
8696d9f
[SQL] SPARK-1964 Add timestamp to hive metastore type parser.
marmbrus May 31, 2014
08a7929
Super minor: Close inputStream in SparkSubmitArguments
aarondav May 31, 2014
ee975a8
SPARK-1839: PySpark RDD#take() shouldn't always read from driver
aarondav May 31, 2014
8b37805
Improve maven plugin configuration
witgo May 31, 2014
a36d925
SPARK-1917: fix PySpark import of scipy.special functions
laserson May 31, 2014
7b8d2c2
updated java code blocks in spark SQL guide such that ctx will refer …
Jun 1, 2014
ffc4cf1
Made spark_ec2.py PEP8 compliant.
rxin Jun 1, 2014
e5cd817
Better explanation for how to use MIMA excludes.
pwendell Jun 2, 2014
e84aa22
Add landmark-based Shortest Path algorithm to graphx.lib
ankurdave Jun 2, 2014
a5602da
[SPARK-1553] Alternating nonnegative least-squares
tmyklebu Jun 2, 2014
ce3bf45
[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCol…
liancheng Jun 2, 2014
e3b62b7
[SPARK-1995][SQL] system function upper and lower can be supported
egraldlo Jun 3, 2014
5219429
Avoid dynamic dispatching when unwrapping Hive data.
liancheng Jun 3, 2014
4a51dd8
[SPARK-1942] Stop clearing spark.driver.port in unit tests
Jun 3, 2014
fc8e738
SPARK-2001 : Remove docs/spark-debugger.md from master
hsaputra Jun 3, 2014
34af7ed
[SPARK-1912] fix compress memory issue during reduce
cloud-fan Jun 3, 2014
186879f
Add support for Pivotal HD in the Maven build: SPARK-1992
tzolov Jun 3, 2014
6544239
[SPARK-1468] Modify the partition function used by partitionBy.
Jun 3, 2014
4c42c45
fix java.lang.ClassCastException
baishuo Jun 3, 2014
b1ec902
Synthetic GraphX Benchmark
jegonzal Jun 3, 2014
c17bea0
[SPARK-1991] Support custom storage levels for vertices and edges
ankurdave Jun 3, 2014
05fa288
Fixed a typo
dbtsai Jun 4, 2014
d890b93
[SPARK-1161] Add saveAsPickleFile and SparkContext.pickleFile in Python
kanzhang Jun 4, 2014
146665f
SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead…
rxin Jun 4, 2014
40ceb3d
use env default python in merge_spark_pr.py
mengxr Jun 4, 2014
e7d0f3c
Enable repartitioning of graph over different number of partitions
jegonzal Jun 4, 2014
19edca2
Update spark-ec2 scripts for 1.0.0 on master
aarondav Jun 4, 2014
177ef5d
SPARK-1806 (addendum) Use non-deprecated methods in Mesos 0.18
srowen Jun 4, 2014
ae55c2c
[SPARK-1817] RDD.zip() should verify partition sizes for each partition
kanzhang Jun 4, 2014
7c8a1d0
[MLLIB] set RDD names in ALS
nevillelyh Jun 4, 2014
5be3319
SPARK-1973. Add randomSplit to JavaRDD (with tests, and tidy Java tests)
srowen Jun 4, 2014
349560c
[SPARK-1752][MLLIB] Standardize text format for vectors and labeled p…
mengxr Jun 4, 2014
52cfc18
SPARK-1518: FileLogger: Fix compile against Hadoop trunk
Jun 4, 2014
2cc0bd3
SPARK-1790: Update EC2 scripts to support r3 instance types
sujeetv Jun 4, 2014
d8d2e93
Minor: Fix documentation error from apache/spark#946
ankurdave Jun 4, 2014
285cac4
Fix issue in ReplSuite with hadoop-provided profile.
Jun 5, 2014
4671126
[SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SN…
ueshin Jun 5, 2014
01f4c13
SPARK-1677: allow user to disable output dir existence checking
CodingCat Jun 5, 2014
aaee35f
[SPARK-2036] [SQL] CaseConversionExpression should check if the evalu…
ueshin Jun 5, 2014
654e63e
HOTFIX: Remove generated-mima-excludes file after runing MIMA.
pwendell Jun 5, 2014
15d8bf6
sbt 0.13.X should be using sbt-assembly 0.11.X
kalpit Jun 5, 2014
08e04ed
Remove compile-scoped junit dependency.
Jun 5, 2014
2ea7ce7
[SPARK-2041][SQL] Correctly analyze queries where columnName == table…
marmbrus Jun 6, 2014
f052dff
Use pluggable clock in DAGSheduler #SPARK-2031
CrazyJvm Jun 6, 2014
f9ab450
[SPARK-2025] Unpersist edges of previous graph in Pregel
ankurdave Jun 6, 2014
229c135
SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
mateiz Jun 6, 2014
4b1266c
[SPARK-2050][SQL] LIKE, RLIKE and IN in HQL should not be case sensit…
marmbrus Jun 6, 2014
ab928c9
[SPARK-1552] Fix type comparison bug in {map,outerJoin}Vertices
ankurdave Jun 6, 2014
8f65c03
[SPARK-2050 - 2][SQL] DIV and BETWEEN should not be case sensitive.
marmbrus Jun 6, 2014
a761f16
[SPARK-1841]: update scalatest to version 2.1.5
witgo Jun 6, 2014
684a102
[SPARK-1994][SQL] Weird data corruption bug when running Spark SQL on…
marmbrus Jun 7, 2014
e6bc51f
HOTFIX: Support empty body in merge script
pwendell Jun 7, 2014
4100154
SPARK-2056 Set RDD name to input path
nevillelyh Jun 7, 2014
da48c74
SPARK-2026: Maven Hadoop Profiles Should Set The Hadoop Version
berngp Jun 8, 2014
6c505e6
SPARK-1898: In deploy.yarn.Client, use YarnClient not YarnClientImpl
Jun 8, 2014
4ea7110
SPARK-1628: Add missing hashCode methods in Partitioner subclasses
zsxwing Jun 8, 2014
e6e2963
Update run-example
maji2014 Jun 8, 2014
3a73aa7
SPARK-1628 follow up: Improve RangePartitioner's documentation.
rxin Jun 9, 2014
7cb4cc0
[SPARK-2067] use relative path for Spark logo in UI
nevillelyh Jun 9, 2014
20d5f4b
Grammar: read -> reads
ash211 Jun 9, 2014
c018887
[SPARK-1308] Add getNumPartitions to pyspark RDD
Jun 9, 2014
fb64844
SPARK-1944 Document --verbose in spark-shell -h
ash211 Jun 9, 2014
b111815
[SPARK-1495][SQL]add support for left semi join
adrian-wang Jun 9, 2014
12c198d
Added a TaskSetManager unit test.
kayousterhout Jun 9, 2014
93ddcbc
[SPARK-1522] : YARN ClientBase throws a NPE if there is no YARN Appli…
berngp Jun 9, 2014
8045f49
[SQL] Simple framework for debugging query execution
marmbrus Jun 9, 2014
ed5dd10
[SPARK-1704][SQL] Fully support EXPLAIN commands as SchemaRDD.
concretevitamin Jun 9, 2014
5f9b236
Make sure that empty string is filtered out when we get the secondary…
dbtsai Jun 10, 2014
a034761
SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats
MLnick Jun 10, 2014
cfb22da
[SPARK-1508][SQL] Add SQLConf to SQLContext.
concretevitamin Jun 10, 2014
d00cae8
Moved hiveOperators.scala to the right package folder
liancheng Jun 10, 2014
34a0b15
[SPARK-1978] In some cases, spark-yarn does not automatically restart…
witgo Jun 10, 2014
f59bead
[SPARK-2076][SQL] Pushdown the join filter & predication for outer join
chenghao-intel Jun 10, 2014
a37e7a4
HOTFIX: Fix Python tests on Jenkins.
pwendell Jun 10, 2014
cfa8c4e
HOTFIX: Increase time limit for Bagel test
ankurdave Jun 10, 2014
93c0e0e
[SQL] Add average overflow test case from #978
egraldlo Jun 10, 2014
a32746f
[SPARK-1998] SparkFlumeEvent with body bigger than 1020 bytes are not…
joyyoj Jun 11, 2014
576e2d0
[SPARK-1940] Enabling rolling of executor logs, and automatic cleanup…
tdas Jun 11, 2014
9645119
Resolve scalatest warnings during build
witgo Jun 11, 2014
eaccf3b
[SPARK-2065] give launched instances names
nchammas Jun 11, 2014
d78cc81
HOTFIX: clear() configs in SQLConf-related unit tests.
concretevitamin Jun 11, 2014
be2ec07
[SPARK-2093] [SQL] NullPropagation should use exact type value.
ueshin Jun 11, 2014
487153b
[SPARK-1968][SQL] SQL/HiveQL command for caching/uncaching tables
liancheng Jun 11, 2014
f1fe720
[SPARK-2091][MLLIB] use numpy.dot instead of ndarray.dot
mengxr Jun 11, 2014
764ffcf
SPARK-2107: FilterPushdownSuite doesn't need Junit jar.
Qiuzhuang Jun 11, 2014
974bf33
SPARK-1639. Tidy up some Spark on YARN code
sryza Jun 11, 2014
1acf918
[SPARK-2069] MIMA false positives
ScrapCodes Jun 11, 2014
7e8809d
[SPARK-2108] Mark SparkContext methods that return block information …
ScrapCodes Jun 11, 2014
2dc0b65
SPARK-2113: awaitTermination() after stop() will hang in Spark Stremaing
Jun 11, 2014
ce0886f
[SPARK-2042] Prevent unnecessary shuffle triggered by take()
sameeragarwal Jun 11, 2014
e991148
[SQL] Code Cleanup: Left Semi Hash Join
adrian-wang Jun 11, 2014
5b4ac43
HOTFIX: A few PySpark tests were not actually run
andrewor14 Jun 11, 2014
6b02002
HOTFIX: PySpark tests should be order insensitive.
pwendell Jun 11, 2014
c90169a
HOTFIX: Forgot to remove false change in previous commit
pwendell Jun 11, 2014
375476c
[SPARK-2052] [SQL] Add optimization for CaseConversionExpression's.
ueshin Jun 12, 2014
9ec0b85
[SPARK-1672][MLLIB] Separate user and product partitioning in ALS
tmyklebu Jun 12, 2014
19dbc8c
[SPARK-2044] Pluggable interface for shuffles
mateiz Jun 12, 2014
ae3cf5c
'killFuture' is never used
watermen Jun 12, 2014
4f602e5
Cleanup on Connection and ConnectionManager
hsaputra Jun 12, 2014
aad4110
fixed typo in docstring for min()
jkthompson Jun 12, 2014
ca8a046
SPARK-554. Add aggregateByKey.
sryza Jun 12, 2014
fa21377
[SPARK-2088] fix NPE in toString
dorx Jun 12, 2014
35f81c1
[SPARK-2080] Yarn: report HS URL in client mode, correct user in clus…
Jun 12, 2014
cb96cd2
SPARK-1843: Replace assemble-deps with env variable.
pwendell Jun 12, 2014
b4dc414
SPARK-2085: [MLlib] Apply user-specific regularization instead of uni…
Jun 13, 2014
166ec29
document laziness of parallelize
Jun 13, 2014
319f578
SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
dorx Jun 13, 2014
e031c2f
[Minor] Fix style, formatting and naming in BlockManager etc.
andrewor14 Jun 13, 2014
c875a03
[SPARK-1516]Throw exception in yarn client instead of run system.exit…
codeboyyong Jun 13, 2014
425f85c
[SPARK-2135][SQL] Use planner for in-memory scans
marmbrus Jun 13, 2014
912e9f6
[HOTFIX] add math3 version to pom
mengxr Jun 13, 2014
697b7ad
Workaround in Spark for ConcurrentModification issue (JIRA Hadoop-104…
nishkamravi2 Jun 13, 2014
fa9a017
[SPARK-1964][SQL] Add timestamp to HiveMetastoreTypes.toMetastoreType
marmbrus Jun 13, 2014
bb0ddc2
[SPARK-2094][SQL] "Exactly once" semantics for DDL and command statem…
liancheng Jun 13, 2014
2b9b907
Small correction in Streaming Programming Guide doc
akkomar Jun 13, 2014
51577ba
[Spark-2137][SQL] Timestamp UDFs broken
yhuai Jun 14, 2014
6e69569
[SPARK-2079] Support batching when serializing SchemaRDD to Python
kanzhang Jun 14, 2014
bd56556
[SPARK-2013] Documentation for saveAsPickleFile and pickleFile in Python
kanzhang Jun 14, 2014
79ef2d8
[SPARK-1837] NumericRange should be partitioned in the same way as ot…
kanzhang Jun 14, 2014
d1de0aa
[SQL] Support transforming TreeNodes with Option children.
marmbrus Jun 15, 2014
b19966d
[SPARK-937] adding EXITED executor state and not relaunching cleanly …
kanzhang Jun 15, 2014
7658285
SPARK-1999: StorageLevel in storage tab and RDD Storage Info never ch…
CrazyJvm Jun 16, 2014
9176e53
SPARK-2148 Add link to requirements for custom equals() and hashcode(…
ash211 Jun 16, 2014
5ce095b
Updating docs to include missing information about reducers and clari…
alig Jun 16, 2014
52b2ed5
SPARK-2039: apply output dir existence checking for all output formats
CodingCat Jun 16, 2014
7133ad9
[SPARK-2010] Support for nested data in PySpark SQL
kanzhang Jun 16, 2014
692a679
[SPARK-1930] The Container is running beyond physical memory limits, …
witgo Jun 16, 2014
4877e05
[SQL][SPARK-2094] Follow up of PR #1071 for Java API
liancheng Jun 16, 2014
3d795ee
Minor fix: made "EXPLAIN" output to play well with JDBC output format
liancheng Jun 16, 2014
1b2dc5a
MLlib documentation fix
afomenko Jun 17, 2014
8df44f1
[SPARK-2130] End-user friendly String repr for StorageLevel in Python
kanzhang Jun 17, 2014
8d65d44
SPARK-1990: added compatibility for python 2.6 for ssh_read command
AtlasPilotPuppy Jun 17, 2014
a2ef7f2
SPARK-2035: Store call stack for stages, display it on the UI.
darabos Jun 17, 2014
723bdf7
[SPARK-2144] ExecutorsPage reports incorrect # of RDD blocks
andrewor14 Jun 17, 2014
f396726
[SPARK-2164][SQL] Allow Hive UDF on columns of type struct
conviva-zz Jun 17, 2014
8b77c36
[SPARK-2053][SQL] Add Catalyst expressions for CASE WHEN.
concretevitamin Jun 17, 2014
b847fe5
SPARK-1063 Add .sortBy(f) method on RDD
ash211 Jun 17, 2014
5b7ab96
SPARK-2146. Fix takeOrdered doc
sryza Jun 17, 2014
c6ebbda
SPARK-2038: rename "conf" parameters in the saveAsHadoop functions
CodingCat Jun 17, 2014
f81fc0f
[SPARK-2147 / 2161] Show removed executors on the UI
andrewor14 Jun 17, 2014
1c3a0ab
HOTFIX: bug caused by #941
pwendell Jun 17, 2014
a8694be
[SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL
yhuai Jun 18, 2014
d5e008f
Revert "SPARK-2038: rename "conf" parameters in the saveAsHadoop func…
pwendell Jun 18, 2014
0126cea
[STREAMING] SPARK-2009 Key not found exception when slow receiver starts
vchekan Jun 18, 2014
d562dec
[SPARK-2176][SQL] Extra unnecessary exchange operator in the result o…
yhuai Jun 18, 2014
97f9315
[SPARK-2162] Double check in doGetLocal to avoid read on removed block.
colorant Jun 18, 2014
422e642
Updated the comment for SPARK-2162.
rxin Jun 18, 2014
9869bea
[SPARK-1466] Raise exception if pyspark Gateway process doesn't start.
kayousterhout Jun 18, 2014
b11c599
SPARK-2158 Clean up core/stdout file from FileAppenderSuite
markhamstra Jun 18, 2014
21522a3
Remove unicode operator from RDD.scala
dorx Jun 18, 2014
4a0c6b5
[SPARK-2184][SQL] AddExchange isn't idempotent
marmbrus Jun 19, 2014
85babcb
Squishing a typo bug before it causes real harm
dorx Jun 19, 2014
814827b
[SPARK-2187] Explain should not run the optimizer twice.
rxin Jun 19, 2014
b30b28c
Minor fix
WangTaoTheTonic Jun 19, 2014
60c613e
[SPARK-2051]In yarn.ClientBase spark.yarn.dist.* do not work
witgo Jun 19, 2014
cc40c06
[SPARK-2191][SQL] Make sure InsertIntoHiveTable doesn't execute more …
marmbrus Jun 19, 2014
38df2e3
[SPARK-2151] Recognize memory format for spark-submit
nishkamravi2 Jun 20, 2014
10d23c4
A few minor Spark SQL Scaladoc fixes.
rxin Jun 20, 2014
64db752
HOTFIX: SPARK-2208 local metrics tests can fail on fast machines
pwendell Jun 20, 2014
adeca4f
More minor scaladoc cleanup for Spark SQL.
rxin Jun 20, 2014
af22207
[SQL] Improve Speed of InsertIntoHiveTable
marmbrus Jun 20, 2014
05255de
[SPARK-2177][SQL] describe table result contains only one column
yhuai Jun 20, 2014
aac5872
SPARK-1293 [SQL] Parquet support for nested types
AndreSchumacher Jun 20, 2014
32c30ca
[SPARK-2210] cast to boolean on boolean value gets turned into NOT((b…
rxin Jun 20, 2014
39693e1
[SPARK-2209][SQL] Cast shouldn't do null check twice.
rxin Jun 20, 2014
a62c682
SPARK-2203: PySpark defaults to use same num reduce partitions as map…
aarondav Jun 20, 2014
47a6428
[SPARK-2196] [SQL] Fix nullability of CaseWhen.
ueshin Jun 20, 2014
8b9b33d
[SPARK-2218] rename Equals to EqualTo in Spark SQL expressions.
rxin Jun 20, 2014
38008be
[SPARK-2163] class LBFGS optimize with Double tolerance instead of Int
BaiGang Jun 20, 2014
3060c81
SPARK-1868: Users should be allowed to cogroup at least 4 RDDs
douglaz Jun 20, 2014
bb2a4e2
SPARK-2180: support HAVING clauses in Hive queries
willb Jun 20, 2014
9a56b58
[SPARK-2225] Turn HAVING without GROUP BY into WHERE.
rxin Jun 20, 2014
713c103
Clean up CacheManager et al.
andrewor14 Jun 21, 2014
8505024
Move ScriptTransformation into the appropriate place.
rxin Jun 21, 2014
50b9f6a
[SQL] Use hive.SessionState, not the thread local SessionState
aarondav Jun 21, 2014
1e6c770
SPARK-1902 Silence stacktrace from logs when doing port failover to p…
ash211 Jun 21, 2014
602d4f8
[SPARK-1970] Update unit test in XORShiftRandomSuite to use ChiSquare…
dorx Jun 21, 2014
1460b75
HOTFIX: Fixing style error introduced by 08d0ac
pwendell Jun 21, 2014
c829cab
[SPARK-2061] Made splits deprecated in JavaRDDLike
AtlasPilotPuppy Jun 21, 2014
11d8332
Fix some tests.
Jun 21, 2014
1d5cbf3
[SQL] Pass SQLContext instead of SparkContext into physical operators.
rxin Jun 21, 2014
bb36494
[SQL] Break hiveOperators.scala into multiple files.
rxin Jun 21, 2014
db255ae
HOTFIX: Fix missing MIMA ignore
pwendell Jun 21, 2014
f6defcd
HOTFIX: Add excludes for new MIMA files
pwendell Jun 21, 2014
8ffa264
SPARK-1996. Remove use of special Maven repo for Akka
srowen Jun 22, 2014
8c7c21b
SPARK-2231: dev/run-tests should include YARN and use a recent Hadoop…
pwendell Jun 22, 2014
50016ca
SPARK-2034. KafkaInputDStream doesn't close resources and may prevent…
srowen Jun 22, 2014
a50e69d
SPARK-1316. Remove use of Commons IO
srowen Jun 22, 2014
b622e01
SPARK-2229: FileAppender throw an llegalArgumentException in jdk6
witgo Jun 23, 2014
c433491
SPARK-2241: quote command line args in ec2 script
orikremer Jun 23, 2014
fb18938
SPARK-2166 - Listing of instances to be terminated before the prompt
Jun 23, 2014
8fc1b52
[SPARK-1395] Fix "local:" URI support in Yarn mode (again).
Jun 23, 2014
7f84c3c
Fixed small running on YARN docs typo
frol Jun 23, 2014
74c2a35
Fix mvn detection
Jun 23, 2014
153cf30
[SPARK-1669][SQL] Made cacheTable idempotent
liancheng Jun 23, 2014
27bec31
[SPARK-2118] spark class should complain if tools jar is missing.
ScrapCodes Jun 23, 2014
4cb80a0
[SPARK-1768] History server enhancements.
Jun 23, 2014
f41dd6a
Cleanup on Connection, ConnectionManagerId, ConnectionManager classes…
hsaputra Jun 24, 2014
bd2fa02
[SPARK-2227] Support dfs command in SQL.
rxin Jun 24, 2014
52ac8a4
[SPARK-2124] Move aggregation into shuffle implementations
jerryshao Jun 24, 2014
e8530db
[SPARK-2252] Fix MathJax for HTTPs.
rxin Jun 24, 2014
7385516
SPARK-1937: fix issue with task locality
Jun 24, 2014
a18f843
HOTFIX: Disabling tests per SPARK-2264
pwendell Jun 24, 2014
b36f603
Fix broken Json tests.
kayousterhout Jun 24, 2014
dfd4a7f
[SPARK-2264][SQL] Fix failing CachedTableSuite
marmbrus Jun 25, 2014
0b46168
[SPARK-1112, 2156] Bootstrap to fetch the driver's Spark properties.
mengxr Jun 25, 2014
386a0a2
[SQL]Add base row updating methods for JoinedRow
chenghao-intel Jun 25, 2014
e7de368
Autodetect JAVA_HOME on RPM-based systems
Jun 25, 2014
b6ac76a
Fix possible null pointer in acumulator toString
marmbrus Jun 25, 2014
fc9c55b
SPARK-2248: spark.default.parallelism does not apply in local mode
witgo Jun 25, 2014
cdff363
[SPARK-2263][SQL] Support inserting MAP<K, V> to Hive tables
liancheng Jun 25, 2014
84b641b
[BUGFIX][SQL] Should match java.math.BigDecimal when wnrapping Hive o…
liancheng Jun 25, 2014
02ab309
SPARK-2038: rename "conf" parameters in the saveAsHadoop functions wi…
CodingCat Jun 25, 2014
07ea156
Replace doc reference to Shark with Spark SQL.
rxin Jun 25, 2014
faa1743
added support for VPC and placement group to spar_ec2.py
pdeyhim Jun 25, 2014
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
package org.apache.spark.streaming.examples

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kinesis.KinesisUtils
import org.apache.spark.streaming.StreamingContext._


object KinesisWordCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide full instructions on how to run this example. See other examples for more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add a Java Kinesis example. Many users really ask for Java examples and can be a significant barrier for trying out the Kinesis

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdeyhim - per our offline convo this wknd, please add a note about running this demo with a minimum of master=local 2 threads.

otherwise it appears that the KinesisNetworkReceiver thread does not startup - breaking the demo.


def main(args: Array[String]): Unit = {


if (args.length < 1) {
System.err.println("Usage: KinesisWordCount <master> <streamname>" + " [accesskey] [accessSecretKey]")
System.exit(1)
}

val master=args(0)
val kinesisStream=args(1)
val accesskey=args(2)
val accessSecretKey=args(3)


val ssc = new StreamingContext(master, "KinesisWordCOunt", Seconds(2),
System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))

val lines = KinesisUtils.createStream(ssc, accesskey, accessSecretKey, kinesisStream)

val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print
ssc.start

}
}
73 changes: 73 additions & 0 deletions external/AmazonKinesis/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent</artifactId>
<version>1.0.0-incubating-SNAPSHOT</version>
<relativePath>../../pom.xml</relativePath>
</parent>

<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-amazonkinesis</artifactId>
<packaging>jar</packaging>
<name>Spark Project External Amazon Kinesis</name>
<url>http://spark.incubator.apache.org/</url>

<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.binary.version}</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalacheck</groupId>
<artifactId>scalacheck_${scala.binary.version}</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.novocode</groupId>
<artifactId>junit-interface</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<plugins>
<plugin>
<groupId>org.scalatest</groupId>
<artifactId>scalatest-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.streaming.kinesis

import org.apache.spark.streaming.StreamingContext
import scala.reflect.ClassTag
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.NetworkReceiver
import com.amazonaws.auth.AWSCredentialsProvider
import java.util.UUID
import com.amazonaws.auth.InstanceProfileCredentialsProvider
import com.amazonaws.auth.AWSCredentials
import com.amazonaws.auth.BasicAWSCredentials
import java.net.UnknownHostException
import java.net.InetAddress
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibConfiguration
import java.nio.charset.Charset
import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorFactory
import com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException
import com.amazonaws.services.kinesis.clientlibrary.exceptions.ThrottlingException
import com.amazonaws.services.kinesis.clientlibrary.exceptions.InvalidStateException
import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessorCheckpointer
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker
import com.amazonaws.services.kinesis.clientlibrary.types.ShutdownReason
import com.amazonaws.services.kinesis.model.Record
import com.amazonaws.services.kinesis.clientlibrary.interfaces.IRecordProcessor
import org.apache.spark.streaming.dstream.NetworkInputDStream
import scala.collection.JavaConversions._
import java.util.List


private[streaming]
class KinesisInputDStream[T: ClassTag](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide docs on these classes, especially on the KinesisReceiver. The documentation must be sufficient that any other developer is able to look at the code, understand the control/data flow and debug stuff when required.

@transient ssc_ : StreamingContext,
accesskey:String,
accessSecretKey:String,
kinesisStream:String,
kinesisEndpoint:String,
storageLevel: StorageLevel
) extends NetworkInputDStream[String](ssc_) {


override def getReceiver(): NetworkReceiver[String] = {
new KinesisReceiver(accesskey,accessSecretKey,kinesisStream,kinesisEndpoint,storageLevel)
}
}


object AllDone extends Exception { }

private[streaming]
class KinesisReceiver[T: ClassTag](
accesskey:String,
accessSecretKey:String,
kinesisStream:String,
kinesisEndpoint:String,
storageLevel: StorageLevel
) extends NetworkReceiver[String] {

val NUM_RETRIES =5
val BACKOFF_TIME_IN_MILLIS =2000
var workerId = UUID.randomUUID().toString()

lazy val credentialsProvider = new AWSCredentialsProvider {

def getCredentials():AWSCredentials = {
if (accesskey.isEmpty()||accessSecretKey.isEmpty) {
new InstanceProfileCredentialsProvider().getCredentials()
}else{
new BasicAWSCredentials(accesskey,accessSecretKey)
}
}

def refresh() {}
}

try {
workerId = InetAddress.getLocalHost().getCanonicalHostName() + ":" + UUID.randomUUID()
} catch {
case e:UnknownHostException => e.printStackTrace()
}

private lazy val decoder = Charset.forName("UTF-8").newDecoder();
private lazy val kinesisClientLibConfiguration = new KinesisClientLibConfiguration(kinesisStream, kinesisStream, credentialsProvider,workerId).withKinesisEndpoint(kinesisEndpoint)
private lazy val blockGenerator = new BlockGenerator(storageLevel)

protected override def onStart() {

blockGenerator.start()
lazy val recordProcessorFactory:IRecordProcessorFactory = new IRecordProcessorFactory{
def createProcessor():IRecordProcessor= new IRecordProcessor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix indenting. Refer to Spark style guide
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide


def initialize(shardId:String){
logInfo("starting with shardId: "+shardId)
}

def processRecords(records: List[Record], checkpointer : IRecordProcessorCheckpointer) {
records.toList.foreach(record=>{
blockGenerator+=decoder.decode(record.getData()).toString();
})
checkpoint(checkpointer);
}

def shutdown(checkpointer : IRecordProcessorCheckpointer, reason : ShutdownReason){
logInfo("Shutting Down Kinesis Receiver: "+reason)
}
}
}
val worker = new Worker(recordProcessorFactory, kinesisClientLibConfiguration);
worker.run()
}

private def checkpoint(checkpointer : IRecordProcessorCheckpointer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document what these functions are doing.


for (i<-1 to NUM_RETRIES) {
try {
checkpointer.checkpoint();
throw AllDone;
} catch {
case se:ShutdownException =>logInfo("Caught shutdown exception, skipping checkpoint.", se)
case e:ThrottlingException => {
// Backoff and re-attempt checkpoint upon transient failures
if (i >= (NUM_RETRIES - 1)) {
logInfo("Checkpoint failed after " + (i + 1) + "attempts.", e)
throw AllDone;
} else {
logInfo("Transient issue when checkpointing - attempt "
+ (i + 1) + " of "+ NUM_RETRIES, e)
}
}
case e:InvalidStateException => {
logInfo("Cannot save checkpoint to the DynamoDB table used by the Amazon Kinesis Client Library.", e)
throw AllDone
}
case AllDone=>
}
try {
Thread.sleep(BACKOFF_TIME_IN_MILLIS)
} catch {
case e:InterruptedException => logInfo("Interrupted sleep", e)
}
}
}

protected override def onStop() {
blockGenerator.stop()
logInfo("Amazon Kinesis receiver stopped")
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.streaming.kinesis

import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.api.java.JavaDStream

object KinesisUtils {

def createStream(
ssc: StreamingContext,
accesskey:String="",
accessSecretKey:String="",
kinesisStream:String,
kinesisEndpoint:String="https://kinesis.us-east-1.amazonaws.com",
storageLevel: StorageLevel = StorageLevel.MEMORY_ONLY_SER_2
): DStream[String] = {
new KinesisInputDStream(ssc, accesskey,accessSecretKey,kinesisStream,kinesisEndpoint, storageLevel)
}
def createStream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide complete Java documentation on how these functions are used. Refer to other XYZUtils to get an idea.

jssc: JavaStreamingContext,
accesskey:String,
accessSecretKey:String,
kinesisStream:String,
kinesisEndpoint:String,
storageLevel: StorageLevel
): JavaDStream[String] = {
new KinesisInputDStream(jssc.ssc, accesskey,accessSecretKey,kinesisStream,kinesisEndpoint, storageLevel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scala style issue.

}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.streaming.kinesis;


import org.junit.Test;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.streaming.LocalJavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaDStream;

public class JavaKinesisStreamSuite extends LocalJavaStreamingContext {
@Test
public void testKinesisStream() {

JavaDStream<String> test1 = KinesisUtils.createStream(ssc,
"x", "y", "z","1",StorageLevel.MEMORY_AND_DISK_SER_2());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the Scala unit test.

}
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.streaming.kinesis

import org.apache.spark.streaming.{StreamingContext, TestSuiteBase}
import org.apache.spark.storage.StorageLevel

class KinesisStreamSuite extends TestSuiteBase {

test("Kinesis input stream") {
val ssc = new StreamingContext(master, framework, batchDuration)
val test1 = KinesisUtils.createStream(ssc, accesskey="x",accessSecretKey="y",kinesisStream="z")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test does not really test anything. Is it possible to add a unit test that actually tests receiving data. Without proper unit tests, we have a lot of trouble understanding and analyzing when things have failed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless i include AWS credentials in this test, there's no way to receive data from Kinesis. I'll take another looks and see if i can come up with something more comprehensive but a very little can be done without credentials.

ssc.stop()
}
}
Loading