[SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery #16842

Gauravshah · 2017-02-07T20:53:33Z

What changes were proposed in this pull request?

added a limit to getRecords api call call in KinesisBackedBlockRdd. This helps reduce the amount of data returned by kinesis api call making the recovery considerably faster

As we are storing the fromSeqNum & toSeqNum in checkpoint metadata, we can also store the number of records. Which can later be used for api call.

How was this patch tested?

The patch was manually tested

Apologies for any silly mistakes, opening first pull request

…rd loaded on aws getRecords call

srowen · 2017-02-07T20:56:57Z

...al/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala

 case class SequenceNumberRange(
-    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
+    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
+    recordCount: Int)


Why is this a property of a range -- or when would it not equal (from - to + 1)?

Not sure of a better place to put.
from - to != count. Kinesis seqNumber are in order but are not sequential

OK, but is it an 'input' or 'output'? the usage below makes it look like the caller dictates how many records are in the range, but it doesn't know that ahead of time? I probably misunderstand this.

http://docs.aws.amazon.com/streams/latest/dev/key-concepts.html#sequence-number

its an input to spark checkpoint metadata. On streaming KinesisReceiver receives records creates blocks & knows about seqNumber, count. When recovering from checkpoint we read back this information from checkpoint and make aws kinesis getRecords call with fromSeqNumber & limit

I'm worried this change will break checkpoint recovery, because we use Java serialization, and be a barrier to users from upgrading.

Not sure on upgrading, since for code upgrade we need to delete the checkpoint directory and start afresh. I did run this patch and was able to serialize the limit into checkpoint, ( not a scala pro though)

srowen · 2017-02-07T21:12:34Z

OK I think I see. MIght still be good for @brkyvz to review.

SparkQA · 2017-02-07T21:21:10Z

Test build #3561 has finished for PR 16842 at commit b5e544a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

Gauravshah · 2017-02-07T21:22:58Z

will work on testcases today

Gauravshah · 2017-02-08T11:56:21Z

Jenkins, retest this please

Gauravshah · 2017-02-21T01:46:45Z

@brkyvz can I do something to take it forward ?

brkyvz · 2017-02-21T23:11:55Z

...al/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala

    val getRecordsRequest = new GetRecordsRequest
    getRecordsRequest.setRequestCredentials(credentials)
    getRecordsRequest.setShardIterator(shardIterator)
+    getRecordsRequest.setLimit(recordCount)


if this value is greater than 10000, this will throw an error

brkyvz · 2017-02-21T23:44:10Z

@srowen Do you know if we make the field of a case class an Option and default it as None, would it still fail Java deserialization. I feel like it would

srowen · 2017-02-22T13:33:44Z

Yes it would certainly change the format of the default Java serialization. It wouldn't be compatible. The fields would have different types.

Gauravshah · 2017-02-22T18:11:56Z

@srowen I assumed that you cannot update code if you want to recover from checkpoint.

brkyvz · 2017-02-27T17:52:57Z

@Gauravshah Can you please comment on how much faster this PR improved your recovery time?

brkyvz · 2017-02-27T18:17:27Z

...al/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala

    getRecordsRequest.setRequestCredentials(credentials)
    getRecordsRequest.setShardIterator(shardIterator)
-    getRecordsRequest.setLimit(recordCount)
+    getRecordsRequest.setLimit(Math.max(recordCount, this.maxGetRecordsLimit))


this should be a min not a max

brkyvz

Talked with @tdas offline. We can't guarantee updatability across Spark versions for Spark Streaming, therefore this change is okay. Left two comments on style, then it LGTM.

brkyvz · 2017-02-28T20:44:44Z

...al/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala

 private[kinesis]
 case class SequenceNumberRange(
-    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
+    streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,


one parameter per line:

streamName: String, shardId: String, ...

brkyvz · 2017-02-28T20:45:21Z

...al/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala

   */
  private def getRecordsAndNextKinesisIterator(
-      shardIterator: String): (Iterator[Record], String) = {
+      shardIterator: String, recordCount: Int): (Iterator[Record], String) = {


ditto, one param per line

Gauravshah · 2017-03-01T04:53:44Z

@brkyvz Thank you for taking this forward. We have batch interval of 2 minutes & takes ~1.1 minutes to process. With older code it takes 10-12 minutes to recover and with limit fix it recovers in 2.5-3 minutes.

brkyvz · 2017-03-03T18:12:25Z

...al/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala

   * Get records starting from or after the given sequence number.
   */
-  private def getRecords(iteratorType: ShardIteratorType, seqNum: String): Iterator[Record] = {
+  private def getRecords(iteratorType: ShardIteratorType, seqNum: String,


you forgot here

brkyvz · 2017-03-05T00:26:18Z

retest this please

brkyvz · 2017-03-05T22:25:32Z

okay to test

brkyvz · 2017-03-06T01:20:46Z

@srowen Do you know why this hasn't kicked off any tests?

SparkQA · 2017-03-06T09:23:32Z

Test build #3595 has finished for PR 16842 at commit c8efdcf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2017-03-06T18:38:45Z

LGTM! Merging to master! Thanks.

Gauravshah · 2017-03-07T01:33:37Z

thanks @srowen & @brkyvz

added limit to kinesis checkpoint backed rdd to reduce number of reco…

b5e544a

…rd loaded on aws getRecords call

Gauravshah changed the title ~~SPARK-19304 fix kinesis slow checkpoint recovery~~ [SPARK-19304] fix kinesis slow checkpoint recovery Feb 7, 2017

Gauravshah changed the title ~~[SPARK-19304] fix kinesis slow checkpoint recovery~~ [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery Feb 7, 2017

srowen reviewed Feb 7, 2017

View reviewed changes

Gauravshah changed the title ~~[SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery~~ [WIP] [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery Feb 7, 2017

fixing test cases on b5e544a

d3b8c62

Gauravshah changed the title ~~[WIP] [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery~~ [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery Feb 12, 2017

brkyvz reviewed Feb 21, 2017

View reviewed changes

limiting max records limit to 10k for aws limitations

274ee27

brkyvz reviewed Feb 27, 2017

View reviewed changes

limiting max records limit to 10k for aws limitations

4f5edd3

brkyvz reviewed Feb 28, 2017

View reviewed changes

scala coding style fixes, one line per param

82499bc

brkyvz reviewed Mar 3, 2017

View reviewed changes

scala coding style fixes, one line per param

c8efdcf

asfgit closed this in 46a64d1 Mar 6, 2017

[SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery #16842

[SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery #16842

Uh oh!

Conversation

Gauravshah commented Feb 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Feb 7, 2017

Uh oh!

SparkQA commented Feb 7, 2017

Uh oh!

Gauravshah commented Feb 7, 2017

Uh oh!

Gauravshah commented Feb 8, 2017

Uh oh!

Gauravshah commented Feb 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Feb 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented Feb 22, 2017

Uh oh!

Gauravshah commented Feb 22, 2017

Uh oh!

brkyvz commented Feb 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gauravshah commented Mar 1, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brkyvz commented Mar 5, 2017

Uh oh!

brkyvz commented Mar 5, 2017

Uh oh!

brkyvz commented Mar 6, 2017

Uh oh!

SparkQA commented Mar 6, 2017

Uh oh!

brkyvz commented Mar 6, 2017

Uh oh!

Gauravshah commented Mar 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Gauravshah commented Feb 7, 2017 •

edited

Loading

brkyvz commented Feb 21, 2017 •

edited

Loading