Skip to content

Conversation

@Gauravshah
Copy link
Contributor

@Gauravshah Gauravshah commented Feb 7, 2017

What changes were proposed in this pull request?

added a limit to getRecords api call call in KinesisBackedBlockRdd. This helps reduce the amount of data returned by kinesis api call making the recovery considerably faster

As we are storing the fromSeqNum & toSeqNum in checkpoint metadata, we can also store the number of records. Which can later be used for api call.

How was this patch tested?

The patch was manually tested

Apologies for any silly mistakes, opening first pull request

@Gauravshah Gauravshah changed the title SPARK-19304 fix kinesis slow checkpoint recovery [SPARK-19304] fix kinesis slow checkpoint recovery Feb 7, 2017
@Gauravshah Gauravshah changed the title [SPARK-19304] fix kinesis slow checkpoint recovery [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery Feb 7, 2017
case class SequenceNumberRange(
streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
recordCount: Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a property of a range -- or when would it not equal (from - to + 1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure of a better place to put.
from - to != count. Kinesis seqNumber are in order but are not sequential

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but is it an 'input' or 'output'? the usage below makes it look like the caller dictates how many records are in the range, but it doesn't know that ahead of time? I probably misunderstand this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its an input to spark checkpoint metadata. On streaming KinesisReceiver receives records creates blocks & knows about seqNumber, count. When recovering from checkpoint we read back this information from checkpoint and make aws kinesis getRecords call with fromSeqNumber & limit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried this change will break checkpoint recovery, because we use Java serialization, and be a barrier to users from upgrading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure on upgrading, since for code upgrade we need to delete the checkpoint directory and start afresh. I did run this patch and was able to serialize the limit into checkpoint, ( not a scala pro though)

@srowen
Copy link
Member

srowen commented Feb 7, 2017

OK I think I see. MIght still be good for @brkyvz to review.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #3561 has finished for PR 16842 at commit b5e544a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Gauravshah
Copy link
Contributor Author

will work on testcases today

@Gauravshah Gauravshah changed the title [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery [WIP] [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery Feb 7, 2017
@Gauravshah
Copy link
Contributor Author

Jenkins, retest this please

@Gauravshah Gauravshah changed the title [WIP] [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery [SPARK-19304] [Streaming] [Kinesis] fix kinesis slow checkpoint recovery Feb 12, 2017
@Gauravshah
Copy link
Contributor Author

@brkyvz can I do something to take it forward ?

val getRecordsRequest = new GetRecordsRequest
getRecordsRequest.setRequestCredentials(credentials)
getRecordsRequest.setShardIterator(shardIterator)
getRecordsRequest.setLimit(recordCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this value is greater than 10000, this will throw an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@brkyvz
Copy link
Contributor

brkyvz commented Feb 21, 2017

@srowen Do you know if we make the field of a case class an Option and default it as None, would it still fail Java deserialization. I feel like it would

@srowen
Copy link
Member

srowen commented Feb 22, 2017

Yes it would certainly change the format of the default Java serialization. It wouldn't be compatible. The fields would have different types.

@Gauravshah
Copy link
Contributor Author

@srowen I assumed that you cannot update code if you want to recover from checkpoint.

@brkyvz
Copy link
Contributor

brkyvz commented Feb 27, 2017

@Gauravshah Can you please comment on how much faster this PR improved your recovery time?

getRecordsRequest.setRequestCredentials(credentials)
getRecordsRequest.setShardIterator(shardIterator)
getRecordsRequest.setLimit(recordCount)
getRecordsRequest.setLimit(Math.max(recordCount, this.maxGetRecordsLimit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a min not a max

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @tdas offline. We can't guarantee updatability across Spark versions for Spark Streaming, therefore this change is okay. Left two comments on style, then it LGTM.

private[kinesis]
case class SequenceNumberRange(
streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String)
streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one parameter per line:

  streamName: String,
  shardId: String,
  ...

*/
private def getRecordsAndNextKinesisIterator(
shardIterator: String): (Iterator[Record], String) = {
shardIterator: String, recordCount: Int): (Iterator[Record], String) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, one param per line

@Gauravshah
Copy link
Contributor Author

@brkyvz Thank you for taking this forward. We have batch interval of 2 minutes & takes ~1.1 minutes to process. With older code it takes 10-12 minutes to recover and with limit fix it recovers in 2.5-3 minutes.

* Get records starting from or after the given sequence number.
*/
private def getRecords(iteratorType: ShardIteratorType, seqNum: String): Iterator[Record] = {
private def getRecords(iteratorType: ShardIteratorType, seqNum: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you forgot here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@brkyvz
Copy link
Contributor

brkyvz commented Mar 5, 2017

retest this please

@brkyvz
Copy link
Contributor

brkyvz commented Mar 5, 2017

okay to test

@brkyvz
Copy link
Contributor

brkyvz commented Mar 6, 2017

@srowen Do you know why this hasn't kicked off any tests?

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #3595 has finished for PR 16842 at commit c8efdcf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented Mar 6, 2017

LGTM! Merging to master! Thanks.

@asfgit asfgit closed this in 46a64d1 Mar 6, 2017
@Gauravshah
Copy link
Contributor Author

thanks @srowen & @brkyvz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants