Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

minPartitions has been used as a hint and relevant method (KafkaOffsetRangeCalculator.getRanges) doesn't guarantee the behavior that partitions will be equal or more than given value.

/**
* Calculate the offset ranges that we are going to process this batch. If `minPartitions`
* is not set or is set less than or equal the number of `topicPartitions` that we're going to
* consume, then we fall back to a 1-1 mapping of Spark tasks to Kafka partitions. If
* `numPartitions` is set higher than the number of our `topicPartitions`, then we will split up
* the read tasks of the skewed partitions to multiple Spark tasks.
* The number of Spark tasks will be *approximately* `numPartitions`. It can be less or more
* depending on rounding errors or Kafka partitions that didn't receive any new data.
*
* Empty ranges (`KafkaOffsetRange.size <= 0`) will be dropped.
*/
def getRanges(
fromOffsets: PartitionOffsetMap,
untilOffsets: PartitionOffsetMap,
executorLocations: Seq[String] = Seq.empty): Seq[KafkaOffsetRange] = {

This patch makes clear the configuration is a hint, and actual partitions could be less or more.

How was this patch tested?

Just a documentation change.

@HeartSaVioR HeartSaVioR changed the title [MINOR][SS] Correct description of minPartitions in Kafka option [MINOR][DOC][SS] Correct description of minPartitions in Kafka option Aug 2, 2019
@HeartSaVioR
Copy link
Contributor Author

cc. @dongjoon-hyun @zsxwing

@SparkQA
Copy link

SparkQA commented Aug 2, 2019

Test build #108535 has finished for PR 25332 at commit ffe033c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you for this follow-up, @HeartSaVioR !
Merged to master/branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request Aug 2, 2019
## What changes were proposed in this pull request?

`minPartitions` has been used as a hint and relevant method (KafkaOffsetRangeCalculator.getRanges) doesn't guarantee the behavior that partitions will be equal or more than given value.

https://github.com/apache/spark/blob/d67b98ea016e9b714bef68feaac108edd08159c9/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala#L32-L46

This patch makes clear the configuration is a hint, and actual partitions could be less or more.

## How was this patch tested?

Just a documentation change.

Closes #25332 from HeartSaVioR/MINOR-correct-kafka-structured-streaming-doc-minpartition.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 7ffc00c)
Signed-off-by: Dongjoon Hyun <[email protected]>
@HeartSaVioR
Copy link
Contributor Author

Thanks for the quick review and merge!

@HeartSaVioR HeartSaVioR deleted the MINOR-correct-kafka-structured-streaming-doc-minpartition branch August 2, 2019 20:47
rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
## What changes were proposed in this pull request?

`minPartitions` has been used as a hint and relevant method (KafkaOffsetRangeCalculator.getRanges) doesn't guarantee the behavior that partitions will be equal or more than given value.

https://github.com/apache/spark/blob/d67b98ea016e9b714bef68feaac108edd08159c9/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala#L32-L46

This patch makes clear the configuration is a hint, and actual partitions could be less or more.

## How was this patch tested?

Just a documentation change.

Closes apache#25332 from HeartSaVioR/MINOR-correct-kafka-structured-streaming-doc-minpartition.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 7ffc00c)
Signed-off-by: Dongjoon Hyun <[email protected]>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
## What changes were proposed in this pull request?

`minPartitions` has been used as a hint and relevant method (KafkaOffsetRangeCalculator.getRanges) doesn't guarantee the behavior that partitions will be equal or more than given value.

https://github.com/apache/spark/blob/d67b98ea016e9b714bef68feaac108edd08159c9/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala#L32-L46

This patch makes clear the configuration is a hint, and actual partitions could be less or more.

## How was this patch tested?

Just a documentation change.

Closes apache#25332 from HeartSaVioR/MINOR-correct-kafka-structured-streaming-doc-minpartition.

Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 7ffc00c)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants