-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28464][Doc][SS] Document Kafka source minPartitions option #25219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
| <tr> | ||
| <td>minPartitions</td> | ||
| <td>int</td> | ||
| <td>0 (disabled)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your first contribution, @arunpandianp .
However, this is wrong because this will mislead the users to try to set 0 and face IllegalArgumentException.
Technically, the default value is None. Just leave this line as a blank like <td></td>.
|
@dongjoon-hyun thanks for checking, changed it to |
|
Test build #107962 has finished for PR 25219 at commit
|
| <td></td> | ||
| <td>streaming and batch</td> | ||
| <td>Minimum number of partitions to read from Kafka. | ||
| You can configure Spark to use an arbitrary minimum of partitions to read from Kafka using the minPartitions option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this line because we don't allow arbitrary number.
| <td>streaming and batch</td> | ||
| <td>Minimum number of partitions to read from Kafka. | ||
| You can configure Spark to use an arbitrary minimum of partitions to read from Kafka using the minPartitions option. | ||
| Normally Spark has a 1-1 mapping of Kafka TopicPartitions to Spark partitions consuming from Kafka. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally -> By default, ?
| If you set the minPartitions option to a value greater than your Kafka TopicPartitions, | ||
| Spark will divvy up large Kafka partitions to smaller pieces. | ||
| This option can be set at times of peak loads, data skew, and as your stream is falling behind to increase processing rate. | ||
| It comes at a cost of initializing Kafka consumers at each trigger, which may impact performance if you use SSL when connecting to Kafka.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove line 401~402, too.
|
Test build #107963 has finished for PR 25219 at commit
|
|
Test build #107964 has finished for PR 25219 at commit
|
|
@dongjoon-hyun pushed suggested changes. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @arunpandianp .
Merged to master/branch-2.4.
| <td>Minimum number of partitions to read from Kafka. | ||
| By default, Spark has a 1-1 mapping of Kafka TopicPartitions to Spark partitions consuming from Kafka. | ||
| If you set the minPartitions option to a value greater than your Kafka TopicPartitions, | ||
| Spark will divvy up large Kafka partitions to smaller pieces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
</td> is missed. I'll fix that during merging.
Adding doc for the kafka source minPartitions option to "Structured Streaming + Kafka Integration Guide" The text is based on the content in https://docs.databricks.com/spark/latest/structured-streaming/kafka.html#configuration Closes #25219 from arunpandianp/SPARK-28464. Authored-by: Arun Pandian <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a0a58cf) Signed-off-by: Dongjoon Hyun <[email protected]>
|
Welcome to the Apache Spark community, @arunpandianp . |
Adding doc for the kafka source minPartitions option to "Structured Streaming + Kafka Integration Guide" The text is based on the content in https://docs.databricks.com/spark/latest/structured-streaming/kafka.html#configuration Closes apache#25219 from arunpandianp/SPARK-28464. Authored-by: Arun Pandian <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Adding doc for the kafka source minPartitions option to "Structured Streaming + Kafka Integration Guide" The text is based on the content in https://docs.databricks.com/spark/latest/structured-streaming/kafka.html#configuration Closes apache#25219 from arunpandianp/SPARK-28464. Authored-by: Arun Pandian <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a0a58cf) Signed-off-by: Dongjoon Hyun <[email protected]>
Adding doc for the kafka source minPartitions option to "Structured Streaming + Kafka Integration Guide" The text is based on the content in https://docs.databricks.com/spark/latest/structured-streaming/kafka.html#configuration Closes apache#25219 from arunpandianp/SPARK-28464. Authored-by: Arun Pandian <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a0a58cf) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Adding doc for the kafka source minPartitions option to "Structured Streaming + Kafka Integration Guide"