-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28163][SS] Use CaseInsensitiveMap for KafkaOffsetReader #24967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @HyukjinKwon and @HeartSaVioR since it's connected to SPARK-28142. |
|
Test build #106887 has finished for PR 24967 at commit
|
| .filter(_.toLowerCase(Locale.ROOT).startsWith("kafka.")) | ||
| .map { k => k.drop(6).toString -> parameters(k) } | ||
| .toMap | ||
| val specifiedKafkaParams = convertToSpecifiedParams(caseInsensitiveOptions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a required change but thought it would be good to simplify.
|
I feel the reason of such confusion is not explicitly requiring type to CaseInsensitiveMap whenever needed. If we require lowercase key, it shouldn't be referred as "case-insensitive". It's case-sensitive, and if we can't expect the same from Kafka, we will get into trouble with passing config to Kafka. |
|
@HeartSaVioR thanks for your comment! This case sensitivity issue disturbs me for at least half a year so pretty open what less error prone solution can we come up. Since I've found another issue I've invested more time and analyzed it through. Let me share my thinking and please do the same to find out how to proceed. First, start with the actual implementation. Majority of this code uses case-insensitive maps in some way (
This is true.
I agree with your statement and we can discuss how to name it. Map with lowercase key is referred "case-insensitive" because the user doesn't have to care about case when the following operations used: put/get/contains. The following ops are the same: If the underlying conversion which is used on the keys consistently will change then the mentioned logic is still true. To find out whether user is provided a configuration I think it's the most robust solution. The other use-case which exists is to extract the entries from these maps which start with Majority of this code uses the mentioned case insensitive maps but there are some parts which breaks this (for example the part which we're now modifying). As a final conclusion from my perspective:
That said everything can be discussed so waiting on thoughts... |
|
First of all, thanks for the detailed analysis, @gaborgsomogyi ! I mostly agree on your analysis - just to say we are mixing up |
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeartSaVioR @gaborfeher @dongjoon-hyun We should not skip adding the unit test cases especially when this PR claims we are fixing a bug.. Please add one. Thanks!
|
@HeartSaVioR I see your point and agree the implementation can be made more safe. Since DSv1 provides only map as input parameter this can be converted to be more on the safe side and internally pass around @gatorsmile Thanks for your review, adding test... |
|
@HeartSaVioR I've converted @gatorsmile Added tests which cover not only the problematic parts. |
|
Test build #107241 has finished for PR 24967 at commit
|
...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceProviderSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #107273 has finished for PR 24967 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Just a 2 cents: I'm not sure we prefer renaming fields without specific reason along with other changes (I'm seeing multiple places doing it), as well as having individual tests for each JIRA issue which actually do pretty similar things.
...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceProviderSuite.scala
Outdated
Show resolved
Hide resolved
...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceProviderSuite.scala
Outdated
Show resolved
Hide resolved
...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceProviderSuite.scala
Outdated
Show resolved
Hide resolved
The other option was to introduce 2 different methods in the test to get the same |
|
Test build #107344 has finished for PR 24967 at commit
|
| */ | ||
| class KafkaContinuousStream( | ||
| offsetReader: KafkaOffsetReader, | ||
| private val offsetReader: KafkaOffsetReader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is private val required for Use CaseInsensitiveMap for KafkaOffsetReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise this code:
getField(stream, offsetReaderMethod)
throws exception something like:
java.lang.IllegalArgumentException: Can't find a private method named: offsetReader
The other possibility to add a getter explicitly which I thought is overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach would be Scala reflection but definitely more verbose than explicit getter.
| */ | ||
| private[kafka010] class KafkaMicroBatchStream( | ||
| kafkaOffsetReader: KafkaOffsetReader, | ||
| private val offsetReader: KafkaOffsetReader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is
private valrequired forUse CaseInsensitiveMap for KafkaOffsetReader? - Is the renaming from
kafkaOffsetReadertooffsetReaderrequired forUse CaseInsensitiveMap for KafkaOffsetReader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is private val required for Use CaseInsensitiveMap for KafkaOffsetReader?
Yeah, please see my previous comment. Same applies here.
Is the renaming from kafkaOffsetReader to offsetReader required for Use CaseInsensitiveMap for KafkaOffsetReader?
In short, no. A little bit detailed if not renamed then the test requires something like this:
private val offsetReaderMicroBatchMethod = PrivateMethod[KafkaOffsetReader]('kafkaOffsetReader)
private val offsetReaderContinuousMethod = PrivateMethod[KafkaOffsetReader]('offsetReader)
...
verifyFieldsInMicroBatchStream(KafkaSourceProvider.FETCH_OFFSET_NUM_RETRY, expected, stream => {
val kafkaOffsetReader = getField(stream, offsetReaderMicroBatchMethod)
assert(expected.toInt === getField(kafkaOffsetReader, fetchOffsetNumRetriesMethod))
})
...
General considerations why I've renamed couple of variables:
- Exact same parameters are named sometimes 2-4 different ways which is hard to track
- Not renaming them would cause similar situation in the test code
- Sometimes a
SparkConfconfig name and it's internal scala variable has different name - Renamed only Spark internal variables which shouldn't matter for users
We can roll back renamings, the code will work properly but I think these changes increase readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rename rolled back.
private val I think is essential.
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReader.scala
Outdated
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaborgsomogyi . I understand the intention and want to help this PR because this PR has been under reviews for 2 weeks already.
However, this PR distracts the focus inconsistently eagerly. :) In general, we should not put the refactoring and new behavior like CaseInsensitiveMap into a single PR. Could you revert all the irrelevant stuff? You can put them into another PR.
If the PRs become as concise as possible, we can do review and merge as fast as we can. In case of simple PRs, as you know, one or two days are enough until merging. So, which one do you want? Do you want to keep all of this in this PR? Or, do you think you can split this into the simple ones?
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except still have my 2 cents for renaming fields.
|
@dongjoon-hyun I think the clean and long term solution is to use |
|
Okay. If this PR remains in this state, I'll leave this PR to other committers, @gaborgsomogyi and @HeartSaVioR . |
|
@dongjoon-hyun @HeartSaVioR rolled back all the renames. |
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
Test build #107403 has finished for PR 24967 at commit
|
|
I think @dongjoon-hyun is busy with the 2.4 release so @vanzin can you have a look please? |
|
@gaborgsomogyi . I'm just leaving this PR to the other committers. I'm not busy on |
|
Ah, ok 🙂 |
...rnal/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousStream.scala
Outdated
Show resolved
Hide resolved
|
Looks fine. |
|
@HyukjinKwon thanks for investing your time! |
|
Test build #107855 has finished for PR 24967 at commit
|
|
@HyukjinKwon Could we revisit this? I guess it's just waiting for last call to merge. |
...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceProviderSuite.scala
Outdated
Show resolved
Hide resolved
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. cc @zsxwing and @cloud-fan, I will get this in if there are no more comments in some days.
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaBatch.scala
Show resolved
Hide resolved
|
retest this please |
|
Test build #108669 has finished for PR 24967 at commit
|
|
Test build #108718 has finished for PR 24967 at commit
|
|
Test build #108820 has finished for PR 24967 at commit
|
|
@HyukjinKwon may I ask for another round please? |
|
thanks, merging to master! |
|
Many thanks @cloud-fan ! |
What changes were proposed in this pull request?
There are "unsafe" conversions in the Kafka connector.
CaseInsensitiveStringMapcomes in which is then converted the following way:The main problem with this is that such case it looses its case insensitive nature
(case insensitive map is converting the key to lower case when get/contains called).
In this PR I'm using
CaseInsensitiveMapto solve this problem.How was this patch tested?
Existing + additional unit tests.