-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18334] MinHash should use binary hash distance #15800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
559c099
517a97b
b546dbd
a3cd928
c8243c7
6aac8b3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,13 +32,7 @@ import org.apache.spark.sql.types.StructType | |
| * :: Experimental :: | ||
| * | ||
| * Model produced by [[MinHash]], where multiple hash functions are stored. Each hash function is | ||
| * a perfect hash function: | ||
| * `h_i(x) = (x * k_i mod prime) mod numEntries` | ||
| * where `k_i` is the i-th coefficient, and both `x` and `k_i` are from `Z_prime^*` | ||
| * | ||
| * Reference: | ||
| * [[https://en.wikipedia.org/wiki/Perfect_hash_function Wikipedia on Perfect Hash Function]] | ||
| * | ||
| * a perfect hash. | ||
| * @param numEntries The number of entries of the hash functions. | ||
| * @param randCoefficients An array of random coefficients, each used by one hash function. | ||
| */ | ||
|
|
@@ -76,7 +70,19 @@ class MinHashModel private[ml] ( | |
| @Since("2.1.0") | ||
| override protected[ml] def hashDistance(x: Vector, y: Vector): Double = { | ||
| // Since it's generated by hashing, it will be a pair of dense vectors. | ||
| x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min | ||
| if (x.toDense.values.zip(y.toDense.values).exists(pair => pair._1 == pair._2)) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why just 0 and 1? I think if more pairs of values are the same, more the two vectors are closer, right?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See discussion above :)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I do more agree on the comment from @jkbradley at #15800 (comment), if I understand correctly some terms here. Is the indicator meaning a matching hashing value between two vectors from one hashing function, i.e., h_i? |
||
| 0 | ||
| } else { | ||
| 1 | ||
| } | ||
| } | ||
|
|
||
| @Since("2.1.0") | ||
| override protected[this] def checkNearestNeighbor(singleProbe: Boolean) = { | ||
| if (!singleProbe) { | ||
| log.warn("Multi-probe for MinHash will run brute force nearest neighbor when there " + | ||
| "aren't enough candidates.") | ||
| } | ||
| } | ||
|
|
||
| @Since("2.1.0") | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley is it not more confusing to not have any reference or further explanation here? You mentioned in #15148 (comment) to remove this, but should we not have some doc here or a better reference instead of nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's technically a perfect hash function because it's being "perfect" depends on the number of buckets used, right?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(But I do like the idea of having more references.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks for all of your suggestions!