-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18450][ML] Scala API Change for LSH AND-amplification #17092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #73550 has finished for PR 17092 at commit
|
|
@jkbradley @MLnick Here is a clean PR. Sorry for messing up the previous one! @merlintang I am happy to continue our discussion here: https://issues.apache.org/jira/browse/SPARK-19771 as OR-AND amplification requires much more changes than SPARK-18450 |
|
@Yunni ok, let us discuss the further optimization step in other ticket. Let me manually check and test this patch, because I have one concern here. I will let you know later. |
|
@Yunni I test this patch locally, it can work, but I have one idea to improve it. We can discuss it in other ticket. |
|
@jkbradley @sethah Please take a review when you have time. Thanks! |
|
Ping. |
|
@MLnick @jkbradley @sethah Could you take a review? Thanks! |
|
@jkbradley @MLnick @sethah @Yunni @merlintang @akatz |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supposing we want to support OR-AND amplification in the future, how will the API be added or changed ? Add an boolean parameter to specify OR-AND / AND-OR ?
and maybe the names of numHashFunctions and numHashTables are a little confusing for users.
| @Since("2.1.0") | ||
| override def setNumHashTables(value: Int): this.type = super.setNumHashTables(value) | ||
|
|
||
| @Since("2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Since("2.4.0")
| @Since("2.1.0") | ||
| override def setNumHashTables(value: Int): this.type = super.setNumHashTables(value) | ||
|
|
||
| @Since("2.2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
|
ping @Yunni |
Closes apache#17422 Closes apache#17619 Closes apache#18034 Closes apache#18229 Closes apache#18268 Closes apache#17973 Closes apache#18125 Closes apache#18918 Closes apache#19274 Closes apache#19456 Closes apache#19510 Closes apache#19420 Closes apache#20090 Closes apache#20177 Closes apache#20304 Closes apache#20319 Closes apache#20543 Closes apache#20437 Closes apache#21261 Closes apache#21726 Closes apache#14653 Closes apache#13143 Closes apache#17894 Closes apache#19758 Closes apache#12951 Closes apache#17092 Closes apache#21240 Closes apache#16910 Closes apache#12904 Closes apache#21731 Closes apache#21095 Added: Closes apache#19233 Closes apache#20100 Closes apache#21453 Closes apache#21455 Closes apache#18477 Added: Closes apache#21812 Closes apache#21787 Author: hyukjinkwon <[email protected]> Closes apache#21781 from HyukjinKwon/closing-prs.
What changes were proposed in this pull request?
Implemented a new Param numHashFunctions as the dimension of AND-amplification for Locality Sensitive Hashing. Now the hash of each feature in LSH is an array of size numHashTables while each element in the array is a vector of size numHashFunctions.
Two features are in the same hash bucket iff ANY pair of the vectors are equal (OR-amplification). Two vectors are equal iff ALL pair of the vector entries are equal (AND-amplification).
Will create follow-up PRs for Python API and Doc/Examples.
How was this patch tested?
By running unit tests MinHashLSHSuite and BucketedRandomProjectionLSHSuite.