Skip to content

Conversation

@Yunni
Copy link
Contributor

@Yunni Yunni commented Feb 28, 2017

What changes were proposed in this pull request?

Implemented a new Param numHashFunctions as the dimension of AND-amplification for Locality Sensitive Hashing. Now the hash of each feature in LSH is an array of size numHashTables while each element in the array is a vector of size numHashFunctions.

Two features are in the same hash bucket iff ANY pair of the vectors are equal (OR-amplification). Two vectors are equal iff ALL pair of the vector entries are equal (AND-amplification).

Will create follow-up PRs for Python API and Doc/Examples.

How was this patch tested?

By running unit tests MinHashLSHSuite and BucketedRandomProjectionLSHSuite.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73550 has finished for PR 17092 at commit 9dd87ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Yunni
Copy link
Contributor Author

Yunni commented Feb 28, 2017

@jkbradley @MLnick Here is a clean PR. Sorry for messing up the previous one!

@merlintang I am happy to continue our discussion here: https://issues.apache.org/jira/browse/SPARK-19771 as OR-AND amplification requires much more changes than SPARK-18450

@merlintang
Copy link

merlintang commented Feb 28, 2017

@Yunni ok, let us discuss the further optimization step in other ticket. Let me manually check and test this patch, because I have one concern here. I will let you know later.

@merlintang
Copy link

@Yunni I test this patch locally, it can work, but I have one idea to improve it. We can discuss it in other ticket.

@Yunni
Copy link
Contributor Author

Yunni commented Mar 9, 2017

@jkbradley @sethah Please take a review when you have time. Thanks!

@Yunni
Copy link
Contributor Author

Yunni commented Apr 6, 2017

Ping.

@Yunni
Copy link
Contributor Author

Yunni commented May 6, 2017

@MLnick @jkbradley @sethah Could you take a review? Thanks!

@kturgut
Copy link

kturgut commented Nov 2, 2017

@jkbradley @MLnick @sethah @Yunni @merlintang @akatz
It seems LSH will be a perfect fit for matching patient records, if only I can figure out how to assign different weights to each column of the patient record that I am comparing. For instance, each record may have 0 to many identifiers. if the identifiers match exactly, we consider a solid match. However if ID's do not strongly match, we also look at additional set of fields such as name, birthdate, address at different weights.
For instance, if the names exactly match, it is stronger than if they match with small typos.
To give different weights for each field we are comparing, should I have to write custom distance calculator?
Or perhaps, should I do a MinHashing and then LSH as a second step as described in this document: http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf?
It does not look like the AND-OR amplification would help with that, as it takes the number of hash-functions as input, and it does not seem like we have control over the sensitivity of the hash-functions.
I will really appreciate your guidance.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supposing we want to support OR-AND amplification in the future, how will the API be added or changed ? Add an boolean parameter to specify OR-AND / AND-OR ?

and maybe the names of numHashFunctions and numHashTables are a little confusing for users.

@Since("2.1.0")
override def setNumHashTables(value: Int): this.type = super.setNumHashTables(value)

@Since("2.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Since("2.4.0")

@Since("2.1.0")
override def setNumHashTables(value: Int): this.type = super.setNumHashTables(value)

@Since("2.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@HyukjinKwon
Copy link
Member

ping @Yunni

@asfgit asfgit closed this in 1a4fda8 Jul 19, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#17422
Closes apache#17619
Closes apache#18034
Closes apache#18229
Closes apache#18268
Closes apache#17973
Closes apache#18125
Closes apache#18918
Closes apache#19274
Closes apache#19456
Closes apache#19510
Closes apache#19420
Closes apache#20090
Closes apache#20177
Closes apache#20304
Closes apache#20319
Closes apache#20543
Closes apache#20437
Closes apache#21261
Closes apache#21726
Closes apache#14653
Closes apache#13143
Closes apache#17894
Closes apache#19758
Closes apache#12951
Closes apache#17092
Closes apache#21240
Closes apache#16910
Closes apache#12904
Closes apache#21731
Closes apache#21095

Added:
Closes apache#19233
Closes apache#20100
Closes apache#21453
Closes apache#21455
Closes apache#18477

Added:
Closes apache#21812
Closes apache#21787

Author: hyukjinkwon <[email protected]>

Closes apache#21781 from HyukjinKwon/closing-prs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants