-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function #4254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function #4254
Changes from 24 commits
a3c5fbe
d5aae20
3fd5bc8
0ef163f
32a90dc
0700335
e5df2b8
9294263
a2b1e57
b7dbcbe
f656c34
a112f38
ace9749
b29c0db
bea48ea
90e7fa4
be659e3
060e6bf
24f438e
88aacc8
43ab10b
218a49d
1c3a62e
121e4d5
7ebd149
92d4752
c12dfc8
24fbf52
4b78aaf
f292f31
4550850
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| --- | ||
| layout: global | ||
| title: Clustering - MLlib | ||
| displayTitle: <a href="mllib-guide.html">MLlib</a> - Power Iteration Clustering | ||
| --- | ||
|
|
||
| * Table of contents | ||
| {:toc} | ||
|
|
||
|
|
||
| ## Power Iteration Clustering | ||
|
|
||
| Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: | ||
|
|
||
| * computes the Gaussian distance between all pairs of points and represents these distances in an Affinity Matrix | ||
|
||
| * calculates a Normalized Affinity Matrix | ||
| * calculates the principal eigenvalue and eigenvector | ||
| * Clusters each of the input points according to their principal eigenvector component value | ||
|
|
||
| Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf} | ||
|
|
||
| Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation: | ||
|
|
||
| <p style="text-align: center;"> | ||
| <img src="img/PIClusteringFiveCirclesInputsAndOutputs.png" | ||
| title="The Property Graph" | ||
| alt="The Property Graph" | ||
| width="50%" /> | ||
| <!-- Images are downsized intentionally to improve quality on retina displays --> | ||
| </p> | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -50,6 +50,11 @@ | |
| <artifactId>spark-sql_${scala.binary.version}</artifactId> | ||
| <version>${project.version}</version> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>org.apache.spark</groupId> | ||
| <artifactId>spark-graphx_${scala.binary.version}</artifactId> | ||
| <version>${project.version}</version> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>org.jblas</groupId> | ||
| <artifactId>jblas</artifactId> | ||
|
|
@@ -103,6 +108,13 @@ | |
| <type>test-jar</type> | ||
| <scope>test</scope> | ||
| </dependency> | ||
| <!-- <dependency> | ||
|
||
| <groupId>org.apache.spark</groupId> | ||
| <artifactId>spark-graphx_${scala.binary.version}</artifactId> | ||
| <version>${project.version}</version> | ||
| <type>test-jar</type> | ||
| <scope>test</scope> | ||
| </dependency> --> | ||
| </dependencies> | ||
| <profiles> | ||
| <profile> | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we make it a section in
mllib-clustering.md?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK