Skip to content
Closed
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
a3c5fbe
Adding Power Iteration Clustering
fjiang6 Jan 22, 2015
d5aae20
Adding Power Iteration Clustering and Suite test
fjiang6 Jan 22, 2015
3fd5bc8
PIClustering is running in new branch (up to the pseudo-eigenvector c…
sboeschhuawei Jan 23, 2015
0ef163f
Added ConcentricCircles data generation and KMeans clustering
sboeschhuawei Jan 23, 2015
32a90dc
Update circles test data values
sboeschhuawei Jan 23, 2015
0700335
First end to end working version: but has bad performance issue
sboeschhuawei Jan 23, 2015
e5df2b8
First end to end working PIC
sboeschhuawei Jan 24, 2015
9294263
Added visualization/plotting of input/output data
sboeschhuawei Jan 25, 2015
a2b1e57
Revert inadvertent update to KMeans
sboeschhuawei Jan 25, 2015
b7dbcbe
Added axes and combined into single plot for matplotlib
sboeschhuawei Jan 26, 2015
f656c34
Added iris dataset
sboeschhuawei Jan 26, 2015
a112f38
Added graphx main and test jars as dependencies to mllib/pom.xml
sboeschhuawei Jan 26, 2015
ace9749
Update PIClustering.scala
fjiang6 Jan 26, 2015
b29c0db
Update PIClustering.scala
fjiang6 Jan 26, 2015
bea48ea
Converted custom Linear Algebra datatypes/routines to use Breeze.
sboeschhuawei Jan 27, 2015
90e7fa4
Converted from custom Linalg routines to Breeze: added JavaDoc commen…
sboeschhuawei Jan 28, 2015
be659e3
Added mllib specific log4j
sboeschhuawei Jan 28, 2015
060e6bf
Added link to PIC doc from the main clustering md doc
sboeschhuawei Jan 28, 2015
24f438e
fixed incorrect markdown in clustering doc
sboeschhuawei Jan 28, 2015
88aacc8
Add assert to testcase on cluster sizes
sboeschhuawei Jan 28, 2015
43ab10b
Change last two println's to log4j logger
sboeschhuawei Jan 28, 2015
218a49d
Applied Xiangrui's comments - especially removing RDD/PICLinalg class…
sboeschhuawei Jan 28, 2015
1c3a62e
removed matplot.py and reordered all private methods to bottom of PIC
sboeschhuawei Jan 28, 2015
121e4d5
Remove unused testing data files
sboeschhuawei Jan 28, 2015
7ebd149
Incorporate Xiangrui's first set of PR comments except restructure PI…
sboeschhuawei Jan 28, 2015
92d4752
Move the Guassian/ Affinity matrix calcs out of PIC. Presently in the…
sboeschhuawei Jan 29, 2015
c12dfc8
Removed examples files and added pic_data.txt. Revamped testcases yet…
sboeschhuawei Jan 29, 2015
24fbf52
Updated API to be similar to KMeans plus other changes requested by X…
sboeschhuawei Jan 30, 2015
4b78aaf
refactor PIC
mengxr Jan 30, 2015
f292f31
Merge pull request #44 from mengxr/SPARK-4259
sboeschhuawei Jan 30, 2015
4550850
Removed pic test data
sboeschhuawei Jan 30, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 30 additions & 0 deletions docs/mllib-clustering-pic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we make it a section in mllib-clustering.md?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

layout: global
title: Clustering - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Power Iteration Clustering
---

* Table of contents
{:toc}


## Power Iteration Clustering

Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:

* computes the Gaussian distance between all pairs of points and represents these distances in an Affinity Matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc needs update. We assume that the input is a graph with precomputed pairwise similarities/distances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

* calculates a Normalized Affinity Matrix
* calculates the principal eigenvalue and eigenvector
* Clusters each of the input points according to their principal eigenvector component value

Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf}

Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation:

<p style="text-align: center;">
<img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
title="The Property Graph"
alt="The Property Graph"
width="50%" />
<!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
3 changes: 3 additions & 0 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,9 @@ a given dataset, the algorithm returns the best clustering result).
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
* *epsilon* determines the distance threshold within which we consider k-means to have converged.

[Power Iteration Clustering](mllib-clustering-pic.md) that uses the Power Iteration method combined with KMeans clustering to
cluster points based on a Gaussian measure of the input data pairwise similarity.

### Examples

<div class="codetabs">
Expand Down
12 changes: 12 additions & 0 deletions mllib/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,11 @@
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-graphx_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.jblas</groupId>
<artifactId>jblas</artifactId>
Expand Down Expand Up @@ -103,6 +108,13 @@
<type>test-jar</type>
<scope>test</scope>
</dependency>
<!-- <dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

<groupId>org.apache.spark</groupId>
<artifactId>spark-graphx_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency> -->
</dependencies>
<profiles>
<profile>
Expand Down
Loading