[SPARK-28159][ML] Make the transform natively in ml framework to avoid extra conversion #24963

zhengruifeng · 2019-06-25T07:26:58Z

What changes were proposed in this pull request?

Make the transform natively in ml framework to avoid extra conversion.
There are many TODOs in current ml module, like // TODO: Make the transformer natively in ml framework to avoid extra conversion. in ChiSqSelector.
This PR is to make ml algs no longer need to convert ml-vector to mllib-vector in transforms.
Including: LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler.

How was this patch tested?

existing testsuites

zhengruifeng · 2019-06-25T08:29:12Z

KMeans and BikMeans are left alone, since there are many classes needed to be created on the ml side.

SparkQA · 2019-06-25T08:33:31Z

Test build #106872 has finished for PR 24963 at commit e9e9c65.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-25T09:35:55Z

Test build #106879 has finished for PR 24963 at commit f34112c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-06-25T14:24:34Z

Before I review, can you update the JIRA and PR with detail about what you're trying to do here? there's no real info.

SparkQA · 2019-06-26T03:02:44Z

Test build #106912 has finished for PR 24963 at commit 51ee85c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

So, this copies the implementation of a lot of these algorithms? hm, that seems bad from a maintenance standpoint. This is just to avoid the conversion of vector classes? I wonder if there are easier answers. For example, many .mllib implementations probably just use the vector as an array of double immediately. If so, could they expose that directly so that the .ml implementation can call the logic more directly? or, is the vector conversion overhead so significant? it should be mostly re-wrapping the same values and indices?

zhengruifeng · 2019-06-27T10:05:33Z

@srowen Your method is more reasonable. It is better to maintain only one impl. I will try to add a method with array of double as input, on the .mllib side.

zhengruifeng · 2019-06-27T10:50:16Z

I means on the .mllib side, directly return a udf .ml.vector => double for the call on the .ml side.

SparkQA · 2019-06-27T11:46:10Z

Test build #106962 has finished for PR 24963 at commit 92d555c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-28T11:28:40Z

Test build #106997 has finished for PR 24963 at commit 10ba449.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-06-28T14:49:00Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

+    }
+  }
+
+  private[spark] def compress(features: NewVector): NewVector = {


These seem like general methods, not specific to chi-squared. Do we not already do some of this work in the Vector constructors or an existing utility method?

Likewise here, I don't think we want to handle .ml vectors in .mllib. I think the idea is to make this .mllib method more generic, perhaps just operating on indices and values?

srowen · 2019-06-28T14:54:43Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

 import org.apache.spark.annotation.Since
 import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
 import org.apache.spark.graphx.{Edge, EdgeContext, Graph, VertexId}
+import org.apache.spark.ml.linalg.{Vector => NewVector, Vectors => NewVectors}


Ah OK, I think we don't want to import .ml vectors in .mllib here. But the method below is only used in .ml now. Just move it to .ml.clustering.LDAModel with your changes?

srowen · 2019-06-28T14:57:37Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala

+      k: Int,
+      seed: Long): (BDV[Double], BDM[Double], List[Int]) = {
+    val (ids: List[Int], cts: Array[Double]) = termCounts match {
+      case v: NewDenseVector => ((0 until v.size).toList, v.values)


I think we want to avoid materializing this list of indices. In the dense case it's redundant. If not passed, assume the dense case?

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala

SparkQA · 2019-07-01T05:05:05Z

Test build #107058 has finished for PR 24963 at commit bd813db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

OK yes I think that's the right way to make this change.

zhengruifeng · 2019-07-02T03:28:33Z

@srowen PCA is an exception, since it use matrix multiplication.

zhengruifeng · 2019-07-02T03:43:12Z

@srowen In this PR, I found that it will be more convenient (like IDF/ElementwiseProduct/StandardScaler) if there is some methods in linalg like:

def mapActive(f: (Int, Double) => Double): Vector
return a new vector whose values are computed by orignial vector and function f,
and f is only applied on active elements.

def updateActive(f: (Int, Double) => Double): Unit
like mapActive but update the values in-place

How do you think about this?

SparkQA · 2019-07-02T03:54:48Z

Test build #107092 has finished for PR 24963 at commit 5730ab7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-02T05:05:26Z

Test build #107094 has finished for PR 24963 at commit d54d073.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looking good, one more question.

So, am I right that generally you have:

Broken out parts of the implementation in .mllib to expose indices/values methods
Call those methods from .mllib and .ml implementations directly to avoid vector conversion?

srowen · 2019-07-02T14:53:31Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala

+      k: Int,
+      seed: Long): (BDV[Double], BDM[Double], List[Int]) = {
+    val (ids: List[Int], cts: Array[Double]) = termCounts match {
+      case v: DenseVector => ((0 until v.size).toList, v.values)


Here and elsewhere, as an optimization, can we avoid (0 until v.size).toList)? pass an empty list in this case or something, and then deduce that the indices are just the same length as the values?

You're generally solving this with separate sparse/dense methods which could be fine too if it doesn't result in too much code duplication and improves performance in the dense case.

Looks good then except we might be able to make one more optimization here?

I just look into the usage of indices ids, and find that it is used as slicing indices like val expElogbetad = expElogbeta(indices, ::).toDenseMatrix.
I will have a try.

I am afraid that an empty list may not help to simplify the impl.
since in place like private[clustering] def submitMiniBatch(batch: RDD[(Long, Vector)]): OnlineLDAOptimizer, we still have to create a List for slicing.

zhengruifeng · 2019-07-03T01:14:01Z

@srowen Yes, I broken out the impl in .mllib to expose methods for dense and spares (excpet PCA), and call them from .ml to avoid conversion.

SparkQA · 2019-07-03T04:08:33Z

Test build #107148 has finished for PR 24963 at commit f1314fb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-03T06:04:44Z

Test build #107153 has finished for PR 24963 at commit 096d204.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-07-08T19:45:54Z

Merged to master

zhengruifeng force-pushed the to_ml_vector branch from f34112c to 51ee85c Compare June 26, 2019 01:49

srowen reviewed Jun 26, 2019

View reviewed changes

dongjoon-hyun added the ML label Jun 26, 2019

zhengruifeng added 2 commits June 27, 2019 19:13

init

3d9c916

add chisq

92d555c

zhengruifeng force-pushed the to_ml_vector branch from 51ee85c to 92d555c Compare June 27, 2019 11:38

update

10ba449

srowen requested changes Jun 28, 2019

View reviewed changes

more general

bd813db

srowen approved these changes Jul 1, 2019

View reviewed changes

zhengruifeng added 2 commits July 2, 2019 11:16

add other algs

f78ed32

nit

5730ab7

zhengruifeng added 2 commits July 2, 2019 11:56

nit

38b3872

nit

d54d073

srowen reviewed Jul 2, 2019

View reviewed changes

deal with empty indices

f1314fb

revert empty indices

096d204

srowen closed this in 28ea445 Jul 8, 2019

zhengruifeng deleted the to_ml_vector branch July 9, 2019 01:56

zhengruifeng mentioned this pull request Jul 17, 2019

[SPARK-24283][ML] Make ml.StandardScaler skip conversion of Spar… #21942

Closed

zhengruifeng mentioned this pull request Aug 20, 2019

SPARK-22531 Migrated IDF from MLLib to ML and removed use of MLLib ve… #19759

Closed

imback82 mentioned this pull request Apr 27, 2020

[SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views #28375

Closed

[SPARK-28159][ML] Make the transform natively in ml framework to avoid extra conversion #24963

[SPARK-28159][ML] Make the transform natively in ml framework to avoid extra conversion #24963

Uh oh!

Conversation

zhengruifeng commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

zhengruifeng commented Jun 25, 2019

Uh oh!

SparkQA commented Jun 25, 2019

Uh oh!

SparkQA commented Jun 25, 2019

Uh oh!

srowen commented Jun 25, 2019

Uh oh!

SparkQA commented Jun 26, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jun 27, 2019

Uh oh!

zhengruifeng commented Jun 27, 2019

Uh oh!

SparkQA commented Jun 27, 2019

Uh oh!

SparkQA commented Jun 28, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Jul 1, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jul 2, 2019

Uh oh!

zhengruifeng commented Jul 2, 2019

Uh oh!

SparkQA commented Jul 2, 2019

Uh oh!

SparkQA commented Jul 2, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Jul 3, 2019

Uh oh!

SparkQA commented Jul 3, 2019

Uh oh!

SparkQA commented Jul 3, 2019

Uh oh!

srowen commented Jul 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhengruifeng commented Jun 25, 2019 •

edited

Loading