Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jun 25, 2019

What changes were proposed in this pull request?

Make the transform natively in ml framework to avoid extra conversion.
There are many TODOs in current ml module, like // TODO: Make the transformer natively in ml framework to avoid extra conversion. in ChiSqSelector.
This PR is to make ml algs no longer need to convert ml-vector to mllib-vector in transforms.
Including: LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler.

How was this patch tested?

existing testsuites

@zhengruifeng
Copy link
Contributor Author

KMeans and BikMeans are left alone, since there are many classes needed to be created on the ml side.

@SparkQA
Copy link

SparkQA commented Jun 25, 2019

Test build #106872 has finished for PR 24963 at commit e9e9c65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2019

Test build #106879 has finished for PR 24963 at commit f34112c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 25, 2019

Before I review, can you update the JIRA and PR with detail about what you're trying to do here? there's no real info.

@SparkQA
Copy link

SparkQA commented Jun 26, 2019

Test build #106912 has finished for PR 24963 at commit 51ee85c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this copies the implementation of a lot of these algorithms? hm, that seems bad from a maintenance standpoint. This is just to avoid the conversion of vector classes? I wonder if there are easier answers. For example, many .mllib implementations probably just use the vector as an array of double immediately. If so, could they expose that directly so that the .ml implementation can call the logic more directly? or, is the vector conversion overhead so significant? it should be mostly re-wrapping the same values and indices?

@zhengruifeng
Copy link
Contributor Author

@srowen Your method is more reasonable. It is better to maintain only one impl. I will try to add a method with array of double as input, on the .mllib side.

@zhengruifeng
Copy link
Contributor Author

I means on the .mllib side, directly return a udf .ml.vector => double for the call on the .ml side.

@SparkQA
Copy link

SparkQA commented Jun 27, 2019

Test build #106962 has finished for PR 24963 at commit 92d555c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2019

Test build #106997 has finished for PR 24963 at commit 10ba449.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
}

private[spark] def compress(features: NewVector): NewVector = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like general methods, not specific to chi-squared. Do we not already do some of this work in the Vector constructors or an existing utility method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here, I don't think we want to handle .ml vectors in .mllib. I think the idea is to make this .mllib method more generic, perhaps just operating on indices and values?

import org.apache.spark.annotation.Since
import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
import org.apache.spark.graphx.{Edge, EdgeContext, Graph, VertexId}
import org.apache.spark.ml.linalg.{Vector => NewVector, Vectors => NewVectors}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, I think we don't want to import .ml vectors in .mllib here. But the method below is only used in .ml now. Just move it to .ml.clustering.LDAModel with your changes?

k: Int,
seed: Long): (BDV[Double], BDM[Double], List[Int]) = {
val (ids: List[Int], cts: Array[Double]) = termCounts match {
case v: NewDenseVector => ((0 until v.size).toList, v.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to avoid materializing this list of indices. In the dense case it's redundant. If not passed, assume the dense case?

@SparkQA
Copy link

SparkQA commented Jul 1, 2019

Test build #107058 has finished for PR 24963 at commit bd813db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yes I think that's the right way to make this change.

@zhengruifeng
Copy link
Contributor Author

@srowen PCA is an exception, since it use matrix multiplication.

@zhengruifeng
Copy link
Contributor Author

@srowen In this PR, I found that it will be more convenient (like IDF/ElementwiseProduct/StandardScaler) if there is some methods in linalg like:

def mapActive(f: (Int, Double) => Double): Vector
return a new vector whose values are computed by orignial vector and function f,
and f is only applied on active elements.

def updateActive(f: (Int, Double) => Double): Unit
like mapActive but update the values in-place

How do you think about this?

@SparkQA
Copy link

SparkQA commented Jul 2, 2019

Test build #107092 has finished for PR 24963 at commit 5730ab7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2019

Test build #107094 has finished for PR 24963 at commit d54d073.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, one more question.

So, am I right that generally you have:

  • Broken out parts of the implementation in .mllib to expose indices/values methods
  • Call those methods from .mllib and .ml implementations directly to avoid vector conversion?

k: Int,
seed: Long): (BDV[Double], BDM[Double], List[Int]) = {
val (ids: List[Int], cts: Array[Double]) = termCounts match {
case v: DenseVector => ((0 until v.size).toList, v.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere, as an optimization, can we avoid (0 until v.size).toList)? pass an empty list in this case or something, and then deduce that the indices are just the same length as the values?

You're generally solving this with separate sparse/dense methods which could be fine too if it doesn't result in too much code duplication and improves performance in the dense case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good then except we might be able to make one more optimization here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just look into the usage of indices ids, and find that it is used as slicing indices like val expElogbetad = expElogbeta(indices, ::).toDenseMatrix.
I will have a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that an empty list may not help to simplify the impl.
since in place like private[clustering] def submitMiniBatch(batch: RDD[(Long, Vector)]): OnlineLDAOptimizer, we still have to create a List for slicing.

@zhengruifeng
Copy link
Contributor Author

@srowen Yes, I broken out the impl in .mllib to expose methods for dense and spares (excpet PCA), and call them from .ml to avoid conversion.

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107148 has finished for PR 24963 at commit f1314fb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 3, 2019

Test build #107153 has finished for PR 24963 at commit 096d204.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 8, 2019

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants