-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28159][ML] Make the transform natively in ml framework to avoid extra conversion #24963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
KMeans and BikMeans are left alone, since there are many classes needed to be created on the ml side. |
|
Test build #106872 has finished for PR 24963 at commit
|
|
Test build #106879 has finished for PR 24963 at commit
|
|
Before I review, can you update the JIRA and PR with detail about what you're trying to do here? there's no real info. |
f34112c to
51ee85c
Compare
|
Test build #106912 has finished for PR 24963 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this copies the implementation of a lot of these algorithms? hm, that seems bad from a maintenance standpoint. This is just to avoid the conversion of vector classes? I wonder if there are easier answers. For example, many .mllib implementations probably just use the vector as an array of double immediately. If so, could they expose that directly so that the .ml implementation can call the logic more directly? or, is the vector conversion overhead so significant? it should be mostly re-wrapping the same values and indices?
|
@srowen Your method is more reasonable. It is better to maintain only one impl. I will try to add a method with array of double as input, on the .mllib side. |
|
I means on the .mllib side, directly return a udf .ml.vector => double for the call on the .ml side. |
51ee85c to
92d555c
Compare
|
Test build #106962 has finished for PR 24963 at commit
|
|
Test build #106997 has finished for PR 24963 at commit
|
| } | ||
| } | ||
|
|
||
| private[spark] def compress(features: NewVector): NewVector = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem like general methods, not specific to chi-squared. Do we not already do some of this work in the Vector constructors or an existing utility method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise here, I don't think we want to handle .ml vectors in .mllib. I think the idea is to make this .mllib method more generic, perhaps just operating on indices and values?
| import org.apache.spark.annotation.Since | ||
| import org.apache.spark.api.java.{JavaPairRDD, JavaRDD} | ||
| import org.apache.spark.graphx.{Edge, EdgeContext, Graph, VertexId} | ||
| import org.apache.spark.ml.linalg.{Vector => NewVector, Vectors => NewVectors} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK, I think we don't want to import .ml vectors in .mllib here. But the method below is only used in .ml now. Just move it to .ml.clustering.LDAModel with your changes?
| k: Int, | ||
| seed: Long): (BDV[Double], BDM[Double], List[Int]) = { | ||
| val (ids: List[Int], cts: Array[Double]) = termCounts match { | ||
| case v: NewDenseVector => ((0 until v.size).toList, v.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to avoid materializing this list of indices. In the dense case it's redundant. If not passed, assume the dense case?
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala
Show resolved
Hide resolved
|
Test build #107058 has finished for PR 24963 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yes I think that's the right way to make this change.
|
@srowen |
|
@srowen In this PR, I found that it will be more convenient (like How do you think about this? |
|
Test build #107092 has finished for PR 24963 at commit
|
|
Test build #107094 has finished for PR 24963 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, one more question.
So, am I right that generally you have:
- Broken out parts of the implementation in .mllib to expose indices/values methods
- Call those methods from .mllib and .ml implementations directly to avoid vector conversion?
| k: Int, | ||
| seed: Long): (BDV[Double], BDM[Double], List[Int]) = { | ||
| val (ids: List[Int], cts: Array[Double]) = termCounts match { | ||
| case v: DenseVector => ((0 until v.size).toList, v.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere, as an optimization, can we avoid (0 until v.size).toList)? pass an empty list in this case or something, and then deduce that the indices are just the same length as the values?
You're generally solving this with separate sparse/dense methods which could be fine too if it doesn't result in too much code duplication and improves performance in the dense case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good then except we might be able to make one more optimization here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just look into the usage of indices ids, and find that it is used as slicing indices like val expElogbetad = expElogbeta(indices, ::).toDenseMatrix.
I will have a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid that an empty list may not help to simplify the impl.
since in place like private[clustering] def submitMiniBatch(batch: RDD[(Long, Vector)]): OnlineLDAOptimizer, we still have to create a List for slicing.
|
@srowen Yes, I broken out the impl in .mllib to expose methods for dense and spares (excpet PCA), and call them from .ml to avoid conversion. |
|
Test build #107148 has finished for PR 24963 at commit
|
|
Test build #107153 has finished for PR 24963 at commit
|
|
Merged to master |
What changes were proposed in this pull request?
Make the transform natively in ml framework to avoid extra conversion.
There are many TODOs in current ml module, like
// TODO: Make the transformer natively in ml framework to avoid extra conversion.in ChiSqSelector.This PR is to make ml algs no longer need to convert ml-vector to mllib-vector in transforms.
Including: LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler.
How was this patch tested?
existing testsuites