-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20797][MLLIB]fix LocalLDAModel.save() bug. #18034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // (floatSize * vectorSize + 15) * numWords | ||
| val approxSize = (4L * k + 15) * topicsMatrix.numRows | ||
| val nPartitions = ((approxSize / bufferSize) + 1).toInt | ||
| spark.createDataFrame(topics).repartition(nPartitions).write.parquet(Loader.dataPath(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that this writes multiple files. I don't think you can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not? i think it does works. the multiple parquet files will may be in random order, but it will save topic indices. when u call load process, parquet will restore dataframe, u can check the LocalLDAModel's load method, it will scan all dataframe's row with the topic indices to rebuild the (topic x vocab) breeze matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the point of repartition(1) is to export this to a single file for external consumption. Of course writing it succeeds, but I don't think this is what it is intended to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read mllib's online lda's code implements and the paper.
i found a simple way to test it , u can try this code snippet, simulation a matrix to save:
val vocabSize = 500000
val k = 300
val random = new Random()
val topicsDenseMatrix = DenseMatrix.fill[Double](vocabSize, k)(random.nextDouble())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the PR is trying to do something similar to https://github.com/apache/spark/pull/9989/files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen what's your concern with saving multiple files instead of one?
We've encountered crippling errors saving large LDA models in the ml api as well. The problem there is even worse, since the entire matrix is treated as one single datum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a testcase to save model into multiple files and load back and check the correctness ?
|
Can one of the admins verify this patch? |
|
Any update? seems the author looks inactive. Please let me know if I am not mistaken. Let me leave this closed if so. |
Closes apache#17422 Closes apache#17619 Closes apache#18034 Closes apache#18229 Closes apache#18268 Closes apache#17973 Closes apache#18125 Closes apache#18918 Closes apache#19274 Closes apache#19456 Closes apache#19510 Closes apache#19420 Closes apache#20090 Closes apache#20177 Closes apache#20304 Closes apache#20319 Closes apache#20543 Closes apache#20437 Closes apache#21261 Closes apache#21726 Closes apache#14653 Closes apache#13143 Closes apache#17894 Closes apache#19758 Closes apache#12951 Closes apache#17092 Closes apache#21240 Closes apache#16910 Closes apache#12904 Closes apache#21731 Closes apache#21095 Added: Closes apache#19233 Closes apache#20100 Closes apache#21453 Closes apache#21455 Closes apache#18477 Added: Closes apache#21812 Closes apache#21787 Author: hyukjinkwon <[email protected]> Closes apache#21781 from HyukjinKwon/closing-prs.
What changes were proposed in this pull request?
LocalLDAModel's model save function has a bug:
please see: https://issues.apache.org/jira/browse/SPARK-20797
add some code like word2vec's save method (repartition), to avoid this bug.
How was this patch tested?
it's hard to test. need large data to train with online LDA method.