[SPARK-20797][MLLIB]fix LocalLDAModel.save() bug. #18034

d0evi1 · 2017-05-19T09:38:26Z

What changes were proposed in this pull request?

LocalLDAModel's model save function has a bug:

please see: https://issues.apache.org/jira/browse/SPARK-20797

add some code like word2vec's save method (repartition), to avoid this bug.

How was this patch tested?

it's hard to test. need large data to train with online LDA method.

srowen · 2017-05-19T09:43:12Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

+      // (floatSize * vectorSize + 15) * numWords
+      val approxSize = (4L * k + 15) * topicsMatrix.numRows
+      val nPartitions = ((approxSize / bufferSize) + 1).toInt
+      spark.createDataFrame(topics).repartition(nPartitions).write.parquet(Loader.dataPath(path))


The problem is that this writes multiple files. I don't think you can do that.

why not? i think it does works. the multiple parquet files will may be in random order, but it will save topic indices. when u call load process, parquet will restore dataframe, u can check the LocalLDAModel's load method, it will scan all dataframe's row with the topic indices to rebuild the (topic x vocab) breeze matrix.

I believe the point of repartition(1) is to export this to a single file for external consumption. Of course writing it succeeds, but I don't think this is what it is intended to do.

I've read mllib's online lda's code implements and the paper.

i found a simple way to test it , u can try this code snippet, simulation a matrix to save:

val vocabSize = 500000
val k = 300
val random = new Random()
val topicsDenseMatrix = DenseMatrix.fill[Double](vocabSize, k)(random.nextDouble())

@d0evi1 I'm not sure how that's related to my comment.
@hhbyyh what do you think -- am I wrong about the issue?

I think the PR is trying to do something similar to https://github.com/apache/spark/pull/9989/files.

@srowen what's your concern with saving multiple files instead of one?

We've encountered crippling errors saving large LDA models in the ml api as well. The problem there is even worse, since the entire matrix is treated as one single datum

Can you add a testcase to save model into multiple files and load back and check the correctness ?

AmplabJenkins · 2018-06-09T00:21:52Z

Can one of the admins verify this patch?

HyukjinKwon · 2018-07-16T02:07:42Z

Any update? seems the author looks inactive. Please let me know if I am not mistaken. Let me leave this closed if so.

Closes apache#17422 Closes apache#17619 Closes apache#18034 Closes apache#18229 Closes apache#18268 Closes apache#17973 Closes apache#18125 Closes apache#18918 Closes apache#19274 Closes apache#19456 Closes apache#19510 Closes apache#19420 Closes apache#20090 Closes apache#20177 Closes apache#20304 Closes apache#20319 Closes apache#20543 Closes apache#20437 Closes apache#21261 Closes apache#21726 Closes apache#14653 Closes apache#13143 Closes apache#17894 Closes apache#19758 Closes apache#12951 Closes apache#17092 Closes apache#21240 Closes apache#16910 Closes apache#12904 Closes apache#21731 Closes apache#21095 Added: Closes apache#19233 Closes apache#20100 Closes apache#21453 Closes apache#21455 Closes apache#18477 Added: Closes apache#21812 Closes apache#21787 Author: hyukjinkwon <[email protected]> Closes apache#21781 from HyukjinKwon/closing-prs.

d0evi1 added 2 commits May 19, 2017 17:19

LDAModel.save() bug fix.

69e514e

LDAModel.save() bug fix, add some code like word2vec's save method.

edd999d

srowen reviewed May 19, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Jul 16, 2018

[INFRA] Close stale PR #21781

Closed

asfgit closed this in 1a4fda8 Jul 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-20797][MLLIB]fix LocalLDAModel.save() bug. #18034

[SPARK-20797][MLLIB]fix LocalLDAModel.save() bug. #18034

Uh oh!

d0evi1 commented May 19, 2017

Uh oh!

srowen May 19, 2017

Uh oh!

d0evi1 May 20, 2017

Uh oh!

srowen May 20, 2017

Uh oh!

d0evi1 May 20, 2017 •

edited

Loading

Uh oh!

srowen May 22, 2017

Uh oh!

hhbyyh May 22, 2017

Uh oh!

Krimit Jun 29, 2017

Uh oh!

WeichenXu123 Sep 26, 2017

Uh oh!

AmplabJenkins commented Jun 9, 2018

Uh oh!

HyukjinKwon commented Jul 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[SPARK-20797][MLLIB]fix LocalLDAModel.save() bug. #18034

[SPARK-20797][MLLIB]fix LocalLDAModel.save() bug. #18034

Uh oh!

Conversation

d0evi1 commented May 19, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen May 19, 2017

Choose a reason for hiding this comment

Uh oh!

d0evi1 May 20, 2017

Choose a reason for hiding this comment

Uh oh!

srowen May 20, 2017

Choose a reason for hiding this comment

Uh oh!

d0evi1 May 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen May 22, 2017

Choose a reason for hiding this comment

Uh oh!

hhbyyh May 22, 2017

Choose a reason for hiding this comment

Uh oh!

Krimit Jun 29, 2017

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Sep 26, 2017

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jun 9, 2018

Uh oh!

HyukjinKwon commented Jul 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

d0evi1 May 20, 2017 •

edited

Loading