Skip to content

Conversation

@d0evi1
Copy link

@d0evi1 d0evi1 commented May 19, 2017

What changes were proposed in this pull request?

LocalLDAModel's model save function has a bug:

please see: https://issues.apache.org/jira/browse/SPARK-20797

add some code like word2vec's save method (repartition), to avoid this bug.

How was this patch tested?

it's hard to test. need large data to train with online LDA method.

// (floatSize * vectorSize + 15) * numWords
val approxSize = (4L * k + 15) * topicsMatrix.numRows
val nPartitions = ((approxSize / bufferSize) + 1).toInt
spark.createDataFrame(topics).repartition(nPartitions).write.parquet(Loader.dataPath(path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this writes multiple files. I don't think you can do that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not? i think it does works. the multiple parquet files will may be in random order, but it will save topic indices. when u call load process, parquet will restore dataframe, u can check the LocalLDAModel's load method, it will scan all dataframe's row with the topic indices to rebuild the (topic x vocab) breeze matrix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the point of repartition(1) is to export this to a single file for external consumption. Of course writing it succeeds, but I don't think this is what it is intended to do.

Copy link
Author

@d0evi1 d0evi1 May 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read mllib's online lda's code implements and the paper.

i found a simple way to test it , u can try this code snippet, simulation a matrix to save:

val vocabSize = 500000
val k = 300
val random = new Random()
val topicsDenseMatrix = DenseMatrix.fill[Double](vocabSize, k)(random.nextDouble())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d0evi1 I'm not sure how that's related to my comment.
@hhbyyh what do you think -- am I wrong about the issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PR is trying to do something similar to https://github.com/apache/spark/pull/9989/files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen what's your concern with saving multiple files instead of one?

We've encountered crippling errors saving large LDA models in the ml api as well. The problem there is even worse, since the entire matrix is treated as one single datum

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a testcase to save model into multiple files and load back and check the correctness ?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

Any update? seems the author looks inactive. Please let me know if I am not mistaken. Let me leave this closed if so.

@asfgit asfgit closed this in 1a4fda8 Jul 19, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#17422
Closes apache#17619
Closes apache#18034
Closes apache#18229
Closes apache#18268
Closes apache#17973
Closes apache#18125
Closes apache#18918
Closes apache#19274
Closes apache#19456
Closes apache#19510
Closes apache#19420
Closes apache#20090
Closes apache#20177
Closes apache#20304
Closes apache#20319
Closes apache#20543
Closes apache#20437
Closes apache#21261
Closes apache#21726
Closes apache#14653
Closes apache#13143
Closes apache#17894
Closes apache#19758
Closes apache#12951
Closes apache#17092
Closes apache#21240
Closes apache#16910
Closes apache#12904
Closes apache#21731
Closes apache#21095

Added:
Closes apache#19233
Closes apache#20100
Closes apache#21453
Closes apache#21455
Closes apache#18477

Added:
Closes apache#21812
Closes apache#21787

Author: hyukjinkwon <[email protected]>

Closes apache#21781 from HyukjinKwon/closing-prs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants