SArray Shuffle #3178

TobyRoseman · 2020-05-07T23:40:07Z

Implements: #3123.

nickjong

Aside: is there any prospect of controlling this behavior with a random seed? What random seed does it use?

TobyRoseman · 2020-05-08T21:23:44Z

Aside: is there any prospect of controlling this behavior with a random seed? What random seed does it use?

In order to get an O(n) running time without random access, we had to rely on groupby which doesn't have a seed option. It sounds like adding that functionality to groupby isn't really possible.

The idea of having a random seed for shuffle is related to #3122. There I was planning on using the SFrame shuffle only if the user did not pass a random seed to tc.object_detection.create. If the user did pass in a random seed, the current sort based shuffle functionality would be used.

Something I've considered doing instead is moving that sort based shuffle functionality from the object detector into SFrame shuffle. The end result would be that SArray/SFrame shuffle would have a seed option; shuffle would run in O(n) time if no seed is given and O(n log n) if a seed is given. This shouldn't add too much additional work to #3122. Let me know if you think that is worth doing.

nickjong · 2020-05-08T22:22:08Z

The idea of having a random seed for shuffle is related to #3122. There I was planning on using the SFrame shuffle only if the user did not pass a random seed to tc.object_detection.create. If the user did pass in a random seed, the current sort based shuffle functionality would be used.

One wrinkle here is that even when the user does not pass in a random seed, we choose one (er, at random) and record it in the model attributes. The user can reproduce the model by passing that seed and the same data and parameters (on the same platform).

But all this means is that the Object Detection case would always use the seeded version, since there's always a seed by the time you get to that part of the implementation.

guihao-liang · 2020-05-10T17:55:12Z

I've suggested in #3060 that irreproducible model is not really useful for model development. I have worked on optimizing groupby #2210 before and I think the overall sort is a bucket sort instead of merge sort on entire set since each hash bucket is relatively sorted (#2210). What we need to do is to do something similar to stable sort, where we can provide an extra index column to track the original order and feed it into groupby together. After the groupby, we should be able to reconstruct their original relative order within the bucket. The big O can be n + B*log(n/B), where B is the bucket number in the hash table used by groupby.

TobyRoseman · 2020-05-11T21:45:47Z

@guihao-liang - I've created #3187 to track adding a seed parameter to SArray/SFrame shuffle.

hoytak · 2020-05-11T21:54:52Z

A model that relies on the order of input isn't robust. For example, SGD in matrix factorization with multiple threads is non-deterministic, and generally most stuff on the GPU isn't super deterministic anyway. In small test cases, this may make sense, but a fast non-deterministic shuffle is definitely useful.

guihao-liang · 2020-05-11T22:14:46Z

SGD uses seed too. So for the same seeded SGD, it's also deterministic because it also uses pseudo-random. I don't know about matrix factorization. But for standard back-prop (RNN, CNN), everything is deterministic. The non-determinism comes from float computation errors and partial derivative approximation errors from singular functions such as ReLu.

Determinism is important for education purposes. I took courses that teach DL, and it's important for students to reproduce what's taught by the lectures. Most important use is that random split the data set into training and testing. If every time the input is different, how could students check whether each step they are doing right since DL is an overall complicated process. I know turicreate is also used for educational purposes and I think determinism is important at least for this case.

hoytak · 2020-05-14T19:46:52Z

The non-randomness in our inner SGD routines (not the deep learning ones) comes from parallelism, so it's inherently non-deterministic based on . Determinism can be a trade-off with speed and accuracy in a lot of cases (such as this one with group-by), so it should be up to the user what the balance should be for their particular use case. I don't see a right answer here.

SArray Shuffle

8cf081c

TobyRoseman requested review from hoytak and nickjong May 7, 2020 23:40

nickjong approved these changes May 8, 2020

View reviewed changes

TobyRoseman merged commit bc89df1 into apple:master May 8, 2020

TobyRoseman mentioned this pull request May 11, 2020

Seed Parameter for SArray/SFrame Shuffle #3187

Open

TobyRoseman mentioned this pull request May 11, 2020

Object Detector should use new SFrame shuffle method if possible #3122

Open

TobyRoseman mentioned this pull request May 15, 2020

SFrame and SArray Should have Shuffle Method. #942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SArray Shuffle #3178

SArray Shuffle #3178

Uh oh!

TobyRoseman commented May 7, 2020

Uh oh!

nickjong left a comment

Uh oh!

TobyRoseman commented May 8, 2020

Uh oh!

nickjong commented May 8, 2020

Uh oh!

guihao-liang commented May 10, 2020 •

edited

Loading

Uh oh!

TobyRoseman commented May 11, 2020

Uh oh!

hoytak commented May 11, 2020

Uh oh!

guihao-liang commented May 11, 2020

Uh oh!

hoytak commented May 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SArray Shuffle #3178

SArray Shuffle #3178

Uh oh!

Conversation

TobyRoseman commented May 7, 2020

Uh oh!

nickjong left a comment

Choose a reason for hiding this comment

Uh oh!

TobyRoseman commented May 8, 2020

Uh oh!

nickjong commented May 8, 2020

Uh oh!

guihao-liang commented May 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TobyRoseman commented May 11, 2020

Uh oh!

hoytak commented May 11, 2020

Uh oh!

guihao-liang commented May 11, 2020

Uh oh!

hoytak commented May 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

guihao-liang commented May 10, 2020 •

edited

Loading