Skip to content
This repository was archived by the owner on Dec 21, 2023. It is now read-only.

Conversation

@TobyRoseman
Copy link
Collaborator

Implements: #3123.

@TobyRoseman TobyRoseman requested review from hoytak and nickjong May 7, 2020 23:40
Copy link
Collaborator

@nickjong nickjong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: is there any prospect of controlling this behavior with a random seed? What random seed does it use?

@TobyRoseman
Copy link
Collaborator Author

Aside: is there any prospect of controlling this behavior with a random seed? What random seed does it use?

In order to get an O(n) running time without random access, we had to rely on groupby which doesn't have a seed option. It sounds like adding that functionality to groupby isn't really possible.

The idea of having a random seed for shuffle is related to #3122. There I was planning on using the SFrame shuffle only if the user did not pass a random seed to tc.object_detection.create. If the user did pass in a random seed, the current sort based shuffle functionality would be used.

Something I've considered doing instead is moving that sort based shuffle functionality from the object detector into SFrame shuffle. The end result would be that SArray/SFrame shuffle would have a seed option; shuffle would run in O(n) time if no seed is given and O(n log n) if a seed is given. This shouldn't add too much additional work to #3122. Let me know if you think that is worth doing.

@TobyRoseman TobyRoseman merged commit bc89df1 into apple:master May 8, 2020
@nickjong
Copy link
Collaborator

nickjong commented May 8, 2020

The idea of having a random seed for shuffle is related to #3122. There I was planning on using the SFrame shuffle only if the user did not pass a random seed to tc.object_detection.create. If the user did pass in a random seed, the current sort based shuffle functionality would be used.

One wrinkle here is that even when the user does not pass in a random seed, we choose one (er, at random) and record it in the model attributes. The user can reproduce the model by passing that seed and the same data and parameters (on the same platform).

But all this means is that the Object Detection case would always use the seeded version, since there's always a seed by the time you get to that part of the implementation.

@guihao-liang
Copy link
Collaborator

guihao-liang commented May 10, 2020

I've suggested in #3060 that irreproducible model is not really useful for model development. I have worked on optimizing groupby #2210 before and I think the overall sort is a bucket sort instead of merge sort on entire set since each hash bucket is relatively sorted (#2210). What we need to do is to do something similar to stable sort, where we can provide an extra index column to track the original order and feed it into groupby together. After the groupby, we should be able to reconstruct their original relative order within the bucket. The big O can be n + B*log(n/B), where B is the bucket number in the hash table used by groupby.

@TobyRoseman
Copy link
Collaborator Author

@guihao-liang - I've created #3187 to track adding a seed parameter to SArray/SFrame shuffle.

@hoytak
Copy link

hoytak commented May 11, 2020

A model that relies on the order of input isn't robust. For example, SGD in matrix factorization with multiple threads is non-deterministic, and generally most stuff on the GPU isn't super deterministic anyway. In small test cases, this may make sense, but a fast non-deterministic shuffle is definitely useful.

@guihao-liang
Copy link
Collaborator

SGD uses seed too. So for the same seeded SGD, it's also deterministic because it also uses pseudo-random. I don't know about matrix factorization. But for standard back-prop (RNN, CNN), everything is deterministic. The non-determinism comes from float computation errors and partial derivative approximation errors from singular functions such as ReLu.

Determinism is important for education purposes. I took courses that teach DL, and it's important for students to reproduce what's taught by the lectures. Most important use is that random split the data set into training and testing. If every time the input is different, how could students check whether each step they are doing right since DL is an overall complicated process. I know turicreate is also used for educational purposes and I think determinism is important at least for this case.

@hoytak
Copy link

hoytak commented May 14, 2020

The non-randomness in our inner SGD routines (not the deep learning ones) comes from parallelism, so it's inherently non-deterministic based on . Determinism can be a trade-off with speed and accuracy in a lot of cases (such as this one with group-by), so it should be up to the user what the balance should be for their particular use case. I don't see a right answer here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants