-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Limit train samples #2809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit train samples #2809
Conversation
stanjo74
commented
Sep 28, 2021
- Limit loaded samples to 2GB.
- Fixes zstd train-cover chokes if sample size is "too large" #2745
- Support for --block-size option which enables chunking - the sample files are broken up into chunks and each chunk is a training sample.
- A sample is limited to 128KB.
- Rewrote the sample loading logic to be more expressive.
|
The first commit d758afe limits the samples size to 2GB. |
terrelln
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I love the simplification to file loading, just some style nits.
programs/dibio.c
Outdated
| /* Shuffle input files before we start assessing how much sample date to load. | ||
| The purpose of the shuffle is to pick random samples when the sample | ||
| set is larger than what we can load in memory. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: That isn't the only purpose. We also shuffle to improve training, because there are some biases that can be introduced when samples show up in a "sorted" order, that are mitigated by shuffling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting, I'd like to know more about these biases. I'm not sure what to add to the comment as this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe: "The purpose of the shuffle is to avoid bias in the trainer due to sorted files, and to pick random samples ..."
I'd be happy to explain more how the dictionary builder works, and how sorting can introduce bias, but is probably too verbose to put into a comment. It would be worth a meeting + writing up a doc, so others can read it too.
|
One observation is |
|
Tested 32KB chunking with 40K files of 112KB and got this output. |
The dictionary builder API in We could change the other functions, but then it would be inconsistent with
I always use But, zstd does deal with sizes > 2 GB in many cases, so we have to be careful to use |
…aded number of samples. Changed unit conversion not to use bit-shifts.
Fixed the bug with 9eb56a3. The estimated number of samples was passed rather than the loaded number of samples. |