Limit train samples #2809

stanjo74 · 2021-09-28T04:21:15Z

Limit loaded samples to 2GB.
Fixes zstd train-cover chokes if sample size is "too large" #2745
Support for --block-size option which enables chunking - the sample files are broken up into chunks and each chunk is a training sample.
A sample is limited to 128KB.
Rewrote the sample loading logic to be more expressive.

stanjo74 · 2021-09-28T16:22:07Z

The first commit d758afe limits the samples size to 2GB.
The rest of the commits are optional rewrite of the sample-loading logic and are optional.

lib/dictBuilder/cover.c

programs/dibio.c

terrelln

Overall, I love the simplification to file loading, just some style nits.

programs/dibio.c

terrelln · 2021-09-28T21:36:13Z

programs/dibio.c

+    /* Shuffle input files before we start assessing how much sample date to load.
+       The purpose of the shuffle is to pick random samples when the sample
+       set is larger than what we can load in memory. */


nit: That isn't the only purpose. We also shuffle to improve training, because there are some biases that can be introduced when samples show up in a "sorted" order, that are mitigated by shuffling.

This is interesting, I'd like to know more about these biases. I'm not sure what to add to the comment as this point.

Maybe: "The purpose of the shuffle is to avoid bias in the trainer due to sorted files, and to pick random samples ..."

I'd be happy to explain more how the dictionary builder works, and how sorting can introduce bias, but is probably too verbose to put into a comment. It would be worth a meeting + writing up a doc, so others can read it too.

programs/dibio.c

stanjo74 · 2021-09-28T23:51:40Z

One observation is size_t* sampleSizes array - we don't need 64bits to track a sample size that is limited to 128KB. I think we should change it to int and save some memory.
Which raises the philosophical question of signed/unsigned int to store cardinal values. My personal preference is int - change my mind.

programs/dibio.c

stanjo74 · 2021-09-29T01:01:36Z

Tested 32KB chunking with 40K files of 112KB and got this output.
The number of samples reported along the chain is suspicious - will track.

Shuffling input files
Training files are 4458176KB, **160002 samples**
Training samples set too large (4353 MB); training on 2048 MB only...
loaded 2097127KB total data, **75265 nb samples**
Trying 82 different sets of parameters
d=6
Training on **120001 samples** of total size 2147458979

terrelln · 2021-09-29T01:55:43Z

One observation is size_t* sampleSizes array - we don't need 64bits to track a sample size that is limited to 128KB. I think we should change it to int and save some memory.

The dictionary builder API in lib/dictBuilder/ requires a size_t const*. The function ZDICT_trainFromBuffer() is stable and its API cannot be changed. It can be deprecated, and maybe eventually deleted a few years later, but never changed.

We could change the other functions, but then it would be inconsistent with ZDICT_trainFromBuffer(), and would break our users without a real gain. But the 128KB limit is part of the programs/ limitation, not part of the lib/dictBuilder limitation. The API can accept files of any size, although passing very large files doesn't really make sense.

Which raises the philosophical question of signed/unsigned int to store cardinal values. My personal preference is int - change my mind.

I always use size_t or int. unsigned never really makes sense, as it just makes the job harder for the compiler. In cases where I know overflow isn't an issue, I tend to prefer int as well. Generally as loop variables.

But, zstd does deal with sizes > 2 GB in many cases, so we have to be careful to use size_t there. And we only use size_t for counts when we know it is referring to contiguous memory, like buffers. When we're dealing with larger numbers, generally the frame content size (how large the data decompresses into), we make sure to use uint64_t, so 32-bit mode works with sizes > 4GB.

…aded number of samples. Changed unit conversion not to use bit-shifts.

stanjo74 · 2021-09-29T03:05:06Z

Tested 32KB chunking with 40K files of 112KB and got this output. The number of samples reported along the chain is suspicious - will track.

Fixed the bug with 9eb56a3. The estimated number of samples was passed rather than the loaded number of samples.

stanjo74 added 3 commits September 24, 2021 13:37

Limit training samples size to 2GB

d758afe

simplified DISPLAYLEVEL() macro to use global vqriable instead of local.

8b607bf

refactored training samples loading

d21fd2e

stanjo74 marked this pull request as draft September 28, 2021 04:32

facebook-github-bot added the CLA Signed label Sep 28, 2021

fixed compiler warning

474d126