Skip to content

Conversation

@stanjo74
Copy link
Contributor

  • Limit loaded samples to 2GB.
  • Fixes zstd train-cover chokes if sample size is "too large" #2745
  • Support for --block-size option which enables chunking - the sample files are broken up into chunks and each chunk is a training sample.
  • A sample is limited to 128KB.
  • Rewrote the sample loading logic to be more expressive.

@stanjo74 stanjo74 marked this pull request as draft September 28, 2021 04:32
@stanjo74
Copy link
Contributor Author

The first commit d758afe limits the samples size to 2GB.
The rest of the commits are optional rewrite of the sample-loading logic and are optional.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I love the simplification to file loading, just some style nits.

programs/dibio.c Outdated
Comment on lines 334 to 336
/* Shuffle input files before we start assessing how much sample date to load.
The purpose of the shuffle is to pick random samples when the sample
set is larger than what we can load in memory. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: That isn't the only purpose. We also shuffle to improve training, because there are some biases that can be introduced when samples show up in a "sorted" order, that are mitigated by shuffling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, I'd like to know more about these biases. I'm not sure what to add to the comment as this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: "The purpose of the shuffle is to avoid bias in the trainer due to sorted files, and to pick random samples ..."

I'd be happy to explain more how the dictionary builder works, and how sorting can introduce bias, but is probably too verbose to put into a comment. It would be worth a meeting + writing up a doc, so others can read it too.

@stanjo74
Copy link
Contributor Author

One observation is size_t* sampleSizes array - we don't need 64bits to track a sample size that is limited to 128KB. I think we should change it to int and save some memory.
Which raises the philosophical question of signed/unsigned int to store cardinal values. My personal preference is int - change my mind.

@stanjo74
Copy link
Contributor Author

Tested 32KB chunking with 40K files of 112KB and got this output.
The number of samples reported along the chain is suspicious - will track.

Shuffling input files
Training files are 4458176KB, **160002 samples**
Training samples set too large (4353 MB); training on 2048 MB only...
loaded 2097127KB total data, **75265 nb samples**
Trying 82 different sets of parameters
d=6
Training on **120001 samples** of total size 2147458979

@terrelln
Copy link
Contributor

terrelln commented Sep 29, 2021

One observation is size_t* sampleSizes array - we don't need 64bits to track a sample size that is limited to 128KB. I think we should change it to int and save some memory.

The dictionary builder API in lib/dictBuilder/ requires a size_t const*. The function ZDICT_trainFromBuffer() is stable and its API cannot be changed. It can be deprecated, and maybe eventually deleted a few years later, but never changed.

We could change the other functions, but then it would be inconsistent with ZDICT_trainFromBuffer(), and would break our users without a real gain. But the 128KB limit is part of the programs/ limitation, not part of the lib/dictBuilder limitation. The API can accept files of any size, although passing very large files doesn't really make sense.

Which raises the philosophical question of signed/unsigned int to store cardinal values. My personal preference is int - change my mind.

I always use size_t or int. unsigned never really makes sense, as it just makes the job harder for the compiler. In cases where I know overflow isn't an issue, I tend to prefer int as well. Generally as loop variables.

But, zstd does deal with sizes > 2 GB in many cases, so we have to be careful to use size_t there. And we only use size_t for counts when we know it is referring to contiguous memory, like buffers. When we're dealing with larger numbers, generally the frame content size (how large the data decompresses into), we make sure to use uint64_t, so 32-bit mode works with sizes > 4GB.

…aded number of samples.

Changed unit conversion not to use bit-shifts.
@stanjo74
Copy link
Contributor Author

Tested 32KB chunking with 40K files of 112KB and got this output. The number of samples reported along the chain is suspicious - will track.

Fixed the bug with 9eb56a3. The estimated number of samples was passed rather than the loaded number of samples.

@stanjo74 stanjo74 marked this pull request as ready for review September 30, 2021 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zstd train-cover chokes if sample size is "too large"

4 participants