Add HF_ prefix to env var MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2409

lhoestq · 2021-05-27T09:07:00Z

As mentioned in #2399 the env var should be prefixed by HF_

albertvillanova · 2021-05-27T09:30:37Z

I thought the renaming was suggested only for the env var, and not for the config variable... As you think is better! ;)

lhoestq · 2021-05-27T09:32:58Z

I think it's better if they match, so that users understand directly that they're directly connected

stas00 · 2021-05-27T15:51:43Z

Well, if you're not concerned about back-compat here, perhaps it could be renamed and shortened too ;)

I'd suggest one of:

HF_DATASETS_IN_MEMORY_MAX_SIZE
HF_DATASETS_MAX_IN_MEMORY_SIZE

the itention is to:

make it consistent with all the other datasets env vars which all start with HF_DATASETS_, e.g.:

HF_DATASETS_CACHE
HF_DATASETS_OFFLINE

allow to recode in the future to support 1M, 4K, 1T and not just bytes - bytes is not a great choice for this type of variable since it will be at least X Mbytes for most reasonable uses.

And I agree with @albertvillanova that the config variable name shouldn't have the HF prefix - it's preaching to the choir - the user already knows it's a local variable.

The only reason we prefix env vars, is because they are used outside of the software.

But I do see a good point of you trying to make things consistent too. How about this:

config.IN_MEMORY_MAX_SIZE (or whatever the final env var will be minus HF_DATASETS_ prefix).

This is of course just my opinion.

lhoestq · 2021-05-28T08:46:43Z

Thanks for the comment :)
I like both propositions, and I agree this would be better in order to allow support for 1M, 1T etc.
Regarding the prefix of the variable in config.py I don't have a strong opinion. I just added it for consistency with the other variables that default to the env variables like HF_DATASETS_CACHE. However I agree this would be nice to have shorter names so I'm not against removing the prefix either. Since the feature is relatively new, I think we can still allow ourself to rename it

stas00 · 2021-05-28T16:59:41Z

Awesome,

Let's use then:

HF_DATASETS_IN_MEMORY_MAX_SIZE for the env var
config.IN_MEMORY_MAX_SIZE for config.

and for now bytes will be documented as the only option and down the road add support for K/M/G.

@albertvillanova, does that sound good to you?

albertvillanova · 2021-05-28T20:58:34Z

Great!!! 🤗

stas00 · 2021-06-07T16:23:17Z

Did I miss a PR with this change?

I want to make sure to add it to transformers tests to avoid the overheard of rebuilding the datasets.

Thank you!

albertvillanova · 2021-06-07T18:48:35Z

@stas00 I'm taking on this now that I have finally finished the collaborative training experiment. Sorry for the delay.

stas00 · 2021-06-07T18:51:40Z

Yes, of course! Thank you for taking care of it, @albertvillanova

stas00 · 2021-06-07T18:54:55Z

Actually, why is this feature on by default?

Users are very unlikely to understand what is going on or to know where to look. Should it at the very least emit a warning that this was done w/o asking the user to do so and how to turn it off?

IMHO, this feature should be enabled explicitly by those who want it and not be On by default. This is an optimization that benefits only select users and is a burden on the rest.

In my line of dev/debug work (multiple short runs that have to be very fast) now I have to remember to disable this feature explicitly on every machine I work :(

lhoestq · 2021-06-07T19:45:01Z

Having the dataset in memory is nice to get the speed but I agree that the lack of caching for dataset in memory is an issue. By default we always had caching on.
Here the issue is that in-memory datasets are still not able to use the cache - we should fix this asap IMO.

Here is the PR that fixes this: #2329

stas00 · 2021-06-07T19:51:15Z

But why do they have to be datasets in memory in the first place? Why not just have the default that all datasets are normal and are cached which seems to be working solidly. And only enable in memory datasets explicitly if the user chooses to and then it doesn't matter if it's cached on not for the majority of the users who will not make this choice.

I mean the definition of in-memory-datasets is very arbitrary - why 250MB and not 5GB? It's very likely that the user will want to set this threshold based on their RAM availability. So while doing that they can enable the in-memory-datasets. Unless I'm missing something here.

The intention here is that things work well in general out of the box, and further performance optimizations are available to those who know what they are doing.

lhoestq · 2021-06-07T20:26:59Z

This is just for speed improvements, especially for data exploration/experiments in notebooks. Ideally it shouldn't have changed anything regarding caching behavior in the first place (i.e. have the caching enabled by default).

The 250MB limit has also been chosen to not create unexpected high memory usage on small laptops.

stas00 · 2021-06-07T20:35:40Z

Won't it be more straight-forward to create a performance optimization doc and share all these optimizations there? That way the user will be in the knowing and will be able to get faster speeds if their RAM is large.

It is hard for me to tell the average size of a dataset an average user will have, but my gut feeling is that many NLP datasets are larger than 250MB. Please correct me if I'm wrong.

But at the same time what you're saying is that once #2329 is completed and merged, the in-memory-datasets will be cached too. So if I wait long enough the whole issue will go away altogether, correct?

add HF_ prefix to env var MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES

fc207c1

lhoestq mentioned this pull request May 27, 2021

Add env variable for MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399

Merged

lhoestq added 2 commits May 27, 2021 11:20

update tests

37e8041

style

374a802

lhoestq merged commit aba604e into master May 27, 2021

lhoestq deleted the add-HF-prefix-to-env-var branch May 27, 2021 09:33

stas00 mentioned this pull request Jun 7, 2021

[testing] set tests to not rebuild datasets huggingface/transformers#12056

Closed

albertvillanova mentioned this pull request Jun 7, 2021

Rename config and environment variable for in memory max size #2454

Merged

albertvillanova mentioned this pull request Jun 8, 2021

Revert default in-memory for small datasets #2458

Closed

Add HF_ prefix to env var MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2409

Add HF_ prefix to env var MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2409

Uh oh!

Conversation

lhoestq commented May 27, 2021

Uh oh!

albertvillanova commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented May 27, 2021

Uh oh!

stas00 commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented May 28, 2021

Uh oh!

stas00 commented May 28, 2021

Uh oh!

albertvillanova commented May 28, 2021

Uh oh!

stas00 commented Jun 7, 2021

Uh oh!

albertvillanova commented Jun 7, 2021

Uh oh!

stas00 commented Jun 7, 2021

Uh oh!

stas00 commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Jun 7, 2021

Uh oh!

stas00 commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Jun 7, 2021

Uh oh!

stas00 commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

albertvillanova commented May 27, 2021 •

edited

Loading

stas00 commented May 27, 2021 •

edited

Loading

stas00 commented Jun 7, 2021 •

edited

Loading

stas00 commented Jun 7, 2021 •

edited

Loading

stas00 commented Jun 7, 2021 •

edited

Loading