Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented May 27, 2021

As mentioned in #2399 the env var should be prefixed by HF_

@albertvillanova
Copy link
Member

albertvillanova commented May 27, 2021

I thought the renaming was suggested only for the env var, and not for the config variable... As you think is better! ;)

@lhoestq
Copy link
Member Author

lhoestq commented May 27, 2021

I think it's better if they match, so that users understand directly that they're directly connected

@lhoestq lhoestq merged commit aba604e into master May 27, 2021
@lhoestq lhoestq deleted the add-HF-prefix-to-env-var branch May 27, 2021 09:33
@stas00
Copy link
Contributor

stas00 commented May 27, 2021

Well, if you're not concerned about back-compat here, perhaps it could be renamed and shortened too ;)

I'd suggest one of:

  • HF_DATASETS_IN_MEMORY_MAX_SIZE
  • HF_DATASETS_MAX_IN_MEMORY_SIZE

the itention is to:

  1. make it consistent with all the other datasets env vars which all start with HF_DATASETS_, e.g.:
HF_DATASETS_CACHE
HF_DATASETS_OFFLINE 
  1. allow to recode in the future to support 1M, 4K, 1T and not just bytes - bytes is not a great choice for this type of variable since it will be at least X Mbytes for most reasonable uses.

And I agree with @albertvillanova that the config variable name shouldn't have the HF prefix - it's preaching to the choir - the user already knows it's a local variable.

The only reason we prefix env vars, is because they are used outside of the software.

But I do see a good point of you trying to make things consistent too. How about this:

config.IN_MEMORY_MAX_SIZE (or whatever the final env var will be minus HF_DATASETS_ prefix).

This is of course just my opinion.

@lhoestq
Copy link
Member Author

lhoestq commented May 28, 2021

Thanks for the comment :)
I like both propositions, and I agree this would be better in order to allow support for 1M, 1T etc.
Regarding the prefix of the variable in config.py I don't have a strong opinion. I just added it for consistency with the other variables that default to the env variables like HF_DATASETS_CACHE. However I agree this would be nice to have shorter names so I'm not against removing the prefix either. Since the feature is relatively new, I think we can still allow ourself to rename it

@stas00
Copy link
Contributor

stas00 commented May 28, 2021

Awesome,

Let's use then:

  • HF_DATASETS_IN_MEMORY_MAX_SIZE for the env var
  • config.IN_MEMORY_MAX_SIZE for config.

and for now bytes will be documented as the only option and down the road add support for K/M/G.

@albertvillanova, does that sound good to you?

@albertvillanova
Copy link
Member

Great!!! 🤗

@stas00
Copy link
Contributor

stas00 commented Jun 7, 2021

Did I miss a PR with this change?

I want to make sure to add it to transformers tests to avoid the overheard of rebuilding the datasets.

Thank you!

@albertvillanova
Copy link
Member

@stas00 I'm taking on this now that I have finally finished the collaborative training experiment. Sorry for the delay.

@stas00
Copy link
Contributor

stas00 commented Jun 7, 2021

Yes, of course! Thank you for taking care of it, @albertvillanova

@stas00
Copy link
Contributor

stas00 commented Jun 7, 2021

Actually, why is this feature on by default?

Users are very unlikely to understand what is going on or to know where to look. Should it at the very least emit a warning that this was done w/o asking the user to do so and how to turn it off?

IMHO, this feature should be enabled explicitly by those who want it and not be On by default. This is an optimization that benefits only select users and is a burden on the rest.

In my line of dev/debug work (multiple short runs that have to be very fast) now I have to remember to disable this feature explicitly on every machine I work :(

@lhoestq
Copy link
Member Author

lhoestq commented Jun 7, 2021

Having the dataset in memory is nice to get the speed but I agree that the lack of caching for dataset in memory is an issue. By default we always had caching on.
Here the issue is that in-memory datasets are still not able to use the cache - we should fix this asap IMO.

Here is the PR that fixes this: #2329

@stas00
Copy link
Contributor

stas00 commented Jun 7, 2021

But why do they have to be datasets in memory in the first place? Why not just have the default that all datasets are normal and are cached which seems to be working solidly. And only enable in memory datasets explicitly if the user chooses to and then it doesn't matter if it's cached on not for the majority of the users who will not make this choice.

I mean the definition of in-memory-datasets is very arbitrary - why 250MB and not 5GB? It's very likely that the user will want to set this threshold based on their RAM availability. So while doing that they can enable the in-memory-datasets. Unless I'm missing something here.

The intention here is that things work well in general out of the box, and further performance optimizations are available to those who know what they are doing.

@lhoestq
Copy link
Member Author

lhoestq commented Jun 7, 2021

This is just for speed improvements, especially for data exploration/experiments in notebooks. Ideally it shouldn't have changed anything regarding caching behavior in the first place (i.e. have the caching enabled by default).

The 250MB limit has also been chosen to not create unexpected high memory usage on small laptops.

@stas00
Copy link
Contributor

stas00 commented Jun 7, 2021

Won't it be more straight-forward to create a performance optimization doc and share all these optimizations there? That way the user will be in the knowing and will be able to get faster speeds if their RAM is large.

It is hard for me to tell the average size of a dataset an average user will have, but my gut feeling is that many NLP datasets are larger than 250MB. Please correct me if I'm wrong.

But at the same time what you're saying is that once #2329 is completed and merged, the in-memory-datasets will be cached too. So if I wait long enough the whole issue will go away altogether, correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants