-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add HF_ prefix to env var MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I thought the renaming was suggested only for the env var, and not for the config variable... As you think is better! ;) |
|
I think it's better if they match, so that users understand directly that they're directly connected |
|
Well, if you're not concerned about back-compat here, perhaps it could be renamed and shortened too ;) I'd suggest one of:
the itention is to:
And I agree with @albertvillanova that the config variable name shouldn't have the HF prefix - it's preaching to the choir - the user already knows it's a local variable. The only reason we prefix env vars, is because they are used outside of the software. But I do see a good point of you trying to make things consistent too. How about this:
This is of course just my opinion. |
|
Thanks for the comment :) |
|
Awesome, Let's use then:
and for now bytes will be documented as the only option and down the road add support for K/M/G. @albertvillanova, does that sound good to you? |
|
Great!!! 🤗 |
|
Did I miss a PR with this change? I want to make sure to add it to transformers tests to avoid the overheard of rebuilding the datasets. Thank you! |
|
@stas00 I'm taking on this now that I have finally finished the collaborative training experiment. Sorry for the delay. |
|
Yes, of course! Thank you for taking care of it, @albertvillanova |
|
Actually, why is this feature on by default? Users are very unlikely to understand what is going on or to know where to look. Should it at the very least emit a warning that this was done w/o asking the user to do so and how to turn it off? IMHO, this feature should be enabled explicitly by those who want it and not be On by default. This is an optimization that benefits only select users and is a burden on the rest. In my line of dev/debug work (multiple short runs that have to be very fast) now I have to remember to disable this feature explicitly on every machine I work :( |
|
Having the dataset in memory is nice to get the speed but I agree that the lack of caching for dataset in memory is an issue. By default we always had caching on. Here is the PR that fixes this: #2329 |
|
But why do they have to be datasets in memory in the first place? Why not just have the default that all datasets are normal and are cached which seems to be working solidly. And only enable in memory datasets explicitly if the user chooses to and then it doesn't matter if it's cached on not for the majority of the users who will not make this choice. I mean the definition of in-memory-datasets is very arbitrary - why 250MB and not 5GB? It's very likely that the user will want to set this threshold based on their RAM availability. So while doing that they can enable the in-memory-datasets. Unless I'm missing something here. The intention here is that things work well in general out of the box, and further performance optimizations are available to those who know what they are doing. |
|
This is just for speed improvements, especially for data exploration/experiments in notebooks. Ideally it shouldn't have changed anything regarding caching behavior in the first place (i.e. have the caching enabled by default). The 250MB limit has also been chosen to not create unexpected high memory usage on small laptops. |
|
Won't it be more straight-forward to create a performance optimization doc and share all these optimizations there? That way the user will be in the knowing and will be able to get faster speeds if their RAM is large. It is hard for me to tell the average size of a dataset an average user will have, but my gut feeling is that many NLP datasets are larger than 250MB. Please correct me if I'm wrong. But at the same time what you're saying is that once #2329 is completed and merged, the in-memory-datasets will be cached too. So if I wait long enough the whole issue will go away altogether, correct? |
As mentioned in #2399 the env var should be prefixed by HF_