-
-
Notifications
You must be signed in to change notification settings - Fork 379
[GSK-1597] Push typo perturbation stochasticity #1362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSK-1597] Push typo perturbation stochasticity #1362
Conversation
…bation-stochasticity
Push feature --------- Co-authored-by: Mathieu Roques <[email protected]> Co-authored-by: Andrey Avtomonov <[email protected]> Co-authored-by: Kevin Messiaen <[email protected]> Co-authored-by: Rabah Abdul Khalek <[email protected]> Co-authored-by: Henrique Chaves <[email protected]> Co-authored-by: Henrique Chaves <[email protected]>
| f"{', '.join(map(lambda x: repr(x), ds_slice_copy.df.values))}".encode("utf-8") | ||
| ) | ||
| if _hash not in hashed_typo_transformations.keys(): | ||
| hashed_typo_transformations.update({_hash: ds_slice_copy.transform(t)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this cache will never be emptied as long as the worker is running, can you make it a fixed-sized LRU cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only caching typo transformations and not the rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this cache will never be emptied as long as the worker is running, can you make it a fixed-sized LRU cache?
I think it's better to keep the cache as long as possible, it'll be confusing to use for instance @lru_cache(maxsize=16), as user might loose a push notification about typo transformation at some point without knowing why. Is there a problem if the caches is never emptied (for that specific case)?
why only caching typo transformations and not the rest?
Among these, typo transformations are the only ones that are random, thus need to be cached
TextGenderTransformation,
TextLowercase,
TextPunctuationRemovalTransformation,
TextTitleCase,
TextTypoTransformation,
TextUppercase,There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a problem if the caches is never emptied (for that specific case)?
it'll lead to an unpredictable RAM consumption, basically the longer you have a worker running the more RAM it'll consume.
user might loose a push notification about typo transformation at some point
It shouldn't happen, you'd just call hashed_typo_transformations[_hash] = ds_slice_copy.transform(t) more often as you could in case of unlimited cache size. Every time you'll have a cache miss in case of limited cache you'll just recompute a transformation and refresh the cache
Co-authored-by: Andrey Avtomonov <[email protected]>
|
Kudos, SonarCloud Quality Gate passed! |
| kwargs = {} | ||
| if _is_typo_transformation: | ||
| hashed_seed = hash(f"{', '.join(map(lambda x: repr(x), ds_slice_copy.df.values))}".encode("utf-8")) | ||
| positive_hashed_seed = hashed_seed % ((sys.maxsize + 1) * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rabah-khalek could you add a commend why we're doing it? In a week we won't remember by heart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done








I had to hash the typo transformation in order to be able to retrieve the same random perturbation when the user go back and forth in the debugger.
The problem is described by two screenshots on linear.