Skip to content

Conversation

@rabah-khalek
Copy link
Contributor

@rabah-khalek rabah-khalek commented Sep 1, 2023

I had to hash the typo transformation in order to be able to retrieve the same random perturbation when the user go back and forth in the debugger.

The problem is described by two screenshots on linear.

@linear
Copy link

linear bot commented Sep 1, 2023

@rabah-khalek rabah-khalek changed the title hashed TextTypoTransformation [GSK-1597] Test typo perturbation stochasticity Sep 1, 2023
@rabah-khalek rabah-khalek changed the title [GSK-1597] Test typo perturbation stochasticity [GSK-1597] Push typo perturbation stochasticity Sep 1, 2023
Push feature
---------

Co-authored-by: Mathieu Roques <[email protected]>
Co-authored-by: Andrey Avtomonov <[email protected]>
Co-authored-by: Kevin Messiaen <[email protected]>
Co-authored-by: Rabah Abdul Khalek <[email protected]>
Co-authored-by: Henrique Chaves <[email protected]>
Co-authored-by: Henrique Chaves <[email protected]>
@rabah-khalek rabah-khalek added Python Pull requests that update Python code push-feature labels Sep 1, 2023
@rabah-khalek rabah-khalek changed the base branch from feature/merging-push-feature to main September 1, 2023 14:58
@rabah-khalek rabah-khalek changed the base branch from main to feature/merging-push-feature September 1, 2023 15:32
f"{', '.join(map(lambda x: repr(x), ds_slice_copy.df.values))}".encode("utf-8")
)
if _hash not in hashed_typo_transformations.keys():
hashed_typo_transformations.update({_hash: ds_slice_copy.transform(t)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this cache will never be emptied as long as the worker is running, can you make it a fixed-sized LRU cache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only caching typo transformations and not the rest?

Copy link
Contributor Author

@rabah-khalek rabah-khalek Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this cache will never be emptied as long as the worker is running, can you make it a fixed-sized LRU cache?

I think it's better to keep the cache as long as possible, it'll be confusing to use for instance @lru_cache(maxsize=16), as user might loose a push notification about typo transformation at some point without knowing why. Is there a problem if the caches is never emptied (for that specific case)?

why only caching typo transformations and not the rest?

Among these, typo transformations are the only ones that are random, thus need to be cached

    TextGenderTransformation,
    TextLowercase,
    TextPunctuationRemovalTransformation,
    TextTitleCase,
    TextTypoTransformation,
    TextUppercase,

Copy link
Contributor

@andreybavt andreybavt Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a problem if the caches is never emptied (for that specific case)?

it'll lead to an unpredictable RAM consumption, basically the longer you have a worker running the more RAM it'll consume.

user might loose a push notification about typo transformation at some point

It shouldn't happen, you'd just call hashed_typo_transformations[_hash] = ds_slice_copy.transform(t) more often as you could in case of unlimited cache size. Every time you'll have a cache miss in case of limited cache you'll just recompute a transformation and refresh the cache

@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 5, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication

kwargs = {}
if _is_typo_transformation:
hashed_seed = hash(f"{', '.join(map(lambda x: repr(x), ds_slice_copy.df.values))}".encode("utf-8"))
positive_hashed_seed = hashed_seed % ((sys.maxsize + 1) * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabah-khalek could you add a commend why we're doing it? In a week we won't remember by heart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@andreybavt andreybavt merged commit b4b1967 into feature/merging-push-feature Sep 6, 2023
rabah-khalek added a commit that referenced this pull request Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

push-feature Python Pull requests that update Python code

Development

Successfully merging this pull request may close these issues.

4 participants