-
-
Notifications
You must be signed in to change notification settings - Fork 381
[GSK-2346] Avoid copying whole dataset when doing slicing #1673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSK-2346] Avoid copying whole dataset when doing slicing #1673
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes the issue, However we can get ride of the copy flag altogether since doing a copy change nothing in the behaviour of the code expect loading duplicated data in the ram.
PS: it is updated in case of empty pipeline (but nothing breaking):
- Line 110: Reset the pipeline (not a big deal)
- Line 114: Recompute the column_meta -> Do nothing but takes time
| def apply(self, dataset: "Dataset", apply_only_last=False, get_mask: bool = False, copy: bool = True): | ||
| if copy: | ||
| ds = dataset.copy() | ||
| else: | ||
| ds = dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the code, it seems that doing a copy of dataset was unnecessary since ds instance is only queried and never updated.
| def apply(self, dataset: "Dataset", apply_only_last=False, get_mask: bool = False, copy: bool = True): | |
| if copy: | |
| ds = dataset.copy() | |
| else: | |
| ds = dataset | |
| def apply(self, dataset: "Dataset", apply_only_last=False, get_mask: bool = False): | |
| ds = dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not true, making a copy may still be necessary.
For example, in the TextTransformation, code is directly modifying the dataframe, and so copy is still needed.
To remove this copy, we need to go through every slicing function and transformation, and I don't want this pr to take too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my bad, you're right, I overlooked this part :/
atm TextTransformation modify dataframe directly, so I don't want to change this beahaviour either (except when slicing) |
mattbit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we'll need to improve this, but the current fix is ok for me.
…ide-dataset-kills-customers-vm
…ide-dataset-kills-customers-vm
…ide-dataset-kills-customers-vm
|
Kudos, SonarCloud Quality Gate passed! |








Description
DataProcessor was doing a copy every time, event in the case of slice.
The issue is the slice was actually referencing the whole dataset, so the whole dataset was kept in memory.
Now, we only reference the existing one.
Testing on another large dataset (76020, 371), was crashing before, working now.
Related Issue
Type of Change
Checklist
CODE_OF_CONDUCT.mddocument.CONTRIBUTING.mdguide.make codestyle.