[GSK-2346] Avoid copying whole dataset when doing slicing #1673

Hartorn · 2023-12-07T16:06:21Z

Description

DataProcessor was doing a copy every time, event in the case of slice.
The issue is the slice was actually referencing the whole dataset, so the whole dataset was kept in memory.
Now, we only reference the existing one.

Testing on another large dataset (76020, 371), was crashing before, working now.

Related Issue

Type of Change

📚 Examples / docs / tutorials / dependencies update
🔧 Bug fix (non-breaking change which fixes an issue)
🥂 Improvement (non-breaking change which improves an existing feature)
🚀 New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to change)
🔐 Security fix

Checklist

I've read the CODE_OF_CONDUCT.md document.
I've read the CONTRIBUTING.md guide.
I've updated the code style using make codestyle.
I've written tests for all new methods and classes that I created.
I've written the docstring in Google format for all the methods and classes that I used.

linear · 2023-12-07T16:06:25Z

GSK-2346 Tabular scan execution on wide dataset kills customer's VM

kevinmessiaen

Fixes the issue, However we can get ride of the copy flag altogether since doing a copy change nothing in the behaviour of the code expect loading duplicated data in the ram.

PS: it is updated in case of empty pipeline (but nothing breaking):

Line 110: Reset the pipeline (not a big deal)
Line 114: Recompute the column_meta -> Do nothing but takes time

kevinmessiaen · 2023-12-08T07:45:43Z

giskard/datasets/base/__init__.py

+    def apply(self, dataset: "Dataset", apply_only_last=False, get_mask: bool = False, copy: bool = True):
+        if copy:
+            ds = dataset.copy()
+        else:
+            ds = dataset


After reading the code, it seems that doing a copy of dataset was unnecessary since ds instance is only queried and never updated.

Suggested change

def apply(self, dataset: "Dataset", apply_only_last=False, get_mask: bool = False, copy: bool = True):

if copy:

ds = dataset.copy()

else:

ds = dataset

def apply(self, dataset: "Dataset", apply_only_last=False, get_mask: bool = False):

ds = dataset

Not true, making a copy may still be necessary.
For example, in the TextTransformation, code is directly modifying the dataframe, and so copy is still needed.

To remove this copy, we need to go through every slicing function and transformation, and I don't want this pr to take too long.

Makes sense!

Ah my bad, you're right, I overlooked this part :/

mattbit · 2023-12-08T08:01:23Z

@Hartorn I would be careful because I remember having to introduce this copy operation because of some bug b092277.
I can't find the PR associated to that commit right now but I would check

Hartorn · 2023-12-08T08:16:44Z

@Hartorn I would be careful because I remember having to introduce this copy operation because of some bug b092277. I can't find the PR associated to that commit right now but I would check

atm TextTransformation modify dataframe directly, so I don't want to change this beahaviour either (except when slicing)

mattbit

At some point we'll need to improve this, but the current fix is ok for me.

…ide-dataset-kills-customers-vm

sonarqubecloud · 2023-12-08T14:26:57Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

100.0% Coverage
0.0% Duplication

Avoid copying whole dataset when doing slicing

bcd51c2

Hartorn added the bug Something isn't working label Dec 7, 2023

Hartorn requested review from andreybavt and mattbit December 7, 2023 16:06

Hartorn self-assigned this Dec 7, 2023

Hartorn marked this pull request as ready for review December 7, 2023 16:12

Hartorn requested a review from kevinmessiaen December 7, 2023 16:31

kevinmessiaen self-assigned this Dec 8, 2023

kevinmessiaen requested changes Dec 8, 2023

View reviewed changes

kevinmessiaen removed their assignment Dec 8, 2023

mattbit approved these changes Dec 8, 2023

View reviewed changes

kevinmessiaen approved these changes Dec 8, 2023

View reviewed changes

kevinmessiaen and others added 2 commits December 8, 2023 11:17

Merge branch 'main' into feature/gsk-2346-tabular-scan-execution-on-w…

29c498e

…ide-dataset-kills-customers-vm

Merge branch 'main' into feature/gsk-2346-tabular-scan-execution-on-w…

c4564cb

…ide-dataset-kills-customers-vm

Hartorn enabled auto-merge (squash) December 8, 2023 13:52

Merge branch 'main' into feature/gsk-2346-tabular-scan-execution-on-w…

7b486fb

…ide-dataset-kills-customers-vm

Hartorn merged commit 348233c into main Dec 8, 2023

Hartorn deleted the feature/gsk-2346-tabular-scan-execution-on-wide-dataset-kills-customers-vm branch December 8, 2023 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[GSK-2346] Avoid copying whole dataset when doing slicing #1673

[GSK-2346] Avoid copying whole dataset when doing slicing #1673

Uh oh!

Hartorn commented Dec 7, 2023 •

edited

Loading

Uh oh!

linear bot commented Dec 7, 2023

Uh oh!

kevinmessiaen left a comment •

edited

Loading

Uh oh!

kevinmessiaen Dec 8, 2023

Uh oh!

Hartorn Dec 8, 2023

Uh oh!

mattbit Dec 8, 2023

Uh oh!

kevinmessiaen Dec 8, 2023

Uh oh!

mattbit commented Dec 8, 2023

Uh oh!

Hartorn commented Dec 8, 2023

Uh oh!

mattbit left a comment

Uh oh!

sonarqubecloud bot commented Dec 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Uh oh!

[GSK-2346] Avoid copying whole dataset when doing slicing #1673

[GSK-2346] Avoid copying whole dataset when doing slicing #1673

Uh oh!

Conversation

Hartorn commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Uh oh!

linear bot commented Dec 7, 2023

Uh oh!

kevinmessiaen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinmessiaen Dec 8, 2023

Choose a reason for hiding this comment

Uh oh!

Hartorn Dec 8, 2023

Choose a reason for hiding this comment

Uh oh!

mattbit Dec 8, 2023

Choose a reason for hiding this comment

Uh oh!

kevinmessiaen Dec 8, 2023

Choose a reason for hiding this comment

Uh oh!

mattbit commented Dec 8, 2023

Uh oh!

Hartorn commented Dec 8, 2023

Uh oh!

mattbit left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Dec 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Hartorn commented Dec 7, 2023 •

edited

Loading

kevinmessiaen left a comment •

edited

Loading