Skip to content

Conversation

@Hartorn
Copy link
Member

@Hartorn Hartorn commented Dec 7, 2023

Description

DataProcessor was doing a copy every time, event in the case of slice.
The issue is the slice was actually referencing the whole dataset, so the whole dataset was kept in memory.
Now, we only reference the existing one.

Testing on another large dataset (76020, 371), was crashing before, working now.

Related Issue

Type of Change

  • 📚 Examples / docs / tutorials / dependencies update
  • 🔧 Bug fix (non-breaking change which fixes an issue)
  • 🥂 Improvement (non-breaking change which improves an existing feature)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 🔐 Security fix

Checklist

  • I've read the CODE_OF_CONDUCT.md document.
  • I've read the CONTRIBUTING.md guide.
  • I've updated the code style using make codestyle.
  • I've written tests for all new methods and classes that I created.
  • I've written the docstring in Google format for all the methods and classes that I used.

@Hartorn Hartorn added the bug Something isn't working label Dec 7, 2023
@Hartorn Hartorn requested review from andreybavt and mattbit December 7, 2023 16:06
@linear
Copy link

linear bot commented Dec 7, 2023

@Hartorn Hartorn self-assigned this Dec 7, 2023
@Hartorn Hartorn marked this pull request as ready for review December 7, 2023 16:12
@Hartorn Hartorn requested a review from kevinmessiaen December 7, 2023 16:31
@kevinmessiaen kevinmessiaen self-assigned this Dec 8, 2023
Copy link
Member

@kevinmessiaen kevinmessiaen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixes the issue, However we can get ride of the copy flag altogether since doing a copy change nothing in the behaviour of the code expect loading duplicated data in the ram.

PS: it is updated in case of empty pipeline (but nothing breaking):

  • Line 110: Reset the pipeline (not a big deal)
  • Line 114: Recompute the column_meta -> Do nothing but takes time

Comment on lines +82 to +86
def apply(self, dataset: "Dataset", apply_only_last=False, get_mask: bool = False, copy: bool = True):
if copy:
ds = dataset.copy()
else:
ds = dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the code, it seems that doing a copy of dataset was unnecessary since ds instance is only queried and never updated.

Suggested change
def apply(self, dataset: "Dataset", apply_only_last=False, get_mask: bool = False, copy: bool = True):
if copy:
ds = dataset.copy()
else:
ds = dataset
def apply(self, dataset: "Dataset", apply_only_last=False, get_mask: bool = False):
ds = dataset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not true, making a copy may still be necessary.
For example, in the TextTransformation, code is directly modifying the dataframe, and so copy is still needed.

To remove this copy, we need to go through every slicing function and transformation, and I don't want this pr to take too long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, you're right, I overlooked this part :/

@kevinmessiaen kevinmessiaen removed their assignment Dec 8, 2023
@mattbit
Copy link
Member

mattbit commented Dec 8, 2023

@Hartorn I would be careful because I remember having to introduce this copy operation because of some bug b092277.
I can't find the PR associated to that commit right now but I would check

@Hartorn
Copy link
Member Author

Hartorn commented Dec 8, 2023

@Hartorn I would be careful because I remember having to introduce this copy operation because of some bug b092277. I can't find the PR associated to that commit right now but I would check

atm TextTransformation modify dataframe directly, so I don't want to change this beahaviour either (except when slicing)

Copy link
Member

@mattbit mattbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we'll need to improve this, but the current fix is ok for me.

@Hartorn Hartorn enabled auto-merge (squash) December 8, 2023 13:52
@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 8, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

100.0% 100.0% Coverage
0.0% 0.0% Duplication

@Hartorn Hartorn merged commit 348233c into main Dec 8, 2023
@Hartorn Hartorn deleted the feature/gsk-2346-tabular-scan-execution-on-wide-dataset-kills-customers-vm branch December 8, 2023 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Development

Successfully merging this pull request may close these issues.

4 participants