Add data quality tests #1651

Kranium2002 · 2023-12-01T10:59:50Z

Description

This pull request addresses a crucial aspect of machine learning model development by introducing a comprehensive suite of data quality tests to Giskard. While Giskard currently excels in model quality testing, enhancing its capabilities to evaluate the quality of training data is imperative.

List of tests to be added:

Related Issue

This PR is related to issue #1601

Type of Change

📚 Examples / docs / tutorials / dependencies update
🔧 Bug fix (non-breaking change which fixes an issue)
🥂 Improvement (non-breaking change which improves an existing feature)
🚀 New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to change)
🔐 Security fix

Checklist

I've read the CODE_OF_CONDUCT.md document.
I've read the CONTRIBUTING.md guide.
I've updated the code style using make codestyle.
I've written tests for all new methods and classes that I created.
I've written the docstring in Google format for all the methods and classes that I used.

Kranium2002 · 2023-12-01T11:07:13Z

I think some of the tests can be removed from this list like Data Consistency Test as Giskard already tests this when creating a Giskard dataset. What do you think @kevinmessiaen ?

kevinmessiaen

Thanks for the tests and your contribution @Kranium2002 !

I left a few comments, let me knows if you need some help.

giskard/testing/tests/data_quality.py

kevinmessiaen · 2023-12-01T13:40:03Z

I think some of the tests can be removed from this list like Data Consistency Test as Giskard already tests this when creating a Giskard dataset. What do you think @kevinmessiaen ?

You're right about Data Consistency.

@luca-martial Do you have something specific in mind about it or can we remove this type of test?

luca-martial · 2023-12-01T17:56:45Z

@kevinmessiaen I removed Data Consistency!

chore(fix): add metric=uniqueness_ration Co-authored-by: Kevin Messiaen <[email protected]>

chore(fix): add threshold to uniqueness test Co-authored-by: Kevin Messiaen <[email protected]>

Kranium2002 · 2023-12-03T22:14:39Z

For anomaly detection I was thinking about using DBSCAN or isolation forests from sklearn what do you suggest @kevinmessiaen? Can we use third party packages for implementing anomaly detection?

I think that it's ok to use external library for complex tests like this one, just make sure to import it in the test so that we can run giskard without installing this dependency as it should be optional

Kranium2002 · 2023-12-03T23:07:10Z

I had some questions regarding some tests:

How do we implement test 6? Does giskard also work like a RDBMS? Is this test really necessary for Giskard?
Aren't tests 5 and 10 same, the only difference being the name of the column we pass onto the anomaly detection function?
@kevinmessiaen

kevinmessiaen · 2023-12-04T09:49:31Z

I had some questions regarding some tests:

How do we implement test 6? Does giskard also work like a RDBMS? Is this test really necessary for Giskard?

Aren't tests 5 and 10 same, the only difference being the name of the column we pass onto the anomaly detection function?
@kevinmessiaen

Hello,

Regarding the test 6 (Ensures relationships between different data tables or datasets are maintained.):

For now we do not work with RDBMS, however the giskard.Dataset is an abstract interface and will be updated to work with more than that, the idea is to ensure that all data of a dataset column are present in another one:

@giskard.test()
def ensure_all_exists(dataset: giskard.Dataset, column: str, target_dataset: giskard.Dataset, target_column: str, threshold: float = 0.0):
    # Ensure that all data in "column" of "dataset" are present in "target_column" of "target_dataset"
    source = dataset.df[column]
    referenced = target_dataset[target_column]
    not_included = source[~source.isin(referenced)] 
    missing_ratio = len(not_included) / len(source)
    return giskard.TestResult(passed=missing_ratio <= threshold, metric=missing_ratio)

Regarding the test 5 and 10

To confirm with @luca-martial but my understanding is:

Test 5: Identifies outliers or anomalies in the dataset.

Let's give the following dataset containing values:
dataset = giskard.Dataset(pd.DataFrame({'age': [20, 25, 23, 40, 67, 55, 44, 17, 47, 60, 120]}))

In this example we have an outlier value of 120 for the age column and it might be a typo so we want to identify it,

I would give a test signature of def outlier(dataset: giskard.Dataset, column: str)

Test 10: Label Noise Detection Test

dataset = giskard.Dataset(pd.DataFrame({
   'age': [20, 25, 23, 40, 67, 55, 44, 17, 47, 60],
   'group': ["<30", "<30", "<30", ">=30", ">=30", ">=30", ">=30", ">=30", ">=30", ">=30"],
}))

Now on this one we need to identify that the group >=30 might have been mislabelled regarding to the column age for the value 17

I would give a test signature of def mislabel(dataset: giskard.Dataset, labelled_column: str, reference_columns: Iterable[str])

data for test fail = { 'label': ['A', 'A', 'B', 'B'], 'data1': [1, 1, 2, 2], 'data2': ['x', 'x', 'y', 'y'] } data for test pass = { 'label': ['A', 'A', 'B', 'B'], 'data1': [1, 1, 2, 2], 'data2': [1,2,3,4] }

Kranium2002 · 2023-12-06T07:54:48Z

I am having some difficulty in implementing Label Noise Detection test.

I tried using isolation forests for this and they work pretty good for the sample data you provided but it starts failing as soon as we give more than one inconsistent label and also it gives false negatives when there are no errors in the labels. @kevinmessiaen

Kranium2002 · 2023-12-17T12:36:54Z

@kevinmessiaen I just added thresholds for all the tests and also added unit tests for all data quality tests. Please take a look and let me know.

kevinmessiaen

That's really good!

Just a small comment. I saw that you removed the @test annotation.

Currently the @test annotation generate a GiskardTest instance that need to be initialized and then executed as following:

@test
def example_test(will_pass: bool):
  return TestResult(passed=will_pass)

result = example_test(True).execute()
assert result.passed

Kranium2002 · 2023-12-19T10:52:18Z

That's really good!

Just a small comment. I saw that you removed the @test annotation.

Currently the @test annotation generate a GiskardTest instance that need to be initialized and then executed as following:
@test
def example_test(will_pass: bool):
  return TestResult(passed=will_pass)

result = example_test(True).execute()
assert result.passed

…ranium2002/giskard into feature/add-data-quality-tests

Kranium2002 · 2023-12-19T11:13:08Z

Added the @ test decorator back. @kevinmessiaen

kevinmessiaen

Thank you so much for your contribution 🎉

I've done some modification to have consistency and improve debugging

Kranium2002 added 2 commits December 1, 2023 14:21

feat(add): Uniqueness and completeness tests

2c17540

fix: pylint issues

f34e0c4

Kranium2002 mentioned this pull request Dec 1, 2023

Feature: add data quality tests in Giskard #1601

Closed

chore(add):Data Range and Validity test

a65fd43

kevinmessiaen self-requested a review December 1, 2023 12:42

kevinmessiaen self-assigned this Dec 1, 2023

kevinmessiaen requested changes Dec 1, 2023

View reviewed changes

kevinmessiaen assigned Kranium2002 and unassigned kevinmessiaen Dec 1, 2023

Kranium2002 and others added 5 commits December 4, 2023 02:28

Update giskard/testing/tests/data_quality.py

c61976a

chore(fix): add metric=uniqueness_ration Co-authored-by: Kevin Messiaen <[email protected]>

Update giskard/testing/tests/data_quality.py

5169b37

chore(fix): add threshold to uniqueness test Co-authored-by: Kevin Messiaen <[email protected]>

chore(fix): add threshold based uniqueness test

9c72434

chore(update): split range and validity tests in two

ee5132e

chore(add): data correlation test

05fd921

Kranium2002 added 3 commits December 4, 2023 04:23

chore(add): anomaly detection using DBSCAN

682cbba

chore(fix): correlation test now returns TestResults

a1dbcb7

fix: minor changes + pylint errors

9daadab

cleanup: remove print statements

62d1de4

kevinmessiaen and others added 3 commits December 4, 2023 10:53

Merge branch 'main' into feature/add-data-quality-tests

3629c49

chore(add): ensure all exists test added

9d2351c

chore(add): label consistency test

ecb8bc7

data for test fail = { 'label': ['A', 'A', 'B', 'B'], 'data1': [1, 1, 2, 2], 'data2': ['x', 'x', 'y', 'y'] } data for test pass = { 'label': ['A', 'A', 'B', 'B'], 'data1': [1, 1, 2, 2], 'data2': [1,2,3,4] }

andreybavt requested review from mattbit and rabah-khalek December 6, 2023 08:56

Kranium2002 added 5 commits December 16, 2023 12:34

test(add): tests for ensure all exists

3051fa0

fix: tests unexpected behaviour add: test mislabel

85d2135

test(add): test for outlier test

d949c01

test(add): test for feature importance

d516e33

test(add): test for class imbalance

40036b0

Kranium2002 requested a review from kevinmessiaen December 17, 2023 12:36

Merge branch 'main' into feature/add-data-quality-tests

8453b0e

kevinmessiaen reviewed Dec 18, 2023

View reviewed changes

Kranium2002 added 2 commits December 19, 2023 16:41

fix: add @test decorator

d7ec740

Merge branch 'feature/add-data-quality-tests' of https://github.com/K…

b7d375e

…ranium2002/giskard into feature/add-data-quality-tests

Kranium2002 requested a review from kevinmessiaen December 19, 2023 11:13

Merge branch 'main' into feature/add-data-quality-tests

ab275b7

Hartorn changed the title ~~Feature/add data quality tests~~ Add data quality tests Dec 20, 2023

kevinmessiaen and others added 7 commits December 21, 2023 09:00

Merge branch 'main' into feature/add-data-quality-tests

b6ad960

Added name and tags to data tests

611779f

Added missing description for uniqueness_test threshold

0d92b83

Renamed tests and added reference documentation

87fcf38

Added output_ds for debugging

ab4dfbc

Added output_ds for debugging and fixed messages

b211569

Updated tests

36b3e45

kevinmessiaen approved these changes Dec 21, 2023

View reviewed changes

kevinmessiaen enabled auto-merge December 21, 2023 05:09

kevinmessiaen disabled auto-merge December 21, 2023 06:39

Skip sonar on forked PR

8e47bff

Kranium2002 requested a review from a team December 21, 2023 07:45

kevinmessiaen enabled auto-merge December 21, 2023 07:45

kevinmessiaen merged commit 0f7fa52 into Giskard-AI:main Dec 21, 2023

Uh oh!

Add data quality tests #1651

Add data quality tests #1651

Uh oh!

Conversation

Kranium2002 commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Uh oh!

Kranium2002 commented Dec 1, 2023

Uh oh!

kevinmessiaen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinmessiaen commented Dec 1, 2023

Uh oh!

luca-martial commented Dec 1, 2023

Uh oh!

Kranium2002 commented Dec 3, 2023 • edited by kevinmessiaen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kranium2002 commented Dec 3, 2023

Uh oh!

kevinmessiaen commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regarding the test 6 (Ensures relationships between different data tables or datasets are maintained.):

Regarding the test 5 and 10

Test 5: Identifies outliers or anomalies in the dataset.

Test 10: Label Noise Detection Test

Uh oh!

Kranium2002 commented Dec 6, 2023

Uh oh!

Kranium2002 commented Dec 17, 2023

Uh oh!

kevinmessiaen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kranium2002 commented Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kranium2002 commented Dec 19, 2023

Uh oh!

kevinmessiaen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Kranium2002 commented Dec 1, 2023 •

edited

Loading

Kranium2002 commented Dec 3, 2023 •

edited by kevinmessiaen

Loading

kevinmessiaen commented Dec 4, 2023 •

edited

Loading

kevinmessiaen left a comment •

edited

Loading

Kranium2002 commented Dec 19, 2023 •

edited

Loading