Skip to content

Conversation

@Kranium2002
Copy link
Contributor

@Kranium2002 Kranium2002 commented Dec 1, 2023

Description

This pull request addresses a crucial aspect of machine learning model development by introducing a comprehensive suite of data quality tests to Giskard. While Giskard currently excels in model quality testing, enhancing its capabilities to evaluate the quality of training data is imperative.

List of tests to be added:

  • Data Completeness Test
  • Data Uniqueness Test
  • Data Range Test
  • Validity Test
  • Data Correlation Test
  • Data Anomaly Detection Test
  • Data Integrity Test
  • Label Consistency Test
  • Class Imbalance Test
  • Feature Importance Test
  • Label Noise Detection Test

Related Issue

This PR is related to issue #1601

Type of Change

  • 📚 Examples / docs / tutorials / dependencies update
  • 🔧 Bug fix (non-breaking change which fixes an issue)
  • 🥂 Improvement (non-breaking change which improves an existing feature)
  • 🚀 New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 🔐 Security fix

Checklist

  • I've read the CODE_OF_CONDUCT.md document.
  • I've read the CONTRIBUTING.md guide.
  • I've updated the code style using make codestyle.
  • I've written tests for all new methods and classes that I created.
  • I've written the docstring in Google format for all the methods and classes that I used.

@Kranium2002
Copy link
Contributor Author

I think some of the tests can be removed from this list like Data Consistency Test as Giskard already tests this when creating a Giskard dataset. What do you think @kevinmessiaen ?

@kevinmessiaen kevinmessiaen self-requested a review December 1, 2023 12:42
@kevinmessiaen kevinmessiaen self-assigned this Dec 1, 2023
Copy link
Member

@kevinmessiaen kevinmessiaen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests and your contribution @Kranium2002 !

I left a few comments, let me knows if you need some help.

@kevinmessiaen
Copy link
Member

I think some of the tests can be removed from this list like Data Consistency Test as Giskard already tests this when creating a Giskard dataset. What do you think @kevinmessiaen ?

You're right about Data Consistency.

@luca-martial Do you have something specific in mind about it or can we remove this type of test?

@luca-martial
Copy link
Contributor

@kevinmessiaen I removed Data Consistency!

@Kranium2002
Copy link
Contributor Author

Kranium2002 commented Dec 3, 2023

For anomaly detection I was thinking about using DBSCAN or isolation forests from sklearn what do you suggest @kevinmessiaen? Can we use third party packages for implementing anomaly detection?

I think that it's ok to use external library for complex tests like this one, just make sure to import it in the test so that we can run giskard without installing this dependency as it should be optional

@Kranium2002
Copy link
Contributor Author

I had some questions regarding some tests:

  • How do we implement test 6? Does giskard also work like a RDBMS? Is this test really necessary for Giskard?
  • Aren't tests 5 and 10 same, the only difference being the name of the column we pass onto the anomaly detection function?
    @kevinmessiaen

@kevinmessiaen
Copy link
Member

kevinmessiaen commented Dec 4, 2023

I had some questions regarding some tests:

  • How do we implement test 6? Does giskard also work like a RDBMS? Is this test really necessary for Giskard?
  • Aren't tests 5 and 10 same, the only difference being the name of the column we pass onto the anomaly detection function?
    @kevinmessiaen

Hello,

Regarding the test 6 (Ensures relationships between different data tables or datasets are maintained.):

For now we do not work with RDBMS, however the giskard.Dataset is an abstract interface and will be updated to work with more than that, the idea is to ensure that all data of a dataset column are present in another one:

@giskard.test()
def ensure_all_exists(dataset: giskard.Dataset, column: str, target_dataset: giskard.Dataset, target_column: str, threshold: float = 0.0):
    # Ensure that all data in "column" of "dataset" are present in "target_column" of "target_dataset"
    source = dataset.df[column]
    referenced = target_dataset[target_column]
    not_included = source[~source.isin(referenced)] 
    missing_ratio = len(not_included) / len(source)
    return giskard.TestResult(passed=missing_ratio <= threshold, metric=missing_ratio)

Regarding the test 5 and 10

To confirm with @luca-martial but my understanding is:

Test 5: Identifies outliers or anomalies in the dataset.

Let's give the following dataset containing values:
dataset = giskard.Dataset(pd.DataFrame({'age': [20, 25, 23, 40, 67, 55, 44, 17, 47, 60, 120]}))

In this example we have an outlier value of 120 for the age column and it might be a typo so we want to identify it,

I would give a test signature of def outlier(dataset: giskard.Dataset, column: str)

Test 10: Label Noise Detection Test

dataset = giskard.Dataset(pd.DataFrame({
   'age': [20, 25, 23, 40, 67, 55, 44, 17, 47, 60],
   'group': ["<30", "<30", "<30", ">=30", ">=30", ">=30", ">=30", ">=30", ">=30", ">=30"],
}))

Now on this one we need to identify that the group >=30 might have been mislabelled regarding to the column age for the value 17

I would give a test signature of def mislabel(dataset: giskard.Dataset, labelled_column: str, reference_columns: Iterable[str])

kevinmessiaen and others added 3 commits December 4, 2023 10:53
data for test fail = {
        'label': ['A', 'A', 'B', 'B'],
        'data1': [1, 1, 2, 2],
        'data2': ['x', 'x', 'y', 'y']
    }

data for test pass = {
        'label': ['A', 'A', 'B', 'B'],
        'data1': [1, 1, 2, 2],
        'data2': [1,2,3,4]
    }
@Kranium2002
Copy link
Contributor Author

I am having some difficulty in implementing Label Noise Detection test.

I tried using isolation forests for this and they work pretty good for the sample data you provided but it starts failing as soon as we give more than one inconsistent label and also it gives false negatives when there are no errors in the labels. @kevinmessiaen

@Kranium2002
Copy link
Contributor Author

@kevinmessiaen I just added thresholds for all the tests and also added unit tests for all data quality tests. Please take a look and let me know.

Copy link
Member

@kevinmessiaen kevinmessiaen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really good!

Just a small comment. I saw that you removed the @test annotation.

Currently the @test annotation generate a GiskardTest instance that need to be initialized and then executed as following:

@test
def example_test(will_pass: bool):
  return TestResult(passed=will_pass)

result = example_test(True).execute()
assert result.passed

@Kranium2002
Copy link
Contributor Author

Kranium2002 commented Dec 19, 2023

That's really good!

Just a small comment. I saw that you removed the @test annotation.

Currently the @test annotation generate a GiskardTest instance that need to be initialized and then executed as following:

@test
def example_test(will_pass: bool):
  return TestResult(passed=will_pass)

result = example_test(True).execute()
assert result.passed

@Kranium2002
Copy link
Contributor Author

Added the @ test decorator back. @kevinmessiaen

@Hartorn Hartorn changed the title Feature/add data quality tests Add data quality tests Dec 20, 2023
Copy link
Member

@kevinmessiaen kevinmessiaen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your contribution 🎉

I've done some modification to have consistency and improve debugging

@Kranium2002 Kranium2002 requested a review from a team December 21, 2023 07:45
@kevinmessiaen kevinmessiaen merged commit 0f7fa52 into Giskard-AI:main Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants