-
-
Notifications
You must be signed in to change notification settings - Fork 381
Add data quality tests #1651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data quality tests #1651
Conversation
|
I think some of the tests can be removed from this list like Data Consistency Test as Giskard already tests this when creating a Giskard dataset. What do you think @kevinmessiaen ? |
kevinmessiaen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tests and your contribution @Kranium2002 !
I left a few comments, let me knows if you need some help.
You're right about Data Consistency. @luca-martial Do you have something specific in mind about it or can we remove this type of test? |
|
@kevinmessiaen I removed Data Consistency! |
chore(fix): add metric=uniqueness_ration Co-authored-by: Kevin Messiaen <[email protected]>
chore(fix): add threshold to uniqueness test Co-authored-by: Kevin Messiaen <[email protected]>
|
For anomaly detection I was thinking about using DBSCAN or isolation forests from sklearn what do you suggest @kevinmessiaen? Can we use third party packages for implementing anomaly detection? I think that it's ok to use external library for complex tests like this one, just make sure to import it in the test so that we can run giskard without installing this dependency as it should be optional |
|
I had some questions regarding some tests:
|
Hello, Regarding the test 6 (Ensures relationships between different data tables or datasets are maintained.):For now we do not work with RDBMS, however the @giskard.test()
def ensure_all_exists(dataset: giskard.Dataset, column: str, target_dataset: giskard.Dataset, target_column: str, threshold: float = 0.0):
# Ensure that all data in "column" of "dataset" are present in "target_column" of "target_dataset"
source = dataset.df[column]
referenced = target_dataset[target_column]
not_included = source[~source.isin(referenced)]
missing_ratio = len(not_included) / len(source)
return giskard.TestResult(passed=missing_ratio <= threshold, metric=missing_ratio)Regarding the test 5 and 10To confirm with @luca-martial but my understanding is: Test 5: Identifies outliers or anomalies in the dataset.Let's give the following dataset containing values: In this example we have an outlier value of I would give a test signature of Test 10: Label Noise Detection Testdataset = giskard.Dataset(pd.DataFrame({
'age': [20, 25, 23, 40, 67, 55, 44, 17, 47, 60],
'group': ["<30", "<30", "<30", ">=30", ">=30", ">=30", ">=30", ">=30", ">=30", ">=30"],
}))Now on this one we need to identify that the I would give a test signature of |
data for test fail = {
'label': ['A', 'A', 'B', 'B'],
'data1': [1, 1, 2, 2],
'data2': ['x', 'x', 'y', 'y']
}
data for test pass = {
'label': ['A', 'A', 'B', 'B'],
'data1': [1, 1, 2, 2],
'data2': [1,2,3,4]
}
|
I am having some difficulty in implementing Label Noise Detection test. I tried using isolation forests for this and they work pretty good for the sample data you provided but it starts failing as soon as we give more than one inconsistent label and also it gives false negatives when there are no errors in the labels. @kevinmessiaen |
|
@kevinmessiaen I just added thresholds for all the tests and also added unit tests for all data quality tests. Please take a look and let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really good!
Just a small comment. I saw that you removed the @test annotation.
Currently the @test annotation generate a GiskardTest instance that need to be initialized and then executed as following:
@test
def example_test(will_pass: bool):
return TestResult(passed=will_pass)
result = example_test(True).execute()
assert result.passed
|
…ranium2002/giskard into feature/add-data-quality-tests
|
Added the @ test decorator back. @kevinmessiaen |
kevinmessiaen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for your contribution 🎉
I've done some modification to have consistency and improve debugging
Description
This pull request addresses a crucial aspect of machine learning model development by introducing a comprehensive suite of data quality tests to Giskard. While Giskard currently excels in model quality testing, enhancing its capabilities to evaluate the quality of training data is imperative.
List of tests to be added:
Related Issue
This PR is related to issue #1601
Type of Change
Checklist
CODE_OF_CONDUCT.mddocument.CONTRIBUTING.mdguide.make codestyle.