-
-
Notifications
You must be signed in to change notification settings - Fork 379
added chi square independent test #1026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ation_labels task/Numpy encoder for classification_labels
|
Kudos, SonarCloud Quality Gate passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The test name should not be
test_independence_chi_squaresince it support botht classification and regression (ttest). My proposition istest_independence_slice - in the decorator the name of test should be changed
- the tags in the decorator should "heuristic"
- less priority: the output_df of the the test could be the example inside the slice that illustrates this correlations:
-(i) for classification: the examples in the slice that return the classification labels that is the most contributing to the chi square
-(ii) for regression: the examples in the slice that return the most extreme values (that explains why the mean between both slides were different).
|
|
||
| @test(name="Right Label", tags=["heuristic", "classification"]) | ||
| def test_independence_chi_square(model: BaseModel, dataset: Dataset, | ||
| slicing_function: SlicingFunction, threshold: float = 0.1) -> TestResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p_value < 0.1 is a implausibly high threshold! I would set it at maximum to 0.01. Generally Χ² is quite powerful, so we will not miss meaningul detections if the sample size is decent (for example I use Χ² tests in TextSlicer to find deviant tokens with p_value < 1e-3).
| original_df = dataset.df.reset_index(drop=True) | ||
| sliced_df = sliced_dataset.df.reset_index(drop=True) | ||
| overlapping_idx = original_df[original_df.isin(sliced_df)].dropna().index.values | ||
| overlapping_array[overlapping_idx] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can’t we just find the complementary slice based on the original index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also check that the complementary slice is not empty
| predictions = model.predict(dataset).prediction | ||
| p_value = ttest_ind(overlapping_array, predictions, alternative="two-sided", equal_var=True).pvalue | ||
|
|
||
| passed = p_value < threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused, what are we testing? Because if it’s independence it’s the other way around. p_value < threshold indicates the rejection of independence hypothesis.
|
@rabah-khalek I think you got the test the other way around. You wrote that:
It should be the opposite: you reject the independent hypothesis if |
|
I've written this at 11pm... But wait, the null in these tests (see doc) is that the samples are more likely drawn from the same distribution (dependent in our lingo).
p_value < alpha ==> populations are different (independent in our lingo) ==> test passes. It's true that I named wrong H0 to refer to independence in the comment above, when it's referring rather to dependence in these tests. But otherwise the test logic is the good one. Let me know if you agree @mattbit. P.S. the |
|
Sorry, these tests are always a bit confusing, I think we are saying the same thing but with different terminology. Let me recap more precisely and tell me if I’m correct and we agree: I’m looking at the Χ² test. Reading the code, the null hypothesis of the Χ² is that the observed predictions are independent of being in the slice or not. More rigorously, You are passing a contingency table to measure this effect, i.e. belonging on the data slice should not affect the expected value of the prediction. For If this is correct, then this is what I need for spurious correlation. Calling this tes "independence" can be confusing, I think we need to be more precise, it could be Is this correct @rabah-khalek? |
|
@rabah-khalek @jmsquare I think we need to be very careful with this kind of test. The model prediction will always be dependent on the feature value, otherwise it would mean that that feature is not informational at all. If we want to point out misbehaviour, we need to restrict this detection only to extreme cases. Thinking about it now, I’m not sure the |
|
Agreed @mattbit, it is a very basic test, and has fundamental limitations... It'll be hard to define robustly a consistent definition of "Spurious correlation" |
|
just read your previous comment. I think the confusion between us stems from the difference of what chisquare and ttest are testing: from scipy import stats
import pandas as pd
import numpy as np
u1 = np.zeros(500)
u2 = np.zeros(500)
u1[:250]=1
u2[:250]=1
crosstab = pd.crosstab(u1, u2)
from scipy import stats
pvalue_chi2 = stats.chi2_contingency(crosstab)[1] # 7.023536136418314e-110
#%%
from scipy.stats import ttest_ind
pvalue_ttest = ttest_ind(u1, u2, alternative="two-sided", equal_var=True) # 1
|
|
@rabah-khalek I think we can put this on hold since we will not be working on spurious correlation for now |
80c1113 to
be66b69
Compare
|
ah okay, for the giskard tests. Yep, have a look and let me know if there's anything missing. Have we converged on our previous conversation? |
|
superseded by #1302 |








Calculates the independence test for:
Examples:
in this case pvalue > 0.1 --> we can reject that the slice and predictions are independent (H0)
in this case pvalue > 0.1 --> we can reject that the slice and predictions are independent (H0)