Skip to content

Conversation

@rabah-khalek
Copy link
Contributor

@rabah-khalek rabah-khalek commented May 25, 2023

Calculates the independence test for:

Examples:

  • classification
from giskard import slicing_function
import numpy as np


dataset = Dataset(credit, name="testing dataset", target="default", column_types=column_types)

@slicing_function(row_level=False)
def slice(df: pd.DataFrame):
    return df.head(10)

sliced_dataset = dataset.slice(slice)
overlapping_idx = dataset.df[dataset.df.isin(sliced_dataset.df)].dropna().index.values
overlapping_array = np.zeros(len(dataset.df))
overlapping_array[overlapping_idx] = 1

array_full = np.zeros(len(dataset.df))

predictions = model.predict(dataset).prediction
crosstab_overlap = pd.crosstab(list(overlapping_array), list(predictions))

from scipy import stats
pvalue = stats.chi2_contingency(crosstab_overlap)[1] # 0.37906791538682305

in this case pvalue > 0.1 --> we can reject that the slice and predictions are independent (H0)

  • regression
from scipy.stats import ttest_ind
rand = np.array([random.randint(0,1000) for i in range(len(dataset.df))])
ttest_ind(rand,np.arange(0,1000),  alternative="two-sided", equal_var=True) # 0.42475623406765706

ttest_ind(np.arange(0,1000),np.arange(0,1000), alternative="two-sided", equal_var=True) # 1

in this case pvalue > 0.1 --> we can reject that the slice and predictions are independent (H0)

rabah-khalek and others added 30 commits May 6, 2023 19:29
…000-robustness-numerical"

This reverts commit 0be71d0, reversing
changes made to 7968a6d.
andreybavt and others added 3 commits May 25, 2023 11:35
…ation_labels

task/Numpy encoder for classification_labels
refactored ml_worker module and improved `import giskard` speed
@rabah-khalek rabah-khalek requested review from jmsquare and mattbit May 25, 2023 15:25
@rabah-khalek rabah-khalek self-assigned this May 25, 2023
@rabah-khalek rabah-khalek marked this pull request as draft May 25, 2023 15:25
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

13.6% 13.6% Coverage
0.0% 0.0% Duplication

Copy link
Member

@jmsquare jmsquare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The test name should not be test_independence_chi_square since it support botht classification and regression (ttest). My proposition is test_independence_slice
  • in the decorator the name of test should be changed
  • the tags in the decorator should "heuristic"
  • less priority: the output_df of the the test could be the example inside the slice that illustrates this correlations:
    -(i) for classification: the examples in the slice that return the classification labels that is the most contributing to the chi square
    -(ii) for regression: the examples in the slice that return the most extreme values (that explains why the mean between both slides were different).


@test(name="Right Label", tags=["heuristic", "classification"])
def test_independence_chi_square(model: BaseModel, dataset: Dataset,
slicing_function: SlicingFunction, threshold: float = 0.1) -> TestResult:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p_value < 0.1 is a implausibly high threshold! I would set it at maximum to 0.01. Generally Χ² is quite powerful, so we will not miss meaningul detections if the sample size is decent (for example I use Χ² tests in TextSlicer to find deviant tokens with p_value < 1e-3).

Comment on lines +240 to +243
original_df = dataset.df.reset_index(drop=True)
sliced_df = sliced_dataset.df.reset_index(drop=True)
overlapping_idx = original_df[original_df.isin(sliced_df)].dropna().index.values
overlapping_array[overlapping_idx] = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can’t we just find the complementary slice based on the original index?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also check that the complementary slice is not empty

predictions = model.predict(dataset).prediction
p_value = ttest_ind(overlapping_array, predictions, alternative="two-sided", equal_var=True).pvalue

passed = p_value < threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused, what are we testing? Because if it’s independence it’s the other way around. p_value < threshold indicates the rejection of independence hypothesis.

@mattbit
Copy link
Member

mattbit commented May 26, 2023

@rabah-khalek I think you got the test the other way around.

You wrote that:

in this case pvalue > 0.1 --> we can reject that the slice and predictions are independent (H0)

It should be the opposite: you reject the independent hypothesis if p_value < alpha. Otherwise, I may have misunderstood what we are testing.

@rabah-khalek
Copy link
Contributor Author

rabah-khalek commented May 26, 2023

I've written this at 11pm... But wait, the null in these tests (see doc) is that the samples are more likely drawn from the same distribution (dependent in our lingo).

A p-value larger than a chosen threshold (e.g. 5% or 1%) indicates that our observation is not so unlikely to have occurred by chance. Therefore, we do not reject the null hypothesis of equal population means. If the p-value is smaller than our threshold, then we have evidence against the null hypothesis of equal population means.

p_value < alpha ==> populations are different (independent in our lingo) ==> test passes.

It's true that I named wrong H0 to refer to independence in the comment above, when it's referring rather to dependence in these tests. But otherwise the test logic is the good one. Let me know if you agree @mattbit.

P.S. the _ind in the ttest refers to independent samples, not independence (in terms of hypothesis)

@mattbit
Copy link
Member

mattbit commented May 26, 2023

Sorry, these tests are always a bit confusing, I think we are saying the same thing but with different terminology. Let me recap more precisely and tell me if I’m correct and we agree:

I’m looking at the Χ² test. Reading the code, the null hypothesis of the Χ² is that the observed predictions are independent of being in the slice or not. More rigorously, $H_0$ is
$$P[\text{sample in data slice } \land \text{ predicted label}=y] = P[\text{sample in data slice}]\ P[\text{predicted label}=y]$$

You are passing a contingency table to measure this effect, i.e. belonging on the data slice should not affect the expected value of the prediction. For $p &lt; \alpha$, you reject this hypothesis and you say that the predicted label looks correlated to being in the data slice or not. In this last case, the Giskard test should not pass.

If this is correct, then this is what I need for spurious correlation. Calling this tes "independence" can be confusing, I think we need to be more precise, it could be test_slice_prediction_independence or something better.

Is this correct @rabah-khalek?

@mattbit
Copy link
Member

mattbit commented May 26, 2023

@rabah-khalek @jmsquare I think we need to be very careful with this kind of test. The model prediction will always be dependent on the feature value, otherwise it would mean that that feature is not informational at all. If we want to point out misbehaviour, we need to restrict this detection only to extreme cases. Thinking about it now, I’m not sure the $\chi ^2$ test is the correct approach here. I’m implementing this in scan, but I think it’s better to spend some time testing if this gives actually meaningful results.

@rabah-khalek
Copy link
Contributor Author

Agreed @mattbit, it is a very basic test, and has fundamental limitations... It'll be hard to define robustly a consistent definition of "Spurious correlation"

@rabah-khalek
Copy link
Contributor Author

just read your previous comment. I think the confusion between us stems from the difference of what chisquare and ttest are testing:

from scipy import stats
import pandas as pd
import numpy as np

u1 = np.zeros(500)
u2 = np.zeros(500)
u1[:250]=1
u2[:250]=1

crosstab = pd.crosstab(u1, u2)

from scipy import stats
pvalue_chi2 = stats.chi2_contingency(crosstab)[1] # 7.023536136418314e-110
#%%
from scipy.stats import ttest_ind
pvalue_ttest = ttest_ind(u1, u2,  alternative="two-sided", equal_var=True) # 1
  • pvalue_chi2 < alpha ==> rejection of independence hypothesis
  • pvalue_ttest < alpha ==> rejection of equal mean hypothesis
    in which case, i should've reversed the inequality for ttest (regression) to make a consistent interpretation.

@mattbit
Copy link
Member

mattbit commented May 26, 2023

@rabah-khalek I think we can put this on hold since we will not be working on spurious correlation for now

@andreybavt andreybavt force-pushed the feature/ai-test-v2-merged branch from 80c1113 to be66b69 Compare June 7, 2023 15:24
@mattbit mattbit changed the base branch from feature/ai-test-v2-merged to main June 14, 2023 13:17
@rabah-khalek
Copy link
Contributor Author

@mattbit, is this now superseded by #1178?

@mattbit
Copy link
Member

mattbit commented Jun 19, 2023

@mattbit, is this now superseded by #1178?

Yes, I think so. Could still be useful to merge though, I can take care of that in GSK-1316 if you want.

@rabah-khalek
Copy link
Contributor Author

ah okay, for the giskard tests. Yep, have a look and let me know if there's anything missing. Have we converged on our previous conversation?

@rabah-khalek
Copy link
Contributor Author

superseded by #1302

@Hartorn Hartorn deleted the test/chi_square_independence branch September 22, 2023 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

8 participants