-
-
Notifications
You must be signed in to change notification settings - Fork 379
[GSK-1279] Fisher's exact test and permutation test for slice metrics significance testing #1671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… significance testing
mattbit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments but otherwise looks good to me.
| perm_slice_dataset = Dataset( | ||
| dataset.df.loc[slice_ids], | ||
| target=dataset.target, | ||
| column_types=dataset.column_types.copy(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to make a copy here
| def _calculate_pvalue_from_permutation_test( | ||
| slice_dataset, comp_dataset, dataset, model, metric, perm_test_resamples=1000 | ||
| ): | ||
| logger.info("PerformanceBiasDetector: permutation test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let’s keep this as debug
|
|
||
| # if the slice size is too small, use Fisher's exact test, otherwise use a chi-square test | ||
| if min(min(row) for row in ctable) <= max_size_fisher: | ||
| logger.info("PerformanceBiasDetector: Fisher's exact test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug
|
|
||
| ctable = [[slice_x_cnt, slice_y_cnt], [comp_x_cnt, comp_y_cnt]] | ||
|
|
||
| # if the slice size is too small, use Fisher's exact test, otherwise use a chi-square test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It‘s a G-test
| _, pvalue = scipy.stats.ttest_ind( | ||
| slice_metric.raw_values, comp_metric.raw_values, equal_var=False, alternative=alternative | ||
| ) | ||
| elif metric.name.lower() in ["accuracy", "precision", "recall"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do this check differently. Let’s use some attribute on the metric object, like is_binary_metric or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's even better to let the Metric class the responsibility of calculating the contingency table entries
| ) | ||
| except ValueError as err: | ||
| pvalue = np.nan | ||
| logger.info(f"PerformanceBiasDetector: p-value could not be calculated: {err}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug
mattbit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, some small changes
| value: float | ||
| affected_samples: int | ||
| raw_values: Optional[np.ndarray] = None | ||
| ctable_values: Optional[list[list[int]]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have a contingency table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can be more general and maybe just get a categorical or binary representation?
Something like binary_counts or binary_representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| class Accuracy(ClassificationPerformanceMetric): | ||
| name = "Accuracy" | ||
| greater_is_better = True | ||
| has_contingency_table = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, this could be more general. In any case, I’m not sure it's needed if we have it on the result object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed but I added it for clarity and "just in case". Should I remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's needed because it's used to decide whether to calculate the binary counts (to prevent from calculating them for classification metrics where it wouldn't make sense, e.g. the F1-Score)
| # column_types=dataset.column_types.copy(), | ||
| # validation=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this commented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove it
| # column_types=dataset.column_types.copy(), | ||
| # validation=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
|
|
||
| def test_calculate_slice_metrics(): | ||
| SLICE_SIZE = 500 | ||
| np.random.seed(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should have the seed somewhere in the detector, not setting the global seed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, not sure why I put it there
mattbit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I’ll do some local testing.
| class BalancedAccuracy(ClassificationPerformanceMetric): | ||
| name = "Balanced Accuracy" | ||
| greater_is_better = True | ||
| has_binary_counts = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed
| class F1Score(SklearnClassificationScoreMixin, ClassificationPerformanceMetric): | ||
| name = "F1 Score" | ||
| greater_is_better = True | ||
| has_binary_counts = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed
| class AUC(PerformanceMetric): | ||
| name = "ROC AUC" | ||
| greater_is_better = True | ||
| has_binary_counts = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed
|

Added Fisher's exact test and permutation test for slice metrics significance testing. Todo: unit tests.