Skip to content

Conversation

@luca-rossi
Copy link
Contributor

Added Fisher's exact test and permutation test for slice metrics significance testing. Todo: unit tests.

@linear
Copy link

linear bot commented Dec 6, 2023

@luca-rossi luca-rossi requested a review from mattbit December 6, 2023 18:45
@luca-rossi luca-rossi marked this pull request as ready for review December 8, 2023 10:35
Copy link
Member

@mattbit mattbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments but otherwise looks good to me.

perm_slice_dataset = Dataset(
dataset.df.loc[slice_ids],
target=dataset.target,
column_types=dataset.column_types.copy(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to make a copy here

def _calculate_pvalue_from_permutation_test(
slice_dataset, comp_dataset, dataset, model, metric, perm_test_resamples=1000
):
logger.info("PerformanceBiasDetector: permutation test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let’s keep this as debug


# if the slice size is too small, use Fisher's exact test, otherwise use a chi-square test
if min(min(row) for row in ctable) <= max_size_fisher:
logger.info("PerformanceBiasDetector: Fisher's exact test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.debug


ctable = [[slice_x_cnt, slice_y_cnt], [comp_x_cnt, comp_y_cnt]]

# if the slice size is too small, use Fisher's exact test, otherwise use a chi-square test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It‘s a G-test

_, pvalue = scipy.stats.ttest_ind(
slice_metric.raw_values, comp_metric.raw_values, equal_var=False, alternative=alternative
)
elif metric.name.lower() in ["accuracy", "precision", "recall"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this check differently. Let’s use some attribute on the metric object, like is_binary_metric or similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's even better to let the Metric class the responsibility of calculating the contingency table entries

)
except ValueError as err:
pvalue = np.nan
logger.info(f"PerformanceBiasDetector: p-value could not be calculated: {err}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.debug

@luca-rossi luca-rossi requested a review from mattbit December 8, 2023 18:01
Copy link
Member

@mattbit mattbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, some small changes

value: float
affected_samples: int
raw_values: Optional[np.ndarray] = None
ctable_values: Optional[list[list[int]]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have a contingency table

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can be more general and maybe just get a categorical or binary representation?
Something like binary_counts or binary_representation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

class Accuracy(ClassificationPerformanceMetric):
name = "Accuracy"
greater_is_better = True
has_contingency_table = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, this could be more general. In any case, I’m not sure it's needed if we have it on the result object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed but I added it for clarity and "just in case". Should I remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's needed because it's used to decide whether to calculate the binary counts (to prevent from calculating them for classification metrics where it wouldn't make sense, e.g. the F1-Score)

Comment on lines 326 to 327
# column_types=dataset.column_types.copy(),
# validation=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this commented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove it

Comment on lines 332 to 333
# column_types=dataset.column_types.copy(),
# validation=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


def test_calculate_slice_metrics():
SLICE_SIZE = 500
np.random.seed(42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should have the seed somewhere in the detector, not setting the global seed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not sure why I put it there

@luca-rossi luca-rossi requested a review from mattbit December 13, 2023 19:53
Copy link
Member

@mattbit mattbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I’ll do some local testing.

class BalancedAccuracy(ClassificationPerformanceMetric):
name = "Balanced Accuracy"
greater_is_better = True
has_binary_counts = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

class F1Score(SklearnClassificationScoreMixin, ClassificationPerformanceMetric):
name = "F1 Score"
greater_is_better = True
has_binary_counts = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

class AUC(PerformanceMetric):
name = "ROC AUC"
greater_is_better = True
has_binary_counts = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

@mattbit mattbit enabled auto-merge December 14, 2023 15:22
@sonarqubecloud
Copy link

@mattbit mattbit merged commit b63c3d1 into main Dec 14, 2023
@mattbit mattbit deleted the task/GSK-1279-statistical-tests branch December 14, 2023 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants