Survey on correctness and performance score calculation and aggregation

I'm reviewing whether our current methods for calculating and aggregating correctness and performance scores make sense. To help with this, I am comparing BackendBench with KernelBench. See the table below for an easy comparison:

|                | BackendBench | KernelBench |
| -------- | ------- | ------- |
| Correctness score per op  | Numeric score (ratio of passed test)    | Binary score (whether all test passed) | 
| Correctness score aggregation | Mean | Mean |
| Number of performance tests per op | Many | 1 |
| Number of runs per performance test | Many | Many |
| Performance score per op | Geometric mean of speedup (incorrect = 1) | Amortized speedup (multiple runs) |
| Performance score aggregation | Geometric mean | Geometric mean (correct tests only) |
| Number of tests per op | Varies (e.g. opinfo)  | Fixed  |

Based on this comparison, below are some questions I have for analysis:

1. Which correctness scoring method is better? Should we use a simple correct/incorrect result or the ratio of passed tests?
2. ~~For performance, does running the same test multiple times give us more accurate speedup results?~~
4. How should we treat incorrect tests and how does it affect performance scores?
5. Since the number of tests per op varies, does this give some ops more weight in the final score?

Edit 1: BackendBench do multiple runs to measure performance as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Survey on correctness and performance score calculation and aggregation #97

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	BackendBench	KernelBench
Correctness score per op	Numeric score (ratio of passed test)	Binary score (whether all test passed)
Correctness score aggregation	Mean	Mean
Number of performance tests per op	Many	1
Number of runs per performance test	Many	Many
Performance score per op	Geometric mean of speedup (incorrect = 1)	Amortized speedup (multiple runs)
Performance score aggregation	Geometric mean	Geometric mean (correct tests only)
Number of tests per op	Varies (e.g. opinfo)	Fixed

Survey on correctness and performance score calculation and aggregation #97

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions