I'm reviewing whether our current methods for calculating and aggregating correctness and performance scores make sense. To help with this, I am comparing BackendBench with KernelBench. See the table below for an easy comparison:
|
BackendBench |
KernelBench |
| Correctness score per op |
Numeric score (ratio of passed test) |
Binary score (whether all test passed) |
| Correctness score aggregation |
Mean |
Mean |
| Number of performance tests per op |
Many |
1 |
| Number of runs per performance test |
Many |
Many |
| Performance score per op |
Geometric mean of speedup (incorrect = 1) |
Amortized speedup (multiple runs) |
| Performance score aggregation |
Geometric mean |
Geometric mean (correct tests only) |
| Number of tests per op |
Varies (e.g. opinfo) |
Fixed |
Based on this comparison, below are some questions I have for analysis:
- Which correctness scoring method is better? Should we use a simple correct/incorrect result or the ratio of passed tests?
For performance, does running the same test multiple times give us more accurate speedup results?
- How should we treat incorrect tests and how does it affect performance scores?
- Since the number of tests per op varies, does this give some ops more weight in the final score?
Edit 1: BackendBench do multiple runs to measure performance as well.
I'm reviewing whether our current methods for calculating and aggregating correctness and performance scores make sense. To help with this, I am comparing BackendBench with KernelBench. See the table below for an easy comparison:
Based on this comparison, below are some questions I have for analysis:
For performance, does running the same test multiple times give us more accurate speedup results?Edit 1: BackendBench do multiple runs to measure performance as well.