feat(evaluation): Add lm-eval to Pruna Metrics #380

sky-2002 · 2025-10-02T13:38:29Z

Description

Adds evaluation harness to the metrics lineup.

Related Issue

Fixes #378

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Have added basic unit tests for two tasks from eval harness.
Will add more tests as needed

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Additional Notes

EvalHarnessMetric inherits from BaseMetric since eval harness metrics are model level metrics.
Open to discuss and update code and tests.

Note

Adds LMEvalMetric wrapping lm-evaluation-harness metrics and enables lm_eval:<task> requests in Task, with tests and optional dependency.

Evaluation Metrics:
- Introduce LMEvalMetric (src/pruna/evaluation/metrics/metric_evalharness.py) wrapping lm-evaluation-harness metrics via registry/aggregation, with stateful accumulation and result reporting.
- Register metric under MetricRegistry as lm_eval_metric.
Task Integration:
- Support lm_eval:<task_name> requests in Task (src/pruna/evaluation/task.py), resolving metrics via lm_eval.tasks.get_task_dict and instantiating LMEvalMetric per task metric.
Tests:
- Add tests/evaluation/test_evalharness_metrics.py covering BLEU-like scoring, empty input behavior, and preds/refs length mismatch.
Dependencies:
- Add optional group evalharness with lm-eval>=0.4.0 in pyproject.toml.

^{Written by Cursor Bugbot for commit 3767da9. This will update automatically on new commits. Configure here.}

begumcig

Hi @sky-2002, thank you so much for your contribution! We’d love to bring lm-eval into Pruna, I think it's such a big task to integrate the entire evaluation suite at one go, but I think with some refactoring we can still make this work! In our codebase, we usually use the StatefulMetric interface for a single metric, whereas lm-eval is more of a full benchmarking suite. To align with our structure, we might need to break it into smaller components 🔨🔨. We also have a Task interface that coordinates multiple metrics together. I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface) or alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.

What do you think? Do you think any of these approaches would align with lm-eval interface? Thank you again!

sky-2002 · 2025-10-19T07:59:21Z

@begumcig Yeah I agree, while implementing as well, I was not very comfortable with this approach.
From what I see in pruna code, a task has a request, which is a list of metrics, exactly as an lm-eval task, which has metrics defined in the yaml files.

alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.

Yes, this is a much cleaner way and aligns with lm-eval interface(to add each metric as a metric in pruna), let us do this. Will update code.
Also, you are right, I misunderstood StatefulMetric

I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface)

The above two need to be done together no? we need to implement a task, and define its metrics or we need a way to create a task on the fly given a task name and the metrics needed for it.

For example, given that we have all metrics implemented in pruna:

We can instantiate a task for a lm-eval task:

from pruna.evaluation.task import Task

squad_metrics = ["lm_eval_exact_match", "lm_eval_f1"]

squad_task = Task(
    request=squad_metrics,
    datamodule=squad_datamodule, 
    device="cuda"
)

or have a task factory:

def make_pruna_task_from_lmeval(task_name, datamodule, model_args, device="cuda"):
    lm_task = lm_tasks.get_task(task_name)
    metrics = [f"lm_eval_{m['metric']}" for m in lm_task.config.metric_list]
    return Task(request=metrics, datamodule=datamodule, device=device)

Wdyt, how should we go about this?

begumcig · 2025-10-22T08:09:29Z

@sky-2002 Yes, that's exactly how our task and stateful metric interface work!

I was suggesting that you could:

Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.
Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Once we have the metrics as Pruna metrics, we can use them with our Task interface like you have shown in the example! This all looks super good to me, and thank you so much for taking the time to look into this, you are a champ 🏆🏆🏆🏆

cursor · 2025-10-23T16:37:24Z

Bug: Mismatched Class and Method Interfaces

The test file attempts to import and use EvalHarnessMetric class, but the actual implementation defines LMEvalMetric class. Additionally, the test uses a completely different constructor signature (expecting tasks, model_args, device parameters) and calls compute(model=None, dataloader=None) method, while the actual LMEvalMetric constructor takes metric_name and optional call_type parameters, and the compute() method takes no parameters. The test expects result structure with task_scores in params, but the implementation returns different parameter structure. This complete interface mismatch will cause the tests to fail with ImportError and TypeError.

cursor · 2025-10-23T16:37:25Z

Bug: Error Message Mismatch with New Request Types

The error message references AVAILABLE_REQUESTS constant which only contains "image_generation_quality", but the function now supports requests starting with "lm_eval:". This creates misleading error messages that don't inform users about the newly supported lm_eval request format, potentially causing confusion when users receive errors claiming only image_generation_quality is available.

cursor · 2025-10-23T17:12:22Z

Bug: Outdated Constants Mislead on New Features

AVAILABLE_REQUESTS constant is outdated and doesn't include the newly added lm_eval functionality. The error message on lines 272-273 references this constant, making it misleading since lm_eval requests (with "lm_eval:" prefix) are now supported but not listed in AVAILABLE_REQUESTS.

cursor · 2025-10-23T17:12:23Z

Bug: Function Lacks Error Handling for Missing Keys and Attributes

The _get_lm_eval_task_metrics function has no error handling for potential KeyError when accessing task_dict[task_name] or AttributeError when accessing task.config.metric_list. If get_task_dict returns a dict without the requested task_name key, or if the task object doesn't have a config attribute with metric_list, this will cause unhandled exceptions.

cursor · 2025-10-23T17:12:24Z

Bug: Empty Task Name Validation Missing

No validation for empty task name after splitting the request. If the request is exactly "lm_eval:" with no task name following, task_name will be an empty string, which could cause issues in subsequent function calls.

sky-2002 · 2025-10-23T17:31:17Z

@begumcig I tried implementing. But I have a few concerns.

We can create wrapper around lm-eval metrics, but then we have to do it around everything, for example, for a task say hellaswag, we need its dataset (specified in yaml), its pre-processing and post-processing functions etc. which might make our code brittle. Might get confusing and maybe too much duplicate code too, because it is already there in lm-eval.

Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.

I realise that this would be much cleaner and would leave lm-eval dependent code to itself, we don't need to keep updating it. But also, idk whether it should inherit from existing Task code, feel free to give suggestions, I am a little confused about what would be a good way to do this.

Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Or if we continue this way, how do we handle the datasets for each task and other related attributes and functions?

github-actions · 2025-11-03T00:08:33Z

This PR has been inactive for 10 days and is now marked as stale.

sky-2002 added 3 commits October 2, 2025 18:58

add initial evalharness metric

c7d5cd4

add tests for eval harness metric

3bdf186

add optional dependency for lm-eval

376b5de

sdiazlor requested a review from begumcig October 6, 2025 09:40

simlang requested review from simlang and removed request for simlang October 10, 2025 14:32

begumcig requested changes Oct 15, 2025

View reviewed changes

fix pre-commit

fd520e4

sky-2002 added 3 commits October 23, 2025 20:37

refactor to create LMEvalMetric

8d54cd1

get lm eval metrics for a task

635ac98

fix conflict

c1518c7

refactor evalharness tests

3767da9

github-actions bot added the stale label Nov 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evaluation): Add lm-eval to Pruna Metrics #380

feat(evaluation): Add lm-eval to Pruna Metrics #380

Uh oh!

sky-2002 commented Oct 2, 2025 •

edited by cursor bot

Loading

Uh oh!

begumcig left a comment

Uh oh!

sky-2002 commented Oct 19, 2025 •

edited

Loading

Uh oh!

begumcig commented Oct 22, 2025

Uh oh!

cursor bot commented Oct 23, 2025

Uh oh!

cursor bot commented Oct 23, 2025

Uh oh!

cursor bot commented Oct 23, 2025

Uh oh!

cursor bot commented Oct 23, 2025

Uh oh!

cursor bot commented Oct 23, 2025

Uh oh!

sky-2002 commented Oct 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(evaluation): Add lm-eval to Pruna Metrics #380

Are you sure you want to change the base?

feat(evaluation): Add lm-eval to Pruna Metrics #380

Uh oh!

Conversation

sky-2002 commented Oct 2, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

sky-2002 commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

begumcig commented Oct 22, 2025

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Mismatched Class and Method Interfaces

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Error Message Mismatch with New Request Types

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Outdated Constants Mislead on New Features

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Function Lacks Error Handling for Missing Keys and Attributes

Uh oh!

cursor bot commented Oct 23, 2025

Bug: Empty Task Name Validation Missing

Uh oh!

sky-2002 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sky-2002 commented Oct 2, 2025 •

edited by cursor bot

Loading

sky-2002 commented Oct 19, 2025 •

edited

Loading

sky-2002 commented Oct 23, 2025 •

edited

Loading