Skip to content

Conversation

@sky-2002
Copy link

@sky-2002 sky-2002 commented Oct 2, 2025

Description

Adds evaluation harness to the metrics lineup.

Related Issue

Fixes #378

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Have added basic unit tests for two tasks from eval harness.
  • Will add more tests as needed

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

  • EvalHarnessMetric inherits from BaseMetric since eval harness metrics are model level metrics.
  • Open to discuss and update code and tests.

Note

Adds LMEvalMetric wrapping lm-evaluation-harness metrics and enables lm_eval:<task> requests in Task, with tests and optional dependency.

  • Evaluation Metrics:
    • Introduce LMEvalMetric (src/pruna/evaluation/metrics/metric_evalharness.py) wrapping lm-evaluation-harness metrics via registry/aggregation, with stateful accumulation and result reporting.
    • Register metric under MetricRegistry as lm_eval_metric.
  • Task Integration:
    • Support lm_eval:<task_name> requests in Task (src/pruna/evaluation/task.py), resolving metrics via lm_eval.tasks.get_task_dict and instantiating LMEvalMetric per task metric.
  • Tests:
    • Add tests/evaluation/test_evalharness_metrics.py covering BLEU-like scoring, empty input behavior, and preds/refs length mismatch.
  • Dependencies:
    • Add optional group evalharness with lm-eval>=0.4.0 in pyproject.toml.

Written by Cursor Bugbot for commit 3767da9. This will update automatically on new commits. Configure here.

@sdiazlor sdiazlor requested a review from begumcig October 6, 2025 09:40
@simlang simlang requested review from simlang and removed request for simlang October 10, 2025 14:32
Copy link
Member

@begumcig begumcig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sky-2002, thank you so much for your contribution! We’d love to bring lm-eval into Pruna, I think it's such a big task to integrate the entire evaluation suite at one go, but I think with some refactoring we can still make this work! In our codebase, we usually use the StatefulMetric interface for a single metric, whereas lm-eval is more of a full benchmarking suite. To align with our structure, we might need to break it into smaller components 🔨🔨. We also have a Task interface that coordinates multiple metrics together. I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface) or alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.

What do you think? Do you think any of these approaches would align with lm-eval interface? Thank you again!

@sky-2002
Copy link
Author

sky-2002 commented Oct 19, 2025

@begumcig Yeah I agree, while implementing as well, I was not very comfortable with this approach.
From what I see in pruna code, a task has a request, which is a list of metrics, exactly as an lm-eval task, which has metrics defined in the yaml files.

alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.

Yes, this is a much cleaner way and aligns with lm-eval interface(to add each metric as a metric in pruna), let us do this. Will update code.
Also, you are right, I misunderstood StatefulMetric

I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface)

The above two need to be done together no? we need to implement a task, and define its metrics or we need a way to create a task on the fly given a task name and the metrics needed for it.

For example, given that we have all metrics implemented in pruna:

We can instantiate a task for a lm-eval task:

from pruna.evaluation.task import Task

squad_metrics = ["lm_eval_exact_match", "lm_eval_f1"]

squad_task = Task(
    request=squad_metrics,
    datamodule=squad_datamodule, 
    device="cuda"
)

or have a task factory:

def make_pruna_task_from_lmeval(task_name, datamodule, model_args, device="cuda"):
    lm_task = lm_tasks.get_task(task_name)
    metrics = [f"lm_eval_{m['metric']}" for m in lm_task.config.metric_list]
    return Task(request=metrics, datamodule=datamodule, device=device)

Wdyt, how should we go about this?

@begumcig
Copy link
Member

@sky-2002 Yes, that's exactly how our task and stateful metric interface work!

I was suggesting that you could:

  • Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.
  • Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Once we have the metrics as Pruna metrics, we can use them with our Task interface like you have shown in the example! This all looks super good to me, and thank you so much for taking the time to look into this, you are a champ 🏆🏆🏆🏆

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Mismatched Class and Method Interfaces

The test file attempts to import and use EvalHarnessMetric class, but the actual implementation defines LMEvalMetric class. Additionally, the test uses a completely different constructor signature (expecting tasks, model_args, device parameters) and calls compute(model=None, dataloader=None) method, while the actual LMEvalMetric constructor takes metric_name and optional call_type parameters, and the compute() method takes no parameters. The test expects result structure with task_scores in params, but the implementation returns different parameter structure. This complete interface mismatch will cause the tests to fail with ImportError and TypeError.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Error Message Mismatch with New Request Types

The error message references AVAILABLE_REQUESTS constant which only contains "image_generation_quality", but the function now supports requests starting with "lm_eval:". This creates misleading error messages that don't inform users about the newly supported lm_eval request format, potentially causing confusion when users receive errors claiming only image_generation_quality is available.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Outdated Constants Mislead on New Features

AVAILABLE_REQUESTS constant is outdated and doesn't include the newly added lm_eval functionality. The error message on lines 272-273 references this constant, making it misleading since lm_eval requests (with "lm_eval:" prefix) are now supported but not listed in AVAILABLE_REQUESTS.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Function Lacks Error Handling for Missing Keys and Attributes

The _get_lm_eval_task_metrics function has no error handling for potential KeyError when accessing task_dict[task_name] or AttributeError when accessing task.config.metric_list. If get_task_dict returns a dict without the requested task_name key, or if the task object doesn't have a config attribute with metric_list, this will cause unhandled exceptions.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Oct 23, 2025

Bug: Empty Task Name Validation Missing

No validation for empty task name after splitting the request. If the request is exactly "lm_eval:" with no task name following, task_name will be an empty string, which could cause issues in subsequent function calls.

Fix in Cursor Fix in Web

@sky-2002
Copy link
Author

sky-2002 commented Oct 23, 2025

@begumcig I tried implementing. But I have a few concerns.

  • We can create wrapper around lm-eval metrics, but then we have to do it around everything, for example, for a task say hellaswag, we need its dataset (specified in yaml), its pre-processing and post-processing functions etc. which might make our code brittle. Might get confusing and maybe too much duplicate code too, because it is already there in lm-eval.

Either create something like a LMEvalTask inheriting from our Task, that would also integrate the attributes and functions that are required for lm_eval task.

I realise that this would be much cleaner and would leave lm-eval dependent code to itself, we don't need to keep updating it. But also, idk whether it should inherit from existing Task code, feel free to give suggestions, I am a little confused about what would be a good way to do this.

Or exactly like you suggested integrate the metrics as stateful metrics and initialize them using our own Task interface. I think the solution you suggested aligns perfectly with our interface and it would be amazing!

Or if we continue this way, how do we handle the datasets for each task and other related attributes and functions?

@github-actions
Copy link

github-actions bot commented Nov 3, 2025

This PR has been inactive for 10 days and is now marked as stale.

@github-actions github-actions bot added the stale label Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add lm-eval to our Metrics

2 participants