-
Notifications
You must be signed in to change notification settings - Fork 68
feat(evaluation): Add lm-eval to Pruna Metrics #380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sky-2002, thank you so much for your contribution! We’d love to bring lm-eval into Pruna, I think it's such a big task to integrate the entire evaluation suite at one go, but I think with some refactoring we can still make this work! In our codebase, we usually use the StatefulMetric interface for a single metric, whereas lm-eval is more of a full benchmarking suite. To align with our structure, we might need to break it into smaller components 🔨🔨. We also have a Task interface that coordinates multiple metrics together. I was thinking maybe we could integrate one task from lm-eval as one Task (inheriting from our Task interface) or alternatively, we could integrate some metrics from lm-eval metrics registry as StatefulMetrics.
What do you think? Do you think any of these approaches would align with lm-eval interface? Thank you again!
|
@begumcig Yeah I agree, while implementing as well, I was not very comfortable with this approach.
Yes, this is a much cleaner way and aligns with
The above two need to be done together no? we need to implement a task, and define its metrics or we need a way to create a task on the fly given a task name and the metrics needed for it. For example, given that we have all metrics implemented in pruna: We can instantiate a task for a lm-eval task: from pruna.evaluation.task import Task
squad_metrics = ["lm_eval_exact_match", "lm_eval_f1"]
squad_task = Task(
request=squad_metrics,
datamodule=squad_datamodule,
device="cuda"
)or have a task factory: def make_pruna_task_from_lmeval(task_name, datamodule, model_args, device="cuda"):
lm_task = lm_tasks.get_task(task_name)
metrics = [f"lm_eval_{m['metric']}" for m in lm_task.config.metric_list]
return Task(request=metrics, datamodule=datamodule, device=device)Wdyt, how should we go about this? |
|
@sky-2002 Yes, that's exactly how our task and stateful metric interface work! I was suggesting that you could:
Once we have the metrics as Pruna metrics, we can use them with our Task interface like you have shown in the example! This all looks super good to me, and thank you so much for taking the time to look into this, you are a champ 🏆🏆🏆🏆 |
Bug: Mismatched Class and Method InterfacesThe test file attempts to import and use |
Bug: Error Message Mismatch with New Request TypesThe error message references |
Bug: Outdated Constants Mislead on New FeaturesAVAILABLE_REQUESTS constant is outdated and doesn't include the newly added lm_eval functionality. The error message on lines 272-273 references this constant, making it misleading since lm_eval requests (with "lm_eval:" prefix) are now supported but not listed in AVAILABLE_REQUESTS. |
Bug: Function Lacks Error Handling for Missing Keys and AttributesThe _get_lm_eval_task_metrics function has no error handling for potential KeyError when accessing task_dict[task_name] or AttributeError when accessing task.config.metric_list. If get_task_dict returns a dict without the requested task_name key, or if the task object doesn't have a config attribute with metric_list, this will cause unhandled exceptions. |
|
@begumcig I tried implementing. But I have a few concerns.
I realise that this would be much cleaner and would leave
Or if we continue this way, how do we handle the datasets for each task and other related attributes and functions? |
|
This PR has been inactive for 10 days and is now marked as stale. |
Description
Adds evaluation harness to the metrics lineup.
Related Issue
Fixes #378
Type of Change
How Has This Been Tested?
Checklist
Additional Notes
EvalHarnessMetricinherits fromBaseMetricsince eval harness metrics are model level metrics.Note
Adds
LMEvalMetricwrapping lm-evaluation-harness metrics and enableslm_eval:<task>requests inTask, with tests and optional dependency.LMEvalMetric(src/pruna/evaluation/metrics/metric_evalharness.py) wrapping lm-evaluation-harness metrics via registry/aggregation, with stateful accumulation and result reporting.MetricRegistryaslm_eval_metric.lm_eval:<task_name>requests inTask(src/pruna/evaluation/task.py), resolving metrics vialm_eval.tasks.get_task_dictand instantiatingLMEvalMetricper task metric.tests/evaluation/test_evalharness_metrics.pycovering BLEU-like scoring, empty input behavior, and preds/refs length mismatch.evalharnesswithlm-eval>=0.4.0inpyproject.toml.Written by Cursor Bugbot for commit 3767da9. This will update automatically on new commits. Configure here.