-
Notifications
You must be signed in to change notification settings - Fork 3.2k
reasoning model evaluation mmlu gpqa #13880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
3f005e1
reasoning model evaluation mmlu gpqa
ruchaa-apte a92670c
Apply isort and black reformatting
ruchaa-apte bba3891
Addressing PR Comments
ruchaa-apte 409d8d0
Apply isort and black reformatting
ruchaa-apte a435625
Add license
ruchaa-apte 61d87cd
Merge branch 'NVIDIA:main' into reasoning_model_eval
ruchaa-apte File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # Evaluation Scripts | ||
|
|
||
| This directory contains scripts for deploying and evaluating NeMo models, as well as preparing datasets required for model evaluation. Here, we focus only on the [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main?views%5B%5D=gpqa_main), [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond), and [MMLU](https://huggingface.co/datasets/cais/mmlu) benchmarks. We will use the datasets hosted on HuggingFace and prepare them using the `prepare_dataset.py` script in the folder. Once the dataset is prepared, we will deploy our trained LoRA checkpoint using the `deploy_and_get_responses.py` script. This script will generate responses for the selected benchmark. Once we have the model responses, we can use the `evaluate_responses.py` script, which compares the ground truth response and extracts the model response. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - **Hardware Requirement:** At least 1 GPU is required to run the Llama 8B model. Ensure that your system meets the necessary specifications for GPU usage. | ||
| - **Environment Details:** This playbook has been tested on: nvcr.io/nvidia/nemo:25.04. It is expected to work similarly on other environments. Launch the NeMo Framework container as follows: | ||
| ```bash | ||
| docker run -it -p 8080:8080 --rm --gpus '"device=0"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.04 | ||
| ``` | ||
|
|
||
ruchaa-apte marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| ## Scripts Overview | ||
|
|
||
| ### 1. `prepare_dataset.py` | ||
|
|
||
| **Purpose:** This script is used to prepare a dataset for model evaluation. It accesses one or all of the datasets based on the argument chosen by the user and downloads the benchmark dataset from HuggingFace. The script rearranges the dataset to depict the question, choices, and the correct answer as one of the multiple-choice options ('A', 'B', 'C', 'D'). | ||
|
|
||
| **How to Run:** | ||
| ```bash | ||
| python prepare_dataset.py --datasets [mmlu, gpqa, gpqa_diamond, all] | ||
| ``` | ||
|
|
||
| **Arguments:** | ||
| - `--datasets`: Specify which datasets to process. Options are `mmlu`, `gpqa`, `gpqa_diamond`, or `all` (default). | ||
|
|
||
| **Step-by-Step:** | ||
| 1. **Load the Dataset:** The script loads the dataset that you want to prepare. | ||
| 2. **Process the Dataset:** It processes the dataset to ensure it is in the correct format. | ||
| 3. **Save the Dataset:** The script saves the processed dataset as jsonl for later use. | ||
|
|
||
| **Note:** Please note that [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main?views%5B%5D=gpqa_main), [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond) benchmarks are gated repo. In order to clone these login to huggingface cli and enter your token. | ||
| ```bash | ||
| huggingface-cli login | ||
| ``` | ||
|
|
||
| ### 2. `deploy_and_get_responses.py` | ||
|
|
||
| **Purpose:** First, you need to prepare a NeMo 2 checkpoint of the model you would like to evaluate. Assuming we will be using a NeMo2 checkpoint we have trained in the previous step, make sure to mount the directory containing the checkpoint when starting the container. The script below will start a server for the provided checkpoint in a separate process. The script will deploy the model using the Triton Inference Server and set up OpenAI-like endpoints for querying it. The server exposes three endpoints: | ||
|
|
||
| - `/v1/triton_health` | ||
| - `/v1/completions/` | ||
| - `/v1/chat/completions/` | ||
|
|
||
| The `/v1/triton_health` allows you to check if the underlying Triton server is ready. The `/v1/completions/` endpoint allows you to send a prompt to the model as-is, without applying the chat template. The model responds with a text completion. Finally, the `/v1/chat/completions/` endpoint allows for multi-turn conversational interactions with the model. This endpoint accepts a structured list of messages with different roles (system, user, assistant) to maintain context and generates chat-like responses. Under the hood, a chat template is applied to turn the conversation into a single input string. | ||
|
|
||
| **Note:** Please note that the chat endpoint will not work correctly for base models, as they do not define a chat template. Deployment can take a couple of minutes, especially for larger models. | ||
|
|
||
| **How to Run:** | ||
| ```bash | ||
| python deploy_and_get_responses.py --checkpoint_path <checkpoint_path> --dataset <dataset> --output_prefix <output_prefix> --max_tokens <max_tokens> | ||
| ``` | ||
|
|
||
| **Arguments:** | ||
| - `--checkpoint_path`: Path to the model checkpoint. | ||
| - `--dataset`: Dataset to evaluate on (choices: `gpqa_main`, `mmlu`, `gpqa_diamond`). | ||
| - `--output_prefix`: Prefix for the output file name (default: `evaluation_results`). | ||
| - `--max_tokens`: Maximum number of tokens to generate. | ||
|
|
||
| **Step-by-Step:** | ||
| 1. **Load the Nemo Model:** The script loads the Nemo model that you want to deploy. | ||
| 2. **Configure Deployment:** It configures the deployment settings. You can modify the caht template here to provide 'detailed thinking on or off' for your reasoining model. | ||
| 3. **Deploy the Model:** The script deploys the model and outputs the deployment status. | ||
|
|
||
| ### 3. `evaluate_responses.py` | ||
|
|
||
| **Purpose:** This script will ingest the model responses generated in the previous step and perform extraction of the final answer from the model response. Once we extract the model response, we compare it with the ground truth and evaluate the model's performance. | ||
|
|
||
| **How to Run:** | ||
| ```bash | ||
| python evaluate_responses.py --input_file <input_file> --output_file <output_file> --model_name <model_name> | ||
| ``` | ||
|
|
||
| **Arguments:** | ||
| - `--input_file`: Path to the input JSONL file containing model responses. | ||
| - `--output_file`: Path to the output CSV file. | ||
| - `--model_name`: Name of the model for reporting results. | ||
151 changes: 151 additions & 0 deletions
151
tutorials/llm/reasoning/evaluation/deploy_and_get_responses.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import argparse | ||
| import json | ||
| import signal | ||
| import subprocess | ||
|
|
||
| import requests | ||
|
|
||
| from nemo.collections.llm import api | ||
| from nemo.collections.llm.evaluation.base import wait_for_fastapi_server | ||
| from nemo.utils import logging | ||
|
|
||
| logging.setLevel(logging.INFO) | ||
|
|
||
| deploy_process = None | ||
| base_url = None | ||
| chat_url = None | ||
| model_name = None | ||
|
|
||
|
|
||
| def parse_args(): | ||
| parser = argparse.ArgumentParser(description='Evaluate model on benchmark dataset') | ||
| parser.add_argument('--checkpoint_path', type=str, required=True, help='Path to the model checkpoint') | ||
| parser.add_argument( | ||
| '--dataset', | ||
| type=str, | ||
| required=True, | ||
| choices=['gpqa_main', 'mmlu', 'gpqa_diamond'], | ||
| help='Dataset to evaluate on (gpqa, mmlu)', | ||
| ) | ||
| parser.add_argument( | ||
| '--output_prefix', type=str, default='evaluation_results', help='Prefix for the output file name' | ||
| ) | ||
| parser.add_argument( | ||
| '--max_tokens', type=int, default=2048, help='Maximum number of tokens to generate in the response' | ||
| ) | ||
| return parser.parse_args() | ||
|
|
||
|
|
||
| def create_benchmark_prompt(question, choice1, choice2, choice3, choice4): | ||
| """Create benchmark prompt in the specified format""" | ||
| prompt = f"""Given the following question and four candidate answers (A, B, C and D), choose the best answer. | ||
| Question: {question} A. {choice1} B. {choice2} C. {choice3} D. {choice4} | ||
| For simple problems, directly provide the answer with minimal explanation. For complex problems, use step-by-step format. Always conclude with: The final answer is [the_answer_letter], where the [the_answer_letter] is one of A, B, C or D.""" | ||
| return prompt | ||
|
|
||
|
|
||
| def load_model(checkpoint_path): | ||
| """Initialize and load the model for inference""" | ||
| global deploy_process, base_url, chat_url, model_name | ||
|
|
||
| SCRIPTS_PATH = "/opt/NeMo/scripts" | ||
| WORKSPACE = "." | ||
|
|
||
| deploy_script = f"{SCRIPTS_PATH}/deploy/nlp/deploy_in_fw_oai_server_eval.py" | ||
| deploy_process = subprocess.Popen( | ||
| ['python', deploy_script, '--nemo_checkpoint', checkpoint_path], | ||
| ) | ||
|
|
||
| base_url = "http://0.0.0.0:8886" | ||
| model_name = "triton_model" | ||
| chat_url = f"{base_url}/v1/chat/completions/" | ||
|
|
||
| wait_for_fastapi_server(base_url=base_url, max_retries=600, retry_interval=10) | ||
| logging.info("Model loaded and server is ready for inference") | ||
|
|
||
|
|
||
| def get_response(prompt, max_tokens): | ||
| chat_payload = { | ||
| "messages": [{"role": "system", "content": "detailed thinking on"}, {"role": "user", "content": prompt}], | ||
| "model": model_name, | ||
| "max_tokens": max_tokens, | ||
| } | ||
| response = requests.post(chat_url, json=chat_payload) | ||
| return response.content.decode() | ||
|
|
||
|
|
||
| def main(): | ||
| args = parse_args() | ||
|
|
||
| # Determine dataset file and output file based on dataset selection | ||
| dataset_files = { | ||
| 'gpqa_main': 'gpqa_dataset.jsonl', | ||
| 'mmlu': 'mmlu_dataset.jsonl', | ||
| 'gpqa_diamond': 'gpqa_diamond_dataset.jsonl', | ||
| } | ||
|
|
||
| dataset_file = dataset_files[args.dataset] | ||
| output_file = f"{args.output_prefix}_{args.dataset}_evaluation.jsonl" | ||
|
|
||
| try: | ||
| with open(dataset_file, "r") as f: | ||
| problems = [json.loads(line) for line in f] | ||
|
|
||
| load_model(args.checkpoint_path) | ||
|
|
||
| # Open output file once before the loop | ||
| with open(output_file, "w") as f: | ||
| for i, problem in enumerate(problems): | ||
| print(f"\n{'='*70}") | ||
| print(f"Problem {i+1}/{len(problems)}") | ||
|
|
||
| prompt = create_benchmark_prompt( | ||
| problem['Question'], | ||
| problem['Choice 1'], | ||
| problem['Choice 2'], | ||
| problem['Choice 3'], | ||
| problem['Choice 4'], | ||
| ) | ||
|
|
||
| response = get_response(prompt, args.max_tokens) | ||
|
|
||
| # Create result entry | ||
| result = { | ||
| "question": problem['Question'], | ||
| "choices": { | ||
| "A": problem['Choice 1'], | ||
| "B": problem['Choice 2'], | ||
| "C": problem['Choice 3'], | ||
| "D": problem['Choice 4'], | ||
| }, | ||
| "expected_answer": problem['Answer'], | ||
| "model_response": response, | ||
| } | ||
|
|
||
| # Write to JSONL file | ||
| f.write(json.dumps(result) + "\n") | ||
|
|
||
| print(f"All results written to {output_file}") | ||
| except Exception as e: | ||
| print(f"An error occurred: {e}") | ||
| finally: | ||
| print("Killing the server...") | ||
| deploy_process.send_signal(signal.SIGINT) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
124 changes: 124 additions & 0 deletions
124
tutorials/llm/reasoning/evaluation/evaluate_responses.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import argparse | ||
| import csv | ||
| import json | ||
| import re | ||
|
|
||
| import pandas as pd | ||
|
|
||
|
|
||
| def extract_model_answer(response): | ||
| if not response or "Internal Server Error" in response: | ||
| return "Internal Server Error" | ||
|
|
||
| # Look for the pattern "The final answer is <letter>" | ||
| match = re.search(r"The final answer is ([A-D])", response) | ||
| if match: | ||
| return match.group(1) | ||
| return "" | ||
|
|
||
|
|
||
| def process_answers(input_file, output_file): | ||
| # Read the JSONL file | ||
| data = [] | ||
| with open(input_file, 'r') as f: | ||
| for line in f: | ||
| if line.strip(): # Skip empty lines | ||
| data.append(json.loads(line)) | ||
|
|
||
| # Prepare CSV headers | ||
| headers = [ | ||
| 'Question', | ||
| 'Choice A', | ||
| 'Choice B', | ||
| 'Choice C', | ||
| 'Choice D', | ||
| 'Expected Answer', | ||
| 'Model Response', | ||
| 'Extracted Model Answer', | ||
| ] | ||
|
|
||
| # Write to CSV | ||
| with open(output_file, 'w', newline='') as f: | ||
| writer = csv.writer(f) | ||
| writer.writerow(headers) | ||
|
|
||
| # Process each question | ||
| for question_data in data: | ||
| question = question_data.get('question', '') | ||
| choices = question_data.get('choices', {}) | ||
| expected_answer = question_data.get('expected_answer', '') | ||
| model_response = question_data.get('model_response', '') | ||
|
|
||
| # Extract model answer | ||
| extracted_answer = extract_model_answer(model_response) | ||
|
|
||
| # Write row | ||
| row = [ | ||
| question, | ||
| choices.get('A', ''), | ||
| choices.get('B', ''), | ||
| choices.get('C', ''), | ||
| choices.get('D', ''), | ||
| expected_answer, | ||
| model_response, | ||
| extracted_answer, | ||
| ] | ||
| writer.writerow(row) | ||
|
|
||
| return output_file | ||
|
|
||
|
|
||
| def evaluate_results(csv_file, model_name): | ||
| # Read the CSV file | ||
| df = pd.read_csv(csv_file) | ||
|
|
||
| # Calculate metrics | ||
| total = len(df) | ||
| correct = len(df[df['Extracted Model Answer'] == df['Expected Answer']]) | ||
| refusals = len(df[df['Extracted Model Answer'].str.contains('Internal Server Error', case=False, na=False)]) | ||
|
|
||
| # Print results | ||
| print(f"\nModel: {model_name}") | ||
| print(f"Total problems: {total}") | ||
| print(f"Correct answers: {correct}") | ||
| print(f"Refusals: {refusals}") | ||
| print(f"Accuracy: {correct/total*100:.1f}% ({correct}/{total})") | ||
|
|
||
|
|
||
| def main(): | ||
| # Set up argument parser | ||
| parser = argparse.ArgumentParser(description='Process and evaluate model responses') | ||
| parser.add_argument( | ||
| '--input_file', type=str, required=True, help='Path to the input JSONL file containing model responses' | ||
| ) | ||
| parser.add_argument('--output_file', type=str, required=True, help='Path to the output CSV file') | ||
| parser.add_argument('--model_name', type=str, required=True, help='Name of the model for reporting results') | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| # Process answers and generate CSV | ||
| print(f"Processing answers from {args.input_file}...") | ||
| csv_file = process_answers(args.input_file, args.output_file) | ||
| print(f"CSV file has been generated: {csv_file}") | ||
|
|
||
| # Evaluate results | ||
| print("\nEvaluating results...") | ||
| evaluate_results(csv_file, args.model_name) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.