Skip to content

Commit 042bfc1

Browse files
ruchaa-aptenasretdinovr
authored andcommitted
reasoning model evaluation mmlu gpqa (NVIDIA-NeMo#13880)
* reasoning model evaluation mmlu gpqa Signed-off-by: Rucha Apte <[email protected]> * Apply isort and black reformatting Signed-off-by: ruchaa-apte <[email protected]> * Addressing PR Comments Signed-off-by: Rucha Apte <[email protected]> * Apply isort and black reformatting Signed-off-by: ruchaa-apte <[email protected]> * Add license Signed-off-by: Rucha Apte <[email protected]> --------- Signed-off-by: Rucha Apte <[email protected]> Signed-off-by: ruchaa-apte <[email protected]> Co-authored-by: ruchaa-apte <[email protected]>
1 parent 9887451 commit 042bfc1

File tree

5 files changed

+601
-1
lines changed

5 files changed

+601
-1
lines changed

tutorials/llm/reasoning/README.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,37 @@ For your reference here are the loss plots from our own experiments using 500,00
3232

3333
You might be wondering about the sudden loss drop at the end. This is expected!
3434
The training dataset is arranged in the increasing order of sample difficulty (i.e. curriculum learning).
35-
With 500,000 training samples, a batch size of 256 and 2000 steps, thats just slightly over 1 epoch of training.
35+
With 500,000 training samples, a batch size of 256 and 2000 steps, that's just slightly over 1 epoch of training.
3636
Towards the end of that epoch, when the model sees the first few (easier samples) again, it can easily predict the right tokens for them so the loss ends up being much lower.
3737

3838
#### LoRA Training Loss Plots
3939
![LoRA Training Loss Plots](images/loss-plot-lora.png)
4040

4141
#### Full Fine-tuning Loss Plots
4242
![Fine-tuning Loss Plots](images/loss-plot-full-finetuning.png)
43+
44+
## Evaluation
45+
46+
This section describes how to evaluate your trained reasoning model on various benchmarks. The evaluation process consists of three main steps:
47+
48+
1. **Prepare the Dataset**: Use `prepare_dataset.py` to download and prepare benchmark datasets from HuggingFace. The script supports:
49+
- [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main)
50+
- [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond)
51+
- [MMLU](https://huggingface.co/datasets/cais/mmlu)
52+
53+
2. **Deploy and Get Responses**: Use `deploy_and_get_responses.py` to:
54+
- Deploy your trained model using Triton Inference Server
55+
- Set up OpenAI-like endpoints for querying
56+
- Generate responses for the selected benchmark
57+
58+
3. **Evaluate Responses**: Use `evaluate_responses.py` to:
59+
- Extract final answers from model responses
60+
- Compare with ground truth
61+
- Calculate model performance metrics
62+
63+
### Hardware Requirements for Evaluation
64+
- At least 1 GPU is required to run the Llama 8B model
65+
- The evaluation scripts have been tested on nvcr.io/nvidia/nemo:25.04
66+
67+
For detailed instructions on running each evaluation script, please refer to the [evaluation README](./evaluation/README.md).
68+
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Evaluation Scripts
2+
3+
This directory contains scripts for deploying and evaluating NeMo models, as well as preparing datasets required for model evaluation. Here, we focus only on the [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main?views%5B%5D=gpqa_main), [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond), and [MMLU](https://huggingface.co/datasets/cais/mmlu) benchmarks. We will use the datasets hosted on HuggingFace and prepare them using the `prepare_dataset.py` script in the folder. Once the dataset is prepared, we will deploy our trained LoRA checkpoint using the `deploy_and_get_responses.py` script. This script will generate responses for the selected benchmark. Once we have the model responses, we can use the `evaluate_responses.py` script, which compares the ground truth response and extracts the model response.
4+
5+
## Prerequisites
6+
7+
- **Hardware Requirement:** At least 1 GPU is required to run the Llama 8B model. Ensure that your system meets the necessary specifications for GPU usage.
8+
- **Environment Details:** This playbook has been tested on: nvcr.io/nvidia/nemo:25.04. It is expected to work similarly on other environments. Launch the NeMo Framework container as follows:
9+
```bash
10+
docker run -it -p 8080:8080 --rm --gpus '"device=0"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.04
11+
```
12+
13+
## Scripts Overview
14+
15+
### 1. `prepare_dataset.py`
16+
17+
**Purpose:** This script is used to prepare a dataset for model evaluation. It accesses one or all of the datasets based on the argument chosen by the user and downloads the benchmark dataset from HuggingFace. The script rearranges the dataset to depict the question, choices, and the correct answer as one of the multiple-choice options ('A', 'B', 'C', 'D').
18+
19+
**How to Run:**
20+
```bash
21+
python prepare_dataset.py --datasets [mmlu, gpqa, gpqa_diamond, all]
22+
```
23+
24+
**Arguments:**
25+
- `--datasets`: Specify which datasets to process. Options are `mmlu`, `gpqa`, `gpqa_diamond`, or `all` (default).
26+
27+
**Step-by-Step:**
28+
1. **Load the Dataset:** The script loads the dataset that you want to prepare.
29+
2. **Process the Dataset:** It processes the dataset to ensure it is in the correct format.
30+
3. **Save the Dataset:** The script saves the processed dataset as jsonl for later use.
31+
32+
**Note:** Please note that [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main?views%5B%5D=gpqa_main), [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond) benchmarks are gated repo. In order to clone these login to huggingface cli and enter your token.
33+
```bash
34+
huggingface-cli login
35+
```
36+
37+
### 2. `deploy_and_get_responses.py`
38+
39+
**Purpose:** First, you need to prepare a NeMo 2 checkpoint of the model you would like to evaluate. Assuming we will be using a NeMo2 checkpoint we have trained in the previous step, make sure to mount the directory containing the checkpoint when starting the container. The script below will start a server for the provided checkpoint in a separate process. The script will deploy the model using the Triton Inference Server and set up OpenAI-like endpoints for querying it. The server exposes three endpoints:
40+
41+
- `/v1/triton_health`
42+
- `/v1/completions/`
43+
- `/v1/chat/completions/`
44+
45+
The `/v1/triton_health` allows you to check if the underlying Triton server is ready. The `/v1/completions/` endpoint allows you to send a prompt to the model as-is, without applying the chat template. The model responds with a text completion. Finally, the `/v1/chat/completions/` endpoint allows for multi-turn conversational interactions with the model. This endpoint accepts a structured list of messages with different roles (system, user, assistant) to maintain context and generates chat-like responses. Under the hood, a chat template is applied to turn the conversation into a single input string.
46+
47+
**Note:** Please note that the chat endpoint will not work correctly for base models, as they do not define a chat template. Deployment can take a couple of minutes, especially for larger models.
48+
49+
**How to Run:**
50+
```bash
51+
python deploy_and_get_responses.py --checkpoint_path <checkpoint_path> --dataset <dataset> --output_prefix <output_prefix> --max_tokens <max_tokens>
52+
```
53+
54+
**Arguments:**
55+
- `--checkpoint_path`: Path to the model checkpoint.
56+
- `--dataset`: Dataset to evaluate on (choices: `gpqa_main`, `mmlu`, `gpqa_diamond`).
57+
- `--output_prefix`: Prefix for the output file name (default: `evaluation_results`).
58+
- `--max_tokens`: Maximum number of tokens to generate.
59+
60+
**Step-by-Step:**
61+
1. **Load the Nemo Model:** The script loads the Nemo model that you want to deploy.
62+
2. **Configure Deployment:** It configures the deployment settings. You can modify the caht template here to provide 'detailed thinking on or off' for your reasoining model.
63+
3. **Deploy the Model:** The script deploys the model and outputs the deployment status.
64+
65+
### 3. `evaluate_responses.py`
66+
67+
**Purpose:** This script will ingest the model responses generated in the previous step and perform extraction of the final answer from the model response. Once we extract the model response, we compare it with the ground truth and evaluate the model's performance.
68+
69+
**How to Run:**
70+
```bash
71+
python evaluate_responses.py --input_file <input_file> --output_file <output_file> --model_name <model_name>
72+
```
73+
74+
**Arguments:**
75+
- `--input_file`: Path to the input JSONL file containing model responses.
76+
- `--output_file`: Path to the output CSV file.
77+
- `--model_name`: Name of the model for reporting results.
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import argparse
16+
import json
17+
import signal
18+
import subprocess
19+
20+
import requests
21+
22+
from nemo.collections.llm import api
23+
from nemo.collections.llm.evaluation.base import wait_for_fastapi_server
24+
from nemo.utils import logging
25+
26+
logging.setLevel(logging.INFO)
27+
28+
deploy_process = None
29+
base_url = None
30+
chat_url = None
31+
model_name = None
32+
33+
34+
def parse_args():
35+
parser = argparse.ArgumentParser(description='Evaluate model on benchmark dataset')
36+
parser.add_argument('--checkpoint_path', type=str, required=True, help='Path to the model checkpoint')
37+
parser.add_argument(
38+
'--dataset',
39+
type=str,
40+
required=True,
41+
choices=['gpqa_main', 'mmlu', 'gpqa_diamond'],
42+
help='Dataset to evaluate on (gpqa, mmlu)',
43+
)
44+
parser.add_argument(
45+
'--output_prefix', type=str, default='evaluation_results', help='Prefix for the output file name'
46+
)
47+
parser.add_argument(
48+
'--max_tokens', type=int, default=2048, help='Maximum number of tokens to generate in the response'
49+
)
50+
return parser.parse_args()
51+
52+
53+
def create_benchmark_prompt(question, choice1, choice2, choice3, choice4):
54+
"""Create benchmark prompt in the specified format"""
55+
prompt = f"""Given the following question and four candidate answers (A, B, C and D), choose the best answer.
56+
Question: {question} A. {choice1} B. {choice2} C. {choice3} D. {choice4}
57+
For simple problems, directly provide the answer with minimal explanation. For complex problems, use step-by-step format. Always conclude with: The final answer is [the_answer_letter], where the [the_answer_letter] is one of A, B, C or D."""
58+
return prompt
59+
60+
61+
def load_model(checkpoint_path):
62+
"""Initialize and load the model for inference"""
63+
global deploy_process, base_url, chat_url, model_name
64+
65+
SCRIPTS_PATH = "/opt/NeMo/scripts"
66+
WORKSPACE = "."
67+
68+
deploy_script = f"{SCRIPTS_PATH}/deploy/nlp/deploy_in_fw_oai_server_eval.py"
69+
deploy_process = subprocess.Popen(
70+
['python', deploy_script, '--nemo_checkpoint', checkpoint_path],
71+
)
72+
73+
base_url = "http://0.0.0.0:8886"
74+
model_name = "triton_model"
75+
chat_url = f"{base_url}/v1/chat/completions/"
76+
77+
wait_for_fastapi_server(base_url=base_url, max_retries=600, retry_interval=10)
78+
logging.info("Model loaded and server is ready for inference")
79+
80+
81+
def get_response(prompt, max_tokens):
82+
chat_payload = {
83+
"messages": [{"role": "system", "content": "detailed thinking on"}, {"role": "user", "content": prompt}],
84+
"model": model_name,
85+
"max_tokens": max_tokens,
86+
}
87+
response = requests.post(chat_url, json=chat_payload)
88+
return response.content.decode()
89+
90+
91+
def main():
92+
args = parse_args()
93+
94+
# Determine dataset file and output file based on dataset selection
95+
dataset_files = {
96+
'gpqa_main': 'gpqa_dataset.jsonl',
97+
'mmlu': 'mmlu_dataset.jsonl',
98+
'gpqa_diamond': 'gpqa_diamond_dataset.jsonl',
99+
}
100+
101+
dataset_file = dataset_files[args.dataset]
102+
output_file = f"{args.output_prefix}_{args.dataset}_evaluation.jsonl"
103+
104+
try:
105+
with open(dataset_file, "r") as f:
106+
problems = [json.loads(line) for line in f]
107+
108+
load_model(args.checkpoint_path)
109+
110+
# Open output file once before the loop
111+
with open(output_file, "w") as f:
112+
for i, problem in enumerate(problems):
113+
print(f"\n{'='*70}")
114+
print(f"Problem {i+1}/{len(problems)}")
115+
116+
prompt = create_benchmark_prompt(
117+
problem['Question'],
118+
problem['Choice 1'],
119+
problem['Choice 2'],
120+
problem['Choice 3'],
121+
problem['Choice 4'],
122+
)
123+
124+
response = get_response(prompt, args.max_tokens)
125+
126+
# Create result entry
127+
result = {
128+
"question": problem['Question'],
129+
"choices": {
130+
"A": problem['Choice 1'],
131+
"B": problem['Choice 2'],
132+
"C": problem['Choice 3'],
133+
"D": problem['Choice 4'],
134+
},
135+
"expected_answer": problem['Answer'],
136+
"model_response": response,
137+
}
138+
139+
# Write to JSONL file
140+
f.write(json.dumps(result) + "\n")
141+
142+
print(f"All results written to {output_file}")
143+
except Exception as e:
144+
print(f"An error occurred: {e}")
145+
finally:
146+
print("Killing the server...")
147+
deploy_process.send_signal(signal.SIGINT)
148+
149+
150+
if __name__ == "__main__":
151+
main()
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import argparse
16+
import csv
17+
import json
18+
import re
19+
20+
import pandas as pd
21+
22+
23+
def extract_model_answer(response):
24+
if not response or "Internal Server Error" in response:
25+
return "Internal Server Error"
26+
27+
# Look for the pattern "The final answer is <letter>"
28+
match = re.search(r"The final answer is ([A-D])", response)
29+
if match:
30+
return match.group(1)
31+
return ""
32+
33+
34+
def process_answers(input_file, output_file):
35+
# Read the JSONL file
36+
data = []
37+
with open(input_file, 'r') as f:
38+
for line in f:
39+
if line.strip(): # Skip empty lines
40+
data.append(json.loads(line))
41+
42+
# Prepare CSV headers
43+
headers = [
44+
'Question',
45+
'Choice A',
46+
'Choice B',
47+
'Choice C',
48+
'Choice D',
49+
'Expected Answer',
50+
'Model Response',
51+
'Extracted Model Answer',
52+
]
53+
54+
# Write to CSV
55+
with open(output_file, 'w', newline='') as f:
56+
writer = csv.writer(f)
57+
writer.writerow(headers)
58+
59+
# Process each question
60+
for question_data in data:
61+
question = question_data.get('question', '')
62+
choices = question_data.get('choices', {})
63+
expected_answer = question_data.get('expected_answer', '')
64+
model_response = question_data.get('model_response', '')
65+
66+
# Extract model answer
67+
extracted_answer = extract_model_answer(model_response)
68+
69+
# Write row
70+
row = [
71+
question,
72+
choices.get('A', ''),
73+
choices.get('B', ''),
74+
choices.get('C', ''),
75+
choices.get('D', ''),
76+
expected_answer,
77+
model_response,
78+
extracted_answer,
79+
]
80+
writer.writerow(row)
81+
82+
return output_file
83+
84+
85+
def evaluate_results(csv_file, model_name):
86+
# Read the CSV file
87+
df = pd.read_csv(csv_file)
88+
89+
# Calculate metrics
90+
total = len(df)
91+
correct = len(df[df['Extracted Model Answer'] == df['Expected Answer']])
92+
refusals = len(df[df['Extracted Model Answer'].str.contains('Internal Server Error', case=False, na=False)])
93+
94+
# Print results
95+
print(f"\nModel: {model_name}")
96+
print(f"Total problems: {total}")
97+
print(f"Correct answers: {correct}")
98+
print(f"Refusals: {refusals}")
99+
print(f"Accuracy: {correct/total*100:.1f}% ({correct}/{total})")
100+
101+
102+
def main():
103+
# Set up argument parser
104+
parser = argparse.ArgumentParser(description='Process and evaluate model responses')
105+
parser.add_argument(
106+
'--input_file', type=str, required=True, help='Path to the input JSONL file containing model responses'
107+
)
108+
parser.add_argument('--output_file', type=str, required=True, help='Path to the output CSV file')
109+
parser.add_argument('--model_name', type=str, required=True, help='Name of the model for reporting results')
110+
111+
args = parser.parse_args()
112+
113+
# Process answers and generate CSV
114+
print(f"Processing answers from {args.input_file}...")
115+
csv_file = process_answers(args.input_file, args.output_file)
116+
print(f"CSV file has been generated: {csv_file}")
117+
118+
# Evaluate results
119+
print("\nEvaluating results...")
120+
evaluate_results(csv_file, args.model_name)
121+
122+
123+
if __name__ == "__main__":
124+
main()

0 commit comments

Comments
 (0)