Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 27 additions & 1 deletion tutorials/llm/reasoning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,37 @@ For your reference here are the loss plots from our own experiments using 500,00

You might be wondering about the sudden loss drop at the end. This is expected!
The training dataset is arranged in the increasing order of sample difficulty (i.e. curriculum learning).
With 500,000 training samples, a batch size of 256 and 2000 steps, thats just slightly over 1 epoch of training.
With 500,000 training samples, a batch size of 256 and 2000 steps, that's just slightly over 1 epoch of training.
Towards the end of that epoch, when the model sees the first few (easier samples) again, it can easily predict the right tokens for them so the loss ends up being much lower.

#### LoRA Training Loss Plots
![LoRA Training Loss Plots](images/loss-plot-lora.png)

#### Full Fine-tuning Loss Plots
![Fine-tuning Loss Plots](images/loss-plot-full-finetuning.png)

## Evaluation

This section describes how to evaluate your trained reasoning model on various benchmarks. The evaluation process consists of three main steps:

1. **Prepare the Dataset**: Use `prepare_dataset.py` to download and prepare benchmark datasets from HuggingFace. The script supports:
- [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main)
- [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond)
- [MMLU](https://huggingface.co/datasets/cais/mmlu)

2. **Deploy and Get Responses**: Use `deploy_and_get_responses.py` to:
- Deploy your trained model using Triton Inference Server
- Set up OpenAI-like endpoints for querying
- Generate responses for the selected benchmark

3. **Evaluate Responses**: Use `evaluate_responses.py` to:
- Extract final answers from model responses
- Compare with ground truth
- Calculate model performance metrics

### Hardware Requirements for Evaluation
- At least 1 GPU is required to run the Llama 8B model
- The evaluation scripts have been tested on nvcr.io/nvidia/nemo:25.04

For detailed instructions on running each evaluation script, please refer to the [evaluation README](./evaluation/README.md).

77 changes: 77 additions & 0 deletions tutorials/llm/reasoning/evaluation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Evaluation Scripts

This directory contains scripts for deploying and evaluating NeMo models, as well as preparing datasets required for model evaluation. Here, we focus only on the [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main?views%5B%5D=gpqa_main), [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond), and [MMLU](https://huggingface.co/datasets/cais/mmlu) benchmarks. We will use the datasets hosted on HuggingFace and prepare them using the `prepare_dataset.py` script in the folder. Once the dataset is prepared, we will deploy our trained LoRA checkpoint using the `deploy_and_get_responses.py` script. This script will generate responses for the selected benchmark. Once we have the model responses, we can use the `evaluate_responses.py` script, which compares the ground truth response and extracts the model response.

## Prerequisites

- **Hardware Requirement:** At least 1 GPU is required to run the Llama 8B model. Ensure that your system meets the necessary specifications for GPU usage.
- **Environment Details:** This playbook has been tested on: nvcr.io/nvidia/nemo:25.04. It is expected to work similarly on other environments. Launch the NeMo Framework container as follows:
```bash
docker run -it -p 8080:8080 --rm --gpus '"device=0"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.04
```

## Scripts Overview

### 1. `prepare_dataset.py`

**Purpose:** This script is used to prepare a dataset for model evaluation. It accesses one or all of the datasets based on the argument chosen by the user and downloads the benchmark dataset from HuggingFace. The script rearranges the dataset to depict the question, choices, and the correct answer as one of the multiple-choice options ('A', 'B', 'C', 'D').

**How to Run:**
```bash
python prepare_dataset.py --datasets [mmlu, gpqa, gpqa_diamond, all]
```

**Arguments:**
- `--datasets`: Specify which datasets to process. Options are `mmlu`, `gpqa`, `gpqa_diamond`, or `all` (default).

**Step-by-Step:**
1. **Load the Dataset:** The script loads the dataset that you want to prepare.
2. **Process the Dataset:** It processes the dataset to ensure it is in the correct format.
3. **Save the Dataset:** The script saves the processed dataset as jsonl for later use.

**Note:** Please note that [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main?views%5B%5D=gpqa_main), [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond) benchmarks are gated repo. In order to clone these login to huggingface cli and enter your token.
```bash
huggingface-cli login
```

### 2. `deploy_and_get_responses.py`

**Purpose:** First, you need to prepare a NeMo 2 checkpoint of the model you would like to evaluate. Assuming we will be using a NeMo2 checkpoint we have trained in the previous step, make sure to mount the directory containing the checkpoint when starting the container. The script below will start a server for the provided checkpoint in a separate process. The script will deploy the model using the Triton Inference Server and set up OpenAI-like endpoints for querying it. The server exposes three endpoints:

- `/v1/triton_health`
- `/v1/completions/`
- `/v1/chat/completions/`

The `/v1/triton_health` allows you to check if the underlying Triton server is ready. The `/v1/completions/` endpoint allows you to send a prompt to the model as-is, without applying the chat template. The model responds with a text completion. Finally, the `/v1/chat/completions/` endpoint allows for multi-turn conversational interactions with the model. This endpoint accepts a structured list of messages with different roles (system, user, assistant) to maintain context and generates chat-like responses. Under the hood, a chat template is applied to turn the conversation into a single input string.

**Note:** Please note that the chat endpoint will not work correctly for base models, as they do not define a chat template. Deployment can take a couple of minutes, especially for larger models.

**How to Run:**
```bash
python deploy_and_get_responses.py --checkpoint_path <checkpoint_path> --dataset <dataset> --output_prefix <output_prefix> --max_tokens <max_tokens>
```

**Arguments:**
- `--checkpoint_path`: Path to the model checkpoint.
- `--dataset`: Dataset to evaluate on (choices: `gpqa_main`, `mmlu`, `gpqa_diamond`).
- `--output_prefix`: Prefix for the output file name (default: `evaluation_results`).
- `--max_tokens`: Maximum number of tokens to generate.

**Step-by-Step:**
1. **Load the Nemo Model:** The script loads the Nemo model that you want to deploy.
2. **Configure Deployment:** It configures the deployment settings. You can modify the caht template here to provide 'detailed thinking on or off' for your reasoining model.
3. **Deploy the Model:** The script deploys the model and outputs the deployment status.

### 3. `evaluate_responses.py`

**Purpose:** This script will ingest the model responses generated in the previous step and perform extraction of the final answer from the model response. Once we extract the model response, we compare it with the ground truth and evaluate the model's performance.

**How to Run:**
```bash
python evaluate_responses.py --input_file <input_file> --output_file <output_file> --model_name <model_name>
```

**Arguments:**
- `--input_file`: Path to the input JSONL file containing model responses.
- `--output_file`: Path to the output CSV file.
- `--model_name`: Name of the model for reporting results.
151 changes: 151 additions & 0 deletions tutorials/llm/reasoning/evaluation/deploy_and_get_responses.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import json
import signal
import subprocess

import requests

from nemo.collections.llm import api
from nemo.collections.llm.evaluation.base import wait_for_fastapi_server
from nemo.utils import logging

logging.setLevel(logging.INFO)

deploy_process = None
base_url = None
chat_url = None
model_name = None


def parse_args():
parser = argparse.ArgumentParser(description='Evaluate model on benchmark dataset')
parser.add_argument('--checkpoint_path', type=str, required=True, help='Path to the model checkpoint')
parser.add_argument(
'--dataset',
type=str,
required=True,
choices=['gpqa_main', 'mmlu', 'gpqa_diamond'],
help='Dataset to evaluate on (gpqa, mmlu)',
)
parser.add_argument(
'--output_prefix', type=str, default='evaluation_results', help='Prefix for the output file name'
)
parser.add_argument(
'--max_tokens', type=int, default=2048, help='Maximum number of tokens to generate in the response'
)
return parser.parse_args()


def create_benchmark_prompt(question, choice1, choice2, choice3, choice4):
"""Create benchmark prompt in the specified format"""
prompt = f"""Given the following question and four candidate answers (A, B, C and D), choose the best answer.
Question: {question} A. {choice1} B. {choice2} C. {choice3} D. {choice4}
For simple problems, directly provide the answer with minimal explanation. For complex problems, use step-by-step format. Always conclude with: The final answer is [the_answer_letter], where the [the_answer_letter] is one of A, B, C or D."""
return prompt


def load_model(checkpoint_path):
"""Initialize and load the model for inference"""
global deploy_process, base_url, chat_url, model_name

SCRIPTS_PATH = "/opt/NeMo/scripts"
WORKSPACE = "."

deploy_script = f"{SCRIPTS_PATH}/deploy/nlp/deploy_in_fw_oai_server_eval.py"
deploy_process = subprocess.Popen(
['python', deploy_script, '--nemo_checkpoint', checkpoint_path],
)

base_url = "http://0.0.0.0:8886"
model_name = "triton_model"
chat_url = f"{base_url}/v1/chat/completions/"

wait_for_fastapi_server(base_url=base_url, max_retries=600, retry_interval=10)
logging.info("Model loaded and server is ready for inference")


def get_response(prompt, max_tokens):
chat_payload = {
"messages": [{"role": "system", "content": "detailed thinking on"}, {"role": "user", "content": prompt}],
"model": model_name,
"max_tokens": max_tokens,
}
response = requests.post(chat_url, json=chat_payload)
return response.content.decode()


def main():
args = parse_args()

# Determine dataset file and output file based on dataset selection
dataset_files = {
'gpqa_main': 'gpqa_dataset.jsonl',
'mmlu': 'mmlu_dataset.jsonl',
'gpqa_diamond': 'gpqa_diamond_dataset.jsonl',
}

dataset_file = dataset_files[args.dataset]
output_file = f"{args.output_prefix}_{args.dataset}_evaluation.jsonl"

try:
with open(dataset_file, "r") as f:
problems = [json.loads(line) for line in f]

load_model(args.checkpoint_path)

# Open output file once before the loop
with open(output_file, "w") as f:
for i, problem in enumerate(problems):
print(f"\n{'='*70}")
print(f"Problem {i+1}/{len(problems)}")

prompt = create_benchmark_prompt(
problem['Question'],
problem['Choice 1'],
problem['Choice 2'],
problem['Choice 3'],
problem['Choice 4'],
)

response = get_response(prompt, args.max_tokens)

# Create result entry
result = {
"question": problem['Question'],
"choices": {
"A": problem['Choice 1'],
"B": problem['Choice 2'],
"C": problem['Choice 3'],
"D": problem['Choice 4'],
},
"expected_answer": problem['Answer'],
"model_response": response,
}

# Write to JSONL file
f.write(json.dumps(result) + "\n")

print(f"All results written to {output_file}")
except Exception as e:
print(f"An error occurred: {e}")
finally:
print("Killing the server...")
deploy_process.send_signal(signal.SIGINT)


if __name__ == "__main__":
main()
124 changes: 124 additions & 0 deletions tutorials/llm/reasoning/evaluation/evaluate_responses.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import csv
import json
import re

import pandas as pd


def extract_model_answer(response):
if not response or "Internal Server Error" in response:
return "Internal Server Error"

# Look for the pattern "The final answer is <letter>"
match = re.search(r"The final answer is ([A-D])", response)
if match:
return match.group(1)
return ""


def process_answers(input_file, output_file):
# Read the JSONL file
data = []
with open(input_file, 'r') as f:
for line in f:
if line.strip(): # Skip empty lines
data.append(json.loads(line))

# Prepare CSV headers
headers = [
'Question',
'Choice A',
'Choice B',
'Choice C',
'Choice D',
'Expected Answer',
'Model Response',
'Extracted Model Answer',
]

# Write to CSV
with open(output_file, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(headers)

# Process each question
for question_data in data:
question = question_data.get('question', '')
choices = question_data.get('choices', {})
expected_answer = question_data.get('expected_answer', '')
model_response = question_data.get('model_response', '')

# Extract model answer
extracted_answer = extract_model_answer(model_response)

# Write row
row = [
question,
choices.get('A', ''),
choices.get('B', ''),
choices.get('C', ''),
choices.get('D', ''),
expected_answer,
model_response,
extracted_answer,
]
writer.writerow(row)

return output_file


def evaluate_results(csv_file, model_name):
# Read the CSV file
df = pd.read_csv(csv_file)

# Calculate metrics
total = len(df)
correct = len(df[df['Extracted Model Answer'] == df['Expected Answer']])
refusals = len(df[df['Extracted Model Answer'].str.contains('Internal Server Error', case=False, na=False)])

# Print results
print(f"\nModel: {model_name}")
print(f"Total problems: {total}")
print(f"Correct answers: {correct}")
print(f"Refusals: {refusals}")
print(f"Accuracy: {correct/total*100:.1f}% ({correct}/{total})")


def main():
# Set up argument parser
parser = argparse.ArgumentParser(description='Process and evaluate model responses')
parser.add_argument(
'--input_file', type=str, required=True, help='Path to the input JSONL file containing model responses'
)
parser.add_argument('--output_file', type=str, required=True, help='Path to the output CSV file')
parser.add_argument('--model_name', type=str, required=True, help='Name of the model for reporting results')

args = parser.parse_args()

# Process answers and generate CSV
print(f"Processing answers from {args.input_file}...")
csv_file = process_answers(args.input_file, args.output_file)
print(f"CSV file has been generated: {csv_file}")

# Evaluate results
print("\nEvaluating results...")
evaluate_results(csv_file, args.model_name)


if __name__ == "__main__":
main()
Loading
Loading