reasoning model evaluation mmlu gpqa (NVIDIA-NeMo#13880)

ruchaa-apte · nasretdinovr · commit 042bfc1ac613 · 2025-08-08T08:50:07.000-07:00
* reasoning model evaluation mmlu gpqa

Signed-off-by: Rucha Apte &lt;ruchaa@nvidia.com&gt;

* Apply isort and black reformatting

Signed-off-by: ruchaa-apte &lt;ruchaa-apte@users.noreply.github.com&gt;

* Addressing PR Comments

Signed-off-by: Rucha Apte &lt;ruchaa@nvidia.com&gt;

* Apply isort and black reformatting

Signed-off-by: ruchaa-apte &lt;ruchaa-apte@users.noreply.github.com&gt;

* Add license

Signed-off-by: Rucha Apte &lt;ruchaa@nvidia.com&gt;

---------

Signed-off-by: Rucha Apte &lt;ruchaa@nvidia.com&gt;
Signed-off-by: ruchaa-apte &lt;ruchaa-apte@users.noreply.github.com&gt;
Co-authored-by: ruchaa-apte &lt;ruchaa-apte@users.noreply.github.com&gt;
diff --git a/tutorials/llm/reasoning/README.md b/tutorials/llm/reasoning/README.md
@@ -32,11 +32,37 @@ For your reference here are the loss plots from our own experiments using 500,00
 
 You might be wondering about the sudden loss drop at the end. This is expected!
 The training dataset is arranged in the increasing order of sample difficulty (i.e. curriculum learning).
-With 500,000 training samples, a batch size of 256 and 2000 steps, that’s just slightly over 1 epoch of training.
+With 500,000 training samples, a batch size of 256 and 2000 steps, that's just slightly over 1 epoch of training.
 Towards the end of that epoch, when the model sees the first few (easier samples) again, it can easily predict the right tokens for them so the loss ends up being much lower.
 
 #### LoRA Training Loss Plots
 ![LoRA Training Loss Plots](images/loss-plot-lora.png)
 
 #### Full Fine-tuning Loss Plots
 ![Fine-tuning Loss Plots](images/loss-plot-full-finetuning.png)
+
+## Evaluation
+
+This section describes how to evaluate your trained reasoning model on various benchmarks. The evaluation process consists of three main steps:
+
+1. **Prepare the Dataset**: Use `prepare_dataset.py` to download and prepare benchmark datasets from HuggingFace. The script supports:
+   - [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main)
+   - [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond)
+   - [MMLU](https://huggingface.co/datasets/cais/mmlu)
+
+2. **Deploy and Get Responses**: Use `deploy_and_get_responses.py` to:
+   - Deploy your trained model using Triton Inference Server
+   - Set up OpenAI-like endpoints for querying
+   - Generate responses for the selected benchmark
+
+3. **Evaluate Responses**: Use `evaluate_responses.py` to:
+   - Extract final answers from model responses
+   - Compare with ground truth
+   - Calculate model performance metrics
+
+### Hardware Requirements for Evaluation
+- At least 1 GPU is required to run the Llama 8B model
+- The evaluation scripts have been tested on nvcr.io/nvidia/nemo:25.04
+
+For detailed instructions on running each evaluation script, please refer to the [evaluation README](./evaluation/README.md).
+
diff --git a/tutorials/llm/reasoning/evaluation/README.md b/tutorials/llm/reasoning/evaluation/README.md
@@ -0,0 +1,77 @@
+# Evaluation Scripts
+
+This directory contains scripts for deploying and evaluating NeMo models, as well as preparing datasets required for model evaluation. Here, we focus only on the [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main?views%5B%5D=gpqa_main), [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond), and [MMLU](https://huggingface.co/datasets/cais/mmlu) benchmarks. We will use the datasets hosted on HuggingFace and prepare them using the `prepare_dataset.py` script in the folder. Once the dataset is prepared, we will deploy our trained LoRA checkpoint using the `deploy_and_get_responses.py` script. This script will generate responses for the selected benchmark. Once we have the model responses, we can use the `evaluate_responses.py` script, which compares the ground truth response and extracts the model response.
+
+## Prerequisites
+
+- **Hardware Requirement:** At least 1 GPU is required to run the Llama 8B model. Ensure that your system meets the necessary specifications for GPU usage.
+- **Environment Details:** This playbook has been tested on: nvcr.io/nvidia/nemo:25.04. It is expected to work similarly on other environments. Launch the NeMo Framework container as follows:
+```bash
+docker run -it -p 8080:8080 --rm --gpus '"device=0"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.04
+```
+
+## Scripts Overview
+
+### 1. `prepare_dataset.py`
+
+**Purpose:** This script is used to prepare a dataset for model evaluation. It accesses one or all of the datasets based on the argument chosen by the user and downloads the benchmark dataset from HuggingFace. The script rearranges the dataset to depict the question, choices, and the correct answer as one of the multiple-choice options ('A', 'B', 'C', 'D').
+
+**How to Run:**
+```bash
+python prepare_dataset.py --datasets [mmlu, gpqa, gpqa_diamond, all]
+```
+
+**Arguments:**
+- `--datasets`: Specify which datasets to process. Options are `mmlu`, `gpqa`, `gpqa_diamond`, or `all` (default).
+
+**Step-by-Step:**
+1. **Load the Dataset:** The script loads the dataset that you want to prepare.
+2. **Process the Dataset:** It processes the dataset to ensure it is in the correct format.
+3. **Save the Dataset:** The script saves the processed dataset as jsonl for later use.
+
+**Note:** Please note that [GPQA main](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_main?views%5B%5D=gpqa_main), [GPQA diamond](https://huggingface.co/datasets/Idavidrein/gpqa/viewer/gpqa_diamond) benchmarks are gated repo. In order to clone these login to huggingface cli and enter your token.
+```bash
+huggingface-cli login
+```
+
+### 2. `deploy_and_get_responses.py`
+
+**Purpose:** First, you need to prepare a NeMo 2 checkpoint of the model you would like to evaluate. Assuming we will be using a NeMo2 checkpoint we have trained in the previous step, make sure to mount the directory containing the checkpoint when starting the container. The script below will start a server for the provided checkpoint in a separate process. The script will deploy the model using the Triton Inference Server and set up OpenAI-like endpoints for querying it. The server exposes three endpoints:
+
+- `/v1/triton_health`
+- `/v1/completions/`
+- `/v1/chat/completions/`
+
+The `/v1/triton_health` allows you to check if the underlying Triton server is ready. The `/v1/completions/` endpoint allows you to send a prompt to the model as-is, without applying the chat template. The model responds with a text completion. Finally, the `/v1/chat/completions/` endpoint allows for multi-turn conversational interactions with the model. This endpoint accepts a structured list of messages with different roles (system, user, assistant) to maintain context and generates chat-like responses. Under the hood, a chat template is applied to turn the conversation into a single input string.
+
+**Note:** Please note that the chat endpoint will not work correctly for base models, as they do not define a chat template. Deployment can take a couple of minutes, especially for larger models.
+
+**How to Run:**
+```bash
+python deploy_and_get_responses.py --checkpoint_path <checkpoint_path> --dataset <dataset> --output_prefix <output_prefix> --max_tokens <max_tokens>
+```
+
+**Arguments:**
+- `--checkpoint_path`: Path to the model checkpoint.
+- `--dataset`: Dataset to evaluate on (choices: `gpqa_main`, `mmlu`, `gpqa_diamond`).
+- `--output_prefix`: Prefix for the output file name (default: `evaluation_results`).
+- `--max_tokens`: Maximum number of tokens to generate.
+
+**Step-by-Step:**
+1. **Load the Nemo Model:** The script loads the Nemo model that you want to deploy.
+2. **Configure Deployment:** It configures the deployment settings. You can modify the caht template here to provide 'detailed thinking on or off' for your reasoining model.
+3. **Deploy the Model:** The script deploys the model and outputs the deployment status.
+
+### 3. `evaluate_responses.py`
+
+**Purpose:** This script will ingest the model responses generated in the previous step and perform extraction of the final answer from the model response. Once we extract the model response, we compare it with the ground truth and evaluate the model's performance.
+
+**How to Run:**
+```bash
+python evaluate_responses.py --input_file <input_file> --output_file <output_file> --model_name <model_name>
+```
+
+**Arguments:**
+- `--input_file`: Path to the input JSONL file containing model responses.
+- `--output_file`: Path to the output CSV file.
+- `--model_name`: Name of the model for reporting results.
diff --git a/tutorials/llm/reasoning/evaluation/deploy_and_get_responses.py b/tutorials/llm/reasoning/evaluation/deploy_and_get_responses.py
@@ -0,0 +1,151 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+import signal
+import subprocess
+
+import requests
+
+from nemo.collections.llm import api
+from nemo.collections.llm.evaluation.base import wait_for_fastapi_server
+from nemo.utils import logging
+
+logging.setLevel(logging.INFO)
+
+deploy_process = None
+base_url = None
+chat_url = None
+model_name = None
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Evaluate model on benchmark dataset')
+    parser.add_argument('--checkpoint_path', type=str, required=True, help='Path to the model checkpoint')
+    parser.add_argument(
+        '--dataset',
+        type=str,
+        required=True,
+        choices=['gpqa_main', 'mmlu', 'gpqa_diamond'],
+        help='Dataset to evaluate on (gpqa, mmlu)',
+    )
+    parser.add_argument(
+        '--output_prefix', type=str, default='evaluation_results', help='Prefix for the output file name'
+    )
+    parser.add_argument(
+        '--max_tokens', type=int, default=2048, help='Maximum number of tokens to generate in the response'
+    )
+    return parser.parse_args()
+
+
+def create_benchmark_prompt(question, choice1, choice2, choice3, choice4):
+    """Create benchmark prompt in the specified format"""
+    prompt = f"""Given the following question and four candidate answers (A, B, C and D), choose the best answer.
+        Question: {question} A. {choice1} B. {choice2} C. {choice3} D. {choice4}
+        For simple problems, directly provide the answer with minimal explanation. For complex problems, use step-by-step format. Always conclude with: The final answer is [the_answer_letter], where the [the_answer_letter] is one of A, B, C or D."""
+    return prompt
+
+
+def load_model(checkpoint_path):
+    """Initialize and load the model for inference"""
+    global deploy_process, base_url, chat_url, model_name
+
+    SCRIPTS_PATH = "/opt/NeMo/scripts"
+    WORKSPACE = "."
+
+    deploy_script = f"{SCRIPTS_PATH}/deploy/nlp/deploy_in_fw_oai_server_eval.py"
+    deploy_process = subprocess.Popen(
+        ['python', deploy_script, '--nemo_checkpoint', checkpoint_path],
+    )
+
+    base_url = "http://0.0.0.0:8886"
+    model_name = "triton_model"
+    chat_url = f"{base_url}/v1/chat/completions/"
+
+    wait_for_fastapi_server(base_url=base_url, max_retries=600, retry_interval=10)
+    logging.info("Model loaded and server is ready for inference")
+
+
+def get_response(prompt, max_tokens):
+    chat_payload = {
+        "messages": [{"role": "system", "content": "detailed thinking on"}, {"role": "user", "content": prompt}],
+        "model": model_name,
+        "max_tokens": max_tokens,
+    }
+    response = requests.post(chat_url, json=chat_payload)
+    return response.content.decode()
+
+
+def main():
+    args = parse_args()
+
+    # Determine dataset file and output file based on dataset selection
+    dataset_files = {
+        'gpqa_main': 'gpqa_dataset.jsonl',
+        'mmlu': 'mmlu_dataset.jsonl',
+        'gpqa_diamond': 'gpqa_diamond_dataset.jsonl',
+    }
+
+    dataset_file = dataset_files[args.dataset]
+    output_file = f"{args.output_prefix}_{args.dataset}_evaluation.jsonl"
+
+    try:
+        with open(dataset_file, "r") as f:
+            problems = [json.loads(line) for line in f]
+
+        load_model(args.checkpoint_path)
+
+        # Open output file once before the loop
+        with open(output_file, "w") as f:
+            for i, problem in enumerate(problems):
+                print(f"\n{'='*70}")
+                print(f"Problem {i+1}/{len(problems)}")
+
+                prompt = create_benchmark_prompt(
+                    problem['Question'],
+                    problem['Choice 1'],
+                    problem['Choice 2'],
+                    problem['Choice 3'],
+                    problem['Choice 4'],
+                )
+
+                response = get_response(prompt, args.max_tokens)
+
+                # Create result entry
+                result = {
+                    "question": problem['Question'],
+                    "choices": {
+                        "A": problem['Choice 1'],
+                        "B": problem['Choice 2'],
+                        "C": problem['Choice 3'],
+                        "D": problem['Choice 4'],
+                    },
+                    "expected_answer": problem['Answer'],
+                    "model_response": response,
+                }
+
+                # Write to JSONL file
+                f.write(json.dumps(result) + "\n")
+
+            print(f"All results written to {output_file}")
+    except Exception as e:
+        print(f"An error occurred: {e}")
+    finally:
+        print("Killing the server...")
+        deploy_process.send_signal(signal.SIGINT)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tutorials/llm/reasoning/evaluation/evaluate_responses.py b/tutorials/llm/reasoning/evaluation/evaluate_responses.py
@@ -0,0 +1,124 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import csv
+import json
+import re
+
+import pandas as pd
+
+
+def extract_model_answer(response):
+    if not response or "Internal Server Error" in response:
+        return "Internal Server Error"
+
+    # Look for the pattern "The final answer is <letter>"
+    match = re.search(r"The final answer is ([A-D])", response)
+    if match:
+        return match.group(1)
+    return ""
+
+
+def process_answers(input_file, output_file):
+    # Read the JSONL file
+    data = []
+    with open(input_file, 'r') as f:
+        for line in f:
+            if line.strip():  # Skip empty lines
+                data.append(json.loads(line))
+
+    # Prepare CSV headers
+    headers = [
+        'Question',
+        'Choice A',
+        'Choice B',
+        'Choice C',
+        'Choice D',
+        'Expected Answer',
+        'Model Response',
+        'Extracted Model Answer',
+    ]
+
+    # Write to CSV
+    with open(output_file, 'w', newline='') as f:
+        writer = csv.writer(f)
+        writer.writerow(headers)
+
+        # Process each question
+        for question_data in data:
+            question = question_data.get('question', '')
+            choices = question_data.get('choices', {})
+            expected_answer = question_data.get('expected_answer', '')
+            model_response = question_data.get('model_response', '')
+
+            # Extract model answer
+            extracted_answer = extract_model_answer(model_response)
+
+            # Write row
+            row = [
+                question,
+                choices.get('A', ''),
+                choices.get('B', ''),
+                choices.get('C', ''),
+                choices.get('D', ''),
+                expected_answer,
+                model_response,
+                extracted_answer,
+            ]
+            writer.writerow(row)
+
+    return output_file
+
+
+def evaluate_results(csv_file, model_name):
+    # Read the CSV file
+    df = pd.read_csv(csv_file)
+
+    # Calculate metrics
+    total = len(df)
+    correct = len(df[df['Extracted Model Answer'] == df['Expected Answer']])
+    refusals = len(df[df['Extracted Model Answer'].str.contains('Internal Server Error', case=False, na=False)])
+
+    # Print results
+    print(f"\nModel: {model_name}")
+    print(f"Total problems: {total}")
+    print(f"Correct answers: {correct}")
+    print(f"Refusals: {refusals}")
+    print(f"Accuracy: {correct/total*100:.1f}% ({correct}/{total})")
+
+
+def main():
+    # Set up argument parser
+    parser = argparse.ArgumentParser(description='Process and evaluate model responses')
+    parser.add_argument(
+        '--input_file', type=str, required=True, help='Path to the input JSONL file containing model responses'
+    )
+    parser.add_argument('--output_file', type=str, required=True, help='Path to the output CSV file')
+    parser.add_argument('--model_name', type=str, required=True, help='Name of the model for reporting results')
+
+    args = parser.parse_args()
+
+    # Process answers and generate CSV
+    print(f"Processing answers from {args.input_file}...")
+    csv_file = process_answers(args.input_file, args.output_file)
+    print(f"CSV file has been generated: {csv_file}")
+
+    # Evaluate results
+    print("\nEvaluating results...")
+    evaluate_results(csv_file, args.model_name)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tutorials/llm/reasoning/evaluation/prepare_dataset.py b/tutorials/llm/reasoning/evaluation/prepare_dataset.py