Add InfiniteBench for long context benchmarking #2421

iankur · 2024-12-09T10:21:31Z

Motivation

This PR adds support for eval on a long context benchmark, InfiniteBench. See #1273 for more context.

Modifications

Following the discussion in #1273, it currently adds code from TensorRT-LLM repo (link) to load the data, create prompts and compute scores. Following are the sample outputs for both cases using gradientai/Llama-3-8B-Instruct-Gradient-1048k with maximum input length of ~130K. Please check readme for more details and instructions on how to run both the benchmarks. Currently, predictions are different (see below) which I will try to fix.

SGLang

{"question_id": 0, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 71432.", "ground_truth": ["71432"]}
{"question_id": 1, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 69079.", "ground_truth": ["69079"]}
{"question_id": 2, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 89415.", "ground_truth": ["89415"]}
{"question_id": 3, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 61734.", "ground_truth": ["61734"]}
{"question_id": 4, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 40204.", "ground_truth": ["40204"]}
{"question_id": 5, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 80723.", "ground_truth": ["80723"]}
{"question_id": 6, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 55058.", "ground_truth": ["55058"]}
{"question_id": 7, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 16783.", "ground_truth": ["16783"]}
{"question_id": 8, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 15951.", "ground_truth": ["15951"]}
{"question_id": 9, "model_id": "gradientai/Llama-3-8B-Instruct-Gradient-1048k", "prediction": " 52933.", "ground_truth": ["52933"]}

TensorRT-LLM

{"id": 0, "prediction": " 71432.", "ground_truth": ["71432"], "input_lengths": [125339]}
{"id": 1, "prediction": " 69079.", "ground_truth": ["69079"], "input_lengths": [125339]}
{"id": 2, "prediction": " 89415.", "ground_truth": ["89415"], "input_lengths": [125339]}
{"id": 3, "prediction": " 61734.", "ground_truth": ["61734"], "input_lengths": [125339]}
{"id": 4, "prediction": " 40204.", "ground_truth": ["40204"], "input_lengths": [125339]}
{"id": 5, "prediction": " 80723.", "ground_truth": ["80723"], "input_lengths": [125339]}
{"id": 6, "prediction": " 55058.", "ground_truth": ["55058"], "input_lengths": [125339]}
{"id": 7, "prediction": " 16783. Remember it", "ground_truth": ["16783"], "input_lengths": [125339]}
{"id": 8, "prediction": " 15951.", "ground_truth": ["15951"], "input_lengths": [125339]}
{"id": 9, "prediction": " 52933.", "ground_truth": ["52933"], "input_lengths": [125339]}

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs

Nice work! May we combine these scripts just to one? Something like this

sglang/python/sglang/bench_serving.py

Line 519 in 641b7d0

    
           SHAREGPT_URL = "https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json"

Implement the process of downloading files into a script to make it more convenient for users.

zhyncs · 2024-12-09T17:47:18Z

Additionally, the section about TensorRT LLM is very good! Would you be willing to help improve this custom task script to make it easier to test TensorRT LLM?
https://github.com/sgl-project/sglang/blob/main/test/srt/experiment_runner.py
ref #2407
If considering doing it, it can be implemented in another PR. Thanks!

zhyncs · 2024-12-09T17:48:15Z

close #1273

zhyncs · 2024-12-09T17:51:40Z

gradientai/Llama-3-8B-Instruct-Gradient-1048k GradientAI LOL your previous work. cc @michaelfeil

iankur · 2024-12-09T22:54:29Z

@zhyncs

Implement the process of downloading files into a script to make it more convenient for users.

Sounds good, I will merge the downloading script for sglang, we can keep the downloading script for tensorrt.

I will also work on the custom task script PR. I am traveling, so it may take some time but will try to do it asap.

liangan1 · 2024-12-10T01:06:33Z

benchmark/infinitebench/eval_long_context.py

+    )
+    parser.add_argument("--data-dir", type=str, default="./data")
+    parser.add_argument("--start-idx", type=int, default=0)
+    parser.add_argument("--end-idx", type=int, default=None)


can you add more descriptions about the "--start-idx" and "--end-idx" ?

I removed these arguments, which were borrowed from tensorrt eval script, and added num-samples with description.

merrymercy · 2024-12-26T16:12:14Z

Is this ready to be merged?
We can have this first and then add this to CI in the next PR.

merrymercy · 2024-12-28T22:11:08Z

cc @iankur and @zhyncs . Ready to merge this first part?

BBuf · 2025-01-14T01:33:07Z

benchmark/infinitebench/README.md

+python convert_checkpoint.py \
+    --model_dir ./Llama-3-8B-Instruct-Gradient-1048k/ \
+    --output_dir /tmp/llama-3-8B-1048k/trt_ckpts \
+    --dtype float16


I see that the dtype specified in the model's config.json is bfloat16. Could you please explain why float16 is being specified here?

iankur added 2 commits December 9, 2024 01:17

add infinitebench eval

c86bffc

fix download path

46df6d8

zhyncs reviewed Dec 9, 2024

View reviewed changes

zhyncs self-assigned this Dec 9, 2024

zhyncs added the high priority label Dec 9, 2024

liangan1 reviewed Dec 10, 2024

View reviewed changes

iankur added 2 commits December 10, 2024 23:56

move dataset download to main script

275a64e

fix url

d0645b8

Merge branch 'main' into long-context-eval

39ff2a7

zhyncs assigned BBuf Jan 13, 2025

BBuf reviewed Jan 14, 2025

View reviewed changes

Merge branch 'main' into long-context-eval

4a4a5d8

zhyncs closed this May 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add InfiniteBench for long context benchmarking #2421

Add InfiniteBench for long context benchmarking #2421

Uh oh!

iankur commented Dec 9, 2024 •

edited

Loading

Uh oh!

zhyncs left a comment

Uh oh!

zhyncs commented Dec 9, 2024

Uh oh!

zhyncs commented Dec 9, 2024

Uh oh!

zhyncs commented Dec 9, 2024

Uh oh!

iankur commented Dec 9, 2024

Uh oh!

liangan1 Dec 10, 2024

Uh oh!

iankur Dec 11, 2024

Uh oh!

merrymercy commented Dec 26, 2024

Uh oh!

merrymercy commented Dec 28, 2024

Uh oh!

BBuf Jan 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add InfiniteBench for long context benchmarking #2421

Add InfiniteBench for long context benchmarking #2421

Uh oh!

Conversation

iankur commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

zhyncs left a comment

Choose a reason for hiding this comment

Uh oh!

zhyncs commented Dec 9, 2024

Uh oh!

zhyncs commented Dec 9, 2024

Uh oh!

zhyncs commented Dec 9, 2024

Uh oh!

iankur commented Dec 9, 2024

Uh oh!

liangan1 Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

iankur Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Dec 26, 2024

Uh oh!

merrymercy commented Dec 28, 2024

Uh oh!

BBuf Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iankur commented Dec 9, 2024 •

edited

Loading