This recipe contains information and scripts to produce performance results for the Nemotron 4 340b pre-training workloads. The scripts help perform environment setup and launch benchmark jobs.
340 billion parameter variant (FP8/BF16)
- At least 128 GPUs with at least 80GB memory each.
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | EP | DP | VP | MBS | GBS | GA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 340b | BF16/FP8 | 256 | 4096 | 96 | 8 | 8 | 1 | NA | 4 | 12 | 1 | 64 | 16 |
| 340b | BF16/FP8 | 512 | 4096 | 96 | 8 | 8 | 1 | NA | 8 | 12 | 1 | 128 | 16 |
| 340b | BF16/FP8 | 1024 | 4096 | 96 | 8 | 8 | 1 | NA | 16 | 12 | 1 | 256 | 16 |
| 340b | BF16/FP8 | 2048 | 4096 | 96 | 8 | 8 | 1 | NA | 32 | 12 | 1 | 512 | 16 |
340 billion parameter variant (FP8/BF16)
- At least 128 GPUs with at least 80GB memory each.
- The GB200 recipes listed below progressively increase GPU count, with configurations weak-scaled to match.
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | VP | MBS | GBS | DP | GA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 340b | BF16/FP8 | 128 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 32 | 4 | 8 |
| 340b | BF16/FP8 | 256 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 64 | 8 | 8 |
| 340b | BF16/FP8 | 512 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 128 | 16 | 8 |
- Strong Scaling, where GBS remains constant across increasing GPU counts and reducing gradient accumulation.
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | VP | MBS | GBS | DP | GA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 340b | BF16/FP8 | 128 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 4 | 128 |
| 340b | BF16/FP8 | 256 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 8 | 64 |
| 340b | BF16/FP8 | 512 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 16 | 32 |
340 billion parameter variant (FP8/BF16)
- At least 128 GPUs with at least 80GB memory each.
- The GB300 recipes listed below progressively increase GPU count, with configurations weak-scaled to match.
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | VP | MBS | GBS | DP | GA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 340b | BF16/FP8 | 128 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 32 | 4 | 8 |
| 340b | BF16/FP8 | 256 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 64 | 8 | 8 |
| 340b | BF16/FP8 | 512 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 128 | 16 | 8 |
- Strong Scaling, where GBS remains constant across increasing GPU counts and reducing gradient accumulation.
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | VP | MBS | GBS | DP | GA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 340b | BF16/FP8 | 128 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 4 | 128 |
| 340b | BF16/FP8 | 256 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 8 | 64 |
| 340b | BF16/FP8 | 512 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 16 | 32 |
340 billion parameter variant (FP8/BF16)
- At least 128 GPUs with at least 180GB memory each.
- The B200 recipes listed below progressively increase GPU count, with configurations weak-scaled to match.
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | VP | MBS | GBS | DP | GA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 340b | BF16/FP8 | 128 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 32 | 4 | 8 |
| 340b | BF16/FP8 | 256 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 64 | 8 | 8 |
| 340b | BF16/FP8 | 512 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 128 | 16 | 8 |
| 340b | BF16/FP8 | 1024 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 256 | 32 | 8 |
- Strong Scaling, where GBS remains constant across increasing GPU counts and reducing gradient accumulation.
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | VP | MBS | GBS | DP | GA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 340b | BF16/FP8 | 128 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 4 | 128 |
| 340b | BF16/FP8 | 256 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 8 | 64 |
| 340b | BF16/FP8 | 512 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 16 | 32 |
| 340b | BF16/FP8 | 1024 | 4096 | 96 | 8 | 4 | 1 | 12 | 1 | 512 | 32 | 16 |
Performance for Nemotron4 training is measured by seconds per iteration, or in other words seconds per training step. This metric is logged for every training step in the main training log file see Output Locations.
This recipe supports two cluster types: Slurm and Run:ai. We expect cluster type to have a negligible impact to performance. Application performance on a Run:ai cluster matches performance on a Slurm cluster.
Since the early training steps typically take much longer time (with input prefetch, activation memory allocation, and JIT compilation), we use the parse_train_timing.sh script to analyze iterations 35-44 and calculate mean and standard deviation for reliable performance metrics. We also get the achieved GPU FLOPS via TFLOPS_per_GPU metric.
To analyze training timing from your experiment results, run the script from the workload directory. In an installed environment, recipe files are available under $LLMB_INSTALL/llmb_repo (a copy created by the installer).
# Basic usage - parses results in the directory named 'experiments' in the current folder
$LLMB_INSTALL/llmb_repo/common/parse_train_timing.sh
# Specify a different experiments directory
$LLMB_INSTALL/llmb_repo/common/parse_train_timing.sh /path/to/experiments
# Output in CSV format
$LLMB_INSTALL/llmb_repo/common/parse_train_timing.sh --format=csv
# Output in JSON format
$LLMB_INSTALL/llmb_repo/common/parse_train_timing.sh --format=json
# Show full filenames instead of shortened versions
$LLMB_INSTALL/llmb_repo/common/parse_train_timing.sh --full-namesExample output:
Train Step Timing and TFLOPS Analysis (iterations 35-44)
================================================================================
Experiment Status Time Mean (s) Time Std (s) TFLOPS_per_GPU Mean TFLOPS_per_GPU Std
------------------------------------------------------------------------------------------ -------- ------------- ------------ ------------------- ------------------
pretrain_nemotron4_340b_fp8_gpus128_tp8_pp4_cp1_vp12_mbs1_gbs512_389757 Success 21.542 0.010 1600.74 0.78To obtain throughput as a tokens per second measurement, follow this formula:
(sequence length) * (global batch size) / (training_step_timing) = (throughput in tokens per second)E.g. 4096 * 256 / 2.693 = 389371
To calculate time to train estimate:
(total tokens) / (throughput in tokens per second) / (number of seconds in a day) = (time to train in days) E.g. 1e12 / 389371 / 86400 = 29.7 days
To calculate the model flops utilization (MFU):
MFU = (global batch size) * (model flops) / (training step time) / (number of GPUs) / (peak GPU FLOPS)The model flops for Nemotron4 340b for GBS=1 is 8.62124e15. Calculation shown here.
E.g. Nemotron4 340b BF16 on 256x H100 GPUs (GBS=64)
peak FLOPS for H100 BF16 = 989 TFLOPS
training step time = 4.31 s
model flops = 8.62124e15
MFU = 64 * 8.62124e15 / 4.31 / 256 / 989e+12 = 50.5%Peak theoretical throughput across GPUs and Data Types (in TFLOPS)
| Data Type | GB300 | GB200 | B200 | H100 |
|---|---|---|---|---|
| BF16 | 2450 | 2450 | 2250 | 989 |
| FP8 | 4900 | 4900 | 4500 | 1979 |
A HuggingFace account is required and you will need to create a HuggingFace access token. Add the generated token to your environment via export HF_TOKEN=<your token>.
Requires Python 3.12.x, or conda.
No special access is required to run this benchmark.
We reference a number of Slurm commands and parameters in this document. A brief summary is included below. It's important to note these are a guide and might not be applicable to all environments. Please consult with your system administrator for the parameters that are specific to your system.
Common parameters:
SBATCH_PARTITIONor-p- Partition (or queue) to use.SBATCH_ACCOUNTor-A- Slurm account to associate with your job, different from your user. Meant for accounting purposes.SBATCH_GPUS_PER_NODEor--gres=gpu:<num gpus>- If your cluster is configured with GRES this should be set to all GPUs in a node. Ignore if not configured.- Encountering errors such as 'GPUs not found' or 'Cannot submit to this partition without GPU resources' means this setting is required.
These parameters can be set either by exporting the environment variable or using the corresponding sbatch flag.
Use the installer referenced in the main README to prepare the recipe environment:
The following directory layout and key variables are used in the recipe:
LLMB_INSTALL: Top-level directory for all benchmarking artifacts (images, datasets, venvs, workloads, etc).LLMB_WORKLOAD: Workload-specific directory, e.g.${LLMB_INSTALL}/workloads/pretrain_nemotron4-340b.- Results, logs, and checkpoints are stored under subfolders of
LLMB_WORKLOAD(see below).
# Set the path where all artifacts will be downloaded
export LLMB_INSTALL=<path to your shared file system folder> (e.g. /lustre/myproject/nemo)Important: LLMB_INSTALL used in this step must be used when running the workload.
The workload relies on python virtual environment in order ensure there are no conflicts between required dependencies and user's packages. We require Python 3.12.x for the workload to work.
There are multiple choices available to set up virtual environment:
- conda
- python venv
To install and activate conda virtual environment
# pick INSTALL_PATH with sufficient disk space
INSTALL_PATH=~
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O $INSTALL_PATH/miniconda.sh
bash $INSTALL_PATH/miniconda.sh -b -p $INSTALL_PATH/miniconda3
$INSTALL_PATH/miniconda3/bin/conda init
source ~/.bashrc
conda create -n nemo2-nemotron python=3.12
conda activate nemo2-nemotronWhen you are finished running this benchmark you can deactivate the environment, run this command
conda deactivateTo install and activate python venv
python3 -m venv $LLMB_INSTALL/venvs/<venv_name>
source $LLMB_INSTALL/venvs/<venv_name>/bin/activateWhen you are finished running this benchmark you can deactivate the environment, run this command
deactivateCreate a install directory by running the attached setup.sh.
# activate virtual python environment setup previously
export CLUSTER_TYPE=runai
./setup.shNote: output log from running setup.sh script may include an error about tritonclient dependency. The error can be ignored as it doesn't affect benchmark functionality.
It would look like this:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tritonclient 2.51.0 requires urllib3>=2.0.7, but you have urllib3 1.26.20 which is incompatible.
- Open https://nvidia.run.ai/ web page and login to your Run:ai cluster
- Install Run:ai (CLI tool) by clicking top right "? | Researcher Command Line Interface" and following instructions
- Obtain latest kubeconfig file (e.g., my_runaicluster.kubeconfig) from cluster administrator
cp my_runaicluster.kubeconfig ~/.kube/config- Login to your cluster via Run:ai CLI:
runai loginif you have login with browser available ORrunai login remote-browserotherwise where you can authenticate from any machine and copy/paste authentication code
runai kubeconfig set- Set your current cluster via Run:ai CLI tool:
runai cluster- Pick your cluster and press Enter
Required environment variables:
Make sure that the environment variables below have been set before submitting recipe job on Run:AI cluster.
RUN_CONF_MOUNTS- PVCs that will be mounted into your environment. This should be in the formname:path:k8s-claimName. Confirm that the path to PVC exists in Run:ai UI under "Assets | Data Sources" e.g.,export RUN_CONF_MOUNTS="workspace:/nemo2-workspace:nemo2-workspace-project-hprdj"BASE_URL- NVIDIA Run:ai API url to use for experiment. Should look likehttps://<base-url>/api/v1. e.g.,export BASE_URL=https://nvidia.run.ai/api/v1APP_ID- name of NVIDIA Run:ai Application. The APP_ID value should be listed in Run:ai web UI under "Access | Applications" e.g.,export APP_ID=llmb-qaAPP_SECRET- NVIDIA Run:ai Application secret. Make sure the value is correct before launching the job. Contact your Run:ai cluster admin to provide you with latest application secretPROJECT_NAME- NVIDIA Run:ai Project to run the experiment in e.g.,llmb-testingPVC_DIR- path to nemo-run dir in the mounted PVC. Should look like/mount/nemo-rune.g.,export PVC_DIR="/nemo2-workspace/nemo-run"
These parameters must be set to launch your job on Run:ai. They can be set by exporting the environment variables.
Since Nemotron4 training only uses synthetic datasets, this step is omitted.
Once the environment has been prepared, it is time to train a model. The training runs for the first 50 steps and then stops. Log files and results are stored under the ${LLMB_WORKLOAD}/experiments/ folder (see Output Locations for details).
The easiest way to run benchmarks is using the llmb-run launcher tool. This method handles configuration automatically and provides a streamlined interface.
# Navigate to your installation directory
cd $LLMB_INSTALL
# Run a benchmark with llmb-run
llmb-run submit -w pretrain_nemotron4-340b -s 340b --dtype bf16 --scale 256For more details on llmb-run usage, see the llmb-run documentation.
Alternatively, you can run training directly using the launch script. This method provides more control over individual parameters and environment variables.
The training will run for the first 50 steps and will stop afterwards. Log files and results will be located under the ${LLMB_WORKLOAD}/experiments/ folder.
Important:
- Ensure your virtual environment is activated before running the training commands below. If you used the installer with conda, run
conda activate $LLMB_INSTALL/venvs/<env_name>. If you used the installer with python venv, runsource $LLMB_INSTALL/venvs/<env_name>/bin/activate. - Run the launch script from the installed recipe directory:
cd $LLMB_INSTALL/llmb_repo/nemotron4-340b/
JOB_TOTAL_GPUS=<number> GPU_TYPE=<type> [DTYPE=<precision>] [MODEL_SIZE=<size>] [CLUSTER_TYPE=<type>] [STRONG_SCALING=<bool>] [ADDITIONAL_SLURM_PARAMS=<params>] ./launch.shRequired:
JOB_TOTAL_GPUS: Number of GPUs to useGPU_TYPE: Type of GPU hardwaregb300- NVIDIA GB300 GPUsgb200- NVIDIA GB200 GPUsb200- NVIDIA B200 GPUsh100- NVIDIA H100 GPUs
Optional:
DTYPE: Precision format (default:fp8)fp8- FP8 precisionbf16- BFloat16 precision
MODEL_SIZE: Model variant340b- 340 billion parameter model
CLUSTER_TYPE: Cluster management system (default:slurm)slurm- Slurm workload managerrunai- Run:AI platform
STRONG_SCALING: Scaling behavior (default:false)true- Keep global batch size constant when scaling GPUsfalse- Scale global batch size with GPU count
ADDITIONAL_SLURM_PARAMS: Additional SLURM parameters (optional)- Format: Semicolon-separated key=value pairs (use semicolons when values contain commas)
- Example:
"nodelist=node001,node002;constraint=gpu"
Train Nemotron4 340B with FP8 precision on 256 GB200 GPUs:
JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.shTrain Nemotron4 340B, FP8 precision with strong scaling enabled on 256 B200 GPUs:
JOB_TOTAL_GPUS=256 GPU_TYPE=b200 STRONG_SCALING=true ./launch.shTrain Nemotron4 340B with FP8 precision on 256 GB200 GPUs:
JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.shTrain Nemotron4 340B on nodes with specific constraints:
ADDITIONAL_SLURM_PARAMS="constraint=gpu&memory;exclusive" JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.shTrain with reservation and additional constraints:
ADDITIONAL_SLURM_PARAMS="reservation=gpu_reservation;constraint=gpu" JOB_TOTAL_GPUS=128 GPU_TYPE=gb200 ./launch.shAll benchmark results are saved under $LLMB_WORKLOAD/experiments/ with the following structure:
experiments/
├── <experiment_name>/
│ └── <experiment_name>_<timestamp>/
│ ├── <experiment_name>/
│ │ ├── log-<experiment_name>.out # Main training log with performance data
│ │ ├── sbatch_<experiment_name>.out # Batch script output
│ │ └── nsys_profile/ # Profiling output (when enabled)
│ │ └── *.nsys-rep files
│ └── [batch scripts and other files]
The <experiment_name> typically follows the pattern: pretrain_nemotron4_<size>_<dtype>_<scale>_<config>
Key files:
log-<experiment_name>.out- Contains training step timing and performance metrics analyzed byparse_train_timing.shnsys_profile/- Contains profiling traces whenENABLE_PROFILE=true
To enable profiling with Nsight Systems set variable ENABLE_PROFILE=true when submitting your job. The job will run for a total of 50 steps where steps 45-50 will be profiled.
In order to view the resulting profiles, ensure you have the latest version of Nsight Systems installed. For more information visit: Nsight Systems
- MPI Ranks: 0-8
- Job Steps: 45-50
- Output Location: Profiling output saved alongside training results (see Output Locations)
- Filename format:
profile_${SLURM_JOB_ID}_nodeId_rankId.nsys-rep
Example command:
ENABLE_PROFILE=true JOB_TOTAL_GPUS=128 DTYPE=fp8 GPU_TYPE=gb200 ./launch.sh- Specify job steps to profile:
PROFILE_START_STEP: start profiling on this job step.
- Default: 45
PROFILE_STOP_STEP: stop profiling on this job step.
- Default: 50
- Enable GPU metrics collection:
ENABLE_GPU_METRICS: Enable GPU metrics collection during Nsight profiling (default: false)
- When set to
truealong withENABLE_PROFILE=true, captures detailed GPU performance metrics - Provides additional GPU utilization, memory usage, and compute efficiency data
- May require additional system configuration for GPU device metrics to work properly
Example command with GPU metrics:
ENABLE_PROFILE=true ENABLE_GPU_METRICS=true JOB_TOTAL_GPUS=128 GPU_TYPE=gb200 DTYPE=fp8 ./launch.shIn order to view the profile traces (*.nsys-rep files) interactively:
- Install the latest Nsight Systems client on your preferred system
- Copy the generated .nsys-rep files to a folder on your preferred system. E.g., /home/nsight-traces/
- Open Nsight Systems client, then click "File | Open" and select one or more .nsys-rep files from /home/nsight-systems folder. For more details, see Reading Your Report in GUI guide.
- Once loaded you can analyze the workload behavior to learn about any performance bottlenecks associated with the model or the job run.
Since most of the benchmarking jobs run on multiple GPUs, there will be multiple .nsys-rep files generated for each run. Multi-Report Analysis Guide will be very helpful to automate the analysis and get to results quicker by using Nsight recipes.
See these tutorials to get a quick start if you are new to Nsight profiling.
Checkpoint save and load can be enabled for this workload in order to measure the impact of storage on checkpointing operations. The additional collected metrics are: time to save a checkpoint and time to load a checkpoint.
Save checkpoint feature works for Nemotron4 340b sizes with either FP8 or BF16 precision. Make sure your file system has sufficient disk space to accomodate checkpoint sizes below:
| Model | Checkpoint Size | Minimum Tested Scale H100 | Minimum Tested Scale GB200 | Minimum Tested Scale B200 |
|---|---|---|---|---|
| 340b | ~4.4 TB | 512 | 128 | 128 |
To save the checkpoints after pretraining nemotron4 model for max_steps, you need to set environment variable ENABLE_CHECKPOINT=true. At the end of the pretraining the checkpoints will be saved in the ${LLMB_WORKLOAD}/experiments folder.
experiment_name = pretrain_nemotron4_${MODEL_SIZE}_${DTYPE}_${JOB_TOTAL_GPUS}
timestamp = date '+%s'
Example directory where checkpoints are saved is ${LLMB_WORKLOAD}/experiments/$experiment_name/${experiment_name}_${timestamp}/$experiment_name/code/nemo_experiments/default/checkpoints/Command to run nemotron4 with checkpoint save enabled
ENABLE_CHECKPOINT=true DTYPE=<precision> MODEL_SIZE=<size> JOB_TOTAL_GPUS=<number> GPU_TYPE=<type> STRONG_SCALING=<bool> ./launch.sh- Check
${LLMB_WORKLOAD}/experiments/$experiment_name/${experiment_name}_${timestamp}/$experiment_name/code/nemo_experiments/default/checkpoints/*/weightsfolder that it contains *.distcp files - Check job output log-*.out file (see Training section for reference) for entries like
[NeMo I 2025-04-11 14:48:45 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 50 : Start time: 1744408121.151s : Save duration: 4.389s
Load checkpoint feature works successfully at the following scales:
| Model | Minimum Tested Scale H100 | Minimum Tested Scale GB200 | Minimum Tested Scale B200 |
|---|---|---|---|
| 340b | 512 | 128 | 128 |
Note:
- Running load checkpointing feature at other scales may run into CUDA OOM errors.
To resume training from saved checkpoints, you need to set LOAD_CHECKPOINT_PATH=<path_to_checkpoint_directory> environment variable. Make sure the checkpoint files are under the ${LLMB_WORKLOAD}/experiments directory and LOAD_CHECKPOINT_PATH variable is set to parent folder of the weights directory containing distributed checkpoint files with extension *.distcp.
E.g., if the checkpoint was saved under ${LLMB_WORKLOAD}/experiments/pretrain_nemotron4_340b_fp8_128/pretrain_nemotron4_340b_fp8_128_<timestamp>/pretrain_nemotron4_340b_fp8_128/code/nemo_experiments/default/checkpoints/*/weights then set the environment variable to a directory one level higher:
LOAD_CHECKPOINT_PATH=${LLMB_WORKLOAD}/experiments/pretrain_nemotron4_340b_fp8_128/pretrain_nemotron4_340b_fp8_128_<timestamp>/pretrain_nemotron4_340b_fp8_128/code/nemo_experiments/default/checkpoints/default--None=0.0000-epoch=0-consumed_samples=12800.0
The scripts will restore configuration from the checkpoint and resume training process. Training will run for 1 step after checkpoint has been loaded.
LOAD_CHECKPOINT_PATH=<your_path_to_checkpoint_directory> JOB_TOTAL_GPUS=<number> GPU_TYPE=<type> DTYPE=<precision> MODEL_SIZE=<size> STRONG_SCALING=<true/false> ./launch.shTo validate that checkpoint was loaded successfully look for the entry like below in the main job log-*.out file (see Training section for reference):
[NeMo I 2025-04-11 14:46:18 nemo_logging:393] Global Checkpoint Load : Rank : 0 : Start time : 1744407969.270s : Time spent in load_checkpoint: 9.712s
For GB200 you may see the following error message
[rank368]:[E808 04:21:41.160918398 ProcessGroupNCCL.cpp:655] [Rank 368] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600001 milliseconds before timing out.
[rank368]:[E808 04:21:41.161005534 ProcessGroupNCCL.cpp:2299] [PG ID 0 PG GUID 0(default_pg) Rank 368] failure detected by watchdog at work sequence id: 5 PG status: last enqueued work: 5, last completed work: 4
[rank368]:[E808 04:21:41.161011710 ProcessGroupNCCL.cpp:693] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank368]:[E808 04:21:41.161045406 ProcessGroupNCCL.cpp:2147] [PG ID 0 PG GUID 0(default_pg) Rank 368] First PG on this rank to signal dumping.To fix, try running with TP_COMM_OVERLAP disabled like so:
TP_COMM_OVERLAP=False llmb-run submit -w pretrain_nemotron4-340b -s 340b --dtype bf16 --scale 256model flops = (sequence length) * ((attention flops) + (mlp flops) + (embedding flops))
model flops breakdown:
attention flops = 12 * (number of layers) * (hidden size)^2 * (1 + (number of query groups) / (number of heads) + (sequence length) / (hidden size))
mlp flops = 12 * (number of layers) * (hidden size) * (ffn hidden size)
embedding flops = 6 * (vocab size) * (hidden size)
Nemotron4 340b calculation:
sequence length = 4096
number of layers = 96
hidden size = 18432
vocab size = 256000
ffn hidden size = 73728
number of heads = 96
number of query groups = 8
attention flops = 12 * 96 * 18432^2 + 12 * 96 * 18432^2 * 8 / 96 + 12 * 96 * 18432 * 4096 = 391,378,894,848 + 32,614,907,904 + 86,973,087,744 = 510,966,890,496
mlp flops = 12 * 96 * 18432 * 73728 = 1,565,515,579,392
embedding flops = 6 * 256000 * 18432 = 28,311,552,000
model flops = 4096 * (510,966,890,496 + 1,565,515,579,392 + 28,311,552,000) = 8.62124e15
Note: Per-tensor delayed scaling recipe is used for FP8 training here.