Cherry-pick 2.4.1 changelog to r2.4.0 (#14843)

chtruong814 · github-actions[bot] · web-flow · commit 2919fedf2601 · 2025-09-29T17:48:08.000-05:00
Signed-off-by: Charlie Truong &lt;chtruong@nvidia.com&gt;
Co-authored-by: github-actions[bot] &lt;41898282+github-actions[bot]@users.noreply.github.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,39 +1,56 @@
 # Changelog
 
 <!-- Next changelog -->
+## NVIDIA Neural Modules 2.4.1
+
+### Detailed Changelogs:
+
+#### Uncategorized:
+
+<details><summary>Changelog</summary>
+
+- Update package_info.py by @ko3n1g :: PR: #14400
+- Patch to address issue 14392 by @youngeunkwon0405 :: PR: #14398
+- Cherry pick `Fix callbacks in DSV3 script (14350)` into `r2.4.0` by @chtruong814 :: PR: #14370
+- Cherry pick `Change Llama Embedding Tutorial to use SFT by default (14231)` into `r2.4.0` by @chtruong814 :: PR: #14303
+- Cherrypick `calculate_per_token_loss requirement for context parallel` (#14065) (#14282) into `r2.4.0` by @chtruong814 :: PR: #14448
+- Pin nvidia-lm-eval to 25.6.1 by @chtruong814 :: PR: #14470
+
+</details>
+
 ## NVIDIA Neural Modules 2.4.0
 
 ### Highlights
 
 - Collections:
-  - Speech  
-    - Batched beam search for transducers (RNN-T and TDT)  
-      - RNNT/TDT buffered/streaming inference \+ batched decoding support in cache-aware  
-      - add support for CTC batched beam search with GPU-LM  
-      - Key fixes  
-        - Punctuation Marks in Timestamps  
-        - Fix timestamps when cuda graphs enabled  
-        - Fix masking of \<pad\> tokens in AED inference  
+  - Speech
+    - Batched beam search for transducers (RNN-T and TDT)
+      - RNNT/TDT buffered/streaming inference \+ batched decoding support in cache-aware
+      - add support for CTC batched beam search with GPU-LM
+      - Key fixes
+        - Punctuation Marks in Timestamps
+        - Fix timestamps when cuda graphs enabled
+        - Fix masking of \<pad\> tokens in AED inference
         - TDT streaming inference fix
   - LLM
-    - Qwen 3 235B-A22B Perf Optimized  
-    - DeepSeek V3 Perf Optimized  
-    - Gemma3 support from Google  
-    - Embedding and Reranker models  
+    - Qwen 3 235B-A22B Perf Optimized
+    - DeepSeek V3 Perf Optimized
+    - Gemma3 support from Google
+    - Embedding and Reranker models
   - MM
-    - Llama 4  
+    - Llama 4
     - AVLM
-- Training performance (speed)  
-  - NVL sharp \+ IB sharp for DP/FSDP-communications on H100 and B200  
-  - MXFP8 with TP communication overlap  
-  - MXFP8 with reduced memory allocation  
-  - FP8 sub-channel recipe (128x128 for weight and 1x128 for activation)  
-  - cudnn fused attention for MLA (both Hopper and Blackwell)  
-  - Advanced custom asymmetric pipelining (for MTP, loss func, and embd)  
-  - BF16 optimizer for model memory saving  
-  - CUDA graph fix for fine-tuning benchmarks  
+- Training performance (speed)
+  - NVL sharp \+ IB sharp for DP/FSDP-communications on H100 and B200
+  - MXFP8 with TP communication overlap
+  - MXFP8 with reduced memory allocation
+  - FP8 sub-channel recipe (128x128 for weight and 1x128 for activation)
+  - cudnn fused attention for MLA (both Hopper and Blackwell)
+  - Advanced custom asymmetric pipelining (for MTP, loss func, and embd)
+  - BF16 optimizer for model memory saving
+  - CUDA graph fix for fine-tuning benchmarks
   - CUDA graph support for LLAMA4
-  
+
 ### Detailed Changelogs
 
 #### ASR