|
1 | 1 | # Changelog |
2 | 2 |
|
3 | 3 | <!-- Next changelog --> |
| 4 | +## NVIDIA Neural Modules 2.4.1 |
| 5 | + |
| 6 | +### Detailed Changelogs: |
| 7 | + |
| 8 | +#### Uncategorized: |
| 9 | + |
| 10 | +<details><summary>Changelog</summary> |
| 11 | + |
| 12 | +- Update package_info.py by @ko3n1g :: PR: #14400 |
| 13 | +- Patch to address issue 14392 by @youngeunkwon0405 :: PR: #14398 |
| 14 | +- Cherry pick `Fix callbacks in DSV3 script (14350)` into `r2.4.0` by @chtruong814 :: PR: #14370 |
| 15 | +- Cherry pick `Change Llama Embedding Tutorial to use SFT by default (14231)` into `r2.4.0` by @chtruong814 :: PR: #14303 |
| 16 | +- Cherrypick `calculate_per_token_loss requirement for context parallel` (#14065) (#14282) into `r2.4.0` by @chtruong814 :: PR: #14448 |
| 17 | +- Pin nvidia-lm-eval to 25.6.1 by @chtruong814 :: PR: #14470 |
| 18 | + |
| 19 | +</details> |
| 20 | + |
4 | 21 | ## NVIDIA Neural Modules 2.4.0 |
5 | 22 |
|
6 | 23 | ### Highlights |
7 | 24 |
|
8 | 25 | - Collections: |
9 | | - - Speech |
10 | | - - Batched beam search for transducers (RNN-T and TDT) |
11 | | - - RNNT/TDT buffered/streaming inference \+ batched decoding support in cache-aware |
12 | | - - add support for CTC batched beam search with GPU-LM |
13 | | - - Key fixes |
14 | | - - Punctuation Marks in Timestamps |
15 | | - - Fix timestamps when cuda graphs enabled |
16 | | - - Fix masking of \<pad\> tokens in AED inference |
| 26 | + - Speech |
| 27 | + - Batched beam search for transducers (RNN-T and TDT) |
| 28 | + - RNNT/TDT buffered/streaming inference \+ batched decoding support in cache-aware |
| 29 | + - add support for CTC batched beam search with GPU-LM |
| 30 | + - Key fixes |
| 31 | + - Punctuation Marks in Timestamps |
| 32 | + - Fix timestamps when cuda graphs enabled |
| 33 | + - Fix masking of \<pad\> tokens in AED inference |
17 | 34 | - TDT streaming inference fix |
18 | 35 | - LLM |
19 | | - - Qwen 3 235B-A22B Perf Optimized |
20 | | - - DeepSeek V3 Perf Optimized |
21 | | - - Gemma3 support from Google |
22 | | - - Embedding and Reranker models |
| 36 | + - Qwen 3 235B-A22B Perf Optimized |
| 37 | + - DeepSeek V3 Perf Optimized |
| 38 | + - Gemma3 support from Google |
| 39 | + - Embedding and Reranker models |
23 | 40 | - MM |
24 | | - - Llama 4 |
| 41 | + - Llama 4 |
25 | 42 | - AVLM |
26 | | -- Training performance (speed) |
27 | | - - NVL sharp \+ IB sharp for DP/FSDP-communications on H100 and B200 |
28 | | - - MXFP8 with TP communication overlap |
29 | | - - MXFP8 with reduced memory allocation |
30 | | - - FP8 sub-channel recipe (128x128 for weight and 1x128 for activation) |
31 | | - - cudnn fused attention for MLA (both Hopper and Blackwell) |
32 | | - - Advanced custom asymmetric pipelining (for MTP, loss func, and embd) |
33 | | - - BF16 optimizer for model memory saving |
34 | | - - CUDA graph fix for fine-tuning benchmarks |
| 43 | +- Training performance (speed) |
| 44 | + - NVL sharp \+ IB sharp for DP/FSDP-communications on H100 and B200 |
| 45 | + - MXFP8 with TP communication overlap |
| 46 | + - MXFP8 with reduced memory allocation |
| 47 | + - FP8 sub-channel recipe (128x128 for weight and 1x128 for activation) |
| 48 | + - cudnn fused attention for MLA (both Hopper and Blackwell) |
| 49 | + - Advanced custom asymmetric pipelining (for MTP, loss func, and embd) |
| 50 | + - BF16 optimizer for model memory saving |
| 51 | + - CUDA graph fix for fine-tuning benchmarks |
35 | 52 | - CUDA graph support for LLAMA4 |
36 | | - |
| 53 | + |
37 | 54 | ### Detailed Changelogs |
38 | 55 |
|
39 | 56 | #### ASR |
|
0 commit comments