add nemo_bridge by chai-xiaonan · Pull Request #1050 · flagos-ai/FlagScale

chai-xiaonan · 2026-01-09T02:51:34Z

Reconstruct the Nemo-Bridge based on the restructured flagscale version. Currently, flagscale has supported some functions of nemo-bridge, enabling the flagscale framework to load and save ckpt in the hf format during the training process. Additionally, in the current version, new features have been added, allowing for the setting of the number of iterations for saving hf weights based on the save_hf_interval. The model has verified that Deepseek V3 16_a3B, Qwen3-32B, and Qwen3-0.6B all have correct accuracy.

lxd-cumt · 2026-01-09T03:06:30Z

flagscale/train/megatron/training/checkpointing.py

+            #Load the HF model from config
+            config_load = args.hf_config_path
+            config = safe_load_config_with_retry(config_load, trust_remote_code=False)
+            bridge = AutoBridge.from_hf_config(config)


Will this save-ckpt step allocate extra GPU memory when initializing an HF model?

lxd-cumt · 2026-01-09T03:09:20Z

flagscale/train/megatron/training/checkpointing.py

+        bridge.load_hf_weights(ddp_model)
+        # no optimizer weight
+        iteration=0
+        num_floating_point_operations_so_far=0


please add print_rank_0 here

lxd-cumt · 2026-01-09T03:11:35Z

flagscale/train/megatron/training/checkpointing.py

+        # use megatron bridge
+        from megatron.nemo_bridge.models import AutoBridge
+        bridge=AutoBridge.from_hf_pretrained(load_dir)
+        bridge.load_hf_weights(ddp_model)


Can nemo-bridge’s load_hf_model handle a ddp_model directly, where ddp_model is wrapped by DistributedDataParallel?

lxd-cumt · 2026-01-09T03:20:46Z

flagscale/train/megatron/nemo_bridge/__init__.py

@@ -0,0 +1,8 @@
+# Copyright (c) 2025, BAAI. All rights reserved.


nemo megatron-bridge supports pip install for usage, ref https://pypi.org/project/megatron-bridge/
please remove source codes

lxd-cumt · 2026-01-09T03:35:34Z

flagscale/train/megatron/nemo_bridge/__init__.py

@@ -0,0 +1,8 @@
+# Copyright (c) 2025, BAAI. All rights reserved.


Rename flagscale/train/megatron/nemo_bridge to flagscale/train/megatron/bridge so that it matches the import pattern from megatron.bridge

tengqm

When copy pasting source code from other repos, we are supposed/obliged to copy paste their copyright notice as well. We cannot claim copyrights for these code.
The original code has following copyright header to be preserved:

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

tengqm · 2026-01-11T03:14:16Z

flagscale/train/megatron/nemo_bridge/models/qwen/qwen2_bridge.py

@@ -0,0 +1,110 @@
+# Copyright (c) 2025, BAAI. All rights reserved.
+#
+# Copied from: https://github.com/NVIDIA-NeMo/Megatron-Bridge


If Megatron-Bridge has a copyright claim, we are supposed to paste their copyright statements here.

…gScale into add_nemo_bridge

tengqm · 2026-01-16T02:49:35Z

flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py

+
+                if not has_implementation:
+                    raise ValueError(
+                        f"\n�~\~W Model architecture '{architecture}' is not yet supported\n\n"


What are these weird characters?
There are some other similar cases in this string.

tengqm · 2026-01-16T02:51:11Z

flagscale/train/megatron/nemo_bridge/models/conversion/model_bridge.py

@@ -0,0 +1,359 @@
+# Copyright (c) 2025, BAAI. All rights reserved.
+#
+# Mainly adapted from: https://github.com/NVIDIA-NeMo/Megatron-Bridge


Please clarify what has been "borrowed".
Please also paste the original copyright claim here if the code was not originally written by us.

tengqm · 2026-01-21T05:39:44Z

flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py

@@ -0,0 +1,202 @@
+# Copyright (c) 2025, BAAI. All rights reserved.


Looks to me that this file was largely adapted from flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py. We copy-pasted the source and we are claiming copyright for this code. This is not acceptable.

We can borrow code from other projects, provided that the license terms grant us this right. In that case, we still have to pay credit to the original authors. We are obliged to mention their copyrights.

There are some weird characters in this file which was obviously a character conversion problem during copy/paste. Please fix them as well.

chai-xiaonan · 2026-02-05T08:48:30Z

收到，谢谢

…odel_path

Copilot

Pull request overview

This PR adds a megatron.nemo_bridge component and integrates it into Megatron training so FlagScale can (a) warm-start from Hugging Face checkpoints and (b) periodically save Hugging Face-format weights during training based on --save-hf-interval.

Changes:

Add new megatron.nemo_bridge bridge implementation with DeepSeek-V3 and Qwen3 support and HF<->Megatron conversion utilities.
Extend checkpoint save/load to optionally export/import Hugging Face weights (--save-hf, --load-hf) with a configurable save interval.
Add new CLI arguments and validation for Hugging Face checkpointing/export configuration.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
flagscale/train/megatron/training/yaml_arguments.py	Minor formatting tweak in transformer config construction.
flagscale/train/megatron/training/checkpointing.py	Adds HF save-on-interval during checkpointing and HF warm-start load path.
flagscale/train/megatron/training/arguments.py	Adds CLI flags for HF load/save and interval; adds validation for HF save settings.
flagscale/train/megatron/nemo_bridge/init.py	Introduces new `megatron.nemo_bridge` package entry point.
flagscale/train/megatron/nemo_bridge/models/init.py	Exposes bridge/public APIs (AutoBridge, bridges, mappings).
flagscale/train/megatron/nemo_bridge/models/conversion/init.py	Exposes conversion layer API.
flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py	Adds AutoBridge wrapper for config validation, dynamic HF model construction, and HF weight load/save.
flagscale/train/megatron/nemo_bridge/models/conversion/model_bridge.py	Adds Megatron↔HF streaming conversion behavior and tied-embedding handling.
flagscale/train/megatron/nemo_bridge/models/conversion/param_mapping.py	Adds padding-aware column-parallel mapping behavior.
flagscale/train/megatron/nemo_bridge/models/hf_pretrained/init.py	Exposes custom HF wrapper.
flagscale/train/megatron/nemo_bridge/models/hf_pretrained/causal_lm.py	Wraps upstream PreTrainedCausalLM and customizes artifact saving behavior.
flagscale/train/megatron/nemo_bridge/models/deepseek/init.py	Exposes DeepSeek bridge.
flagscale/train/megatron/nemo_bridge/models/deepseek/common.py	Common DeepSeek config/mapping helpers.
flagscale/train/megatron/nemo_bridge/models/deepseek/deepseek_v3_bridge.py	DeepSeek-V3 bridge implementation and HF config emission.
flagscale/train/megatron/nemo_bridge/models/qwen/init.py	Exposes Qwen bridge.
flagscale/train/megatron/nemo_bridge/models/qwen/qwen3_bridge.py	Qwen3 bridge implementation and HF config emission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-02T01:07:02Z

flagscale/train/megatron/training/arguments.py

+    group.add_argument('--load-hf', action='store_true',default=None,
+                       help='Use the HF format for warm start, and save it in the torch_dict'
+                            'format while also saving it in the HF format.')
+    group.add_argument('--save-hf', action='store_true',default=None,
+                       help='Save as Hugging Face format checkpoint.')
+    group.add_argument('--hf-config-path', type=str,default=None,


The new CLI args are missing whitespace after commas (e.g., action='store_true',default=None and type=str,default=None). This violates Ruff/pycodestyle (E231) and will fail lint; add spaces after commas to match the surrounding argument definitions.

Suggested change

group.add_argument('--load-hf', action='store_true',default=None,

help='Use the HF format for warm start, and save it in the torch_dict'

'format while also saving it in the HF format.')

group.add_argument('--save-hf', action='store_true',default=None,

help='Save as Hugging Face format checkpoint.')

group.add_argument('--hf-config-path', type=str,default=None,

group.add_argument('--load-hf', action='store_true', default=None,

help='Use the HF format for warm start, and save it in the torch_dict'

'format while also saving it in the HF format.')

group.add_argument('--save-hf', action='store_true', default=None,

help='Save as Hugging Face format checkpoint.')

group.add_argument('--hf-config-path', type=str, default=None,

Copilot · 2026-03-02T01:07:03Z

flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py

+        if modeling_path :
+            import os,sys
+            import importlib.util
+            abs_model_path = os.path.abspath(modeling_path)
+            model_dir = os.path.dirname(abs_model_path)
+            if model_dir not in sys.path:
+                sys.path.insert(0, model_dir)
+            module_name = "dynamic_model_module"
+            try:
+                spec = importlib.util.spec_from_file_location(module_name, abs_model_path)
+                assert spec is not None
+                module = importlib.util.module_from_spec(spec)
+                sys.modules[module_name] = module
+                spec.loader.exec_module(module)
+                model_class = getattr(module, model_name)
+                with init_empty_weights():
+                    hf_model = model_class._from_config(model.config)
+                for name, param in hf_model.named_parameters():
+                    set_module_tensor_to_device(
+                        hf_model, name, "cpu", torch.empty(*param.size(), dtype=model.config.torch_dtype)
+                    )
+            except Exception as e:
+                print(f"import module error: {e}")
+
+        elif not modeling_path and config :
+            with init_empty_weights():
+                hf_model = AutoModelForCausalLM.from_config(model.config)
+
+            for name, param in hf_model.named_parameters():
+                set_module_tensor_to_device(
+                    hf_model, name, "cpu", torch.empty(*param.size(), dtype=model.config.torch_dtype)
+                )
+        else: 
+            raise ValueError("Need one args, model_path or config, to build HF model.")
+        model.model = hf_model
+        return cls(model)


from_hf_config swallows exceptions when importing/building the dynamic HF model (print(f"import module error: {e}")) and then continues, potentially returning an AutoBridge with hf_model still None. This will cause failures later in a much less debuggable place. Raise an exception (or re-raise with context) if the model class can't be loaded, and ensure hf_model is non-None before returning.

Copilot · 2026-03-02T01:07:03Z

flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py

+        else:
+            pre_trained = PreTrainedCausalLM.from_pretrained(hf_path)
+            # Preserve trust_remote_code setting from the original bridge instance
+            trust_remote_code = getattr(self.hf_pretrained, 'trust_remote_code', False)
+            pre_trained = PreTrainedCausalLM.from_pretrained(
+                hf_path, trust_remote_code=trust_remote_code
+            )


load_hf_weights calls PreTrainedCausalLM.from_pretrained(hf_path) twice (the first result is immediately overwritten). This adds unnecessary IO/memory overhead; remove the redundant first call and load once with the correct trust_remote_code value.

Copilot · 2026-03-02T01:07:03Z

flagscale/train/megatron/training/checkpointing.py

+    if args.load_hf:
+        # use megatron bridge
+        from megatron.nemo_bridge.models import AutoBridge
+        bridge=AutoBridge.from_hf_pretrained(load_dir)
+        bridge.load_hf_weights(ddp_model)
+        # no optimizer weight


HF load path is passing ddp_model directly into bridge.load_hf_weights(...), but later in this function you unwrap via model = unwrap_model(ddp_model). If the bridge expects the unwrapped Megatron model (common for checkpointing paths), this will load weights into the wrong wrapper type or fail. Unwrap the model before calling the bridge, and also fix spacing around = to satisfy Ruff (E225).

Copilot · 2026-03-02T01:07:03Z

flagscale/train/megatron/nemo_bridge/models/hf_pretrained/causal_lm.py

+import sys
+
+from pathlib import Path
+from typing import Dict, Generic, List, Optional, TypeVar, Union
+
+import torch
+
+from megatron.bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM as OriginalPreTrainedCausalLM


This module has several unused imports (sys, Dict, Generic, List, TypeVar, etc.). Ruff is configured to error on unused imports (F401), so this will break CI; please remove unused imports or use them.

Copilot · 2026-03-02T01:07:05Z

flagscale/train/megatron/nemo_bridge/models/deepseek/deepseek_v3_bridge.py

+            hidden_size=args.hidden_size,
+            intermediate_size=args.ffn_hidden_size,
+            moe_intermediate_size=args.moe_ffn_hidden_size,
+            num_hidden_layers=args.num_layers,
+            num_nextn_predict_layers= args.mtp_num_layers if args.mtp_num_layers else 0 ,
+            num_attention_heads=args.num_attention_heads,
+            num_key_value_heads=args.num_query_groups,
+            n_shared_experts=args.moe_shared_expert_intermediate_size // args.moe_ffn_hidden_size,
+            n_routed_experts=args.num_experts,
+            routed_scaling_factor=args.moe_router_topk_scaling_factor,
+            kv_lora_rank=args.kv_lora_rank,
+            q_lora_rank=args.q_lora_rank if args.q_lora_rank else 0,
+            qk_rope_head_dim=args.qk_pos_emb_head_dim,
+            v_head_dim=args.v_head_dim,
+            qk_nope_head_dim=args.qk_head_dim,
+            n_group=args.moe_router_num_groups,
+            topk_group=args.moe_router_group_topk,
+            num_experts_per_tok=args.moe_router_topk,
+            moe_layer_freq = 1,
+            topk_method = "noaux_tc",
+            first_k_dense_replace=first_k_dense_replace,


Several keyword arguments in the DeepseekV3Config(...) call have invalid spacing (e.g., num_nextn_predict_layers= args... and moe_layer_freq = 1). This violates pycodestyle/Ruff (E251) and will fail lint; remove spaces around = inside keyword arguments and stray spaces before commas.

Copilot · 2026-03-02T01:07:05Z

flagscale/train/megatron/training/checkpointing.py

+            hf_config = AutoBridge.convert_mg2hf_config(args, safe_save, model_name)
+            bridge = AutoBridge.from_hf_config(hf_config, model_path, model_name)
+            #Save the HF model weights in the corresponding iteration's safetensor folder.
+            bridge.save_hf_pretrained(model=model,path=safe_save)


In the HF save call, bridge.save_hf_pretrained(model=model,path=safe_save) is missing whitespace after the comma (model=model, path=...). With Ruff/pycodestyle enabled, this triggers E231 and will fail lint; adjust argument formatting to match the rest of the file.

Suggested change

bridge.save_hf_pretrained(model=model,path=safe_save)

bridge.save_hf_pretrained(model=model, path=safe_save)

Copilot · 2026-03-02T01:07:05Z

flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py

+from megatron.bridge import AutoBridge as OriginalAutoBridge
+import transformers
+import torch.distributed as dist
+from transformers import AutoModelForCausalLM
+from transformers.configuration_utils import PretrainedConfig
+
+from megatron.core.transformer.module import MegatronModule
+from megatron.nemo_bridge.models.conversion import model_bridge
+from megatron.nemo_bridge.models.conversion.model_bridge import MegatronModelBridge
+
+from megatron.nemo_bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM
+from megatron.bridge.models.hf_pretrained.state import SafeTensorsStateSource
+from megatron.bridge.models.hf_pretrained.safe_config_loader import safe_load_config_with_retry
+from megatron.bridge.models.conversion.utils import get_causal_lm_class_via_auto_map
+
+from typing import TypeVar, Union
+from pathlib import Path


Import order in this module is likely to fail Ruff's isort rules (I001): standard-library imports (typing, pathlib) should be grouped at the top, followed by third-party (transformers, torch), then local (megatron...). Please run isort/ruff on this file to normalize the import blocks.

Suggested change

from megatron.bridge import AutoBridge as OriginalAutoBridge

import transformers

import torch.distributed as dist

from transformers import AutoModelForCausalLM

from transformers.configuration_utils import PretrainedConfig

from megatron.core.transformer.module import MegatronModule

from megatron.nemo_bridge.models.conversion import model_bridge

from megatron.nemo_bridge.models.conversion.model_bridge import MegatronModelBridge

from megatron.nemo_bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM

from megatron.bridge.models.hf_pretrained.state import SafeTensorsStateSource

from megatron.bridge.models.hf_pretrained.safe_config_loader import safe_load_config_with_retry

from megatron.bridge.models.conversion.utils import get_causal_lm_class_via_auto_map

from typing import TypeVar, Union

from pathlib import Path

from pathlib import Path

from typing import TypeVar, Union

import torch.distributed as dist

import transformers

from transformers import AutoModelForCausalLM

from transformers.configuration_utils import PretrainedConfig

from megatron.bridge import AutoBridge as OriginalAutoBridge

from megatron.bridge.models.conversion.utils import get_causal_lm_class_via_auto_map

from megatron.bridge.models.hf_pretrained.safe_config_loader import safe_load_config_with_retry

from megatron.bridge.models.hf_pretrained.state import SafeTensorsStateSource

from megatron.core.transformer.module import MegatronModule

from megatron.nemo_bridge.models.conversion import model_bridge

from megatron.nemo_bridge.models.conversion.model_bridge import MegatronModelBridge

from megatron.nemo_bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM

Copilot · 2026-03-02T01:07:06Z

flagscale/train/megatron/nemo_bridge/models/conversion/param_mapping.py

+# Copyright (c) 2025, BAAI. All rights reserved.
+
+import torch
+import torch.nn as nn
+from megatron.bridge.models.conversion.utils import get_module_and_param_from_name
+from megatron.bridge.models.conversion.param_mapping import ColumnParallelMapping as OriginalColumnParallelMapping
+from megatron.bridge.models.conversion.param_mapping import AutoMapping as OriginalAutoMapping
+from megatron.bridge.models.conversion.param_mapping import QKVMapping as OriginalQKVMapping
+from megatron.bridge.models.conversion.param_mapping import (
+    MegatronParamMapping,
+    RowParallelMapping,
+    ReplicatedMapping,
+    GatedMLPMapping,
+)
+
+
+import logging
+logger = logging.getLogger(__name__)


import logging (stdlib) appears after torch/megatron imports, which will violate Ruff isort (I001). Please reorder imports into standard-library / third-party / first-party groups (and keep them contiguous) so lint passes.

Copilot · 2026-03-02T01:07:06Z

flagscale/train/megatron/nemo_bridge/models/conversion/model_bridge.py

+
+logger = logging.getLogger(__name__)
+
+from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge as OriginalMegatronModelBridge
+from megatron.bridge.models.conversion.model_bridge import (
+    _megatron_local_name_to_global,
+    HFWeightTuple, 
+    WeightConversionTask,
+)


This file has imports placed after runtime statements (logger = ...) and mixes import groups, which is likely to trip Ruff isort (I001). Please move all imports to the top and let isort/ruff format them into consistent groups.

Suggested change

logger = logging.getLogger(__name__)

from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge as OriginalMegatronModelBridge

from megatron.bridge.models.conversion.model_bridge import (

_megatron_local_name_to_global,

HFWeightTuple,

WeightConversionTask,

)

from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge as OriginalMegatronModelBridge

from megatron.bridge.models.conversion.model_bridge import (

_megatron_local_name_to_global,

HFWeightTuple,

WeightConversionTask,

)

logger = logging.getLogger(__name__)

chai-xiaonan · 2026-03-02T01:07:46Z

收到，谢谢

add nemo_bridge

d82146d

chai-xiaonan requested review from aoyulong, heavyrain-lzy and zhaoyinglia as code owners January 9, 2026 02:51

lxd-cumt reviewed Jan 9, 2026

View reviewed changes

tengqm reviewed Jan 11, 2026

View reviewed changes

chai-xiaonan and others added 7 commits January 15, 2026 16:34

Merge branch 'main' into add_nemo_bridge

a358436

Reconstruct the code, and some functions use pip megatron-bridge

90c2ed0

Merge branch 'main' into add_nemo_bridge

224f8e1

Merge branch 'flagos-ai:main' into add_nemo_bridge

c0b0f37

delete readme and swp file

2900956

Merge branch 'add_nemo_bridge' of https://github.com/chai-xiaonan/Fla…

0dd3b6b

…gScale into add_nemo_bridge

Merge branch 'main' into add_nemo_bridge

961b6c5

tengqm reviewed Feb 5, 2026

View reviewed changes

chai-xiaonan added 2 commits February 9, 2026 16:38

Merge branch 'main' into add_nemo_bridge

108d0f6

Generate the config through mcore and instantiate the model through m…

438938e

…odel_path

aoyulong requested a review from Copilot March 2, 2026 00:58

Copilot started reviewing on behalf of aoyulong March 2, 2026 00:59 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

		@@ -0,0 +1,8 @@
		# Copyright (c) 2025, BAAI. All rights reserved.

		@@ -0,0 +1,202 @@
		# Copyright (c) 2025, BAAI. All rights reserved.

	bridge.save_hf_pretrained(model=model,path=safe_save)
	bridge.save_hf_pretrained(model=model, path=safe_save)

Conversation

chai-xiaonan commented Jan 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tengqm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chai-xiaonan commented Feb 5, 2026 via email

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

chai-xiaonan commented Mar 2, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants