Skip to content

add nemo_bridge#1050

Open
chai-xiaonan wants to merge 10 commits intoflagos-ai:mainfrom
chai-xiaonan:add_nemo_bridge
Open

add nemo_bridge#1050
chai-xiaonan wants to merge 10 commits intoflagos-ai:mainfrom
chai-xiaonan:add_nemo_bridge

Conversation

@chai-xiaonan
Copy link

Reconstruct the Nemo-Bridge based on the restructured flagscale version. Currently, flagscale has supported some functions of nemo-bridge, enabling the flagscale framework to load and save ckpt in the hf format during the training process. Additionally, in the current version, new features have been added, allowing for the setting of the number of iterations for saving hf weights based on the save_hf_interval. The model has verified that Deepseek V3 16_a3B, Qwen3-32B, and Qwen3-0.6B all have correct accuracy.

#Load the HF model from config
config_load = args.hf_config_path
config = safe_load_config_with_retry(config_load, trust_remote_code=False)
bridge = AutoBridge.from_hf_config(config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this save-ckpt step allocate extra GPU memory when initializing an HF model?

bridge.load_hf_weights(ddp_model)
# no optimizer weight
iteration=0
num_floating_point_operations_so_far=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add print_rank_0 here

# use megatron bridge
from megatron.nemo_bridge.models import AutoBridge
bridge=AutoBridge.from_hf_pretrained(load_dir)
bridge.load_hf_weights(ddp_model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can nemo-bridge’s load_hf_model handle a ddp_model directly, where ddp_model is wrapped by DistributedDataParallel?

@@ -0,0 +1,8 @@
# Copyright (c) 2025, BAAI. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nemo megatron-bridge supports pip install for usage, ref https://pypi.org/project/megatron-bridge/
please remove source codes

@@ -0,0 +1,8 @@
# Copyright (c) 2025, BAAI. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename flagscale/train/megatron/nemo_bridge to flagscale/train/megatron/bridge so that it matches the import pattern from megatron.bridge

Copy link
Contributor

@tengqm tengqm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When copy pasting source code from other repos, we are supposed/obliged to copy paste their copyright notice as well. We cannot claim copyrights for these code.
The original code has following copyright header to be preserved:

# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@@ -0,0 +1,110 @@
# Copyright (c) 2025, BAAI. All rights reserved.
#
# Copied from: https://github.com/NVIDIA-NeMo/Megatron-Bridge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Megatron-Bridge has a copyright claim, we are supposed to paste their copyright statements here.


if not has_implementation:
raise ValueError(
f"\n�~\~W Model architecture '{architecture}' is not yet supported\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these weird characters?
There are some other similar cases in this string.

@@ -0,0 +1,359 @@
# Copyright (c) 2025, BAAI. All rights reserved.
#
# Mainly adapted from: https://github.com/NVIDIA-NeMo/Megatron-Bridge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify what has been "borrowed".
Please also paste the original copyright claim here if the code was not originally written by us.

@@ -0,0 +1,202 @@
# Copyright (c) 2025, BAAI. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me that this file was largely adapted from flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py. We copy-pasted the source and we are claiming copyright for this code. This is not acceptable.

We can borrow code from other projects, provided that the license terms grant us this right. In that case, we still have to pay credit to the original authors. We are obliged to mention their copyrights.

There are some weird characters in this file which was obviously a character conversion problem during copy/paste. Please fix them as well.

@chai-xiaonan
Copy link
Author

chai-xiaonan commented Feb 5, 2026 via email

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a megatron.nemo_bridge component and integrates it into Megatron training so FlagScale can (a) warm-start from Hugging Face checkpoints and (b) periodically save Hugging Face-format weights during training based on --save-hf-interval.

Changes:

  • Add new megatron.nemo_bridge bridge implementation with DeepSeek-V3 and Qwen3 support and HF<->Megatron conversion utilities.
  • Extend checkpoint save/load to optionally export/import Hugging Face weights (--save-hf, --load-hf) with a configurable save interval.
  • Add new CLI arguments and validation for Hugging Face checkpointing/export configuration.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
flagscale/train/megatron/training/yaml_arguments.py Minor formatting tweak in transformer config construction.
flagscale/train/megatron/training/checkpointing.py Adds HF save-on-interval during checkpointing and HF warm-start load path.
flagscale/train/megatron/training/arguments.py Adds CLI flags for HF load/save and interval; adds validation for HF save settings.
flagscale/train/megatron/nemo_bridge/init.py Introduces new megatron.nemo_bridge package entry point.
flagscale/train/megatron/nemo_bridge/models/init.py Exposes bridge/public APIs (AutoBridge, bridges, mappings).
flagscale/train/megatron/nemo_bridge/models/conversion/init.py Exposes conversion layer API.
flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py Adds AutoBridge wrapper for config validation, dynamic HF model construction, and HF weight load/save.
flagscale/train/megatron/nemo_bridge/models/conversion/model_bridge.py Adds Megatron↔HF streaming conversion behavior and tied-embedding handling.
flagscale/train/megatron/nemo_bridge/models/conversion/param_mapping.py Adds padding-aware column-parallel mapping behavior.
flagscale/train/megatron/nemo_bridge/models/hf_pretrained/init.py Exposes custom HF wrapper.
flagscale/train/megatron/nemo_bridge/models/hf_pretrained/causal_lm.py Wraps upstream PreTrainedCausalLM and customizes artifact saving behavior.
flagscale/train/megatron/nemo_bridge/models/deepseek/init.py Exposes DeepSeek bridge.
flagscale/train/megatron/nemo_bridge/models/deepseek/common.py Common DeepSeek config/mapping helpers.
flagscale/train/megatron/nemo_bridge/models/deepseek/deepseek_v3_bridge.py DeepSeek-V3 bridge implementation and HF config emission.
flagscale/train/megatron/nemo_bridge/models/qwen/init.py Exposes Qwen bridge.
flagscale/train/megatron/nemo_bridge/models/qwen/qwen3_bridge.py Qwen3 bridge implementation and HF config emission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2588 to +2593
group.add_argument('--load-hf', action='store_true',default=None,
help='Use the HF format for warm start, and save it in the torch_dict'
'format while also saving it in the HF format.')
group.add_argument('--save-hf', action='store_true',default=None,
help='Save as Hugging Face format checkpoint.')
group.add_argument('--hf-config-path', type=str,default=None,
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new CLI args are missing whitespace after commas (e.g., action='store_true',default=None and type=str,default=None). This violates Ruff/pycodestyle (E231) and will fail lint; add spaces after commas to match the surrounding argument definitions.

Suggested change
group.add_argument('--load-hf', action='store_true',default=None,
help='Use the HF format for warm start, and save it in the torch_dict'
'format while also saving it in the HF format.')
group.add_argument('--save-hf', action='store_true',default=None,
help='Save as Hugging Face format checkpoint.')
group.add_argument('--hf-config-path', type=str,default=None,
group.add_argument('--load-hf', action='store_true', default=None,
help='Use the HF format for warm start, and save it in the torch_dict'
'format while also saving it in the HF format.')
group.add_argument('--save-hf', action='store_true', default=None,
help='Save as Hugging Face format checkpoint.')
group.add_argument('--hf-config-path', type=str, default=None,

Copilot uses AI. Check for mistakes.
Comment on lines +139 to +174
if modeling_path :
import os,sys
import importlib.util
abs_model_path = os.path.abspath(modeling_path)
model_dir = os.path.dirname(abs_model_path)
if model_dir not in sys.path:
sys.path.insert(0, model_dir)
module_name = "dynamic_model_module"
try:
spec = importlib.util.spec_from_file_location(module_name, abs_model_path)
assert spec is not None
module = importlib.util.module_from_spec(spec)
sys.modules[module_name] = module
spec.loader.exec_module(module)
model_class = getattr(module, model_name)
with init_empty_weights():
hf_model = model_class._from_config(model.config)
for name, param in hf_model.named_parameters():
set_module_tensor_to_device(
hf_model, name, "cpu", torch.empty(*param.size(), dtype=model.config.torch_dtype)
)
except Exception as e:
print(f"import module error: {e}")

elif not modeling_path and config :
with init_empty_weights():
hf_model = AutoModelForCausalLM.from_config(model.config)

for name, param in hf_model.named_parameters():
set_module_tensor_to_device(
hf_model, name, "cpu", torch.empty(*param.size(), dtype=model.config.torch_dtype)
)
else:
raise ValueError("Need one args, model_path or config, to build HF model.")
model.model = hf_model
return cls(model)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_hf_config swallows exceptions when importing/building the dynamic HF model (print(f"import module error: {e}")) and then continues, potentially returning an AutoBridge with hf_model still None. This will cause failures later in a much less debuggable place. Raise an exception (or re-raise with context) if the model class can't be loaded, and ensure hf_model is non-None before returning.

Copilot uses AI. Check for mistakes.
Comment on lines +185 to +191
else:
pre_trained = PreTrainedCausalLM.from_pretrained(hf_path)
# Preserve trust_remote_code setting from the original bridge instance
trust_remote_code = getattr(self.hf_pretrained, 'trust_remote_code', False)
pre_trained = PreTrainedCausalLM.from_pretrained(
hf_path, trust_remote_code=trust_remote_code
)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_hf_weights calls PreTrainedCausalLM.from_pretrained(hf_path) twice (the first result is immediately overwritten). This adds unnecessary IO/memory overhead; remove the redundant first call and load once with the correct trust_remote_code value.

Copilot uses AI. Check for mistakes.
Comment on lines +1429 to +1434
if args.load_hf:
# use megatron bridge
from megatron.nemo_bridge.models import AutoBridge
bridge=AutoBridge.from_hf_pretrained(load_dir)
bridge.load_hf_weights(ddp_model)
# no optimizer weight
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HF load path is passing ddp_model directly into bridge.load_hf_weights(...), but later in this function you unwrap via model = unwrap_model(ddp_model). If the bridge expects the unwrapped Megatron model (common for checkpointing paths), this will load weights into the wrong wrapper type or fail. Unwrap the model before calling the bridge, and also fix spacing around = to satisfy Ruff (E225).

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +11
import sys

from pathlib import Path
from typing import Dict, Generic, List, Optional, TypeVar, Union

import torch

from megatron.bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM as OriginalPreTrainedCausalLM
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module has several unused imports (sys, Dict, Generic, List, TypeVar, etc.). Ruff is configured to error on unused imports (F401), so this will break CI; please remove unused imports or use them.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +92
hidden_size=args.hidden_size,
intermediate_size=args.ffn_hidden_size,
moe_intermediate_size=args.moe_ffn_hidden_size,
num_hidden_layers=args.num_layers,
num_nextn_predict_layers= args.mtp_num_layers if args.mtp_num_layers else 0 ,
num_attention_heads=args.num_attention_heads,
num_key_value_heads=args.num_query_groups,
n_shared_experts=args.moe_shared_expert_intermediate_size // args.moe_ffn_hidden_size,
n_routed_experts=args.num_experts,
routed_scaling_factor=args.moe_router_topk_scaling_factor,
kv_lora_rank=args.kv_lora_rank,
q_lora_rank=args.q_lora_rank if args.q_lora_rank else 0,
qk_rope_head_dim=args.qk_pos_emb_head_dim,
v_head_dim=args.v_head_dim,
qk_nope_head_dim=args.qk_head_dim,
n_group=args.moe_router_num_groups,
topk_group=args.moe_router_group_topk,
num_experts_per_tok=args.moe_router_topk,
moe_layer_freq = 1,
topk_method = "noaux_tc",
first_k_dense_replace=first_k_dense_replace,
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several keyword arguments in the DeepseekV3Config(...) call have invalid spacing (e.g., num_nextn_predict_layers= args... and moe_layer_freq = 1). This violates pycodestyle/Ruff (E251) and will fail lint; remove spaces around = inside keyword arguments and stray spaces before commas.

Copilot uses AI. Check for mistakes.
hf_config = AutoBridge.convert_mg2hf_config(args, safe_save, model_name)
bridge = AutoBridge.from_hf_config(hf_config, model_path, model_name)
#Save the HF model weights in the corresponding iteration's safetensor folder.
bridge.save_hf_pretrained(model=model,path=safe_save)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the HF save call, bridge.save_hf_pretrained(model=model,path=safe_save) is missing whitespace after the comma (model=model, path=...). With Ruff/pycodestyle enabled, this triggers E231 and will fail lint; adjust argument formatting to match the rest of the file.

Suggested change
bridge.save_hf_pretrained(model=model,path=safe_save)
bridge.save_hf_pretrained(model=model, path=safe_save)

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +19
from megatron.bridge import AutoBridge as OriginalAutoBridge
import transformers
import torch.distributed as dist
from transformers import AutoModelForCausalLM
from transformers.configuration_utils import PretrainedConfig

from megatron.core.transformer.module import MegatronModule
from megatron.nemo_bridge.models.conversion import model_bridge
from megatron.nemo_bridge.models.conversion.model_bridge import MegatronModelBridge

from megatron.nemo_bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM
from megatron.bridge.models.hf_pretrained.state import SafeTensorsStateSource
from megatron.bridge.models.hf_pretrained.safe_config_loader import safe_load_config_with_retry
from megatron.bridge.models.conversion.utils import get_causal_lm_class_via_auto_map

from typing import TypeVar, Union
from pathlib import Path
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import order in this module is likely to fail Ruff's isort rules (I001): standard-library imports (typing, pathlib) should be grouped at the top, followed by third-party (transformers, torch), then local (megatron...). Please run isort/ruff on this file to normalize the import blocks.

Suggested change
from megatron.bridge import AutoBridge as OriginalAutoBridge
import transformers
import torch.distributed as dist
from transformers import AutoModelForCausalLM
from transformers.configuration_utils import PretrainedConfig
from megatron.core.transformer.module import MegatronModule
from megatron.nemo_bridge.models.conversion import model_bridge
from megatron.nemo_bridge.models.conversion.model_bridge import MegatronModelBridge
from megatron.nemo_bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM
from megatron.bridge.models.hf_pretrained.state import SafeTensorsStateSource
from megatron.bridge.models.hf_pretrained.safe_config_loader import safe_load_config_with_retry
from megatron.bridge.models.conversion.utils import get_causal_lm_class_via_auto_map
from typing import TypeVar, Union
from pathlib import Path
from pathlib import Path
from typing import TypeVar, Union
import torch.distributed as dist
import transformers
from transformers import AutoModelForCausalLM
from transformers.configuration_utils import PretrainedConfig
from megatron.bridge import AutoBridge as OriginalAutoBridge
from megatron.bridge.models.conversion.utils import get_causal_lm_class_via_auto_map
from megatron.bridge.models.hf_pretrained.safe_config_loader import safe_load_config_with_retry
from megatron.bridge.models.hf_pretrained.state import SafeTensorsStateSource
from megatron.core.transformer.module import MegatronModule
from megatron.nemo_bridge.models.conversion import model_bridge
from megatron.nemo_bridge.models.conversion.model_bridge import MegatronModelBridge
from megatron.nemo_bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +18
# Copyright (c) 2025, BAAI. All rights reserved.

import torch
import torch.nn as nn
from megatron.bridge.models.conversion.utils import get_module_and_param_from_name
from megatron.bridge.models.conversion.param_mapping import ColumnParallelMapping as OriginalColumnParallelMapping
from megatron.bridge.models.conversion.param_mapping import AutoMapping as OriginalAutoMapping
from megatron.bridge.models.conversion.param_mapping import QKVMapping as OriginalQKVMapping
from megatron.bridge.models.conversion.param_mapping import (
MegatronParamMapping,
RowParallelMapping,
ReplicatedMapping,
GatedMLPMapping,
)


import logging
logger = logging.getLogger(__name__)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import logging (stdlib) appears after torch/megatron imports, which will violate Ruff isort (I001). Please reorder imports into standard-library / third-party / first-party groups (and keep them contiguous) so lint passes.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +40

logger = logging.getLogger(__name__)

from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge as OriginalMegatronModelBridge
from megatron.bridge.models.conversion.model_bridge import (
_megatron_local_name_to_global,
HFWeightTuple,
WeightConversionTask,
)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has imports placed after runtime statements (logger = ...) and mixes import groups, which is likely to trip Ruff isort (I001). Please move all imports to the top and let isort/ruff format them into consistent groups.

Suggested change
logger = logging.getLogger(__name__)
from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge as OriginalMegatronModelBridge
from megatron.bridge.models.conversion.model_bridge import (
_megatron_local_name_to_global,
HFWeightTuple,
WeightConversionTask,
)
from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge as OriginalMegatronModelBridge
from megatron.bridge.models.conversion.model_bridge import (
_megatron_local_name_to_global,
HFWeightTuple,
WeightConversionTask,
)
logger = logging.getLogger(__name__)

Copilot uses AI. Check for mistakes.
@chai-xiaonan
Copy link
Author

chai-xiaonan commented Mar 2, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants