Conversation
| #Load the HF model from config | ||
| config_load = args.hf_config_path | ||
| config = safe_load_config_with_retry(config_load, trust_remote_code=False) | ||
| bridge = AutoBridge.from_hf_config(config) |
There was a problem hiding this comment.
Will this save-ckpt step allocate extra GPU memory when initializing an HF model?
| bridge.load_hf_weights(ddp_model) | ||
| # no optimizer weight | ||
| iteration=0 | ||
| num_floating_point_operations_so_far=0 |
There was a problem hiding this comment.
please add print_rank_0 here
| # use megatron bridge | ||
| from megatron.nemo_bridge.models import AutoBridge | ||
| bridge=AutoBridge.from_hf_pretrained(load_dir) | ||
| bridge.load_hf_weights(ddp_model) |
There was a problem hiding this comment.
Can nemo-bridge’s load_hf_model handle a ddp_model directly, where ddp_model is wrapped by DistributedDataParallel?
| @@ -0,0 +1,8 @@ | |||
| # Copyright (c) 2025, BAAI. All rights reserved. | |||
There was a problem hiding this comment.
nemo megatron-bridge supports pip install for usage, ref https://pypi.org/project/megatron-bridge/
please remove source codes
| @@ -0,0 +1,8 @@ | |||
| # Copyright (c) 2025, BAAI. All rights reserved. | |||
There was a problem hiding this comment.
Rename flagscale/train/megatron/nemo_bridge to flagscale/train/megatron/bridge so that it matches the import pattern from megatron.bridge
tengqm
left a comment
There was a problem hiding this comment.
When copy pasting source code from other repos, we are supposed/obliged to copy paste their copyright notice as well. We cannot claim copyrights for these code.
The original code has following copyright header to be preserved:
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.| @@ -0,0 +1,110 @@ | |||
| # Copyright (c) 2025, BAAI. All rights reserved. | |||
| # | |||
| # Copied from: https://github.com/NVIDIA-NeMo/Megatron-Bridge | |||
There was a problem hiding this comment.
If Megatron-Bridge has a copyright claim, we are supposed to paste their copyright statements here.
|
|
||
| if not has_implementation: | ||
| raise ValueError( | ||
| f"\n�~\~W Model architecture '{architecture}' is not yet supported\n\n" |
There was a problem hiding this comment.
What are these weird characters?
There are some other similar cases in this string.
| @@ -0,0 +1,359 @@ | |||
| # Copyright (c) 2025, BAAI. All rights reserved. | |||
| # | |||
| # Mainly adapted from: https://github.com/NVIDIA-NeMo/Megatron-Bridge | |||
There was a problem hiding this comment.
Please clarify what has been "borrowed".
Please also paste the original copyright claim here if the code was not originally written by us.
| @@ -0,0 +1,202 @@ | |||
| # Copyright (c) 2025, BAAI. All rights reserved. | |||
There was a problem hiding this comment.
Looks to me that this file was largely adapted from flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py. We copy-pasted the source and we are claiming copyright for this code. This is not acceptable.
We can borrow code from other projects, provided that the license terms grant us this right. In that case, we still have to pay credit to the original authors. We are obliged to mention their copyrights.
There are some weird characters in this file which was obviously a character conversion problem during copy/paste. Please fix them as well.
|
收到,谢谢
|
There was a problem hiding this comment.
Pull request overview
This PR adds a megatron.nemo_bridge component and integrates it into Megatron training so FlagScale can (a) warm-start from Hugging Face checkpoints and (b) periodically save Hugging Face-format weights during training based on --save-hf-interval.
Changes:
- Add new
megatron.nemo_bridgebridge implementation with DeepSeek-V3 and Qwen3 support and HF<->Megatron conversion utilities. - Extend checkpoint save/load to optionally export/import Hugging Face weights (
--save-hf,--load-hf) with a configurable save interval. - Add new CLI arguments and validation for Hugging Face checkpointing/export configuration.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| flagscale/train/megatron/training/yaml_arguments.py | Minor formatting tweak in transformer config construction. |
| flagscale/train/megatron/training/checkpointing.py | Adds HF save-on-interval during checkpointing and HF warm-start load path. |
| flagscale/train/megatron/training/arguments.py | Adds CLI flags for HF load/save and interval; adds validation for HF save settings. |
| flagscale/train/megatron/nemo_bridge/init.py | Introduces new megatron.nemo_bridge package entry point. |
| flagscale/train/megatron/nemo_bridge/models/init.py | Exposes bridge/public APIs (AutoBridge, bridges, mappings). |
| flagscale/train/megatron/nemo_bridge/models/conversion/init.py | Exposes conversion layer API. |
| flagscale/train/megatron/nemo_bridge/models/conversion/auto_bridge.py | Adds AutoBridge wrapper for config validation, dynamic HF model construction, and HF weight load/save. |
| flagscale/train/megatron/nemo_bridge/models/conversion/model_bridge.py | Adds Megatron↔HF streaming conversion behavior and tied-embedding handling. |
| flagscale/train/megatron/nemo_bridge/models/conversion/param_mapping.py | Adds padding-aware column-parallel mapping behavior. |
| flagscale/train/megatron/nemo_bridge/models/hf_pretrained/init.py | Exposes custom HF wrapper. |
| flagscale/train/megatron/nemo_bridge/models/hf_pretrained/causal_lm.py | Wraps upstream PreTrainedCausalLM and customizes artifact saving behavior. |
| flagscale/train/megatron/nemo_bridge/models/deepseek/init.py | Exposes DeepSeek bridge. |
| flagscale/train/megatron/nemo_bridge/models/deepseek/common.py | Common DeepSeek config/mapping helpers. |
| flagscale/train/megatron/nemo_bridge/models/deepseek/deepseek_v3_bridge.py | DeepSeek-V3 bridge implementation and HF config emission. |
| flagscale/train/megatron/nemo_bridge/models/qwen/init.py | Exposes Qwen bridge. |
| flagscale/train/megatron/nemo_bridge/models/qwen/qwen3_bridge.py | Qwen3 bridge implementation and HF config emission. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| group.add_argument('--load-hf', action='store_true',default=None, | ||
| help='Use the HF format for warm start, and save it in the torch_dict' | ||
| 'format while also saving it in the HF format.') | ||
| group.add_argument('--save-hf', action='store_true',default=None, | ||
| help='Save as Hugging Face format checkpoint.') | ||
| group.add_argument('--hf-config-path', type=str,default=None, |
There was a problem hiding this comment.
The new CLI args are missing whitespace after commas (e.g., action='store_true',default=None and type=str,default=None). This violates Ruff/pycodestyle (E231) and will fail lint; add spaces after commas to match the surrounding argument definitions.
| group.add_argument('--load-hf', action='store_true',default=None, | |
| help='Use the HF format for warm start, and save it in the torch_dict' | |
| 'format while also saving it in the HF format.') | |
| group.add_argument('--save-hf', action='store_true',default=None, | |
| help='Save as Hugging Face format checkpoint.') | |
| group.add_argument('--hf-config-path', type=str,default=None, | |
| group.add_argument('--load-hf', action='store_true', default=None, | |
| help='Use the HF format for warm start, and save it in the torch_dict' | |
| 'format while also saving it in the HF format.') | |
| group.add_argument('--save-hf', action='store_true', default=None, | |
| help='Save as Hugging Face format checkpoint.') | |
| group.add_argument('--hf-config-path', type=str, default=None, |
| if modeling_path : | ||
| import os,sys | ||
| import importlib.util | ||
| abs_model_path = os.path.abspath(modeling_path) | ||
| model_dir = os.path.dirname(abs_model_path) | ||
| if model_dir not in sys.path: | ||
| sys.path.insert(0, model_dir) | ||
| module_name = "dynamic_model_module" | ||
| try: | ||
| spec = importlib.util.spec_from_file_location(module_name, abs_model_path) | ||
| assert spec is not None | ||
| module = importlib.util.module_from_spec(spec) | ||
| sys.modules[module_name] = module | ||
| spec.loader.exec_module(module) | ||
| model_class = getattr(module, model_name) | ||
| with init_empty_weights(): | ||
| hf_model = model_class._from_config(model.config) | ||
| for name, param in hf_model.named_parameters(): | ||
| set_module_tensor_to_device( | ||
| hf_model, name, "cpu", torch.empty(*param.size(), dtype=model.config.torch_dtype) | ||
| ) | ||
| except Exception as e: | ||
| print(f"import module error: {e}") | ||
|
|
||
| elif not modeling_path and config : | ||
| with init_empty_weights(): | ||
| hf_model = AutoModelForCausalLM.from_config(model.config) | ||
|
|
||
| for name, param in hf_model.named_parameters(): | ||
| set_module_tensor_to_device( | ||
| hf_model, name, "cpu", torch.empty(*param.size(), dtype=model.config.torch_dtype) | ||
| ) | ||
| else: | ||
| raise ValueError("Need one args, model_path or config, to build HF model.") | ||
| model.model = hf_model | ||
| return cls(model) |
There was a problem hiding this comment.
from_hf_config swallows exceptions when importing/building the dynamic HF model (print(f"import module error: {e}")) and then continues, potentially returning an AutoBridge with hf_model still None. This will cause failures later in a much less debuggable place. Raise an exception (or re-raise with context) if the model class can't be loaded, and ensure hf_model is non-None before returning.
| else: | ||
| pre_trained = PreTrainedCausalLM.from_pretrained(hf_path) | ||
| # Preserve trust_remote_code setting from the original bridge instance | ||
| trust_remote_code = getattr(self.hf_pretrained, 'trust_remote_code', False) | ||
| pre_trained = PreTrainedCausalLM.from_pretrained( | ||
| hf_path, trust_remote_code=trust_remote_code | ||
| ) |
There was a problem hiding this comment.
load_hf_weights calls PreTrainedCausalLM.from_pretrained(hf_path) twice (the first result is immediately overwritten). This adds unnecessary IO/memory overhead; remove the redundant first call and load once with the correct trust_remote_code value.
| if args.load_hf: | ||
| # use megatron bridge | ||
| from megatron.nemo_bridge.models import AutoBridge | ||
| bridge=AutoBridge.from_hf_pretrained(load_dir) | ||
| bridge.load_hf_weights(ddp_model) | ||
| # no optimizer weight |
There was a problem hiding this comment.
HF load path is passing ddp_model directly into bridge.load_hf_weights(...), but later in this function you unwrap via model = unwrap_model(ddp_model). If the bridge expects the unwrapped Megatron model (common for checkpointing paths), this will load weights into the wrong wrapper type or fail. Unwrap the model before calling the bridge, and also fix spacing around = to satisfy Ruff (E225).
| import sys | ||
|
|
||
| from pathlib import Path | ||
| from typing import Dict, Generic, List, Optional, TypeVar, Union | ||
|
|
||
| import torch | ||
|
|
||
| from megatron.bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM as OriginalPreTrainedCausalLM |
There was a problem hiding this comment.
This module has several unused imports (sys, Dict, Generic, List, TypeVar, etc.). Ruff is configured to error on unused imports (F401), so this will break CI; please remove unused imports or use them.
| hidden_size=args.hidden_size, | ||
| intermediate_size=args.ffn_hidden_size, | ||
| moe_intermediate_size=args.moe_ffn_hidden_size, | ||
| num_hidden_layers=args.num_layers, | ||
| num_nextn_predict_layers= args.mtp_num_layers if args.mtp_num_layers else 0 , | ||
| num_attention_heads=args.num_attention_heads, | ||
| num_key_value_heads=args.num_query_groups, | ||
| n_shared_experts=args.moe_shared_expert_intermediate_size // args.moe_ffn_hidden_size, | ||
| n_routed_experts=args.num_experts, | ||
| routed_scaling_factor=args.moe_router_topk_scaling_factor, | ||
| kv_lora_rank=args.kv_lora_rank, | ||
| q_lora_rank=args.q_lora_rank if args.q_lora_rank else 0, | ||
| qk_rope_head_dim=args.qk_pos_emb_head_dim, | ||
| v_head_dim=args.v_head_dim, | ||
| qk_nope_head_dim=args.qk_head_dim, | ||
| n_group=args.moe_router_num_groups, | ||
| topk_group=args.moe_router_group_topk, | ||
| num_experts_per_tok=args.moe_router_topk, | ||
| moe_layer_freq = 1, | ||
| topk_method = "noaux_tc", | ||
| first_k_dense_replace=first_k_dense_replace, |
There was a problem hiding this comment.
Several keyword arguments in the DeepseekV3Config(...) call have invalid spacing (e.g., num_nextn_predict_layers= args... and moe_layer_freq = 1). This violates pycodestyle/Ruff (E251) and will fail lint; remove spaces around = inside keyword arguments and stray spaces before commas.
| hf_config = AutoBridge.convert_mg2hf_config(args, safe_save, model_name) | ||
| bridge = AutoBridge.from_hf_config(hf_config, model_path, model_name) | ||
| #Save the HF model weights in the corresponding iteration's safetensor folder. | ||
| bridge.save_hf_pretrained(model=model,path=safe_save) |
There was a problem hiding this comment.
In the HF save call, bridge.save_hf_pretrained(model=model,path=safe_save) is missing whitespace after the comma (model=model, path=...). With Ruff/pycodestyle enabled, this triggers E231 and will fail lint; adjust argument formatting to match the rest of the file.
| bridge.save_hf_pretrained(model=model,path=safe_save) | |
| bridge.save_hf_pretrained(model=model, path=safe_save) |
| from megatron.bridge import AutoBridge as OriginalAutoBridge | ||
| import transformers | ||
| import torch.distributed as dist | ||
| from transformers import AutoModelForCausalLM | ||
| from transformers.configuration_utils import PretrainedConfig | ||
|
|
||
| from megatron.core.transformer.module import MegatronModule | ||
| from megatron.nemo_bridge.models.conversion import model_bridge | ||
| from megatron.nemo_bridge.models.conversion.model_bridge import MegatronModelBridge | ||
|
|
||
| from megatron.nemo_bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM | ||
| from megatron.bridge.models.hf_pretrained.state import SafeTensorsStateSource | ||
| from megatron.bridge.models.hf_pretrained.safe_config_loader import safe_load_config_with_retry | ||
| from megatron.bridge.models.conversion.utils import get_causal_lm_class_via_auto_map | ||
|
|
||
| from typing import TypeVar, Union | ||
| from pathlib import Path |
There was a problem hiding this comment.
Import order in this module is likely to fail Ruff's isort rules (I001): standard-library imports (typing, pathlib) should be grouped at the top, followed by third-party (transformers, torch), then local (megatron...). Please run isort/ruff on this file to normalize the import blocks.
| from megatron.bridge import AutoBridge as OriginalAutoBridge | |
| import transformers | |
| import torch.distributed as dist | |
| from transformers import AutoModelForCausalLM | |
| from transformers.configuration_utils import PretrainedConfig | |
| from megatron.core.transformer.module import MegatronModule | |
| from megatron.nemo_bridge.models.conversion import model_bridge | |
| from megatron.nemo_bridge.models.conversion.model_bridge import MegatronModelBridge | |
| from megatron.nemo_bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM | |
| from megatron.bridge.models.hf_pretrained.state import SafeTensorsStateSource | |
| from megatron.bridge.models.hf_pretrained.safe_config_loader import safe_load_config_with_retry | |
| from megatron.bridge.models.conversion.utils import get_causal_lm_class_via_auto_map | |
| from typing import TypeVar, Union | |
| from pathlib import Path | |
| from pathlib import Path | |
| from typing import TypeVar, Union | |
| import torch.distributed as dist | |
| import transformers | |
| from transformers import AutoModelForCausalLM | |
| from transformers.configuration_utils import PretrainedConfig | |
| from megatron.bridge import AutoBridge as OriginalAutoBridge | |
| from megatron.bridge.models.conversion.utils import get_causal_lm_class_via_auto_map | |
| from megatron.bridge.models.hf_pretrained.safe_config_loader import safe_load_config_with_retry | |
| from megatron.bridge.models.hf_pretrained.state import SafeTensorsStateSource | |
| from megatron.core.transformer.module import MegatronModule | |
| from megatron.nemo_bridge.models.conversion import model_bridge | |
| from megatron.nemo_bridge.models.conversion.model_bridge import MegatronModelBridge | |
| from megatron.nemo_bridge.models.hf_pretrained.causal_lm import PreTrainedCausalLM |
| # Copyright (c) 2025, BAAI. All rights reserved. | ||
|
|
||
| import torch | ||
| import torch.nn as nn | ||
| from megatron.bridge.models.conversion.utils import get_module_and_param_from_name | ||
| from megatron.bridge.models.conversion.param_mapping import ColumnParallelMapping as OriginalColumnParallelMapping | ||
| from megatron.bridge.models.conversion.param_mapping import AutoMapping as OriginalAutoMapping | ||
| from megatron.bridge.models.conversion.param_mapping import QKVMapping as OriginalQKVMapping | ||
| from megatron.bridge.models.conversion.param_mapping import ( | ||
| MegatronParamMapping, | ||
| RowParallelMapping, | ||
| ReplicatedMapping, | ||
| GatedMLPMapping, | ||
| ) | ||
|
|
||
|
|
||
| import logging | ||
| logger = logging.getLogger(__name__) |
There was a problem hiding this comment.
import logging (stdlib) appears after torch/megatron imports, which will violate Ruff isort (I001). Please reorder imports into standard-library / third-party / first-party groups (and keep them contiguous) so lint passes.
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge as OriginalMegatronModelBridge | ||
| from megatron.bridge.models.conversion.model_bridge import ( | ||
| _megatron_local_name_to_global, | ||
| HFWeightTuple, | ||
| WeightConversionTask, | ||
| ) |
There was a problem hiding this comment.
This file has imports placed after runtime statements (logger = ...) and mixes import groups, which is likely to trip Ruff isort (I001). Please move all imports to the top and let isort/ruff format them into consistent groups.
| logger = logging.getLogger(__name__) | |
| from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge as OriginalMegatronModelBridge | |
| from megatron.bridge.models.conversion.model_bridge import ( | |
| _megatron_local_name_to_global, | |
| HFWeightTuple, | |
| WeightConversionTask, | |
| ) | |
| from megatron.bridge.models.conversion.model_bridge import MegatronModelBridge as OriginalMegatronModelBridge | |
| from megatron.bridge.models.conversion.model_bridge import ( | |
| _megatron_local_name_to_global, | |
| HFWeightTuple, | |
| WeightConversionTask, | |
| ) | |
| logger = logging.getLogger(__name__) |
|
收到,谢谢
|
Reconstruct the Nemo-Bridge based on the restructured flagscale version. Currently, flagscale has supported some functions of nemo-bridge, enabling the flagscale framework to load and save ckpt in the hf format during the training process. Additionally, in the current version, new features have been added, allowing for the setting of the number of iterations for saving hf weights based on the save_hf_interval. The model has verified that Deepseek V3 16_a3B, Qwen3-32B, and Qwen3-0.6B all have correct accuracy.