-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Support for Phi-1.5 & Phi-2 models #7862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
5f4e6e8
635b0ad
fb60fa2
33dfcde
adc05a5
dce59cb
3586ecc
1d888b8
a7623ed
d48f0cb
f672db1
ccf669f
876bcb4
b6d8263
23d7c45
8a9bc15
59e0da9
376323d
35e8118
8e871e5
435f3a0
bcc9570
66ac35e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -30,7 +30,7 @@ in the GitHub search bar. | |||||
| | **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](https://docs.sglang.ai/references/llama4) | | ||||||
| | **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2` | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. | | ||||||
| | **Gemma** (v1, v2, v3) | `google/gemma-3-1b-it` | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. | | ||||||
| | **Phi** (Phi-3, Phi-4 series) | `microsoft/Phi-4-multimodal-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-mini is a high-accuracy text model and Phi-4-multimodal (5.6B) processes text, images, and speech in one compact model. | | ||||||
| | **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4 series) | `microsoft/Phi-4-multimodal-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-mini is a high-accuracy text model and Phi-4-multimodal (5.6B) processes text, images, and speech in one compact model. | | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for adding Phi-1.5 support! To make the documentation clearer for users, could you please update the example model and description for the Phi model family? The current example and description are specific to Phi-4. Since this PR adds support for Phi-1.5, it would be great to reflect that.
Suggested change
|
||||||
| | **MiniCPM** (v3, 4B) | `openbmb/MiniCPM3-4B` | OpenBMB’s series of compact LLMs for edge devices; MiniCPM 3 (4B) achieves GPT-3.5-level results in text tasks. | | ||||||
| | **OLMoE** (Open MoE) | `allenai/OLMoE-1B-7B-0924` | Allen AI’s open Mixture-of-Experts model (7B total, 1B active parameters) delivering state-of-the-art results with sparse expert activation. | | ||||||
| | **StableLM** (3B, 7B) | `stabilityai/stablelm-tuned-alpha-7b` | StabilityAI’s early open-source LLM (3B & 7B) for general text generation; a demonstration model with basic instruction-following ability. | | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,281 @@ | ||
| # Adapted from https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/phi.py | ||
| import math | ||
|
lifuhuang marked this conversation as resolved.
Outdated
|
||
| from typing import Iterable, Optional, Tuple, Union | ||
|
|
||
| import torch | ||
| from torch import nn | ||
| from transformers import PhiConfig | ||
|
|
||
| from sglang.srt.distributed import get_pp_group, get_tensor_model_parallel_world_size | ||
|
lifuhuang marked this conversation as resolved.
|
||
| from sglang.srt.layers.activation import get_act_fn | ||
| from sglang.srt.layers.linear import ( | ||
| ColumnParallelLinear, | ||
| QKVParallelLinear, | ||
| RowParallelLinear, | ||
| ) | ||
| from sglang.srt.layers.logits_processor import LogitsProcessor, LogitsProcessorOutput | ||
| from sglang.srt.layers.quantization.base_config import QuantizationConfig | ||
| from sglang.srt.layers.radix_attention import RadixAttention | ||
| from sglang.srt.layers.rotary_embedding import get_rope | ||
| from sglang.srt.layers.vocab_parallel_embedding import ( | ||
| ParallelLMHead, | ||
| VocabParallelEmbedding, | ||
| ) | ||
| from sglang.srt.model_executor.forward_batch_info import ForwardBatch | ||
| from sglang.srt.model_loader.weight_utils import default_weight_loader | ||
| from sglang.srt.utils import add_prefix, make_layers | ||
|
|
||
|
|
||
| class PhiAttention(nn.Module): | ||
|
|
||
| def __init__( | ||
| self, | ||
| config: PhiConfig, | ||
| quant_config: Optional[QuantizationConfig] = None, | ||
| prefix: str = "", | ||
| ): | ||
| super().__init__() | ||
| self.total_num_heads = config.num_attention_heads | ||
| self.hidden_size = config.hidden_size | ||
| self.head_size = self.hidden_size // self.total_num_heads | ||
|
|
||
| tensor_model_parallel_world_size = get_tensor_model_parallel_world_size() | ||
| assert self.total_num_heads % tensor_model_parallel_world_size == 0 | ||
| self.num_heads = self.total_num_heads // tensor_model_parallel_world_size | ||
|
|
||
| # pylint: disable=C0103 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ppraneth , do we need this? |
||
| self.qkv_proj = QKVParallelLinear( | ||
| self.hidden_size, | ||
| self.head_size, | ||
| self.total_num_heads, | ||
| bias=True, | ||
| quant_config=quant_config, | ||
| ) | ||
| self.dense = RowParallelLinear( | ||
| self.hidden_size, | ||
| self.hidden_size, | ||
| quant_config=quant_config, | ||
| ) | ||
|
|
||
| scaling = self.head_size**-0.5 | ||
| rotary_dim = int( | ||
| config.partial_rotary_factor | ||
| * (config.hidden_size // config.num_attention_heads) | ||
| ) | ||
| assert rotary_dim % 2 == 0 | ||
|
|
||
| # pylint: disable=C0301 | ||
| # Refer to: | ||
| # https://huggingface.co/microsoft/phi-1_5/blob/d212a789620c380ff32ca1d1ee9943a777360987/modeling_phi.py#L518 | ||
| rope_theta = getattr(config, "rope_theta", 10000.0) | ||
| max_position_embeddings = getattr(config, "max_position_embeddings", 2048) | ||
| self.rotary_emb = get_rope( | ||
| self.head_size, | ||
| rotary_dim=rotary_dim, | ||
| max_position=max_position_embeddings, | ||
| base=rope_theta, | ||
| ) | ||
| self.attn = RadixAttention( | ||
| self.num_heads, | ||
| self.head_size, | ||
| scaling, | ||
| quant_config=quant_config, | ||
| prefix=add_prefix("attn", prefix), | ||
| ) | ||
|
|
||
| def forward( | ||
| self, | ||
| position_ids: torch.Tensor, | ||
| forward_batch: ForwardBatch, | ||
| hidden_states: torch.Tensor, | ||
| ) -> torch.Tensor: | ||
| qkv, _ = self.qkv_proj(hidden_states) | ||
| q, k, v = qkv.chunk(chunks=3, dim=-1) | ||
| q, k = self.rotary_emb(position_ids, q, k) | ||
| attn_output = self.attn(q, k, v, forward_batch=forward_batch) | ||
| output, _ = self.dense(attn_output) | ||
| return output | ||
|
|
||
|
|
||
| class PhiMLP(nn.Module): | ||
|
|
||
| def __init__( | ||
| self, config: PhiConfig, quant_config: Optional[QuantizationConfig] = None | ||
| ): | ||
| super().__init__() | ||
|
|
||
| n_inner = getattr(config, "n_inner", None) | ||
| n_inner = n_inner if n_inner is not None else 4 * config.hidden_size | ||
|
|
||
| self.fc1 = ColumnParallelLinear( | ||
| config.hidden_size, | ||
| n_inner, | ||
| quant_config=quant_config, | ||
| ) | ||
| self.fc2 = RowParallelLinear( | ||
| n_inner, | ||
| config.hidden_size, | ||
| quant_config=quant_config, | ||
| ) | ||
| self.act = get_act_fn(config.hidden_act) | ||
|
|
||
| def forward(self, hidden_states): | ||
| hidden_states, _ = self.fc1(hidden_states) | ||
| hidden_states = self.act(hidden_states) | ||
| hidden_states, _ = self.fc2(hidden_states) | ||
| return hidden_states | ||
|
|
||
|
|
||
| class PhiLayer(nn.Module): | ||
|
|
||
| def __init__( | ||
| self, | ||
| config: PhiConfig, | ||
| quant_config: Optional[QuantizationConfig] = None, | ||
| prefix: str = "", | ||
| ): | ||
| super().__init__() | ||
| self.input_layernorm = nn.LayerNorm( | ||
| config.hidden_size, eps=config.layer_norm_eps | ||
| ) | ||
| self.self_attn = PhiAttention( | ||
| config, quant_config, prefix=add_prefix("self_attn", prefix) | ||
| ) | ||
| self.mlp = PhiMLP(config, quant_config) | ||
|
|
||
| def forward( | ||
| self, | ||
| position_ids: torch.Tensor, | ||
| forward_batch: ForwardBatch, | ||
| hidden_states: torch.Tensor, | ||
| ) -> torch.Tensor: | ||
| residual = hidden_states | ||
| hidden_states = self.input_layernorm(hidden_states) | ||
| attn_outputs = self.self_attn( | ||
| position_ids=position_ids, | ||
| hidden_states=hidden_states, | ||
| forward_batch=forward_batch, | ||
| ) | ||
| feed_forward_hidden_states = self.mlp(hidden_states) | ||
| hidden_states = attn_outputs + feed_forward_hidden_states + residual | ||
| return hidden_states | ||
|
|
||
|
|
||
| class PhiModel(nn.Module): | ||
|
|
||
| def __init__( | ||
| self, | ||
| config: PhiConfig, | ||
| quant_config: Optional[QuantizationConfig] = None, | ||
| prefix: str = "", | ||
| ): | ||
| super().__init__() | ||
| self.config = config | ||
| self.embed_tokens = VocabParallelEmbedding( | ||
| config.vocab_size, config.hidden_size | ||
| ) | ||
| self.start_layer, self.end_layer, self.layers = make_layers( | ||
| config.num_hidden_layers, | ||
| lambda prefix: PhiLayer(config, prefix=prefix), | ||
| prefix=add_prefix("layers", prefix), | ||
| ) | ||
| self.final_layernorm = nn.LayerNorm( | ||
| config.hidden_size, eps=config.layer_norm_eps | ||
| ) | ||
|
|
||
| def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: | ||
| return self.embed_tokens(input_ids) | ||
|
|
||
| def forward( | ||
| self, | ||
| input_ids: torch.Tensor, | ||
| forward_batch: ForwardBatch, | ||
| positions: torch.Tensor, | ||
| inputs_embeds: Optional[torch.Tensor] = None, | ||
| ) -> Union[torch.Tensor]: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ppraneth , can you apply this suggested change? Thanks!
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here |
||
| if inputs_embeds is not None: | ||
| hidden_states = inputs_embeds | ||
| else: | ||
| hidden_states = self.get_input_embeddings(input_ids) | ||
| for i in range(self.start_layer, self.end_layer): | ||
| layer = self.layers[i] | ||
| hidden_states = layer(positions, hidden_states, forward_batch=forward_batch) | ||
|
lifuhuang marked this conversation as resolved.
Outdated
|
||
| hidden_states = self.final_layernorm(hidden_states) | ||
| return hidden_states | ||
|
|
||
|
|
||
| # Pending | ||
|
lifuhuang marked this conversation as resolved.
Outdated
|
||
| class PhiForCausalLM(nn.Module): | ||
| packed_modules_mapping = { | ||
| "qkv_proj": [ | ||
| "q_proj", | ||
| "k_proj", | ||
| "v_proj", | ||
| ] | ||
| } | ||
|
|
||
| def __init__( | ||
| self, | ||
| config: PhiConfig, | ||
| quant_config: Optional[QuantizationConfig] = None, | ||
| prefix: str = "", | ||
| ): | ||
| super().__init__() | ||
| self.config = config | ||
| self.quant_config = quant_config | ||
| self.model = PhiModel( | ||
| config=config, | ||
| quant_config=quant_config, | ||
| prefix=add_prefix("model", prefix), | ||
| ) | ||
|
|
||
| self.lm_head = ParallelLMHead( | ||
| config.vocab_size, | ||
| config.hidden_size, | ||
| bias=True, | ||
| quant_config=quant_config, | ||
| ) | ||
| self.logits_processor = LogitsProcessor(config.vocab_size) | ||
|
|
||
| def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor: | ||
| return self.model.get_input_embeddings(input_ids) | ||
|
|
||
| def forward( | ||
| self, | ||
| input_ids: torch.Tensor, | ||
| positions: torch.Tensor, | ||
| forward_batch: ForwardBatch, | ||
| inputs_embeds: Optional[torch.Tensor] = None, | ||
| ) -> LogitsProcessorOutput: | ||
| hidden_states = self.model(input_ids, positions, forward_batch, inputs_embeds) | ||
|
lifuhuang marked this conversation as resolved.
Outdated
|
||
|
|
||
| return self.logits_processor( | ||
| input_ids, hidden_states, self.lm_head, forward_batch | ||
| ) | ||
|
|
||
| def compute_logits( | ||
| self, | ||
| hidden_states: torch.Tensor, | ||
| sampling_metadata, | ||
| ) -> Optional[torch.Tensor]: | ||
| logits = self.logits_processor( | ||
| self.lm_head, hidden_states, sampling_metadata, self.lm_head.bias | ||
| ) | ||
| return logits | ||
|
lifuhuang marked this conversation as resolved.
Outdated
lifuhuang marked this conversation as resolved.
Outdated
|
||
|
|
||
| def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): | ||
| params_dict = dict(self.named_parameters()) | ||
| for name, loaded_weight in weights: | ||
| if "rotary_emb.inv_freq" in name: | ||
| continue | ||
| if name.endswith(".bias") and name not in params_dict: | ||
| continue | ||
| if self.config.tie_word_embeddings and "lm_head.weight" in name: | ||
| continue | ||
|
|
||
| param = params_dict[name] | ||
| weight_loader = getattr(param, "weight_loader", default_weight_loader) | ||
| weight_loader(param, loaded_weight) | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
|
||
| EntryClass = PhiForCausalLM | ||
Uh oh!
There was an error while loading. Please reload this page.