-
Notifications
You must be signed in to change notification settings - Fork 470
Add the conformer backbone (phi4mm audio) #1448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 28 commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
9e06476
Deps
EricLBuehler 7bb5a64
Add conformer
EricLBuehler ce5aff6
Nemo loading
EricLBuehler 408aae1
Position embeds
EricLBuehler a9e46ed
Load t5 attn bias
EricLBuehler 0a36028
Attn and feed forward
EricLBuehler 4cd9196
Add conv module and glu pointwise
EricLBuehler 8244e07
Implement relative attn bias
EricLBuehler 3853cfe
Add the forward methods
EricLBuehler d94134e
Add encoder embedding
EricLBuehler f645bf9
Fix oproj
EricLBuehler 46d7ea4
Some loading
EricLBuehler c9ac339
Conformer loads!
EricLBuehler 3907feb
Fully loading speech stack
EricLBuehler 8e10e52
Merger
EricLBuehler d7ce884
Dont need that
EricLBuehler 06f2cfe
First pass at audio processing
EricLBuehler 6123f82
Read samples
EricLBuehler 2ee2742
Optional
EricLBuehler 2f4e8b3
Small loading fix
EricLBuehler d6f4e99
Runs but not correct yet
EricLBuehler 8010d36
Improved audio processing?
EricLBuehler 865f0c3
Works with this
EricLBuehler a0f9bfc
Fix t5 attn bias
EricLBuehler 34ca7d4
It works!
EricLBuehler 2d84c8c
Comment
EricLBuehler b8291e3
Use some other crates
EricLBuehler 5083095
Clippy
EricLBuehler 99bd576
Allow bf16 on metal
EricLBuehler 04bbd4e
Add prefix_audio
EricLBuehler 4ea5b25
Remove unused
EricLBuehler 8250b33
Typo
EricLBuehler 558eb99
User specified
EricLBuehler c85b087
Add audio url parsing
EricLBuehler 6fb358e
AudioProjectionMode -> InputMode
EricLBuehler 9acf1c4
Audio prefix caching
EricLBuehler f810079
Fix bug in audio prefix caching
EricLBuehler bc87555
Support both at the same time!
EricLBuehler e546c5b
Tweak logging
EricLBuehler b81ca71
Support stereo
EricLBuehler ff67bf8
Add mistralrs-audio
EricLBuehler 167f1b5
Support batching
EricLBuehler f8f2a31
Add server and rust api example
EricLBuehler 6480ac1
Add python api
EricLBuehler 8316dd6
Fix add_multimodal_message
EricLBuehler 53f3e39
Fix unfold for conformer
EricLBuehler 38f7a8e
Streaming example
EricLBuehler 29b146c
Add web chat support
EricLBuehler dd9031b
Add modalities registry
EricLBuehler File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,141 @@ | ||
| use serde::{Deserialize, Serialize}; | ||
|
|
||
| use crate::{layers::Activation, serde_default_fn}; | ||
|
|
||
| serde_default_fn!(usize, default_attention_dim, 256); | ||
| serde_default_fn!(usize, default_attention_heads, 4); | ||
| serde_default_fn!(usize, default_linear_units, 2048); | ||
| serde_default_fn!(usize, default_num_blocks, 6); | ||
| serde_default_fn!(String, default_input_layer, "nemo_conv".to_string()); | ||
| serde_default_fn!(bool, default_causal, true); | ||
| serde_default_fn!(bool, default_batch_norm, false); | ||
| serde_default_fn!(usize, default_ext_pw_out_channel, 0); | ||
| serde_default_fn!(usize, default_ext_pw_kernel_size, 1); | ||
| serde_default_fn!(usize, default_depthwise_seperable_out_channel, 256); | ||
| serde_default_fn!(usize, default_depthwise_multiplier, 1); | ||
| serde_default_fn!(usize, default_chunk_se, 0); | ||
| serde_default_fn!(usize, default_kernel_size, 3); | ||
| serde_default_fn!(Activation, default_activation, Activation::Relu); | ||
| serde_default_fn!(Activation, default_conv_activation, Activation::Relu); | ||
| serde_default_fn!(Activation, default_conv_glu_type, Activation::Sigmoid); | ||
| serde_default_fn!(bool, default_bias_in_glu, true); | ||
| serde_default_fn!(bool, default_linear_glu_in_convm, false); | ||
| serde_default_fn!(String, default_attention_glu_type, "swish".to_string()); | ||
| serde_default_fn!(bool, default_export, false); | ||
| serde_default_fn!(i32, default_extra_layer_output_idx, -1); | ||
| serde_default_fn!(usize, default_time_reduction, 4); | ||
| serde_default_fn!(bool, default_replication_pad_for_subsample_embedding, false); | ||
| serde_default_fn!(usize, default_attention_group_size, 1); | ||
| serde_default_fn!(String, default_subsampling, "dw_striding".to_string()); | ||
| serde_default_fn!(usize, default_conv_channels, 256); | ||
| serde_default_fn!(usize, default_subsampling_conv_chunking_factor, 1); | ||
| serde_default_fn!(Activation, default_nemo_activation, Activation::Relu); | ||
| serde_default_fn!(bool, default_nemo_is_causal, false); | ||
| serde_default_fn!(usize, fake_default_sentinel, usize::MAX); | ||
|
|
||
| #[derive(Serialize, Deserialize, Debug, Clone)] | ||
| pub struct RelativeAttentionBiasArgs { | ||
| pub t5_bias_max_distance: Option<usize>, | ||
| pub t5_bias_symmetric: Option<bool>, | ||
| #[serde(rename = "type")] | ||
| pub tp: String, | ||
| } | ||
|
|
||
| #[derive(Serialize, Deserialize, Debug, Clone)] | ||
| pub struct NemoConvConfig { | ||
| #[serde(default = "default_subsampling")] | ||
| pub subsampling: String, | ||
| #[serde(default = "fake_default_sentinel")] | ||
| pub subsampling_factor: usize, | ||
| #[serde(default = "fake_default_sentinel")] | ||
| pub feat_in: usize, | ||
| #[serde(default = "fake_default_sentinel")] | ||
| pub feat_out: usize, | ||
| #[serde(default = "default_conv_channels")] | ||
| pub conv_channels: usize, | ||
| #[serde(default = "default_subsampling_conv_chunking_factor")] | ||
| pub subsampling_conv_chunking_factor: usize, | ||
| #[serde(default = "default_nemo_activation")] | ||
| pub activation: Activation, | ||
| #[serde(default = "default_nemo_is_causal")] | ||
| pub is_causal: bool, | ||
| } | ||
|
|
||
| #[derive(Serialize, Deserialize, Debug, Clone)] | ||
| pub struct EncoderEmbeddingConfig { | ||
| pub input_size: usize, | ||
| } | ||
|
|
||
| #[derive(Serialize, Deserialize, Debug, Clone)] | ||
| pub struct ConformerEncoderConfig { | ||
| pub input_size: usize, | ||
| pub chunk_size: i32, | ||
| pub left_chunk: usize, | ||
| pub num_lang: Option<usize>, | ||
| #[serde(default = "default_attention_dim")] | ||
| pub attention_dim: usize, | ||
| #[serde(default = "default_attention_heads")] | ||
| pub attention_heads: usize, | ||
| #[serde(default = "default_linear_units")] | ||
| pub linear_units: usize, | ||
| #[serde(default = "default_num_blocks")] | ||
| pub num_blocks: usize, | ||
| #[serde(default = "default_input_layer")] | ||
| pub input_layer: String, | ||
| #[serde(default = "default_causal")] | ||
| pub causal: bool, | ||
| #[serde(default = "default_batch_norm")] | ||
| pub batch_norm: bool, | ||
| #[serde(default = "default_ext_pw_out_channel")] | ||
| pub ext_pw_out_channel: usize, | ||
| #[serde(default = "default_ext_pw_kernel_size")] | ||
| pub ext_pw_kernel_size: usize, | ||
| #[serde(default = "default_depthwise_seperable_out_channel")] | ||
| pub depthwise_seperable_out_channel: usize, | ||
| #[serde(default = "default_depthwise_multiplier")] | ||
| pub depthwise_multiplier: usize, | ||
| #[serde(default = "default_chunk_se")] | ||
| pub chunk_se: usize, | ||
| #[serde(default = "default_kernel_size")] | ||
| pub kernel_size: usize, | ||
| #[serde(default = "default_activation")] | ||
| pub activation: Activation, | ||
| #[serde(default = "default_conv_activation")] | ||
| pub conv_activation: Activation, | ||
| #[serde(default = "default_conv_glu_type")] | ||
| pub conv_glu_type: Activation, | ||
| #[serde(default = "default_bias_in_glu")] | ||
| pub bias_in_glu: bool, | ||
| #[serde(default = "default_linear_glu_in_convm")] | ||
| pub linear_glu_in_convm: bool, | ||
| #[serde(default = "default_attention_glu_type")] | ||
| pub attention_glu_type: String, | ||
| #[serde(default = "default_export")] | ||
| pub export: bool, | ||
| #[serde(default = "default_extra_layer_output_idx")] | ||
| pub extra_layer_output_idx: i32, | ||
| pub relative_attention_bias_args: Option<RelativeAttentionBiasArgs>, | ||
| #[serde(default = "default_time_reduction")] | ||
| pub time_reduction: usize, | ||
| pub nemo_conv_settings: NemoConvConfig, | ||
| #[serde(default = "default_replication_pad_for_subsample_embedding")] | ||
| pub replication_pad_for_subsample_embedding: bool, | ||
| #[serde(default = "default_attention_group_size")] | ||
| pub attention_group_size: usize, | ||
| pub encoder_embedding_config: Option<EncoderEmbeddingConfig>, | ||
| } | ||
|
|
||
| impl ConformerEncoderConfig { | ||
| pub fn finish_nemo_config(&mut self) { | ||
| // Override any of the defaults with the incoming, user settings | ||
| if self.nemo_conv_settings.subsampling_factor == usize::MAX { | ||
| self.nemo_conv_settings.subsampling_factor = self.time_reduction; | ||
| } | ||
| if self.nemo_conv_settings.feat_in == usize::MAX { | ||
| self.nemo_conv_settings.feat_in = self.input_size; | ||
| } | ||
| if self.nemo_conv_settings.feat_out == usize::MAX { | ||
| self.nemo_conv_settings.feat_out = self.attention_dim; | ||
| } | ||
| } | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo: "seperable" should be "separable"
Also applies to: 93-93
🧰 Tools
🪛 GitHub Check: Typos
[warning] 14-14:
"seperable" should be "separable".
🤖 Prompt for AI Agents