44.1kHz acoustic tokenizer supports speech & sound & music by yongjie-lv · Pull Request #4 · inclusionAI/MingTok-Audio

yongjie-lv · 2026-02-24T04:09:10Z

No description provided.

gemini-code-assist · 2026-02-24T04:09:31Z

Summary of Changes

Hello @yongjie-lv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the acoustic tokenizer by introducing a configurable sample rate and a novel patch-based processing mechanism. These changes aim to improve the model's adaptability and performance across diverse audio inputs, including speech, sound, and music, by allowing it to handle different sampling frequencies and process audio more efficiently through patching and aggregation.

Highlights

Configurable Sample Rate: Introduced a sample_rate parameter to the AudioVAEconfig to allow for flexible configuration of the audio processing sample rate, defaulting to 16000 Hz.
Patch-based Encoding and Decoding: Implemented a new patch_size mechanism within the Encoder and Decoder modules. This involves downsampling input frames into patches, aggregating them using a Qwen2Model with cls_embed tokens, and subsequently upsampling during decoding, enabling more efficient processing for various audio types.
Test Script Update: Updated the test.py script to utilize a new pretrained model (Ming-omni-tts-tokenizer-12Hz), handle bfloat16 precision, and dynamically resample input audio to match the model's configured sample rate, ensuring compatibility and correct usage of the new features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

audio_tokenizer/configuration_audio_vae.py
- Added sample_rate parameter to the AudioVAEconfig constructor.
- Stored the sample_rate as an instance attribute.
audio_tokenizer/modeling_audio_vae.py
- Passed patch_size from the configuration to the encoder and decoder during initialization.
- Modified encode_latent to adjust frame_num based on patch_size for downsampling.
audio_tokenizer/vae_modules.py
- Imported the torch library.
- Added patch_size to the Encoder and Decoder constructors.
- Implemented patch-based aggregation in the Encoder using a Qwen2Model and cls_embed tokens when patch_size is active.
- Added a pad_patch_insert_cls method to handle frame padding and cls token insertion.
- Integrated an Upsample layer in the Decoder for reconstruction when patch_size is active.
- Applied upsampling in both the forward and low_level_reconstruct methods of the Decoder.
test.py
- Updated the pretrained model path to 'inclusionAI/Ming-omni-tts-tokenizer-12Hz'.
- Enabled bfloat16 precision for the model.
- Added logic to resample input waveforms to match the model's configured sample rate.
- Updated the output torchaudio.save call to use the model's sample_rate.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces support for a sample_rate parameter in the AudioVAEconfig and integrates patch_size functionality into the encoder and decoder modules. This allows for more flexible audio processing, including downsampling and upsampling based on the patch_size. The test.py script has also been updated to reflect these changes, including sample rate resampling and using the new model. Overall, the changes enhance the model's capabilities for handling different audio resolutions and introduce a new aggregation mechanism in the encoder.

gemini-code-assist · 2026-02-24T04:13:11Z

audio_tokenizer/configuration_audio_vae.py

 class AudioVAEconfig(PretrainedConfig):
    def __init__(
        self,
+        sample_rate: int=16000,


The sample_rate parameter is introduced with a default value of 16000. While this is a common sample rate, it might be beneficial to explicitly document the expected range or common values for this parameter, especially if the model is intended to support various audio types (speech, sound, music) as suggested by the PR title.

gemini-code-assist · 2026-02-24T04:13:11Z

audio_tokenizer/modeling_audio_vae.py

            input_dim=config.enc_kwargs['input_dim'],
            hop_size=config.enc_kwargs.get('hop_size', 320),
            latent_dim=config.enc_kwargs['latent_dim'],
+            patch_size=config.patch_size


The patch_size is directly passed from config.patch_size to the Encoder constructor. It's good that this parameter is now configurable. Ensure that the patch_size value is validated within the AudioVAEconfig or Encoder to prevent unexpected behavior, such as non-positive values or values that might lead to inefficient patching.

gemini-code-assist · 2026-02-24T04:13:11Z

audio_tokenizer/modeling_audio_vae.py

            output_dim=config.dec_kwargs['output_dim'],
            latent_dim=config.dec_kwargs['latent_dim'],
            semantic_model=semantic_model,
+            patch_size=config.patch_size


Similar to the encoder, patch_size is passed to the Decoder. Consider adding validation for patch_size to ensure it's a positive integer, as negative or zero values could lead to issues in the upsampling logic.

gemini-code-assist · 2026-02-24T04:13:11Z

audio_tokenizer/modeling_audio_vae.py

+        if self.config.patch_size != -1:
+            frame_num = torch.ceil(frame_num/self.config.patch_size)


The frame_num calculation is updated to account for patch_size. The use of torch.ceil is appropriate for ensuring all frames are covered. However, it might be clearer to cast the result to torch.int64 instead of torch.int32 to avoid potential overflow issues with very long audio inputs, although int32 is likely sufficient for typical audio lengths.

frame_num = torch.ceil(frame_num/self.config.patch_size).to(torch.int64)

gemini-code-assist · 2026-02-24T04:13:11Z

audio_tokenizer/vae_modules.py

        self.norm = nn.LayerNorm(config.hidden_size)
+        self.patch_size = patch_size
+        if patch_size != -1:
+            config.num_hidden_layers = 4


The config.num_hidden_layers is hardcoded to 4 when patch_size is enabled. This might limit the flexibility of the aggregator model. Consider making this configurable through encoder_args or config to allow for more fine-grained control over the aggregator's architecture.

gemini-code-assist · 2026-02-24T04:13:11Z

audio_tokenizer/vae_modules.py

        self.head = ISTFTHead(dim=config.hidden_size, n_fft=self.hop_length * 4, hop_length=self.hop_length, padding="same")
        self.patch_size = patch_size
+        if self.patch_size != -1:
+            self.upsampling = nn.Upsample(scale_factor=patch_size, mode='linear')


The nn.Upsample module is initialized with mode='linear'. For audio data, linear interpolation might not always be the best choice, especially if preserving specific audio characteristics is crucial. Consider if other interpolation modes like nearest or bicubic (if applicable for the data dimensions) might be more suitable, or if this should be a configurable parameter.

gemini-code-assist · 2026-02-24T04:13:11Z

test.py

 from audio_tokenizer.modeling_audio_vae import AudioVAE

-model = AudioVAE.from_pretrained('inclusionAI/MingTok-Audio')
+model = AudioVAE.from_pretrained('inclusionAI/Ming-omni-tts-tokenizer-12Hz')


The model name has been updated to inclusionAI/Ming-omni-tts-tokenizer-12Hz. It's good to see the test script reflecting the new model. Ensure that this model is publicly available or accessible in the environment where the tests will run.

gemini-code-assist · 2026-02-24T04:13:11Z

test.py

+model = AudioVAE.from_pretrained('inclusionAI/Ming-omni-tts-tokenizer-12Hz')
 model = model.cuda()
 model.eval()
+model = model.to(torch.bfloat16)


The model is cast to torch.bfloat16. This is a good practice for optimizing memory and computation with compatible hardware. Ensure that the target environment supports bfloat16 operations, as not all GPUs or PyTorch versions do.

gemini-code-assist · 2026-02-24T04:13:11Z

test.py

+if sr != sample_rate:
+    waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform)


The torchaudio.transforms.Resample is used to match the waveform's sample rate with the model's expected sample_rate. This is a critical step for ensuring correct input to the model. It's good that this is handled explicitly.

gemini-code-assist · 2026-02-24T04:13:11Z

test.py

        output_waveform = model.decode(latent)

-torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=16000)
+torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=sample_rate)


The output sample_rate for saving the reconstructed audio is now dynamically set from model.config.sample_rate. This is an improvement over the hardcoded 16000 as it ensures consistency with the model's configuration.

44.1kHz acoustic tokenizer supports speech & sound & music

6f64176

yongjie-lv merged commit 22d0e4d into inclusionAI:main Feb 24, 2026
0 of 2 checks passed

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

44.1kHz acoustic tokenizer supports speech & sound & music#4

44.1kHz acoustic tokenizer supports speech & sound & music#4
yongjie-lv merged 1 commit intoinclusionAI:mainfrom
yongjie-lv:main

yongjie-lv commented Feb 24, 2026

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if self.config.patch_size != -1:
		frame_num = torch.ceil(frame_num/self.config.patch_size)

		if sr != sample_rate:
		waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform)

Conversation

yongjie-lv commented Feb 24, 2026

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant