Skip to content

44.1kHz acoustic tokenizer supports speech & sound & music#4

Merged
yongjie-lv merged 1 commit intoinclusionAI:mainfrom
yongjie-lv:main
Feb 24, 2026
Merged

44.1kHz acoustic tokenizer supports speech & sound & music#4
yongjie-lv merged 1 commit intoinclusionAI:mainfrom
yongjie-lv:main

Conversation

@yongjie-lv
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link

Summary of Changes

Hello @yongjie-lv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the acoustic tokenizer by introducing a configurable sample rate and a novel patch-based processing mechanism. These changes aim to improve the model's adaptability and performance across diverse audio inputs, including speech, sound, and music, by allowing it to handle different sampling frequencies and process audio more efficiently through patching and aggregation.

Highlights

  • Configurable Sample Rate: Introduced a sample_rate parameter to the AudioVAEconfig to allow for flexible configuration of the audio processing sample rate, defaulting to 16000 Hz.
  • Patch-based Encoding and Decoding: Implemented a new patch_size mechanism within the Encoder and Decoder modules. This involves downsampling input frames into patches, aggregating them using a Qwen2Model with cls_embed tokens, and subsequently upsampling during decoding, enabling more efficient processing for various audio types.
  • Test Script Update: Updated the test.py script to utilize a new pretrained model (Ming-omni-tts-tokenizer-12Hz), handle bfloat16 precision, and dynamically resample input audio to match the model's configured sample rate, ensuring compatibility and correct usage of the new features.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • audio_tokenizer/configuration_audio_vae.py
    • Added sample_rate parameter to the AudioVAEconfig constructor.
    • Stored the sample_rate as an instance attribute.
  • audio_tokenizer/modeling_audio_vae.py
    • Passed patch_size from the configuration to the encoder and decoder during initialization.
    • Modified encode_latent to adjust frame_num based on patch_size for downsampling.
  • audio_tokenizer/vae_modules.py
    • Imported the torch library.
    • Added patch_size to the Encoder and Decoder constructors.
    • Implemented patch-based aggregation in the Encoder using a Qwen2Model and cls_embed tokens when patch_size is active.
    • Added a pad_patch_insert_cls method to handle frame padding and cls token insertion.
    • Integrated an Upsample layer in the Decoder for reconstruction when patch_size is active.
    • Applied upsampling in both the forward and low_level_reconstruct methods of the Decoder.
  • test.py
    • Updated the pretrained model path to 'inclusionAI/Ming-omni-tts-tokenizer-12Hz'.
    • Enabled bfloat16 precision for the model.
    • Added logic to resample input waveforms to match the model's configured sample rate.
    • Updated the output torchaudio.save call to use the model's sample_rate.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yongjie-lv yongjie-lv merged commit 22d0e4d into inclusionAI:main Feb 24, 2026
0 of 2 checks passed
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces support for a sample_rate parameter in the AudioVAEconfig and integrates patch_size functionality into the encoder and decoder modules. This allows for more flexible audio processing, including downsampling and upsampling based on the patch_size. The test.py script has also been updated to reflect these changes, including sample rate resampling and using the new model. Overall, the changes enhance the model's capabilities for handling different audio resolutions and introduce a new aggregation mechanism in the encoder.

class AudioVAEconfig(PretrainedConfig):
def __init__(
self,
sample_rate: int=16000,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sample_rate parameter is introduced with a default value of 16000. While this is a common sample rate, it might be beneficial to explicitly document the expected range or common values for this parameter, especially if the model is intended to support various audio types (speech, sound, music) as suggested by the PR title.

input_dim=config.enc_kwargs['input_dim'],
hop_size=config.enc_kwargs.get('hop_size', 320),
latent_dim=config.enc_kwargs['latent_dim'],
patch_size=config.patch_size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The patch_size is directly passed from config.patch_size to the Encoder constructor. It's good that this parameter is now configurable. Ensure that the patch_size value is validated within the AudioVAEconfig or Encoder to prevent unexpected behavior, such as non-positive values or values that might lead to inefficient patching.

output_dim=config.dec_kwargs['output_dim'],
latent_dim=config.dec_kwargs['latent_dim'],
semantic_model=semantic_model,
patch_size=config.patch_size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the encoder, patch_size is passed to the Decoder. Consider adding validation for patch_size to ensure it's a positive integer, as negative or zero values could lead to issues in the upsampling logic.

Comment on lines +71 to +72
if self.config.patch_size != -1:
frame_num = torch.ceil(frame_num/self.config.patch_size)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The frame_num calculation is updated to account for patch_size. The use of torch.ceil is appropriate for ensuring all frames are covered. However, it might be clearer to cast the result to torch.int64 instead of torch.int32 to avoid potential overflow issues with very long audio inputs, although int32 is likely sufficient for typical audio lengths.

            frame_num = torch.ceil(frame_num/self.config.patch_size).to(torch.int64)

self.norm = nn.LayerNorm(config.hidden_size)
self.patch_size = patch_size
if patch_size != -1:
config.num_hidden_layers = 4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The config.num_hidden_layers is hardcoded to 4 when patch_size is enabled. This might limit the flexibility of the aggregator model. Consider making this configurable through encoder_args or config to allow for more fine-grained control over the aggregator's architecture.

self.head = ISTFTHead(dim=config.hidden_size, n_fft=self.hop_length * 4, hop_length=self.hop_length, padding="same")
self.patch_size = patch_size
if self.patch_size != -1:
self.upsampling = nn.Upsample(scale_factor=patch_size, mode='linear')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The nn.Upsample module is initialized with mode='linear'. For audio data, linear interpolation might not always be the best choice, especially if preserving specific audio characteristics is crucial. Consider if other interpolation modes like nearest or bicubic (if applicable for the data dimensions) might be more suitable, or if this should be a configurable parameter.

from audio_tokenizer.modeling_audio_vae import AudioVAE

model = AudioVAE.from_pretrained('inclusionAI/MingTok-Audio')
model = AudioVAE.from_pretrained('inclusionAI/Ming-omni-tts-tokenizer-12Hz')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model name has been updated to inclusionAI/Ming-omni-tts-tokenizer-12Hz. It's good to see the test script reflecting the new model. Ensure that this model is publicly available or accessible in the environment where the tests will run.

model = AudioVAE.from_pretrained('inclusionAI/Ming-omni-tts-tokenizer-12Hz')
model = model.cuda()
model.eval()
model = model.to(torch.bfloat16)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model is cast to torch.bfloat16. This is a good practice for optimizing memory and computation with compatible hardware. Ensure that the target environment supports bfloat16 operations, as not all GPUs or PyTorch versions do.

Comment on lines +13 to +14
if sr != sample_rate:
waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The torchaudio.transforms.Resample is used to match the waveform's sample rate with the model's expected sample_rate. This is a critical step for ensuring correct input to the model. It's good that this is handled explicitly.

output_waveform = model.decode(latent)

torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=16000) No newline at end of file
torchaudio.save('./1089-134686-0000_reconstruct.wav', output_waveform.cpu()[0], sample_rate=sample_rate)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output sample_rate for saving the reconstructed audio is now dynamically set from model.config.sample_rate. This is an improvement over the hardcoded 16000 as it ensures consistency with the model's configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant