mtmd: refactor audio preprocessing#17978
Conversation
|
Hmm, I think I can also upstream some changes from #17694 , that would make your PR a bit shorter @tdakhran I will remove the pre-calculated filters and replace with your version Edit: since my goal is to implement conformer, I think I will end up copying a lot of code and refactor them along the way |
Co-authored-by: Tarek <[email protected]>
|
@ggerganov This is ready for review. I only have basic knowledge about signal/audio processing, would appreciate if you can have a deeper look to see if things are still correct compared to the original code from whisper.cpp Note: this change also contain enough code for LFM2-audio and gemma 3n audio preprocessor Test results: |
tdakhran
left a comment
There was a problem hiding this comment.
@ngxson, thanks for the refactor.
@ggerganov I've verified that generated coefficients match existing for n_fft = 400.
| params.sample_rate = hparams.audio_sample_rate; | ||
| params.center_padding = false; | ||
| params.preemph = 0.0f; // disabled | ||
| params.use_natural_log = false; |
There was a problem hiding this comment.
params.use_natural_log = true; for LFM2-Audio-1.5B, I'd like to avoid reimplementing the whole processor just because it. Shall all params members be defined in hparams?
There was a problem hiding this comment.
For other models, it's recommended to make a dedicated class that extends from mtmd_audio_preprocessor:
struct mtmd_audio_preprocessor_lfm2a : mtmd_audio_preprocessor {
mtmd_audio_preprocessor_lfm2a(const clip_ctx * ctx) : mtmd_audio_preprocessor(ctx) {}
void initialize() override;
bool preprocess(const float * samples, size_t n_samples, std::vector<mtmd_audio_mel> & output) override;
};This way, you can customize initialization of cache, while also allow defining custom filter params and handling custom paddings
There was a problem hiding this comment.
sounds good, will create a dedicated class.
* mtmd: refactor audio preprocessing * refactor Co-authored-by: Tarek <[email protected]> * wip * wip (2) * improve constructor * fix use_natural_log * fix padding for short input * clean up * remove need_chunking --------- Co-authored-by: Tarek <[email protected]>
* mtmd: refactor audio preprocessing * refactor Co-authored-by: Tarek <[email protected]> * wip * wip (2) * improve constructor * fix use_natural_log * fix padding for short input * clean up * remove need_chunking --------- Co-authored-by: Tarek <[email protected]>
The goal of this PR is to allow more audio pre-processing mechanism to be added into mtmd
While the code is not very clean, this should already allow:
Key points
mtmd_audio_preprocessorinitialize()function which will be called on model load, to initialize global cache entries like sin/cos, hann windowfill_mel_filterbank_matrix(the hard-coded value is now removed)