Skip to content

Conversation

@Mike-Bell
Copy link
Contributor

@Mike-Bell Mike-Bell commented Jan 6, 2023

I spent quite a while trying to figure out why AVX2 wasn't enabled when I built this on Windows in my build system. Adding /arch:AVX2 to the C-compiler seems to fix it. I'm not building through Visual Studio but via a more customized build system (leveraging MSBuild), so that's likely why this isn't required for other people building on Windows, but since ggml.c is a C file (not C++), it seems like this should be correct and necessary in some cicrumstances.

@RndyP
Copy link

RndyP commented Jan 6, 2023

Thanks Mike! I had tried AVX2 and it worked, but was extremely slow. The JFK.wav sample took 17 seconds, where with AVX it takes 4.5 seconds. Using Visual Studio with AVX2 set under C/C++ Code Generation -> Enable Enhanced Instruction Set -> Advanced Vector Extensions 2 (/arch:AVX2) time went from 4.5 to about 3.6 seconds. Nice improvement.

@Mike-Bell Mike-Bell force-pushed the SupportWindowsAvx2Better branch from 4faae25 to 8f4f7d9 Compare January 6, 2023 17:05
@ggerganov ggerganov merged commit 41e05c6 into ggml-org:master Jan 6, 2023
@Mike-Bell Mike-Bell deleted the SupportWindowsAvx2Better branch January 6, 2023 17:46
@RndyP
Copy link

RndyP commented Jan 6, 2023

I did some research into what's going on here. One of the most critical sections of code is in ggml_vec_dot_f16(). AVX2 has fused multiply add instructions and AVX does not. So with AVX2 we have:
image
Without (just AVX) we have:
image
Note in the AVX2 case the disassembly shows the use of the fused intrinsics vfmadd231ps

@ggerganov
Copy link
Member

ggerganov commented Jan 6, 2023

@RndyP
So recently, I was revisiting the SIMD code in ggml.c and I came to the realisation that we don't use AVX2 at all because all x86 intrinsics that we currently use are either AVX, FMA or F16C. This is based on lookup in the following site:

https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

For example, the case that you refer to I think actually depends on wether FMA is supported or not:

https://github.com/ggerganov/whisper.cpp/blob/87dd4a30811ee07700ee6fee267508e8935b32fc/ggml.c#L483-L487

So it is strange that AVX2 flags make any difference at all on Windows. My understanding is that AVX2 should not make a difference, but I might be missing something.

@RndyP
Copy link

RndyP commented Jan 6, 2023

I think what's happening is the Microsoft compiler is smart enough to convert the _mm256_add_ps(_mm256_mul_ps(b, c), a) to the fused assembly version, when the compiler option is set. The code above goes from 16 down to 12 vector instructions. 4.5 to 3.6 seconds is a nice improvement.

anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants