Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis SummaryScope: PR #683 adds ModernBERT classifier normalization support across 10 files with 27 additions and 6 deletions. Key FindingsPerformance-Critical Area Impact: The changes affect the embedding generation pipeline by adding an optional layer normalization operation in Inference Performance: The modifications do not impact token generation functions ( For embedding-specific workloads, the 5000-10000 ns increase per embedding represents less than 1% of total embedding inference time on the reference hardware (12th Gen Intel Core i7-1255U). Function-Level Changes:
Power Consumption: The |
f14c301 to
c7d40d0
Compare
1f52e52 to
59c4631
Compare
Mirrored from ggml-org/llama.cpp#18330
Made support for conversion from hf->gguf and execution on llama.cpp after my recent (granite-embd-support)[https://github.com/ggml-org/llama.cpp/pull/15641] which is a modern bert based model, this pr continues off of that and has some tweaks. I have ran cosine similarity tests with this script
with the following results
BASELINE
Embedding shape: (1, 1024)
Embedding vector: [[ 0.659244 0.39849958 0.2302168 0.6192862 -0.62407815 0.0042014
0.14638135 0.2541136 ]]
llama.cpp
llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "hello world" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using
tokenizersbefore the fork if possible- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.679083 0.394328 0.235899 0.62807 -0.630304 -0.004468 0.141795
0.248705]
COSINE SIMILARITY: [0.99971951]
BASELINE
Embedding shape: (1, 1024)
Embedding vector: [[ 0.23057191 0.12633912 -0.00238159 -0.08394846 0.19630949 0.03715154
0.0040304 0.63173795]]
llama.cpp
llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "tell me a story about a developer and their dog" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using
tokenizersbefore the fork if possible- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.230692 0.131872 -0.007958 -0.065828 0.282647 0.056364 -0.025206
0.672672]
COSINE SIMILARITY: [0.9994365]
BASELINE
Embedding shape: (1, 1024)
Embedding vector: [[ 0.15972608 0.52267325 -0.05636618 0.40699816 0.6401572 0.49469572
-0.4336093 0.3909793 ]]
llama.cpp
llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "123sfg this is a r@nd0m t35t" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using
tokenizersbefore the fork if possible- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.177373 0.495097 -0.114586 0.46121 0.635596 0.548017 -0.400412
0.430722]
COSINE SIMILARITY: [0.99780866]
running the same tests on the granite-embd-small gives same results as before
@gabe-l-hart