Skip to content

UPSTREAM PR #18330: full modern bert support#683

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18330-branch_ryan-mangeno-full-modern-bert-support
Open

UPSTREAM PR #18330: full modern bert support#683
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18330-branch_ryan-mangeno-full-modern-bert-support

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18330

Made support for conversion from hf->gguf and execution on llama.cpp after my recent (granite-embd-support)[https://github.com/ggml-org/llama.cpp/pull/15641] which is a modern bert based model, this pr continues off of that and has some tweaks. I have ran cosine similarity tests with this script

from sentence_transformers import SentenceTransformer
import numpy as np
import subprocess
import shlex
import os

model_path = os.path.expanduser(
    "~/models/models--answerdotai--ModernBERT-large/snapshots/45bb4654a4d5aaff24dd11d4781fa46d39bf8c13/"
)
lcpp_model = os.path.expanduser("~/models/modern-bert-large.gguf")
lcpp_exe = "/Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding"

model = SentenceTransformer(model_path)

input_queries = [
    "hello world",
    "tell me a story about a developer and their dog",
    "123sfg this is a r@nd0m t35t",
]


def cosine_similarity(vector_a: np.ndarray, vector_b: np.ndarray) -> float:
    vector_a = np.asarray(vector_a)
    vector_b = np.asarray(vector_b)
    numerator = np.dot(vector_a, vector_b)
    denominator_a = np.linalg.norm(vector_a)
    denominator_b = np.linalg.norm(vector_b)
    if denominator_a == 0 or denominator_b == 0: return 0.0
    cosine_sim = numerator / (denominator_a * denominator_b)
    return cosine_sim


for query in input_queries:
    print("### BASELINE ###")
    embedding = model.encode([query])
    print("Embedding shape:", embedding.shape)
    print("Embedding vector:", embedding[:, :8])

    print("### llama.cpp ###")
    cmd = f"{lcpp_exe} -m {lcpp_model} -p \"{query}\" --temp 0 --embd-normalize -1 --pooling mean"
    print(f"llama.cpp command: {cmd}")
    proc = subprocess.Popen(
        shlex.split(cmd),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    out, _ = proc.communicate()
    vals = out.decode("utf-8").split(":")[-1]
    vals = [
        float(v) for v in vals.split()
        if v.strip()
    ]
    lcpp_emb = np.array(vals)
    print("llama.cpp Embedding shape:", lcpp_emb.shape)
    print("llama.cpp Embedding vector:", lcpp_emb[:8])
    print()
    cos_sim = cosine_similarity(embedding, lcpp_emb)
    print(f"COSINE SIMILARITY: {cos_sim}")
    print("--------------------------------")
    print()

with the following results

BASELINE

Embedding shape: (1, 1024)
Embedding vector: [[ 0.659244 0.39849958 0.2302168 0.6192862 -0.62407815 0.0042014
0.14638135 0.2541136 ]]

llama.cpp

llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "hello world" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.679083 0.394328 0.235899 0.62807 -0.630304 -0.004468 0.141795
0.248705]

COSINE SIMILARITY: [0.99971951]

BASELINE

Embedding shape: (1, 1024)
Embedding vector: [[ 0.23057191 0.12633912 -0.00238159 -0.08394846 0.19630949 0.03715154
0.0040304 0.63173795]]

llama.cpp

llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "tell me a story about a developer and their dog" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.230692 0.131872 -0.007958 -0.065828 0.282647 0.056364 -0.025206
0.672672]

COSINE SIMILARITY: [0.9994365]

BASELINE

Embedding shape: (1, 1024)
Embedding vector: [[ 0.15972608 0.52267325 -0.05636618 0.40699816 0.6401572 0.49469572
-0.4336093 0.3909793 ]]

llama.cpp

llama.cpp command: /Users/ryanmangeno/Projects/gits/llama-fix/llama.cpp/build/bin/llama-embedding -m /Users/ryanmangeno/models/modern-bert-large.gguf -p "123sfg this is a r@nd0m t35t" --temp 0 --embd-normalize -1 --pooling mean
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
llama.cpp Embedding shape: (1024,)
llama.cpp Embedding vector: [ 0.177373 0.495097 -0.114586 0.46121 0.635596 0.548017 -0.400412
0.430722]

COSINE SIMILARITY: [0.99780866]

running the same tests on the granite-embd-small gives same results as before
@gabe-l-hart

@loci-review
Copy link

loci-review bot commented Dec 24, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Scope: PR #683 adds ModernBERT classifier normalization support across 10 files with 27 additions and 6 deletions.

Key Findings

Performance-Critical Area Impact:

The changes affect the embedding generation pipeline by adding an optional layer normalization operation in llm_graph_context::build_pooling. This function now includes a conditional normalization step when cls_norm tensor is present. The normalization adds approximately 5000-10000 ns per embedding computation for ModernBERT models, while models without cls_norm experience no measurable overhead due to branch prediction on the null check.

Inference Performance:

The modifications do not impact token generation functions (llama_decode, llama_encode, llama_tokenize). The changes are isolated to the embedding pooling path, which executes during embedding extraction, not during autoregressive token generation. Therefore, tokens per second remains unchanged for text generation workloads.

For embedding-specific workloads, the 5000-10000 ns increase per embedding represents less than 1% of total embedding inference time on the reference hardware (12th Gen Intel Core i7-1255U).

Function-Level Changes:

llama_model_saver::add_kv shows improved performance with response time decreasing from 33 ns to 26 ns (7 ns improvement) and throughput time from 25 ns to 18 ns (7 ns improvement). This improvement results from code reorganization that enhanced compiler optimization during the addition of cls_norm serialization logic.

Power Consumption:

The build.bin.libllama.so binary shows minimal power consumption change. The added normalization operation executes conditionally and only for models requiring classifier normalization, resulting in negligible impact on overall binary power profile.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from f14c301 to c7d40d0 Compare December 28, 2025 11:07
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 1f52e52 to 59c4631 Compare January 2, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants