UPSTREAM PR #16574: mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter) by DajanaV · Pull Request #16 · auroralabs-loci/llama.cpp

DajanaV · 2025-10-31T03:23:43Z

Mirrored from ggml-org/llama.cpp#16574

Update Notes (2025‑10‑22)

CLI Merge
- Fold the standalone Jina CLI into mtmd-cli’s projector‑only flow; remove the extra binary.
Conversion Script (set_gguf_parameters)
- Emit vision keys using the standard naming: clip.has_vision_encoder, clip.vision.image_size/patch_size/embedding_length/
  block_count/projection_dim/feed_forward_length/attention.head_count.
- Write only projector_type (set to 'jinaclip2'); do not introduce projector_version.
Inference (mtmd)
- Use ggml_rope_ext to implement 2D RoPE; reuse bicubic for image preprocessing.
Minimal Validation
- Conversion succeeds; gguf_dump shows clip.projector_type='jinaclip2'.
- Minimal inference passes for both text and image; C++ vs Python cosine/RMSE are within the expected range.

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)

Overview

Converter: write jina-bert-v3 text tower params into GGUF (supports both merged-LoRA checkpoints and adapter-based inputs), and export vision metadata (projector_type=jinaclip, vision.rope_theta, image_size, patch_size, projection_dim, etc.).
Runtime: introduce PROJECTOR_TYPE_JINACLIP in the MTMD path (JinaCLIP v2 vision tower: 2D RoPE with shared frequency cache, attention/FFN internal LayerNorm, single-token output), and normalize with common_embd_normalize(..., 2).
CLI (core): add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps.
Compatibility: only activates when related GGUF metadata exists; doesn’t affect other projectors (e.g., LLaVA/Qwen2VL); no ggml op changes; no external dependencies.

Scope of changes

convert_hf_to_gguf.py
- Text: support both merged-LoRA single checkpoints and adapter-based export.
- Vision (JinaCLIP v2): export clip.projector_type=jinaclip, clip.vision.rope_theta (configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.
tools/mtmd/clip.cpp, tools/mtmd/clip-impl.h
- Add PROJECTOR_TYPE_JINACLIP: JinaCLIP v2 vision tower (2D RoPE with shared freq cache), attention internal LN, FFN sub-layer LN (enabled when both weight/bias present), single-token output (CLS-equivalent), unified L2 normalize.
- clip_n_output_tokens() returns 1 for JinaCLIP; clip_n_mmproj_embd() returns projection_dim.
tools/mtmd/jinaclip-cli.cpp, tools/mtmd/CMakeLists.txt
- Add llama-jinaclip-cli target (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.

Validation summary

CI: CPU-only ci/run.sh passes locally; no ggml op changes in this PR.
Correctness: embedding models have no perplexity; we verify via C++ vs Python parity.
- TEXT (CPU, minimal sample): cosine=0.999996, RMSE=0.000125
- IMAGE (CPU, minimal sample): cosine=0.990261, RMSE=0.006168
Performance: checked with CLI encode_ms and thread scaling; no regression observed. More data can be added if requested.
Compatibility: activated only when GGUF metadata (projector_type=jinaclip, etc.) is present; other projectors unaffected.
Reference: ModelScope uniontech-yourong/split_jina (used for Python-side parity).

Performance (absolute metrics, CPU-only minimal samples)

Environment
- OS: Ubuntu 22.04.5 LTS
- CPU: Intel Xeon Platinum 8352V (dual-socket, 2×32C/64T, SMT on), 128 threads total
- Build: Release, GGML_CUDA=OFF (CPU-only), GCC 11.4, CMake 3.22
- Model: JinaCLIP v2 vision tower (image_size=512, patch=14, depth=24, hidden=1024; official: https://huggingface.co/jinaai/jina-clip-v2); text tower (Jina Embeddings v3, output truncated to 512 dims)
- Threads: primarily 8 threads for both text/image (with 1-thread comparison)
Metric definitions
- Text: use CLI-reported JINACLIP_ENCODE_MS (pure inference, excludes load)
- Image: use CLI line “image … done in … ms” (pure inference, excludes load)
Results (single sample, minimal)
- Text (“hello world”, ≈5 tokens)
  - 1 thread: encode_ms ≈ 180.48 ms
  - 8 threads: encode_ms ≈ 34.08 ms
- Image (512×512, single)
  - 8 threads: image done in ≈ 6154 ms (stabilizes ~6.1–6.4 s after warm-up)
Notes
- Above numbers are CPU-only pure inference; end-to-end (including model load) is higher and not included.

GPU group (absolute metrics, minimal samples)

Environment
- GPU: NVIDIA vGPU-32GB (cc=8.9, 32 GB), Driver 550.107, CUDA 12.4
- Build: Release, GGML_CUDA=ON (CUDA backend), CUDA arch=89
- Threads: -t 8 (host-side preprocessing threads)
Results (pure inference, excludes load)
- Text (“hello world”, ≈5 tokens): encode_ms ≈ 84.88 ms
- Image (512×512, single): image done in ≈ 827 ms

Reproduction

Minimal commands & data (CPU)

Produce GGUF (with ST pooling metadata)
- Text: jina-bert-v3.pooling_type = MEAN/CLS/LAST
- Vision: clip.projector_type = jinaclip, clip.vision.rope_theta = 10000 (default)
Text parity
- C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-jinaclip-cli -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0
- Python: python3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa off
- Metric: read both 512-d outputs and compute cosine / RMSE
Image parity
- C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-jinaclip-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0
- Python: python3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa off
- Metric: read both 512-d outputs and compute cosine / RMSE

Files in this PR

convert_hf_to_gguf.py
tools/mtmd/clip.cpp
tools/mtmd/clip-impl.h
tools/mtmd/jinaclip-cli.cpp
tools/mtmd/CMakeLists.txt

Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.

cpu: introduce chunking for flash attention (#16829)

0c4ff2b

Factor out the core FA loop into flash_atten_f16_one_chunk and add an outter loop on top that handles the chunks.

DajanaV temporarily deployed to PROD__AL_DEMO October 31, 2025 03:23 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from 46af8d7 to 25582b5 Compare October 31, 2025 08:10

DajanaV closed this Oct 31, 2025

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

loci-review bot mentioned this pull request Mar 4, 2026

UPSTREAM PR #20087: Hybrid model cache: add --checkpoint-every-nb #1222

Open

loci-dev mentioned this pull request Mar 21, 2026

UPSTREAM PR #17342: Throughput improvement for small batch sizes #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16574: mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)#16

UPSTREAM PR #16574: mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)#16
DajanaV wants to merge 1 commit intomainfrom
upstream-PR16574-branch_pockers21-feature/jinaclip-v2-projector

DajanaV commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Oct 31, 2025

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants