UPSTREAM PR #16906: model: add Janus Pro for image understanding by DajanaV · Pull Request #32 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-01T03:23:59Z

This pull request introduces support for the Janus‑Pro 1B and Janus‑Pro 7B models within the llama.cpp framework.

The focus of this update is on image understanding (i.e., visual-input → textual or conceptual output).
Image generation is not covered by this PR.

Usage & Current Progress

Convert models to GGUF files:

# Convert the base Janus-Pro 1B model
python convert_hf_to_gguf.py deepseek-community/Janus-Pro-1B \
    --outfile janus-pro-1b-f16.gguf \
    --remote \
    --outtype f16

# Convert the multimodal projection (mmproj) component
python convert_hf_to_gguf.py deepseek-community/Janus-Pro-1B \
    --outfile mmproj-janus-pro-1b-f16.gguf \
    --remote \
    --outtype f16 \
    --mmproj

Run the model

# Build the project:
cmake -B build
cmake --build build --target llama-mtmd-cli

./build/bin/llama-mtmd-cli \
    -m janus-pro-1b-f16.gguf \
    --mmproj mmproj-janus-pro-1b-f16.gguf \
    --chat-template deepseek

References

Janus-Pro 1B model card (Hugging Face):
https://huggingface.co/deepseek-community/Janus-Pro-1B

Janus-Pro 7B model card (Hugging Face):
https://huggingface.co/deepseek-community/Janus-Pro-7B

Configurations:
https://huggingface.co/deepseek-community/Janus-Pro-1B/blob/main/config.json
https://huggingface.co/deepseek-community/Janus-Pro-7B/blob/main/config.json

HF Implementation:
https://github.com/huggingface/transformers/tree/main/src/transformers/models/janus

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

* ggml: add spacemit backend Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23 * add new line at end of file Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2 * add riscv zba extension limit Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce * fixed for review comments, file renamed and format Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce * fixed for code format, after clang-format Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2 * use _Float16 instead of __fp16 Change-Id: I039fb02bb95270e641bc4442204e658735859d43 * add ci for riscv64-spacemit-ime-native Change-Id: I711c1033061df1a289ea77891b2997599dfe8279 * update debian-13-riscv64-spacemit-ime-native ci label Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a * remove license comment for spacemit ime Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3 * upgrade binutils for gcc ime Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45 * add spacemit ime cross jobs Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6 * remove native compile for riscv64-spacemit-ime Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e * ci : add caching for spacemit ime cross toolchain Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de * ci: bug fixed for cache path and env Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a * Update .github/workflows/build-linux-cross.yml for cache path Co-authored-by: Sigbjørn Skjæret <[email protected]> * bugfixed for build-linux-cross.yml, syntax error Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: cailinxi <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

* ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths

…locks (#16326) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <[email protected]>

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

The JSON parser is temporarily kept only for backward compatibility. It reads the etag from old .json files to prevent unnecessary re-downloads for existing users. This legacy code can be removed in a future version. Signed-off-by: Adrien Gallouët <[email protected]>

* metal : dynamic simdgroups for MV kernels * cont : minor

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

`test-arg-parser.cpp` has been updated to work consistently, regardless of whether CURL or SSL support is available, and now always points to `ggml.ai`. The previous timeout test has been removed, but it can be added back by providing a dedicated URL under `ggml.ai`. Signed-off-by: Adrien Gallouët <[email protected]>

* Work on rope * Simplify inplace operation generation and combine mul/add generation * Work on rope variants * implement neox rope * rope complete * Add sub,div,glu operators * implement scale op * Update cpy shader to handle cont/more types * formatting * Update test vars printing for rope,rms_norm * Avoid ROPE hardcoded constants * Add TODO to change ROPE constants to enum Co-authored-by: Georgi Gerganov <[email protected]> * fix TODO comment --------- Co-authored-by: Georgi Gerganov <[email protected]>

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

* common : disable progress bar without a tty Signed-off-by: Adrien Gallouët <[email protected]> * Add missing headers Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]>

* fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci]

…#16359) * Make a few GLM tensors not required layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work * Update llama-model.cpp layer.nextn.shared_head_norm also not required in case of future models

…6363)

…(#16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using

* feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

…onditional rendering for Actions Dropdown for Chat Conversation Items (#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build

* vulkan: add mmq q2_k integer dot support * Refactor mmq caching * Reduce mmq register use * Load 4 quant blocks into shared memory in one step * Pack q2_k blocks into caches of 32 * Use 32-bit accumulators for integer dot matmul * Add q4_k mmq * Add q3_k mmq * Add q5_k mmq * Add q6_k mmq * Add mxfp4 mmq, enable MMQ MUL_MAT_ID * Fix mmv dm loads

* vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <[email protected]>

* llama: store mrope data in KV cell * correct x,y ordering * address review comments * add consistency checks * Update src/llama-kv-cache.cpp Co-authored-by: Georgi Gerganov <[email protected]> * add TODO * fix asan error * kv-cells : improve ext handling * cont : fix headers --------- Co-authored-by: Georgi Gerganov <[email protected]>

loci-review · 2025-11-01T04:26:03Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Critical Functions

Critical Function Performance Status

All core inference and processing functions show no measurable performance changes between versions:

Inference Functions

llama_decode: 49,003,720 ns response time (no change)
llama_encode: 12,329,177 ns response time (no change)
llama_tokenize: 834,827 ns response time (no change)

Model Management Functions

llama_model_load_from_file: 333,126,340 ns response time (no change)
llama_batch_init: 257 ns response time (no change)
llama_memory_clear: 49 ns response time (no change)

Function Modification Status

All analyzed critical functions report "is_modified": false, indicating no code changes between versions.

Key Performance Indicator Impact Analysis

1. Tokens Per Second

Status: No impact on inference throughput

llama_decode: No change in 49 million ns response time
llama_encode: No change in 12 million ns response time
llama_tokenize: No change in 835,000 ns response time

Reference Impact: Based on the provided benchmark (7% tokens/sec reduction for 2ms llama_decode slowdown), the absence of changes in these functions indicates no tokens per second degradation.

2. Power Consumption

Status: Minimal impact on binary level

build.bin.libllama.so: -0.0% change (306,978.33 nJ → 306,978.09 nJ)
build.bin.libggml-base.so: 0.0% change (90,434 nJ)
build.bin.libggml-cpu.so: 0.0% change (151,692 nJ)
build.bin.libggml.so: 0.0% change (6,339 nJ)

Impacted Functions: The 0.24 nJ reduction in libllama.so correlates with the __copy_move_b function micro-optimization rather than core inference changes.

3. Quantization Efficiency

Status: No impact

llama_model_quantize: Function not present in performance data, indicating no execution during profiling
Quantization support functions: No changes detected in core quantization pathways

4. Memory Usage

Status: No impact on memory management functions

llama_memory_clear: 49 ns (no change)
KV cache functions: No performance changes detected
Memory allocation patterns: Stable across versions

5. Batch Processing

Status: No impact on batch operations

llama_batch_init: 257 ns (no change)
llama_decode batch processing: 49 million ns (no change)
Batch allocation functions: No performance degradation

Action Items for Performance Optimization

Build System Optimizations

Address STL micro-regressions: The __copy_move_b function shows 0.08 ns increase
- Apply link-time optimization (-flto) to reduce PLT overhead
- Use profile-guided optimization (-fprofile-use) for hot path optimization
- Consider -fno-plt compiler flag to eliminate procedure linkage table overhead
Compiler optimization consistency:
- Ensure consistent compiler flags across builds
- Verify identical optimization levels between versions
- Check for compiler version differences affecting code generation

Code-Level Optimizations

Token data structure efficiency: The __copy_move_b regression affects llama_token_data copying
- Review structure layout for cache alignment
- Consider structure-of-arrays vs array-of-structures for bulk operations
- Evaluate vectorization opportunities for 12-byte token data elements
Memory access pattern optimization:
- Profile memory access patterns in token copying operations
- Consider prefetching strategies for large token arrays
- Evaluate SIMD instruction usage for bulk token operations

Performance Impact Assessment

The analysis reveals stable performance across all critical inference functions. The only measurable change is a 0.08 ns micro-regression in STL copy operations, representing less than 0.1% impact on any performance metric.

Key Findings:

Core inference pipeline remains unchanged
No functional modifications to critical paths
Power consumption effectively unchanged
All KPIs maintain baseline performance levels

The performance stability indicates that the Janus Pro model additions in PR #32 successfully isolate new functionality without impacting existing inference performance.

npu perf fix

ggerganov and others added 30 commits September 29, 2025 17:43

ggml : prepare for development of 0.9.2-dev

b6dff20

ggml : bump version to 0.9.3 (ggml/1353)

b6ae75a

ggml : remove -dev suffix from release version (ggml/1355)

c9b1c06

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

sync : whisper.cpp (ggml/1359)

4d3d455

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

sync : ggml

2ddd3f2

ci : add AMD runners and workflows (#16249)

d72f5f7

* ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths

tests: override test_set_rows::max_nmse_err to allow for occasional r…

a74a0d6

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

codeowners: add codeowners for opencl backend (#16344)

de41f2b

kleidiai : fix work size and threads sync for fp16 (#16246)

f1eb1cb

metal : dynamic simdgroups for MV kernels (#16340)

35fb824

* metal : dynamic simdgroups for MV kernels * cont : minor

cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (#16328)

a014310

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

ggml : bump version to 0.9.4 (ggml/1363)

075c015

ci : disable ccache for android (#16348)

2df5bcf

opencl: support ne3 in get_rows (#15866)

d1c84a6

Chatapi ignore empty sampling (#16330)

16b0ca0

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

opencl: support pad_ext (#15888)

7c156df

common : disable progress bar without a tty (#16352)

bf6f3b3

* common : disable progress bar without a tty Signed-off-by: Adrien Gallouët <[email protected]> * Add missing headers Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]>

ci : fix ccache key for ubuntu-cpu-cmake (#16355)

b2ba81d

* fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci]

webui: Remove running llama-server within WebUI dev.sh script (#1…

aa9538a

…6363)

vulkan: make ggml_vk_default_dispatcher support older vulkan headers …

132d673

…(#16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using

Add optional setting for showing "Model used:" information (#16337)

4f15759

* feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

ci : use registry cache for docker builds (#16366)

1104ca1

Improve code block color theming (#16325)

2a9b633

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

0cc4m and others added 4 commits October 29, 2025 14:39

Add support for Janus Pro

5471f50

DajanaV temporarily deployed to PROD__AL_DEMO November 1, 2025 03:24 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 19 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

loci-dev pushed a commit that referenced this pull request Jan 14, 2026

Merge pull request #32 from cavusmustafa/npu-perf-fix

6fdb81d

npu perf fix

loci-dev mentioned this pull request Mar 21, 2026

UPSTREAM PR #17342: Throughput improvement for small batch sizes #1279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #16906: model: add Janus Pro for image understanding#32

UPSTREAM PR #16906: model: add Janus Pro for image understanding#32
DajanaV wants to merge 6879 commits intomainfrom
upstream-PR16906-branch_ravenouse-januspro

DajanaV commented Nov 1, 2025

Uh oh!

loci-review bot commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants