Skip to content

UPSTREAM PR #17914: Restore clip's cb() to its rightful glory#516

Open
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17914-branch_pwilkin-clip-cb
Open

UPSTREAM PR #17914: Restore clip's cb() to its rightful glory#516
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17914-branch_pwilkin-clip-cb

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#17914

I used my callback function from my Qwen3Next testing days, it seems like it works more cleanly than the previous one which was causing some problems with the scheduler / buffers.

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 10, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #516

PR Title: Restore clip's cb() to its rightful glory
Changes: Single file modified (tools/mtmd/clip.cpp), 63 additions, 20 deletions


Analysis Overview

This PR refactors the debug callback mechanism in the CLIP vision encoder. The changes replace the previous ggml_cpy and ggml_dup_tensor approach with a custom operation using ggml_custom_4d that executes a print_debug callback during graph computation.

Code Changes:

  • Added ggml_get_float_value helper function for tensor data extraction (25 lines)
  • Added print_debug static callback function with tensor statistics computation (30 lines)
  • Modified cb function to use ggml_custom_4d instead of ggml_cpy approach (8 lines)
  • Removed post-execution debug printing loop from clip_image_batch_encode (12 lines)

Performance Impact:

The cb function shows a 46% response time improvement (1,787,204 ns reduction, from 3,843,130 ns to 2,055,926 ns). However, this function is only active when MTMD_DEBUG_GRAPH environment variable is set, making it a debug-only code path with no impact on production inference.

Functions in the CLIP image processing pipeline show improvements:

  • clip_image_build_graph: 85,774,540 ns reduction (44% improvement)
  • clip_image_batch_encode: 260,638,040 ns reduction (42% improvement)
  • warmup: 171,566,130 ns reduction (42% improvement)

Tokens Per Second Impact:

No impact on tokenization or text inference performance. The modified functions (cb, clip_image_build_graph, clip_image_batch_encode) are part of the vision encoder preprocessing pipeline, not the LLM inference path. Functions responsible for token generation (llama_decode, llama_encode, llama_tokenize) remain unchanged.

Power Consumption:

The libmtmd.so binary shows a 0.121% increase (158.69 nJ), which is negligible. All other binaries show no measurable change in power consumption.

Key Findings:

The refactoring improves debug callback execution by eliminating redundant tensor copies and moving statistics computation into the graph execution phase. The changes are isolated to debug functionality and vision preprocessing, with no effect on text generation throughput.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from e70bc15 to ef96f85 Compare December 14, 2025 09:08
@loci-dev loci-dev force-pushed the main branch 25 times, most recently from 81e654d to c785ce2 Compare December 18, 2025 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants