@FIR-1507 - GGML Blob: Unify TMU Codebase Across POSIX and FPGA#96
Open
akapoor3518 wants to merge 1 commit intomasterfrom
Open
@FIR-1507 - GGML Blob: Unify TMU Codebase Across POSIX and FPGA#96akapoor3518 wants to merge 1 commit intomasterfrom
akapoor3518 wants to merge 1 commit intomasterfrom
Conversation
atrivedi-tsavoritesi
approved these changes
Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently tsisim is broken hence cant test at tsisim. Let proceed with this .
We will validate once tsisim is working
tested at posix and comare the generated blob python file
POSIX LOG
[akapoor@wspd0 llama.cpp]$ ./build-posix/bin/llama-cli-original -m /proj/rel/sw/ggml/models/smolVLM-256M.gguf --device tSavorite -p "my cat's name is" -c 2048 -b 128 --n-predict 4 --temp 0.0 --top-k 50 --top-p 0.9 --repeat-penalty 1.5 --repeat-last-n 5 --no-warmup --no-conversation
Failed to generate tool call example: Value is not callable: null at row 1, column 72:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
at row 1, column 42:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
at row 1, column 42:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
at row 1, column 13:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
at row 1, column 1:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
my cat's name is not mentioned in the
llama_perf_sampler_print: sampling time = 15.65 ms / 9 runs ( 1.74 ms per token, 575.12 tokens per second)
llama_perf_context_print: load time = 372657.41 ms
llama_perf_context_print: prompt eval time = 370205.55 ms / 5 tokens (74041.11 ms per token, 0.01 tokens per second)
llama_perf_context_print: eval time = 973.80 ms / 3 runs ( 324.60 ms per token, 3.08 tokens per second)
llama_perf_context_print: total time = 373648.27 ms / 8 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 420 652 539845 1285.35
MUL OPU 427 663 465522 1090.22
RMS_NORM OPU 427 427 369502 865.34
MUL_MAT CPU 7020 0 1201116 171.10
MUL_MAT OPU 117 11442 369371522 3157021.56
CONT CPU 1579 0 67573 42.79
RESHAPE CPU 2345 0 746 0.32
VIEW CPU 3622 0 452 0.12
PERMUTE CPU 2892 0 372 0.13
TRANSPOSE CPU 705 0 161 0.23
GET_ROWS CPU 53 0 3503 66.09
SET_ROWS CPU 1585 0 1263 0.80
SOFT_MAX CPU 805 0 28588 35.51
ROPE CPU 1650 0 6636 4.02
GLU OPU 210 326 339890 1618.52
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
12874 3.65e+05 28.3746 0.0000 [97.74%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
9780 3.34e+05 34.1148 3.34e+05 └─ [89.27%] [ txe_mul_mat_tile_f32_k128 ]
1662 28431.6348 17.1069 28431.6348 └─ [ 7.61%] [ txe_mul_mat_tile_f32_k64 ]
12874 10868.1452 0.8442 10868.1452 └─ [ 2.91%] TXE 0 Idle
236 78.7403 0.3336 78.7403 └─ [2.11e-02%] [ txe_swiglu ]
472 58.5994 0.1242 58.5994 └─ [1.57e-02%] [ txe_add ]
480 58.0449 0.1209 58.0449 └─ [1.55e-02%] [ txe_mult ]
244 45.5822 0.1868 45.5822 └─ [1.22e-02%] [ txe_rms_norm ]
[Thread] tsi::runtime::TsavRTPosix::initialize (cumulative over all threads)
1 36.4620 36.4620 32.7460 [9.76e-03%] [Thread] tsi::runtime::TsavRTPosix::initialize
1 3.5800 3.5800 3.0830 └─ [9.58e-04%] tsi::runtime::TsavRTPosix::initializeQueues
1 0.3640 0.3640 0.3640 └─ [9.74e-05%] tsi::runtime::TsavRT::awaitCommandListCompletion
1 0.1330 0.1330 0.0880 └─ [3.56e-05%] tsi::runtime::TsavRT::finalizeCommandList
1 0.0450 0.0450 0.0450 └─ [1.20e-05%] tsi::runtime::executeWithTimeout
1 0.1360 0.1360 0.1360 └─ [3.64e-05%] tsi::runtime::TsavRT::initialize
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
12874 1722.2390 0.1338 0.0000 [4.61e-01%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
12874 10867.9667 0.8442 10867.9667 └─ [ 2.91%] TXE 0 Idle
25748 1294.5480 0.0503 1294.5480 └─ [3.46e-01%] tsi::runtime::executeWithTimeout
12864 236.9801 0.0184 236.9801 └─ [6.34e-02%] Command{command=2 (LOAD_BLOB), blob_args=[140065255486080[...
12874 8.6650 6.73e-04 8.6650 └─ [2.32e-03%] LOAD_BLOB Command Execution
6 6.5256 1.0876 6.5256 └─ [1.75e-03%] Command{command=2 (LOAD_BLOB), blob_args=[140065255483008[...
4 3.9426 0.9856 3.9426 └─ [1.05e-03%] Command{command=2 (LOAD_BLOB), blob_args=[140065255484544[...
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
12874 2158.4290 0.1677 0.0000 [5.78e-01%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
12874 10868.3515 0.8442 10868.3515 └─ [ 2.91%] TXE 0 Idle
25748 1473.5760 0.0572 1473.5760 └─ [3.94e-01%] tsi::runtime::executeWithTimeout
12864 238.4388 0.0185 238.4388 └─ [6.38e-02%] Command{command=3 (UNLOAD_BLOB), blob_args=[14006525548608...
12874 9.5790 7.44e-04 9.5790 └─ [2.56e-03%] UNLOAD_BLOB Command Execution
4 0.0845 0.0211 0.0845 └─ [2.26e-05%] Command{command=3 (UNLOAD_BLOB), blob_args=[14006525548454...
6 0.0455 0.0076 0.0455 └─ [1.22e-05%] Command{command=3 (UNLOAD_BLOB), blob_args=[14006525548300...
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
1 8.9640 8.9640 0.0000 [2.40e-03%] [Thread] tsi::runtime::TsavRT::finalize
1 10891.843010891.8430 10891.8430 └─ [ 2.91%] TXE 0 Idle
2 0.2630 0.1315 0.2630 └─ [7.04e-05%] tsi::runtime::executeWithTimeout
1 0.0146 0.0146 0.0146 └─ [3.90e-06%] Command{command=4 (RELEASE), blob_args=[0[0], 3[0x3], 4[0x...
2 0.0110 0.0055 0.0110 └─ [2.94e-06%] tsi::runtime::TsavRT::deallocate
1 0.0020 0.0020 0.0020 └─ [5.35e-07%] RELEASE Command Execution
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
12875 3.64e+05 28.3058 236.5760 [97.51%] [Thread] tsi::runtime::TsavRT::processResponses
12875 3.64e+05 28.2874 3.64e+05 └─ [97.45%] tsi::runtime::executeWithTimeout
[Thread] OPU (cumulative over all threads)
1 0.0850 0.0850 0.0570 [2.27e-05%] [Thread] OPU
1 0.0280 0.0280 0.0280 └─ [7.49e-06%] tsi::runtime::TsavRT::allocate
[Thread] txe_rms_norm (cumulative over all threads)
732 55.8470 0.0763 54.5820 [1.49e-02%] [Thread] txe_rms_norm
732 1.2650 0.0017 1.2650 └─ [3.38e-04%] tsi::runtime::executeWithTimeout
[Thread] txe_swiglu (cumulative over all threads)
708 89.2650 0.1261 88.0330 [2.39e-02%] [Thread] txe_swiglu
708 1.2320 0.0017 1.2320 └─ [3.30e-04%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
12874 248.7660 0.0193 227.3870 [6.66e-02%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
12874 21.3790 0.0017 21.3790 └─ [5.72e-03%] tsi::runtime::executeWithTimeout
[Thread] txe_add (cumulative over all threads)
1416 76.5070 0.0540 74.2400 [2.05e-02%] [Thread] txe_add
1416 2.2670 0.0016 2.2670 └─ [6.07e-04%] tsi::runtime::executeWithTimeout
[Thread] txe_mult (cumulative over all threads)
1440 75.9970 0.0528 73.7720 [2.03e-02%] [Thread] txe_mult
1440 2.2250 0.0015 2.2250 └─ [5.95e-04%] tsi::runtime::executeWithTimeout
[Thread] txe_mul_mat_tile_f32_k128 (cumulative over all threads)
29340 3.34e+05 11.3937 3.34e+05 [89.44%] [Thread] txe_mul_mat_tile_f32_k128
29340 167.0290 0.0057 167.0290 └─ [4.47e-02%] tsi::runtime::executeWithTimeout
[Thread] txe_mul_mat_tile_f32_k64 (cumulative over all threads)
4986 28529.1790 5.7219 28509.3730 [ 7.63%] [Thread] txe_mul_mat_tile_f32_k64
4986 19.8060 0.0040 19.8060 └─ [5.30e-03%] tsi::runtime::executeWithTimeout
[Thread] TXE 0 Idle Time (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
12874 30.5550 0.0024 30.5550 [8.18e-03%] [Thread] tsi::runtime::TsavRT::deallocate
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
12874 68.2900 0.0053 68.2900 [1.83e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
12889 43.0570 0.0033 43.0570 [1.15e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::executeWithTimeout (cumulative over all threads)
38626 9798.5210 0.2537 9798.5210 [ 2.62%] [Thread] tsi::runtime::executeWithTimeout
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.9991
[akapoor@wspd0 llama.cpp]$
[akapoor@wspd0 llama.cpp]$ pwd