@FIR-737: Added another endpoint llama-cli t invoke directly in URL#11
Merged
LewisLui777 merged 1 commit intomasterfrom Jun 13, 2025
Merged
@FIR-737: Added another endpoint llama-cli t invoke directly in URL#11LewisLui777 merged 1 commit intomasterfrom
LewisLui777 merged 1 commit intomasterfrom
Conversation
This commit has two changes 1. Added another endpoint llama-cli to invole the run_platform_test.sh directly 2. Updated reading of output to byte by byte to identify marking prompt and exit when the marker is seen
reach2shaunak
approved these changes
Jun 13, 2025
mmankal
approved these changes
Jun 13, 2025
LewisLui777
approved these changes
Jun 13, 2025
LewisLui777
left a comment
There was a problem hiding this comment.
Both of these look amazing. Thank you so much for your help.
dineshReddy6381
pushed a commit
that referenced
this pull request
Sep 17, 2025
* oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <[email protected]> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: slaren <[email protected]> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <[email protected]> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: slaren <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit has two changes
Make sure to read the contributing guidelines before submitting a PR
The test results are below from the browser on FPGA2 without any markers
URL used is http://10.50.0.112:5003/llama-cli?model=tiny-llama&backend=tSavorite&tokens=5&prompt=Hello+How+are+you
/usr/bin/tsi/v0.1.1.tsv31_06_06_2025/bin/run_platform_test.sh "Hello How are you" 5 tinyllama-vo-5m-para.gguf tSavorite Check if tnApcMgr is running; if it is not, uncomment below line and execute the run_platform_test.sh script. Running on v0.1.1.tsv31_06_06_2025 register_backend: registered backend Tsavorite (1 devices) register_device: registered device Tsavorite (txe) register_backend: registered backend CPU (1 devices) register_device: registered device CPU (CPU) load_backend: failed to find ggml_backend_init in /usr/bin/tsi/v0.1.1.tsv31_06_06_2025/bin/tsi-ggml/libggml-tsavorite.so load_backend: failed to find ggml_backend_init in /usr/bin/tsi/v0.1.1.tsv31_06_06_2025/bin/tsi-ggml/libggml-cpu.so build: 5473 (a7b7e46) with gcc (GCC) 13.3.0 for x86_64-pc-linux-gnu (debug) main: llama backend init main: load the model and apply lora adapter, if any TXE Device MEMORY Summary total 134217728 and free 134217728 llama_model_load_from_file_impl: using device Tsavorite (txe) - 128 MiB free llama_model_loader: loaded meta data with 38 key-value pairs and 201 tensors from /tsi/akapoor/ggml/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Tiny Llama v0.3 FP32 llama_model_loader: - kv 3: general.size_label str = 1.1B llama_model_loader: - kv 4: general.license str = apache-2.0 llama_model_loader: - kv 5: general.dataset.count u32 = 3 llama_model_loader: - kv 6: general.dataset.0.name str = SlimPajama 627B llama_model_loader: - kv 7: general.dataset.0.organization str = Cerebras llama_model_loader: - kv 8: general.dataset.0.repo_url str = https://huggingface.co/cerebras/SlimP... llama_model_loader: - kv 9: general.dataset.1.name str = Starcoderdata llama_model_loader: - kv 10: general.dataset.1.organization str = Bigcode llama_model_loader: - kv 11: general.dataset.1.repo_url str = https://huggingface.co/bigcode/starco... llama_model_loader: - kv 12: general.dataset.2.name str = Oasst_Top1_2023 08 25 llama_model_loader: - kv 13: general.dataset.2.version str = 08-25 llama_model_loader: - kv 14: general.dataset.2.organization str = OpenAssistant llama_model_loader: - kv 15: general.dataset.2.repo_url str = https://huggingface.co/OpenAssistant/... llama_model_loader: - kv 16: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 17: llama.block_count u32 = 22 llama_model_loader: - kv 18: llama.context_length u32 = 2048 llama_model_loader: - kv 19: llama.embedding_length u32 = 2048 llama_model_loader: - kv 20: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 21: llama.attention.head_count u32 = 32 llama_model_loader: - kv 22: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 23: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 24: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 25: general.file_type u32 = 0 llama_model_loader: - kv 26: llama.vocab_size u32 = 32003 llama_model_loader: - kv 27: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 28: tokenizer.ggml.model str = llama llama_model_loader: - kv 29: tokenizer.ggml.pre str = default llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,32003] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 31: tokenizer.ggml.scores arr[f32,32003] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,32003] = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 35: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 201 tensors print_info: file format = GGUF V3 (latest) print_info: file type = all F32 print_info: file size = 4.10 GiB (32.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 6 load: token to piece cache size = 0.1684 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 2048 print_info: n_embd = 2048 print_info: n_layer = 22 print_info: n_head = 32 print_info: n_head_kv = 4 print_info: n_rot = 64 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 5632 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 2048 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1B print_info: model params = 1.10 B print_info: general.name = Tiny Llama v0.3 FP32 print_info: vocab type = SPM print_info: n_vocab = 32003 print_info: n_merges = 0 print_info: BOS token = 1 '' print_info: EOS token = 2 '' print_info: EOT token = 32002 '<|im_end|>' print_info: UNK token = 0 '' print_info: PAD token = 32000 '[PAD]' print_info: LF token = 13 '<0x0A>' print_info: EOG token = 2 '' print_info: EOG token = 32002 '<|im_end|>' print_info: max token length = 48 load_tensors: loading model tensors, this can take a while... (mmap = true) TXE Device MEMORY Summary total 134217728 and free 134217728 load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/23 layers to GPU load_tensors: CPU_Mapped model buffer size = 4196.40 MiB .......................................................................................... llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 12288 llama_context: n_ctx_per_seq = 12288 llama_context: n_batch = 1024 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (12288) > n_ctx_train (2048) -- possible training context overflow [2018-03-09 12:37:20.831859] 277:278 [�[32m info�[m] :: TXE resource allocation request processed successfully. llama_context: CPU output buffer size = 0.12 MiB llama_kv_cache_unified: CPU KV buffer size = 264.00 MiB llama_kv_cache_unified: size = 264.00 MiB ( 12288 cells, 22 layers, 1 seqs), K (f16): 132.00 MiB, V (f16): 132.00 MiB ggml_backend_tsavorite_buffer_type_alloc_buffer is called from llama data Loader ANoop Allocating memory from tsi_alloc with size 15732736 Allocating memory from tsi_alloc with size 15732736 starting memory 0xfffea3cb3080 Address of Newly Created BUffer 0xfffea3cb3080 and size 15732736 llama_context: tsavorite compute buffer size = 15.00 MiB llama_context: CPU compute buffer size = 808.01 MiB llama_context: graph nodes = 798 llama_context: graph splits = 223 (with bs=512), 137 (with bs=1) common_init_from_params: setting dry_penalty_last_n to ctx_size = 12288 main: llama threadpool init, n_threads = 4 main: model was trained on only 2048 context tokens (12288 specified) system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | sampler seed: 1594714430 sampler params: repeat_last_n = 5, repeat_penalty = 1.500, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 12288 top_k = 50, top_p = 0.900, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 12288, n_batch = 1024, n_predict = 5, n_keep = 1 my cat’s name is Luna. llama_perf_sampler_print: sampling time = 163.23 ms / 11 runs ( 14.84 ms per token, 67.39 tokens per second) llama_perf_context_print: load time = 125697.32 ms llama_perf_context_print: prompt eval time = 94708.24 ms / 6 tokens (15784.71 ms per token, 0.06 tokens per second) llama_perf_context_print: eval time = 142173.74 ms / 4 runs (35543.44 ms per token, 0.03 tokens per second) llama_perf_context_print: total time = 268116.82 ms / 10 tokens TXE_ADD Operation, total tensor: 5 Number of Kernel Call: 160 Number of tensor got spilt: 5 Min Num of Elem 2048 Max Num of Elem 2048 TXE_SUB Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_MULT Operation, total tensor: 225 Number of Kernel Call: 14080 Number of tensor got spilt: 225 Min Num of Elem 2048 Max Num of Elem 12288 TXE_DIV Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SQRT Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_NEG Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_ABS Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SIN Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SIGMOID Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SILU Operation, total tensor: 110 Number of Kernel Call: 18920 Number of tensor got spilt: 110 Min Num of Elem 5632 Max Num of Elem 33792 [2018-03-09 12:41:22.453747] 277:278 [�[32m info�[m] :: TXE resource release request processed successfully. GGML Tsavorite Profiling Results: ------------------------------------------------------------------------------------------------------------------------ Calls Total(ms) T/call Self(ms) Function ------------------------------------------------------------------------------------------------------------------------ 33160 47905.000 1.445 0.000 [20%] RuntimeHostShim::awaitCommandListCompletion 18920 29773.936 1.574 29773.936 └─ [12%] [ txe_silu ] 14080 21871.588 1.553 21871.588 └─ [ 9%] [ txe_mult ] 160 251.353 1.571 251.353 └─ [ 0%] [ txe_add ] 33160 0.541 0.000 0.541 └─ [ 0%] TXE 0 Idle 1 204.000 204.000 26.000 [ 0%] GGML Tsavorite 1 178.000 178.000 178.000 └─ [ 0%] RuntimeHostShim::initialize 1 87.000 87.000 87.000 [ 0%] RuntimeHostShim::finalize 33160 50.000 0.002 50.000 [ 0%] RuntimeHostShim::loadBlob 33160 40.000 0.001 40.000 [ 0%] RuntimeHostShim::finalizeCommandList 33160 7.000 0.000 7.000 [ 0%] RuntimeHostShim::createCommandList 33160 5.000 0.000 5.000 [ 0%] RuntimeHostShim::deallocate 33161 4.000 0.000 4.000 [ 0%] RuntimeHostShim::allocate 33160 4.000 0.000 4.000 [ 0%] RuntimeHostShim::addCommandToList 33160 1.000 0.000 1.000 [ 0%] RuntimeHostShim::unloadBlob 113720 0.000 0.000 0.000 [ 0%] RuntimeHostShim::getShmemManager 33160 0.000 0.000 0.000 [ 0%] RuntimeHostShim::launchBlob ======================================================================================================================== 412163 241743.000 0.587241743.000 [100%] TOTAL ========================================================================================================================