Llama-quantize: Partial requant feature#1313
Conversation
6e504b4 to
f5b4bae
Compare
|
Precisely what do you mean by "copyported"? "Copyported" from where? Considering that I wrote most of the |
ikawrakow
left a comment
There was a problem hiding this comment.
Please merge and resolve conflicts so I don't need to be reviewing the --dry-run changes along with the actual changes of the PR.
- Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded.
f5b4bae to
3136e22
Compare
I apologize, I simply meant that I saw an equivalent PR on llama.cpp, and saw a recent conversation when the terms "copying" and "porting" were opposed. I merged both terms as I did in the initial thread, forgetting that applied to what I would do, not to what you would do (rewriting it fully by yourself). Now, you know very well that I don't believe that you NEED to "copyport" somebody's else modifications considering that you indeed wrote most of the llama-quantize code of llama.cpp, and that I didn't imply that precise statement. But what goes without saying goes better by saying it. I apologize again for the misunderstanding born of my inadequate terminology, it bore no malice. Then, the merger is done, and I'm currently correcting my PR accordingly to your review. |
* Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup
* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench
…ow#1313 follow-up) Preliminary steps: - Add --force-requant / -frq argument to force regeneration of split files whose tensor ggml_types differ from the specified quantization type - Add -prq shortened argument for --partial-requant - Combined with --partial-requant / -prq: skips existing matching splits, deletes and regenerates splits with mismatched tensor types
Follow up of ikawrakow#1313, which implemented partial-requant. Preliminary steps: - Add --force-requant / -frq argument to force regeneration of split files whose tensor ggml_types differ from the specified quantization type - Add -prq shortened argument for --partial-requant - Combined with --partial-requant / -prq: skips existing matching splits, deletes and regenerates splits with mismatched tensor types
Follow up of ikawrakow#1313, which implemented partial-requant. Preliminary steps: - Add --force-requant / -frq argument to force regeneration of split files whose tensor ggml_types differ from the specified quantization type - Add -prq shortened argument for --partial-requant - Combined with --partial-requant / -prq: skips existing matching splits, deletes and regenerates splits with mismatched tensor types
Follow up of ikawrakow#1313, which implemented partial-requant. Preliminary steps: - Add --force-requant / -frq argument to force regeneration of split files whose tensor ggml_types differ from the specified quantization type - Add -prq shortened argument for --partial-requant - Combined with --partial-requant / -prq: skips existing matching splits, deletes and regenerates splits with mismatched tensor types
…ferent from destination) feature Follow up of ikawrakow#1313, which implemented partial-requant. Preliminary steps: - Add --force-requant / -frq argument to force regeneration of split files whose tensor ggml_types differ from the specified quantization type - Add -prq shortened argument for --partial-requant - Combined with --partial-requant / -prq: skips existing matching splits, deletes and regenerates splits with mismatched tensor types (WIP) Llama-quantize: Enhance force_requant with tensor type comparison Implementation of force_requant feature for partial_requant: Features: - Read existing split file headers using gguf_init_from_file - Compare tensor names and ggml_types between existing and expected tensors - Limited to splits containing only ONE tensor - Logs tensor type mismatches showing old type -> new type - Deletes mismatched split files before requantization Combined partial_requant + force_requant behavior: - When both flags set: existing splits with matching types are skipped - Splits with different tensor types are deleted and requantized - Missing splits are created as before Error handling: - Tensor name mismatch triggers warning and exits - File deletion errors logged but don't abort process Logging improvements: - Removed trailing comma from tensor logging - Added type display for skipped tensors - Displays old vs new ggml_type when requantizing Approach: - Universal logic (not whitelist/blacklist) - Respects priority order of quantization parameters - Compares against ctx_outs which has expected types computed TODO: Test with x64-Release-MMQ-TEST build TODO: Verify with llama-sweep-bench test command Llama-quantize: Fix force_requant to compare against expected type Key fix: Moved force_requant comparison to AFTER quantization type is computed Changes: 1. Removed 'operator ():' clutter from log messages (removed __func__) 2. Added corrupted file handling (detect and delete invalid magic files) 3. Fixed force_requant logic: - OLD: Compared existing file vs ctx_outs (source model type) - WRONG - NEW: Compare existing file vs new_type (computed destination type) - CORRECT The new_type is computed by llama_tensor_get_type() which respects: - LLAMA_FTYPE (IQ5_KS in your example) - --custom-q rules (highest priority) - --ffn-gate-inp-type, --token-embedding-type, etc. Expected behavior: - If existing split has iq6_k and expected is iq6_k → skip - If existing split has iq4_ks and expected is iq6_k → delete & requantize - Corrupted files (invalid magic) → automatically deleted Limited to splits with ONE tensor (as per requirement F) Llama-quantize: Fix scope issue for force_requant variables Fixed compile errors C2065: undeclared identifiers Problem: fname and file_exists were declared inside new_ofstream lambda but force_requant check was in the main loop outside the lambda scope Solution: Move fname and file_exists tracking to outer scope: - Declare current_fname and current_file_exists before lambda - Update them inside lambda (current_fname = fname, current_file_exists = true) - Use current_fname and current_file_exists in main loop force_requant check This ensures the force_requant logic has access to the correct file path and existence status when comparing tensor types after quantization. Llama-quantize: Move variable declarations before lambda definition Fixed remaining C2065 compile errors Moved current_fname and current_file_exists declarations BEFORE the new_ofstream lambda definition, so they can be referenced inside it. Also removed obsolete file_exists variable reference. Llama-quantize: Implement two-phase force_requant with pre-sweep NEW TWO-PHASE APPROACH: Phase 1: Pre-sweep (before quantization) - Scan ALL existing split files - Compare tensor names and ggml_types - Delete mismatched/corrupted files upfront - No file locking issues during quantization Phase 2: Quantization - Only quantize missing/deleted splits - Skip matching splits with detailed logging IMPROVED LOGGING: - Shows actual vs expected ggml_type when skipping Example: 'split 00001 exists, tensor output.weight is iq6_k, expected iq6_k, skipping' BENEFITS: - Solves Windows file locking problem - Cleaner separation of concerns - Better user visibility of what will be processed - Limited to 1-tensor splits (requirement F) Fix: Compare destination vs expected type, skip BEFORE quantizing CRITICAL FIX: Compare destination split's tensor type, not source OLD (WRONG): - Compared source tensor type (f16) - Showed skip AFTER quantization NEW (CORRECT): - Read destination split's tensor type - Compare dest vs expected (from command line) - Skip BEFORE quantizing if match - Only quantize if types differ EXPECTED LOG: split 00001: dest=iq6_k, expected=iq6_k → skip split 00002: dest=q8_0, expected=iq6_k → requantize split 00003: missing → quantize split 00004: corrupted → delete & requantize Fix: Add missing variable declarations for force_requant Added declarations: - split_needs_requant[] - split_existing_type[] - split_expected_type[] - current_fname - current_file_exists Fix: Compare dest vs expected AFTER quantization, skip before write CRITICAL FIX: - Moved destination file check to AFTER new_type is computed - Comparison happens at QuantizationDone: - If dest_type == expected_type → skip writing, just continue - If dest_type != expected_type → delete dest, quantize and write - This ensures we compare AFTER knowing expected type EXPECTED LOG: [1/733] output.weight [...] converting to iq6_k .. size = ... -> iq6_k, skip (dest already has iq6_k) [2/733] token_embd.weight [...] -> q6_0, dest has q8_0, deleting for requant converting to q6_0 .. size = ... Restore Phase 1 pre-sweep to get destination tensor types Phase 1 now scans ALL destination splits to record their tensor types before quantization LOG OUTPUT EXAMPLE: === Phase 1: Pre-sweep destination splits === split 0: tensor 'output.weight' is 'iq6_k' split 1: tensor 'token_embd.weight' is 'q6_0' split 2: missing (will be quantized) split 3: corrupted (will be deleted) deleted corrupted split 3 === Phase 1 complete === Fix: Remove redundant file reading, use Phase 1 data Phase 1 stores tensor types from all destination splits During quantization: use stored types, don't read files again This fixes 'invalid magic' errors from redundant file reads Optimize: Skip BEFORE quantizing to avoid wasteful computation CRITICAL OPTIMIZATION: Moved the skip check to AFTER new_type is determined but BEFORE quantization code runs OLD FLOW (wasteful): 1. Determine expected type 2. Quantize tensor (expensive) 3. Check if dest matches 4. If match, skip writing (work wasted) NEW FLOW (efficient): 1. Determine expected type 2. Check if dest matches (from Phase 1) 3. If match, skip entirely (no quantization) 4. If different, quantize and write This avoids expensive quantization when not needed Clean code: Remove unused variables and simplify Phase 1 Optimizations: - Removed unused 'split_needs_requant' vector - Removed unused 'split_expected_type' vector (was set but never read) - Combined corrupted file detection and deletion in Phase 1 - Simplified Phase 1 logging - Removed redundant checks in QuantizationDone (early skip replaces this) Fix: Close file when skipping to prevent 1KB files BUG: Skipped splits were rewritten as 1KB files ROOT CAUSE: - new_ofstream() opens file and writes metadata placeholder - Early skip sets split_skipped but doesn't close file - File remains open with placeholder metadata - When eventually closed, writes incomplete metadata FIX: - Close file with fout.close() before continue - This prevents writing incomplete files CRITICAL FIX: Restore tensor data copy for skipped splits PROBLEM: Removed original code that copies tensor data when skipping ORIGINAL CODE (correct): - When split exists and matches, copy tensor data from source to dest - Uses gguf_set_tensor_data to preserve existing quantization MY CODE (wrong): - Just closed file and continued - Destination file was empty (1KB metadata only) FIX: Restore gguf_set_tensor_data call for skipped tensors Also fixed: - Added goto QuantizationDone to ensure proper file closing - Changed log message to 'copy' instead of 'skip' Fix: Write skipped tensors immediately, use continue instead of goto Problem: goto QuantizationDone caused undefined behavior Solution: Write skipped tensors immediately and continue - Set split_skipped = true - Call gguf_set_tensor_data to store in ctx_outs - Write to file immediately - Use continue to go to next tensor This ensures correct write behavior for skipped tensors
..if specified tensor quant is different from destination! Follow up of ikawrakow#1313, which implemented partial-requant. Feature to force requantization of split files when tensor ggml_types differ from specified quantization type. FILES CHANGED: - examples/quantize/quantize.cpp: Added --force-requant / -frq argument, -prq for --partial-requant - include/llama.h: Added bool force_requant field to llama_model_quantize_params - src/llama.cpp: Initialized force_requant to false in default params - src/llama-quantize.cpp: Implemented two-phase force_requant logic IMPLEMENTATION: Phase 1 (Pre-sweep): - Scans all destination split files using gguf_init_from_file - Records tensor names and ggml_types from each split - Detects corrupted files (invalid magic) and marks for deletion - Limited to splits containing ONE tensor Phase 2 (Quantization): - Compares destination tensor type vs expected type (from quantization rules) - If types match: copy tensor data from source to destination (preserves existing quantization) - If types differ: delete destination, quantize from source, write to destination - Logs each decision with tensor name and type information USAGE: --partial-requant / -prq: Quantize only missing splits --force-requant / -frq: Force requantization of splits with different tensor types Combined: Skips matching splits, requantizes mismatched splits ERROR HANDLING: - Tensor name mismatch: Warning + abort - Corrupted files: Auto-delete - File deletion failure: Log warning (continues anyway) LOGGING: - Shows split tensor types during Phase 1 - Logs skip/copy/requant decisions during Phase 2 - Removed __func__ clutter from logs PRIORITY ORDER: Respects standard quantization priority: 1. --custom-q rules (highest) 2. --output-tensor-type, --token-embedding-type 3. --ffn-gate-inp-type 4. --attn-q/k/v/output-type, --ffn-*-type 5. LLAMA_FTYPE defaults LIMITATION: Only works with splits containing exactly ONE tensor FOLLOW-UP: This is the foundation for further force_requant enhancements Add size comparison to force_requant Phase 1 now collects 4 parameters per split: 1. Split number 2. Tensor name 3. ggml_type 4. Size in bytes (calculated from dimensions and type) Phase 2 compares ALL 4 parameters: - Type mismatch: Requantize - Size mismatch: Requantize - Both match: Copy tensor data This ensures tensors are requantized when: - Type differs - Size differs (even if type same) - File missing - File corrupted Log messages now show type AND size for better visibility BUG FIX: Calculate tensor size from dimensions and type: - Access ctx_dest->infos[tensor_idx].n_dims - Calculate nelements from ne[] dimensions - Use ggml_row_size(type, nelements) for total size
Inspired by the recently added --dry-run option for llama-quantize.
This PR allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. (useful for whoever is used to change certains tensors' quantization to improve an overall quant strategy: just delete the splits you want to requantize in the destination directory)
It works both for GGUF which are split tensor by tensor, or by group of several tensors. (though this one is not very much tested except with 2 tensors by split: I'm myself using directories of single tensors GGUFs since @Thireus made his GGUF-Tool-Suite)
It also adds automatic directory creation for both llama-quantize and gguf-split in case the destination directory of the quantization/split doesn't exist. (A longstanding lacking feature ^^)