-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feat: Smart model management #6390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to d8d147f in 1 minute and 8 seconds. Click for details.
- Reviewed
357lines of code in2files - Skipped
0files when reviewing. - Skipped posting
6draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. extensions/llamacpp-extension/settings.json:36
- Draft comment:
Good addition of the 'memory_util' dropdown. It cleanly adds the option for smart memory utilization. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
2. extensions/llamacpp-extension/src/index.ts:1869
- Draft comment:
getTotalSystemMemory properly sums VRAM and RAM (after converting to bytes). Ensure units stay consistent. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
3. extensions/llamacpp-extension/src/index.ts:1897
- Draft comment:
getKVCachePerToken validates metadata and computes kv cache per token; consider more detailed error messages if available metadata fields are missing. - Reason this comment was not posted:
Confidence changes required:20%<= threshold50%None
4. extensions/llamacpp-extension/src/index.ts:1935
- Draft comment:
planModelLoad logic is robust, calculating GPU layers and context length with fallbacks. Consider documenting the chosen heuristics (70% VRAM and memory percentages) for clarity. - Reason this comment was not posted:
Confidence changes required:20%<= threshold50%None
5. extensions/llamacpp-extension/src/index.ts:2089
- Draft comment:
isModelSupported returns GREEN/YELLOW/RED based on memory thresholds. Double-check that the condition (modelSize <= usableVRAM || totalRequired <= usableTotalMemory) meets the intended semantics. - Reason this comment was not posted:
Confidence changes required:30%<= threshold50%None
6. extensions/llamacpp-extension/src/index.ts:852
- Draft comment:
onSettingUpdate properly updates memoryMode when 'memory_util' changes. Ensure UI description and backend percentages (high:0.7, medium:0.5, low:0.4) are aligned with user expectations and documentation. - Reason this comment was not posted:
Confidence changes required:20%<= threshold50%None
Workflow ID: wflow_vYlnOecj6cFDfo5J
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed bbd272f in 2 minutes and 57 seconds. Click for details.
- Reviewed
357lines of code in2files - Skipped
0files when reviewing. - Skipped posting
4draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. extensions/llamacpp-extension/src/index.ts:2151
- Draft comment:
Add an inline comment to clarify the YELLOW condition rationale—i.e. when model weights fit in VRAM but the overall KV cache doesn’t, or when only total memory suffices. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The existing comment above the if statement is clear and comprehensive. Adding an inline comment would be redundant and could make the code more cluttered. The code is already well-documented with the multi-line comment explaining both conditions. The suggestion would just repeat what's already stated. Could there be value in having the explanation closer to the actual condition for easier reading? Some developers prefer inline comments for immediate context. While inline comments can be valuable, in this case the existing comment is only 1-2 lines above the condition and clearly explains both parts of the OR condition. Adding an inline comment would create unnecessary duplication. The comment should be deleted as it suggests adding redundant documentation. The code is already well-documented with a clear comment explaining the condition's purpose.
2. extensions/llamacpp-extension/src/index.ts:1884
- Draft comment:
Review the aggregation of VRAM across all GPUs in getTotalSystemMemory; ensure that summing all GPU memory is appropriate, as GPU VRAM may not be poolable in all scenarios. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While the comment raises an interesting technical point about GPU memory pooling, the code's purpose appears to be getting a rough estimate of available memory for planning purposes, not precise allocation. The function is used by planModelLoad() and isModelSupported() which make conservative estimates with safety margins. The code already uses conservative percentages (70-80%) of the total memory. The comment identifies a real technical limitation that could theoretically cause issues in some multi-GPU scenarios. Not all GPU architectures support memory pooling. The code is making conservative estimates with safety margins, and the actual GPU memory allocation happens elsewhere. This is just for high-level planning. The comment should be deleted. While technically accurate, the current implementation is intentionally using rough estimates with conservative margins, making the concern less relevant for this planning-focused code.
3. extensions/llamacpp-extension/settings.json:41
- Draft comment:
Typo: The title 'Smart Memory utilization' might be intended to be 'Smart Memory Utilization' for consistency in title case. Please confirm the desired capitalization. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The comment is about pure styling/capitalization. There's no actual functional issue being raised. The existing titles in the file use mixed capitalization styles, so there isn't even a clear style violation. This seems like a very minor nitpick that doesn't warrant a comment. Maybe consistent title case would improve readability and professionalism of the UI? The comment does provide a concrete suggestion. While consistency could be nice, this is ultimately a UI styling choice. Per the rules, we should not comment on pure UI changes or minor style issues. This comment should be deleted as it only addresses UI capitalization without any functional impact, and the rules explicitly say not to comment on UI styling changes.
4. extensions/llamacpp-extension/src/index.ts:2129
- Draft comment:
Typographical consistency: The comment uses "kvcache" as one word, whereas elsewhere (e.g., in the Returns section) it's written as "KV cache". Consider updating this comment for consistency. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% This is a purely stylistic comment about comment formatting. It doesn't affect functionality or code quality. The comment is pointing out an inconsistency in documentation/comments rather than suggesting any meaningful code changes. According to the rules, we should not make purely informative comments or comments that are obvious/unimportant. Could inconsistent terminology in comments make the code harder to understand for future maintainers? Maybe there's value in keeping terminology consistent. While consistent terminology is good, this is too minor of an issue to warrant a PR comment. It's just about capitalizing letters in a comment, not about any actual code changes or meaningful improvements. Delete this comment as it is purely about comment formatting and doesn't suggest any meaningful code improvements.
Workflow ID: wflow_xaBKtJWMa5Ryso3G
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 076b5fc in 2 minutes and 13 seconds. Click for details.
- Reviewed
89lines of code in3files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. extensions/llamacpp-extension/settings.json:193
- Draft comment:
Removal of the 'no_kv_offload' UI option looks intentional since the new smart memory management (via 'memory_util') is now used. Ensure the backend logic fully relies on the new mode. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
2. web-app/src/hooks/useModelProvider.ts:270
- Draft comment:
In the migration code for 'no_kv_offload', using a falsy check (!model.settings.no_kv_offload) may override a valid boolean value (e.g. false or true from a previous config). Consider checking explicitly for undefined (e.g. model.settings.no_kv_offload === undefined) to avoid unwanted overwrites. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_IDj11V5bRfWqLRpw
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed e1847b4 in 2 minutes and 54 seconds. Click for details.
- Reviewed
490lines of code in5files - Skipped
0files when reviewing. - Skipped posting
5draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. extensions/llamacpp-extension/src/index.ts:2002
- Draft comment:
Consider adding a clarifying comment on the fallback logic (safeTokensPerLayer) in planModelLoad to explain the assumptions behind the formula. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
2. web-app/src/containers/ModelSetting.tsx:42
- Draft comment:
Ensure to handle the case when planModelLoad returns an 'Unsupported' mode. Consider alerting the user if the model cannot be loaded due to memory constraints. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment starts with "Ensure that..." which is a red flag per the rules. It's speculative about what errors might occur without evidence that these specific error cases exist. While better error handling could be good, the comment isn't specific or actionable enough. The code already has basic error handling with try/catch. The comment does point out a potential user experience issue - users should know if optimization fails. Maybe there's value in showing error states to users rather than just logging to console. While better error handling could be valuable, this comment is too speculative and starts with "Ensure that..." which violates our rules. It's asking for verification rather than pointing out a clear issue. Delete this comment. It's speculative, asks for verification, and doesn't point out a clear issue that needs fixing.
3. web-app/src/services/models/default.ts:510
- Draft comment:
Good dynamic import usage for obtaining the absolute model path. Consider logging additional details when falling back, to aid in debugging if planModelLoad isn’t available. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While the comment is technically correct about the dynamic imports, it's making two suggestions: 1) Praising existing code (which isn't actionable) and 2) Suggesting more logging in the fallback case. The fallback logging suggestion is minor and not clearly necessary - the existing console.warn already provides the key information needed. The comment could be identifying a real debugging need - what if something goes wrong in production and we need more details about why planModelLoad isn't available? The existing console.warn already captures the key information - that planModelLoad isn't available. Additional details would likely not help debug this specific issue since it's a simple feature availability check. The comment should be deleted as it combines praise for existing code with a minor logging suggestion that isn't clearly necessary given the existing error handling.
4. web-app/src/services/models/default.ts:292
- Draft comment:
The key mapping (e.g. mapping 'ctx_len' to 'ctx_size' and 'ngl' to 'n_gpu_layers') ensures consistency between provider settings and engine config. This is good—just ensure such mappings remain in sync with any future config changes. - Reason this comment was not posted:
Confidence changes required:50%<= threshold50%None
5. web-app/src/services/models/default.ts:448
- Draft comment:
Returning 'GREY' on error in isModelSupported clearly indicates an error state. Ensure the frontend handles the 'GREY' state gracefully in its UI logic. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_fauVA3Hx2AE6tb9U
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed f606de8 in 3 minutes and 19 seconds. Click for details.
- Reviewed
416lines of code in1files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. extensions/llamacpp-extension/src/index.ts:75
- Draft comment:
New optional property 'noOffloadMmproj' is added to ModelPlan. Please add a brief JSDoc comment to explain its purpose. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
2. extensions/llamacpp-extension/src/index.ts:2065
- Draft comment:
In the loop over contextSizes, bestConfig is overwritten with each valid candidate. Consider clarifying or refining the selection criteria to ensure the best balance between layers and context is chosen. - Reason this comment was not posted:
Comment looked like it was already resolved.
3. extensions/llamacpp-extension/src/index.ts:1917
- Draft comment:
There is an inconsistency in KV cache calculation: getKVCachePerToken uses nHeadKV whereas estimateKVCache uses nHead. Ensure both methods use the same logic for models (e.g. GQA) to avoid mis‑estimation. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% While the comment identifies a real inconsistency in the code, I notice that estimateKVCache() is an unused private method. The active method getKVCachePerToken() has the correct implementation using nHeadKV. Since estimateKVCache() isn't used, this inconsistency doesn't actually affect anything. The comment is technically correct but not practically relevant. I could be wrong about estimateKVCache() being completely unused - there could be test files or other parts of the codebase that use it that I can't see. Even if estimateKVCache() is used somewhere, the comment would be better as a PR suggestion to fix the unused/incorrect method rather than just pointing out the inconsistency. The comment identifies a real inconsistency but focuses on pointing it out rather than fixing it. Since one method appears unused, we should either fix or remove it rather than just noting the difference.
Workflow ID: wflow_YLI7ZICnvJHHzelT
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed 41f83de in 1 minute and 4 seconds. Click for details.
- Reviewed
12lines of code in1files - Skipped
0files when reviewing. - Skipped posting
1draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. extensions/llamacpp-extension/src/index.ts:1904
- Draft comment:
Usage of 'usableRAM' remains in the logger despite its removal. This variable is now undefined. Please update the logger message or compute and define the free/usable RAM before logging. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
Workflow ID: wflow_NeMnDbOvhCTNTwXq
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
* **New UI option** – `memory_util` added to `settings.json` with a dropdown (high / medium / low) to let users control how aggressively the engine uses system memory. * **Configuration updates** – `LlamacppConfig` now includes `memory_util`; the extension class stores it in a new `memoryMode` property and handles updates through `updateConfig`. * **System memory handling** * Introduced `SystemMemory` interface and `getTotalSystemMemory()` to report combined VRAM + RAM. * Added helper methods `getKVCachePerToken`, `getLayerSize`, and a new `ModelPlan` type. * **Smart model‑load planner** – `planModelLoad()` computes: * Number of GPU layers that can fit in usable VRAM. * Maximum context length based on KV‑cache size and the selected memory utilization mode (high/medium/low). * Whether KV‑cache must be off‑loaded to CPU and the overall loading mode (GPU, Hybrid, CPU, Unsupported). * Detailed logging of the planning decision. * **Improved support check** – `isModelSupported()` now: * Uses the combined VRAM/RAM totals from `getTotalSystemMemory()`. * Applies an 80% usable‑memory heuristic. * Returns **GREEN** only when both weights and KV‑cache fit in VRAM, **YELLOW** when they fit only in total memory or require CPU off‑load, and **RED** when the model cannot fit at all. * **Cleanup** – Removed unused `GgufMetadata` import; updated imports and type definitions accordingly. * **Documentation/comments** – Added explanatory JSDoc comments for the new methods and clarified the return semantics of `isModelSupported`.
…emory budgeting * Extend `ModelPlan` with optional `noOffloadMmproj` flag to indicate when a multimodal projector can stay in VRAM. * Add `mmprojPath` parameter to `planModelLoad` and calculate its size, attempting to keep it on GPU when possible. * Refactor system memory detection: * Use `used_memory` (actual free RAM) instead of total RAM for budgeting. * Introduced `usableRAM` placeholder for future use. * Rewrite KV‑cache size calculation: * Properly handle GQA models via `attention.head_count_kv`. * Compute bytes per token as `nHeadKV * headDim * 2 * 2 * nLayer`. * Replace the old 70 % VRAM heuristic with a more flexible budget: * Reserve a fixed VRAM amount and apply an overhead factor. * Derive usable system RAM from total memory minus VRAM. * Implement a robust allocation algorithm: * Prioritize placing the mmproj in VRAM. * Search for the best balance of GPU layers and context length. * Fallback strategies for hybrid and pure‑CPU modes with detailed safety checks. * Add extensive validation of model size, KV‑cache size, layer size, and memory mode. * Improve logging throughout the planning process for easier debugging. * Adjust final plan return shape to include the new `noOffloadMmproj` field.
Describe Your Changes
memory_utiladded tosettings.jsonwith a dropdown (high / medium / low) to let users control how aggressively the engine uses system memory.LlamacppConfignow includesmemory_util; the extension class stores it in a newmemoryModeproperty and handles updates throughupdateConfig.SystemMemoryinterface andgetTotalSystemMemory()to report combined VRAM + RAM.getKVCachePerToken,getLayerSize, and a newModelPlantype.planModelLoad()computes:isModelSupported()now:getTotalSystemMemory().GgufMetadataimport; updated imports and type definitions accordingly.isModelSupported.Fixes Issues
Self Checklist
FE UPDATES REQUIRED
Unsupportedfor a model, prevent loading and show a helpful error (part of feat: Guadrail to not allow user to run model bigger than their memory allow #6204)--> After discussing we already have error from llamacpp and have indicator on chat input
Important
Introduces smart model management with memory utilization control and optimized model loading for
llamacppprovider.memory_utiloption insettings.jsonfor memory utilization control (high/medium/low).planModelLoad()inindex.tsto optimize model loading based on system memory and VRAM.isModelSupported()inindex.tsto use combined VRAM/RAM and return status based on memory fit.LlamacppConfiginindex.tsto includememory_util.SystemMemoryinterface and helper methods for memory calculations inindex.ts.ModelSetting.tsxforllamacppprovider.useModelProvider.tsto handle new settings and migrations.GgufMetadataimport inindex.ts.index.ts.This description was created by
for 41f83de. You can customize this summary. It will automatically update as commits are pushed.