-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feat: Add support for overriding tensor buffer type #6062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to 9745451 in 1 minute and 22 seconds. Click for details.
- Reviewed
38lines of code in1files - Skipped
0files when reviewing. - Skipped posting
4draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. extensions/llamacpp-extension/src/index.ts:42
- Draft comment:
Consider providing a default value (e.g. empty string) for override_tensor_buffer_t to avoid unintended falsy checks. - Reason this comment was not posted:
Confidence changes required:50%<= threshold50%None
2. extensions/llamacpp-extension/src/index.ts:1269
- Draft comment:
Validate that override_tensor_buffer_t is non-empty and, if applicable, a valid regex before pushing the '--override-tensors' argument. - Reason this comment was not posted:
Confidence changes required:50%<= threshold50%None
3. extensions/llamacpp-extension/src/index.ts:1340
- Draft comment:
Updated error logging now refers to 'model' instead of 'llama-server'. Ensure consistency with frontend error handling, if any. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
4. extensions/llamacpp-extension/src/index.ts:1269
- Draft comment:
Typographical suggestion: Consider revising the comment wording on this line. Instead of "This is an expert level settings and should only be used by people who knows what they are doing.", you might change it to "This is an expert-level setting and should only be used by people who know what they are doing." - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% While the comment is about a changed line and points out real typos, comments about typos in comments are generally not important enough to warrant a PR comment. The meaning is still clear despite the typos. This is a very minor stylistic issue that doesn't affect functionality. The typos do make the code look slightly less polished and professional. Multiple typos in one line could indicate rushed work. While polish is good, fixing comment typos is too minor to warrant a PR comment. This kind of feedback is better handled through general code review guidelines or style guides. Delete this comment as it points out typos that are too minor to warrant a PR comment.
Workflow ID: wflow_dEmJOMdTirEAHdbG
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
Barecheck - Code coverage reportTotal: 33.19%Your code coverage diff: 0.00% ▴ Uncovered files and lines
|
This commit introduces a new configuration option, `override_tensor_buffer_t`, which allows users to specify a regex for matching tensor names to override their buffer type. This is an advanced setting primarily useful for optimizing the performance of large models, particularly Mixture of Experts (MoE) models. By overriding the tensor buffer type, users can keep critical parts of the model, like the attention layers, on the GPU while offloading other parts, such as the expert feed-forward networks, to the CPU. This can lead to significant speed improvements for massive models. Additionally, this change refines the error message to be more specific when a model fails to load. The previous message "Failed to load llama-server" has been updated to "Failed to load model" to be more accurate.
Minh141120
approved these changes
Aug 7, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe Your Changes
This commit introduces a new configuration option,
override_tensor_buffer_t, which allows users to specify a regex for matching tensor names to override their buffer type. This is an advanced setting primarily useful for optimizing the performance of large models, particularly Mixture of Experts (MoE) models.By overriding the tensor buffer type, users can keep critical parts of the model, like the attention layers, on the GPU while offloading other parts, such as the expert feed-forward networks, to the CPU. This can lead to significant speed improvements for massive models.
Additionally, this change refines the error message to be more specific when a model fails to load. The previous message "Failed to load llama-server" has been updated to "Failed to load model" to be more accurate.
Fixes Issues
Self Checklist
Important
Adds
override_tensor_buffer_toption toindex.tsfor tensor buffer type override and refines model loading error message.override_tensor_buffer_toption inLlamacppConfiginindex.tsto specify regex for tensor names to override buffer type.performLoad()inindex.ts.This description was created by
for 9745451. You can customize this summary. It will automatically update as commits are pushed.