Conversation
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
ikawrakow
approved these changes
Feb 10, 2025
Owner
ikawrakow
left a comment
There was a problem hiding this comment.
LGTM, but it does nothing on the single socket computers I have currently available, so relying on the comments in the linked PR and issue that this really improves things on NUMA systems.
Collaborator
Author
The first commit, should work on any system to help MoE loading (Deepseek is the most noticeable because of it's large size and expert count but it should help all MoE) . It is only the the second commit is designed to benefit NUMA systems. |
Nexesenex
added a commit
to Nexesenex/ik_llama.cpp.nxs
that referenced
this pull request
Mar 3, 2025
This was referenced Jul 23, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First commit is a port of: ggml-org/llama.cpp#11571
The second commit is based on what fairydreaming has reported here ggml-org/llama.cpp#11733 and also unify's warmup to always be one token.
This allows warmup to actually warmup an MoE model as all experts are exercised.