Replies: 4 comments 1 reply
-
|
The way you’re describing it (swap defaults not behaving consistently when mixing embedding models with larger chat models) is actually a classic failure mode we’ve seen before. In our notes it maps to ProblemMap No 3 (indexing collapse when multiple models contend for resource state) and sometimes overlaps with No 7 (semantic drift between embedding vs chat session state). In practice, setting swap=false globally rarely solves it, because the allocator still evaluates load heuristics differently when embeddings are running. The more reliable fix is to introduce a semantic guard at the allocation stage — meaning, don’t only group by model type but also by intended query role (embedding-only vs mixed). One quick way to confirm: try running your config through a TXTOS or WFGY core file, and then simply ask your LLM “does this allocation fall into problem map one point zero or two point zero?” The model will usually expose whether it’s a pure config misalignment or a deeper semantic mismatch in how swap is applied. |
Beta Was this translation helpful? Give feedback.
-
|
@sammcj perhaps making embedding_models
It’s not well documented but all models that aren’t explicitly grouped go into a default group that has: |
Beta Was this translation helpful? Give feedback.
-
llama-swap doesn’t have any data on current or expected vram usage so it can’t do this automatically. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks folks! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Example use cases:
Is the best option to have something like this?
Beta Was this translation helpful? Give feedback.
All reactions