Is there a way to set swap: false by default? (outside of a group) #259

sammcj · 2025-08-20T01:00:39Z

sammcj
Aug 20, 2025

Example use cases:

If have embeddings models, you likely want to have those loaded along side any other normal models at a given time if you're using them for RAG.
You might have some small and some larger models for different tasks that should run perfectly fine next to each other.
You have a lot of models, and adding every model to a group just to set swap: false seems like config overhead.
You only want to swap out models if there's not enough vRAM to load the most recently requested model.

Is the best option to have something like this?

groups:
  embedding_models:
    swap: false
    exclusive: false # allow to run at the same time as normal models group
    members:
      - qwen3-embedding-0-6b-f16-32k
      - qwen3-embedding-4b-q8-32k
      - qwen3-embedding-8b-q6k-32k
      - qwen3-embedding-8b-q8-32k

  normal_models:
    swap: true       # don't run two of these models at the same time
    exclusive: false # allow to run at the same time as embeddings models group
    members:
      - glm-4-5-q2kxl-32k
      - glm-4-5-q3kxl-32k
      - glm-4-5-q3kxl-128k
      - glm-4-5-q4kxl-16k
      - glm-4-5-q4kxl-32k
      - glm-4-5-q4kxl-64k
      - glm-4-5-q4kxl-128k
      - gpt-oss-20b-q6kxl-64k
      - gpt-oss-20b-q6kxl-128k
      - gpt-oss-120b-q4kxl-32k
      - qwen3-coder-30b-a3b-q6kxl-32k
      - qwen3-coder-30b-a3b-q6kxl-64k
      - qwen3-coder-30b-a3b-q6kxl-128k
      - nemotron-super-49b-96k
      - nemotron-super-49b-64k
      - nemotron-super-49b-32k
      - granite-4.0-tiny-q8-128k
      - granite-4.0-tiny-q8-64k
      - Mistral-Small-3.2-24B-Instruct-2506-UD-Q6_K_XL-40k
      - Kimi-Dev-72B-UD-Q3_K_XL-49k
      - gemma-3-27b-it-UD-Q4_K_XL-128k
      - gemma-3-27b-it-qat-UD-Q6_K_XL-128k
      - qwen3-30b-a3b-q6kxl-32k
      - qwen3-30b-a3b-q6kxl-128k
      - qwen3-30b-a3b-q5kxl-32k-nothink
      - qwen3-30b-a3b-q5kxl-64k-nothink
      - qwen3-32b-q6kxl-32k
      - qwen3-32b-q6kxl-32k-nothink
      - qwen3-32b-q6kxl-64k
      - qwen3-32b-q6kxl-128k
      - qwen3-32b-q4kxl-32k
      - qwen3-32b-q4kxl-draft-128k
      - qwen3-235b-a22b-2507-udq2kxl-16k
      - qwen3-235b-a22b-2507-udq2kxl-32k
      - qwen3-235b-a22b-2507-udq3kxl-32k

onestardao · 2025-08-20T03:34:54Z

onestardao
Aug 20, 2025

The way you’re describing it (swap defaults not behaving consistently when mixing embedding models with larger chat models) is actually a classic failure mode we’ve seen before. In our notes it maps to ProblemMap No 3 (indexing collapse when multiple models contend for resource state) and sometimes overlaps with No 7 (semantic drift between embedding vs chat session state).

In practice, setting swap=false globally rarely solves it, because the allocator still evaluates load heuristics differently when embeddings are running. The more reliable fix is to introduce a semantic guard at the allocation stage — meaning, don’t only group by model type but also by intended query role (embedding-only vs mixed).

One quick way to confirm: try running your config through a TXTOS or WFGY core file, and then simply ask your LLM “does this allocation fall into problem map one point zero or two point zero?” The model will usually expose whether it’s a pure config misalignment or a deeper semantic mismatch in how swap is applied.

0 replies

mostlygeek · 2025-08-20T04:24:45Z

mostlygeek
Aug 20, 2025
Maintainer

@sammcj perhaps making embedding_models persistent: true might do what you want. With this set no other groups can unload the group, it’ll have to be done manually or with ttl set in the model’s settings.

persistent:true can be combined with swap: true, exclusive: false to have only one embedding model loaded at a time and not unload other groups.

It’s not well documented but all models that aren’t explicitly grouped go into a default group that has: swap: true, exclusive: true, persistent: false.

0 replies

mostlygeek · 2025-08-20T04:29:33Z

mostlygeek
Aug 20, 2025
Maintainer

You only want to swap out models if there's not enough vRAM to load the most recently requested model.

llama-swap doesn’t have any data on current or expected vram usage so it can’t do this automatically.

0 replies

sammcj · 2025-08-20T23:02:31Z

sammcj
Aug 20, 2025
Author

Thanks folks!

1 reply

mostlygeek Aug 20, 2025
Maintainer

was using persistent: true useful?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there a way to set swap: false by default? (outside of a group) #259

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there a way to set swap: false by default? (outside of a group) #259

Uh oh!

Uh oh!

sammcj Aug 20, 2025

Replies: 4 comments · 1 reply

Uh oh!

onestardao Aug 20, 2025

Uh oh!

Uh oh!

mostlygeek Aug 20, 2025 Maintainer

Uh oh!

mostlygeek Aug 20, 2025 Maintainer

Uh oh!

sammcj Aug 20, 2025 Author

Uh oh!

mostlygeek Aug 20, 2025 Maintainer

sammcj
Aug 20, 2025

Replies: 4 comments 1 reply

onestardao
Aug 20, 2025

mostlygeek
Aug 20, 2025
Maintainer

mostlygeek
Aug 20, 2025
Maintainer

sammcj
Aug 20, 2025
Author

mostlygeek Aug 20, 2025
Maintainer