You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# use a custom command to stop the model when swapping. By default
134
-
# this is SIGTERM on POSIX systems, and taskkill on Windows systems
135
-
# the ${PID} variable can be used in cmdStop, it will be automatically replaced
136
-
# with the PID of the running model
137
-
cmdStop: docker stop dockertest
138
-
139
-
# Groups provide advanced controls over model swapping behaviour. Using groups
140
-
# some models can be kept loaded indefinitely, while others are swapped out.
141
-
#
142
-
# Tips:
143
-
#
144
-
# - models must be defined above in the Models section
145
-
# - a model can only be a member of one group
146
-
# - group behaviour is controlled via the `swap`, `exclusive` and `persistent` fields
147
-
# - see issue #109 for details
148
-
#
149
-
# NOTE: the example below uses model names that are not defined above for demonstration purposes
150
-
groups:
151
-
# group1 is the default behaviour of llama-swap where only one model is allowed
152
-
# to run a time across the whole llama-swap instance
153
-
"group1":
154
-
# swap controls the model swapping behaviour in within the group
155
-
# - true : only one model is allowed to run at a time
156
-
# - false: all models can run together, no swapping
157
-
swap: true
158
-
159
-
# exclusive controls how the group affects other groups
160
-
# - true: causes all other groups to unload their models when this group runs a model
161
-
# - false: does not affect other groups
162
-
exclusive: true
163
-
164
-
# members references the models defined above
165
-
members:
166
-
- "llama"
167
-
- "qwen-unlisted"
168
-
169
-
# models in this group are never unloaded
170
-
"group2":
171
-
swap: false
172
-
exclusive: false
173
-
members:
174
-
- "docker-llama"
175
-
# (not defined above, here for example)
176
-
- "modelA"
177
-
- "modelB"
178
-
179
-
"forever":
180
-
# setting persistent to true causes the group to never be affected by the swapping behaviour of
181
-
# other groups. It is a shortcut to keeping some models always loaded.
182
-
persistent: true
183
-
184
-
# set swap/exclusive to false to prevent swapping inside the group and effect on other groups
185
-
swap: false
186
-
exclusive: false
187
-
members:
188
-
- "forever-modelA"
189
-
- "forever-modelB"
190
-
- "forever-modelc"
191
-
```
192
-
193
-
### Use Case Examples
62
+
- ⚡ `groups` to run multiple models at once
63
+
- ⚡ `macros` for reusable snippets
64
+
- ⚡ `ttl` to automatically unload models
65
+
- ⚡ `aliases` to use familiar model names (e.g., "gpt-4o-mini")
66
+
- ⚡ `env` variables to pass custom environment to inference servers
67
+
- ⚡ `useModelName` to override model names sent to upstream servers
68
+
- ⚡ `healthCheckTimeout` to control model startup wait times
69
+
- ⚡ `${PORT}` automatic port variables for dynamic port assignment
70
+
- ⚡ Docker/podman compatible
194
71
195
-
- [config.example.yaml](config.example.yaml) includes example for supporting `v1/embeddings` and `v1/rerank` endpoints
196
-
- [Speculative Decoding](examples/speculative-decoding/README.md) - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases.
197
-
- [Optimizing Code Generation](examples/benchmark-snakegame/README.md) - find the optimal settings for your machine. This example demonstrates defining multiple configurations and testing which one is fastest.
198
-
- [Restart on Config Change](examples/restart-on-config-change/README.md) - automatically restart llama-swap when trying out different configurations.
199
-
</details>
72
+
Check the [wiki](https://github.com/mostlygeek/llama-swap/wiki/Configuration) full documentation.
0 commit comments