Skip to content

Commit d05f2c9

Browse files
chore: clean residual Ollama references (#527)
Replace Ollama references with MLX (vllm-mlx) across agent routing tables, permissions, scripts, and docs. Add deprecation notice to historical multi-model orchestration plan. Closes #526 (claude)
1 parent bca9bb4 commit d05f2c9

8 files changed

Lines changed: 46 additions & 41 deletions

File tree

agentsmd/agents/planner.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ Architecture and design specialist for system planning and task breakdown.
1515
| Mode | Model | Reasoning |
1616
| ---- | ----- | --------- |
1717
| Cloud | Claude Opus 4.6 | Extended thinking for complex architecture |
18-
| Local (MLX) | mlx-community/Qwen3-235B-A22B-4bit | Strong reasoning for offline planning (port 11436) |
19-
| Local (Ollama) | qwen3-next | Fallback when MLX unavailable (port 11434) |
18+
| Local (MLX) | mlx-community/Qwen3-235B-A22B-4bit | Strong reasoning for offline planning (port 11434) |
2019

2120
## Capabilities
2221

@@ -101,7 +100,6 @@ The planner agent often works with:
101100

102101
When `AI_ORCHESTRATION_LOCAL_ONLY=true`:
103102

104-
- Try MLX first: mlx-community/Qwen3-235B-A22B-4bit (port 11436)
105-
- Fall back to Ollama: qwen3-next (port 11434)
103+
- Use MLX: mlx-community/Qwen3-235B-A22B-4bit (port 11434)
106104
- All planning done locally
107105
- No cloud API calls

agentsmd/agents/researcher.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,7 @@ actual research work to specialized models (Gemini 3 Pro or local models) via PA
1919
| Mode | Model | Use Case |
2020
| ---- | ----- | -------- |
2121
| Cloud | Gemini 3 Pro | Large context analysis, web research |
22-
| Local (MLX) | mlx-community/Qwen3-235B-A22B-4bit | Offline research, private data (port 11436) |
23-
| Local (Ollama) | qwen3-next | Fallback when MLX unavailable (port 11434) |
22+
| Local (MLX) | mlx-community/Qwen3-235B-A22B-4bit | Offline research, private data (port 11434) |
2423

2524
## Capabilities
2625

@@ -53,10 +52,8 @@ pal clink "Research question here"
5352

5453
When `AI_ORCHESTRATION_LOCAL_ONLY=true` or `--local` flag is passed:
5554

56-
- Try MLX first: mlx-community/Qwen3-235B-A22B-4bit (port 11436)
57-
- Fall back to Ollama: qwen3-next (port 11434)
55+
- Use MLX: mlx-community/Qwen3-235B-A22B-4bit (port 11434)
5856
- No cloud API calls
59-
- OLLAMA_HOST environment variable is respected
6057

6158
## Output Format
6259

agentsmd/agents/reviewer.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ then synthesizes findings into a unified review.
1717

1818
## Models Used
1919

20-
| Role | Cloud Model | Local (MLX preferred) | Local (Ollama fallback) |
21-
| ---- | ----------- | --------------------- | ----------------------- |
22-
| Primary | Gemini 3 Pro | mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit | deepseek-r1 |
23-
| Secondary | Claude Opus 4.6 | mlx-community/Qwen3-235B-A22B-4bit | qwen3-next |
24-
| Synthesis | Claude Sonnet 4.6 | mlx-community/Qwen3.5-27B-4bit | qwen3-next |
20+
| Role | Cloud Model | Local (MLX) |
21+
| ---- | ----------- | ----------- |
22+
| Primary | Gemini 3 Pro | mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit |
23+
| Secondary | Claude Opus 4.6 | mlx-community/Qwen3-235B-A22B-4bit |
24+
| Synthesis | Claude Sonnet 4.6 | mlx-community/Qwen3.5-27B-4bit |
2525

2626
## Review Process
2727

@@ -71,9 +71,8 @@ Good patterns, well-written code, improvements over previous state.
7171

7272
When `AI_ORCHESTRATION_LOCAL_ONLY=true`:
7373

74-
- Try MLX first: mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit (port 11436)
75-
- Fall back to Ollama: deepseek-r1 for primary analysis (port 11434)
76-
- Cross-validation: MLX Qwen3-235B or Ollama qwen3-next
74+
- Use MLX: mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit for primary analysis (port 11434)
75+
- Cross-validation: mlx-community/Qwen3-235B-A22B-4bit
7776
- No cloud API calls
7877

7978
## Severity Guidelines

agentsmd/permissions/STRATEGY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ agentsmd/permissions/
116116
│ ├── rust.json # Rust toolchain: cargo (rustc/rustup covered in core.json)
117117
│ ├── network.json # Network utilities: ping, dig, host, netstat, lsof, pgrep
118118
│ ├── system.json # System utilities: ln, readlink, htop, launchctl, plutil, etc.
119-
│ └── tools.json # Dev tools: rbenv, goenv, redis-cli, ollama, shellcheck, etc.
119+
│ └── tools.json # Dev tools: rbenv, goenv, redis-cli, shellcheck, etc.
120120
121121
├── ask/
122122
│ ├── git.json # Git: merge, reset, rebase, cherry-pick, restore, rm, gc/prune, commit --amend, push --force, clean

agentsmd/permissions/allow/tools.json

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,6 @@
4343
"orbctl info",
4444
"orbctl config get",
4545
"orbctl version",
46-
"ollama list",
4746
"shellcheck",
4847
"check-jsonschema",
4948
"claude doctor",

agentsmd/rules/infra/pre-integration-checklist.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Pre-Integration Checklist for New Inference Backends
22

3-
Complete every item before merging a new inference backend (MLX, Ollama, vLLM, etc.).
3+
Complete every item before merging a new inference backend (MLX (vllm-mlx), vLLM, etc.).
44
Based on the MLX arc retrospective where 5 of 14 PRs were reactive fixes that this checklist
55
would have prevented.
66

@@ -9,7 +9,7 @@ would have prevented.
99
- [ ] Document peak RAM usage for the largest model you plan to serve
1010
- [ ] Document sustained (idle-loaded) RAM usage with a model resident
1111
- [ ] Verify total system RAM can handle the backend plus normal workload (browser, IDE, Claude Code)
12-
- [ ] Set an explicit memory ceiling in the LaunchAgent or service config (e.g., `mlx_max_memory`, `OLLAMA_MAX_VRAM`)
12+
- [ ] Set an explicit memory ceiling in the LaunchAgent or service config (e.g., `mlx_max_memory`, `VLLM_MAX_MEMORY`)
1313
- [ ] Confirm OOM behavior: does the process get killed, crash gracefully, or hang?
1414
- [ ] Test with the largest model on the lowest-spec target machine
1515

@@ -49,13 +49,13 @@ would have prevented.
4949

5050
- [ ] List every new environment variable the backend introduces
5151
- [ ] Check for naming conflicts with existing variables (`env | grep -i <prefix>`)
52-
- [ ] Follow the existing naming convention (e.g., `OLLAMA_HOST`, `MLX_*`)
52+
- [ ] Follow the existing naming convention (e.g., `MLX_*`, `VLLM_*`)
5353
- [ ] Document which variables are required vs. optional, with defaults
5454
- [ ] Verify variables are set in the correct scope (LaunchAgent plist, shell profile, or Nix config)
5555

5656
## LaunchAgent / Service Management
5757

58-
- [ ] Define startup order: does this service depend on another (e.g., network, Ollama)?
58+
- [ ] Define startup order: does this service depend on another (e.g., network, vllm-mlx)?
5959
- [ ] Add a health check endpoint or command (e.g., `curl http://localhost:<port>/v1/models`)
6060
- [ ] Set `KeepAlive` or restart policy so the service recovers from crashes
6161
- [ ] Set `ThrottleInterval` to prevent restart loops from consuming resources

docs/projects/multi-model-orchestration/PLAN.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Multi-Model AI Orchestration System
22

3+
> **Note (2026-03-25)**: This plan was written when Ollama was part of the stack. Ollama has been fully removed
4+
> and replaced by MLX (vllm-mlx) on port 11434. Model references using Ollama-style tags (e.g., `qwen3-coder:30b`)
5+
> should be read as their HuggingFace equivalents.
6+
>
37
> **ARCHIVED PLAN**: This document contains historical planning notes.
48
> The Python scripts shown are **deprecated design artifacts** — do NOT use as templates.
59
> All orchestration is handled via PAL MCP tools and direct CLI invocations.
@@ -792,4 +796,4 @@ ln -sf ~/git/ai-assistant-instructions/feature/multi-model-orchestration/agentsm
792796
- [PAL MCP Server](https://github.com/BeehiveInnovations/pal-mcp-server)
793797
- [Anthropic Skills](https://github.com/anthropics/skills)
794798
- [Claude Code Plugins](https://www.anthropic.com/news/claude-code-plugins)
795-
- [LLM Rankings Dec 2025](https://vertu.com/lifestyle/top-8-ai-models-ranked-gemini-3-chatgpt-5-1-grok-4-claude-4-5-more/)
799+
- LLM Rankings Dec 2025 (source link no longer available)

scripts/select-model.sh

Lines changed: 25 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
# Usage: select-model.sh [options]
88
# --task-type=<research|coding|review|decision|default>
99
# --cost-sensitive (flag - prefer free local models)
10-
# --private (flag - sensitive data, must use local Ollama)
10+
# --private (flag - sensitive data, must use local MLX)
1111
# --large-context (flag - need 1M+ context window)
1212
# --analyze-complexity=<prompt|filepath> (optional complexity analysis)
1313
#
@@ -132,9 +132,10 @@ select_model() {
132132

133133
# Step 1: Is the data sensitive or confidential?
134134
if [[ "$private" == "true" ]]; then
135-
echo "Model: ollama"
136-
echo "Selected: deepseek-r1:70b (local reasoning) or qwen3-next:80b (local general)"
137-
echo "Command: ollama run deepseek-r1:70b"
135+
echo "Model: mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit"
136+
echo "Selected: mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit (local reasoning) or mlx-community/Qwen3-235B-A22B-4bit (local general)"
137+
# Use PAL MCP chat tool: pal chat --model mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit "<prompt>"
138+
echo "Command: pal chat --model mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit"
138139
echo "Rationale: Private/sensitive data must stay local. Never use cloud APIs."
139140
return 0
140141
fi
@@ -143,32 +144,37 @@ select_model() {
143144
if [[ "$cost_sensitive" == "true" ]]; then
144145
case "$task_type" in
145146
coding)
146-
echo "Model: qwen3-coder:30b"
147-
echo "Command: ollama run qwen3-coder:30b"
147+
echo "Model: mlx-community/Qwen3-Coder-30B-A3B-Instruct"
148+
# Use PAL MCP chat tool: pal chat --model mlx-community/Qwen3-Coder-30B-A3B-Instruct "<prompt>"
149+
echo "Command: pal chat --model mlx-community/Qwen3-Coder-30B-A3B-Instruct"
148150
echo "Rationale: Cost-sensitive coding task - using free local specialized model"
149151
return 0
150152
;;
151153
review)
152-
echo "Model: deepseek-r1:70b"
153-
echo "Command: ollama run deepseek-r1:70b"
154+
echo "Model: mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit"
155+
# Use PAL MCP chat tool: pal chat --model mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit "<prompt>"
156+
echo "Command: pal chat --model mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit"
154157
echo "Rationale: Cost-sensitive code review - using free local reasoning model"
155158
return 0
156159
;;
157160
research)
158-
echo "Model: qwen3-next:80b"
159-
echo "Command: ollama run qwen3-next:80b"
161+
echo "Model: mlx-community/Qwen3-235B-A22B-4bit"
162+
# Use PAL MCP chat tool: pal chat --model mlx-community/Qwen3-235B-A22B-4bit "<prompt>"
163+
echo "Command: pal chat --model mlx-community/Qwen3-235B-A22B-4bit"
160164
echo "Rationale: Cost-sensitive research/analysis - using free local general model"
161165
return 0
162166
;;
163167
decision)
164-
echo "Model: deepseek-r1:70b + qwen3-next:80b"
165-
echo "Command: bash -c 'echo \"Model 1 (DeepSeek R1):\" && ollama run deepseek-r1:70b && echo -e \"\\nModel 2 (Qwen):\" && ollama run qwen3-next:80b'"
168+
echo "Model: mlx-community/DeepSeek-R1-Distill-Llama-70B-4bit + mlx-community/Qwen3-235B-A22B-4bit"
169+
# Use PAL MCP clink tool for parallel multi-model: pal clink "<prompt>"
170+
echo "Command: pal clink"
166171
echo "Rationale: Cost-sensitive critical decision - using best-reasoning + general local models"
167172
return 0
168173
;;
169174
default)
170-
echo "Model: qwen3-next:80b"
171-
echo "Command: ollama run qwen3-next:80b"
175+
echo "Model: mlx-community/Qwen3-235B-A22B-4bit"
176+
# Use PAL MCP chat tool: pal chat --model mlx-community/Qwen3-235B-A22B-4bit "<prompt>"
177+
echo "Command: pal chat --model mlx-community/Qwen3-235B-A22B-4bit"
172178
echo "Rationale: Cost-sensitive generic task - using free local model"
173179
return 0
174180
;;
@@ -231,9 +237,11 @@ select_model() {
231237
fi
232238

233239
# Default: Start local, fall back to cloud
234-
echo "Model: ollama-with-fallback"
235-
echo "Selected: qwen3-next:80b (local) → gemini-3-pro (cloud fallback)"
236-
echo "Command: ollama run qwen3-next:80b || gemini chat --model gemini-3-pro"
240+
echo "Model: mlx-with-fallback"
241+
echo "Selected: mlx-community/Qwen3-235B-A22B-4bit (local) → gemini-3-pro (cloud fallback)"
242+
# Use PAL MCP chat tool with local model first: pal chat --model mlx-community/Qwen3-235B-A22B-4bit "<prompt>"
243+
# Fall back to cloud: pal chat --model gemini-3-pro "<prompt>"
244+
echo "Command: pal chat --model mlx-community/Qwen3-235B-A22B-4bit"
237245
echo "Rationale: Default/general task - try local first for cost/privacy, fall back to cloud if needed"
238246
return 0
239247
}

0 commit comments

Comments
 (0)