Notes and things I noticed while setting up a DGX Spark mini-PC.
Disable password login fallback
Open WebUI with models from Ollama
$ docker pull ghcr.io/open-webui/open-webui:ollama
$ docker run -d -p 8080:8080 --gpus=all \
-v open-webui:/app/backend/data \
-v open-webui-ollama:/root/.ollama \
--name open-webui \
ghcr.io/open-webui/open-webui:ollama
Pick the smartest model that can fit in 128GB memory.
Per 2025-10, 4-bit quantized is fine.
Any lower will increasingly degrade in perplexity.
Benchmark
model size context response prompt in-mem-size
---------------------------------------------------------------------------
gpt-oss:20b 14GB 128k 50tok/s 1800tok/s 30GB
qwen3-coder:30b 19GB 256k 60tok/s 850tok/s 70GB
VSCode extension: Continue.dev
To use the already downloaded models from open-webui:ollama, expose port 11434 and set OLLAMA_HOST.
$ docker run -d \
-p 8080:8080 \
-p 11434:11434 \
--gpus=all \
-e OLLAMA_HOST=0.0.0.0 \
-v open-webui:/app/backend/data \
-v open-webui-ollama:/root/.ollama \
--name open-webui \
ghcr.io/open-webui/open-webui:ollama
Continue.dev's model setting, add:
{
"title": "Qwen 3 Coder",
"provider": "ollama",
"model": "qwen3-coder:30b",
"apiBase": "http://<IP_ADDRESS>:11434",
"systemMessage": "You are an expert software developer. You give helpful and concise responses. Whenever you write a code block you include the language after the opening ticks."
}
Unsloth
Pytouch
NeMo (Nvidia)
Dreambooth (images)
Unsloth
dep:
nvidia-cuda-toolkit
pytorch
transformers
peft
datasets
unsloth
unsloth_zoo
bitsandbytes
PyTorch
dep:
transformers
peft
datasets
trl
bitsandbytes
e.g.
import pytorch
from trl import SFTConfig, SFTTrainer
# Config, load dataset, then:
trainer = SFTTrainer(...)
trainer_stats = trainer.train()============================================================
TRAINING COMPLETED
============================================================
Training runtime: 76.69 seconds
Samples per second: 6.52
Steps per second: 1.63
Train loss: 1.0082
============================================================
Serve with vLLM
https://github.com/vllm-project/vllm
nano-vLLM for understanding, like nanoGPT?
https://github.com/GeeeekExplorer/nano-vllm
Huggingface.co gives free 5TB public, 100GB private model storage
specify <USERNAME_OR_ORG>/<MODEL_NAME>, <WRITE_ACCESS_TOKEN>
https://huggingface.co/settings/tokens/new?tokenType=write
e.g.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(...)
# Some fine-tuning steps, then:
model.push_to_hub_merged("hLuigi/gpt_oss_20B_RL_2048_Game", tokenizer, token = "hf_123ABC", save_method = "mxfp4")TBA
TBA
fp16 is standard
because GPUs older than hopper does not support fp8
some report fp16 scales better than bp16
mxfp4
hopper and newer architectures support it (e.g. RTX 50xx, Hx00, Bx00)
Chain-of-thought
Draft–critique–revise-repeat
Planning
TBA
Container
"Nvidia Inference Microservice"
flush the buffer cache using:
sudo sh -c ‘sync; echo 3 > /proc/sys/vm/drop_caches’