This file replaces llm-context/01-machine_context_summary.md as the single source of task-level instructions for this repository.
You are continuing development of m40-llm—a Rust LLM runtime/server targeting NVIDIA Tesla M40 (sm_52). Work efficiently and keep changes focused.
- Review
CONTRIBUTING.mdand set up the git hooks (pre-commit and commit message hooks). - Ensure Cocogitto (
cog) is installed and working. - Ensure
nvccis available. If missing, install CUDA Toolkit compatible with Tesla M40 (up tocuda-nvcc=12.4.*,cuda-cudart=12.4.*, andcuda-cudart-dev=12.4.*) and moderngcc/g++.- Always install CUDA Toolkit 12.4 plus cuBLAS development headers before building (use the micromamba command in
README.md).
- Always install CUDA Toolkit 12.4 plus cuBLAS development headers before building (use the micromamba command in
- If you are running in OpenHands (not Codex): Use micromamba to install packages, not apt-get.
- Commit early and often.
- Use Conventional Commits.
- Every commit must include both a summary and a descriptive body.
llm-context/00-index.mdpoints to key design notes; consult it when resuming work.- CUDA and KV cache guidance:
docs/cuda_parity_and_kv_layout.md.
- Precision policy: FP16 storage (weights and KV), FP32 compute; Q in FP32, K/V stored as FP16.
- Model format: GGUF; integration planned via
gguf-rs-libandgguf-llms. - CUDA build:
build.rsdetects NVCC and embeds sm_52 SASS plus compute_52 PTX; cuBLAS toggled viaM40LLM_ENABLE_CUBLASwhen headers/libs are present. - KV cache API:
allocate_kv_cache_with_layoutinsrc/infer.rs;ffi_debug_read_kv_tokeninsrc/cuda.rshelps validate layout. - Key tests to keep green (with
--features=cuda):- Attention parity grid:
tests/attention_parity_cuda_grid.rs(setM40LLM_ENABLE_CUBLAS=0|1). - Projection/MLP wrappers:
tests/proj_wrappers.rs,tests/mlp_wrappers.rs. - Forward smoke:
tests/forward_with_layer_smoke.rs.
- Attention parity grid:
- Phase 1 (Functional, critical path): Goal is to load a real GGUF and print decoded tokens on Tesla M40 with no silent CPU fallback.
- Acceptance highlights: fast-fail invalid GGUFs, logits parity across CUDA/CPU, deterministic sampler, and a CLI decode loop that visibly exercises the GPU.
- Phase 2 (Fast, M40 optimization): Optimize cuBLAS wiring and attention while keeping parity grid green.
- Focus areas: cuBLAS GEMM wrappers (t23-cublas-gemm), full projection wiring (t23-5-integration continuation), attention parity guardrail (t23-5e-attn-green), microbenchmarks (t31b-microbench-attn), stream separation (t33-stream-sep), and persistent decode prototype (t32-persistent-kernel).
- Phase 3 (Infrastructure, QA, polish): Stabilize CI, logging, and server polish.
- Focus areas: clippy cleanups (t23-6-warnings), CUDA benchmark unification (t23-6-benches-ctx), expanded CI matrix (t35-ci-matrix), better errors/telemetry (t34-robustness),
/generateserver parity (t30-server), prefill-vs-decode validation plan (t26-tests-plan), and completingllm-contextindexing (t41-lc-index).
- Focus areas: clippy cleanups (t23-6-warnings), CUDA benchmark unification (t23-6-benches-ctx), expanded CI matrix (t35-ci-matrix), better errors/telemetry (t34-robustness),
- Task selection rule: when choosing work, pick the lowest-numbered unfinished task in the lowest-numbered phase.
- Update this file (
AGENTS.md) as you complete high-level tasks.
[
{
"phase": 1,
"name": "Functional (Critical Path)",
"goal": "Load a real GGUF model and print tokens using the Tesla M40 GPU (host sampling allowed).",
"success_definition": "CLI run prints decoded text while exercising the Tesla M40 GPU path (no silent CPU fallback).",
"tasks": [
{
"id": "t28-sampling",
"priority": 5,
"status": "done",
"title": "Host-side sampling (softmax, top-k, top-p)",
"rationale": "Host sampling keeps the GPU MVP simple and correct.",
"scope": [
"Softmax implementation.",
"Temperature scaling.",
"Top-k and top-p sampling.",
"Deterministic RNG."
],
"acceptance": [
"Greedy and top-k/top-p sampling paths work.",
"Given logits, sampler returns a token ID deterministically."
]
},
{
"id": "t26-min-forward",
"priority": 6,
"status": "in_progress",
"title": "Minimal forward pass (prefill + decode)",
"rationale": "This is the minimal computation needed to produce logits.",
"scope": [
"Single-layer forward correctness.",
"KV cache write/read path.",
"Uses real GGUF weights."
],
"acceptance": [
"Forward produces last-token logits.",
"Works for both prefill and decode paths."
]
},
{
"id": "t25-rope-rmsnorm",
"priority": 7,
"status": "todo",
"title": "RMSNorm and RoPE (GPU-first)",
"rationale": "Keeping activations on the GPU avoids catastrophic perf loss.",
"scope": [
"CUDA kernels for RMSNorm and RoPE.",
"Host fallback only behind feature or debug flag."
],
"acceptance": [
"Parity vs CPU on small shapes.",
"Integrated without host round-trips in hot path."
]
},
{
"id": "t22-parity-check",
"priority": 8,
"status": "in_progress",
"title": "CUDA ↔ CPU parity checks",
"rationale": "Protects correctness while wiring GPU math.",
"scope": [
"Last-token attention parity grid.",
"Run with M40LLM_ENABLE_CUBLAS=0 and =1."
],
"acceptance": [
"Parity grid stays green within ~1e-3 tolerance.",
"Coverage includes kv_head_dim variants."
]
},
{
"id": "t23-6-autoselect-m40",
"priority": 9,
"status": "in_progress",
"title": "Runtime auto-selection of Tesla M40",
"rationale": "Prevents accidental arch mismatches on Maxwell.",
"scope": [
"Compile for sm_52 at build time.",
"Prefer Tesla M40 at runtime.",
"Graceful CPU fallback if CUDA unavailable."
],
"acceptance": [
"Logs confirm selected device: Tesla M40 (sm_52).",
"Build fails fast on incompatible CUDA arch."
]
},
{
"id": "t23-5-integration",
"priority": 10,
"status": "in_progress",
"title": "GPU minimal forward using cuBLAS (FP32)",
"rationale": "M40 stability requires FP32 compute with cuBLAS.",
"scope": [
"FP16 used only for storage.",
"Upcast weights to FP32 on device.",
"Use cublasSgemm and StridedBatched GEMM.",
"Document and validate layout/transposition rules."
],
"acceptance": [
"Prefill and decode compute logits fully on GPU.",
"Numeric parity vs CPU for tiny configs.",
"No reliance on half arithmetic on sm_52."
]
},
{
"id": "t29-e2e-decode",
"priority": 11,
"status": "todo",
"title": "End-to-end decode loop (prints tokens)",
"rationale": "This is the visible success milestone.",
"scope": [
"Prefill initializes device KV cache.",
"Decode loop runs device forward, copies logits to host, samples next token.",
"CLI entry point."
],
"acceptance": [
"CLI run prints decoded text.",
"GPU path is exercised (confirmed via logs).",
"No silent CPU fallback."
]
},
{
"id": "t36-min-gguf-model",
"priority": 12,
"status": "todo",
"title": "Minimal GGUF test model",
"rationale": "Fast deterministic tests unblock iteration and CI.",
"scope": [
"1–2 layers.",
"Tiny dimensions.",
"Deterministic outputs."
],
"acceptance": [
"cargo test runs prefill + 1–2 decode steps deterministically.",
"Used by CUDA-gated tests."
]
}
]
},
{
"phase": 2,
"name": "Fast (M40 Optimization)",
"goal": "Improve performance on Tesla M40 while preserving numerical correctness and parity guarantees.",
"success_definition": "End-to-end decode runs significantly faster on M40 with cuBLAS enabled and all parity tests passing.",
"tasks": [
{
"id": "t23-cublas-gemm",
"priority": 1,
"status": "in_progress",
"title": "cuBLAS GEMM wrappers and validation",
"rationale": "All performance gains depend on correct and well-understood GEMM behavior on Maxwell.",
"scope": [
"Harden cuBLAS wrapper APIs.",
"Validate row-major Rust layouts vs column-major cuBLAS expectations.",
"Document lda/ldb/ldc and transposition contracts.",
"Basic performance sanity checks on Tesla M40."
],
"acceptance": [
"GEMM wrappers are documented and reused consistently.",
"No silent shape or stride mismatches.",
"Measured performance is reasonable for M40-class hardware."
]
},
{
"id": "t23-5-integration",
"priority": 2,
"status": "in_progress",
"title": "Full GEMM wiring for model layers",
"rationale": "Moving all major projections onto GPU is required for meaningful speedups.",
"scope": [
"Wire Q/K/V projections to device GEMMs.",
"Wire MLP gate/up/down projections.",
"Wire output projection.",
"Keep CPU fallback paths for debugging."
],
"acceptance": [
"All projections run on GPU when cuBLAS is enabled.",
"CPU fallback remains functional.",
"Parity tests pass within tolerance."
]
},
{
"id": "t23-5e-attn-green",
"priority": 3,
"status": "in_progress",
"title": "Maintain attention parity grid during optimization",
"rationale": "Attention bugs are subtle and catastrophic if unchecked.",
"scope": [
"Continuously run attention parity grid during integration.",
"Cover multiple head_dim and kv_head_dim combinations."
],
"acceptance": [
"Parity grid remains green throughout Phase 2 work.",
"Failures are caught immediately during refactors."
]
},
{
"id": "t31b-microbench-attn",
"priority": 4,
"status": "todo",
"title": "Attention microbenchmarks on M40",
"rationale": "Optimization requires measurement, not guesswork.",
"scope": [
"Microbenchmarks for attention forward pass.",
"Measure prefill vs decode costs.",
"Capture baseline vs optimized timings."
],
"acceptance": [
"Benchmarks run reproducibly on M40.",
"Results are documented and tracked over time."
]
},
{
"id": "t33-stream-sep",
"priority": 5,
"status": "todo",
"title": "Prefill and decode stream separation",
"rationale": "Separate CUDA streams allow better overlap and latency hiding.",
"scope": [
"Introduce distinct CUDA streams for prefill and decode.",
"Tune stream priorities where applicable."
],
"acceptance": [
"Prefill and decode run on separate streams.",
"No correctness regressions.",
"Measured latency improvement or neutrality."
]
},
{
"id": "t32-persistent-kernel",
"priority": 6,
"status": "todo",
"title": "Persistent decode kernel prototype",
"rationale": "Optional advanced optimization to reduce kernel launch overhead.",
"scope": [
"Prototype persistent kernel for decode loop.",
"Limit scope to experimentation and benchmarking."
],
"acceptance": [
"Prototype builds and runs behind a feature flag.",
"Performance impact is measured and documented."
]
},
{
"id": "t26-3-impl",
"priority": 7,
"status": "todo",
"title": "Remove remaining host fallbacks in forward path",
"rationale": "Host fallbacks in hot paths negate GPU gains.",
"scope": [
"CUDA RMSNorm and residual paths.",
"Remove unnecessary host round-trips."
],
"acceptance": [
"Forward path runs fully on GPU in normal operation.",
"Parity tests remain green."
]
}
]
},
{
"phase": 3,
"name": "Infrastructure, QA, and Product Polish",
"goal": "Stabilize the system, improve developer experience, and make the project production-ready.",
"success_definition": "Clean CI, clear diagnostics, documented behavior, and a usable server interface.",
"tasks": [
{
"id": "t23-6-warnings",
"priority": 1,
"status": "in_progress",
"title": "Eliminate warnings and keep clippy clean",
"rationale": "Warnings hide real problems and reduce signal quality.",
"scope": [
"Remove unused CudaContext imports.",
"Keep clippy -D warnings green across feature combinations."
],
"acceptance": [
"No warnings in default or CUDA builds.",
"CI treats warnings as errors."
]
},
{
"id": "t23-6-benches-ctx",
"priority": 2,
"status": "todo",
"title": "Unify benchmarks on ctx_m40",
"rationale": "Benchmarks must reflect real target hardware.",
"scope": [
"Update benches to use cuda_env::ctx_m40().",
"Remove ad-hoc device selection logic."
],
"acceptance": [
"All CUDA benchmarks use ctx_m40.",
"Bench results are comparable across runs."
]
},
{
"id": "t35-ci-matrix",
"priority": 3,
"status": "todo",
"title": "Expand CI build and test matrix",
"rationale": "Prevent regressions across feature combinations.",
"scope": [
"CUDA vs non-CUDA builds.",
"Different nvcc / feature flag combinations.",
"Document CI expectations."
],
"acceptance": [
"CI covers all supported build modes.",
"Failures clearly indicate configuration."
]
},
{
"id": "t34-robustness",
"priority": 4,
"status": "todo",
"title": "Robust error handling and telemetry",
"rationale": "Clear errors and logs reduce debugging time dramatically.",
"scope": [
"Improve error messages in inference and loading paths.",
"Add basic telemetry and timing logs."
],
"acceptance": [
"Common failure modes have actionable messages.",
"Logs show device, timing, and execution path."
]
},
{
"id": "t30-server",
"priority": 5,
"status": "todo",
"title": "HTTP server wired to real decode path",
"rationale": "Server interface is required for real-world usage.",
"scope": [
"Wire /generate endpoint to real decode loop.",
"Support streaming output and CORS."
],
"acceptance": [
"/generate produces valid streamed or buffered output.",
"Uses the same decode path as CLI."
]
},
{
"id": "t26-tests-plan",
"priority": 6,
"status": "todo",
"title": "Formal prefill vs decode validation plan",
"rationale": "Decode paths often diverge subtly from prefill.",
"scope": [
"Define expected invariants between prefill and decode.",
"Add CUDA-gated tests where appropriate."
],
"acceptance": [
"Tests catch prefill/decode divergence.",
"Documented validation strategy exists."
]
},
{
"id": "t41-lc-index",
"priority": 7,
"status": "in_progress",
"title": "Finalize llm-context index and cross-links",
"rationale": "Contextual documentation improves long-term maintainability.",
"scope": [
"Finalize relevance tags.",
"Cross-link related design notes."
],
"acceptance": [
"llm-context index is complete and consistent.",
"Key design decisions are discoverable."
]
}
]
}
]