Fix: Bad Gateway Error 502#61
Conversation
Made-with: Cursor
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/crusades/affinetes/runner.py (2)
977-986: Consider reusingAsyncClientacross retries.A new
httpx.AsyncClientis instantiated for each retry attempt (line 977). This loses connection pooling benefits and adds overhead. Moving the client outside the retry loop would be more efficient.♻️ Proposed refactor to reuse client
async def _post_with_retry( self, url: str, payload: dict, timeout: float, post_start: float, max_retries: int = 3, backoff_base: float = 10.0, ) -> httpx.Response: """POST with exponential backoff retry on 502/503/504 errors.""" last_response = None - for attempt in range(max_retries + 1): - if attempt > 0: - wait = backoff_base * (2 ** (attempt - 1)) - logger.warning( - f"[BASILICA] Retry {attempt}/{max_retries} after {wait:.0f}s backoff..." - ) - await asyncio.sleep(wait) - - async with httpx.AsyncClient(timeout=timeout) as client: - post_task = asyncio.create_task(client.post(url, json=payload)) - while not post_task.done(): - await asyncio.sleep(60) - if not post_task.done(): - elapsed = time.time() - post_start - logger.info( - f" [WAITING] Evaluation in progress... {elapsed:.0f}s elapsed" - ) - last_response = await post_task - - if last_response.status_code not in (502, 503, 504): - return last_response - - logger.warning(f"[BASILICA] Got {last_response.status_code} on attempt {attempt + 1}") - - return last_response + async with httpx.AsyncClient(timeout=timeout) as client: + for attempt in range(max_retries + 1): + if attempt > 0: + wait = backoff_base * (2 ** (attempt - 1)) + logger.warning( + f"[BASILICA] Retry {attempt}/{max_retries} after {wait:.0f}s backoff..." + ) + await asyncio.sleep(wait) + + post_task = asyncio.create_task(client.post(url, json=payload)) + while not post_task.done(): + await asyncio.sleep(60) + if not post_task.done(): + elapsed = time.time() - post_start + logger.info( + f" [WAITING] Evaluation in progress... {elapsed:.0f}s elapsed" + ) + last_response = await post_task + + if last_response.status_code not in (502, 503, 504): + return last_response + + logger.warning(f"[BASILICA] Got {last_response.status_code} on attempt {attempt + 1}") + + return last_response🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/crusades/affinetes/runner.py` around lines 977 - 986, The code creates a new httpx.AsyncClient inside the retry loop (AsyncClient, client.post, post_task), which prevents connection reuse; move the AsyncClient instantiation out of the retry loop so the same client is reused for each attempt, call client.post to create post_task on each retry, and ensure the client is properly closed (or used in a single async with/finally) after all retries; keep the existing timeout handling and the waiting loop that inspects post_task.done() and elapsed (post_start) unchanged.
892-907: Consider addingerror_codefor deployment failures.The new exception handlers return failure results without an
error_code. Per context snippet 2, the validator usesresult.is_fatal()to decide whether to skip remaining runs, which relies onerror_code. Deployment timeouts/failures might warrant specific error codes for categorization.♻️ Proposed addition of error codes
except DeploymentTimeout as e: elapsed = time.time() - start_time logger.error(f"[BASILICA] Deployment timed out after {elapsed:.0f}s!") logger.error(f" {e}") return EvaluationResult.failure( f"Basilica deployment timeout after {elapsed:.0f}s: {e}", task_id=task_id, + error_code="DEPLOYMENT_TIMEOUT", ) except DeploymentFailed as e: elapsed = time.time() - start_time logger.error(f"[BASILICA] Deployment failed after {elapsed:.0f}s!") logger.error(f" {e}") return EvaluationResult.failure( f"Basilica deployment failed: {e}", task_id=task_id, + error_code="DEPLOYMENT_FAILED", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/crusades/affinetes/runner.py` around lines 892 - 907, The failure returns for DeploymentTimeout and DeploymentFailed lack an error_code which the validator uses via result.is_fatal(); update the two exception handlers (the except DeploymentTimeout as e and except DeploymentFailed as e blocks in runner.py) to pass a descriptive error_code to EvaluationResult.failure (e.g. "BASILICA_DEPLOYMENT_TIMEOUT" for DeploymentTimeout and "BASILICA_DEPLOYMENT_FAILED" for DeploymentFailed) while preserving the existing message, task_id and logging; ensure the error_code strings are consistent and used where other failures set error codes so is_fatal() can detect them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/crusades/affinetes/runner.py`:
- Around line 977-986: The code creates a new httpx.AsyncClient inside the retry
loop (AsyncClient, client.post, post_task), which prevents connection reuse;
move the AsyncClient instantiation out of the retry loop so the same client is
reused for each attempt, call client.post to create post_task on each retry, and
ensure the client is properly closed (or used in a single async with/finally)
after all retries; keep the existing timeout handling and the waiting loop that
inspects post_task.done() and elapsed (post_start) unchanged.
- Around line 892-907: The failure returns for DeploymentTimeout and
DeploymentFailed lack an error_code which the validator uses via
result.is_fatal(); update the two exception handlers (the except
DeploymentTimeout as e and except DeploymentFailed as e blocks in runner.py) to
pass a descriptive error_code to EvaluationResult.failure (e.g.
"BASILICA_DEPLOYMENT_TIMEOUT" for DeploymentTimeout and
"BASILICA_DEPLOYMENT_FAILED" for DeploymentFailed) while preserving the existing
message, task_id and logging; ensure the error_code strings are consistent and
used where other failures set error codes so is_fatal() can detect them.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8c076eda-488a-4af2-aa4f-2364cf7771c8
📒 Files selected for processing (1)
src/crusades/affinetes/runner.py
When Basilica's proxy returns 502 mid-evaluation, the runner retries the POST to /evaluate. The previous code spawned a second torchrun on default port 29500, colliding with the still-running first process. Three defenses: - Track the subprocess; kill the stale process group before starting new - Randomize --master_port (29500-39500) like the local Docker runner - start_new_session=True so killpg targets only torchrun + workers Made-with: Cursor
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@environments/templar/env.py`:
- Around line 3015-3029: The race is caused by using the global
_current_torchrun directly per request; change the per-request logic in the
evaluate flow to assign the spawned process to a local variable (e.g., proc) and
use that local proc for all per-request exception/timeout cleanup (call
_kill_torchrun_group(proc) and await proc.wait()) so a failing request only
kills its own subprocess; when setting/clearing the global _current_torchrun
keep it as a best-effort shared pointer but only set it to None in the finally
block if _current_torchrun is still the same process instance (compare identity
with proc) to avoid clearing or killing unrelated jobs, and update the exception
handlers that currently reference _current_torchrun (lines with
_kill_torchrun_group and await ...wait()) to use the local proc instead.
- Line 3075: The master_port selection currently uses random.randint which can
be poisoned by untrusted in-process code; capture a cryptographically secure
randrange at module import time (e.g. assign random.SystemRandom().randrange to
a module-level name like _secure_randrange) and use that secure function where
master_port is computed (the site setting master_port = 29500 +
random.randint(0, 10000)); ensure the secure capture happens before any
untrusted execution and keep usage consistent (replace random.randint calls for
master_port selection with _secure_randrange).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bbc55bfa-c549-4cea-b1f3-9bb150aa6682
📒 Files selected for processing (1)
environments/templar/env.py
| global _current_torchrun | ||
| import asyncio as _aio | ||
| import json as _json | ||
| import tempfile | ||
|
|
||
| if _current_torchrun is not None and _current_torchrun.returncode is None: | ||
| logger.warning( | ||
| f"Killing stale torchrun (pid={_current_torchrun.pid}) before new evaluation" | ||
| ) | ||
| _kill_torchrun_group(_current_torchrun) | ||
| try: | ||
| await _aio.wait_for(_current_torchrun.wait(), timeout=5) | ||
| except TimeoutError: | ||
| pass | ||
| _current_torchrun = None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's find the file and check its existence
find . -name "env.py" -path "*/templar/*" 2>/dev/null | head -20Repository: one-covenant/crusades
Length of output: 94
🏁 Script executed:
#!/bin/bash
# Get line count to understand the file structure
wc -l environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 97
🏁 Script executed:
#!/bin/bash
# Read the code at the specified lines (3015-3029)
sed -n '3010,3035p' environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 952
🏁 Script executed:
#!/bin/bash
# Look for the _evaluate_via_torchrun function definition
grep -n "_evaluate_via_torchrun" environments/templar/env.py | head -10Repository: one-covenant/crusades
Length of output: 197
🏁 Script executed:
#!/bin/bash
# Check the lines mentioned as "Also applies to: 3087-3087, 3118-3133"
sed -n '3082,3092p' environments/templar/env.py
sed -n '3113,3138p' environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 1245
🏁 Script executed:
#!/bin/bash
# Search for any locking mechanisms or async patterns related to torchrun
grep -n "torchrun\|_current_torchrun\|asyncio\.Lock\|Lock\|mutex\|Semaphore" environments/templar/env.py | head -30Repository: one-covenant/crusades
Length of output: 1693
🏁 Script executed:
#!/antml/bash
# Look for the /evaluate endpoint definition to check if it's single-flight
grep -n "evaluate\|/evaluate" environments/templar/env.py | grep -i "def\|route\|endpoint" | head -20Repository: one-covenant/crusades
Length of output: 343
🏁 Script executed:
#!/bin/bash
# Read the /evaluate endpoint definition
sed -n '3154,3170p' environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 751
🏁 Script executed:
#!/bin/bash
# Check for any concurrency control like locks or semaphores
grep -n "Lock\|Semaphore\|single.*flight\|concurrent" environments/templar/env.py | head -20Repository: one-covenant/crusades
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Look for uvicorn or FastAPI configuration to understand concurrency
grep -n "uvicorn\|FastAPI\|app\|Depends" environments/templar/env.py | head -30Repository: one-covenant/crusades
Length of output: 2201
🏁 Script executed:
#!/bin/bash
# Check task_id usage to understand request tracking
sed -n '2939,2970p' environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 1009
🏁 Script executed:
#!/antml/bash
# Look at the full exception handling in _evaluate_via_torchrun
sed -n '3100,3145p' environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 1603
🏁 Script executed:
#!/bin/bash
# Check the finally block to see how cleanup is handled
sed -n '3140,3155p' environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 546
🏁 Script executed:
#!/bin/bash
# Check if there are any other endpoints or handlers that might serialize requests
grep -B5 -A5 "num_gpus.*>" environments/templar/env.py | head -40Repository: one-covenant/crusades
Length of output: 1143
Fix race condition in multi-GPU cleanup: track process identity per request.
The global _current_torchrun can be overwritten by concurrent requests, causing one request's exception handler to kill another request's subprocess. Since the /evaluate endpoint is async with no concurrency control, overlapping multi-GPU calls will interleave: Request A sets _current_torchrun = proc_A, Request B overwrites it with _current_torchrun = proc_B, and if Request A times out or errors, it kills proc_B instead. The unconditional stale-process kill at lines 3020–3024 also aborts unrelated jobs without task identity verification.
Use a local proc variable to track the process started by each request. Kill only the local proc in exception handlers (lines 3119–3120, 3132–3133), and only clear the global if it still points to the same process in the finally block:
Fix outline
async def _evaluate_via_torchrun(request: EvaluateRequest) -> dict:
global _current_torchrun
+ proc: asyncio.subprocess.Process | None = None
import asyncio as _aio
import json as _json
import tempfile
if _current_torchrun is not None and _current_torchrun.returncode is None:
logger.warning(
f"Killing stale torchrun (pid={_current_torchrun.pid}) before new evaluation"
)
_kill_torchrun_group(_current_torchrun)
try:
await _aio.wait_for(_current_torchrun.wait(), timeout=5)
except TimeoutError:
pass
_current_torchrun = None
# ... setup code ...
try:
proc = await _aio.create_subprocess_exec(...)
_current_torchrun = proc
# ... rest of execution ...
except TimeoutError:
- if _current_torchrun is not None:
- _kill_torchrun_group(_current_torchrun)
+ if proc is not None:
+ _kill_torchrun_group(proc)
return { ... }
except Exception as e:
- if _current_torchrun is not None:
- _kill_torchrun_group(_current_torchrun)
+ if proc is not None:
+ _kill_torchrun_group(proc)
return { ... }
finally:
+ if _current_torchrun is proc:
+ _current_torchrun = None
for p in (params_path, script_path):
if p:
try:
os.unlink(p)
except OSError:
pass🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@environments/templar/env.py` around lines 3015 - 3029, The race is caused by
using the global _current_torchrun directly per request; change the per-request
logic in the evaluate flow to assign the spawned process to a local variable
(e.g., proc) and use that local proc for all per-request exception/timeout
cleanup (call _kill_torchrun_group(proc) and await proc.wait()) so a failing
request only kills its own subprocess; when setting/clearing the global
_current_torchrun keep it as a best-effort shared pointer but only set it to
None in the finally block if _current_torchrun is still the same process
instance (compare identity with proc) to avoid clearing or killing unrelated
jobs, and update the exception handlers that currently reference
_current_torchrun (lines with _kill_torchrun_group and await ...wait()) to use
the local proc instead.
| f.write(eval_script) | ||
| script_path = f.name | ||
|
|
||
| master_port = 29500 + random.randint(0, 10000) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "env.py" -path "*/templar/*" | head -20Repository: one-covenant/crusades
Length of output: 94
🏁 Script executed:
sed -n '3070,3080p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 457
🏁 Script executed:
sed -n '3040,3090p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 1904
🏁 Script executed:
sed -n '3000,3045p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 1553
🏁 Script executed:
grep -n "_reset_torch_state" ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 170
🏁 Script executed:
sed -n '271,310p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 1641
🏁 Script executed:
sed -n '310,330p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 716
🏁 Script executed:
sed -n '3000,3015p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 617
🏁 Script executed:
sed -n '2240,2260p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 821
🏁 Script executed:
grep -n "def _evaluate" ./environments/templar/env.py | head -10Repository: one-covenant/crusades
Length of output: 137
🏁 Script executed:
grep -n "def _evaluate" ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 137
🏁 Script executed:
grep -n "def evaluate" ./environments/templar/env.py | head -5Repository: one-covenant/crusades
Length of output: 164
🏁 Script executed:
sed -n '3154,3180p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 1344
🏁 Script executed:
sed -n '1692,1750p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 2571
🏁 Script executed:
head -1 ./environments/templar/env.py && grep -n "import random" ./environments/templar/env.py | head -10Repository: one-covenant/crusades
Length of output: 85
🏁 Script executed:
sed -n '2240,2280p' ./environments/templar/env.pyRepository: one-covenant/crusades
Length of output: 1689
Use a cryptographically trusted entropy source for master_port selection.
The single-GPU evaluation path executes untrusted miner code in-process. If untrusted code patches random.randint via frame globals manipulation, and _reset_torch_state() does not restore the random module (confirmed—it only resets torch/time functions), then subsequent multi-GPU evaluations will use the poisoned random.randint to select master_port. This allows attackers to force repeated or predictable port values, defeating the retry mitigation.
Capture random.SystemRandom().randrange at module import time to secure the entropy source before untrusted code execution:
Fix
_secure_randrange = random.SystemRandom().randrange- master_port = 29500 + random.randint(0, 10000)
+ master_port = 29500 + _secure_randrange(10001)🧰 Tools
🪛 Ruff (0.15.6)
[error] 3075-3075: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@environments/templar/env.py` at line 3075, The master_port selection
currently uses random.randint which can be poisoned by untrusted in-process
code; capture a cryptographically secure randrange at module import time (e.g.
assign random.SystemRandom().randrange to a module-level name like
_secure_randrange) and use that secure function where master_port is computed
(the site setting master_port = 29500 + random.randint(0, 10000)); ensure the
secure capture happens before any untrusted execution and keep usage consistent
(replace random.randint calls for master_port selection with _secure_randrange).
os.killpg only kills the launcher process group, but torchrun's elastic agent spawns workers in a separate process group. Walk /proc children to collect every descendant PID and SIGKILL them individually, then sleep 3s for CUDA driver to reclaim GPU memory before next evaluation. Made-with: Cursor
Made-with: Cursor
- http_timeout = max(self.timeout + 600, 1800) so cold evals (15+ min) don't trigger premature 502 retries - backoff_base raised from 10s to 60s (60s, 120s, 240s) to avoid re-sending requests while eval is still running inside the container - skip retries that would exceed overall timeout budget Made-with: Cursor
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
environments/templar/env.py (1)
3182-3188:⚠️ Potential issue | 🟠 MajorMissing
_current_torchruncleanup in finally block.The finally block cleans up temp files but doesn't clear
_current_torchrun. As part of the race condition fix, the finally block should:finally: if _current_torchrun is proc: _current_torchrun = None for p in (params_path, script_path): ...This ensures the global is only cleared if it still references the current request's process.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@environments/templar/env.py` around lines 3182 - 3188, In the finally block of the function that creates/launches the torchrun process, add logic to clear the module-level _current_torchrun only when it still refers to the process for this request (proc) before unlinking temp files; i.e., check if _current_torchrun is proc and set _current_torchrun = None if true, then proceed to remove params_path and script_path as currently implemented to avoid clearing another request's torchrun.
♻️ Duplicate comments (2)
environments/templar/env.py (2)
3113-3113:⚠️ Potential issue | 🟠 MajorInsecure PRNG for
master_portremains unfixed.As flagged by static analysis (Ruff S311) and the previous review:
random.randintcan be poisoned by untrusted miner code executed in single-GPU evaluations. The suggested fix (capturingrandom.SystemRandom().randrangeat module import time) was not applied.
,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@environments/templar/env.py` at line 3113, The assignment to master_port uses insecure random.randint; capture a cryptographically secure RNG function at module import (e.g., secure_randrange = random.SystemRandom().randrange) and replace the use in the master_port assignment so master_port = 29500 + secure_randrange(0, 10001) (reference: master_port variable in this file) to avoid PRNG poisoning from untrusted code executed later.
3052-3067:⚠️ Potential issue | 🔴 CriticalRace condition persists: global
_current_torchruncan cause cross-request interference.The global
_current_torchrunis shared across concurrent/evaluaterequests. Since FastAPI/uvicorn handles requests concurrently:
- Request A spawns
proc_A, sets_current_torchrun = proc_A- Request B calls
_evaluate_via_torchrun, seesproc_A.returncode is None, kills it as "stale"- Request A fails unexpectedly
The fix from the previous review (using a local
procvariable for per-request tracking) was not applied.
,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@environments/templar/env.py` around lines 3052 - 3067, The code uses the global _current_torchrun for per-request process tracking which causes cross-request races; change the function to track the spawned subprocess with a local variable (e.g., proc) instead of assigning to global _current_torchrun, call _kill_torchrun_group(proc) and await proc.wait()/asyncio.wait_for(proc.wait(), ...) on that local proc, and stop reading or mutating _current_torchrun for per-request lifecycle (leave _current_torchrun unused or reserved only for a true singleton launcher if needed). Ensure all references in this block to _current_torchrun (checks, pid logging, kill, wait, sleep) are replaced to operate on the per-request local process variable and do not affect other concurrent requests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@environments/templar/env.py`:
- Around line 3182-3188: In the finally block of the function that
creates/launches the torchrun process, add logic to clear the module-level
_current_torchrun only when it still refers to the process for this request
(proc) before unlinking temp files; i.e., check if _current_torchrun is proc and
set _current_torchrun = None if true, then proceed to remove params_path and
script_path as currently implemented to avoid clearing another request's
torchrun.
---
Duplicate comments:
In `@environments/templar/env.py`:
- Line 3113: The assignment to master_port uses insecure random.randint; capture
a cryptographically secure RNG function at module import (e.g., secure_randrange
= random.SystemRandom().randrange) and replace the use in the master_port
assignment so master_port = 29500 + secure_randrange(0, 10001) (reference:
master_port variable in this file) to avoid PRNG poisoning from untrusted code
executed later.
- Around line 3052-3067: The code uses the global _current_torchrun for
per-request process tracking which causes cross-request races; change the
function to track the spawned subprocess with a local variable (e.g., proc)
instead of assigning to global _current_torchrun, call
_kill_torchrun_group(proc) and await proc.wait()/asyncio.wait_for(proc.wait(),
...) on that local proc, and stop reading or mutating _current_torchrun for
per-request lifecycle (leave _current_torchrun unused or reserved only for a
true singleton launcher if needed). Ensure all references in this block to
_current_torchrun (checks, pid logging, kill, wait, sleep) are replaced to
operate on the per-request local process variable and do not affect other
concurrent requests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0a87bb0a-a123-4f67-b932-35b864ab2283
📒 Files selected for processing (2)
environments/templar/env.pysrc/crusades/affinetes/runner.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/crusades/affinetes/runner.py
POST /evaluate now returns 202 with a job_id immediately. The runner
polls GET /eval-status/{job_id} every 30s (short requests, immune to
proxy timeouts). Removes _post_with_retry since retries are no longer
needed — the evaluation runs in a background task on the container.
Made-with: Cursor
Made-with: Cursor
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@environments/templar/env.py`:
- Around line 3304-3306: The current cleanup removes finished jobs immediately
via _jobs.pop(job_id, None) inside async with _jobs_lock when handling GET
/eval-status/{job_id}, which can lose results if the 200 response is dropped;
instead, do not delete on first successful read—either mark the job as finished
and store a completed_at timestamp in the job record and let a background task
remove entries after a short TTL, or only pop the entry after an explicit client
ack endpoint (e.g., POST /eval-ack/{job_id}); update code around _jobs_lock,
_jobs.pop and the GET /eval-status/{job_id} handling to implement one of these
strategies so completed jobs remain available for retries for a short period.
- Around line 3205-3228: The call to actor.evaluate in _run_evaluation runs
synchronously on the FastAPI event loop (Actor.evaluate has no awaits), blocking
the loop for the 1-GPU path; offload the CPU/GPU-bound work to a background
thread or executor instead of calling actor.evaluate directly. Replace the
direct await actor.evaluate(...) in _run_evaluation with an offload using
asyncio.to_thread or loop.run_in_executor (e.g., create a task that runs
asyncio.to_thread(actor.evaluate, task_id=..., seed=..., model_url=..., ...)) so
the evaluation executes off the event loop; keep the same arguments and ensure
the created task is scheduled (asyncio.create_task) so the endpoint and
/eval-status polling remain responsive.
In `@src/crusades/affinetes/runner.py`:
- Line 760: AffinetesRunner currently instantiates the SDK client without the
programmatic key; change the client creation to pass the runner's key so the SDK
uses it (e.g., construct BasilicaClient with api_key=self.basilica_api_key),
ensuring you reference the AffinetesRunner.__init__-provided
self.basilica_api_key when creating the BasilicaClient instance so it falls back
to the environment only when that attribute is None.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: addd4fbe-8a93-46ae-89f8-f63493faeff0
📒 Files selected for processing (2)
environments/templar/env.pysrc/crusades/affinetes/runner.py
| actor = get_actor() | ||
| return await actor.evaluate( | ||
| task_id=request.task_id, | ||
| seed=request.seed, | ||
| model_url=request.model_url, | ||
| data_url=request.data_url, | ||
| steps=request.steps, | ||
| batch_size=request.batch_size, | ||
| timeout=request.timeout, | ||
| sequence_length=request.sequence_length, | ||
| data_samples=request.data_samples, | ||
| code=request.code, | ||
| max_loss_difference=request.max_loss_difference, | ||
| use_random_init=request.use_random_init, | ||
| min_trainable_params_ratio=request.min_trainable_params_ratio, | ||
| min_params_changed_ratio=request.min_params_changed_ratio, | ||
| weight_relative_error_max=request.weight_relative_error_max, | ||
| timer_divergence_threshold=request.timer_divergence_threshold, | ||
| gpu_peak_tflops=request.gpu_peak_tflops, | ||
| max_plausible_mfu=request.max_plausible_mfu, | ||
| min_mfu=request.min_mfu, | ||
| require_cuda_timing=True, | ||
| num_gpus=request.num_gpus, | ||
| ) |
There was a problem hiding this comment.
asyncio.create_task() does not offload the 1-GPU evaluation path.
_run_evaluation() still executes Actor.evaluate() on the FastAPI event-loop thread. Because Actor.evaluate() in this file has no suspension points, the default num_gpus == 1 path will block /eval-status until the run finishes, so the runner’s short poll requests can still hit timeouts and recreate the same 502 failure mode this PR is trying to remove.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@environments/templar/env.py` around lines 3205 - 3228, The call to
actor.evaluate in _run_evaluation runs synchronously on the FastAPI event loop
(Actor.evaluate has no awaits), blocking the loop for the 1-GPU path; offload
the CPU/GPU-bound work to a background thread or executor instead of calling
actor.evaluate directly. Replace the direct await actor.evaluate(...) in
_run_evaluation with an offload using asyncio.to_thread or loop.run_in_executor
(e.g., create a task that runs asyncio.to_thread(actor.evaluate, task_id=...,
seed=..., model_url=..., ...)) so the evaluation executes off the event loop;
keep the same arguments and ensure the created task is scheduled
(asyncio.create_task) so the endpoint and /eval-status polling remain
responsive.
| # Clean up to avoid unbounded memory growth | ||
| async with _jobs_lock: | ||
| _jobs.pop(job_id, None) |
There was a problem hiding this comment.
Don't delete a finished job on the first status read.
If the 200 carrying the result is lost after this pop(), the next poll gets 404 and a completed evaluation becomes unrecoverable. Keep finished jobs for a short TTL, or delete them only after an explicit ack, instead of removing them on the first successful GET /eval-status/{job_id}.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@environments/templar/env.py` around lines 3304 - 3306, The current cleanup
removes finished jobs immediately via _jobs.pop(job_id, None) inside async with
_jobs_lock when handling GET /eval-status/{job_id}, which can lose results if
the 200 response is dropped; instead, do not delete on first successful
read—either mark the job as finished and store a completed_at timestamp in the
job record and let a background task remove entries after a short TTL, or only
pop the entry after an explicit client ack endpoint (e.g., POST
/eval-ack/{job_id}); update code around _jobs_lock, _jobs.pop and the GET
/eval-status/{job_id} handling to implement one of these strategies so completed
jobs remain available for retries for a short period.
| "Failed to create Basilica deployment", | ||
| task_id=task_id, | ||
| ) | ||
| client = BasilicaClient() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/crusades/affinetes/runner.py | sed -n '750,770p'Repository: one-covenant/crusades
Length of output: 1315
🏁 Script executed:
# Find the __init__ method of AffinetesRunner
rg -A 30 "class AffinetesRunner" src/crusades/affinetes/runner.py | head -50Repository: one-covenant/crusades
Length of output: 1032
🏁 Script executed:
# Look for basilica_api_key usage in the file
rg "basilica_api_key" src/crusades/affinetes/runner.pyRepository: one-covenant/crusades
Length of output: 393
🏁 Script executed:
# Check if there's a requirements file or setup to find SDK version
fd -e txt -e txt -e lock -e cfg | xargs rg "basilica" 2>/dev/null | head -20Repository: one-covenant/crusades
Length of output: 1499
🌐 Web query:
basilica-sdk 0.24.0 BasilicaClient api_key parameter
💡 Result:
In basilica-sdk version 0.24.0, the BasilicaClient class accepts an optional 'api_key' parameter in its constructor. The signature is: class BasilicaClient: def init( self, base_url: str = None, # Default: https://api.basilica.ai api_key: str = None, # Default: BASILICA_API_TOKEN env var ): ... The 'api_key' parameter provides the authentication token directly. If not provided, it falls back to the BASILICA_API_TOKEN environment variable or stored CLI credentials (in that order of priority). Usage examples: - client = BasilicaClient(api_key="basilica_...") - client = BasilicaClient # Uses env var BASILICA_API_TOKEN To obtain an API token: basilica tokens create
Citations:
- 1: https://pypi.org/project/basilica-sdk/0.24.0/
- 2: https://docs.basilica.ai/introduction/authentication
- 3: https://docs.basilica.ai/sdk
🏁 Script executed:
# Check if there are other BasilicaClient instantiations in the codebase with api_key
rg "BasilicaClient\(" src/Repository: one-covenant/crusades
Length of output: 135
🏁 Script executed:
# Check imports to understand where BasilicaClient comes from
rg "from.*basilica|import.*basilica" src/crusades/affinetes/runner.pyRepository: one-covenant/crusades
Length of output: 141
Pass self.basilica_api_key into the SDK client.
AffinetesRunner.__init__ accepts a programmatic Basilica API key, but line 761 always instantiates BasilicaClient() with environment-only auth. If the caller supplied the key via basilica_api_key and did not export BASILICA_API_TOKEN, deployment creation will fail even though self.basilica_api_key is populated. The current SDK (v0.24.0) exposes BasilicaClient(api_key=...) and falls back to BASILICA_API_TOKEN only when that argument is omitted.
Suggested fix
- client = BasilicaClient()
+ client = BasilicaClient(api_key=self.basilica_api_key)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| client = BasilicaClient() | |
| client = BasilicaClient(api_key=self.basilica_api_key) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/crusades/affinetes/runner.py` at line 760, AffinetesRunner currently
instantiates the SDK client without the programmatic key; change the client
creation to pass the runner's key so the SDK uses it (e.g., construct
BasilicaClient with api_key=self.basilica_api_key), ensuring you reference the
AffinetesRunner.__init__-provided self.basilica_api_key when creating the
BasilicaClient instance so it falls back to the environment only when that
attribute is None.
Made-with: Cursor
Summary by CodeRabbit
New Features
Bug Fixes
Improvements