Fix: Bad Gateway Error 502 by shivam-MBZUAI · Pull Request #61 · one-covenant/crusades

shivam-MBZUAI · 2026-03-24T04:57:11Z

Made-with: Cursor

Summary by CodeRabbit

New Features
- Evaluation now returns immediately with a job ID (HTTP 202) and supports a polling status endpoint for long-running jobs.
Bug Fixes
- Added explicit, user-facing messages for deployment timeouts and failures.
- Removed misleading network error logs and capped repeated HTTP errors during polling.
Improvements
- Increased time budget for long-running requests.
- More reliable run isolation and improved subprocess lifecycle handling (retries, cleanup, randomized ports).

Made-with: Cursor

coderabbitai · 2026-03-24T04:57:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Fix: Bad Gateway Error 502' is partially related to the changeset—it addresses a real technical issue (proxy timeouts causing 502 errors), but does not clearly describe the primary implementation strategy (asynchronous job polling with `/eval-status` endpoint). The title is accurate but lacks specificity about the main architectural change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/runner

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

src/crusades/affinetes/runner.py (2)

977-986: Consider reusing AsyncClient across retries.

A new httpx.AsyncClient is instantiated for each retry attempt (line 977). This loses connection pooling benefits and adds overhead. Moving the client outside the retry loop would be more efficient.

♻️ Proposed refactor to reuse client

     async def _post_with_retry(
         self,
         url: str,
         payload: dict,
         timeout: float,
         post_start: float,
         max_retries: int = 3,
         backoff_base: float = 10.0,
     ) -> httpx.Response:
         """POST with exponential backoff retry on 502/503/504 errors."""
         last_response = None
-        for attempt in range(max_retries + 1):
-            if attempt > 0:
-                wait = backoff_base * (2 ** (attempt - 1))
-                logger.warning(
-                    f"[BASILICA] Retry {attempt}/{max_retries} after {wait:.0f}s backoff..."
-                )
-                await asyncio.sleep(wait)
-
-            async with httpx.AsyncClient(timeout=timeout) as client:
-                post_task = asyncio.create_task(client.post(url, json=payload))
-                while not post_task.done():
-                    await asyncio.sleep(60)
-                    if not post_task.done():
-                        elapsed = time.time() - post_start
-                        logger.info(
-                            f"   [WAITING] Evaluation in progress... {elapsed:.0f}s elapsed"
-                        )
-                last_response = await post_task
-
-            if last_response.status_code not in (502, 503, 504):
-                return last_response
-
-            logger.warning(f"[BASILICA] Got {last_response.status_code} on attempt {attempt + 1}")
-
-        return last_response
+        async with httpx.AsyncClient(timeout=timeout) as client:
+            for attempt in range(max_retries + 1):
+                if attempt > 0:
+                    wait = backoff_base * (2 ** (attempt - 1))
+                    logger.warning(
+                        f"[BASILICA] Retry {attempt}/{max_retries} after {wait:.0f}s backoff..."
+                    )
+                    await asyncio.sleep(wait)
+
+                post_task = asyncio.create_task(client.post(url, json=payload))
+                while not post_task.done():
+                    await asyncio.sleep(60)
+                    if not post_task.done():
+                        elapsed = time.time() - post_start
+                        logger.info(
+                            f"   [WAITING] Evaluation in progress... {elapsed:.0f}s elapsed"
+                        )
+                last_response = await post_task
+
+                if last_response.status_code not in (502, 503, 504):
+                    return last_response
+
+                logger.warning(f"[BASILICA] Got {last_response.status_code} on attempt {attempt + 1}")
+
+        return last_response

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/crusades/affinetes/runner.py` around lines 977 - 986, The code creates a
new httpx.AsyncClient inside the retry loop (AsyncClient, client.post,
post_task), which prevents connection reuse; move the AsyncClient instantiation
out of the retry loop so the same client is reused for each attempt, call
client.post to create post_task on each retry, and ensure the client is properly
closed (or used in a single async with/finally) after all retries; keep the
existing timeout handling and the waiting loop that inspects post_task.done()
and elapsed (post_start) unchanged.

892-907: Consider adding error_code for deployment failures.

The new exception handlers return failure results without an error_code. Per context snippet 2, the validator uses result.is_fatal() to decide whether to skip remaining runs, which relies on error_code. Deployment timeouts/failures might warrant specific error codes for categorization.

♻️ Proposed addition of error codes

         except DeploymentTimeout as e:
             elapsed = time.time() - start_time
             logger.error(f"[BASILICA] Deployment timed out after {elapsed:.0f}s!")
             logger.error(f"   {e}")
             return EvaluationResult.failure(
                 f"Basilica deployment timeout after {elapsed:.0f}s: {e}",
                 task_id=task_id,
+                error_code="DEPLOYMENT_TIMEOUT",
             )
         except DeploymentFailed as e:
             elapsed = time.time() - start_time
             logger.error(f"[BASILICA] Deployment failed after {elapsed:.0f}s!")
             logger.error(f"   {e}")
             return EvaluationResult.failure(
                 f"Basilica deployment failed: {e}",
                 task_id=task_id,
+                error_code="DEPLOYMENT_FAILED",
             )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/crusades/affinetes/runner.py` around lines 892 - 907, The failure returns
for DeploymentTimeout and DeploymentFailed lack an error_code which the
validator uses via result.is_fatal(); update the two exception handlers (the
except DeploymentTimeout as e and except DeploymentFailed as e blocks in
runner.py) to pass a descriptive error_code to EvaluationResult.failure (e.g.
"BASILICA_DEPLOYMENT_TIMEOUT" for DeploymentTimeout and
"BASILICA_DEPLOYMENT_FAILED" for DeploymentFailed) while preserving the existing
message, task_id and logging; ensure the error_code strings are consistent and
used where other failures set error codes so is_fatal() can detect them.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/crusades/affinetes/runner.py`:
- Around line 977-986: The code creates a new httpx.AsyncClient inside the retry
loop (AsyncClient, client.post, post_task), which prevents connection reuse;
move the AsyncClient instantiation out of the retry loop so the same client is
reused for each attempt, call client.post to create post_task on each retry, and
ensure the client is properly closed (or used in a single async with/finally)
after all retries; keep the existing timeout handling and the waiting loop that
inspects post_task.done() and elapsed (post_start) unchanged.
- Around line 892-907: The failure returns for DeploymentTimeout and
DeploymentFailed lack an error_code which the validator uses via
result.is_fatal(); update the two exception handlers (the except
DeploymentTimeout as e and except DeploymentFailed as e blocks in runner.py) to
pass a descriptive error_code to EvaluationResult.failure (e.g.
"BASILICA_DEPLOYMENT_TIMEOUT" for DeploymentTimeout and
"BASILICA_DEPLOYMENT_FAILED" for DeploymentFailed) while preserving the existing
message, task_id and logging; ensure the error_code strings are consistent and
used where other failures set error codes so is_fatal() can detect them.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c076eda-488a-4af2-aa4f-2364cf7771c8

📥 Commits

Reviewing files that changed from the base of the PR and between 42517fc and 2a33e4f.

📒 Files selected for processing (1)

src/crusades/affinetes/runner.py

When Basilica's proxy returns 502 mid-evaluation, the runner retries the POST to /evaluate. The previous code spawned a second torchrun on default port 29500, colliding with the still-running first process. Three defenses: - Track the subprocess; kill the stale process group before starting new - Randomize --master_port (29500-39500) like the local Docker runner - start_new_session=True so killpg targets only torchrun + workers Made-with: Cursor

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@environments/templar/env.py`:
- Around line 3015-3029: The race is caused by using the global
_current_torchrun directly per request; change the per-request logic in the
evaluate flow to assign the spawned process to a local variable (e.g., proc) and
use that local proc for all per-request exception/timeout cleanup (call
_kill_torchrun_group(proc) and await proc.wait()) so a failing request only
kills its own subprocess; when setting/clearing the global _current_torchrun
keep it as a best-effort shared pointer but only set it to None in the finally
block if _current_torchrun is still the same process instance (compare identity
with proc) to avoid clearing or killing unrelated jobs, and update the exception
handlers that currently reference _current_torchrun (lines with
_kill_torchrun_group and await ...wait()) to use the local proc instead.
- Line 3075: The master_port selection currently uses random.randint which can
be poisoned by untrusted in-process code; capture a cryptographically secure
randrange at module import time (e.g. assign random.SystemRandom().randrange to
a module-level name like _secure_randrange) and use that secure function where
master_port is computed (the site setting master_port = 29500 +
random.randint(0, 10000)); ensure the secure capture happens before any
untrusted execution and keep usage consistent (replace random.randint calls for
master_port selection with _secure_randrange).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bbc55bfa-c549-4cea-b1f3-9bb150aa6682

📥 Commits

Reviewing files that changed from the base of the PR and between 2a33e4f and 3e418d2.

📒 Files selected for processing (1)

environments/templar/env.py

coderabbitai · 2026-03-24T05:51:49Z

+    global _current_torchrun
    import asyncio as _aio
    import json as _json
    import tempfile

+    if _current_torchrun is not None and _current_torchrun.returncode is None:
+        logger.warning(
+            f"Killing stale torchrun (pid={_current_torchrun.pid}) before new evaluation"
+        )
+        _kill_torchrun_group(_current_torchrun)
+        try:
+            await _aio.wait_for(_current_torchrun.wait(), timeout=5)
+        except TimeoutError:
+            pass
+    _current_torchrun = None


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's find the file and check its existence find . -name "env.py" -path "*/templar/*" 2>/dev/null | head -20

Repository: one-covenant/crusades

Length of output: 94

🏁 Script executed:

#!/bin/bash # Get line count to understand the file structure wc -l environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 97

🏁 Script executed:

#!/bin/bash # Read the code at the specified lines (3015-3029) sed -n '3010,3035p' environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 952

🏁 Script executed:

#!/bin/bash # Look for the _evaluate_via_torchrun function definition grep -n "_evaluate_via_torchrun" environments/templar/env.py | head -10

Repository: one-covenant/crusades

Length of output: 197

🏁 Script executed:

#!/bin/bash # Check the lines mentioned as "Also applies to: 3087-3087, 3118-3133" sed -n '3082,3092p' environments/templar/env.py sed -n '3113,3138p' environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 1245

🏁 Script executed:

#!/bin/bash # Search for any locking mechanisms or async patterns related to torchrun grep -n "torchrun\|_current_torchrun\|asyncio\.Lock\|Lock\|mutex\|Semaphore" environments/templar/env.py | head -30

Repository: one-covenant/crusades

Length of output: 1693

🏁 Script executed:

#!/antml/bash # Look for the /evaluate endpoint definition to check if it's single-flight grep -n "evaluate\|/evaluate" environments/templar/env.py | grep -i "def\|route\|endpoint" | head -20

Repository: one-covenant/crusades

Length of output: 343

🏁 Script executed:

#!/bin/bash # Read the /evaluate endpoint definition sed -n '3154,3170p' environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 751

🏁 Script executed:

#!/bin/bash # Check for any concurrency control like locks or semaphores grep -n "Lock\|Semaphore\|single.*flight\|concurrent" environments/templar/env.py | head -20

Repository: one-covenant/crusades

Length of output: 47

🏁 Script executed:

#!/bin/bash # Look for uvicorn or FastAPI configuration to understand concurrency grep -n "uvicorn\|FastAPI\|app\|Depends" environments/templar/env.py | head -30

Repository: one-covenant/crusades

Length of output: 2201

🏁 Script executed:

#!/bin/bash # Check task_id usage to understand request tracking sed -n '2939,2970p' environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 1009

🏁 Script executed:

#!/antml/bash # Look at the full exception handling in _evaluate_via_torchrun sed -n '3100,3145p' environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 1603

🏁 Script executed:

#!/bin/bash # Check the finally block to see how cleanup is handled sed -n '3140,3155p' environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 546

🏁 Script executed:

#!/bin/bash # Check if there are any other endpoints or handlers that might serialize requests grep -B5 -A5 "num_gpus.*>" environments/templar/env.py | head -40

Repository: one-covenant/crusades

Length of output: 1143

Fix race condition in multi-GPU cleanup: track process identity per request.

The global _current_torchrun can be overwritten by concurrent requests, causing one request's exception handler to kill another request's subprocess. Since the /evaluate endpoint is async with no concurrency control, overlapping multi-GPU calls will interleave: Request A sets _current_torchrun = proc_A, Request B overwrites it with _current_torchrun = proc_B, and if Request A times out or errors, it kills proc_B instead. The unconditional stale-process kill at lines 3020–3024 also aborts unrelated jobs without task identity verification.

Use a local proc variable to track the process started by each request. Kill only the local proc in exception handlers (lines 3119–3120, 3132–3133), and only clear the global if it still points to the same process in the finally block:

Fix outline

async def _evaluate_via_torchrun(request: EvaluateRequest) -> dict: global _current_torchrun + proc: asyncio.subprocess.Process | None = None import asyncio as _aio import json as _json import tempfile if _current_torchrun is not None and _current_torchrun.returncode is None: logger.warning( f"Killing stale torchrun (pid={_current_torchrun.pid}) before new evaluation" ) _kill_torchrun_group(_current_torchrun) try: await _aio.wait_for(_current_torchrun.wait(), timeout=5) except TimeoutError: pass _current_torchrun = None # ... setup code ... try: proc = await _aio.create_subprocess_exec(...) _current_torchrun = proc # ... rest of execution ... except TimeoutError: - if _current_torchrun is not None: - _kill_torchrun_group(_current_torchrun) + if proc is not None: + _kill_torchrun_group(proc) return { ... } except Exception as e: - if _current_torchrun is not None: - _kill_torchrun_group(_current_torchrun) + if proc is not None: + _kill_torchrun_group(proc) return { ... } finally: + if _current_torchrun is proc: + _current_torchrun = None for p in (params_path, script_path): if p: try: os.unlink(p) except OSError: pass

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@environments/templar/env.py` around lines 3015 - 3029, The race is caused by using the global _current_torchrun directly per request; change the per-request logic in the evaluate flow to assign the spawned process to a local variable (e.g., proc) and use that local proc for all per-request exception/timeout cleanup (call _kill_torchrun_group(proc) and await proc.wait()) so a failing request only kills its own subprocess; when setting/clearing the global _current_torchrun keep it as a best-effort shared pointer but only set it to None in the finally block if _current_torchrun is still the same process instance (compare identity with proc) to avoid clearing or killing unrelated jobs, and update the exception handlers that currently reference _current_torchrun (lines with _kill_torchrun_group and await ...wait()) to use the local proc instead.

coderabbitai · 2026-03-24T05:51:49Z

            f.write(eval_script)
            script_path = f.name

+        master_port = 29500 + random.randint(0, 10000)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "env.py" -path "*/templar/*" | head -20

Repository: one-covenant/crusades

Length of output: 94

🏁 Script executed:

sed -n '3070,3080p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 457

🏁 Script executed:

sed -n '3040,3090p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 1904

🏁 Script executed:

sed -n '3000,3045p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 1553

🏁 Script executed:

grep -n "_reset_torch_state" ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 170

🏁 Script executed:

sed -n '271,310p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 1641

🏁 Script executed:

sed -n '310,330p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 716

🏁 Script executed:

sed -n '3000,3015p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 617

🏁 Script executed:

sed -n '2240,2260p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 821

🏁 Script executed:

grep -n "def _evaluate" ./environments/templar/env.py | head -10

Repository: one-covenant/crusades

Length of output: 137

🏁 Script executed:

grep -n "def _evaluate" ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 137

🏁 Script executed:

grep -n "def evaluate" ./environments/templar/env.py | head -5

Repository: one-covenant/crusades

Length of output: 164

🏁 Script executed:

sed -n '3154,3180p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 1344

🏁 Script executed:

sed -n '1692,1750p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 2571

🏁 Script executed:

head -1 ./environments/templar/env.py && grep -n "import random" ./environments/templar/env.py | head -10

Repository: one-covenant/crusades

Length of output: 85

🏁 Script executed:

sed -n '2240,2280p' ./environments/templar/env.py

Repository: one-covenant/crusades

Length of output: 1689

Use a cryptographically trusted entropy source for master_port selection.

The single-GPU evaluation path executes untrusted miner code in-process. If untrusted code patches random.randint via frame globals manipulation, and _reset_torch_state() does not restore the random module (confirmed—it only resets torch/time functions), then subsequent multi-GPU evaluations will use the poisoned random.randint to select master_port. This allows attackers to force repeated or predictable port values, defeating the retry mitigation.

Capture random.SystemRandom().randrange at module import time to secure the entropy source before untrusted code execution:

Fix

_secure_randrange = random.SystemRandom().randrange

- master_port = 29500 + random.randint(0, 10000) + master_port = 29500 + _secure_randrange(10001)

🧰 Tools

🪛 Ruff (0.15.6)

[error] 3075-3075: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@environments/templar/env.py` at line 3075, The master_port selection currently uses random.randint which can be poisoned by untrusted in-process code; capture a cryptographically secure randrange at module import time (e.g. assign random.SystemRandom().randrange to a module-level name like _secure_randrange) and use that secure function where master_port is computed (the site setting master_port = 29500 + random.randint(0, 10000)); ensure the secure capture happens before any untrusted execution and keep usage consistent (replace random.randint calls for master_port selection with _secure_randrange).

os.killpg only kills the launcher process group, but torchrun's elastic agent spawns workers in a separate process group. Walk /proc children to collect every descendant PID and SIGKILL them individually, then sleep 3s for CUDA driver to reclaim GPU memory before next evaluation. Made-with: Cursor

Made-with: Cursor

- http_timeout = max(self.timeout + 600, 1800) so cold evals (15+ min) don't trigger premature 502 retries - backoff_base raised from 10s to 60s (60s, 120s, 240s) to avoid re-sending requests while eval is still running inside the container - skip retries that would exceed overall timeout budget Made-with: Cursor

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

environments/templar/env.py (1)
3182-3188: ⚠️ Potential issue | 🟠 Major

Missing _current_torchrun cleanup in finally block.

The finally block cleans up temp files but doesn't clear _current_torchrun. As part of the race condition fix, the finally block should:
finally:
    if _current_torchrun is proc:
        _current_torchrun = None
    for p in (params_path, script_path):
        ...
This ensures the global is only cleared if it still references the current request's process.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@environments/templar/env.py` around lines 3182 - 3188, In the finally block
of the function that creates/launches the torchrun process, add logic to clear
the module-level _current_torchrun only when it still refers to the process for
this request (proc) before unlinking temp files; i.e., check if
_current_torchrun is proc and set _current_torchrun = None if true, then proceed
to remove params_path and script_path as currently implemented to avoid clearing
another request's torchrun.

♻️ Duplicate comments (2)

environments/templar/env.py (2)
3113-3113: ⚠️ Potential issue | 🟠 Major

Insecure PRNG for master_port remains unfixed.

As flagged by static analysis (Ruff S311) and the previous review: random.randint can be poisoned by untrusted miner code executed in single-GPU evaluations. The suggested fix (capturing random.SystemRandom().randrange at module import time) was not applied.
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@environments/templar/env.py` at line 3113, The assignment to master_port uses
insecure random.randint; capture a cryptographically secure RNG function at
module import (e.g., secure_randrange = random.SystemRandom().randrange) and
replace the use in the master_port assignment so master_port = 29500 +
secure_randrange(0, 10001) (reference: master_port variable in this file) to
avoid PRNG poisoning from untrusted code executed later.
3052-3067: ⚠️ Potential issue | 🔴 Critical

Race condition persists: global _current_torchrun can cause cross-request interference.

The global _current_torchrun is shared across concurrent /evaluate requests. Since FastAPI/uvicorn handles requests concurrently:

Request A spawns proc_A, sets _current_torchrun = proc_A

Request B calls _evaluate_via_torchrun, sees proc_A.returncode is None, kills it as "stale"

Request A fails unexpectedly

The fix from the previous review (using a local proc variable for per-request tracking) was not applied.
,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@environments/templar/env.py` around lines 3052 - 3067, The code uses the
global _current_torchrun for per-request process tracking which causes
cross-request races; change the function to track the spawned subprocess with a
local variable (e.g., proc) instead of assigning to global _current_torchrun,
call _kill_torchrun_group(proc) and await
proc.wait()/asyncio.wait_for(proc.wait(), ...) on that local proc, and stop
reading or mutating _current_torchrun for per-request lifecycle (leave
_current_torchrun unused or reserved only for a true singleton launcher if
needed). Ensure all references in this block to _current_torchrun (checks, pid
logging, kill, wait, sleep) are replaced to operate on the per-request local
process variable and do not affect other concurrent requests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@environments/templar/env.py`:
- Around line 3182-3188: In the finally block of the function that
creates/launches the torchrun process, add logic to clear the module-level
_current_torchrun only when it still refers to the process for this request
(proc) before unlinking temp files; i.e., check if _current_torchrun is proc and
set _current_torchrun = None if true, then proceed to remove params_path and
script_path as currently implemented to avoid clearing another request's
torchrun.

---

Duplicate comments:
In `@environments/templar/env.py`:
- Line 3113: The assignment to master_port uses insecure random.randint; capture
a cryptographically secure RNG function at module import (e.g., secure_randrange
= random.SystemRandom().randrange) and replace the use in the master_port
assignment so master_port = 29500 + secure_randrange(0, 10001) (reference:
master_port variable in this file) to avoid PRNG poisoning from untrusted code
executed later.
- Around line 3052-3067: The code uses the global _current_torchrun for
per-request process tracking which causes cross-request races; change the
function to track the spawned subprocess with a local variable (e.g., proc)
instead of assigning to global _current_torchrun, call
_kill_torchrun_group(proc) and await proc.wait()/asyncio.wait_for(proc.wait(),
...) on that local proc, and stop reading or mutating _current_torchrun for
per-request lifecycle (leave _current_torchrun unused or reserved only for a
true singleton launcher if needed). Ensure all references in this block to
_current_torchrun (checks, pid logging, kill, wait, sleep) are replaced to
operate on the per-request local process variable and do not affect other
concurrent requests.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a87bb0a-a123-4f67-b932-35b864ab2283

📥 Commits

Reviewing files that changed from the base of the PR and between 8edd59d and a946bba.

📒 Files selected for processing (2)

environments/templar/env.py
src/crusades/affinetes/runner.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/crusades/affinetes/runner.py

POST /evaluate now returns 202 with a job_id immediately. The runner polls GET /eval-status/{job_id} every 30s (short requests, immune to proxy timeouts). Removes _post_with_retry since retries are no longer needed — the evaluation runs in a background task on the container. Made-with: Cursor

Made-with: Cursor

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@environments/templar/env.py`:
- Around line 3304-3306: The current cleanup removes finished jobs immediately
via _jobs.pop(job_id, None) inside async with _jobs_lock when handling GET
/eval-status/{job_id}, which can lose results if the 200 response is dropped;
instead, do not delete on first successful read—either mark the job as finished
and store a completed_at timestamp in the job record and let a background task
remove entries after a short TTL, or only pop the entry after an explicit client
ack endpoint (e.g., POST /eval-ack/{job_id}); update code around _jobs_lock,
_jobs.pop and the GET /eval-status/{job_id} handling to implement one of these
strategies so completed jobs remain available for retries for a short period.
- Around line 3205-3228: The call to actor.evaluate in _run_evaluation runs
synchronously on the FastAPI event loop (Actor.evaluate has no awaits), blocking
the loop for the 1-GPU path; offload the CPU/GPU-bound work to a background
thread or executor instead of calling actor.evaluate directly. Replace the
direct await actor.evaluate(...) in _run_evaluation with an offload using
asyncio.to_thread or loop.run_in_executor (e.g., create a task that runs
asyncio.to_thread(actor.evaluate, task_id=..., seed=..., model_url=..., ...)) so
the evaluation executes off the event loop; keep the same arguments and ensure
the created task is scheduled (asyncio.create_task) so the endpoint and
/eval-status polling remain responsive.

In `@src/crusades/affinetes/runner.py`:
- Line 760: AffinetesRunner currently instantiates the SDK client without the
programmatic key; change the client creation to pass the runner's key so the SDK
uses it (e.g., construct BasilicaClient with api_key=self.basilica_api_key),
ensuring you reference the AffinetesRunner.__init__-provided
self.basilica_api_key when creating the BasilicaClient instance so it falls back
to the environment only when that attribute is None.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: addd4fbe-8a93-46ae-89f8-f63493faeff0

📥 Commits

Reviewing files that changed from the base of the PR and between a946bba and c7d734c.

📒 Files selected for processing (2)

environments/templar/env.py
src/crusades/affinetes/runner.py

coderabbitai · 2026-03-24T09:10:48Z

+    actor = get_actor()
+    return await actor.evaluate(
+        task_id=request.task_id,
+        seed=request.seed,
+        model_url=request.model_url,
+        data_url=request.data_url,
+        steps=request.steps,
+        batch_size=request.batch_size,
+        timeout=request.timeout,
+        sequence_length=request.sequence_length,
+        data_samples=request.data_samples,
+        code=request.code,
+        max_loss_difference=request.max_loss_difference,
+        use_random_init=request.use_random_init,
+        min_trainable_params_ratio=request.min_trainable_params_ratio,
+        min_params_changed_ratio=request.min_params_changed_ratio,
+        weight_relative_error_max=request.weight_relative_error_max,
+        timer_divergence_threshold=request.timer_divergence_threshold,
+        gpu_peak_tflops=request.gpu_peak_tflops,
+        max_plausible_mfu=request.max_plausible_mfu,
+        min_mfu=request.min_mfu,
+        require_cuda_timing=True,
+        num_gpus=request.num_gpus,
+    )


⚠️ Potential issue | 🔴 Critical

asyncio.create_task() does not offload the 1-GPU evaluation path.

_run_evaluation() still executes Actor.evaluate() on the FastAPI event-loop thread. Because Actor.evaluate() in this file has no suspension points, the default num_gpus == 1 path will block /eval-status until the run finishes, so the runner’s short poll requests can still hit timeouts and recreate the same 502 failure mode this PR is trying to remove.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@environments/templar/env.py` around lines 3205 - 3228, The call to actor.evaluate in _run_evaluation runs synchronously on the FastAPI event loop (Actor.evaluate has no awaits), blocking the loop for the 1-GPU path; offload the CPU/GPU-bound work to a background thread or executor instead of calling actor.evaluate directly. Replace the direct await actor.evaluate(...) in _run_evaluation with an offload using asyncio.to_thread or loop.run_in_executor (e.g., create a task that runs asyncio.to_thread(actor.evaluate, task_id=..., seed=..., model_url=..., ...)) so the evaluation executes off the event loop; keep the same arguments and ensure the created task is scheduled (asyncio.create_task) so the endpoint and /eval-status polling remain responsive.

coderabbitai · 2026-03-24T09:10:48Z

+    # Clean up to avoid unbounded memory growth
+    async with _jobs_lock:
+        _jobs.pop(job_id, None)


⚠️ Potential issue | 🟠 Major

Don't delete a finished job on the first status read.

If the 200 carrying the result is lost after this pop(), the next poll gets 404 and a completed evaluation becomes unrecoverable. Keep finished jobs for a short TTL, or delete them only after an explicit ack, instead of removing them on the first successful GET /eval-status/{job_id}.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@environments/templar/env.py` around lines 3304 - 3306, The current cleanup removes finished jobs immediately via _jobs.pop(job_id, None) inside async with _jobs_lock when handling GET /eval-status/{job_id}, which can lose results if the 200 response is dropped; instead, do not delete on first successful read—either mark the job as finished and store a completed_at timestamp in the job record and let a background task remove entries after a short TTL, or only pop the entry after an explicit client ack endpoint (e.g., POST /eval-ack/{job_id}); update code around _jobs_lock, _jobs.pop and the GET /eval-status/{job_id} handling to implement one of these strategies so completed jobs remain available for retries for a short period.

coderabbitai · 2026-03-24T09:10:48Z

-                    "Failed to create Basilica deployment",
-                    task_id=task_id,
-                )
+            client = BasilicaClient()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/crusades/affinetes/runner.py | sed -n '750,770p'

Repository: one-covenant/crusades

Length of output: 1315

🏁 Script executed:

# Find the __init__ method of AffinetesRunner rg -A 30 "class AffinetesRunner" src/crusades/affinetes/runner.py | head -50

Repository: one-covenant/crusades

Length of output: 1032

🏁 Script executed:

# Look for basilica_api_key usage in the file rg "basilica_api_key" src/crusades/affinetes/runner.py

Repository: one-covenant/crusades

Length of output: 393

🏁 Script executed:

# Check if there's a requirements file or setup to find SDK version fd -e txt -e txt -e lock -e cfg | xargs rg "basilica" 2>/dev/null | head -20

Repository: one-covenant/crusades

Length of output: 1499

🌐 Web query:

basilica-sdk 0.24.0 BasilicaClient api_key parameter

💡 Result:

In basilica-sdk version 0.24.0, the BasilicaClient class accepts an optional 'api_key' parameter in its constructor. The signature is: class BasilicaClient: def init( self, base_url: str = None, # Default: https://api.basilica.ai api_key: str = None, # Default: BASILICA_API_TOKEN env var ): ... The 'api_key' parameter provides the authentication token directly. If not provided, it falls back to the BASILICA_API_TOKEN environment variable or stored CLI credentials (in that order of priority). Usage examples: - client = BasilicaClient(api_key="basilica_...") - client = BasilicaClient # Uses env var BASILICA_API_TOKEN To obtain an API token: basilica tokens create

Citations:

1: https://pypi.org/project/basilica-sdk/0.24.0/

2: https://docs.basilica.ai/introduction/authentication

3: https://docs.basilica.ai/sdk

🏁 Script executed:

# Check if there are other BasilicaClient instantiations in the codebase with api_key rg "BasilicaClient\(" src/

Repository: one-covenant/crusades

Length of output: 135

🏁 Script executed:

# Check imports to understand where BasilicaClient comes from rg "from.*basilica|import.*basilica" src/crusades/affinetes/runner.py

Repository: one-covenant/crusades

Length of output: 141

Pass self.basilica_api_key into the SDK client.

AffinetesRunner.__init__ accepts a programmatic Basilica API key, but line 761 always instantiates BasilicaClient() with environment-only auth. If the caller supplied the key via basilica_api_key and did not export BASILICA_API_TOKEN, deployment creation will fail even though self.basilica_api_key is populated. The current SDK (v0.24.0) exposes BasilicaClient(api_key=...) and falls back to BASILICA_API_TOKEN only when that argument is omitted.

Suggested fix

- client = BasilicaClient() + client = BasilicaClient(api_key=self.basilica_api_key)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

client = BasilicaClient()

client = BasilicaClient(api_key=self.basilica_api_key)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/crusades/affinetes/runner.py` at line 760, AffinetesRunner currently instantiates the SDK client without the programmatic key; change the client creation to pass the runner's key so the SDK uses it (e.g., construct BasilicaClient with api_key=self.basilica_api_key), ensuring you reference the AffinetesRunner.__init__-provided self.basilica_api_key when creating the BasilicaClient instance so it falls back to the environment only when that attribute is None.

Made-with: Cursor

chore: ruff format + lint fix on runner.py

2a33e4f

Made-with: Cursor

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

shivam-MBZUAI changed the title ~~chore: ruff format + lint fix on runner.py~~ Fix: Bad Gateway Error 502 Mar 24, 2026

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

shivam-MBZUAI added 3 commits March 24, 2026 06:06

increase post-kill sleep to 30s for GPU memory reclaim

79709ec

Made-with: Cursor

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

shivam-MBZUAI added 2 commits March 24, 2026 08:57

docs: update _run_basilica docstring to reflect async poll flow

0a8a80e

Made-with: Cursor

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

shivam-MBZUAI and others added 2 commits March 24, 2026 09:32

fix: increase deploy_async timeout to 1800s for cold image pulls

de89aa0

Made-with: Cursor

Update __init__.py

6c23b2d

shivam-MBZUAI merged commit 0a74b2b into main Mar 24, 2026
1 check passed

This was referenced Mar 27, 2026

Fix: Add per-deployment Bearer auth to block rogue evaluation requests #68

Merged

Reuse Basilica deployment across N eval runs per miner to reduce eval… #69

Merged

shivam-MBZUAI deleted the fix/runner branch March 28, 2026 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Bad Gateway Error 502#61

Fix: Bad Gateway Error 502#61
shivam-MBZUAI merged 9 commits into
mainfrom
fix/runner

shivam-MBZUAI commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 24, 2026 •

edited

Loading

Reviews paused

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Uh oh!

coderabbitai Bot Mar 24, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Uh oh!

coderabbitai Bot Mar 24, 2026

Uh oh!

coderabbitai Bot Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	client = BasilicaClient()
	client = BasilicaClient(api_key=self.basilica_api_key)

Conversation

shivam-MBZUAI commented Mar 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shivam-MBZUAI commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 24, 2026 •

edited

Loading