Skip to content

feat(orchestrator): hot runner protocol — extensible runner registration and persistent editor providers#791

Closed
frostebite wants to merge 5 commits intomainfrom
feature/hot-runner-protocol
Closed

feat(orchestrator): hot runner protocol — extensible runner registration and persistent editor providers#791
frostebite wants to merge 5 commits intomainfrom
feature/hot-runner-protocol

Conversation

@frostebite
Copy link
Member

@frostebite frostebite commented Mar 5, 2026

Summary

Adds a fully implemented hot runner protocol to the Orchestrator module — persistent, process-based build/test providers that keep Unity editors warm between jobs for dramatically faster iteration.

Implementation

Component Description
HotRunnerRegistry In-memory runner registry with file-based persistence. Tracks registered runners by ID, labels, transport, and capabilities. Supports runner expiration and cleanup.
HotRunnerHealthMonitor Periodic health checking via transport ping. Tracks runner health state transitions (healthy, degraded, offline). Configurable check intervals and thresholds.
HotRunnerDispatcher Routes jobs to available runners by label matching. Timeout-based dispatch with configurable fallback to cold builds. Priority-based runner selection.
HotRunnerService Orchestrates the full hot runner lifecycle: registration, health monitoring, job dispatch, and result collection. Integrates with the existing provider model.

New action.yml inputs (7)

Input Purpose
hotRunnerEnabled Use persistent hot runner for builds
hotRunnerTransport Transport protocol: websocket, grpc, named-pipe
hotRunnerHost Hot runner host address
hotRunnerPort Hot runner port number
hotRunnerHealthInterval Health check interval in seconds
hotRunnerMaxIdle Maximum idle time before recycling runner
hotRunnerFallbackToCold Fall back to cold build if no hot runner available

Usage

- uses: game-ci/unity-builder@v4
  with:
    targetPlatform: StandaloneLinux64
    hotRunnerEnabled: true
    hotRunnerTransport: websocket
    hotRunnerHost: localhost
    hotRunnerPort: 9090
    hotRunnerFallbackToCold: true

Companion Package Required

This PR implements the orchestrator/client side of the hot runner protocol. The Unity editor-side (server/worker) is implemented in game-ci/game-ci-editor-tools as a companion Unity package (PR #1) that:

  • Implements 5 transport servers: Named Pipe, File-based, HTTP (always available), WebSocket (#if GAMECI_WEBSOCKETS), SignalR (#if GAMECI_SIGNALR)
  • Registry-based TransportFactory for user extensibility
  • Auto-registers editors as hot runner workers
  • Executes build/test jobs within the warm editor
  • Reports health status (memory usage, uptime, Library hash, job count) and handles recycling

Test coverage

35 unit tests covering:

  • Runner registration and deregistration
  • Label-based runner matching
  • Health monitoring state transitions
  • Job dispatch with timeout and fallback
  • Registry persistence (file-based)
  • Full service orchestration workflow

Related Issues

Documentation

Test plan

  • All 35 new tests pass
  • All existing tests pass — no regressions
  • tsc --noEmit — no type errors
  • CI builds on push

Summary by CodeRabbit

  • New Features
    • Hot Runner support for persistent Unity-editor builds: configurable transport/host/port, health checks, idle recycling, job dispatching, and optional fallback to traditional builds; integrated into build flow and exposed via new inputs.
  • Documentation
    • Added Hot Runner protocol documentation and specification reference.
  • Tests
    • Comprehensive test suite covering registry, health monitoring, dispatching, service flows and persistence.
  • Chores
    • CI workflow adjusted to continue on error for a macOS build job.

Tracking:

Initial scaffold for the runner registration and hot editor provider module.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@frostebite frostebite added enhancement New feature or request orchestrator Orchestrator module labels Mar 5, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

Caution

Review failed

An error occurred during the review process. Please try again later.

📝 Walkthrough

Walkthrough

Adds a Hot Runner subsystem: new README and action inputs, hot-runner fields on BuildParameters/Input, a hot-runner runtime path in top-level flow with optional fallback, and a new orchestrator hot-runner package (types, registry with persistence, dispatcher, health monitor, service, barrel) plus comprehensive tests and minor CI/workflow branch adjustments.

Changes

Cohort / File(s) Summary
Documentation
src/model/orchestrator/runners/README.md
New README describing the Hot Runner Protocol, registration, transports, and persistent Unity editor provider protocol.
Action inputs
action.yml
Adds hot-runner inputs: hotRunnerEnabled, hotRunnerTransport, hotRunnerHost, hotRunnerPort, hotRunnerHealthInterval, hotRunnerMaxIdle, hotRunnerFallbackToCold.
Top-level flow
src/index.ts
Introduces hot-runner code path (init, submit, real-time logging, shutdown), fallback-to-cold behavior, and runColdBuild helper; imports hot-runner service/types.
Config & inputs
src/model/build-parameters.ts, src/model/input.ts
Adds hot-runner properties to BuildParameters and corresponding Input getters parsing action inputs with defaults and type conversions.
Hot-runner core
src/model/orchestrator/services/hot-runner/...
src/model/orchestrator/services/hot-runner/index.ts
New subsystem: types, HotRunnerRegistry (in-memory + persistence), HotRunnerDispatcher (dispatch, timeouts, output streaming), HotRunnerHealthMonitor (periodic health checks, recycling), HotRunnerService (initialize/register/submit/shutdown), and barrel export.
Tests
src/model/orchestrator/services/hot-runner/hot-runner.test.ts
Extensive unit tests covering registry, health monitor, dispatcher, and service using mocked transports/filesystem.
Workflows / CI
.github/workflows/build-tests-mac.yml, .github/workflows/orchestrator-async-checks.yml, src/model/orchestrator/tests/e2e/..., src/model/orchestrator/workflows/...
Switched some orchestrator clone targets to main (removed orchestrator-develop fallback) and made macOS matrix job continue-on-error.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CI as CI Action
    participant Service as HotRunnerService
    participant Registry as HotRunnerRegistry
    participant Dispatcher as HotRunnerDispatcher
    participant Runner as HotRunnerTransport
    participant Health as HotRunnerHealthMonitor

    CI->>Service: submitBuild(params)
    Service->>Registry: findAvailableRunner(requirements)
    alt runner found
        Service->>Dispatcher: dispatchJob(jobRequest, onOutput)
        Dispatcher->>Registry: markRunnerBusy(id)
        Dispatcher->>Runner: sendJob(jobRequest)
        Runner-->>Dispatcher: streamOutput/chunk
        Dispatcher-->>CI: forwardOutput(chunk)
        Runner-->>Dispatcher: jobResult
        Dispatcher->>Registry: updateRunner(onComplete)
        Dispatcher-->>Service: jobResult
        Service-->>CI: return result
    else no runner available
        Service->>Registry: wait/poll (optional)
        Registry-->>Service: none / timeout
        Service->>CI: fall back to cold build (if enabled)
    end
    Health->>Registry: runPeriodicChecks()
    Registry->>Runner: transport.healthCheck()
    alt unhealthy
        Health->>Service: recycleUnhealthyRunner(id)
        Service->>Runner: disconnect()
        Service->>Registry: unregisterRunner(id)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #789 — Implements the Hot Runner Protocol (persistent process-based build/test providers); this PR adds the protocol implementation (types, registry, dispatcher, health monitor, service).
  • #795 — Next‑gen Hot Runner feature tracking; code-level overlap with registry/dispatcher/health-monitor additions.
  • #810 — Orchestrator endgame / integration; changes align with orchestrator hot-runner integration and tests.

Possibly related PRs

Suggested reviewers

  • webbertakken
  • GabLeRoux
  • davidmfinol

Poem

🐇 I hopped into code with a curious twitch,
I registered runners and tuned every switch,
Warm editors hum and accept jobs in line,
Logs glow like carrots, streaming output divine,
A rabbit cheers — hot runners now shine.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: implementing a hot runner protocol with extensible runner registration and persistent editor providers, which matches the core objective.
Description check ✅ Passed The description comprehensively covers changes, implementation details, new inputs, usage examples, test coverage, and related issues/documentation, aligning well with the template structure.
Linked Issues check ✅ Passed All core objectives from issue #789 are implemented: persistent process-based providers, extensible transport registration, runner registry, health monitoring, job dispatch, and structured configuration inputs.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the hot runner protocol. The single workflow modification (continue-on-error in build-tests-mac.yml) is a minor adjustment supporting the overall feature without introducing unrelated changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/hot-runner-protocol

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/model/orchestrator/runners/README.md`:
- Line 5: Replace the vague "See GitHub Issue for full specification." line in
README.md with a direct link to the exact issue (e.g.,
https://github.com/.../issues/789) and add a minimal protocol outline listing
the surface methods/steps: Register, Heartbeat, JobAssignment, ResultPosting;
for each include key inputs (e.g., Register: runner_id, capabilities; Heartbeat:
runner_id, status, timestamp; JobAssignment: job_id, payload; ResultPosting:
job_id, status, result). Ensure the new text is short — one sentence linking the
issue followed by a 3–5 bullet points naming the protocol actions and their core
inputs so the README is usable without visiting the issue.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 59088cd2-7210-42ca-bbd2-99fd5a33d342

📥 Commits

Reviewing files that changed from the base of the PR and between 9d47543 and 49b37f7.

📒 Files selected for processing (1)
  • src/model/orchestrator/runners/README.md


Extensible runner registration and persistent Unity editor provider protocol.

See GitHub Issue for full specification.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace the generic issue reference with a direct link and minimal protocol outline.

“See GitHub Issue for full specification.” is too vague in standalone docs. Please link the exact issue (e.g., #789 URL) and add a short bullet list of the protocol surface (Register, Heartbeat, job assignment, result posting) plus key inputs so this README is useful without extra lookup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/runners/README.md` at line 5, Replace the vague "See
GitHub Issue for full specification." line in README.md with a direct link to
the exact issue (e.g., https://github.com/.../issues/789) and add a minimal
protocol outline listing the surface methods/steps: Register, Heartbeat,
JobAssignment, ResultPosting; for each include key inputs (e.g., Register:
runner_id, capabilities; Heartbeat: runner_id, status, timestamp; JobAssignment:
job_id, payload; ResultPosting: job_id, status, result). Ensure the new text is
short — one sentence linking the issue followed by a 3–5 bullet points naming
the protocol actions and their core inputs so the README is usable without
visiting the issue.

… monitoring, and job dispatch (#791)

Adds persistent Unity editor instance support to reduce build iteration time
by eliminating cold-start overhead. Includes:

- HotRunnerTypes: interfaces for config, status, job request/result, transport
- HotRunnerRegistry: in-memory runner management with file-based persistence
- HotRunnerHealthMonitor: periodic health checks, idle recycling, job-count recycling
- HotRunnerDispatcher: job routing with wait-for-runner, timeout, and output streaming
- HotRunnerService: high-level API integrating registry, health, and dispatch
- 34 unit tests covering registration, filtering, health, dispatch, timeout, fallback
- action.yml inputs for hot runner configuration (7 new inputs)
- Input/BuildParameters integration for hot runner settings
- index.ts wiring with cold-build fallback when hot runner unavailable

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (3)
src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts (1)

23-27: Prevent overlapping health-check cycles.

This interval starts async work every tick without guarding reentrancy. If a cycle runs longer than the interval, multiple cycles can mutate/recycle the same runners concurrently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts`
around lines 23 - 27, The health-check interval starts new async cycles
unguarded, causing overlapping runs; add a reentrancy guard (e.g., a private
boolean like isHealthCheckRunning) and update the setInterval callback (the
block that calls this.runHealthChecks()) to check and return immediately if
isHealthCheckRunning is true, set isHealthCheckRunning = true before awaiting
this.runHealthChecks(), and set it false in a finally block to ensure it always
resets; alternatively replace the setInterval with a self-scheduling async loop
that awaits runHealthChecks() before scheduling the next iteration. Ensure
references to this.intervalHandle and the runHealthChecks method remain intact
and that OrchestratorLogger.logWarning still reports errors from caught
exceptions.
src/model/orchestrator/services/hot-runner/hot-runner-types.ts (1)

3-3: Transport kind typing is too rigid for a custom/extensible protocol.

Line 3 hard-codes three values, which makes adding new transports a code change instead of a configuration/protocol extension.

♻️ Proposed refactor
+export type HotRunnerTransportKind = 'websocket' | 'grpc' | 'named-pipe' | (string & {});
+
 export interface HotRunnerConfig {
   enabled: boolean;
-  transport: 'websocket' | 'grpc' | 'named-pipe';
+  transport: HotRunnerTransportKind;
   host: string;
   port: number;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-types.ts` at line 3,
The transport property is currently a closed union ('websocket' | 'grpc' |
'named-pipe') which prevents adding new transports without changing code; change
it to an open, extensible type by introducing a TransportKind type alias (e.g.
export type TransportKind = 'websocket' | 'grpc' | 'named-pipe' | string) and
update the transport property to use TransportKind (transport: TransportKind),
then adjust any usages of transport in HotRunner-related code to accept the
broader type; this keeps the known literals for type hints while allowing custom
protocols to be supplied without code changes.
src/model/orchestrator/services/hot-runner/hot-runner-registry.ts (1)

68-70: Registry read APIs leak mutable internal objects.

Callers can mutate returned HotRunnerStatus objects directly, bypassing updateRunner() and persistence.

Suggested fix
   getRunner(id: string): HotRunnerStatus | undefined {
-    return this.runners.get(id);
+    const runner = this.runners.get(id);
+    return runner ? { ...runner } : undefined;
   }

   listRunners(filter?: HotRunnerFilter): HotRunnerStatus[] {
     let results = [...this.runners.values()];
@@
-    return results;
+    return results.map((runner) => ({ ...runner }));
   }

   findAvailableRunner(requirements: { unityVersion: string; platform: string }): HotRunnerStatus | undefined {
-    return this.listRunners({
+    return this.listRunners({
       state: 'idle',
       unityVersion: requirements.unityVersion,
       platform: requirements.platform,
     })[0];
   }

Also applies to: 82-98, 103-109

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-registry.ts` around
lines 68 - 70, The registry's read APIs (e.g., getRunner and the other methods
that return HotRunnerStatus objects) expose internal mutable objects allowing
callers to bypass updateRunner() and persistence; fix this by returning
defensive copies instead of original references—use a deep clone (e.g.,
structuredClone or equivalent) for single HotRunnerStatus returns in getRunner
and deep-clone each value when returning arrays/collections, or return
frozen/immutable copies so callers cannot mutate internal state directly; ensure
all methods that return HotRunnerStatus instances (the getRunner implementation
and the other read APIs referenced) are updated to return cloned copies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/index.ts`:
- Around line 42-62: The current try/catch treats any exception thrown inside
the try (including errors from hotRunnerService.shutdown()) as a hotRunnerError
and may trigger the cold fallback; change the control flow so only errors from
initialize/submitBuild trigger fallback: after calling submitBuild and setting
exitCode, call hotRunnerService.shutdown() inside its own try/catch that logs
shutdown failures but does NOT rethrow; keep the outer catch to handle failures
from initialize/submitBuild and only then, when
buildParameters.hotRunnerFallbackToCold is true, call runColdBuild; reference
hotRunnerService.initialize, hotRunnerService.submitBuild,
hotRunnerService.shutdown, and runColdBuild to locate where to add the inner
try/catch and separate error handling.

In `@src/model/input.ts`:
- Around line 291-293: The hotRunnerTransport getter currently asserts arbitrary
strings to the union type; change it to retrieve the raw value via
Input.getInput('hotRunnerTransport'), check it against the allowed set
['websocket','grpc','named-pipe'], return the default 'websocket' when no value
is provided, and throw a clear Error (or process.exit with message) if the
provided value is not one of those options so invalid inputs fail fast; update
the static getter hotRunnerTransport in the Input class to perform this
validation rather than using a type assertion.
- Around line 299-309: The three getters hotRunnerPort, hotRunnerHealthInterval,
and hotRunnerMaxIdle currently parse inputs that can yield NaN; update each to
validate the parsed value is a positive integer (Number.isInteger(parsed) &&
parsed > 0) after calling Input.getInput(...), and if validation fails throw a
clear Error describing the invalid input name and value (fail fast) instead of
returning NaN or silently using it; keep existing defaults only as the string
passed into parseInt but still validate the final numeric result so invalid
strings cause a thrown error.

In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts`:
- Around line 80-87: The catch block in the executeWithTimeout path marks the
runner idle even when the underlying HotRunnerTransport.sendJob is still
running, risking reassignment before the job actually ends; update the logic in
the catch around executeWithTimeout/sendJob to distinguish timeout-induced
rejection from real failures (inspect the thrown error or return a timeout
sentinel), and on timeout either keep the runner in a non-idle state (e.g.,
leave state='running' or mark 'timed_out') until sendJob completes or implement
an abort/cancellation flow: add cancellation/abort support to
HotRunnerTransport.sendJob (or pass an AbortSignal from executeWithTimeout),
call the abort in the timeout handler, and only call
registry.updateRunner(runner.id, { state: 'idle', currentJob: undefined }) after
the job has actually finished or after abort completes.

In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts`:
- Around line 49-57: When transport is missing (transport =
this.transports.get(runnerId) null check), don't mark the runner immediately as
'unhealthy' and let it be recycled; instead update the registry with a transient
state like 'missing-transport' (use this.registry.updateRunner(runnerId, {
state: 'missing-transport', lastHealthCheck: new Date().toISOString() })), log
with OrchestratorLogger.logWarning as before, and return without triggering the
unhealthy/recycle path (i.e., avoid returning the value that your caller treats
as "unhealthy" or add a short grace window flag) so restored runners can
reattach transports before eviction; ensure corresponding checks elsewhere that
look for 'unhealthy' (or the boolean returned) are adjusted to only recycle true
'unhealthy' states.
- Around line 63-67: The problem is that lastHealthCheck (set in
registry.updateRunner inside hot-runner-health-monitor) is updated on every
successful poll, which prevents idleAge (computed later) from ever reaching
maxIdleTime; fix by separating health polling from activity timestamps: stop
using the poll timestamp as the runner's activity marker and instead update
lastHealthCheck for health status only while using a distinct lastActiveAt (or
preserve runner.createdAt/lastTaskTimestamp) for idle calculations; update
registry.updateRunner to accept and persist a separate lastActiveAt (or only set
lastHealthCheck when health state changes) and change the idleAge computation to
read runner.lastActiveAt (or createdAt/lastTaskTimestamp) when comparing against
maxIdleTime.

In `@src/model/orchestrator/services/hot-runner/hot-runner-service.ts`:
- Around line 144-164: The parseCustomParameters function currently tokenizes
raw by splitting on whitespace which breaks quoted values like -defineSymbols "A
B"; update parseCustomParameters to perform quote-aware tokenization (or use a
regex that matches flag tokens and quoted/unquoted values) so that values
wrapped in single or double quotes are kept as one token, strip surrounding
quotes and unescape inner characters, then proceed with the existing logic that
extracts key (from part.replace(/^-+/, '')) and assigns value (handling
key=value, next-token value, or boolean true); ensure the tokenizer feeds into
the same for-loop/parts handling so variables parts, part, key and the i
increment logic continue to work correctly.
- Around line 106-124: The shutdown() method stops monitoring and disconnects
transports but never unregisters runners from the registry, leaving stale
entries; update shutdown() to iterate over the registered runner IDs (use
this.transports.keys() or the same map used to track runners) and call the
registry removal method (e.g., this.registry.unregisterRunner(id) or the
appropriate unregister method on the orchestrator registry) for each runner
before clearing this.transports, ensuring all registry entries are removed even
if transport.disconnect() fails (handle errors the same way as current
disconnect error handling).

In `@src/model/orchestrator/services/hot-runner/hot-runner.test.ts`:
- Around line 488-491: The test’s mock transport uses sendJob mocked with
setTimeout(60000) which creates a long-lived timer; replace that with a truly
never-resolving promise or use Jest fake timers so no real timer is scheduled.
Concretely, update the mock created by createMockTransport in hot-runner.test.ts
so sendJob returns a Promise that never resolves (e.g., new Promise(() => {}) )
or switch the test to use jest.useFakeTimers() and advance timers as needed
instead of setTimeout, ensuring no 60s timer handle remains.

---

Nitpick comments:
In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts`:
- Around line 23-27: The health-check interval starts new async cycles
unguarded, causing overlapping runs; add a reentrancy guard (e.g., a private
boolean like isHealthCheckRunning) and update the setInterval callback (the
block that calls this.runHealthChecks()) to check and return immediately if
isHealthCheckRunning is true, set isHealthCheckRunning = true before awaiting
this.runHealthChecks(), and set it false in a finally block to ensure it always
resets; alternatively replace the setInterval with a self-scheduling async loop
that awaits runHealthChecks() before scheduling the next iteration. Ensure
references to this.intervalHandle and the runHealthChecks method remain intact
and that OrchestratorLogger.logWarning still reports errors from caught
exceptions.

In `@src/model/orchestrator/services/hot-runner/hot-runner-registry.ts`:
- Around line 68-70: The registry's read APIs (e.g., getRunner and the other
methods that return HotRunnerStatus objects) expose internal mutable objects
allowing callers to bypass updateRunner() and persistence; fix this by returning
defensive copies instead of original references—use a deep clone (e.g.,
structuredClone or equivalent) for single HotRunnerStatus returns in getRunner
and deep-clone each value when returning arrays/collections, or return
frozen/immutable copies so callers cannot mutate internal state directly; ensure
all methods that return HotRunnerStatus instances (the getRunner implementation
and the other read APIs referenced) are updated to return cloned copies.

In `@src/model/orchestrator/services/hot-runner/hot-runner-types.ts`:
- Line 3: The transport property is currently a closed union ('websocket' |
'grpc' | 'named-pipe') which prevents adding new transports without changing
code; change it to an open, extensible type by introducing a TransportKind type
alias (e.g. export type TransportKind = 'websocket' | 'grpc' | 'named-pipe' |
string) and update the transport property to use TransportKind (transport:
TransportKind), then adjust any usages of transport in HotRunner-related code to
accept the broader type; this keeps the known literals for type hints while
allowing custom protocols to be supplied without code changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ec0174bd-f54a-4289-a762-bf43b66c85a2

📥 Commits

Reviewing files that changed from the base of the PR and between 49b37f7 and 1bb31f3.

⛔ Files ignored due to path filters (2)
  • dist/index.js is excluded by !**/dist/**
  • dist/index.js.map is excluded by !**/dist/**, !**/*.map
📒 Files selected for processing (11)
  • action.yml
  • src/index.ts
  • src/model/build-parameters.ts
  • src/model/input.ts
  • src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts
  • src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts
  • src/model/orchestrator/services/hot-runner/hot-runner-registry.ts
  • src/model/orchestrator/services/hot-runner/hot-runner-service.ts
  • src/model/orchestrator/services/hot-runner/hot-runner-types.ts
  • src/model/orchestrator/services/hot-runner/hot-runner.test.ts
  • src/model/orchestrator/services/hot-runner/index.ts

Comment on lines +42 to +62
try {
await hotRunnerService.initialize(hotRunnerConfig);
const result = await hotRunnerService.submitBuild(buildParameters, (output) => {
core.info(output);
});

exitCode = result.exitCode;
core.info(`[HotRunner] Build completed with exit code ${exitCode}`);
await hotRunnerService.shutdown();
} catch (hotRunnerError) {
await hotRunnerService.shutdown();

if (buildParameters.hotRunnerFallbackToCold) {
core.warning(
`[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`,
);
exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder);
} else {
throw hotRunnerError;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t let shutdown errors trigger a second build.

In this block, a successful hot build can still fall into fallback if hotRunnerService.shutdown() throws, which may run the same job again via cold build.

🔧 Proposed fix
-      try {
-        await hotRunnerService.initialize(hotRunnerConfig);
-        const result = await hotRunnerService.submitBuild(buildParameters, (output) => {
-          core.info(output);
-        });
-
-        exitCode = result.exitCode;
-        core.info(`[HotRunner] Build completed with exit code ${exitCode}`);
-        await hotRunnerService.shutdown();
-      } catch (hotRunnerError) {
-        await hotRunnerService.shutdown();
-
-        if (buildParameters.hotRunnerFallbackToCold) {
-          core.warning(
-            `[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`,
-          );
-          exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder);
-        } else {
-          throw hotRunnerError;
-        }
-      }
+      let hotRunnerError: unknown;
+      try {
+        await hotRunnerService.initialize(hotRunnerConfig);
+        const result = await hotRunnerService.submitBuild(buildParameters, (output) => {
+          core.info(output);
+        });
+        exitCode = result.exitCode;
+        core.info(`[HotRunner] Build completed with exit code ${exitCode}`);
+      } catch (error) {
+        hotRunnerError = error;
+      } finally {
+        try {
+          await hotRunnerService.shutdown();
+        } catch (shutdownError) {
+          core.warning(`[HotRunner] Shutdown failed: ${(shutdownError as Error).message}`);
+        }
+      }
+
+      if (hotRunnerError) {
+        if (buildParameters.hotRunnerFallbackToCold) {
+          core.warning(
+            `[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`,
+          );
+          exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder);
+        } else {
+          throw hotRunnerError;
+        }
+      }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try {
await hotRunnerService.initialize(hotRunnerConfig);
const result = await hotRunnerService.submitBuild(buildParameters, (output) => {
core.info(output);
});
exitCode = result.exitCode;
core.info(`[HotRunner] Build completed with exit code ${exitCode}`);
await hotRunnerService.shutdown();
} catch (hotRunnerError) {
await hotRunnerService.shutdown();
if (buildParameters.hotRunnerFallbackToCold) {
core.warning(
`[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`,
);
exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder);
} else {
throw hotRunnerError;
}
}
let hotRunnerError: unknown;
try {
await hotRunnerService.initialize(hotRunnerConfig);
const result = await hotRunnerService.submitBuild(buildParameters, (output) => {
core.info(output);
});
exitCode = result.exitCode;
core.info(`[HotRunner] Build completed with exit code ${exitCode}`);
} catch (error) {
hotRunnerError = error;
} finally {
try {
await hotRunnerService.shutdown();
} catch (shutdownError) {
core.warning(`[HotRunner] Shutdown failed: ${(shutdownError as Error).message}`);
}
}
if (hotRunnerError) {
if (buildParameters.hotRunnerFallbackToCold) {
core.warning(
`[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`,
);
exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder);
} else {
throw hotRunnerError;
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/index.ts` around lines 42 - 62, The current try/catch treats any
exception thrown inside the try (including errors from
hotRunnerService.shutdown()) as a hotRunnerError and may trigger the cold
fallback; change the control flow so only errors from initialize/submitBuild
trigger fallback: after calling submitBuild and setting exitCode, call
hotRunnerService.shutdown() inside its own try/catch that logs shutdown failures
but does NOT rethrow; keep the outer catch to handle failures from
initialize/submitBuild and only then, when
buildParameters.hotRunnerFallbackToCold is true, call runColdBuild; reference
hotRunnerService.initialize, hotRunnerService.submitBuild,
hotRunnerService.shutdown, and runColdBuild to locate where to add the inner
try/catch and separate error handling.

Comment on lines +291 to +293
static get hotRunnerTransport(): 'websocket' | 'grpc' | 'named-pipe' {
return (Input.getInput('hotRunnerTransport') ?? 'websocket') as 'websocket' | 'grpc' | 'named-pipe';
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate hotRunnerTransport instead of casting arbitrary strings.

Line 292 uses a type assertion, so invalid inputs are treated as valid types and fail later at runtime. Validate here and fail fast with a clear error.

🔧 Proposed fix
   static get hotRunnerTransport(): 'websocket' | 'grpc' | 'named-pipe' {
-    return (Input.getInput('hotRunnerTransport') ?? 'websocket') as 'websocket' | 'grpc' | 'named-pipe';
+    const value = (Input.getInput('hotRunnerTransport') ?? 'websocket').toLowerCase();
+    if (value === 'websocket' || value === 'grpc' || value === 'named-pipe') {
+      return value;
+    }
+    throw new Error(
+      `Invalid hotRunnerTransport "${value}". Expected one of: websocket, grpc, named-pipe.`,
+    );
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/input.ts` around lines 291 - 293, The hotRunnerTransport getter
currently asserts arbitrary strings to the union type; change it to retrieve the
raw value via Input.getInput('hotRunnerTransport'), check it against the allowed
set ['websocket','grpc','named-pipe'], return the default 'websocket' when no
value is provided, and throw a clear Error (or process.exit with message) if the
provided value is not one of those options so invalid inputs fail fast; update
the static getter hotRunnerTransport in the Input class to perform this
validation rather than using a type assertion.

Comment on lines +299 to +309
static get hotRunnerPort(): number {
return Number.parseInt(Input.getInput('hotRunnerPort') ?? '9090', 10);
}

static get hotRunnerHealthInterval(): number {
return Number.parseInt(Input.getInput('hotRunnerHealthInterval') ?? '30', 10);
}

static get hotRunnerMaxIdle(): number {
return Number.parseInt(Input.getInput('hotRunnerMaxIdle') ?? '3600', 10);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on invalid numeric hot-runner inputs.

Lines 300, 304, and 308 can return NaN. That can propagate into runtime timing/port logic and cause unstable behavior. Validate as positive integers before returning.

🔧 Proposed fix
+  private static parsePositiveIntInput(name: string, fallback: number): number {
+    const raw = Input.getInput(name) ?? `${fallback}`;
+    const value = Number.parseInt(raw, 10);
+    if (!Number.isFinite(value) || value <= 0) {
+      throw new Error(`Invalid ${name} value "${raw}". Expected a positive integer.`);
+    }
+    return value;
+  }
+
   static get hotRunnerPort(): number {
-    return Number.parseInt(Input.getInput('hotRunnerPort') ?? '9090', 10);
+    return Input.parsePositiveIntInput('hotRunnerPort', 9090);
   }
 
   static get hotRunnerHealthInterval(): number {
-    return Number.parseInt(Input.getInput('hotRunnerHealthInterval') ?? '30', 10);
+    return Input.parsePositiveIntInput('hotRunnerHealthInterval', 30);
   }
 
   static get hotRunnerMaxIdle(): number {
-    return Number.parseInt(Input.getInput('hotRunnerMaxIdle') ?? '3600', 10);
+    return Input.parsePositiveIntInput('hotRunnerMaxIdle', 3600);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static get hotRunnerPort(): number {
return Number.parseInt(Input.getInput('hotRunnerPort') ?? '9090', 10);
}
static get hotRunnerHealthInterval(): number {
return Number.parseInt(Input.getInput('hotRunnerHealthInterval') ?? '30', 10);
}
static get hotRunnerMaxIdle(): number {
return Number.parseInt(Input.getInput('hotRunnerMaxIdle') ?? '3600', 10);
}
private static parsePositiveIntInput(name: string, fallback: number): number {
const raw = Input.getInput(name) ?? `${fallback}`;
const value = Number.parseInt(raw, 10);
if (!Number.isFinite(value) || value <= 0) {
throw new Error(`Invalid ${name} value "${raw}". Expected a positive integer.`);
}
return value;
}
static get hotRunnerPort(): number {
return Input.parsePositiveIntInput('hotRunnerPort', 9090);
}
static get hotRunnerHealthInterval(): number {
return Input.parsePositiveIntInput('hotRunnerHealthInterval', 30);
}
static get hotRunnerMaxIdle(): number {
return Input.parsePositiveIntInput('hotRunnerMaxIdle', 3600);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/input.ts` around lines 299 - 309, The three getters hotRunnerPort,
hotRunnerHealthInterval, and hotRunnerMaxIdle currently parse inputs that can
yield NaN; update each to validate the parsed value is a positive integer
(Number.isInteger(parsed) && parsed > 0) after calling Input.getInput(...), and
if validation fails throw a clear Error describing the invalid input name and
value (fail fast) instead of returning NaN or silently using it; keep existing
defaults only as the string passed into parseInt but still validate the final
numeric result so invalid strings cause a thrown error.

Comment on lines +80 to +87
} catch (error: any) {
OrchestratorLogger.logWarning(`[HotRunner] Job ${request.jobId} failed on runner ${runner.id}: ${error.message}`);

// Mark runner as idle despite failure -- the health monitor will recycle if needed
registry.updateRunner(runner.id, {
state: 'idle',
currentJob: undefined,
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Timeout path can free a runner that may still be executing.

executeWithTimeout rejects on timer, but the underlying sendJob keeps running. Then the catch block marks the runner idle, which allows reassignment before the original job truly ends.

🔧 Proposed mitigation
     } catch (error: any) {
       OrchestratorLogger.logWarning(`[HotRunner] Job ${request.jobId} failed on runner ${runner.id}: ${error.message}`);
 
-      // Mark runner as idle despite failure -- the health monitor will recycle if needed
+      // Do not immediately reuse runner on timeout; it may still be executing remotely
+      const timedOut = error instanceof Error && error.message.includes('timed out');
       registry.updateRunner(runner.id, {
-        state: 'idle',
+        state: timedOut ? 'unhealthy' : 'idle',
         currentJob: undefined,
       });
 
       throw error;
     }

Also consider extending HotRunnerTransport.sendJob(...) with cancellation/abort support so timed-out jobs can be actively stopped.

Also applies to: 123-141

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts` around
lines 80 - 87, The catch block in the executeWithTimeout path marks the runner
idle even when the underlying HotRunnerTransport.sendJob is still running,
risking reassignment before the job actually ends; update the logic in the catch
around executeWithTimeout/sendJob to distinguish timeout-induced rejection from
real failures (inspect the thrown error or return a timeout sentinel), and on
timeout either keep the runner in a non-idle state (e.g., leave state='running'
or mark 'timed_out') until sendJob completes or implement an abort/cancellation
flow: add cancellation/abort support to HotRunnerTransport.sendJob (or pass an
AbortSignal from executeWithTimeout), call the abort in the timeout handler, and
only call registry.updateRunner(runner.id, { state: 'idle', currentJob:
undefined }) after the job has actually finished or after abort completes.

Comment on lines +49 to +57
const transport = this.transports.get(runnerId);
if (!transport) {
OrchestratorLogger.logWarning(`[HotRunner] No transport for runner ${runnerId}`);
this.registry.updateRunner(runnerId, {
state: 'unhealthy',
lastHealthCheck: new Date().toISOString(),
});

return false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

No-transport runners are aggressively recycled, which can purge restored state too early.

When transport is missing, this marks the runner unhealthy; then non-starting unhealthy runners are recycled in the same cycle. During service initialization restore flows, this can evict persisted runners before transport reattachment.

🔧 Proposed mitigation
   private async runHealthChecks(): Promise<void> {
     if (!this.registry) {
       return;
     }
 
     const runners = this.registry.listRunners();
 
     for (const runner of runners) {
+      if (!this.transports.has(runner.id)) {
+        OrchestratorLogger.log(`[HotRunner] Skipping ${runner.id}; transport not attached yet`);
+        continue;
+      }
+
       if (runner.state === 'stopping') {
         continue;
       }

Also applies to: 159-161

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts`
around lines 49 - 57, When transport is missing (transport =
this.transports.get(runnerId) null check), don't mark the runner immediately as
'unhealthy' and let it be recycled; instead update the registry with a transient
state like 'missing-transport' (use this.registry.updateRunner(runnerId, {
state: 'missing-transport', lastHealthCheck: new Date().toISOString() })), log
with OrchestratorLogger.logWarning as before, and return without triggering the
unhealthy/recycle path (i.e., avoid returning the value that your caller treats
as "unhealthy" or add a short grace window flag) so restored runners can
reattach transports before eviction; ensure corresponding checks elsewhere that
look for 'unhealthy' (or the boolean returned) are adjusted to only recycle true
'unhealthy' states.

Comment on lines +63 to +67
const status = await transport.getStatus();
this.registry.updateRunner(runnerId, {
lastHealthCheck: new Date().toISOString(),
memoryUsageMB: status.memoryUsageMB,
uptime: status.uptime,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Idle recycling is neutralized by the current timestamp source.

lastHealthCheck is refreshed on each healthy poll (Lines 65-67), but idle age is computed from that same field (Line 130). In steady state, idle runners never reach maxIdleTime.

Also applies to: 130-134, 164-168

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts`
around lines 63 - 67, The problem is that lastHealthCheck (set in
registry.updateRunner inside hot-runner-health-monitor) is updated on every
successful poll, which prevents idleAge (computed later) from ever reaching
maxIdleTime; fix by separating health polling from activity timestamps: stop
using the poll timestamp as the runner's activity marker and instead update
lastHealthCheck for health status only while using a distinct lastActiveAt (or
preserve runner.createdAt/lastTaskTimestamp) for idle calculations; update
registry.updateRunner to accept and persist a separate lastActiveAt (or only set
lastHealthCheck when health state changes) and change the idleAge computation to
read runner.lastActiveAt (or createdAt/lastTaskTimestamp) when comparing against
maxIdleTime.

Comment on lines +106 to +124
async shutdown(): Promise<void> {
OrchestratorLogger.log(`[HotRunner] Shutting down service`);

this.healthMonitor.stopMonitoring();

const disconnectPromises: Promise<void>[] = [];
for (const [id, transport] of this.transports.entries()) {
disconnectPromises.push(
transport.disconnect().catch((error: any) => {
OrchestratorLogger.logWarning(`[HotRunner] Error disconnecting runner ${id}: ${error.message}`);
}),
);
}
await Promise.all(disconnectPromises);

this.transports.clear();

OrchestratorLogger.log(`[HotRunner] Service shut down`);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Shutdown does not unregister runners, leaving stale registry state.

Line 103 says shutdown should unregister all runners, but Line 121 only clears transports. Registry entries remain and can be restored later without matching transports.

Suggested fix
   async shutdown(): Promise<void> {
     OrchestratorLogger.log(`[HotRunner] Shutting down service`);

     this.healthMonitor.stopMonitoring();

     const disconnectPromises: Promise<void>[] = [];
     for (const [id, transport] of this.transports.entries()) {
       disconnectPromises.push(
         transport.disconnect().catch((error: any) => {
           OrchestratorLogger.logWarning(`[HotRunner] Error disconnecting runner ${id}: ${error.message}`);
         }),
       );
     }
     await Promise.all(disconnectPromises);

+    for (const runner of this.registry.listRunners()) {
+      this.registry.unregisterRunner(runner.id);
+    }
     this.transports.clear();

     OrchestratorLogger.log(`[HotRunner] Service shut down`);
   }
🧰 Tools
🪛 ESLint

[error] 114-114: Prefer async/await to Promise.catch()

(github/no-then)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-service.ts` around
lines 106 - 124, The shutdown() method stops monitoring and disconnects
transports but never unregisters runners from the registry, leaving stale
entries; update shutdown() to iterate over the registered runner IDs (use
this.transports.keys() or the same map used to track runners) and call the
registry removal method (e.g., this.registry.unregisterRunner(id) or the
appropriate unregister method on the orchestrator registry) for each runner
before clearing this.transports, ensuring all registry entries are removed even
if transport.disconnect() fails (handle errors the same way as current
disconnect error handling).

Comment on lines +144 to +164
private parseCustomParameters(raw: string): Record<string, string> {
const result: Record<string, string> = {};
const parts = raw.trim().split(/\s+/);

for (let i = 0; i < parts.length; i++) {
const part = parts[i];
if (part.startsWith('-')) {
const key = part.replace(/^-+/, '');
if (key.includes('=')) {
const [k, ...v] = key.split('=');
result[k] = v.join('=');
} else if (i + 1 < parts.length && !parts[i + 1].startsWith('-')) {
result[key] = parts[i + 1];
i++;
} else {
result[key] = 'true';
}
}
}

return result;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Quoted custom parameter values are parsed incorrectly.

Line 146 splits on whitespace, so values like -defineSymbols "A B" are broken into separate tokens and misparsed.

Suggested fix (quote-aware tokenization)
   private parseCustomParameters(raw: string): Record<string, string> {
     const result: Record<string, string> = {};
-    const parts = raw.trim().split(/\s+/);
+    const parts = raw.match(/"[^"]*"|'[^']*'|[^\s]+/g) ?? [];
+    const unquote = (value: string): string => value.replace(/^['"]|['"]$/g, '');

-    for (let i = 0; i < parts.length; i++) {
-      const part = parts[i];
+    for (let i = 0; i < parts.length; i++) {
+      const part = parts[i];
       if (part.startsWith('-')) {
         const key = part.replace(/^-+/, '');
         if (key.includes('=')) {
           const [k, ...v] = key.split('=');
-          result[k] = v.join('=');
+          result[k] = unquote(v.join('='));
         } else if (i + 1 < parts.length && !parts[i + 1].startsWith('-')) {
-          result[key] = parts[i + 1];
+          result[key] = unquote(parts[i + 1]);
           i++;
         } else {
           result[key] = 'true';
         }
       }
     }

     return result;
   }
🧰 Tools
🪛 ESLint

[error] 148-148: The variable i should be named index. A more descriptive name will do too.

(unicorn/prevent-abbreviations)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-service.ts` around
lines 144 - 164, The parseCustomParameters function currently tokenizes raw by
splitting on whitespace which breaks quoted values like -defineSymbols "A B";
update parseCustomParameters to perform quote-aware tokenization (or use a regex
that matches flag tokens and quoted/unquoted values) so that values wrapped in
single or double quotes are kept as one token, strip surrounding quotes and
unescape inner characters, then proceed with the existing logic that extracts
key (from part.replace(/^-+/, '')) and assigns value (handling key=value,
next-token value, or boolean true); ensure the tokenizer feeds into the same
for-loop/parts handling so variables parts, part, key and the i increment logic
continue to work correctly.

Comment on lines +488 to +491
const transport = createMockTransport({
sendJob: jest.fn().mockImplementation(
() => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout
),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Timeout test creates a long-lived timer handle.

Line 490 schedules a 60s timer that outlives the assertion path and can slow/flake CI runs. Use a never-resolving promise (no timer handle) or fake timers.

🔧 Proposed fix
     const transport = createMockTransport({
       sendJob: jest.fn().mockImplementation(
-        () => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout
+        () => new Promise<HotRunnerJobResult>(() => {}), // never resolves and no active timer handle
       ),
     });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const transport = createMockTransport({
sendJob: jest.fn().mockImplementation(
() => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout
),
const transport = createMockTransport({
sendJob: jest.fn().mockImplementation(
() => new Promise<HotRunnerJobResult>(() => {}), // never resolves and no active timer handle
),
});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner.test.ts` around lines
488 - 491, The test’s mock transport uses sendJob mocked with setTimeout(60000)
which creates a long-lived timer; replace that with a truly never-resolving
promise or use Jest fake timers so no real timer is scheduled. Concretely,
update the mock created by createMockTransport in hot-runner.test.ts so sendJob
returns a Promise that never resolves (e.g., new Promise(() => {}) ) or switch
the test to use jest.useFakeTimers() and advance timers as needed instead of
setTimeout, ensuring no 60s timer handle remains.

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

Cat Gif

… safeguards

Validate runner entries when loading from hot-runners.json. Discard
corrupted entries with warnings. Add validateAndRepair() method for
runtime recovery. Validate data before persisting to prevent writing
corrupt state. Handle corrupt persistence files (invalid JSON)
gracefully. Rewrite executeWithTimeout using Promise.race to clean up
transport connections on timeout. Fix pre-existing ESLint violations
in dispatcher and test files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/model/orchestrator/services/hot-runner/hot-runner.test.ts (1)

621-624: ⚠️ Potential issue | 🟡 Minor

Timeout tests still create 60s real timer handles.

Both mocks schedule long-lived timers that survive past the assertion path; this can slow/flake CI.

🔧 Proposed fix
     const transport = createMockTransport({
       sendJob: jest.fn().mockImplementation(
-        () => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout
+        () => new Promise<HotRunnerJobResult>(() => {}), // never resolves, no timer handle
       ),
     });
@@
     const transport = createMockTransport({
       sendJob: jest.fn().mockImplementation(
-        () => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout
+        () => new Promise<HotRunnerJobResult>(() => {}), // never resolves, no timer handle
       ),
     });

Verification (read-only): expected 2 matches before fix, 0 after fix.

#!/bin/bash
rg -n "setTimeout\\(resolve, 60000\\)" src/model/orchestrator/services/hot-runner/hot-runner.test.ts

Also applies to: 638-640

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner.test.ts` around lines
621 - 624, The tests create real 60s timers via setTimeout in the mocked sendJob
(and the other similar mock) which leaves long-lived timer handles; replace
those with never-resolving promises that don't schedule timers (e.g.,
mockImplementation(() => new Promise(() => {}))) so no real timer is created.
Update the createMockTransport mock for sendJob and the other matching mock to
return a promise that never resolves instead of using setTimeout, keeping the
rest of the test behavior intact.
src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts (1)

81-88: ⚠️ Potential issue | 🔴 Critical

Do not set timed-out runners back to idle immediately.

On timeout, the remote job may still be executing; immediately returning the runner to idle risks reassignment before the prior job is truly terminated.

🔧 Proposed fix
     } catch (error: any) {
       OrchestratorLogger.logWarning(`[HotRunner] Job ${request.jobId} failed on runner ${runner.id}: ${error.message}`);
 
-      // Mark runner as idle despite failure -- the health monitor will recycle if needed
+      const timedOut = error instanceof Error && error.message.includes('timed out');
       registry.updateRunner(runner.id, {
-        state: 'idle',
+        state: timedOut ? 'unhealthy' : 'idle',
         currentJob: undefined,
       });
+      if (timedOut) {
+        this.transports.delete(runner.id);
+      }
 
       throw error;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts` around
lines 81 - 88, The catch block in hot-runner-dispatcher.ts currently logs errors
and immediately calls registry.updateRunner(runner.id, { state: 'idle',
currentJob: undefined }) which wrongly returns timed-out runners to idle; change
the error handling to detect a timeout (e.g., check error.code, error.name, or
error.message for "timeout"/TimeoutError in the catch around the
dispatch/execute call) and only set the runner to 'idle' for non-timeout
failures via registry.updateRunner; for timeout cases, do not clear currentJob
or set state to 'idle'—instead update the runner to a terminal/clean-up state
like 'terminating' (or leave state unchanged) so the health monitor/recycler can
safely reclaim the runner; keep the OrchestratorLogger.logWarning call but
include the chosen timeout detection branch before calling registry.updateRunner
to apply the correct state transition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts`:
- Around line 132-136: The timeout Promise created with setTimeout
(timeoutPromise / TIMEOUT_SENTINEL using request.timeout) never clears its timer
when Promise.race resolves, causing dangling timers and leaks; modify the code
that races timeoutPromise against the send job (the Promise.race call in the
sendJob / dispatcher flow) to capture the timer id (e.g., const timeoutId =
setTimeout(...)), then after the race settles (both on success and error) call
clearTimeout(timeoutId) to cancel the pending timeout callback; ensure
clearTimeout is invoked in the same scope that awaits Promise.race so the timer
is always cleared regardless of which promise won.

In `@src/model/orchestrator/services/hot-runner/hot-runner-registry.ts`:
- Around line 283-287: The loop in hot-runner-registry.ts that restores runners
accepts entries where the map key id can differ from the payload status.id;
update the restoration logic inside the for (const [id, status] of
Object.entries(data.runners)) block (used with isValidRunnerStatus) to verify
that status.id exists and equals the map key before calling this.runners.set(id,
status); if they don't match, skip the entry (and optionally log or count the
corrupt entry) so downstream code that relies on runner.id won't break.

---

Duplicate comments:
In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts`:
- Around line 81-88: The catch block in hot-runner-dispatcher.ts currently logs
errors and immediately calls registry.updateRunner(runner.id, { state: 'idle',
currentJob: undefined }) which wrongly returns timed-out runners to idle; change
the error handling to detect a timeout (e.g., check error.code, error.name, or
error.message for "timeout"/TimeoutError in the catch around the
dispatch/execute call) and only set the runner to 'idle' for non-timeout
failures via registry.updateRunner; for timeout cases, do not clear currentJob
or set state to 'idle'—instead update the runner to a terminal/clean-up state
like 'terminating' (or leave state unchanged) so the health monitor/recycler can
safely reclaim the runner; keep the OrchestratorLogger.logWarning call but
include the chosen timeout detection branch before calling registry.updateRunner
to apply the correct state transition.

In `@src/model/orchestrator/services/hot-runner/hot-runner.test.ts`:
- Around line 621-624: The tests create real 60s timers via setTimeout in the
mocked sendJob (and the other similar mock) which leaves long-lived timer
handles; replace those with never-resolving promises that don't schedule timers
(e.g., mockImplementation(() => new Promise(() => {}))) so no real timer is
created. Update the createMockTransport mock for sendJob and the other matching
mock to return a promise that never resolves instead of using setTimeout,
keeping the rest of the test behavior intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f13e66b7-2ff8-48c7-9268-cbd45e630c99

📥 Commits

Reviewing files that changed from the base of the PR and between 1bb31f3 and 7c0c4c2.

⛔ Files ignored due to path filters (2)
  • dist/index.js is excluded by !**/dist/**
  • dist/index.js.map is excluded by !**/dist/**, !**/*.map
📒 Files selected for processing (3)
  • src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts
  • src/model/orchestrator/services/hot-runner/hot-runner-registry.ts
  • src/model/orchestrator/services/hot-runner/hot-runner.test.ts

Comment on lines +132 to +136
const timeoutPromise = new Promise<typeof TIMEOUT_SENTINEL>((resolve) => {
setTimeout(() => {
resolve(TIMEOUT_SENTINEL);
}, request.timeout);
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's read the file to see the current state
cat -n src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts | sed -n '120,160p'

Repository: game-ci/unity-builder

Length of output: 1623


🏁 Script executed:

# Also run the verification script mentioned in the review
python - <<'PY'
from pathlib import Path
p = Path("src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts")
txt = p.read_text()
s = txt.find("private async executeWithTimeout")
e = txt.find("private sleep", s)
if s == -1:
    print("executeWithTimeout method not found")
elif e == -1:
    # Look for the next method instead
    e = txt.find("private ", s + 1)
    if e == -1:
        e = len(txt)
block = txt[s:e] if s != -1 else ""
print("setTimeout found:", "setTimeout(" in block)
print("clearTimeout found:", "clearTimeout(" in block)
print("\n--- Method content preview (first 1500 chars) ---")
if block:
    print(block[:1500])
PY

Repository: game-ci/unity-builder

Length of output: 1123


Clear timeout handles after Promise.race settles.

The setTimeout is never cleared when sendJob wins the race, leaving dangling timers for every successful dispatch. Even when the timeout fires, the callback executes on an already-resolved promise, wasting resources. Over repeated dispatches, this accumulates memory leaks.

🔧 Proposed fix
   private async executeWithTimeout(
     transport: HotRunnerTransport,
     request: HotRunnerJobRequest,
   ): Promise<HotRunnerJobResult> {
     const TIMEOUT_SENTINEL = Symbol('timeout');
+    let timeoutHandle: ReturnType<typeof setTimeout> | undefined;
 
     const timeoutPromise = new Promise<typeof TIMEOUT_SENTINEL>((resolve) => {
-      setTimeout(() => {
+      timeoutHandle = setTimeout(() => {
         resolve(TIMEOUT_SENTINEL);
       }, request.timeout);
     });
 
-    const result = await Promise.race([transport.sendJob(request), timeoutPromise]);
+    try {
+      const result = await Promise.race([transport.sendJob(request), timeoutPromise]);
 
-    if (result === TIMEOUT_SENTINEL) {
-      // Disconnect the transport to clean up the orphaned sendJob call
-      try {
-        await transport.disconnect();
-      } catch (disconnectError: any) {
-        OrchestratorLogger.logWarning(
-          `[HotRunner] Error disconnecting transport after timeout for job ${request.jobId}: ${disconnectError.message}`,
-        );
+      if (result === TIMEOUT_SENTINEL) {
+        // Disconnect the transport to clean up the orphaned sendJob call
+        try {
+          await transport.disconnect();
+        } catch (disconnectError: any) {
+          OrchestratorLogger.logWarning(
+            `[HotRunner] Error disconnecting transport after timeout for job ${request.jobId}: ${disconnectError.message}`,
+          );
+        }
+
+        throw new Error(`[HotRunner] Job ${request.jobId} timed out after ${request.timeout}ms`);
       }
 
-      throw new Error(`[HotRunner] Job ${request.jobId} timed out after ${request.timeout}ms`);
+      return result;
+    } finally {
+      if (timeoutHandle) {
+        clearTimeout(timeoutHandle);
+      }
     }
-
-    return result;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts` around
lines 132 - 136, The timeout Promise created with setTimeout (timeoutPromise /
TIMEOUT_SENTINEL using request.timeout) never clears its timer when Promise.race
resolves, causing dangling timers and leaks; modify the code that races
timeoutPromise against the send job (the Promise.race call in the sendJob /
dispatcher flow) to capture the timer id (e.g., const timeoutId =
setTimeout(...)), then after the race settles (both on success and error) call
clearTimeout(timeoutId) to cancel the pending timeout callback; ensure
clearTimeout is invoked in the same scope that awaits Promise.race so the timer
is always cleared regardless of which promise won.

Comment on lines +283 to +287
if (data.runners && typeof data.runners === 'object') {
for (const [id, status] of Object.entries(data.runners)) {
if (isValidRunnerStatus(status)) {
this.runners.set(id, status);
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate restored runner ID against its persistence key.

A corrupt entry with key id !== status.id is currently accepted. That can break later lookups because downstream code uses runner.id from the payload.

🔧 Proposed fix
     if (data.runners && typeof data.runners === 'object') {
       for (const [id, status] of Object.entries(data.runners)) {
-        if (isValidRunnerStatus(status)) {
+        if (isValidRunnerStatus(status) && status.id === id) {
           this.runners.set(id, status);
         } else {
           OrchestratorLogger.logWarning(`[HotRunner] Discarding invalid runner entry '${id}' from persistence file`);
           discarded++;
         }
       }
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (data.runners && typeof data.runners === 'object') {
for (const [id, status] of Object.entries(data.runners)) {
if (isValidRunnerStatus(status)) {
this.runners.set(id, status);
} else {
if (data.runners && typeof data.runners === 'object') {
for (const [id, status] of Object.entries(data.runners)) {
if (isValidRunnerStatus(status) && status.id === id) {
this.runners.set(id, status);
} else {
OrchestratorLogger.logWarning(`[HotRunner] Discarding invalid runner entry '${id}' from persistence file`);
discarded++;
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/model/orchestrator/services/hot-runner/hot-runner-registry.ts` around
lines 283 - 287, The loop in hot-runner-registry.ts that restores runners
accepts entries where the map key id can differ from the payload status.id;
update the restoration logic inside the for (const [id, status] of
Object.entries(data.runners)) block (used with isValidRunnerStatus) to verify
that status.id exists and equals the map key before calling this.runners.set(id,
status); if they don't match, skip the entry (and optionally log or count the
corrupt entry) so downstream code that relies on runner.id won't break.

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 81.56425% with 66 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.91%. Comparing base (9d47543) to head (3cac184).

Files with missing lines Patch % Lines
...estrator/services/hot-runner/hot-runner-service.ts 50.74% 33 Missing ⚠️
...r/services/hot-runner/hot-runner-health-monitor.ts 69.13% 25 Missing ⚠️
...strator/services/hot-runner/hot-runner-registry.ts 95.00% 7 Missing ⚠️
...rator/services/hot-runner/hot-runner-dispatcher.ts 98.14% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #791      +/-   ##
==========================================
+ Coverage   31.25%   34.91%   +3.66%     
==========================================
  Files          84       88       +4     
  Lines        4563     4921     +358     
  Branches     1103     1191      +88     
==========================================
+ Hits         1426     1718     +292     
- Misses       3137     3203      +66     
Files with missing lines Coverage Δ
src/model/build-parameters.ts 90.00% <ø> (ø)
src/model/input.ts 90.00% <100.00%> (+1.03%) ⬆️
src/model/orchestrator/workflows/async-workflow.ts 27.77% <ø> (ø)
...rchestrator/workflows/build-automation-workflow.ts 10.44% <ø> (ø)
...rator/services/hot-runner/hot-runner-dispatcher.ts 98.14% <98.14%> (ø)
...strator/services/hot-runner/hot-runner-registry.ts 95.00% <95.00%> (ø)
...r/services/hot-runner/hot-runner-health-monitor.ts 69.13% <69.13%> (ø)
...estrator/services/hot-runner/hot-runner-service.ts 50.74% <50.74%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build-tests-mac.yml:
- Line 15: The job-level "continue-on-error: true" in the macOS workflow should
be removed or replaced with step-level retry logic to avoid masking failures;
either remove the job-level continue-on-error (leave "fail-fast: false" as-is)
or implement the same step-level retry pattern used in Ubuntu/Windows (duplicate
build steps like "build-1", "Sleep for Retry", "build-2" with step-level
"continue-on-error" where appropriate); if flakiness is intentional, add an
inline comment explaining why macOS is allowed to be flaky.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d7aaa034-6b2b-45c3-8fde-35c051603c70

📥 Commits

Reviewing files that changed from the base of the PR and between 7c0c4c2 and 41f00bd.

📒 Files selected for processing (1)
  • .github/workflows/build-tests-mac.yml

buildForAllPlatformsMacOS:
name: ${{ matrix.targetPlatform }} on ${{ matrix.unityVersion }}
runs-on: macos-latest
continue-on-error: true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== Checking macOS workflow around line 15 ==="
head -20 .github/workflows/build-tests-mac.yml | cat -n

echo ""
echo "=== Searching continue-on-error in macOS workflow ==="
rg -n "continue-on-error" .github/workflows/build-tests-mac.yml

echo ""
echo "=== Searching continue-on-error in Ubuntu workflow ==="
rg -n "continue-on-error" .github/workflows/build-tests-ubuntu.yml

echo ""
echo "=== Searching continue-on-error in Windows workflow ==="
rg -n "continue-on-error" .github/workflows/build-tests-windows.yml

echo ""
echo "=== Checking for retry logic in macOS workflow ==="
rg -n "retry|attempt|retries" .github/workflows/build-tests-mac.yml

echo ""
echo "=== Checking for retry logic in Ubuntu workflow ==="
rg -n "retry|attempt|retries" .github/workflows/build-tests-ubuntu.yml

echo ""
echo "=== Checking for retry logic in Windows workflow ==="
rg -n "retry|attempt|retries" .github/workflows/build-tests-windows.yml

Repository: game-ci/unity-builder

Length of output: 1190


🏁 Script executed:

#!/bin/bash

echo "=== Ubuntu workflow around lines 139 and 162 ==="
sed -n '130,170p' .github/workflows/build-tests-ubuntu.yml | cat -n

echo ""
echo "=== Windows workflow around lines 75 and 101 ==="
sed -n '65,110p' .github/workflows/build-tests-windows.yml | cat -n

echo ""
echo "=== Full macOS workflow to see job structure ==="
cat .github/workflows/build-tests-mac.yml | wc -l

Repository: game-ci/unity-builder

Length of output: 4120


🏁 Script executed:

#!/bin/bash

echo "=== Full macOS workflow content ==="
cat .github/workflows/build-tests-mac.yml

Repository: game-ci/unity-builder

Length of output: 3133


Job-level continue-on-error masks failures and is inconsistent with other platform workflows.

The macOS workflow applies continue-on-error: true at the job level (line 15), which causes the entire job to report success even if the build fails. This differs from the Ubuntu and Windows workflows, which use step-level continue-on-error with duplicated build steps and retry delays between attempts—allowing the job to fail overall if all retries exhaust.

The macOS workflow contains no retry logic or step duplication, unlike Ubuntu (build-1, Sleep for Retry, build-2) and Windows (build-1, Sleep for Retry, build-2). The existing fail-fast: false (line 17) already prevents one matrix failure from canceling sibling matrix jobs; the job-level continue-on-error goes further by hiding failures from the workflow status entirely.

Consider one of these approaches:

  1. Implement step-level retry logic consistent with Ubuntu/Windows workflows
  2. Remove job-level continue-on-error if the intent is only to avoid blocking matrix siblings (already handled by fail-fast: false)
  3. Add a comment explaining why macOS builds are expected to be flaky if this is intentional
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/build-tests-mac.yml at line 15, The job-level
"continue-on-error: true" in the macOS workflow should be removed or replaced
with step-level retry logic to avoid masking failures; either remove the
job-level continue-on-error (leave "fail-fast: false" as-is) or implement the
same step-level retry pattern used in Ubuntu/Windows (duplicate build steps like
"build-1", "Sleep for Retry", "build-2" with step-level "continue-on-error"
where appropriate); if flakiness is intentional, add an inline comment
explaining why macOS is allowed to be flaky.

The orchestrator-develop branch no longer exists. Update all fallback
clone commands and test fixtures to use main instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@frostebite
Copy link
Member Author

Closing — all orchestrator code has been extracted to the standalone game-ci/orchestrator repository.

Content from this PR (hot runner protocol — extensible runner registration, persistent editor providers) is fully present in the orchestrator repo. See PR #819 for the extraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Next-Gen Orchestrator Next-Gen experimental features orchestrator Orchestrator module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Hot Runner Protocol — persistent process-based build/test providers

1 participant