Skip to content

Cloud Runner Improvements - LTS Candidate - S3 Locking, Aws Local Stack (Pipelines), Testing Improvements, Rclone storage support, Provider plugin system#731

Merged
frostebite merged 234 commits intomainfrom
cloud-runner-develop
Mar 3, 2026
Merged

Cloud Runner Improvements - LTS Candidate - S3 Locking, Aws Local Stack (Pipelines), Testing Improvements, Rclone storage support, Provider plugin system#731
frostebite merged 234 commits intomainfrom
cloud-runner-develop

Conversation

@frostebite
Copy link
Member

@frostebite frostebite commented Sep 8, 2025

Summary

Major improvements to Cloud Runner with LocalStack support, rclone storage provider, dynamic provider plugin system, and enhanced CI testing capabilities.

I have contacted LocalStack to regain access to ECS mocking functionality again, but for now mocking myself with local-docker for AWS workflows.

Changes

New Features

  • LocalStack Support: Full support for LocalStack (local AWS emulator) with per-service endpoint configuration
  • Rclone Storage Provider: Experimental storage backend using rclone, enabling any rclone-supported remote (S3, GCS, Azure, SFTP, etc.)
  • Provider Plugin System: Dynamic provider loading from GitHub repositories, local paths, or NPM packages
  • aws-local CI Mode: AWS_FORCE_PROVIDER=aws-local validates AWS CloudFormation templates while executing via local-docker (no LocalStack Pro required)
  • Resource Tracking: New diagnostic feature for monitoring disk usage and container resource allocations
  • Windows Support: Windows execution path for local runs
  • Fully self-hosted K8s and AWS testing: via k3ds and LocalStack AWS

New Action Inputs

Input Description
resourceTracking Enable disk usage and allocation logging
awsEndpoint Base AWS endpoint (for LocalStack)
awsCloudFormationEndpoint CloudFormation-specific endpoint override
awsEcsEndpoint ECS-specific endpoint override
awsKinesisEndpoint Kinesis-specific endpoint override
awsCloudWatchLogsEndpoint CloudWatch Logs-specific endpoint override
awsS3Endpoint S3-specific endpoint override
storageProvider Storage backend: s3 (default) or rclone
rcloneRemote Rclone remote path (e.g., myremote:bucket/path)
cloneDepth Git clone depth for repository (default: 50, use 0 for full clone)
cloudRunnerRepoName Unity builder repo name (default: game-ci/unity-builder, useful for forks)

Improvements

  • AWS Client Factory: Centralized factory for AWS clients with endpoint configuration
  • Container Endpoint Transformation: Automatic localhost → host.docker.internal / container hostname conversion for AWS/K8s containers
  • Kubernetes Endpoint Normalization: Pods can reach LocalStack via shared Docker network
  • Retry/Backoff: More robust AWS interactions with retry logic
  • Logging & Caching: Enhanced logging and caching mechanisms
  • S3-based Workspace Locking: Shared workspace coordination via S3

Bug Fixes (includes #686)

  • AWS Secrets Fix: CloudFormation now works correctly when no secrets are specified (previously failed with empty Secrets: block)
  • Image Version Format: imageRollingVersion now supports dot versions (e.g., "3.1.0")

Testing

  • New integrity workflows for K8s (k3d) and AWS (LocalStack) validation
  • Rclone integration tests with LocalStack S3 backend
  • CI-specific Jest configuration (jest.ci.config.js)
  • Removed legacy Cloud Runner CI pipeline

Documentation

  • Provider loader guide (src/model/cloud-runner/providers/README.md)

CI Testing Modes

Mode Environment Variable Behavior
Full AWS AWS_FORCE_PROVIDER=aws Uses AWS provider with LocalStack (requires Pro for ECS)
AWS-Local AWS_FORCE_PROVIDER=aws-local Validates CloudFormation templates, executes via local-docker
Auto (unset) Auto-detects LocalStack and falls back to local-docker

Related PRs

Checklist

  • Read the contribution guide and accept the code of conduct
  • Docs (If new inputs or outputs have been added or changes to behavior that should be documented. Please make a PR in the documentation repo)
  • Readme (updated or not needed)
  • Tests (added, updated or not needed)

Summary by CodeRabbit

  • New Features

    • Resource-tracking/diagnostics, LocalStack testing support, rclone-backed storage, dynamic provider loading, configurable clone depth and builder repo.
  • Enhancements

    • Safer caching with disk-pressure guards, improved CI/test workflows and Jest CI script, stronger Kubernetes log/diagnostics, Windows command handling, more resilient remote/git operations.
  • Chores

    • Removed legacy CI pipeline and cleaned up related workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

- Implemented a primary attempt to pull LFS files using GIT_PRIVATE_TOKEN.
- Added a fallback mechanism to use GITHUB_TOKEN if the initial attempt fails.
- Configured git to replace SSH and HTTPS URLs with token-based authentication for the fallback.
- Improved error handling to log specific failure messages for both token attempts.

This change ensures more robust handling of LFS file retrieval in various authentication scenarios.
- Added permissions for packages, pull-requests, statuses, and id-token to enhance workflow capabilities.
- This change improves the CI pipeline's ability to manage pull requests and access necessary resources.
…ation

- Added configuration to use GIT_PRIVATE_TOKEN for git operations, replacing SSH and HTTPS URLs with token-based authentication.
- Improved error handling to ensure GIT_PRIVATE_TOKEN availability before attempting to pull LFS files.
- This change streamlines the process of pulling LFS files in environments requiring token authentication.
…entication

- Enhanced the process of configuring git to use GIT_PRIVATE_TOKEN and GITHUB_TOKEN by clearing existing URL configurations before setting new ones.
- Improved the clarity of the URL replacement commands for better readability and maintainability.
- This change ensures a more robust setup for pulling LFS files in environments requiring token authentication.
… pipeline

- Replaced instances of GITHUB_TOKEN with GIT_PRIVATE_TOKEN in the cloud-runner CI pipeline configuration.
- This change ensures consistent use of token-based authentication across various jobs in the workflow, enhancing security and functionality.
…L unsetting

- Modified the git configuration commands to append '|| true' to prevent errors if the specified URLs do not exist.
- This change enhances the reliability of the URL clearing process in the RemoteClient class, ensuring smoother execution during token-based authentication setups.
…tion

- Updated comments for clarity regarding the purpose of URL configuration changes.
- Simplified the git configuration commands by removing redundant lines while maintaining functionality for HTTPS token-based authentication.
- This change enhances the readability and maintainability of the RemoteClient class's git setup process.
# Conflicts:
#	dist/index.js
#	dist/index.js.map
#	jest.config.js
#	yarn.lock
…logs; tests: retained workspace AWS assertion (#381)
…nd log management; update builder path logic based on provider strategy
…sed on provider strategy and credentials; update binary files
…ained markers; hooks: include AWS S3 hooks on aws provider
…t:ci script; fix(windows): skip grep-based version regex tests; logs: echo CACHE_KEY/retained markers; hooks: include AWS hooks on aws provider
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/model/build-parameters.ts`:
- Line 213: The cloneDepth assignment should validate and sanitize
CloudRunnerOptions.cloneDepth before use: parse it with Number.parseInt, then if
Number.isNaN(parsed) or parsed < 0 (or not an integer) replace it with a
sensible default (e.g. a DEFAULT_CLONE_DEPTH constant or 1); update the
cloneDepth property assignment in build-parameters.ts (the cloneDepth field and
CloudRunnerOptions.cloneDepth reference) to use the validated/fallback value so
downstream git operations never receive NaN or a negative depth.

In `@src/model/cloud-runner/providers/aws/aws-cloud-formation-templates.ts`:
- Around line 22-26: The getSecretDefinitionTemplate function currently returns
a snippet starting with the top-level "Secrets:" key which causes duplicate YAML
keys when called per-secret; change getSecretDefinitionTemplate to return only
the list item block (the "- Name: '...'\n  ValueFrom: !Ref ...") without the
"Secrets:" header, and update the caller loop that inserts these snippets to
either (a) create a single "Secrets:" header once and concatenate all list-item
snippets under it, or (b) emit the header on the first insertion only and append
subsequent list items — locate getSecretDefinitionTemplate to modify its return
string and the code that calls insertAtTemplate('p3 - container def', ...) to
ensure the "Secrets:" header is produced exactly once.

In `@src/model/image-tag.ts`:
- Line 41: Validate containerRegistryImageVersion before assigning to
this.imageRollingVersion by checking it matches a Docker tag-safe pattern (e.g.,
start with an alphanumeric/underscore and only contain alphanumerics, dots,
underscores or dashes, max length ~128) and reject values containing disallowed
characters like '/' or spaces; update the assignment site (the code where
this.imageRollingVersion = containerRegistryImageVersion) to perform this check
and throw or return a clear error when the value is invalid so failures are
explicit and early.
🧹 Nitpick comments (2)
src/model/cloud-runner/remote-client/index.ts (2)

213-213: Simplify async return statement.

The function is already async, so wrapping the return value in a Promise constructor is unnecessary. This can be simplified.

♻️ Suggested simplification
-    return new Promise((result) => result(``));
+    return ``;

309-311: Address static analysis hints for naming and formatting.

ESLint flags the variable name depthArg (should be depthArgument) and Prettier flags the line formatting.

♻️ Suggested fix
-      const depthArg = CloudRunnerOptions.cloneDepth !== '0' ? `--depth ${CloudRunnerOptions.cloneDepth}` : '';
+      const depthArgument = CloudRunnerOptions.cloneDepth !== '0' ? `--depth ${CloudRunnerOptions.cloneDepth}` : '';
       await CloudRunnerSystem.Run(
-        `git clone ${depthArg} ${CloudRunnerFolders.targetBuildRepoUrl} ${path.basename(CloudRunnerFolders.repoPathAbsolute)}`.trim(),
+        `git clone ${depthArgument} ${CloudRunnerFolders.targetBuildRepoUrl} ${path.basename(
+          CloudRunnerFolders.repoPathAbsolute,
+        )}`.trim(),
       );

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/model/cloud-runner/providers/aws/index.ts (1)

163-171: Unconditional wait for cleanup stack deletion when it may not exist.

The DeleteStackCommand for the cleanup stack (line 151) is only sent when CloudRunnerOptions.useCleanupCron is true, but this waitUntilStackDeleteComplete for the cleanup stack is always executed. If useCleanupCron is false, the cleanup stack was never created, so this wait will either fail or timeout unnecessarily.

🔧 Proposed fix
     await waitUntilStackDeleteComplete(
       {
         client: CF,
         maxWaitTime: stackWaitTimeSeconds,
       },
       {
         StackName: taskDef.taskDefStackName,
       },
     );
+    if (CloudRunnerOptions.useCleanupCron) {
       await waitUntilStackDeleteComplete(
         {
           client: CF,
           maxWaitTime: stackWaitTimeSeconds,
         },
         {
           StackName: `${taskDef.taskDefStackName}-cleanup`,
         },
       );
+    }
src/model/cloud-runner/remote-client/index.ts (1)

363-363: Fix contradictory assertion message.

Line 363 asserts fs.existsSync(path.join('.git', 'lfs')) but the message says "LFS folder should not exist before caching". Either the assertion or message is incorrect.

If LFS folder should NOT exist at this point:

-    assert(fs.existsSync(path.join(`.git`, `lfs`)), 'LFS folder should not exist before caching');
+    assert(!fs.existsSync(path.join(`.git`, `lfs`)), 'LFS folder should not exist before caching');
🤖 Fix all issues with AI agents
In `@src/model/cloud-runner/providers/k8s/kubernetes-task-runner.ts`:
- Line 154: The checks inside KubernetesTaskRunner that inspect error?.message
should use the normalized errorMessage variable (set earlier) instead of
accessing error.message directly; update both occurrences (the condition near
the isRunning/continueStreaming branch and the later check around line 204) to
use a safe include against errorMessage (e.g. (errorMessage ||
'').includes('previous terminated container')) so string rejections from
CloudRunnerSystem.Run are handled consistently and won't silently fail.
🧹 Nitpick comments (12)
src/model/cloud-runner/tests/cloud-runner-s3-steps.test.ts (3)

22-24: Remove or replace trivial test with meaningful assertion.

This test asserts true === true, which provides no validation. If the intent is to verify the file loads without errors, the import/parse step already does that. Consider removing this test or replacing it with a meaningful assertion (e.g., verifying exported functions exist).


72-126: Consider extracting repeated credential setup in customJob YAML.

The AWS credential configuration block is duplicated three times across test-s3-pull-cache, test-s3-upload-cache, and test-s3-upload-build steps. This increases maintenance burden and risk of inconsistency.

Consider moving credential setup to a shared initialization step or relying on environment variables passed to containers rather than repeating aws configure in each step.


134-135: Remove redundant shouldRunS3 check.

This code path is only reachable when shouldRunS3 is true (guarded at line 43), making the inner check redundant.

-        // Only run S3 operations if environment supports it
-        if (shouldRunS3) {
+        // Run S3 verification operations
+        {
src/model/cloud-runner/providers/aws/aws-job-stack.ts (1)

21-30: Extract getStackWaitTime() to a shared module to avoid duplication.

This function and DEFAULT_STACK_WAIT_TIME_SECONDS are duplicated identically in three files: aws-job-stack.ts, aws-base-stack.ts, and index.ts. Consider extracting them to aws-client-factory.ts or a dedicated utility module.

♻️ Example consolidation

In aws-client-factory.ts:

const DEFAULT_STACK_WAIT_TIME_SECONDS = 600;

export function getStackWaitTime(): number {
  const overrideValue = Number(process.env.CLOUD_RUNNER_AWS_STACK_WAIT_TIME ?? '');
  if (!Number.isNaN(overrideValue) && overrideValue > 0) {
    return overrideValue;
  }

  return DEFAULT_STACK_WAIT_TIME_SECONDS;
}

Then import it in the other files:

import { getStackWaitTime } from './aws-client-factory';
src/model/cloud-runner/providers/aws/index.ts (1)

105-109: Consider removing or documenting the unused factory method calls.

getECS() and getKinesis() are called but their return values are discarded. If the intent is to pre-initialize clients for early failure detection, add a comment explaining this. Otherwise, remove these calls since the factory will lazily initialize clients when actually needed.

   ResourceTracking.logAllocationSummary('aws workflow');
   await ResourceTracking.logDiskUsageSnapshot('aws workflow (host)');
-  AwsClientFactory.getECS();
   const CF = AwsClientFactory.getCloudFormation();
-  AwsClientFactory.getKinesis();
src/model/cloud-runner/providers/k8s/index.ts (2)

162-200: Test cleanup block uses dynamic import unnecessarily.

The dynamic import of CloudRunnerSystem at line 168-170 is unnecessary since CloudRunnerSystem is already imported at the module level (line 19). While past reviews marked this as addressed, the dynamic import pattern persists. Use the existing module-level import directly.

♻️ Suggested fix
       if (process.env['cloudRunnerTests'] === 'true') {
         try {
           CloudRunnerLogger.log('Cleaning up old images in k3d node before pulling new image...');
-          const { CloudRunnerSystem: CloudRunnerSystemModule } = await import(
-            '../../services/core/cloud-runner-system'
-          );

           // Aggressive cleanup: remove stopped containers and non-Unity images
           // ... rest of code using CloudRunnerSystem instead of CloudRunnerSystemModule

And update line 188:

-              await CloudRunnerSystemModule.Run(cmd, true, true);
+              await CloudRunnerSystem.Run(cmd, true, true);

206-282: Redundant dynamic import in image cache validation block.

Similar to the cleanup block, line 208-210 dynamically imports CloudRunnerSystem when it's already available at module scope. This adds unnecessary overhead and complexity.

♻️ Suggested fix
         if (process.env['cloudRunnerTests'] === 'true' && image.includes('unityci/editor')) {
           try {
-            const { CloudRunnerSystem: CloudRunnerSystemModule2 } = await import(
-              '../../services/core/cloud-runner-system'
-            );
-
             // Check if image is cached on agent node (where pods run)
-            const agentImageCheck = await CloudRunnerSystemModule2.Run(
+            const agentImageCheck = await CloudRunnerSystem.Run(
               `docker exec k3d-unity-builder-agent-0 sh -c "crictl images | grep -q unityci/editor && echo 'cached' || echo 'not_cached'" || echo 'not_cached'`,

Apply similar changes to all CloudRunnerSystemModule2.Run calls in this block.

src/model/cloud-runner/tests/e2e/cloud-runner-end2end-retaining.test.ts (1)

141-170: Cleanup logic is thorough but duplicates earlier block.

The cache cleanup logic (lines 141-170) largely duplicates the workspace cleanup logic (lines 111-138). Consider extracting a shared helper function to reduce duplication.

♻️ Suggested refactor pattern
async function safeCleanupDirectory(dirPath: string, logger: typeof CloudRunnerLogger) {
  if (!fs.existsSync(dirPath)) return;
  
  try {
    logger.log(`Cleaning up: ${dirPath}`);
    await CloudRunnerSystem.Run(`chmod -R u+w ${dirPath} 2>/dev/null || true`);
    await CloudRunnerSystem.Run(`rm -rf ${dirPath} 2>/dev/null || true`);
    await CloudRunnerSystem.Run(`find ${dirPath} -type f -delete 2>/dev/null || true`);
    await CloudRunnerSystem.Run(`find ${dirPath} -type d -empty -delete 2>/dev/null || true`);
  } catch (error: any) {
    logger.log(`Failed to cleanup: ${error.message}`);
  }
}
src/model/cloud-runner/cloud-runner.ts (1)

156-168: Dynamic provider loading with fallback is acceptable but consider logging level.

The dynamic provider loading approach with fallback to LocalCloudRunner is reasonable for extensibility. However, silently falling back could mask configuration errors in production.

Consider using CloudRunnerLogger.logWarning instead of CloudRunnerLogger.log for the fallback message to make it more visible.

src/model/cloud-runner/tests/e2e/cloud-runner-end2end-caching.test.ts (1)

102-136: Cleanup logic duplicates the retaining test pattern.

This cleanup block is nearly identical to the one in cloud-runner-end2end-retaining.test.ts. Consider extracting a shared test utility for cache cleanup to reduce duplication across e2e tests.

src/model/cloud-runner/remote-client/index.ts (1)

397-400: Remove unconventional void 0 statement.

The void 0; statement at line 399 is unconventional for a no-op. The comment alone suffices, or use an empty block.

     } catch {
       /* no-op: best-effort git lfs pull without tokens may fail */
-      void 0;
     }
src/model/cloud-runner/providers/k8s/kubernetes-storage.ts (1)

61-75: Avoid creating redundant KubeConfig instances; reuse the initialized client's kubeconfig.

While both kubeClient and the independently created KubeConfig at line 62 use loadFromDefault() and thus load the same default cluster config, creating separate KubeConfig and StorageV1Api instances each time this method is called is inefficient and violates DRY.

Instead, add StorageV1Api to the Kubernetes class initialization (in index.ts), or pass the KubeConfig instance as a parameter to this method to reuse it. The same pattern is repeated at line 187.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/model/docker.ts`:
- Around line 58-60: The image detection logic (isAlpineBasedImage /
commandPrefix) misses tagged and registry-qualified refs (e.g., alpine:3.19 or
ghcr.io/rclone/rclone:1.66); update detection to normalize or pattern-match the
image string instead of exact equality: extract the repository/name portion or
use a regex to match (^(.*/)?alpine(:|$) or (^(.*/)?rclone/rclone(:|$)) so
images with tags or registry prefixes are detected as Alpine-based, then set
commandPrefix to /bin/sh accordingly; update the comment placement to include
the required blank line before it.
🧹 Nitpick comments (1)
src/model/cloud-runner/providers/docker/index.ts (1)

140-141: Add blank line before the comment to satisfy ESLint.

Static analysis reports a missing blank line before the comment.

🔧 Proposed fix
     const entrypointFilePath = `start.sh`;
+
     // Use #!/bin/sh for POSIX compatibility (Alpine-based images like rclone/rclone don't have bash)
     const fileContents = `#!/bin/sh

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/model/cloud-runner/providers/docker/index.ts`:
- Around line 97-106: The set endpointEnvironmentNames in
src/model/cloud-runner/providers/docker/index.ts is missing the INPUT_* variants
for CloudFormation, ECS, Kinesis, and CloudWatch Logs; update the Set used in
the localhost rewriting logic (symbol: endpointEnvironmentNames) to include
'INPUT_AWSCLOUDFORMATIONENDPOINT', 'INPUT_AWSECSENDPOINT',
'INPUT_AWSKINESISENDPOINT', and 'INPUT_AWSCLOUDWATCHLOGSENDPOINT', and apply the
same additions to the equivalent sets/logic in
src/model/cloud-runner/providers/k8s/kubernetes-job-spec-factory.ts and
src/model/cloud-runner/providers/aws/aws-task-runner.ts so containers using
INPUT_* endpoint env vars are rewritten to host.docker.internal consistently.

frostebite and others added 6 commits January 29, 2026 16:47
Reverts cosmetic changes that renamed workflow_id to workflowId in GitHub
API calls. The GitHub REST API uses workflow_id, so we keep the eslint
camelcase suppression comments to match the official API convention.

Also restores the getCheckStatus() method that was removed.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…s, versioning.test.ts

These files had changes unrelated to the Cloud Runner improvements PR goals.
Reverting to main branch state.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ovider

The rclone/rclone image is Alpine-based and only has /bin/sh, not /bin/bash.
This fixes exit code 127 errors when running rclone commands in containers.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The previous implementation fetched ALL PR refs with:
  git fetch origin +refs/pull/*:refs/remotes/origin/pull/*

This is extremely slow for repos with many PRs (700+ PRs in unity-builder).
Now fetches only the specific PR ref needed, e.g., for pull/731/merge:
  git fetch origin +refs/pull/731/merge:... +refs/pull/731/head:...

This should significantly speed up the Cloud Runner integrity tests.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Tests are already covered by cloud-runner-integrity.yml

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@frostebite frostebite changed the title Cloud Runner Improvements - S3 Locking, Aws Local Stack (Pipelines), Testing Improvements, Rclone storage support, Provider plugin system Cloud Runner Improvements - LTS Candidate - S3 Locking, Aws Local Stack (Pipelines), Testing Improvements, Rclone storage support, Provider plugin system Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants