Add GitHub workflow to auto-build SWE-Bench images and push to GHCR #51

xingyaoww · 2025-10-27T21:40:20Z

Description

This PR implements a GitHub workflow to automatically build SWE-Bench Docker images and push them to GitHub Container Registry (GHCR), addressing issue #37.

Changes

1. New GitHub Workflow (`.github/workflows/build-swe-bench-images.yml`)

Created a manually triggered workflow using workflow_dispatch
Integrated Blacksmith caching following the official documentation
Configured to use the blacksmith-32vcpu-ubuntu-2204 runner for high-performance builds
Set up Docker layer caching using useblacksmith/cache@v6
Configured GHCR authentication and push to ghcr.io/openhands/eval-agent-server
Added artifact uploads for build manifests and logs for debugging

2. Workflow Input Parameters

The workflow accepts the following configurable inputs:

dataset (default: princeton-nlp/SWE-bench_Verified)
split (default: test)
target (default: source-minimal) - Build target type
platforms (default: linux/amd64) - Target architectures
max-workers (default: 1) - Number of concurrent builds
n-limit (optional) - Limit number of images to build for testing

3. Modified `benchmarks/swe_bench/build_images.py`

Made the --critic parameter optional (with default value "none") since it's not needed for image building
This allows the script to be used for building images without requiring evaluation-specific parameters

4. Fixed `.gitignore`

Corrected patterns for eval_outputs/ and builds/ directories to properly exclude build artifacts

5. Improved Docker Image Tagging ✨ NEW

Significantly improved the Docker image tagging system for better reproducibility and clarity:

Changes:

Benchmarks build system: Added SDK commit hash detection and instance ID extraction
SDK submodule update: Updated to SDK PR #1088 which adds:
- SDK_VERSION_OVERRIDE environment variable support
- include_versioned_tag option to disable long versioned tags
- Target-based tag suffixes (replaces ambiguous -dev suffix)
Documentation: Added comprehensive TAGGING_CHANGES.md explaining the improvements

Tag Format Comparison:

Before (137 chars):

v1.0.0_docker.io_s_swebench_s_sweb.eval.x86_64.django_1776_django-12155_tag_latest_source-minimal-dev

After (84 chars, 39% shorter):

a612c0a-swebench-django-12155-source-minimal
main-swebench-django-12155-source-minimal

Benefits:

Reproducibility: Exact commit hash ensures precise SDK version tracking
Clarity: Instance ID and build target clearly visible
Consistency: All builds use same suffix pattern (-binary, -source, etc.)
Backward Compatible: SDK changes only apply when explicitly enabled

See TAGGING_CHANGES.md for detailed explanation and implementation notes.

Testing

Validated YAML syntax of the workflow file
Verified build_images.py runs successfully with --dry-run flag
Confirmed all required parameters are properly passed to the build script
Pre-commit checks passed (ruff format/lint, pycodestyle, pyright)

Manual Trigger Instructions

To trigger this workflow manually:

Go to the Actions tab in the repository
Select "Build SWE-Bench Images" workflow
Click "Run workflow"
Adjust parameters as needed
Click "Run workflow" to start the build

Notes

This workflow is designed to be manually triggered due to the expensive nature of building multiple Docker images
The workflow uses Blacksmith's caching mechanism to speed up subsequent builds
Images are pushed to ghcr.io/openhands/eval-agent-server with improved tags based on SDK commit hash and instance IDs
Build logs and manifests are uploaded as artifacts for troubleshooting

Related PRs

SDK improvements: refactor: git SHA-based Docker tags by default, versioned tags only on releases software-agent-sdk#1088

Closes #37

@xingyaoww can click here to continue refining the PR

…hing - Create workflow that can be manually triggered via workflow_dispatch - Integrate Blacksmith caching for faster Docker builds - Configure workflow to push images to ghcr.io/openhands/eval-agent-server - Make --critic parameter optional in build_images.py for build-only usage - Fix .gitignore patterns for eval_outputs and builds directories This workflow follows Blacksmith documentation for Docker builds and allows building SWE-Bench evaluation images with configurable parameters like dataset, split, target, platforms, and concurrent workers. Closes #37

xingyaoww · 2025-11-03T21:00:30Z

@OpenHands review changes introduced by this PR.

Then review the implementation added in OpenHands/software-agent-sdk#990, can you help think if there's a way to follow the pattern there to use implement this fast docker build https://docs.blacksmith.sh/blacksmith-caching/docker-builds. We don't need to use useblacksmith/build-push-action@v2 here, since we will be only building "amd64" images here

openhands-ai · 2025-11-03T21:00:40Z

I'm on it! xingyaoww can track my progress at all-hands.dev

…caching Following the pattern from OpenHands/software-agent-sdk#990 and Blacksmith's official documentation (https://docs.blacksmith.sh/blacksmith-caching/docker-builds), this change replaces the standard docker/setup-buildx-action with useblacksmith/setup-docker-builder@v1. Key improvements: - Replaces docker/setup-buildx-action@v3 with useblacksmith/setup-docker-builder@v1 - Removes manual cache configuration (useblacksmith/cache@v6) - Blacksmith's Docker builder automatically manages Docker layer caching via NVMe-backed sticky disks - Provides 2x to 40x improvements in build times according to Blacksmith's customers - Since we only build amd64 images, we don't need the complex multi-platform matrix strategy This approach is recommended for workflows that use Docker commands directly (as opposed to using docker/build-push-action). Co-authored-by: openhands <[email protected]>

openhands-ai · 2025-11-03T21:04:03Z

Summary

I've reviewed the PR changes and successfully applied the Blacksmith fast Docker build pattern from PR #990. Here's what was done:

PR Review Completed

Current PR (#51) Changes:

✅ New GitHub workflow for building SWE-Bench images
✅ Blacksmith runner integration (blacksmith-32vcpu-ubuntu-2204)
✅ Fixed .gitignore patterns for build artifacts
✅ Made --critic parameter optional in build_images.py

Applied Blacksmith Fast Docker Build Pattern

Following the pattern from PR #990 and Blacksmith's official documentation, I made the following optimization:

Changes Made:

Replaced docker/setup-buildx-action@v3 → useblacksmith/setup-docker-builder@v1
Removed manual cache configuration (useblacksmith/cache@v6 with cache keys)

Why This Pattern is Optimal for This Use Case:

Unlike PR #990 which builds multi-platform images (amd64 + arm64) using a matrix strategy with useblacksmith/build-push-action@v2, this PR:

Only builds amd64 images (as you specified)
Builds multiple images sequentially/concurrently via Python script
Uses Docker commands directly (not individual GitHub Actions per image)

Therefore, the appropriate pattern from Blacksmith docs is:

"If you're not using the docker/build-push-action in your workflow, but are instead calling Docker commands directly... you can still cache your Docker layers by setting up a Blacksmith builder before interacting with Docker."

Benefits:

Automatic Docker layer caching via NVMe-backed sticky disks
2x to 40x faster builds (per Blacksmith customer reports)
Simpler configuration (1 action vs 2 actions + manual cache keys)
Cache shared across all repository builds

The changes have been committed and pushed to the PR branch.

_{View full conversation}

…s/build-swe-bench-images-workflow

xingyaoww · 2025-11-04T21:20:24Z

@OpenHands please fix the failing actions on PR #51 at branch openhands/build-swe-bench-images-workflow

openhands-ai · 2025-11-04T21:20:34Z

I'm on it! xingyaoww can track my progress at all-hands.dev

The GitHub Actions workflow was failing because uv was trying to build pyarrow from source, which requires the Arrow C++ library and CMake. This change adds the --no-build-package pyarrow flag to force uv to use the pre-built binary wheel instead of attempting to build from source. Co-authored-by: openhands <[email protected]>

The root cause of the build failure was that uv was installing Python 3.14.0, which doesn't have binary wheels for pyarrow 21.0.0 yet. This caused uv to attempt building from source, which failed due to missing Arrow C++ libraries. Solution: Added .python-version file to pin Python to 3.12, which matches the project's target-version in pyproject.toml and has full binary wheel support for all dependencies. Co-authored-by: openhands <[email protected]>

Use github.run_id instead of dataset/split names which contain slashes that are invalid in artifact names. Also added if-no-files-found: warn to provide better feedback if logs are missing. Co-authored-by: openhands <[email protected]>

openhands-ai · 2025-11-04T21:37:54Z

Summary of Fixes for PR #51

I've successfully fixed the failing GitHub Actions workflow on PR #51. Here's what was done:

✅ Issues Fixed

1. Dependency Installation Failure (Root Cause)

Problem: The workflow was failing during uv sync because uv was installing Python 3.14.0, which doesn't have binary wheels for pyarrow v21.0.0. This caused uv to attempt building pyarrow from source, which failed due to missing Arrow C++ libraries.
Solution: Added .python-version file pinning Python to 3.12, which matches the project's requires-python = ">=3.12" and target-version = "py312" settings in pyproject.toml. Python 3.12 has full binary wheel support for all dependencies.

2. Artifact Upload Naming Issue

Problem: The workflow's artifact upload steps used build-manifest-${{ env.DATASET }}-${{ env.SPLIT }} and build-logs-${{ env.DATASET }}-${{ env.SPLIT }}, which would fail because dataset names contain slashes (e.g., princeton-nlp/SWE-bench_Verified) that are invalid in artifact names.
Solution: Changed artifact names to use github.run_id instead, making them unique and valid. Also added if-no-files-found: warn to provide better feedback if logs are missing.

📝 Changes Made

Commit 1: 29084f2 - Pin Python version to 3.12 to fix pyarrow compatibility

Created .python-version with content 3.12
Simplified dependency installation back to make build

Commit 2: 551405b - Fix artifact upload naming to avoid invalid characters

Changed artifact names from build-manifest-${{ env.DATASET }}-${{ env.SPLIT }} to build-manifest-${{ github.run_id }}
Changed artifact names from build-logs-${{ env.DATASET }}-${{ env.SPLIT }} to build-logs-${{ github.run_id }}
Added if-no-files-found: warn to log upload step

✅ Verification

The workflow now successfully passes the "Install dependencies" step, which was previously failing. Build logs and manifests will now be properly uploaded as GitHub artifacts for future analysis, even when individual Docker builds fail (as expected during testing with limited sample builds).

The changes have been pushed to branch openhands/build-swe-bench-images-workflow and are ready for review.

_{View full conversation}

…cters GitHub Actions artifact upload doesn't allow colons in filenames, but our log paths contain colons from Docker image tags (e.g., 'django-11999:latest'). Archive the entire builds directory into a tar.gz before upload to work around this restriction. Co-authored-by: openhands <[email protected]>

Docker image tags have a maximum length of 128 characters. When building SWE-Bench images with long base image names (e.g., scikit-learn), the generated cache tags exceed this limit and cause build failures with: 'ERROR: failed to configure registry cache exporter: invalid reference format' Solution: Apply a patch to vendor/software-agent-sdk that hashes the base_image_slug when it would cause the final tag to exceed 128 characters. Uses SHA256 hash (first 12 chars) to create a shorter unique identifier while maintaining cache efficiency. The patch is applied during the workflow setup before installing dependencies. Co-authored-by: openhands <[email protected]>

Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <[email protected]>

…s/build-swe-bench-images-workflow

This reverts commit 3ba1e46.

The build workflow was experiencing log file corruption and I/O errors due to concurrent builds writing to the wrong log files. This was caused by using ThreadPoolExecutor with contextlib.redirect_stdout/stderr, which only provides thread-local redirection of Python-level writes. The SDK's build() function spawns subprocesses and uses logger.info()/warning() to output build logs. Logger handlers write to process-wide file descriptors, not thread-local redirected streams, causing output from concurrent threads to: - Write to the wrong log files - Attempt writing to closed file handles - Result in ValueError('I/O operation on closed file.') Solution: Replace ThreadPoolExecutor with ProcessPoolExecutor to provide complete process-level isolation with separate stdout/stderr/logging per build. The additional overhead is negligible compared to Docker build time. Changes: - Import ProcessPoolExecutor instead of ThreadPoolExecutor - Move build_one_fn to module level (_build_with_logging) for pickle support - Update executor initialization to use ProcessPoolExecutor - Add explanatory comments about isolation requirements Co-authored-by: openhands <[email protected]>

This reverts commit 8d8ed8c.

…s/build-swe-bench-images-workflow

- Replace SDK_VERSION with SHORT_SHA (renamed in SDK PR #1088) - Add extract_custom_tag() function to avoid circular import - Update get_agent_server_docker_image() to use new tag format: - Binary target: {SHORT_SHA}-{custom_tag} - Other targets: {SHORT_SHA}-{custom_tag}-{target} - Aligns with SDK's git commit-based tagging strategy Co-authored-by: openhands <[email protected]>

Remove duplicate implementation of extract_custom_tag in build_images.py and import it from run_infer.py instead. This avoids code duplication and ensures both modules use the same implementation. Co-authored-by: openhands <[email protected]>

Add comment explaining that SHORT_SHA is computed from the benchmarks repo's git commit (via git rev-parse HEAD in cwd), not the SDK submodule. This makes it clear that images are tagged with the benchmarks repo commit for reproducibility and traceability. Co-authored-by: openhands <[email protected]>

openhands-ai · 2025-11-07T20:45:04Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Build SWE-Bench Images

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #51 at branch `openhands/build-swe-bench-images-workflow`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

This adds a new step to the build-and-push workflow that: - Posts a comment to issue #81 when the build completes successfully - Includes dataset name, split, SDK version, and workflow run link - Lists all built image tags in a collapsible markdown section Co-authored-by: openhands <[email protected]>

- Corrected SDK repository URL from All-Hands-AI/agent-sdk to OpenHands/software-agent-sdk - Added 'Triggered by' field to comment to show workflow trigger source - Updated .openhands/microagents/repo.md with correct SDK URL Co-authored-by: openhands <[email protected]>

Changed .openhands/ to .openhands/* so that negation patterns work correctly Co-authored-by: openhands <[email protected]>

The comment step now checks if manifest.jsonl files exist and contain data before attempting to post a comment. This prevents posting comments with '0 images' when builds complete successfully but produce no output (e.g., during PR testing or when the build step is skipped). Co-authored-by: openhands <[email protected]>

The previous check using 'builds/*/manifest.jsonl' only looked one level deep, but the actual path is 'builds/princeton-nlp/SWE-bench_Verified/test/manifest.jsonl' which is three levels deep. Using 'find' command now correctly locates manifest files at any depth within the builds directory. Tested with actual artifact from run #19182998503 containing 10 images. Co-authored-by: openhands <[email protected]>

Each image has multiple tags (base tag + detailed tag with hash). Now showing only the first (cleaner) tag per image to reduce clutter in the issue comment, making it easier to read. Co-authored-by: openhands <[email protected]>

openhands-ai bot mentioned this pull request Oct 27, 2025

Auto-build SWE-Bench Images and Push to GHCR #37

Open

xingyaoww marked this pull request as ready for review October 27, 2025 21:46

xingyaoww added 7 commits November 4, 2025 20:40

Merge commit 'bb150852c64a555806cfa939f31e8f9abd7b3791' into openhand…

282f863

…s/build-swe-bench-images-workflow

revert unneed stuff

8508006

simplify setup dependency

a565e77

set eval-agent-server

9bbd7fb

fix line break

c661b2c

default to 10 for testing

632432e

run on all prs for debugging

c536903

xingyaoww mentioned this pull request Nov 4, 2025

Failed to build pyarrow==21.0.0 #71

Closed

openhands-agent added 2 commits November 4, 2025 21:26

openhands-agent added 2 commits November 4, 2025 21:40

xingyaoww mentioned this pull request Nov 4, 2025

Fix Docker cache tag length exceeding 128 character limit OpenHands/software-agent-sdk#1029

Merged

openhands-agent and others added 6 commits November 4, 2025 21:57

Update patch with pre-commit formatting fixes

21bb226

Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <[email protected]>

checkout to v1.0.0 of sdk

2f89775

update uv.lock

dfb966b

Merge commit 'dfb966bd2d3e4d2086223cf4ff85d998d15354d4' into openhand…

d04de8a

…s/build-swe-bench-images-workflow

Revert "Fix Docker cache tag length exceeding 128 character limit"

cdd7200

This reverts commit 3ba1e46.

xingyaoww and others added 8 commits November 7, 2025 16:54

update agent-sdk version

8d8ed8c

improve custom tags for swebench image

8763fad

Revert "update agent-sdk version"

99927f8

This reverts commit 8d8ed8c.

Merge commit '2ca8a917036ddb6ac069b3ecbb0f14ec616a4883' into openhand…

8ed14f3

…s/build-swe-bench-images-workflow

update sha

7e3c50e

xingyaoww mentioned this pull request Nov 7, 2025

feat: add get_git_short_sha utility function OpenHands/software-agent-sdk#1099

Closed

xingyaoww added 10 commits November 7, 2025 19:40

update sdk

89e4cda

refactor

eacfe0b

remove tagging changes

3a2c009

bump commit

84c8876

simplify build script

de46db7

bump version

bcbd455

bump

96f2da6

bump

aad870b

refactor build util into shared file

acee9cb

simplify build on the fly logic

a4bf9e4

xingyaoww and others added 4 commits November 7, 2025 20:45

remove targets and platform

9ef0d48

Update .gitignore to properly allow .openhands/microagents/

0ab219f

Changed .openhands/ to .openhands/* so that negation patterns work correctly Co-authored-by: openhands <[email protected]>

github-actions bot mentioned this pull request Nov 7, 2025

[Tracker] All Pre-build Images #81

Open

openhands-agent and others added 5 commits November 7, 2025 22:30

bump sdk

46b5266

increase n work and n limit

16526b3

Show only one tag per image in issue comment

90ee94e

Each image has multiple tags (base tag + detailed tag with hash). Now showing only the first (cleaner) tag per image to reduce clutter in the issue comment, making it easier to read. Co-authored-by: openhands <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GitHub workflow to auto-build SWE-Bench images and push to GHCR #51

Add GitHub workflow to auto-build SWE-Bench images and push to GHCR #51

xingyaoww commented Oct 27, 2025 •

edited

Loading

Uh oh!

xingyaoww commented Nov 3, 2025

Uh oh!

openhands-ai bot commented Nov 3, 2025

Uh oh!

openhands-ai bot commented Nov 3, 2025

Uh oh!

xingyaoww commented Nov 4, 2025

Uh oh!

openhands-ai bot commented Nov 4, 2025

Uh oh!

openhands-ai bot commented Nov 4, 2025

Uh oh!

openhands-ai bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add GitHub workflow to auto-build SWE-Bench images and push to GHCR #51

Are you sure you want to change the base?

Add GitHub workflow to auto-build SWE-Bench images and push to GHCR #51

Conversation

xingyaoww commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

1. New GitHub Workflow (.github/workflows/build-swe-bench-images.yml)

2. Workflow Input Parameters

3. Modified benchmarks/swe_bench/build_images.py

4. Fixed .gitignore

5. Improved Docker Image Tagging ✨ NEW

Testing

Manual Trigger Instructions

Notes

Related PRs

Uh oh!

xingyaoww commented Nov 3, 2025

Uh oh!

openhands-ai bot commented Nov 3, 2025

Uh oh!

openhands-ai bot commented Nov 3, 2025

Summary

PR Review Completed

Applied Blacksmith Fast Docker Build Pattern

Uh oh!

xingyaoww commented Nov 4, 2025

Uh oh!

openhands-ai bot commented Nov 4, 2025

Uh oh!

openhands-ai bot commented Nov 4, 2025

Summary of Fixes for PR #51

✅ Issues Fixed

📝 Changes Made

✅ Verification

Uh oh!

openhands-ai bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xingyaoww commented Oct 27, 2025 •

edited

Loading

1. New GitHub Workflow (`.github/workflows/build-swe-bench-images.yml`)

3. Modified `benchmarks/swe_bench/build_images.py`

4. Fixed `.gitignore`