-
Notifications
You must be signed in to change notification settings - Fork 6
Add GitHub workflow to auto-build SWE-Bench images and push to GHCR #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…hing - Create workflow that can be manually triggered via workflow_dispatch - Integrate Blacksmith caching for faster Docker builds - Configure workflow to push images to ghcr.io/openhands/eval-agent-server - Make --critic parameter optional in build_images.py for build-only usage - Fix .gitignore patterns for eval_outputs and builds directories This workflow follows Blacksmith documentation for Docker builds and allows building SWE-Bench evaluation images with configurable parameters like dataset, split, target, platforms, and concurrent workers. Closes #37
|
@OpenHands review changes introduced by this PR. Then review the implementation added in OpenHands/software-agent-sdk#990, can you help think if there's a way to follow the pattern there to use implement this fast docker build https://docs.blacksmith.sh/blacksmith-caching/docker-builds. We don't need to use useblacksmith/build-push-action@v2 here, since we will be only building "amd64" images here |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
…caching Following the pattern from OpenHands/software-agent-sdk#990 and Blacksmith's official documentation (https://docs.blacksmith.sh/blacksmith-caching/docker-builds), this change replaces the standard docker/setup-buildx-action with useblacksmith/setup-docker-builder@v1. Key improvements: - Replaces docker/setup-buildx-action@v3 with useblacksmith/setup-docker-builder@v1 - Removes manual cache configuration (useblacksmith/cache@v6) - Blacksmith's Docker builder automatically manages Docker layer caching via NVMe-backed sticky disks - Provides 2x to 40x improvements in build times according to Blacksmith's customers - Since we only build amd64 images, we don't need the complex multi-platform matrix strategy This approach is recommended for workflows that use Docker commands directly (as opposed to using docker/build-push-action). Co-authored-by: openhands <[email protected]>
SummaryI've reviewed the PR changes and successfully applied the Blacksmith fast Docker build pattern from PR #990. Here's what was done: PR Review CompletedCurrent PR (#51) Changes:
Applied Blacksmith Fast Docker Build PatternFollowing the pattern from PR #990 and Blacksmith's official documentation, I made the following optimization: Changes Made:
Why This Pattern is Optimal for This Use Case: Unlike PR #990 which builds multi-platform images (amd64 + arm64) using a matrix strategy with
Therefore, the appropriate pattern from Blacksmith docs is:
Benefits:
The changes have been committed and pushed to the PR branch. |
|
@OpenHands please fix the failing actions on PR #51 at branch |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
The GitHub Actions workflow was failing because uv was trying to build pyarrow from source, which requires the Arrow C++ library and CMake. This change adds the --no-build-package pyarrow flag to force uv to use the pre-built binary wheel instead of attempting to build from source. Co-authored-by: openhands <[email protected]>
The root cause of the build failure was that uv was installing Python 3.14.0, which doesn't have binary wheels for pyarrow 21.0.0 yet. This caused uv to attempt building from source, which failed due to missing Arrow C++ libraries. Solution: Added .python-version file to pin Python to 3.12, which matches the project's target-version in pyproject.toml and has full binary wheel support for all dependencies. Co-authored-by: openhands <[email protected]>
Use github.run_id instead of dataset/split names which contain slashes that are invalid in artifact names. Also added if-no-files-found: warn to provide better feedback if logs are missing. Co-authored-by: openhands <[email protected]>
Summary of Fixes for PR #51I've successfully fixed the failing GitHub Actions workflow on PR #51. Here's what was done: ✅ Issues Fixed1. Dependency Installation Failure (Root Cause)
2. Artifact Upload Naming Issue
📝 Changes MadeCommit 1:
Commit 2:
✅ VerificationThe workflow now successfully passes the "Install dependencies" step, which was previously failing. Build logs and manifests will now be properly uploaded as GitHub artifacts for future analysis, even when individual Docker builds fail (as expected during testing with limited sample builds). The changes have been pushed to branch |
…cters GitHub Actions artifact upload doesn't allow colons in filenames, but our log paths contain colons from Docker image tags (e.g., 'django-11999:latest'). Archive the entire builds directory into a tar.gz before upload to work around this restriction. Co-authored-by: openhands <[email protected]>
Docker image tags have a maximum length of 128 characters. When building SWE-Bench images with long base image names (e.g., scikit-learn), the generated cache tags exceed this limit and cause build failures with: 'ERROR: failed to configure registry cache exporter: invalid reference format' Solution: Apply a patch to vendor/software-agent-sdk that hashes the base_image_slug when it would cause the final tag to exceed 128 characters. Uses SHA256 hash (first 12 chars) to create a shorter unique identifier while maintaining cache efficiency. The patch is applied during the workflow setup before installing dependencies. Co-authored-by: openhands <[email protected]>
Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <[email protected]>
…s/build-swe-bench-images-workflow
This reverts commit 3ba1e46.
The build workflow was experiencing log file corruption and I/O errors due to
concurrent builds writing to the wrong log files. This was caused by using
ThreadPoolExecutor with contextlib.redirect_stdout/stderr, which only provides
thread-local redirection of Python-level writes.
The SDK's build() function spawns subprocesses and uses logger.info()/warning()
to output build logs. Logger handlers write to process-wide file descriptors,
not thread-local redirected streams, causing output from concurrent threads to:
- Write to the wrong log files
- Attempt writing to closed file handles
- Result in ValueError('I/O operation on closed file.')
Solution: Replace ThreadPoolExecutor with ProcessPoolExecutor to provide
complete process-level isolation with separate stdout/stderr/logging per
build. The additional overhead is negligible compared to Docker build time.
Changes:
- Import ProcessPoolExecutor instead of ThreadPoolExecutor
- Move build_one_fn to module level (_build_with_logging) for pickle support
- Update executor initialization to use ProcessPoolExecutor
- Add explanatory comments about isolation requirements
Co-authored-by: openhands <[email protected]>
This reverts commit 8d8ed8c.
…s/build-swe-bench-images-workflow
- Replace SDK_VERSION with SHORT_SHA (renamed in SDK PR #1088)
- Add extract_custom_tag() function to avoid circular import
- Update get_agent_server_docker_image() to use new tag format:
- Binary target: {SHORT_SHA}-{custom_tag}
- Other targets: {SHORT_SHA}-{custom_tag}-{target}
- Aligns with SDK's git commit-based tagging strategy
Co-authored-by: openhands <[email protected]>
Remove duplicate implementation of extract_custom_tag in build_images.py and import it from run_infer.py instead. This avoids code duplication and ensures both modules use the same implementation. Co-authored-by: openhands <[email protected]>
Add comment explaining that SHORT_SHA is computed from the benchmarks repo's git commit (via git rev-parse HEAD in cwd), not the SDK submodule. This makes it clear that images are tagged with the benchmarks repo commit for reproducibility and traceability. Co-authored-by: openhands <[email protected]>
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
This adds a new step to the build-and-push workflow that: - Posts a comment to issue #81 when the build completes successfully - Includes dataset name, split, SDK version, and workflow run link - Lists all built image tags in a collapsible markdown section Co-authored-by: openhands <[email protected]>
- Corrected SDK repository URL from All-Hands-AI/agent-sdk to OpenHands/software-agent-sdk - Added 'Triggered by' field to comment to show workflow trigger source - Updated .openhands/microagents/repo.md with correct SDK URL Co-authored-by: openhands <[email protected]>
Changed .openhands/ to .openhands/* so that negation patterns work correctly Co-authored-by: openhands <[email protected]>
The comment step now checks if manifest.jsonl files exist and contain data before attempting to post a comment. This prevents posting comments with '0 images' when builds complete successfully but produce no output (e.g., during PR testing or when the build step is skipped). Co-authored-by: openhands <[email protected]>
The previous check using 'builds/*/manifest.jsonl' only looked one level deep, but the actual path is 'builds/princeton-nlp/SWE-bench_Verified/test/manifest.jsonl' which is three levels deep. Using 'find' command now correctly locates manifest files at any depth within the builds directory. Tested with actual artifact from run #19182998503 containing 10 images. Co-authored-by: openhands <[email protected]>
Each image has multiple tags (base tag + detailed tag with hash). Now showing only the first (cleaner) tag per image to reduce clutter in the issue comment, making it easier to read. Co-authored-by: openhands <[email protected]>
Description
This PR implements a GitHub workflow to automatically build SWE-Bench Docker images and push them to GitHub Container Registry (GHCR), addressing issue #37.
Changes
1. New GitHub Workflow (
.github/workflows/build-swe-bench-images.yml)workflow_dispatchblacksmith-32vcpu-ubuntu-2204runner for high-performance buildsuseblacksmith/cache@v6ghcr.io/openhands/eval-agent-server2. Workflow Input Parameters
The workflow accepts the following configurable inputs:
dataset(default:princeton-nlp/SWE-bench_Verified)split(default:test)target(default:source-minimal) - Build target typeplatforms(default:linux/amd64) - Target architecturesmax-workers(default:1) - Number of concurrent buildsn-limit(optional) - Limit number of images to build for testing3. Modified
benchmarks/swe_bench/build_images.py--criticparameter optional (with default value"none") since it's not needed for image building4. Fixed
.gitignoreeval_outputs/andbuilds/directories to properly exclude build artifacts5. Improved Docker Image Tagging ✨ NEW
Significantly improved the Docker image tagging system for better reproducibility and clarity:
Changes:
SDK_VERSION_OVERRIDEenvironment variable supportinclude_versioned_tagoption to disable long versioned tags-devsuffix)TAGGING_CHANGES.mdexplaining the improvementsTag Format Comparison:
Before (137 chars):
After (84 chars, 39% shorter):
Benefits:
-binary,-source, etc.)See
TAGGING_CHANGES.mdfor detailed explanation and implementation notes.Testing
build_images.pyruns successfully with--dry-runflagManual Trigger Instructions
To trigger this workflow manually:
Notes
ghcr.io/openhands/eval-agent-serverwith improved tags based on SDK commit hash and instance IDsRelated PRs
Closes #37
@xingyaoww can click here to continue refining the PR