jmj/swebench eval infer #57

juanmichelini · 2025-10-28T19:16:11Z

Fixes #24

- Added missing swebench-eval script entry point that maps to benchmarks.swe_bench.eval_infer:main - This resolves the 'No such file or directory' error when running 'uv run swebench-eval' - Now swebench-eval works as an entry point to eval_infer.py, similar to how swebench-infer works for run_infer.py Co-authored-by: openhands <[email protected]>

juanmichelini · 2025-10-30T19:53:09Z

@OpenHands /codereview-roasted

openhands-ai · 2025-10-30T19:53:17Z

I'm on it! juanmichelini can track my progress at all-hands.dev

openhands-ai · 2025-10-30T19:55:13Z

Code Review Summary for PR #57: "jmj/swebench eval infer"

I've completed a comprehensive code review of the changes in this PR, which adds SWE-Bench evaluation functionality to complement the existing inference capabilities.

Taste Rating: 🔴 Needs improvement - Violates fundamental principles

Critical Issues That Must Be Fixed:

[benchmarks/utils/patch_utils.py, Lines 42-47] - BROKEN FILE FILTERING: The logic any(f in diff for f in files) uses dangerous substring matching that will cause data corruption. A file named "setup_test.py" would incorrectly match "setup.py" and be filtered out.
[benchmarks/swe_bench/eval_infer.py, Line 158] - BRITTLE SYSTEM DEPENDENCY: Hardcoded /usr/bin/python3 path will break on systems where Python is installed elsewhere, violating the "never break userspace" principle.
[benchmarks/swe_bench/eval_infer.py, Lines 134-196 & 199-289] - EXCESSIVE COMPLEXITY: Functions are doing too much - the main function handles 5+ different responsibilities, and run_swebench_evaluation mixes path manipulation, subprocess execution, and error handling.

Key Technical Problems:

Security Risk: Substring-based file filtering could match partial paths and cause silent data loss
Fragile Design: Git patches treated as raw strings instead of structured data
Poor Error Handling: Subprocess calls don't properly capture or handle stderr
Hardcoded Values: Setup file list and Python path are not configurable

Verdict:

❌ Needs rework - The patch filtering logic is fundamentally broken and will cause data corruption in production. The subprocess execution is brittle and environment-dependent.

Key Architectural Insight:

This code treats git patches as unstructured text when they have well-defined formats that should be parsed properly. The file filtering especially needs complete rewriting to use proper path matching instead of dangerous substring operations.

The core functionality (converting OpenHands output to SWE-Bench format) solves a real problem, but the implementation has several critical flaws that must be addressed before this can be safely merged.

_{View full conversation}

juanmichelini · 2025-10-30T20:13:21Z

@OpenHands you made some good points in the code review, can you update this branch with your suggestions and I'll test it? Do the necessary changes but let's keep the PR as minimal as possible.

openhands-ai · 2025-10-30T20:13:30Z

I'm on it! juanmichelini can track my progress at all-hands.dev

Replace dangerous substring matching with proper filename extraction and exact matching. The previous implementation used which would incorrectly filter files like 'my_setup.py' when trying to filter 'setup.py'. Now properly parses diff headers using regex to extract exact filenames and performs exact matching instead of substring matching. Co-authored-by: openhands <[email protected]>

openhands-ai · 2025-10-30T20:25:24Z

✅ Code Review Suggestions Implemented

I have successfully updated the jmj/swebench-eval-infer branch with the critical fix identified in the code review, keeping the changes minimal as requested.

🔧 Changes Made

Fixed Critical File Filtering Bug in benchmarks/utils/patch_utils.py:

Problem: The original code used dangerous substring matching any(f in diff for f in files) which would incorrectly filter files like my_setup.py when trying to filter setup.py
Solution: Replaced with proper regex-based filename extraction and exact matching that parses diff headers correctly

📊 Impact

Files changed: 1 file
Lines changed: +14 insertions, -2 deletions
Scope: Minimal and targeted fix addressing the fundamental issue

✅ Verification

Created and ran comprehensive test confirming the fix works correctly
Verified exact filename matching now works properly
Confirmed no regression in functionality

🚀 Status

The changes have been committed and pushed to the PR branch. The fix addresses the "BROKEN FILE FILTERING" issue that was flagged as violating fundamental principles in the code review, ensuring that only exact filename matches are filtered instead of dangerous substring matching.

The PR is now ready for your testing!

_{View full conversation}

xingyaoww · 2025-10-31T16:09:40Z

benchmarks/swe_bench/eval_infer.py

+def install_swebench() -> None:
+    """Install SWE-Bench if not already available."""
+    import importlib.util
+
+    if importlib.util.find_spec("swebench") is not None:
+        logger.info("SWE-Bench is already installed")
+        return
+
+    logger.info("Installing SWE-Bench...")
+    try:
+        # Install SWE-Bench using uv
+        subprocess.run(
+            ["uv", "add", "swebench"], check=True, capture_output=True, text=True
+        )
+        logger.info("SWE-Bench installed successfully")
+    except subprocess.CalledProcessError as e:
+        logger.error(f"Failed to install SWE-Bench: {e}")
+        logger.error(f"stdout: {e.stdout}")
+        logger.error(f"stderr: {e.stderr}")
+        raise


You should just uv add swebench and maintain its version in pyproject.toml and uv.lock -- so you can get rid of this function and all associate configs

@OpenHands please do it for me

I'm on it! xingyaoww can track my progress at all-hands.dev

Summary

I've successfully addressed the PR comment from @xingyaoww on PR #57. The requested changes have been completed and pushed to the jmj/swebench-eval-infer branch.

Changes Made:

✅ Added swebench as a proper dependency in pyproject.toml using uv add swebench

✅ Removed the install_swebench() function (22 lines) - This function previously attempted to dynamically install swebench at runtime

✅ Removed the --install-swebench CLI argument and its help text

✅ Removed the code that called install_swebench() in the main function

✅ Updated README.md to remove the --install-swebench example and simplified the workflow description

Result:

The SWE-Bench dependency is now properly managed through pyproject.toml and uv.lock like other dependencies, eliminating the need for runtime installation logic. This is cleaner, more maintainable, and follows standard Python dependency management practices.

Commit: ec01450 - "refactor: manage swebench dependency via uv add"
Branch: jmj/swebench-eval-infer

The changes are now live on the PR and ready for review.

_{View full conversation}

- Added swebench to pyproject.toml dependencies - Removed install_swebench() function and --install-swebench flag - Updated README to reflect simplified installation process - SWE-Bench is now installed as a regular dependency via uv

openhands-ai · 2025-11-04T19:20:15Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #57 at branch `jmj/swebench-eval-infer`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

- Applied ruff formatting to benchmarks/utils/patch_utils.py - Added .openhands/setup.sh from software-agent-sdk for environment setup - Updated .gitignore to allow .openhands/setup.sh Co-authored-by: openhands <[email protected]>

- Added succinct repository documentation for OpenHands benchmarks - Covers dev setup, code standards, testing, and benchmark-specific guidelines - Updated .gitignore to allow .openhands/microagents/ directory Co-authored-by: openhands <[email protected]>

- Locked swebench v4.1.0 and its dependencies - Added beautifulsoup4, gitpython, ghapi, and other transitive deps Co-authored-by: openhands <[email protected]>

juanmichelini and others added 9 commits October 15, 2025 14:43

doc: SWE-Bench eval infer example

a0886fa

feat: eval_infer, format ouput properly, run harness

f6ff20d

fix: report is now saved to same directory as predictions file

7f62c79

SWE-Bench eval: git patch postprocessing

7f9ff70

SWE-Bench eval: git patch postprocessing

9bcd968

Merge branch 'main' into jmj/swebench-eval-infer

1eacdf6

Fix patch apply errors

1584e00

Added param for workers to eval_infer

f38c74c

Merge branch 'main' into jmj/swebench-eval-infer

095c4bc

xingyaoww reviewed Oct 31, 2025

View reviewed changes

refactor: manage swebench dependency via uv add

ec01450

- Added swebench to pyproject.toml dependencies - Removed install_swebench() function and --install-swebench flag - Updated README to reflect simplified installation process - SWE-Bench is now installed as a regular dependency via uv

openhands-agent added 3 commits November 4, 2025 19:32

chore: run ruff format and add .openhands/setup.sh

621b149

- Applied ruff formatting to benchmarks/utils/patch_utils.py - Added .openhands/setup.sh from software-agent-sdk for environment setup - Updated .gitignore to allow .openhands/setup.sh Co-authored-by: openhands <[email protected]>

chore: update uv.lock with swebench dependency

6dd4da5

- Locked swebench v4.1.0 and its dependencies - Added beautifulsoup4, gitpython, ghapi, and other transitive deps Co-authored-by: openhands <[email protected]>

xingyaoww approved these changes Nov 4, 2025

View reviewed changes

xingyaoww merged commit d789b46 into main Nov 4, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

jmj/swebench eval infer #57

jmj/swebench eval infer #57

Uh oh!

juanmichelini commented Oct 28, 2025 •

edited

Loading

Uh oh!

juanmichelini commented Oct 30, 2025

Uh oh!

openhands-ai bot commented Oct 30, 2025

Uh oh!

openhands-ai bot commented Oct 30, 2025

Uh oh!

juanmichelini commented Oct 30, 2025

Uh oh!

openhands-ai bot commented Oct 30, 2025

Uh oh!

openhands-ai bot commented Oct 30, 2025

Uh oh!

xingyaoww Oct 31, 2025

Uh oh!

xingyaoww Nov 4, 2025

Uh oh!

openhands-ai bot Nov 4, 2025

Uh oh!

openhands-ai bot Nov 4, 2025

Uh oh!

openhands-ai bot commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jmj/swebench eval infer #57

jmj/swebench eval infer #57

Uh oh!

Conversation

juanmichelini commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juanmichelini commented Oct 30, 2025

Uh oh!

openhands-ai bot commented Oct 30, 2025

Uh oh!

openhands-ai bot commented Oct 30, 2025

Code Review Summary for PR #57: "jmj/swebench eval infer"

Taste Rating: 🔴 Needs improvement - Violates fundamental principles

Critical Issues That Must Be Fixed:

Key Technical Problems:

Verdict:

Key Architectural Insight:

Uh oh!

juanmichelini commented Oct 30, 2025

Uh oh!

openhands-ai bot commented Oct 30, 2025

Uh oh!

openhands-ai bot commented Oct 30, 2025

✅ Code Review Suggestions Implemented

🔧 Changes Made

📊 Impact

✅ Verification

🚀 Status

Uh oh!

xingyaoww Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

xingyaoww Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Nov 4, 2025

Choose a reason for hiding this comment

Summary

Changes Made:

Result:

Uh oh!

openhands-ai bot commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juanmichelini commented Oct 28, 2025 •

edited

Loading