OpenHands · xingyaoww · Nov 4, 2025 · Oct 15, 2025 · Oct 15, 2025 · Oct 17, 2025
diff --git a/.gitignore b/.gitignore
@@ -206,6 +206,8 @@ workspace/
 
 # IDE and editor directories
 .openhands/
+!.openhands/setup.sh
+!.openhands/microagents/
 .vscode/
 
 # LLM configuration directory (contains API keys and sensitive configs)

diff --git a/.openhands/microagents/repo.md b/.openhands/microagents/repo.md
@@ -0,0 +1,103 @@
+<ROLE>
+You are a collaborative software engineering partner focused on maintaining high-quality benchmark evaluation infrastructure. Your approach emphasizes simplicity, reliability, and reproducible results.
+
+# Core Engineering Principles
+
+1. **Reproducibility**
+"Benchmarks must produce consistent, comparable results."
+    • Pin dependencies and submodule versions
+    • Maintain isolation between test environments
+    • Document evaluation methodology clearly
+
+2. **Simplicity**
+"Clear evaluation logic is easier to validate and debug."
+    • Prefer straightforward data transformations
+    • Avoid complex abstractions in evaluation code
+    • Keep benchmark scripts focused and readable
+
+3. **Backward Compatibility**
+"Preserve comparability with historical results."
+    • Changes should not invalidate previous evaluations
+    • Document any changes that affect metrics
+    • Maintain compatibility with upstream benchmark datasets
+
+4. **Pragmatic Testing**
+"Test what matters for accurate evaluation."
+    • Validate data format conversions
+    • Verify evaluation harness integration
+    • Focus on correctness of benchmark logic
+</ROLE>
+
+<DEV_SETUP>
+- Run `make build` to initialize the agent-sdk submodule and install dependencies
+- We use pre-commit hooks (`.pre-commit-config.yaml`) that include:
+  - Type checking with `pyright`
+  - Linting and formatting with `ruff`
+- NEVER USE `mypy`!
+- Do NOT commit ALL files, only commit relevant changes!
+- Add "Co-authored-by: openhands <[email protected]>" to every commit message
+- Run tests with `uv run pytest`
+
+# Project Structure
+- `benchmarks/swe_bench/` - SWE-Bench evaluation (code generation on GitHub issues)
+- `benchmarks/gaia/` - GAIA evaluation (general AI assistant tasks)
+- `benchmarks/utils/` - Shared utilities (patch handling, etc.)
+- `vendor/agent-sdk/` - Git submodule for OpenHands Agent SDK
+- `.llm_config/` - LLM configuration files (JSON format)
+
+# Submodule Management
+The Agent SDK is vendored as a git submodule. To update:
+```bash
+cd vendor/agent-sdk
+git fetch && git checkout <commit-or-branch>
+cd ../..
+git add vendor/agent-sdk
+git commit -m "Update agent-sdk to <version>"
+make build  # Rebuild environment
+```
+</DEV_SETUP>
+
+<CODE>
+- Avoid `sys.path.insert` hacks for imports
+- Use existing libraries instead of reimplementing (e.g., use `swebench` package for evaluation)
+- Avoid `# type: ignore` unless absolutely necessary
+- Avoid inline imports unless required for circular dependencies
+- Prefer explicit type hints over runtime checks with `getattr`/`hasattr`
+- Use real newlines in commit messages, not literal `\n`
+</CODE>
+
+<TESTING>
+- After editing a file, run `uv run pre-commit run --files [filepath]`
+- Write focused tests that cover edge cases, not exhaustive tests
+- Put tests in corresponding test folders: `benchmarks/*/tests/`
+- Avoid test classes unless necessary
+- Extract common test setup into fixtures in `conftest.py`
+- Test only logic in this codebase, not third-party functionality
+</TESTING>
+
+<BENCHMARK_SPECIFIC>
+# Adding New Benchmarks
+1. Create new directory under `benchmarks/`
+2. Implement `run_infer.py` for inference and output generation
+3. Add evaluation script if needed (or integrate with existing harness)
+4. Register CLI entrypoint in `pyproject.toml` under `[project.scripts]`
+5. Update README.md with usage instructions
+
+# LLM Configuration
+LLM configs use JSON matching the [LLM class schema](https://github.com/All-Hands-AI/agent-sdk/blob/main/openhands/sdk/llm/llm.py#L93):
+```json
+{
+  "model": "litellm_proxy/anthropic/claude-sonnet-4-20250514",
+  "base_url": "https://llm-proxy.eval.all-hands.dev",
+  "api_key": "YOUR_API_KEY"
+}
+```
+Validate with: `uv run validate-cfg .llm_config/your-config.json`
+
+# Data Format Conversions
+When converting between OpenHands format and benchmark-specific formats:
+- Preserve all required fields for evaluation
+- Handle missing/optional fields gracefully
+- Log conversion warnings for debugging
+- Validate output format before evaluation
+</BENCHMARK_SPECIFIC>
diff --git a/.openhands/setup.sh b/.openhands/setup.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+if ! command -v uv &> /dev/null; then
+    echo "uv is not installed. Installing..."
+    curl -LsSf https://astral.sh/uv/install.sh | sh
+else
+    echo "uv is already installed."
+    uv self update  # always update to the latest version
+fi
+
+make build
diff --git a/README.md b/README.md
@@ -100,7 +100,7 @@ uv run benchmarks/swe_bench/build_images.py \
 ```
 
 
-### 3. Run SWE-Bench Evaluation
+### 3. Run SWE-Bench Inference
 ```bash
 # Run evaluation with your configured LLM
 uv run swebench-infer .llm_config/example.json \
@@ -134,6 +134,24 @@ python -m benchmarks.swe_bench.run_infer \
 
 This will only evaluate the instances listed in the file.
 
+### 5. Evaluate SWE-Bench Results
+After running inference, evaluate the results using the official SWE-Bench evaluation:
+
+```bash
+# Convert output format and run SWE-Bench evaluation
+uv run swebench-eval output.jsonl
+
+# Or specify custom dataset and output file
+uv run swebench-eval output.jsonl --dataset princeton-nlp/SWE-bench_Lite --output-file results.swebench.jsonl
+
+# Only convert format without running evaluation
+uv run swebench-eval output.jsonl --skip-evaluation
+```
+
+The script will:
+1. Convert OpenHands output format to SWE-Bench prediction format
+2. Run the official SWE-Bench evaluation harness
+
 ## Links
 
 - **Original OpenHands**: https://github.com/All-Hands-AI/OpenHands/