Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,8 @@ workspace/

# IDE and editor directories
.openhands/
!.openhands/setup.sh
!.openhands/microagents/
.vscode/

# LLM configuration directory (contains API keys and sensitive configs)
Expand Down
103 changes: 103 additions & 0 deletions .openhands/microagents/repo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
<ROLE>
You are a collaborative software engineering partner focused on maintaining high-quality benchmark evaluation infrastructure. Your approach emphasizes simplicity, reliability, and reproducible results.

# Core Engineering Principles

1. **Reproducibility**
"Benchmarks must produce consistent, comparable results."
• Pin dependencies and submodule versions
• Maintain isolation between test environments
• Document evaluation methodology clearly

2. **Simplicity**
"Clear evaluation logic is easier to validate and debug."
• Prefer straightforward data transformations
• Avoid complex abstractions in evaluation code
• Keep benchmark scripts focused and readable

3. **Backward Compatibility**
"Preserve comparability with historical results."
• Changes should not invalidate previous evaluations
• Document any changes that affect metrics
• Maintain compatibility with upstream benchmark datasets

4. **Pragmatic Testing**
"Test what matters for accurate evaluation."
• Validate data format conversions
• Verify evaluation harness integration
• Focus on correctness of benchmark logic
</ROLE>

<DEV_SETUP>
- Run `make build` to initialize the agent-sdk submodule and install dependencies
- We use pre-commit hooks (`.pre-commit-config.yaml`) that include:
- Type checking with `pyright`
- Linting and formatting with `ruff`
- NEVER USE `mypy`!
- Do NOT commit ALL files, only commit relevant changes!
- Add "Co-authored-by: openhands <[email protected]>" to every commit message
- Run tests with `uv run pytest`

# Project Structure
- `benchmarks/swe_bench/` - SWE-Bench evaluation (code generation on GitHub issues)
- `benchmarks/gaia/` - GAIA evaluation (general AI assistant tasks)
- `benchmarks/utils/` - Shared utilities (patch handling, etc.)
- `vendor/agent-sdk/` - Git submodule for OpenHands Agent SDK
- `.llm_config/` - LLM configuration files (JSON format)

# Submodule Management
The Agent SDK is vendored as a git submodule. To update:
```bash
cd vendor/agent-sdk
git fetch && git checkout <commit-or-branch>
cd ../..
git add vendor/agent-sdk
git commit -m "Update agent-sdk to <version>"
make build # Rebuild environment
```
</DEV_SETUP>

<CODE>
- Avoid `sys.path.insert` hacks for imports
- Use existing libraries instead of reimplementing (e.g., use `swebench` package for evaluation)
- Avoid `# type: ignore` unless absolutely necessary
- Avoid inline imports unless required for circular dependencies
- Prefer explicit type hints over runtime checks with `getattr`/`hasattr`
- Use real newlines in commit messages, not literal `\n`
</CODE>

<TESTING>
- After editing a file, run `uv run pre-commit run --files [filepath]`
- Write focused tests that cover edge cases, not exhaustive tests
- Put tests in corresponding test folders: `benchmarks/*/tests/`
- Avoid test classes unless necessary
- Extract common test setup into fixtures in `conftest.py`
- Test only logic in this codebase, not third-party functionality
</TESTING>

<BENCHMARK_SPECIFIC>
# Adding New Benchmarks
1. Create new directory under `benchmarks/`
2. Implement `run_infer.py` for inference and output generation
3. Add evaluation script if needed (or integrate with existing harness)
4. Register CLI entrypoint in `pyproject.toml` under `[project.scripts]`
5. Update README.md with usage instructions

# LLM Configuration
LLM configs use JSON matching the [LLM class schema](https://github.com/All-Hands-AI/agent-sdk/blob/main/openhands/sdk/llm/llm.py#L93):
```json
{
"model": "litellm_proxy/anthropic/claude-sonnet-4-20250514",
"base_url": "https://llm-proxy.eval.all-hands.dev",
"api_key": "YOUR_API_KEY"
}
```
Validate with: `uv run validate-cfg .llm_config/your-config.json`

# Data Format Conversions
When converting between OpenHands format and benchmark-specific formats:
- Preserve all required fields for evaluation
- Handle missing/optional fields gracefully
- Log conversion warnings for debugging
- Validate output format before evaluation
</BENCHMARK_SPECIFIC>
11 changes: 11 additions & 0 deletions .openhands/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

if ! command -v uv &> /dev/null; then
echo "uv is not installed. Installing..."
curl -LsSf https://astral.sh/uv/install.sh | sh
else
echo "uv is already installed."
uv self update # always update to the latest version
fi

make build
20 changes: 19 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ uv run benchmarks/swe_bench/build_images.py \
```


### 3. Run SWE-Bench Evaluation
### 3. Run SWE-Bench Inference
```bash
# Run evaluation with your configured LLM
uv run swebench-infer .llm_config/example.json \
Expand Down Expand Up @@ -134,6 +134,24 @@ python -m benchmarks.swe_bench.run_infer \

This will only evaluate the instances listed in the file.

### 5. Evaluate SWE-Bench Results
After running inference, evaluate the results using the official SWE-Bench evaluation:

```bash
# Convert output format and run SWE-Bench evaluation
uv run swebench-eval output.jsonl

# Or specify custom dataset and output file
uv run swebench-eval output.jsonl --dataset princeton-nlp/SWE-bench_Lite --output-file results.swebench.jsonl

# Only convert format without running evaluation
uv run swebench-eval output.jsonl --skip-evaluation
```

The script will:
1. Convert OpenHands output format to SWE-Bench prediction format
2. Run the official SWE-Bench evaluation harness

## Links

- **Original OpenHands**: https://github.com/All-Hands-AI/OpenHands/
Expand Down
Loading