We should create a github workflow in this repo that: 1. setup evaluation benchmarks 2. follow https://docs.blacksmith.sh/blacksmith-caching/docker-builds, automatically build docker images, and push to `ghcr.io/openhands/eval-agent-server` 3. use `benchmarks/swe_bench/build_images.py` to build images, but we need to modify it so we can push stuff to ghcr 4. this workflow should be manually triggered since they are super expensive to run