-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
buildx-trace.txt
docker-trace.txt
Contributing guidelines and issue reporting guide
- I've read the contributing guidelines and wholeheartedly agree. I've also read the issue reporting guide.
Well-formed report checklist
- I have found a bug that the documentation does not mention anything about my problem
- I have found a bug that there are no open or closed issues that are related to my problem
- I have provided version/information about my environment and done my best to provide a reproducer
Description of bug
Bug description
In our CI environments, there have been occurrences of a docker build -t <...> . having the appearance of successfully building and saving, but never actually finishes. The command appears "stuck"; build timeouts eventually kill the process (typically 30 minutes). When this occurs, all other builds for the same project behave the same way ("build" just fine, but never finish) 100% of the time. Builds from other projects appear unaffected.
This is very similar to #6131, and quite possibly a true duplicate. But because the tooling is different, I thought it better to create a new issue (and also so it can be moved to moby, if that's where this belongs).
Reproduction
Unfortunately, I have not been able to reliably reproduce this. We've seen four occurrences of this in the wild across two versions. There were no obvious commonalities, and none are overly complex (not intensive to build, not using advanced buildkit features/flags/options, low line count, using the default builder, etc.). Within the last month or two we have moved to the containerd image store and had never encountered the issue prior to that; not sure if that's relevant or just coincidence.
I have only managed to reproduce this in a controlled fashion once, and took over 16k iterations over two days. My initial guess is that a previous build timeout had caused docker/buildkit to get into a bad state. I created a Dockerfile that would take ten seconds to build, then created a CI job that would timeout (thus kill the docker client) every third build. Sample output:
Timeout set to expire in 15 sec
+ docker buildx build --build-arg '_num=23833' -t junk --load .
#0 building with "default" instance using docker driver
#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 128B done
#1 DONE 0.1s
#2 [internal] load metadata for docker.io/library/alpine:latest
#2 DONE 0.1s
#3 [internal] load .dockerignore
#3 DONE 0.0s
#3 [internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.1s
#4 [1/3] FROM docker.io/library/alpine:latest@sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1
#4 resolve docker.io/library/alpine:latest@sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1 0.0s done
#4 CACHED
#5 [2/3] RUN echo "23833" > /tmp/num
#5 DONE 0.4s
#6 [3/3] RUN sleep 10
#6 DONE 10.4s
#7 exporting to image
#7 exporting layers
#7 exporting layers 0.3s done
#7 exporting manifest sha256:9e9676d99c89acd961d0d4ebef2372c0e5a8193301e406da83c665b869d8b366 0.0s done
#7 exporting config sha256:f30382e2a1fe78ec2ba8eb8d3d020612bfadbafd9ec27e83b50a9c879bc19742 0.0s done
#7 exporting attestation manifest sha256:3ef78c1a99d0c4a4224abe6a557c9bdc08bf4635c1099eab8acf499d2c775122 0.1s done
#7 exporting manifest list sha256:f6acd982e9dd4354fa32522bc19cbd954106d87287e35d808cebcb7f3e15fda4
#7 exporting manifest list sha256:f6acd982e9dd4354fa32522bc19cbd954106d87287e35d808cebcb7f3e15fda4 0.0s done
#7 naming to docker.io/library/junk:latest done
#7 unpacking to docker.io/library/junk:latest 0.1s done
#7 DONE 0.6s
<--------- This is where the build hangs indefinitely until killed ----------->
Cancelling nested steps due to timeout
Sending interrupt signal to process
ERROR: failed to build: failed to solve: Canceled: context canceled
My other guess is that our (docker) cleanup routines are involved (i.e., concurrent adds/deletes).
All the observations below strongly suggest that the builds that "hang" until killed are still running (conceptually and literally).
Build history
As mentioned, it took 16k builds before I was able to trigger the issue, which was on Saturday night. I didn't see that until Monday morning, after another 7k builds had run, and why some of these figures are so extreme.
On the host there is virtually no docker-related activity, yet there are over 5k "running" buildkit builds (which I assume are all related to this reproduction):
# docker buildx history ls --filter status=running | sed 1d | wc -l
5026
# docker buildx history ls --filter status=running | sed 1d | head -n5
qn92ozxmdey4l9er97jouw9vo Running 34 hours ago 2055m 31s+
yyld3tjznvz9fcodvnpj1ea8p Running 34 hours ago 2055m 52s+
zejsagmbp47kl4qzhgmpwta6c Running 34 hours ago 2056m 25s+
o6fwaq4rdhiuzro4eysdj0lbr Running 34 hours ago 2056m 46s+
y3a2tbfvbt38o8z524md6y40b Running 34 hours ago 2057m 18s+
# docker buildx history ls --filter status=running | sed 1d | tail -n5
rne18nqs4v0xk3o2di0u5n8ue Running 3 days ago 4327m 23s+
y86jnwnej9ml6auoan9iq9rgw Running 3 days ago 4327m 47s+
km7ac8r8awaf5xr3oojpapnzc Running 3 days ago 4328m 20s+
frwu4n2n0k1fr1goxy1o8k33d Running 3 days ago 4328m 45s+
wtg2hh8l4mlrtrdtnny5vy08a Running 3 days ago 4329m 44s+
<-- oldest entry -->
# docker buildx history inspect wtg2hh8l4mlrtrdtnny5vy08a
Dockerfile: Dockerfile
Started: 2025-09-06 19:25:28
Duration: 4331m 4s (running)
Build Steps: 0/0 (NaN% cached)
Image Resolve Mode: local
BUILD ARG VALUE
_num 16141
<-- newest entry -->
# docker buildx history inspect y3a2tbfvbt38o8z524md6y40b
Dockerfile: Dockerfile
Started: 2025-09-08 09:17:34
Duration: 2058m 26s (running)
Build Steps: 0/0 (NaN% cached)
Image Resolve Mode: local
BUILD ARG VALUE
_num 23828Cache
The buildkit cache has an extremely high number of non-reclaimable entries that aren't in use, matching the behavior in the linked ticket:
# docker buildx du | awk '{if ($2 == "false" && NF == 3) { print $0 }}' | wc -l
20106Performance
Despite the host being effectively idle (no actual build activity), the docker daemon is eating up 1/3 (4GB) of the hosts memory and 50% of the CPU (i.e., 400% on an eight core machine).
I took a stacktrace dump of dockerd (which appears to be the equivalent of the format you guys like from pprof). It's humongous... docs say that GH doesn't allow attachments that big, and I didn't know if you'd actually want it (but let me know):
# ls -hl /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
-rw-r--r-- 1 root root 43M Sep 8 16:53 /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
# wc -l /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
526979 /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
# grep '^goroutine ' /var/run/docker/goroutine-stacks-2025-09-08T165305-0600.log | wc -l
32646
This seemed pretty telling to me:
That number (5026) matches up exactly with the number of so-called running buildkit builds; it seems as if they can't agree on what needs to be done. I haven't taken multiple snapshots to see if that distribution changes (i.e., whether they are constantly undoing each others' work, or if it's closer to an actual deadlock), but I can if it'll help.
Here's an example from 'Begin' side:
goroutine 2050617244 [sync.Mutex.Lock]:
internal/sync.runtime_SemacquireMutex(0x5576571a46c0?, 0xd0?, 0xc00087cf5c?)
/usr/local/go/src/runtime/sema.go:95 +0x25
internal/sync.(*Mutex).lockSlow(0xc00087cf58)
/usr/local/go/src/internal/sync/mutex.go:149 +0x15d
internal/sync.(*Mutex).Lock(...)
/usr/local/go/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:46
go.etcd.io/bbolt.(*DB).beginTx(0xc00087cd88)
/root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:772 +0x46
go.etcd.io/bbolt.(*DB).Begin(0xc00087cd88, 0x0)
/root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:758 +0x1cc
go.etcd.io/bbolt.(*DB).View(0x557654026f50?, 0xc08ed3aa20)
/root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:923 +0x30
github.com/moby/buildkit/solver/bboltcachestorage.(*Store).Load(0xc000b02710, {0xc011ce87d0, 0x47}, {0xc036e7a3c0, 0x34})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/bboltcachestorage/storage.go:124 +0xd3
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01f319440, {0x557655b5c748, 0xc085117d10}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:125 +0x58f
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc080653040, {0x557655b5c748, 0xc085117d10}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0xee0505e1c?, {0x557655b5c748, 0xc085117d10}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc0499bce40, {0x557655b5c748, 0xc085117c50}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0x557655b62a40?, {0x557655b5c748, 0xc085117c50}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver/llbsolver.NewProvenanceCreator.func1({0x557655b5c748, 0xc0714d4690})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:421 +0x1a4
github.com/moby/buildkit/solver/llbsolver.(*ProvenanceCreator).Predicate(0xc085117b00, {0x557655b5c748, 0xc0714d4690})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:478 +0xb5
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.2({0x557654ec97c9?, 0xc02d2e80e0?}, {0x557655b62a40, 0xc06f685360}, 0xc0acc7cf00)
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:248 +0x20b
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.3()
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:286 +0x5d
golang.org/x/sync/errgroup.(*Group).add.func1()
/root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:130 +0x7e
created by golang.org/x/sync/errgroup.(*Group).add in goroutine 2050616659
/root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:98 +0x76
and the 'Rollback' side:
goroutine 1964312991 [sync.Mutex.Lock]:
internal/sync.runtime_SemacquireMutex(0x5576558c7a40?, 0xaa?, 0x34?)
/usr/local/go/src/runtime/sema.go:95 +0x25
internal/sync.(*Mutex).lockSlow(0xc00087cf58)
/usr/local/go/src/internal/sync/mutex.go:149 +0x15d
internal/sync.(*Mutex).Lock(...)
/usr/local/go/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:46
go.etcd.io/bbolt.(*DB).removeTx(0xc00087cd88, 0xc01b6621c0)
/root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:857 +0x76
go.etcd.io/bbolt.(*Tx).close(0xc01b6621c0)
/root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/tx.go:368 +0x168
go.etcd.io/bbolt.(*Tx).nonPhysicalRollback(0xc01b6621c0?)
/root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/tx.go:319 +0x4d
go.etcd.io/bbolt.(*Tx).Rollback(0xc00087cd88?)
/root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/tx.go:307 +0x25
go.etcd.io/bbolt.(*DB).View(0x557654026f50?, 0xc0e2f9c5d0)
/root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:942 +0x8c
github.com/moby/buildkit/solver/bboltcachestorage.(*Store).Load(0xc000b02710, {0xc011ce87d0, 0x47}, {0xc036e7a3c0, 0x34})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/bboltcachestorage/storage.go:124 +0xd3
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01f319440, {0x557655b5c748, 0xc03ba7d740}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:125 +0x58f
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc027a73580, {0x557655b5c748, 0xc03ba7d740}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0xee04f1135?, {0x557655b5c748, 0xc03ba7d740}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc031e0d700, {0x557655b5c748, 0xc03ba7d680}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0x557655b62a40?, {0x557655b5c748, 0xc03ba7d680}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver/llbsolver.NewProvenanceCreator.func1({0x557655b5c748, 0xc03ba7cde0})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:421 +0x1a4
github.com/moby/buildkit/solver/llbsolver.(*ProvenanceCreator).Predicate(0xc03ba7d5f0, {0x557655b5c748, 0xc03ba7cde0})
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:478 +0xb5
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.2({0x557654ec97c9?, 0xc0001168a8?}, {0x557655b62a40, 0xc032181b80}, 0xc026b760c0)
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:248 +0x20b
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.3()
/root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:286 +0x5d
golang.org/x/sync/errgroup.(*Group).add.func1()
/root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:130 +0x7e
created by golang.org/x/sync/errgroup.(*Group).add in goroutine 1964312482
/root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:98 +0x76
Full dumps for the docker client itself and the buildx plugin (across two executions by necessity) when they are "hung" are attached.
I can probably keep this host in its current state for another day if there are diagnostics you'd find useful. If you want the entire track dump, I can put in a repo or something.
Version information
Two of the five occurrences were on this version, the rest on an earlier version (probably 28.3.2).
# docker version
Client: Docker Engine - Community
Version: 28.3.3
API version: 1.51
Go version: go1.24.5
Git commit: 980b856
Built: Fri Jul 25 11:37:02 2025
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 28.3.3
API version: 1.51 (minimum version 1.24)
Go version: go1.24.5
Git commit: bea959c
Built: Fri Jul 25 11:33:59 2025
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.27
GitCommit: 05044ec0a9a75232cad458027ca83437aae3f4da
runc:
Version: 1.2.5
GitCommit: v1.2.5-0-g59923ef
docker-init:
Version: 0.19.0
GitCommit: de40ad0
# docker buildx version
github.com/docker/buildx v0.26.1 1a8287f
# docker buildx inspect
Name: default
Driver: docker
Last Activity: 2025-09-10 03:00:06 +0000 UTC
Nodes:
Name: default
Endpoint: default
Status: running
BuildKit version: v0.23.2
Platforms: linux/amd64, linux/amd64/v2, linux/amd64/v3
Labels:
org.mobyproject.buildkit.worker.containerd.namespace: moby
org.mobyproject.buildkit.worker.containerd.uuid: 3343a640-79a9-499a-ab40-db2d41946a35
org.mobyproject.buildkit.worker.executor: containerd
org.mobyproject.buildkit.worker.hostname: ####################
org.mobyproject.buildkit.worker.moby.host-gateway-ip: 172.17.0.1
org.mobyproject.buildkit.worker.network: host
org.mobyproject.buildkit.worker.selinux.enabled: false
org.mobyproject.buildkit.worker.snapshotter: overlayfs
GC Policy rule#0:
All: false
Filters: type==source.local,type==exec.cachemount,type==source.git.checkout
Keep Duration: 48h0m0s
Max Used Space: 1.029GiB
GC Policy rule#1:
All: false
Keep Duration: 1440h0m0s
Reserved Space: 7.451GiB
Max Used Space: 953.7MiB
Min Free Space: 3.725GiB
GC Policy rule#2:
All: false
Reserved Space: 7.451GiB
Max Used Space: 953.7MiB
Min Free Space: 3.725GiB
GC Policy rule#3:
All: true
Reserved Space: 7.451GiB
Max Used Space: 953.7MiB
Min Free Space: 3.725GiB
# docker info
Client: Docker Engine - Community
Version: 28.3.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.26.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.39.1
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 3
Running: 3
Paused: 0
Stopped: 0
Images: 7
Server Version: 28.3.3
Storage Driver: overlayfs
driver-type: io.containerd.snapshotter.v1
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
CDI spec directories:
/etc/cdi
/var/run/cdi
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
runc version: v1.2.5-0-g59923ef
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.14.0-570.32.1.el9_6.x86_64
Operating System: Rocky Linux 9.6 (Blue Onyx)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 11.43GiB
Name: #################
ID: 641d878b-c0e9-486a-b05a-d275591b7c29
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 20162
Goroutines: 32871
System Time: 2025-09-09T20:45:43.785297833-06:00
EventsListeners: 0
Experimental: false
Insecure Registries:
::1/128
127.0.0.0/8
Registry Mirrors:
#################
Live Restore Enabled: false