Skip to content

Image export to local store via buildx hangs indefinitely #6209

@rrjjvv

Description

@rrjjvv

buildx-trace.txt
docker-trace.txt

Contributing guidelines and issue reporting guide

Well-formed report checklist

  • I have found a bug that the documentation does not mention anything about my problem
  • I have found a bug that there are no open or closed issues that are related to my problem
  • I have provided version/information about my environment and done my best to provide a reproducer

Description of bug

Bug description

In our CI environments, there have been occurrences of a docker build -t <...> . having the appearance of successfully building and saving, but never actually finishes. The command appears "stuck"; build timeouts eventually kill the process (typically 30 minutes). When this occurs, all other builds for the same project behave the same way ("build" just fine, but never finish) 100% of the time. Builds from other projects appear unaffected.

This is very similar to #6131, and quite possibly a true duplicate. But because the tooling is different, I thought it better to create a new issue (and also so it can be moved to moby, if that's where this belongs).

Reproduction

Unfortunately, I have not been able to reliably reproduce this. We've seen four occurrences of this in the wild across two versions. There were no obvious commonalities, and none are overly complex (not intensive to build, not using advanced buildkit features/flags/options, low line count, using the default builder, etc.). Within the last month or two we have moved to the containerd image store and had never encountered the issue prior to that; not sure if that's relevant or just coincidence.

I have only managed to reproduce this in a controlled fashion once, and took over 16k iterations over two days. My initial guess is that a previous build timeout had caused docker/buildkit to get into a bad state. I created a Dockerfile that would take ten seconds to build, then created a CI job that would timeout (thus kill the docker client) every third build. Sample output:

Timeout set to expire in 15 sec
+ docker buildx build --build-arg '_num=23833' -t junk --load .
#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 128B done
#1 DONE 0.1s

#2 [internal] load metadata for docker.io/library/alpine:latest
#2 DONE 0.1s

#3 [internal] load .dockerignore
#3 DONE 0.0s

#3 [internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.1s

#4 [1/3] FROM docker.io/library/alpine:latest@sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1
#4 resolve docker.io/library/alpine:latest@sha256:4bcff63911fcb4448bd4fdacec207030997caf25e9bea4045fa6c8c44de311d1 0.0s done
#4 CACHED

#5 [2/3] RUN echo "23833" > /tmp/num
#5 DONE 0.4s

#6 [3/3] RUN sleep 10
#6 DONE 10.4s

#7 exporting to image
#7 exporting layers
#7 exporting layers 0.3s done
#7 exporting manifest sha256:9e9676d99c89acd961d0d4ebef2372c0e5a8193301e406da83c665b869d8b366 0.0s done
#7 exporting config sha256:f30382e2a1fe78ec2ba8eb8d3d020612bfadbafd9ec27e83b50a9c879bc19742 0.0s done
#7 exporting attestation manifest sha256:3ef78c1a99d0c4a4224abe6a557c9bdc08bf4635c1099eab8acf499d2c775122 0.1s done
#7 exporting manifest list sha256:f6acd982e9dd4354fa32522bc19cbd954106d87287e35d808cebcb7f3e15fda4
#7 exporting manifest list sha256:f6acd982e9dd4354fa32522bc19cbd954106d87287e35d808cebcb7f3e15fda4 0.0s done
#7 naming to docker.io/library/junk:latest done
#7 unpacking to docker.io/library/junk:latest 0.1s done
#7 DONE 0.6s
<---------  This is where the build hangs indefinitely until killed  ----------->
Cancelling nested steps due to timeout
Sending interrupt signal to process
ERROR: failed to build: failed to solve: Canceled: context canceled

My other guess is that our (docker) cleanup routines are involved (i.e., concurrent adds/deletes).

All the observations below strongly suggest that the builds that "hang" until killed are still running (conceptually and literally).

Build history

As mentioned, it took 16k builds before I was able to trigger the issue, which was on Saturday night. I didn't see that until Monday morning, after another 7k builds had run, and why some of these figures are so extreme.

On the host there is virtually no docker-related activity, yet there are over 5k "running" buildkit builds (which I assume are all related to this reproduction):

# docker buildx history ls --filter status=running | sed 1d | wc -l
5026
# docker buildx history ls --filter status=running | sed 1d | head -n5
qn92ozxmdey4l9er97jouw9vo             Running   34 hours ago   2055m 31s+   
yyld3tjznvz9fcodvnpj1ea8p             Running   34 hours ago   2055m 52s+   
zejsagmbp47kl4qzhgmpwta6c             Running   34 hours ago   2056m 25s+   
o6fwaq4rdhiuzro4eysdj0lbr             Running   34 hours ago   2056m 46s+   
y3a2tbfvbt38o8z524md6y40b             Running   34 hours ago   2057m 18s+
# docker buildx history ls --filter status=running | sed 1d | tail -n5
rne18nqs4v0xk3o2di0u5n8ue             Running   3 days ago     4327m 23s+   
y86jnwnej9ml6auoan9iq9rgw             Running   3 days ago     4327m 47s+   
km7ac8r8awaf5xr3oojpapnzc             Running   3 days ago     4328m 20s+   
frwu4n2n0k1fr1goxy1o8k33d             Running   3 days ago     4328m 45s+   
wtg2hh8l4mlrtrdtnny5vy08a             Running   3 days ago     4329m 44s+

<-- oldest entry -->
# docker buildx history inspect wtg2hh8l4mlrtrdtnny5vy08a
Dockerfile:	Dockerfile

Started:	2025-09-06 19:25:28
Duration:	4331m  4s (running)
Build Steps:	0/0 (NaN% cached)

Image Resolve Mode:	local

BUILD ARG	VALUE
_num		16141

<-- newest entry -->
# docker buildx history inspect y3a2tbfvbt38o8z524md6y40b
Dockerfile:	Dockerfile

Started:	2025-09-08 09:17:34
Duration:	2058m 26s (running)
Build Steps:	0/0 (NaN% cached)

Image Resolve Mode:	local

BUILD ARG	VALUE
_num		23828

Cache

The buildkit cache has an extremely high number of non-reclaimable entries that aren't in use, matching the behavior in the linked ticket:

# docker buildx du | awk '{if ($2 == "false" && NF == 3) { print $0 }}' | wc -l
20106

Performance

Despite the host being effectively idle (no actual build activity), the docker daemon is eating up 1/3 (4GB) of the hosts memory and 50% of the CPU (i.e., 400% on an eight core machine).

I took a stacktrace dump of dockerd (which appears to be the equivalent of the format you guys like from pprof). It's humongous... docs say that GH doesn't allow attachments that big, and I didn't know if you'd actually want it (but let me know):

# ls -hl /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
-rw-r--r-- 1 root root 43M Sep  8 16:53 /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
# wc -l /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
526979 /run/docker/goroutine-stacks-2025-09-08T165305-0600.log
# grep '^goroutine ' /var/run/docker/goroutine-stacks-2025-09-08T165305-0600.log | wc -l
32646

This seemed pretty telling to me:

Image

That number (5026) matches up exactly with the number of so-called running buildkit builds; it seems as if they can't agree on what needs to be done. I haven't taken multiple snapshots to see if that distribution changes (i.e., whether they are constantly undoing each others' work, or if it's closer to an actual deadlock), but I can if it'll help.

Here's an example from 'Begin' side:

goroutine 2050617244 [sync.Mutex.Lock]:
internal/sync.runtime_SemacquireMutex(0x5576571a46c0?, 0xd0?, 0xc00087cf5c?)
        /usr/local/go/src/runtime/sema.go:95 +0x25
internal/sync.(*Mutex).lockSlow(0xc00087cf58)
        /usr/local/go/src/internal/sync/mutex.go:149 +0x15d
internal/sync.(*Mutex).Lock(...)
        /usr/local/go/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:46
go.etcd.io/bbolt.(*DB).beginTx(0xc00087cd88)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:772 +0x46
go.etcd.io/bbolt.(*DB).Begin(0xc00087cd88, 0x0)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:758 +0x1cc
go.etcd.io/bbolt.(*DB).View(0x557654026f50?, 0xc08ed3aa20)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:923 +0x30
github.com/moby/buildkit/solver/bboltcachestorage.(*Store).Load(0xc000b02710, {0xc011ce87d0, 0x47}, {0xc036e7a3c0, 0x34})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/bboltcachestorage/storage.go:124 +0xd3
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01f319440, {0x557655b5c748, 0xc085117d10}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:125 +0x58f
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc080653040, {0x557655b5c748, 0xc085117d10}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0xee0505e1c?, {0x557655b5c748, 0xc085117d10}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc0499bce40, {0x557655b5c748, 0xc085117c50}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0x557655b62a40?, {0x557655b5c748, 0xc085117c50}, {0x557655b55be0, 0xc0202dee00}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver/llbsolver.NewProvenanceCreator.func1({0x557655b5c748, 0xc0714d4690})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:421 +0x1a4
github.com/moby/buildkit/solver/llbsolver.(*ProvenanceCreator).Predicate(0xc085117b00, {0x557655b5c748, 0xc0714d4690})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:478 +0xb5
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.2({0x557654ec97c9?, 0xc02d2e80e0?}, {0x557655b62a40, 0xc06f685360}, 0xc0acc7cf00)
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:248 +0x20b
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.3()
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:286 +0x5d
golang.org/x/sync/errgroup.(*Group).add.func1()
        /root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:130 +0x7e
created by golang.org/x/sync/errgroup.(*Group).add in goroutine 2050616659
        /root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:98 +0x76

and the 'Rollback' side:

goroutine 1964312991 [sync.Mutex.Lock]:
internal/sync.runtime_SemacquireMutex(0x5576558c7a40?, 0xaa?, 0x34?)
        /usr/local/go/src/runtime/sema.go:95 +0x25
internal/sync.(*Mutex).lockSlow(0xc00087cf58)
        /usr/local/go/src/internal/sync/mutex.go:149 +0x15d
internal/sync.(*Mutex).Lock(...)
        /usr/local/go/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:46
go.etcd.io/bbolt.(*DB).removeTx(0xc00087cd88, 0xc01b6621c0)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:857 +0x76
go.etcd.io/bbolt.(*Tx).close(0xc01b6621c0)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/tx.go:368 +0x168
go.etcd.io/bbolt.(*Tx).nonPhysicalRollback(0xc01b6621c0?)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/tx.go:319 +0x4d
go.etcd.io/bbolt.(*Tx).Rollback(0xc00087cd88?)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/tx.go:307 +0x25
go.etcd.io/bbolt.(*DB).View(0x557654026f50?, 0xc0e2f9c5d0)
        /root/rpmbuild/BUILD/src/engine/vendor/go.etcd.io/bbolt/db.go:942 +0x8c
github.com/moby/buildkit/solver/bboltcachestorage.(*Store).Load(0xc000b02710, {0xc011ce87d0, 0x47}, {0xc036e7a3c0, 0x34})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/bboltcachestorage/storage.go:124 +0xd3
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc01f319440, {0x557655b5c748, 0xc03ba7d740}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:125 +0x58f
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc027a73580, {0x557655b5c748, 0xc03ba7d740}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0xee04f1135?, {0x557655b5c748, 0xc03ba7d740}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver.(*exporter).ExportTo(0xc031e0d700, {0x557655b5c748, 0xc03ba7d680}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:187 +0xef7
github.com/moby/buildkit/solver.(*mergedExporter).ExportTo(0x557655b62a40?, {0x557655b5c748, 0xc03ba7d680}, {0x557655b55be0, 0xc028e116a0}, {0x557655b1e158, 0x2, {0x0, 0x0}, 0x0, ...})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/exporter.go:275 +0x131
github.com/moby/buildkit/solver/llbsolver.NewProvenanceCreator.func1({0x557655b5c748, 0xc03ba7cde0})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:421 +0x1a4
github.com/moby/buildkit/solver/llbsolver.(*ProvenanceCreator).Predicate(0xc03ba7d5f0, {0x557655b5c748, 0xc03ba7cde0})
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/provenance.go:478 +0xb5
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.2({0x557654ec97c9?, 0xc0001168a8?}, {0x557655b62a40, 0xc032181b80}, 0xc026b760c0)
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:248 +0x20b
github.com/moby/buildkit/solver/llbsolver.(*Solver).recordBuildHistory.func1.3()
        /root/rpmbuild/BUILD/src/engine/vendor/github.com/moby/buildkit/solver/llbsolver/solver.go:286 +0x5d
golang.org/x/sync/errgroup.(*Group).add.func1()
        /root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:130 +0x7e
created by golang.org/x/sync/errgroup.(*Group).add in goroutine 1964312482
        /root/rpmbuild/BUILD/src/engine/vendor/golang.org/x/sync/errgroup/errgroup.go:98 +0x76

Full dumps for the docker client itself and the buildx plugin (across two executions by necessity) when they are "hung" are attached.

I can probably keep this host in its current state for another day if there are diagnostics you'd find useful. If you want the entire track dump, I can put in a repo or something.

Version information

Two of the five occurrences were on this version, the rest on an earlier version (probably 28.3.2).

# docker version
Client: Docker Engine - Community
 Version:           28.3.3
 API version:       1.51
 Go version:        go1.24.5
 Git commit:        980b856
 Built:             Fri Jul 25 11:37:02 2025
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          28.3.3
  API version:      1.51 (minimum version 1.24)
  Go version:       go1.24.5
  Git commit:       bea959c
  Built:            Fri Jul 25 11:33:59 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:        05044ec0a9a75232cad458027ca83437aae3f4da
 runc:
  Version:          1.2.5
  GitCommit:        v1.2.5-0-g59923ef
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

# docker buildx version 
github.com/docker/buildx v0.26.1 1a8287f

# docker buildx inspect 
Name:          default
Driver:        docker
Last Activity: 2025-09-10 03:00:06 +0000 UTC

Nodes:
Name:             default
Endpoint:         default
Status:           running
BuildKit version: v0.23.2
Platforms:        linux/amd64, linux/amd64/v2, linux/amd64/v3
Labels:
 org.mobyproject.buildkit.worker.containerd.namespace: moby
 org.mobyproject.buildkit.worker.containerd.uuid:      3343a640-79a9-499a-ab40-db2d41946a35
 org.mobyproject.buildkit.worker.executor:             containerd
 org.mobyproject.buildkit.worker.hostname:             ####################
 org.mobyproject.buildkit.worker.moby.host-gateway-ip: 172.17.0.1
 org.mobyproject.buildkit.worker.network:              host
 org.mobyproject.buildkit.worker.selinux.enabled:      false
 org.mobyproject.buildkit.worker.snapshotter:          overlayfs
GC Policy rule#0:
 All:            false
 Filters:        type==source.local,type==exec.cachemount,type==source.git.checkout
 Keep Duration:  48h0m0s
 Max Used Space: 1.029GiB
GC Policy rule#1:
 All:            false
 Keep Duration:  1440h0m0s
 Reserved Space: 7.451GiB
 Max Used Space: 953.7MiB
 Min Free Space: 3.725GiB
GC Policy rule#2:
 All:            false
 Reserved Space: 7.451GiB
 Max Used Space: 953.7MiB
 Min Free Space: 3.725GiB
GC Policy rule#3:
 All:            true
 Reserved Space: 7.451GiB
 Max Used Space: 953.7MiB
 Min Free Space: 3.725GiB

# docker info 
Client: Docker Engine - Community
 Version:    28.3.3
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.26.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.39.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 3
  Running: 3
  Paused: 0
  Stopped: 0
 Images: 7
 Server Version: 28.3.3
 Storage Driver: overlayfs
  driver-type: io.containerd.snapshotter.v1
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /etc/cdi
  /var/run/cdi
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05044ec0a9a75232cad458027ca83437aae3f4da
 runc version: v1.2.5-0-g59923ef
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.14.0-570.32.1.el9_6.x86_64
 Operating System: Rocky Linux 9.6 (Blue Onyx)
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 11.43GiB
 Name: #################
 ID: 641d878b-c0e9-486a-b05a-d275591b7c29
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 20162
  Goroutines: 32871
  System Time: 2025-09-09T20:45:43.785297833-06:00
  EventsListeners: 0
 Experimental: false
 Insecure Registries:
  ::1/128
  127.0.0.0/8
 Registry Mirrors:
  #################
 Live Restore Enabled: false

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions