Skip to content

[BUG] agent-stack-k8s does not work with ubuntu based agent images #808

@seemethere

Description

@seemethere

Describe the bug

agent-stack-k8s Alpine Hardcoding Issue

Version: v0.37.0 (current latest)

Location: internal/controller/scheduler/scheduler.go

Problem

The controller hardcodes Alpine-specific shell and commands, making it impossible to use Ubuntu/Debian-based agent images:

  1. Shell hardcoded to ash (Alpine shell):
// Line 987 - copy-agent init container
Command: []string{"ash"}
Args:    []string{"-cefx", containerArgs.String()}

// Lines 1111, 1129, 1142 - checkout container
Command: []string{"ash", "-c"}
  1. Alpine-specific user/group commands:
// Lines 1112-1115
checkoutContainer.Args = []string{fmt.Sprintf(`set -exufo pipefail
addgroup -g %d buildkite-agent
adduser -D -u %d -G buildkite-agent -h /workspace buildkite-agent
su buildkite-agent -c "%s && buildkite-agent-entrypoint kubernetes-bootstrap"`,
  • addgroup / adduser -D are BusyBox/Alpine commands
  • Ubuntu/Debian use groupadd / useradd

Why This Matters

Some Kubernetes environments require the use-vc resolv.conf option to force TCP-based DNS queries. musl libc (Alpine) doesn't support use-vc, causing DNS resolution to fail. glibc-based images (Ubuntu, Rocky) work correctly. In general I feel as though it'd be good for all images published by buildkite/agent to be compatible with this stack

Requested Enhancement

Add configuration option to specify shell and use POSIX-compatible user creation, or detect the image type and adapt accordingly. Example:

config:
  shell: "/bin/bash"  # or auto-detect

Or use POSIX-compatible approach that works on both:

# Instead of Alpine-specific adduser/addgroup
getent group buildkite-agent || groupadd -g $GID buildkite-agent
getent passwd buildkite-agent || useradd -u $UID -g buildkite-agent -d /workspace buildkite-agent

To Reproduce

Steps to reproduce the behavior:

  1. Deploy with configuration '...':
  # Helm values for agent-stack-k8s
  config:
    # Custom agent image (Ubuntu-based instead of default Alpine)
    image: "ghcr.io/buildkite/agent:3.115.4-ubuntu-24.04"

    # Required for our environment - forces TCP DNS queries
    pod-spec-patch:
      dnsPolicy: "None"
      dnsConfig:
        options:
          - name: use-vc  # Force TCP for DNS (not supported by musl/Alpine)
  1. Run pipeline on agents
  2. See error

Expected behavior

In general I feel as though it'd be good for all images published by buildkite/agent to be compatible with this stack

Environment

  • agent-stack-k8s version: v0.37.0
  • Kubernetes version: v1.34.2
  • Deployment method: modified helm chart

Logs

The following init containers failed:

�[96;100m CONTAINER  �[0m�[96;100m EXIT CODE �[0m�[96;100m SIGNAL �[0m�[96;100m REASON     �[0m�[96;100m MESSAGE                                                                                                                                                                                                                      �[0m
�[97;40m copy-agent �[0m�[97;40m       128 �[0m�[97;40m      0 �[0m�[97;40m StartError �[0m�[97;40m failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "ash": executable file not found in $PATH �[0m

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions