Skip to content

fix: speed up provisioning even more by 10s with localdns enabled#8338

Merged
awesomenix merged 2 commits into
mainfrom
nishp/speedup/localdns
Apr 17, 2026
Merged

fix: speed up provisioning even more by 10s with localdns enabled#8338
awesomenix merged 2 commits into
mainfrom
nishp/speedup/localdns

Conversation

@awesomenix

Copy link
Copy Markdown
Contributor
  • Configure skipping WAagent hold for scriptless case.
  • Capture localdns logs for debugging
  • prewarm coredns so its faster by 6s

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce Linux node provisioning time in scriptless scenarios (with LocalDNS) by skipping walinuxagent hold work, capturing LocalDNS logs for easier debugging, and pre-warming CoreDNS to reduce startup latency.

Changes:

  • Expose a new template helper (GetSkipWaAgentHold) and plumb SKIP_WAAGENT_HOLD into the CSE command environment.
  • Pre-warm the LocalDNS CoreDNS binary during basePrep when LocalDNS is enabled.
  • Extend e2e VM log collection to include localdns systemd unit logs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/agent/baker.go Adds GetSkipWaAgentHold template func to drive SKIP_WAAGENT_HOLD in rendered scripts.
parts/linux/cloud-init/artifacts/cse_main.sh Adds CoreDNS version invocation to pre-warm when LocalDNS is enabled.
parts/linux/cloud-init/artifacts/cse_cmd.sh Exports SKIP_WAAGENT_HOLD into the CSE execution environment.
e2e/vmss.go Collects journalctl -u localdns into test failure artifacts.

Comment on lines 194 to 196
CSE_TIMEOUT="{{GetCSETimeout}}"
SKIP_WAAGENT_HOLD="{{GetSkipWaAgentHold}}"
/usr/bin/nohup /bin/bash -c "/bin/bash /opt/azure/containers/provision_start.sh"

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting SKIP_WAAGENT_HOLD to true causes cse_main.sh to skip unholding walinuxagent at the end of provisioning. However, Ubuntu's installDeps (provision_installs_distro.sh) still unconditionally runs aptmarkWALinuxAgent hold, so in scenarios where FULL_INSTALL_REQUIRED=true this can leave walinuxagent permanently held. Consider also guarding the hold inside installDeps (or only skipping unhold when a hold was actually performed).

Copilot uses AI. Check for mistakes.
@awesomenix awesomenix enabled auto-merge (squash) April 17, 2026 14:45
systemctl restart systemd-timesyncd
fi

# pre-warm coredns by checking its version.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit puzzled, do you know what this is doing in the background ? how is this supposed to accelerate/get ride of the 6sec delay you saw ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just reading the binary from azure storage into memory or just paging into memory, considering 6s slowness my guess is the hydration of the binary itself.


# pre-warm coredns by checking its version.
if [ "${SHOULD_ENABLE_LOCALDNS}" = "true" ]; then
nohup /bin/sh -c '/opt/azure/containers/localdns/binary/coredns --version >/dev/null 2>&1' >/dev/null 2>&1 &

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there is an error with this command, should we care ? maybe log on a local filesystem ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont care, since we are just using it for warming up the binary so in this case either warm it from azure storage or just paging it into memory

LOCALDNS_GENERATED_COREFILE="{{GetGeneratedLocalDNSCoreFile}}"
PRE_PROVISION_ONLY="{{GetPreProvisionOnly}}"
CSE_TIMEOUT="{{GetCSETimeout}}"
SKIP_WAAGENT_HOLD="{{GetSkipWaAgentHold}}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be part of another PR ? I'm confused how we can make the GetSkipWaAgentHold work ? isn't this risky ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already disable waagent hold for aks-node-controller, because of phase 2 i just reimplemented in CSE which will be only enabled (or skipped hold) in phase 2

"SKIP_WAAGENT_HOLD": "true",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walinuxagent is already centralized in components.json. I remember weeks ago there was a PR that handled this logic and should have had sufficient test coverage even on lower end VM. (Nishchay can confirm). If this is the case, then it should be safe to set SKIP_WAAGENT_HOLD to true completely in Scriptless phase 2 (EnableScriptlessNBCCSECmd = true)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i confirmed on A and B series on the lowest end as well

@Devinwong Devinwong left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@awesomenix awesomenix merged commit a848a1d into main Apr 17, 2026
44 of 45 checks passed
@awesomenix awesomenix deleted the nishp/speedup/localdns branch April 17, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants