Skip to content

fix(topology): carry numa through old-format pool normalization (RUN-41037)#231

Open
eliranw wants to merge 1 commit into
mainfrom
eliranw/RUN-41037-fix-old-format-numa-drop
Open

fix(topology): carry numa through old-format pool normalization (RUN-41037)#231
eliranw wants to merge 1 commit into
mainfrom
eliranw/RUN-41037-fix-old-format-numa-drop

Conversation

@eliranw

@eliranw eliranw commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What

A numa: block on an old-format (flat gpuCount/gpuProduct/gpuMemory) node pool was silently dropped at parse time — NodePoolTopology had no Numa field, so yaml.Unmarshal discarded the key, normalization produced Numa == nil, and the NRT publisher skipped the pool. No error or warning anywhere; no NodeResourceTopology was ever created.

Fixes #229. Jira: RUN-41037. (Supersedes #230 — same change; the branch was renamed to carry the Jira ID for the ticket gate, which closed the original PR.)

Change

  • NodePoolTopology: add Numa *NumaConfig (yaml: numa,omitempty).
  • normalizeNodePool: copy Numa into the resulting NodePoolConfig.

Backward-compatible: pools without numa are unchanged (nil before and after).

Tests

  • TestNormalizeOldFormatPreservesNuma — unit, struct-level normalization.
  • TestParseOldFormatYAMLPreservesNuma — end-to-end ParseAndNormalizeTopology with the exact YAML shape from the field report, asserting numa survives and GPU count still normalizes to 16 devices.
  • All topology/status-exporter/status-updater suites pass locally; golangci-lint clean.

Note: #230's CI had two unrelated flakes — the kwok-device-plugin Eventually timeout (passes locally, path untouched by this change) and setup-e2e racing kubectl wait right after helm install ("no matching resources found").

Found debugging a live cluster where nodePools.<name>.numa was configured in old-format values and no NRT appeared.

…41037)

A numa block on an old-format (flat gpuCount/gpuProduct/gpuMemory) pool was
silently dropped: NodePoolTopology had no Numa field, so yaml.Unmarshal
discarded the key and the NRT publisher skipped the pool. Add the field and
copy it in normalizeNodePool.

Fixes #229

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

numa block on an old-format node pool is silently dropped — no NRT published

1 participant