fix(topology): carry numa through old-format pool normalization (RUN-41037)#231
Open
eliranw wants to merge 1 commit into
Open
fix(topology): carry numa through old-format pool normalization (RUN-41037)#231eliranw wants to merge 1 commit into
eliranw wants to merge 1 commit into
Conversation
…41037) A numa block on an old-format (flat gpuCount/gpuProduct/gpuMemory) pool was silently dropped: NodePoolTopology had no Numa field, so yaml.Unmarshal discarded the key and the NRT publisher skipped the pool. Add the field and copy it in normalizeNodePool. Fixes #229 Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A
numa:block on an old-format (flatgpuCount/gpuProduct/gpuMemory) node pool was silently dropped at parse time —NodePoolTopologyhad noNumafield, soyaml.Unmarshaldiscarded the key, normalization producedNuma == nil, and the NRT publisher skipped the pool. No error or warning anywhere; noNodeResourceTopologywas ever created.Fixes #229. Jira: RUN-41037. (Supersedes #230 — same change; the branch was renamed to carry the Jira ID for the ticket gate, which closed the original PR.)
Change
NodePoolTopology: addNuma *NumaConfig(yaml: numa,omitempty).normalizeNodePool: copyNumainto the resultingNodePoolConfig.Backward-compatible: pools without
numaare unchanged (nilbefore and after).Tests
TestNormalizeOldFormatPreservesNuma— unit, struct-level normalization.TestParseOldFormatYAMLPreservesNuma— end-to-endParseAndNormalizeTopologywith the exact YAML shape from the field report, asserting numa survives and GPU count still normalizes to 16 devices.Note: #230's CI had two unrelated flakes — the kwok-device-plugin
Eventuallytimeout (passes locally, path untouched by this change) andsetup-e2eracingkubectl waitright afterhelm install("no matching resources found").Found debugging a live cluster where
nodePools.<name>.numawas configured in old-format values and no NRT appeared.