Skip to content
Closed

Patch 1 #3865

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
bc94810
modify templates for doca ofed
VrindaMarwah Dec 31, 2025
2766c97
doca ofed installation changes for k8s
VrindaMarwah Dec 31, 2025
a71819d
add ansible builtin
VrindaMarwah Jan 2, 2026
07dbf6b
Merge pull request #3826 from VrindaMarwah/pub/ib_support
jagadeeshnv Jan 5, 2026
30d5135
Update ansible-lint.yml
VrindaMarwah Jan 5, 2026
b20eb56
Update pylint.yml
VrindaMarwah Jan 5, 2026
46a8a92
Merge pull request #3827 from VrindaMarwah/pub/ib_support
jagadeeshnv Jan 5, 2026
8f7dec7
Update image-build to use docker.io/dellhpcomniaaisolution/image-buil…
balajikumaran-c-s Jan 6, 2026
94e7e5e
Remove rpmdb rebuild commands from base_image_commands
balajikumaran-c-s Jan 6, 2026
f120e16
Add retry logic for image pull with pull_image_retries and pull_image…
balajikumaran-c-s Jan 6, 2026
942f0ce
Merge pull request #3829 from balajikumaran-c-s/pub/ib_support
abhishek-sa1 Jan 6, 2026
50c87bd
doca changes to build image
VrindaMarwah Jan 7, 2026
4261525
slurm user uid set to 6001
jagadeeshnv Jan 7, 2026
9d97814
Merge pull request #3834 from jagadeeshnv/pub/ib_support
snarthan Jan 8, 2026
27974c5
add static ip for ib interface
VrindaMarwah Jan 8, 2026
c07acc6
Merge branch 'dell:pub/ib_support' into pub/ib_support
VrindaMarwah Jan 8, 2026
40f36b9
Update openchami_image_cmd.yml
VrindaMarwah Jan 8, 2026
5096861
Update slurm_custom.json
VrindaMarwah Jan 8, 2026
43827ef
Update slurm_custom.json
VrindaMarwah Jan 8, 2026
6df6515
Update service_k8s.json
VrindaMarwah Jan 8, 2026
6630ec7
Update local_repo_config.yml
VrindaMarwah Jan 8, 2026
f03b32c
remove unused vars main.yml
balajikumaran-c-s Jan 10, 2026
322ccd0
Updated image tag in main.yml
balajikumaran-c-s Jan 10, 2026
0407906
Update image tag in default_packages.json
balajikumaran-c-s Jan 10, 2026
6f17b12
Merge pull request #3838 from balajikumaran-c-s/pub/ib_support
abhishek-sa1 Jan 10, 2026
951a5e2
Merge branch 'dell:pub/ib_support' into pub/ib_support
VrindaMarwah Jan 10, 2026
9974216
add package mounts for doca installation
VrindaMarwah Jan 11, 2026
2a86f1c
updating comments in network_spec
VrindaMarwah Jan 11, 2026
3abb36c
passwordless_ssh changes
sakshi-singla-1735 Jan 12, 2026
216a06c
ansible lint fixes
sakshi-singla-1735 Jan 12, 2026
cd729f5
input validation for ib network
sakshi-singla-1735 Jan 12, 2026
d38cf10
Merge pull request #3841 from VrindaMarwah/pub/ib_support
snarthan Jan 12, 2026
2cca244
Merge branch 'pub/ib_support' into pub/input_validation_ib
sakshi-singla-1735 Jan 12, 2026
a12179e
Merge pull request #3844 from sakshi-singla-1735/pub/input_validation_ib
snarthan Jan 12, 2026
f015e98
removing duplicate code
sakshi-singla-1735 Jan 13, 2026
e0b1fe5
Merge branch 'pub/v2.1_rc1' into pub/passwordlessssh
sakshi-singla-1735 Jan 13, 2026
e770d86
variablize filenames
sakshi-singla-1735 Jan 13, 2026
d3ac541
Merge branch 'pub/passwordlessssh' of github.com:sakshi-singla-1735/o…
sakshi-singla-1735 Jan 13, 2026
a700dd3
Merge pull request #3843 from sakshi-singla-1735/pub/passwordlessssh
snarthan Jan 13, 2026
078997e
extract cuda in nfs
Nagachandan-P Jan 14, 2026
ddc00f8
making path changes
sakshi-singla-1735 Jan 14, 2026
64d4b28
Update ci-group-login_compiler_node_aarch64.yaml.j2
Nagachandan-P Jan 14, 2026
34aea37
Update ci-group-login_compiler_node_x86_64.yaml.j2
Nagachandan-P Jan 14, 2026
53290e6
Merge pull request #3857 from Nagachandan-P/pub/v2.1_rc1
jagadeeshnv Jan 14, 2026
e3dc75a
adding the repo for apptainer
sakshi-singla-1735 Jan 14, 2026
66661de
add set pipefail to doca-ofed script
VrindaMarwah Jan 14, 2026
6670061
Update ansible-lint.yml
VrindaMarwah Jan 14, 2026
05c1146
Update pylint.yml
VrindaMarwah Jan 14, 2026
72e5971
Merge pull request #3858 from VrindaMarwah/pub/v2.1_rc1
snarthan Jan 14, 2026
a7c3a62
Merge pull request #3856 from sakshi-singla-1735/pub/passwordlessssh
jagadeeshnv Jan 14, 2026
a2589e8
Update README.md
expressmailin Jan 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/ansible-lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ on:
- pub/ochami
- pub/ochami_aarch64
- pub/k8s_telemetry
- pub/ib_support
- pub/v2.1_rc1

jobs:
build:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ on:
- pub/ochami
- pub/ochami_aarch64
- pub/k8s_telemetry
- pub/ib_support
- pub/v2.1_rc1

jobs:
build:
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@


<img src="docs/logos/omnia-logo-transparent.png" width="500px">
<!-- ALL-CONTRIBUTORS-BADGE:START - Do not remove or modify this section -->
<!-- DO NOT ADD A BADGE -->
Expand Down
2 changes: 1 addition & 1 deletion build_image_aarch64/roles/prepare_arm_node/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@

- name: Build full Podman image path
ansible.builtin.set_fact:
pulp_aarch_image: "{{ hostvars['localhost']['oim_pxe_ip'] }}:2225/dellhpcomniaaisolution/image-build-aarch64:latest"
pulp_aarch_image: "{{ hostvars['localhost']['oim_pxe_ip'] }}:2225/dellhpcomniaaisolution/image-build-aarch64:1.0"

- name: Pull aarch64 image using Podman
ansible.builtin.command:
Expand Down
1 change: 0 additions & 1 deletion build_image_aarch64/roles/prepare_arm_node/vars/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ aarch64_regctl_url: "https://github.com/regclient/regclient/releases/latest/down
pulp_repo_file_path: "/etc/yum.repos.d/pulp.repo"
pulp_webserver_cert_path: "/opt/omnia/pulp/settings/certs/pulp_webserver.crt"
anchors_path: "/etc/pki/ca-trust/source/anchors/pulp_webserver.crt"
regctl_tar_path: "omnia/offline_repo/cluster/aarch64/rhel/10.0/tarball/regctl-linux-arm64/regctl-linux-arm64.tar.gz"
regctl_bin_path: "/usr/local/bin/regctl"

# Error messages
Expand Down
15 changes: 5 additions & 10 deletions build_image_x86_64/roles/image_creation/tasks/build_image_tag.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,16 @@
# limitations under the License.
---

- name: Pull specific OpenCHAMI image by version tag
- name: Pull image-build image
ansible.builtin.command:
cmd: "podman pull {{ openchami_image_sha }}"
cmd: "podman pull {{ image_build_el10 }}"
register: pull_result
retries: "{{ pull_image_retries }}"
delay: "{{ pull_image_delay }}"
until: pull_result.rc == 0
changed_when: "'Image is up to date' not in pull_result.stdout"

- name: Fail if image not pulled successfully
ansible.builtin.fail:
msg: "{{ pull_result.stdout }}"
when: pull_result.rc != 0

- name: Tagging OpenCHAMI image with stable name
ansible.builtin.command:
cmd: "{{ ochami_stable_image_tag }}"
args:
creates: "{{ ochami_stable_image_path }}"
register: tag_result
changed_when: "'Tagged' in tag_result.stdout"
10 changes: 4 additions & 6 deletions build_image_x86_64/roles/image_creation/vars/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.
---
openchami_image_sha: "ghcr.io/openchami/image-build@sha256:52dd9d546951ce4f2f6f9febd08a228cfcb5b9e8e204ca4f5ee232f6be65d3a4"
image_build_el10: "docker.io/dellhpcomniaaisolution/image-build-el10:1.0"
pull_image_retries: "3"
pull_image_delay: "10"
input_project_dir: "{{ hostvars['localhost']['input_project_dir'] }}"
omnia_metadata_file: "/opt/omnia/.data/oim_metadata.yml"
dir_permissions_644: "0644"
Expand All @@ -33,7 +35,7 @@ ochami_compute_mounts:

ochami_x86_64_image:
- --entrypoint /bin/bash
- ghcr.io/openchami/image-build:stable
- docker.io/dellhpcomniaaisolution/image-build-el10:1.0
ochami_base_command:
- -c 'update-ca-trust extract && image-build --config /home/builder/config.yaml --log-level DEBUG'

Expand All @@ -52,7 +54,3 @@ compute_image_failure_msg: |
# build_compute_image.yml
openchami_compute_image_vars_template: "{{ role_path }}/templates/compute_images_templates.j2"
openchami_compute_image_vars_path: "/opt/omnia/openchami/compute_images_template.yaml"

# build_image_tag.yml
ochami_stable_image_tag: "podman tag {{ openchami_image_sha }} ghcr.io/openchami/image-build:stable"
ochami_stable_image_path: "/var/lib/containers/storage/overlay-images/{{ openchami_image_sha }}"
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,12 @@ def json_file_mandatory(file_path):
"Please ensure the CSV file has the required headers."
)
NETWORK_SPEC_FILE_NOT_FOUND_MSG = "network_spec.yml file not found in input folder."
IB_NETMASK_BITS_MISMATCH_MSG = (
"netmask_bits configured for ib_network must match admin_network netmask_bits in network_spec.yml."
)
IB_SUBNET_IN_ADMIN_RANGE_MSG = (
"ib_network subnet must be outside the admin network range derived from primary_oim_admin_ip/netmask_bits in network_spec.yml."
)

# telemetry
MANDATORY_FIELD_FAIL_MSG = "must not be empty"
Expand Down Expand Up @@ -427,3 +433,4 @@ def get_logic_failed(input_file_path):
def get_logic_success(input_file_path):
"""Returns a formatted message indicating logic validation success for a file."""
return f"{'#' * 10} Logic validation successful for {input_file_path} {'#' * 10}"

Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,35 @@
}
},
"additionalProperties": false
},
{
"type": "object",
"required": ["ib_network"],
"properties": {
"ib_network": {
"type": "object",
"required": [
"subnet",
"netmask_bits"
],
"properties": {
"subnet": {
"type": "string",
"pattern": "^(?:(?:25[0-5]|2[0-4][0-9]|1?[0-9]{1,2})\\.){3}(?:25[0-5]|2[0-4][0-9]|1?[0-9]{1,2})$"
},
"netmask_bits": {
"type": "string",
"pattern": "^(1[0-9]|2[0-9]|[1-9])$|^3[0-2]$"
}
},
"additionalProperties": false
}
},
"additionalProperties": false
}
]
}
}
}
}

Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import itertools
import csv
import yaml
import ipaddress
from ansible.module_utils.input_validation.common_utils import validation_utils
from ansible.module_utils.input_validation.common_utils import config
from ansible.module_utils.input_validation.common_utils import en_us_validation_msg
Expand Down Expand Up @@ -744,6 +745,54 @@ def validate_network_spec(
)
return errors

# Extract admin and IB parameters for cross-validation
admin_netmask_bits = None
admin_primary_ip = None
ib_netmask_bits = None
ib_subnet = None
ib_present = False

for network in data["Networks"]:
if "admin_network" in network and isinstance(network["admin_network"], dict):
admin_net = network["admin_network"]
admin_netmask_bits = admin_net.get("netmask_bits", admin_netmask_bits)
admin_primary_ip = admin_net.get("primary_oim_admin_ip", admin_primary_ip)

if "ib_network" in network and isinstance(network["ib_network"], dict):
ib_net = network["ib_network"]
# Consider IB network present only when config is non-empty
if ib_net:
ib_present = True
ib_netmask_bits = ib_net.get("netmask_bits", ib_netmask_bits)
ib_subnet = ib_net.get("subnet", ib_subnet)

# If IB network is configured and both netmask bits are available, they must match
if ib_present and ib_netmask_bits and admin_netmask_bits and ib_netmask_bits != admin_netmask_bits:
errors.append(
create_error_msg(
"ib_network.netmask_bits",
ib_netmask_bits,
en_us_validation_msg.IB_NETMASK_BITS_MISMATCH_MSG,
)
)

# If IB subnet and admin primary IP are available, ensure IB subnet is not in admin range
if ib_present and ib_subnet and admin_primary_ip and admin_netmask_bits:
try:
admin_network = ipaddress.IPv4Network(f"{admin_primary_ip}/{admin_netmask_bits}", strict=False)
ib_ip = ipaddress.IPv4Address(ib_subnet)
if ib_ip in admin_network:
errors.append(
create_error_msg(
"ib_network.subnet",
ib_subnet,
en_us_validation_msg.IB_SUBNET_IN_ADMIN_RANGE_MSG,
)
)
except ValueError:
# If IPs/netmask are invalid, rely on existing validations to report issues
pass

for network in data["Networks"]:
errors.extend(_validate_admin_network(network))

Expand Down Expand Up @@ -941,3 +990,4 @@ def _validate_ip_ranges(dynamic_range, network_type, netmask_bits):
)

return errors

2 changes: 0 additions & 2 deletions common/vars/openchami_image_cmd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@ rhel_aarch64_base_image_name: "rhel-aarch64_base"
base_image_commands:
- "dracut --add 'dmsquash-live livenet network-manager' --install '/usr/lib/systemd/systemd-sysroot-fstab-check' --kver $(basename /lib/modules/*) -N -f --logfile /tmp/dracut.log 2>/dev/null" # noqa: yaml[line-length]
- "echo DRACUT LOG:; cat /tmp/dracut.log"
- "rm -f /var/lib/rpm/__db*"
- "rpmdb --rebuilddb"

# x86_64 compute commands
default_x86_64_compute_commands:
Expand Down
17 changes: 17 additions & 0 deletions discovery/discovery.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,18 @@
name: discovery_validations
tasks_from: validate_oim_timezone.yml

- name: Build cluster host lists from PXE mapping
hosts: localhost
connection: local
roles:
- passwordless_ssh

- name: Configure OIM SSH from cluster host lists
hosts: oim
connection: ssh
roles:
- passwordless_ssh

- name: Validate discovery parameters
hosts: oim
connection: ssh
Expand Down Expand Up @@ -102,6 +114,11 @@
ansible.builtin.include_role:
name: configure_ochami
tasks_from: discover_mapping_nodes.yml

- name: Read nodes.yaml and derive Omnia node facts
ansible.builtin.include_role:
name: passwordless_ssh
tasks_from: read_nodes_yaml.yml
roles:
- nfs_client
- k8s_config
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,12 @@
register: read_ssh_key
no_log: true

- name: Read the ssh private key
ansible.builtin.command: cat {{ ssh_private_key_path }}
changed_when: false
register: read_ssh_private_key
no_log: true

- name: Hash the password
ansible.builtin.command: openssl passwd -6 "{{ hostvars['localhost']['provision_password'] }}"
changed_when: false
Expand Down
Loading