diff --git a/third_party/Dell/openshift-4.20/EI/single-node/troubleshooting.md b/third_party/Dell/openshift-4.20/EI/single-node/troubleshooting.md new file mode 100644 index 00000000..c8f56064 --- /dev/null +++ b/third_party/Dell/openshift-4.20/EI/single-node/troubleshooting.md @@ -0,0 +1,215 @@ + +# Troubleshooting Guide + +This section provides common deployment and runtime issues observed during Intel® AI for Enterprise Inference setup — along with step-by-step resolutions. + +**Issues:** + 1. [Deployment Fails During Bastion Setup](#1-Deployment-Fails-During-Bastion-Setup) + 2. [PVC Stuck in Pending State](#2-PVC-Stuck-in-Pending-State) + 3. [StatefulSet Pods Not Scheduling on SNO](#3-StatefulSet-Pods-Not-Scheduling-on-SNO) + 4. [Route Exists but API Is Not Reachable](#4-Route-Exists-but-API-Is-Not-Reachable) + 5. [Keycloak Token Generation Fails](#5-Keycloak-Token-Generation-Fails) + +--- + +### 1. Deployment Fails During Bastion Setup + +Deployment stops early with errors such as: + +Failed to update apt cache +Bastion node setup failed. Exiting. + +**Root Cause** + +EI bastion setup assumes: +- A multi-node cluster +- A mutable OS +- Working `apt` repositories + +**These assumptions are invalid for SNO brownfield deployments.** + + - SNO does not require a bastion + - RHCOS is immutable + - Bastion playbooks are incompatible + +**Fix** +```bash +export EI_SKIP_BASTION_SETUP=true +export SKIP_BASTION_SETUP=true +./inference-stack-deploy.sh +``` + +> **Note:** +> - Bastion is not required for SNO +> - Skipping bastion is expected and correct +> - Do not attempt to fix apt just to satisfy bastion logic + +--- + +### 2. PVC Stuck in Pending State + +Verify: + +PVC shows: +> status : pending + +And pod describe shows +> waiting for first consumer 0/1 nodes didn’t find available PVs + +**Verify Local Storage Operator** +```bash +oc get pods -n openshift-local-storage +``` + +Expected: +- local-storage-operator → Running +- diskmaker-manager → Running + +**Verify StorageClass** +```bash +oc get storageclass +``` + +Expected: +```bash + local-sc kubernetes.io/no-provisioner WaitForFirstConsumer +``` + +**Verify Disk Availability** +```bash +lsblk +``` + +Ensure: +- Disk is unused +- Disk is not mounted +- Disk has sufficient capacity + + +> **Note:** +> - Local Storage Operator creates only ONE PV per disk +> - A used disk cannot create additional PVs +> - Additional PVs require additional disks or partitions + +**FIX:** + +Add a disk by editing the existing LocalVolume. If a LocalVolume resource already exists (for example local-storage), you do not need to create a new one. +You can edit the existing LocalVolume and add the new disk. + +**Edit the Existing LocalVolume** +```bash +kubectl edit localvolume local-storage -n openshift-local-storage +``` +**Add the New Disk Under devicePaths** + +Locate the storageClassDevices section and add the new disk path. + +Before (example): +```bash +spec: + storageClassDevices: + - storageClassName: local-sc + volumeMode: Filesystem + fsType: xfs + devicePaths: + - /dev/nvme4n1 +``` + +After (add one more disk): +```bash +spec: + storageClassDevices: + - storageClassName: local-sc + volumeMode: Filesystem + fsType: xfs + devicePaths: + - /dev/nvme4n1 + - /dev/nvme5n1 +``` + +Save and exit the editor. + +**Verify PV Creation** + +Within a few seconds, the Local Storage Operator will create a new PV. +```bash +oc get pv +``` +Expected: +- One new PV per newly added disk + +PVCs bind only after a new disk is detected + +```bash +oc get pvc -A +``` +Expected: +- STATUS: Bound + +--- + +### 3. StatefulSet Pods Not Scheduling on SNO + +StatefulSet pods (example: auth-apisix-etcd-0) remain Pending even though PV exists. + +**Root Cause** + +Local PVs are node-specific + +Scheduler needs explicit node placement on SNO + +**Fix** + +Patch the StatefulSet to pin it to the SNO node: +```bash +oc patch statefulset auth-apisix-etcd -n auth-apisix \ + --type='merge' \ + -p '{ + "spec": { + "template": { + "spec": { + "nodeSelector": { + "kubernetes.io/hostname": "" + } + } + } + } + }' +``` + +Restart the pod: +```bash +oc delete pod auth-apisix-etcd-0 -n auth-apisix +``` +--- + +### 4. Route Exists but API Is Not Reachable +Checks: + +```bash +oc get routes -A +oc describe route +``` +**Common Causes** + +- Missing /etc/hosts entry +- Incorrect BASE_URL / cluster_url +- TLS mode mismatch +- DNS resolution failure + +**Fix** + +- Ensure hostname resolves correctly +- Verify TLS mode matches EI configuration +- Use curl -k only for debugging + +--- + +### 5. Keycloak Token Generation Fails + +Verify: +- Client ID exists +- Client secret is correct +- Access type is **Confidential** +- `KEYCLOAK_REALM` matches configuration +- Token endpoint URL is correct \ No newline at end of file diff --git a/third_party/Dell/openshift-4.20/EI/single-node/user-guide-apisix.md b/third_party/Dell/openshift-4.20/EI/single-node/user-guide-apisix.md new file mode 100644 index 00000000..842a39d0 --- /dev/null +++ b/third_party/Dell/openshift-4.20/EI/single-node/user-guide-apisix.md @@ -0,0 +1,219 @@ +# Intel® AI for Enterprise Inference - OpenShift (APISIX) +## Red Hat OpenShift Brownfield Deployment (Single Node OpenShift – SNO) + +This guide is a **supplement** to the [OpenShift Brownfield Deployment Guide](../../../../../docs/brownfield/brownfield_deployment_openshift.md). + +Follow that guide end-to-end and refer to the sections below for SNO and APISIX-specific steps not covered there. + +--- + +## SNO-Specific Requirements + +The brownfield guide targets a generic OpenShift cluster. For SNO, these additional requirements apply: + +| Component | Requirement | +|-----------|-------------| +| OpenShift Container Platform | v4.20.0 | +| Kubernetes Version | v1.33.6 | +| Node Type | Single Node OpenShift (SNO) | +| Operating System | Red Hat CoreOS (RHCOS) | +| StorageClass | local-sc | +| Local Storage Operator | installed and bound | +| TLS | Edge termination (Router-managed) | + +The inference stack must be deployed from a **separate machine**, not the SNO node itself. + +| Component | Requirement | +|-----------|-------------| +| Deployment Machine | Ubuntu 22.04 | +| Enterprise Inference Version | release-1.4.0 | +| Accelerator | CPU / Gaudi3 | +| Network | Full egress (Registry + Hugging Face) | + +--- + +## Additional Pre-Deployment Steps + +### Copy kubeconfig to the Deployment Machine + +The brownfield guide assumes kubeconfig is already on the deployment machine. For SNO, copy it from your local machine first: + +```bash +scp PATH_TO_YOUR_KUBECONFIG_FILE username@:/home/user/admin.kubeconfig +``` + +Then follow [Prepare Kubeconfig](../../../../../docs/brownfield/brownfield_deployment.md#prepare-kubeconfig) to complete the setup. + +### DNS Resolution (If No Corporate DNS) + +SNO exposes additional routes not mentioned in the brownfield guide. Add these entries to `/etc/hosts` on the deployment machine: + +```bash +sudo vi /etc/hosts +``` +``` + api.. + keycloak-okd.apps.. + okd.apps.. +``` + +> If enterprise DNS is configured correctly, this step is not required. + +### Activate Virtual Environment + +```bash +cd ~/Enterprise-Inference/core/kubespray +source venv/bin/activate +pip install kubernetes +``` + +### Create Certificate Files + +```bash +mkdir -p ~/certs && \ +openssl req -x509 -nodes -days 365 \ +-newkey rsa:2048 \ +-keyout ~/certs/ei.key \ +-out ~/certs/ei.crt \ +-subj "/CN=okd.apps.." +``` +--- + +### Update Inference Configuration + +When updating `core/inventory/inference-config.cfg` per the brownfield guide, apply these APISIX-specific values: + +> **Note:** +> - Replace `.` with your SNO cluster URL +> - Set `cpu_or_gpu` to `cpu` for Xeon models or `gaudi3` for Intel Gaudi 3 +> - Set Keycloak values: `keycloak_client_id`, `keycloak_admin_user`, `keycloak_admin_password` +> - Replace `hugging_face_token` with your Hugging Face token +> - `deploy_kubernetes_fresh=off` and `deploy_ingress_controller=off` are required for brownfield + +``` +cluster_url=okd.apps.. +cert_file=~/certs/ei.crt +key_file=~/certs/ei.key +keycloak_client_id=my-client-id +keycloak_admin_user=your-keycloak-admin-user +keycloak_admin_password=changeme +hugging_face_token=your_hugging_face_token +hugging_face_token_falcon3=your_hugging_face_token +cpu_or_gpu=gaudi3 +deploy_kubernetes_fresh=off +deploy_ingress_controller=off +deploy_keycloak_apisix=on +deploy_genai_gateway=off +``` + +### Update hosts.yaml + +```bash +cp -f docs/examples/single-node/hosts.yaml core/inventory/hosts.yaml +``` + +> The `ansible_user` field defaults to `ubuntu`. Change it to the actual username if different. + +--- + +## Running the Deployment (SNO-Specific) + +The brownfield guide runs `./inference-stack-deploy.sh` directly. On SNO, `sudo` is required and does not inherit environment variables — export `KUBECONFIG` explicitly before running: + +**Gaudi** +```bash +cd core +chmod +x inference-stack-deploy.sh +export KUBECONFIG=/home/user/.kube/config +sudo -E ./inference-stack-deploy.sh --models "1" +``` + +**CPU** +```bash +cd core +chmod +x inference-stack-deploy.sh +export KUBECONFIG=/home/user/.kube/config +sudo -E ./inference-stack-deploy.sh --models "21" --cpu-or-gpu "cpu" +``` + +When prompted, choose **4) Brownfield Deployment**, provide the kubeconfig path (e.g. `~/.kube/config`), then select **1) Initial deployment**. + +> See the [full list of available model IDs](../../../ubuntu-22.04/iac/README.md#pre-integrated-models-list). + +--- + +## Verify the Deployment + +In addition to the route verification in the brownfield guide, run these APISIX-specific checks: + +**Verify Namespaces** +```bash +kubectl get ns | egrep "auth-apisix|default" +``` +Expected: `default`, `auth-apisix` + +**Verify Pods** +```bash +kubectl get pods -A +``` +Expected: all pods `Running`, no `CrashLoopBackOff` or `Pending`. + +**Health Check** +```bash +kubectl get pv +kubectl get pvc -A +kubectl get routes -A +``` +Expected: PV = `Bound`, PVC = `Bound`, Routes created. + +--- + +## Test the Inference + +### Obtain Access Token + +Ensure Keycloak values in `core/scripts/generate-token.sh` match those in `core/inventory/inference-config.cfg`. + +> Replace `BASE_URL` with `https://okd.apps.CLUSTER-NAME.DOMAIN-NAME` wherever required. + +```bash +cd Enterprise-Inference/core/scripts +chmod +x generate-token.sh +. generate-token.sh +``` + +Confirm the token is set: +```bash +echo $BASE_URL +echo $TOKEN +``` + +If a valid token is returned (long JWT string), the environment is ready for inference testing. + +### Run a Test Query + +**Gaudi:** +```bash +curl -k https://${BASE_URL}/Llama-3.1-8B-Instruct/v1/completions \ +-X POST \ +-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "What is Deep Learning?", "max_tokens": 25, "temperature": 0}' \ +-H 'Content-Type: application/json' \ +-H "Authorization: Bearer $TOKEN" +``` + +**CPU:** +```bash +curl -k ${BASE_URL}/Llama-3.1-8B-Instruct-vllmcpu/v1/completions \ +-X POST \ +-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "What is Deep Learning?", "max_tokens": 25, "temperature": 0}' \ +-H 'Content-Type: application/json' \ +-H "Authorization: Bearer $TOKEN" +``` + +If successful, the model will return a completion response. + +--- + +## Troubleshooting + +- [SNO Troubleshooting Guide](./troubleshooting.md) diff --git a/third_party/Dell/openshift-4.20/iac/SNO-installation-guide.md b/third_party/Dell/openshift-4.20/iac/SNO-installation-guide.md new file mode 100644 index 00000000..667f0d10 --- /dev/null +++ b/third_party/Dell/openshift-4.20/iac/SNO-installation-guide.md @@ -0,0 +1,312 @@ +## Single Node OpenShift (SNO) Installation Using Assisted Installer + +This repository provides a **comprehensive guide** for installing **Single Node OpenShift (SNO)** using the **Red Hat Assisted Installer**. +It is intended for **edge deployments, lab environments, proof-of-concepts (PoCs), and development workloads**, and can be used on **bare metal servers**. + +--- + +## Table of Contents + +- [Overview](#overview) +- [Architecture](#architecture) +- [Prerequisites](#prerequisites) +- [Installation Procedure](#installation-procedure) +- [Accessing the SNO Node](#accessing-the-sno-node) +- [Post-Installation Storage Setup](#post-installation-storage-setup) +- [Validation & Health Checks](#validation--health-checks) + +--- + +## Overview + +**Single Node OpenShift (SNO)** is a deployment model where all OpenShift components—control plane and worker—run on a **single node**. + +The **Red Hat Assisted Installer** streamlines OpenShift installation by performing **preflight checks, hardware validation, network validation, and operator readiness checks** before cluster installation. + +**This is the official Red Hat reference document for SNO installation:** https://docs.redhat.com/en/documentation/openshift_container_platform/4.12/html/installing_on_a_single_node/install-sno-installing-sno + +--- + +## Architecture + +In a Single Node OpenShift deployment: + +- One node acts as: + - Control Plane + - Worker + - etcd member +- High availability is **not provided** +- External load balancers are **not required** +- Ingress, API, and applications are exposed directly from the node + +--- + +## Prerequisites + +### Infrastructure Requirements + +| Resource | Minimum | Recommended | +|--------|---------|-------------| +| CPU | 8 vCPU | 16+ vCPU | +| Memory | 16 GB | 32+ GB | +| Disk | 120 GB | 250+ GB | +| Network | Outbound Internet access | Stable, low-latency | +| DNS | Required | Highly reliable | + +> **Note:** For workloads such as AI/ML inference, OpenShift Virtualization, or observability-heavy clusters, allocate additional CPU, memory, and disk. + +### Network Requirements + +- Outbound internet access is required during installation +- NTP must be enabled and system clocks synchronized +- Firewalls must allow required OpenShift ports + +--- + +## Installation Procedure + +Open the OpenShift Assisted Installer UI and click on **Create Cluster** + +```bash +https://console.redhat.com/openshift/assisted-installer/clusters +``` +### Step 1: Configure cluster details + +Provide the cluster details + +- **Cluster Name** : Provide a name to the cluster (example: api) +- **Base Domain** : Enter your domain (example: example.com) +> Note: The system will automatically form the full cluster URL: api.api.example.com, This value is permanent and cannot be changed later +- **Openshift Version** : Select OpenShift version as 4.20.x (example: 4.20.17) +- **CPU architecture** : Select _x86_64_ as CPU architecture +- **Number of control plane nodes** : Select _1 (Single Node OpenShift)_ as control plane node + +Next, under _operators_ leave everything default and move to next Page + +### Step 2: Host Discovery (Generate and Boot Discovery ISO + +- Click “Add host” and generate Discovery ISO +- Choose **Provisioning type** as _Full Discovery ISO_ +- Paste your SSH public key (required for accessing the node later) +- Click Generate Discovery ISO +- Once the ISO is generated, save the URL or download the ISO, use this to boot your machine that will acts as openshift cluster + +> Note: The Full Discovery ISO reduces network dependencies and significantly improves installation reliability. + +### Step 3: Boot your server using the ISO + +**Boot the server using the Discovery ISO** +- Mount the ISO (via iDRAC / USB) +- Reboot the server and ensure it boots from the ISO +- The system will start a lightweight discovery agent + +**Wait for host detection in OpenShift UI** +- In Openshift UI, check host inventory status +- Confirm the host is listed with: + Role: Control plane + Worker (SNO) + Status: Ready + +### Step 4: Install the Cluster + +**Validate Storage configuration** + +- Ensure correct installation disk is selected +- Verify additional disks are visible and not selected (to use later) + +**Validate Networking configuration** +- Confirm correct IP address is assigned (DHCP or Static) +- Verify active NIC is detected + +> Note: Ensure connectivity to required endpoints (DNS / API if applicable) + +**Install Cluster** + +- Click Install Cluster, installation typically completes in 30–60 minutes +- Once cluster is installed, download and save the kubeconfig file. +- Also, save the kubeadmin password. + +> Note: Do not reboot or shut down the node during installation. + +--- + +## Accessing the SNO Node + +### Copy kubeconfig to the Node + +- Replace PATH_TO_YOUR_KUBECONFIG_FILE below with the actual path of your downloaded kubeconfig file, Run this on your local machine. +- Replace NODE_IP with your actual IP +```bash +scp PATH_TO_YOUR_KUBECONFIG_FILE core@:/home/core/admin.kubeconfig +``` +> **Note:** RHCOS restricts root access and /home/user is writable + +### SSH into the Node + +SSH into the node +```bash +ssh core@ +``` +### Switch to Root +```bash +sudo -i +``` +> **Note:** RHCOS uses sudo-based privilege escalation. Direct root login is intentionally restricted. + +Configure kubeconfig for Root +```bash +mv /home/core/admin.kubeconfig /root/admin.kubeconfig +export KUBECONFIG=/root/admin.kubeconfig +echo 'export KUBECONFIG=/root/admin.kubeconfig' >> ~/.bashrc +source ~/.bashrc +``` +### Verify OpenShift Access (From Node) +```bash +oc whoami +oc get nodes +``` +**Expected:** + + - User: system:admin + - Node: Ready +--- + +## Post-Installation Storage Setup + +SNO does not support dynamic storage provisioning by default. Local storage must be explicitly configured. + +**Step 1: Install Local Storage Operator** +```bash +oc create namespace openshift-local-storage +``` + +**verify** +```bash +oc get csv -n openshift-local-storage +``` +**expected** +```bash +local-storage-operator.vX.X.X Succeeded +``` +**Create Operator Group** +```bash +oc apply -f - < + storageClassDevices: + - storageClassName: local-sc + volumeMode: Filesystem + fsType: xfs + devicePaths: + - /dev/nvme0n1 + - /dev/nvme5n1 + - /dev/nvme7n1 +EOF +``` +> **Note:** The Local Storage Operator creates only **one PV per disk**. Additional PVCs require additional disks or partitions. + +**Make local-sc as default storageclass** +```bash +oc patch storageclass local-sc \ + -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' +``` + +--- + +## Validation & Health Checks + +### Validate cluster health: +```bash +oc get nodes +`` +```bash +oc get clusteroperators +``` +**Ensure:** +- Available = True +- Progressing = False +- Degraded = False + +### Verify StorageClass +```bash +oc get storageclass +``` +**Expected:** + + - local-sc + +### Verify PV + +```bash +oc get pv +``` +**Expected:** + ```bash +NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS AGE +local-pv-xxxx 3576Gi RWO Delete Available local-sc 5m +local-pv-yyyy 3576Gi RWO Delete Available local-sc 5m +local-pv-zzzz 3576Gi RWO Delete Available local-sc 5m +``` \ No newline at end of file