From 05a91c4dca3aa0484fbd747bc4f7e6db192bc6d4 Mon Sep 17 00:00:00 2001 From: samzong Date: Fri, 25 Jul 2025 06:34:06 +0800 Subject: [PATCH] new docs for PVC storage and model management Signed-off-by: samzong --- README.md | 4 +- site/content/en/docs/architecture/_index.md | 7 + .../en/docs/architecture/pvc-storage-flow.md | 201 +++++++++++ .../en/docs/reference/storage-types.md | 212 +++++++++++ .../content/en/docs/troubleshooting/_index.md | 7 + .../en/docs/troubleshooting/pvc-storage.md | 341 ++++++++++++++++++ site/content/en/docs/user-guide/_index.md | 26 ++ .../en/docs/user-guide/storage/_index.md | 33 ++ .../en/docs/user-guide/storage/pvc-storage.md | 325 +++++++++++++++++ site/layouts/partials/scripts/mermaid.html | 68 ++++ 10 files changed, 1223 insertions(+), 1 deletion(-) create mode 100644 site/content/en/docs/architecture/_index.md create mode 100644 site/content/en/docs/architecture/pvc-storage-flow.md create mode 100644 site/content/en/docs/reference/storage-types.md create mode 100755 site/content/en/docs/troubleshooting/_index.md create mode 100644 site/content/en/docs/troubleshooting/pvc-storage.md create mode 100755 site/content/en/docs/user-guide/_index.md create mode 100644 site/content/en/docs/user-guide/storage/_index.md create mode 100644 site/content/en/docs/user-guide/storage/pvc-storage.md create mode 100644 site/layouts/partials/scripts/mermaid.html diff --git a/README.md b/README.md index 4880e078..1a6dcd52 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,8 @@ Read the [documentation](https://sgl-project.github.io/ome/docs/) to learn more - **Model Management:** Models are first-class citizen custom resources in OME. Sophisticated model parsing extracts architecture, parameter count, and capabilities directly from model files. Supports distributed storage with automated repair, double encryption, namespace scoping, and multiple formats (SafeTensors, PyTorch, TensorRT, ONNX). +- **Flexible Storage Backends:** Serve models directly from Kubernetes Persistent Volume Claims (PVCs), HuggingFace Hub, cloud object storage (OCI, S3, Azure, GCS), or GitHub releases. PVC storage avoids data duplication and is documented in the [PVC user guide](https://sgl-project.github.io/ome/docs/user-guide/storage/pvc-storage/). + - **Intelligent Runtime Selection:** Automatic matching of models to optimal runtime configurations through weighted scoring based on architecture, format, quantization, parameter size, and framework compatibility. - **Optimized Deployments:** Supports multiple deployment patterns including prefill-decode disaggregation, multi-node inference, and traditional Kubernetes deployments with advanced scaling controls. @@ -123,4 +125,4 @@ High-level overview of the main priorities: ## License -OME is licensed under the [MIT License](LICENSE). \ No newline at end of file +OME is licensed under the [MIT License](LICENSE). diff --git a/site/content/en/docs/architecture/_index.md b/site/content/en/docs/architecture/_index.md new file mode 100644 index 00000000..942ee0ac --- /dev/null +++ b/site/content/en/docs/architecture/_index.md @@ -0,0 +1,7 @@ +--- +title: "Architecture" +linkTitle: "Architecture" +weight: 5 +description: > + Technical architecture and design details of OME components and systems. +--- diff --git a/site/content/en/docs/architecture/pvc-storage-flow.md b/site/content/en/docs/architecture/pvc-storage-flow.md new file mode 100644 index 00000000..e54db093 --- /dev/null +++ b/site/content/en/docs/architecture/pvc-storage-flow.md @@ -0,0 +1,201 @@ +--- +title: "PVC Storage" +date: 2025-07-25 +weight: 15 +description: > + Technical architecture and design decisions for PVC storage support in OME. +--- + +## Architecture Overview + +PVC storage in OME uses a **controller-only architecture** that bypasses the model +agent entirely. This design leverages Kubernetes native volume mounting and eliminates +model duplication. + +## Key Design Decision: Skip Model Agent + +**Why?** DaemonSet pods cannot efficiently mount PVCs, especially ReadWriteOnce +volumes. + +| Traditional storage | PVC storage | +| ------------------------------------------------- | ------------------------------------------------------------------------ | +| Model Agent downloads artifacts to local hostPath | BaseModel Controller validates the PVC reference | +| Files live on each node’s disk | Controller spawns a metadata Job to read `config.json` from the PVC | +| Model Agent labels nodes so pods land correctly | BaseModel status flips to `Ready` and the PVC path becomes immediately usable | +| Scheduler targets labeled nodes | Model Agent skips PVC-backed models entirely | + +## Component Flow + +1. **User submits BaseModel/ClusterBaseModel** referencing a `pvc://` URI. +2. **BaseModel Controller validates** that the PVC exists, is bound, and is accessible. +3. **If PVC ready**, the controller creates a short-lived metadata extraction Job that mounts the PVC, reads `config.json`, and reports metadata back. If the PVC is missing or pending, the controller records a failure condition instead. +4. **BaseModel status updates** to `Ready` once metadata is populated; otherwise it stays in `MetadataPending` or `Failed`. +5. **User deploys an InferenceService** pointing to the BaseModel. The InferenceService controller wires the PVC into serving pods. +6. **Kubernetes scheduler** places pods on nodes that satisfy the PVC’s topology/access-mode constraints—no model-agent coordination needed. + +```mermaid +sequenceDiagram + autonumber + actor User + participant BMController as BaseModel Controller + participant MetadataJob as Metadata Job + participant K8s as Kubernetes + participant ISController as InferenceService Controller + + User->>BMController: Apply BaseModel (pvc://...) + BMController->>K8s: Validate PVC (GET pvc) + BMController->>MetadataJob: Create Job w/ PVC mount + MetadataJob->>K8s: Mount PVC & read config.json + MetadataJob-->>BMController: Report metadata / status + BMController-->>User: Update BaseModel.status Ready + User->>ISController: Apply InferenceService referencing BaseModel + ISController->>K8s: Create Deployment + PVC volumeMount + K8s-->>User: Pods scheduled on nodes that satisfy PVC access mode +``` + +## Component Responsibilities + +| Component | Role | PVC Handling | +| ------------------------------- | ------------------------------ | -------------------------------------------------- | +| **Model Agent** | Downloads models, labels nodes | **Skips PVC** entirely | +| **BaseModel Controller** | Manages BaseModel lifecycle | **Primary owner** - validates, extracts metadata | +| **Metadata Job** | Extracts model config | **Temporary** - mounts PVC, reads config.json | +| **InferenceService Controller** | Manages serving pods | **Volume mounter** - creates pods with PVC volumes | + +## Core Design Decisions + +### 1. Why Skip Model Agent? + +**Problem**: DaemonSet + PVC incompatibility + +- DaemonSets run on every node +- ReadWriteOnce PVCs can't be mounted by multiple pods +- Complex coordination needed for RWO volumes + +**Solution**: Controller-only approach + +```go +// Model agent explicitly skips PVC storage +switch storageType { +case storage.StorageTypePVC: + s.logger.Infof("Skipping PVC storage for model %s", modelInfo) + return nil +} +``` + +### 2. Why Use Jobs for Metadata? + +**Problem**: Need to read model config from PVC +**Solution**: Ephemeral Jobs with PVC mount + +```yaml +# Metadata extraction job template +apiVersion: batch/v1 +kind: Job +metadata: + name: metadata-{model}-{hash} +spec: + template: + spec: + containers: + - name: extractor + image: ome/metadata-agent + volumeMounts: + - name: model-pvc + mountPath: /models + readOnly: true + volumes: + - name: model-pvc + persistentVolumeClaim: + claimName: { pvc-name } +``` + +### 3. Why No Node Labeling? + +**Traditional**: Model agent labels nodes with available models +**PVC**: Kubernetes scheduler handles PVC placement constraints + +**Traditional path**: the Model Agent labels nodes (e.g., `model-xyz=ready`), and InferenceService pods add node selectors so the scheduler picks one of those labeled nodes. + +**PVC path**: the scheduler already understands PVC topology and access modes, so pods simply declare the PVC volume. Kubernetes ensures they land on nodes that can mount it, avoiding extra labels or coordination. + +## Storage Type Comparison + +| Aspect | PVC Storage | Object Storage | HuggingFace | +| ----------------- | ----------------- | ---------------- | ---------------- | +| **Model Agent** | Skipped | Downloads | Downloads | +| **Node Labels** | None | Creates labels | Creates labels | +| **Scheduling** | PVC constraints | Node selectors | Node selectors | +| **Data Transfer** | None | Network download | Network download | +| **Availability** | Storage dependent | Node replicated | Node replicated | + +## Security Model + +### RBAC Requirements + +```yaml +# BaseModel Controller permissions +- apiGroups: [""] + resources: ["persistentvolumeclaims"] + verbs: ["get", "list", "watch"] +- apiGroups: ["batch"] + resources: ["jobs"] + verbs: ["create", "get", "list", "watch"] + +# Metadata Job permissions +- apiGroups: [""] + resources: ["persistentvolumeclaims"] + verbs: ["get"] +- apiGroups: ["ome.io"] + resources: ["basemodels"] + verbs: ["update"] +``` + +### Security Boundaries + +- **Namespace isolation**: BaseModel → same namespace PVC only +- **Read-only mounts**: All PVC mounts are read-only +- **Minimal permissions**: Jobs have least-privilege access + +## Performance Profile + +| Operation | PVC Storage | Object Storage | +| ---------------------- | -------------- | ------------------- | +| **Model Loading** | Immediate | Minutes (download) | +| **Scaling Up** | Fast | Slow (re-download) | +| **Storage Efficiency** | No duplication | Replicated per node | + +**Performance depends on storage backend:** + +- **NFS**: Good for sharing, may bottleneck with many pods +- **Block storage**: Excellent single-pod, RWO limits concurrency +- **Distributed**: Scales well, varies by implementation + +## Common Issues & Solutions + +| Issue | Cause | Solution | +| ---------------------- | -------------------- | ------------------------------------- | +| MetadataPending | PVC not bound | Check PVC status, storage provisioner | +| Pod scheduling failure | PVC node constraints | Verify PVC accessible from nodes | +| Slow model loading | Storage performance | Use faster storage class | + +## Future Enhancements + +**Planned:** + +- Cross-namespace PVC access with RBAC +- Volume snapshot integration for versioning +- Multi-PVC model support +- Performance optimization hints + +**Integration:** + +- CSI driver advanced features +- Automatic storage class selection +- Volume expansion for growing repos + +## Related Documentation + +- [PVC Storage User Guide](/ome/docs/user-guide/storage/pvc-storage/) - How to use +- [Troubleshooting PVC Storage](/ome/docs/troubleshooting/pvc-storage/) - Common issues +- [Storage Types Reference](/ome/docs/reference/storage-types/) - Complete API spec diff --git a/site/content/en/docs/reference/storage-types.md b/site/content/en/docs/reference/storage-types.md new file mode 100644 index 00000000..39a90ead --- /dev/null +++ b/site/content/en/docs/reference/storage-types.md @@ -0,0 +1,212 @@ +--- +title: "Storage Types" +date: 2025-07-25 +weight: 20 +description: > + Complete API reference for all supported storage types in OME BaseModel and + ClusterBaseModel resources. +--- + +## Storage Type Overview + +| Type | URI Format | Agent Role | Metadata | Authentication | +| --------------- | ----------------------------- | ---------- | -------------- | -------------------------------- | +| **PVC** | `pvc://[ns:]name/path` | Skipped | Controller+Job | Kubernetes RBAC only | +| **OCI** | `oci://n/ns/b/bucket/o/path` | Downloads | Agent | Instance/User/Resource Principal | +| **HuggingFace** | `hf://model[@branch]` | Downloads | Agent | Token (optional) | +| **AWS S3** | `s3://bucket/path` | Downloads | Agent | Access key / IRSA | +| **Azure Blob** | `azure://account/container/path` | Downloads | Agent | Account key / MSI | +| **GCS** | `gs://bucket/path` | Downloads | Agent | Service account | +| **GitHub** | `gh://owner/repo@tag/path` | Downloads | Agent | Token (optional) | + +> Note: Additional cloud/object storage integrations will be documented once they graduate from preview support. + +## BaseModel Field Reference + +PVC and all other storage providers share the same `spec.storage` block of the +[BaseModel](https://sgl-project.github.io/ome/docs/reference/ome.v1beta1/#ome-io-v1beta1-BaseModel) +CRD: + +```yaml +spec: + storage: + storageUri: # Required for every storage type + path: /raid/models/optional-local # Optional node-local path for agent downloads + storageKey: my-secret # Secret containing credentials (non-PVC) + parameters: # Provider-specific hints + region: us-east-1 + annotations: + ome.io/skip-config-parsing: "true" # Optional PVC-specific override +``` + +Use `ClusterBaseModel.spec.storage.storageUri` to reference PVCs across namespaces (via +`pvc://{namespace}:{name}/{sub-path}`) while normal BaseModels must live beside the PVC. + +## PVC Storage + +### URI Format + +``` +# BaseModel (same namespace) +pvc://{pvc-name}/{sub-path} + +# ClusterBaseModel (explicit namespace) +pvc://{namespace}:{pvc-name}/{sub-path} +``` + +### Parameters + +| Field | Required | Description | +| ----------- | --------------------- | ---------------------------------- | +| `pvc-name` | Yes | Name of PVC containing models | +| `namespace` | ClusterBaseModel only | Namespace containing PVC | +| `sub-path` | Yes | Path within PVC to model directory | + +### Examples + +```yaml +# BaseModel +storage: + storageUri: "pvc://models-pvc/llama/llama-3-70b" + +# ClusterBaseModel +storage: + storageUri: "pvc://ai-models:models-pvc/llama/llama-3-70b" +``` + +### Requirements + +- PVC must be `Bound` +- Model directory must contain `config.json` +- Files readable by metadata extraction job +- BaseModel service account must be able to `get` the PVC (namespace scoped) + +### Related CRD Fields + +- `spec.storage.storageUri` — `pvc://` URI including namespace prefix for ClusterBaseModel. +- `metadata.annotations["ome.io/skip-config-parsing"]` — opt out of metadata extraction if + `config.json` is absent. +- `spec.storage.parameters["subPath"]` (optional) — override auto-detected subpath inside the PVC. + +## OCI Object Storage + +**URI Format:** `oci://n/{namespace}/b/{bucket}/o/{object_path}` + +### URI Components + +| Component | Required | Description | +| ------------- | -------- | --------------------------------- | +| `namespace` | Yes | OCI compartment namespace | +| `bucket` | Yes | Object storage bucket name | +| `object_path` | Yes | Path to model files within bucket | + +### Examples + +```yaml +apiVersion: ome.io/v1beta1 +kind: BaseModel +metadata: + name: llama-oci +spec: + storage: + storageUri: "oci://n/ai-models/b/llm-store/o/meta/llama-3.1-70b-instruct/" + path: "/raid/models/llama-3.1-70b-instruct" + storageKey: "oci-credentials" + parameters: + region: "us-phoenix-1" + auth_type: "InstancePrincipal" +``` + +### Authentication Methods + +| Method | Description | Configuration | +| ------------------- | ----------------------------- | -------------------------------- | +| `InstancePrincipal` | Use compute instance identity | No credentials needed | +| `UserPrincipal` | User-based authentication | Requires API key in secret | +| `ResourcePrincipal` | OKE resource principal | Automatic in OKE clusters | +| `WorkloadIdentity` | Service account based | Requires workload identity setup | + +## Authentication Patterns + +### Credential Storage + +```yaml +# HuggingFace +apiVersion: v1 +kind: Secret +metadata: + name: hf-token +type: Opaque +stringData: + token: hf_xxx + +# OCI (User Principal) +apiVersion: v1 +kind: Secret +metadata: + name: oci-credentials +type: Opaque +stringData: + tenancy_ocid: ocid1.tenancy.oc1..example + user_ocid: ocid1.user.oc1..example + fingerprint: 1a:2b:3c:4d + private_key: |- + -----BEGIN PRIVATE KEY----- + ... + -----END PRIVATE KEY----- +``` + +### BaseModel Template + +```yaml +apiVersion: ome.io/v1beta1 +kind: BaseModel +metadata: + name: example-model +spec: + storage: + storageUri: "" + path: "/local/path/to/model" + storageKey: "credential-secret-name" + parameters: + key: "value" # Storage-specific parameters +``` + +## Storage Selection Guide + +| Use Case | Recommended Type | Access Pattern | +| -------------------- | ---------------- | ---------------------- | +| **Development** | HuggingFace | Frequent model updates | +| **High Performance** | PVC (NVMe/SSD) | Low latency serving | +| **Shared Models** | PVC (NFS/RWX) | Multiple consumers | +| **Cloud Native** | OCI Object Store | Durable, versioned | +| **Hybrid** | HF + PVC | Sync public to private | + +## Common Configurations + +### Multi-Environment + +```yaml +# Dev +storageUri: "hf://model-name" + +# Prod +storageUri: "pvc://prod-models/model-name" +``` + +### Hybrid Sync (OCI to PVC) + +```yaml +# Authoritative copy in Object Storage +storageUri: "oci://n/ml/b/prod-models/o/llama-3.1-70b-instruct" + +# Cached copy inside the cluster +storageUri: "pvc://prod-models/llama-3-1-70b" +``` + +## Related Documentation + +- [PVC Storage Guide](/ome/docs/user-guide/storage/pvc-storage/) - PVC usage +- [BaseModel Reference](/ome/docs/concepts/base_model/) - Complete BaseModel spec +- [Troubleshooting](/ome/docs/troubleshooting/pvc-storage/) - Common issues +- [Architecture: PVC Flow](/ome/docs/architecture/pvc-storage-flow/) - Controller and job design diff --git a/site/content/en/docs/troubleshooting/_index.md b/site/content/en/docs/troubleshooting/_index.md new file mode 100755 index 00000000..160b722e --- /dev/null +++ b/site/content/en/docs/troubleshooting/_index.md @@ -0,0 +1,7 @@ +--- +title: "Troubleshooting" +linkTitle: "Troubleshooting" +weight: 10 +description: > + Troubleshooting guides for common OME issues. +--- diff --git a/site/content/en/docs/troubleshooting/pvc-storage.md b/site/content/en/docs/troubleshooting/pvc-storage.md new file mode 100644 index 00000000..27f1f752 --- /dev/null +++ b/site/content/en/docs/troubleshooting/pvc-storage.md @@ -0,0 +1,341 @@ +--- +title: "PVC Storage" +date: 2025-07-25 +weight: 10 +description: > + Common issues, diagnostics, and solutions for PVC storage in OME models. +--- + +## Quick Diagnostic Checklist + +Run these commands first to identify the issue: + +```bash +# Check BaseModel status +kubectl get basemodel -o wide + +# Check PVC status +kubectl get pvc -n + +# Check recent events +kubectl get events --sort-by='.lastTimestamp' | grep -E 'basemodel|pvc' + +# Check metadata extraction jobs +kubectl get jobs -l "app.kubernetes.io/component=metadata-extraction" +``` + +## Common Issues Reference + +| Symptom | Likely Cause | Quick Check | Solution | +| ------------------- | ---------------------- | ----------------------- | ------------------------------------------------ | +| **MetadataPending** | PVC not found | `kubectl get pvc` | [Create PVC](#pvc-not-found) | +| **MetadataPending** | PVC not bound | `kubectl describe pvc` | [Fix storage](#pvc-not-bound) | +| **MetadataPending** | config.json missing | Check metadata job logs | [Fix model structure](#config-json-missing) | +| **Pod FailedMount** | Multi-attach error | RWO PVC + multiple pods | [Use RWX or single replica](#multi-attach-error) | +| **Pod Pending** | Node affinity conflict | PVC not accessible | [Check PV topology](#node-affinity-conflict) | +| **Permission denied** | RBAC restrictions | Job / controller logs | [Fix RBAC](#rbac-permissions) | +| **Slow loading** | Storage performance | Monitor I/O | [Optimize storage](#storage-performance) | + +## Detailed Solutions + +### PVC Not Found + +**Error:** `PVC 'model-storage-pvc' not found in namespace 'models'` + +**Diagnosis:** + +```bash +# Check if PVC exists +kubectl get pvc -n + +# Check URI format in BaseModel +kubectl get basemodel -o jsonpath='{.spec.storage.storageUri}' +``` + +**Solutions:** + +1. **Create missing PVC:** + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: model-storage-pvc + namespace: models +spec: + accessModes: [ReadWriteMany] + resources: + requests: + storage: 200Gi + storageClassName: your-storage-class +``` + +2. **Fix URI format:** + +```yaml +# BaseModel - same namespace +storageUri: "pvc://model-storage-pvc/path/to/model" + +# ClusterBaseModel - explicit namespace +storageUri: "pvc://models:model-storage-pvc/path/to/model" +``` + +### PVC Not Bound + +**Error:** PVC status shows "Pending" instead of "Bound" + +**Diagnosis:** + +```bash +# Check PVC status and events +kubectl describe pvc -n + +# Check storage class and provisioner +kubectl get storageclass +kubectl get pods -n kube-system | grep provisioner +``` + +**Solutions:** + +1. **Check storage class exists:** + +```bash +kubectl get storageclass +``` + +2. **Verify provisioner is running:** + +```bash +kubectl logs -n kube-system +``` + +3. **Check resource quotas:** + +```bash +kubectl describe quota -n +``` + +### Config.json Missing + +**Error:** `config.json not found at path /models/model-name/config.json` + +**Diagnosis:** + +```bash +# Debug PVC contents +kubectl run pvc-debug --rm -i --tty --image=alpine \ + --overrides='{"spec":{"volumes":[{"name":"pvc","persistentVolumeClaim":{"claimName":""}}],"containers":[{"name":"debug","image":"alpine","volumeMounts":[{"mountPath":"/models","name":"pvc"}],"command":["sh"]}]}}' + +# Inside pod: check structure +find /models -name "config.json" +ls -la /models// +``` + +**Solutions:** + +1. **Fix model path in URI:** + +```yaml +# If config.json is at /models/subdir/config.json +storageUri: "pvc://pvc-name/subdir" +``` + +2. **Skip automatic parsing:** + +```yaml +apiVersion: ome.io/v1beta1 +kind: BaseModel +metadata: + annotations: + ome.io/skip-config-parsing: "true" +spec: + modelType: "llama" + modelArchitecture: "LlamaForCausalLM" + # ... other metadata fields + storage: + storageUri: "pvc://pvc-name/path" +``` + +### Multi-Attach Error + +**Error:** `Volume is already exclusively attached to one node` + +**Cause:** ReadWriteOnce PVC with multiple pods + +**Solutions:** + +1. **Use ReadWriteMany PVC:** + +```yaml +spec: + accessModes: [ReadWriteMany] # Changed from ReadWriteOnce +``` + +2. **Limit to single replica:** + +```yaml +apiVersion: ome.io/v1beta1 +kind: InferenceService +spec: + engine: + minReplicas: 1 + maxReplicas: 1 +``` + +### Node Affinity Conflict + +**Error:** `node(s) had volume node affinity conflict` + +**Cause:** The PersistentVolume backing the PVC is pinned to specific zones/nodes that do not +match where the scheduler is trying to place predictor pods. + +**Diagnosis:** + +```bash +# Identify the bound PV and inspect its node affinity requirements +kubectl get pvc -o jsonpath='{.spec.volumeName}' +kubectl describe pv | sed -n '/Node Affinity/,+10p' + +# Confirm nodes satisfy the same topology keys +kubectl get nodes --show-labels | grep topology.kubernetes.io/zone +``` + +**Solutions:** + +1. Schedule the InferenceService using node selectors/affinity that match the PV's + `nodeAffinity` requirements. +2. If the PVC relies on zone-specific storage classes, create PVCs per zone or switch to an + RWX storage class that is accessible cluster-wide. +3. Verify the CSI driver exposes the correct topology labels and that the nodes running + predictor pods can attach to the volume class. + +> PVC-backed models never rely on model-agent labels, so focus solely on PV topology constraints. + +### RBAC Permissions + +**Error:** Metadata job or BaseModel controller logs show `"forbidden"` when reading PVCs or +updating BaseModel status. + +**Diagnosis:** + +```bash +# Inspect controller logs +kubectl logs deploy/ome-controller-manager -n ome | grep pvc + +# Verify the service account has PVC + Job verbs +kubectl get clusterrole ome-controller-manager -o yaml | grep -A2 persistentvolumeclaims +``` + +**Solutions:** + +1. Ensure the controller ClusterRole includes `get`, `list`, `watch` on `persistentvolumeclaims` + and `create` on `batch/jobs` (see architecture doc for reference bindings). +2. For metadata jobs failing with permission issues, confirm the job's service account can read + the PVC namespace and update the BaseModel status (`patch` on `basemodels`). +3. Re-apply the Helm chart or manually patch the ClusterRole/RoleBinding if they drifted from the + release defaults. + +### Storage Performance + +**Symptoms:** Slow model loading, high I/O wait + +**Diagnosis:** + +```bash +# Monitor pod resource usage +kubectl top pods + +# Check I/O in pod +kubectl exec -- iostat -x 1 +``` + +**Solutions:** + +1. **Use faster storage class:** + +```yaml +storageClassName: fast-ssd # or nvme, premium-ssd +``` + +2. **Optimize for your storage:** + +- **NFS:** Tune mount options (`rsize=1048576,wsize=1048576`) +- **Block:** Use high-IOPS storage tiers +- **Cloud:** Request higher IOPS/throughput + +## Diagnostic Scripts + +### Complete Status Check + +```bash +#!/bin/bash +# Usage: ./pvc-debug.sh + +MODEL_NAME=${1:-"my-model"} +NAMESPACE=${2:-"default"} + +echo "=== BaseModel Status ===" +kubectl get basemodel $MODEL_NAME -n $NAMESPACE -o wide + +echo -e "\n=== PVC Status ===" +PVC_URI=$(kubectl get basemodel $MODEL_NAME -n $NAMESPACE -o jsonpath='{.spec.storage.storageUri}') +PVC_NAME=$(echo $PVC_URI | sed 's/.*pvc:\/\/\([^\/]*\).*/\1/') +kubectl get pvc $PVC_NAME -n $NAMESPACE -o wide + +echo -e "\n=== Recent Events ===" +kubectl get events --sort-by='.lastTimestamp' | grep -E "$MODEL_NAME|$PVC_NAME" | tail -5 + +echo -e "\n=== Metadata Jobs ===" +kubectl get jobs -n $NAMESPACE -l "app.kubernetes.io/component=metadata-extraction" +``` + +### PVC Content Explorer + +```bash +# Interactive PVC debugging +kubectl run pvc-debug-$(date +%s) --rm -i --tty --image=alpine \ + --overrides='{ + "spec": { + "volumes": [{"name":"pvc","persistentVolumeClaim":{"claimName":""}}], + "containers": [{ + "name":"debug", + "image":"alpine", + "volumeMounts":[{"mountPath":"/models","name":"pvc"}], + "command":["sh","-c","apk add --no-cache file && sh"] + }] + } + }' \ + --namespace= + +# Inside pod: +# ls -la /models/ +# find /models -name "config.json" +# file /models/*/config.json +``` + +## Error Quick Reference + +| Error | Component | Fix | +| ------------------------------- | ------------ | ------------------------------- | +| `PVC not found` | Controller | Create PVC or fix URI | +| `PVC not bound` | Storage | Check provisioner/storage class | +| `config.json not found` | Metadata Job | Fix path or skip parsing | +| `Multi-Attach error` | Kubernetes | Use RWX or single replica | +| `Volume node affinity conflict` | Scheduler | Check PV topology | +| `Permission denied` | Metadata Job | Fix file permissions | + +## Prevention Checklist + +Before creating BaseModel: + +- [ ] PVC exists and is bound +- [ ] Model files at correct path with config.json +- [ ] Appropriate access mode (RWX for sharing, RWO for performance) +- [ ] Storage class supports required performance +- [ ] RBAC permissions configured + +## Related Documentation + +- [PVC Storage User Guide](/ome/docs/user-guide/storage/pvc-storage/) - Usage instructions +- [PVC Storage Architecture](/ome/docs/architecture/pvc-storage-flow/) - Technical details +- [Storage Types Reference](/ome/docs/reference/storage-types/) - Complete API spec diff --git a/site/content/en/docs/user-guide/_index.md b/site/content/en/docs/user-guide/_index.md new file mode 100755 index 00000000..5b4a7807 --- /dev/null +++ b/site/content/en/docs/user-guide/_index.md @@ -0,0 +1,26 @@ +--- +title: "User Guide" +linkTitle: "User Guide" +weight: 6 +description: > + Step-by-step guides for common OME tasks and workflows. +--- + +# User Guide + +This section provides practical, step-by-step guides for common OME tasks and workflows. +Whether you're new to OME or looking to implement specific features, these guides will help you +get started quickly and effectively. + +## Getting Started + +- [Storage Configuration](/ome/docs/user-guide/storage/) - Configure different storage backends + for your models + +## Storage Guides + +- [PVC Storage](/ome/docs/user-guide/storage/pvc-storage/) - Use models from Kubernetes + Persistent Volume Claims + +For more comprehensive information about OME concepts, see the [Concepts](/ome/docs/concepts/) +section. diff --git a/site/content/en/docs/user-guide/storage/_index.md b/site/content/en/docs/user-guide/storage/_index.md new file mode 100644 index 00000000..bb1f7ff8 --- /dev/null +++ b/site/content/en/docs/user-guide/storage/_index.md @@ -0,0 +1,33 @@ +--- +title: "Storage Configuration" +linkTitle: "Storage Configuration" +date: 2025-07-25 +weight: 10 +description: > + Configure different storage backends for your OME models. +--- + +OME supports multiple storage backends for BaseModel and ClusterBaseModel resources. This +section provides detailed guides for configuring and using different storage types. + +## Available Storage Types + +- **[PVC Storage](/ome/docs/user-guide/storage/pvc-storage/)** - Use models stored in + Kubernetes Persistent Volume Claims +- **OCI Object Storage** - Oracle Cloud Infrastructure object storage +- **HuggingFace Hub** - Public and private models from HuggingFace +- **AWS S3** - Amazon S3 compatible storage +- **Azure Blob Storage** - Microsoft Azure blob storage +- **Google Cloud Storage** - Google Cloud Platform storage +- **GitHub Releases** - Models distributed via GitHub releases + +## Choosing a Storage Type + +The choice of storage type depends on your specific requirements: + +- **Use PVC Storage** when you have models already stored in Kubernetes persistent volumes +- **Use Object Storage** (OCI, S3, Azure, GCS) for cloud-native deployments +- **Use HuggingFace** for public models or when developing with transformer models +- **Use GitHub Releases** for open-source model projects with version control + +For a complete comparison of storage types, see the [Storage Types Reference](/ome/docs/reference/storage-types/). diff --git a/site/content/en/docs/user-guide/storage/pvc-storage.md b/site/content/en/docs/user-guide/storage/pvc-storage.md new file mode 100644 index 00000000..da51f56a --- /dev/null +++ b/site/content/en/docs/user-guide/storage/pvc-storage.md @@ -0,0 +1,325 @@ +--- +title: "PVC Storage for Models" +date: 2025-07-25 +weight: 10 +description: > + Use models stored in Kubernetes Persistent Volume Claims (PVCs) directly with OME, + eliminating the need to copy models to object storage. +--- + +## Quick Start + +### Prerequisites + +- Kubernetes cluster with PVC support +- Model files already in a PVC with `config.json` + +### Step 1: Create PVC (if needed) + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: models-pvc + namespace: default +spec: + accessModes: [ReadWriteMany] + resources: + requests: + storage: 100Gi +``` + +### Step 2: Create BaseModel + +```yaml +apiVersion: ome.io/v1beta1 +kind: BaseModel +metadata: + name: my-model +spec: + storage: + storageUri: "pvc://models-pvc/path/to/model" +``` + +### Step 3: Verify & Use + +```bash +# Check status +kubectl get basemodel my-model -o wide +``` + +```yaml +# Deploy an InferenceService that consumes the PVC model +apiVersion: ome.io/v1beta1 +kind: InferenceService +metadata: + name: my-service +spec: + model: + name: my-model + runtime: vllm-runtime + predictor: + containers: + - name: predictor + image: ghcr.io/sgl-project/vllm-runtime:latest + args: + - "--model" + - "$(MODEL_PATH)" + resources: + limits: + nvidia.com/gpu: "1" +``` + +`MODEL_PATH` is automatically injected by the controller and points at the PVC mount. The PVC +volume is also mounted into the predictor container—no additional `volumeMounts` are required. + +**Done!** Your PVC model is now serving. + +--- + +## When to Use PVC Storage + +| Use PVC When | Don't Use When | +| ------------------------- | ----------------------------- | +| Models already in PVCs | Need models on specific nodes | +| Avoiding data duplication | Want model agent management | +| High-performance storage | Need node-specific labeling | +| Shared model repositories | Require local caching | + +## URI Format Reference + +### BaseModel (same namespace) + +``` +pvc://{pvc-name}/{sub-path} +``` + +### ClusterBaseModel (explicit namespace) + +``` +pvc://{namespace}:{pvc-name}/{sub-path} +``` + +**Examples:** + +```yaml +# BaseModel - PVC in same namespace +storageUri: "pvc://model-storage/llama/llama-3-70b" + +# ClusterBaseModel - specify namespace +storageUri: "pvc://ai-models:model-storage/llama/llama-3-70b" +``` + +## Common Use Cases + +### Shared NFS Models + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: nfs-models +spec: + accessModes: [ReadWriteMany] + storageClassName: nfs-csi + resources: + requests: + storage: 1Ti +--- +apiVersion: ome.io/v1beta1 +kind: ClusterBaseModel +metadata: + name: shared-llama +spec: + storage: + storageUri: "pvc://ai-models:nfs-models/models/llama-3-70b" +``` + +### High-Performance Block Storage + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: fast-models +spec: + accessModes: [ReadWriteOnce] # Single node only + storageClassName: fast-ssd + resources: + requests: + storage: 500Gi +--- +apiVersion: ome.io/v1beta1 +kind: BaseModel +metadata: + name: fast-llama +spec: + storage: + storageUri: "pvc://fast-models/models/llama-3-70b" +``` + +### Manual Metadata (No config.json) + +```yaml +apiVersion: ome.io/v1beta1 +kind: BaseModel +metadata: + name: custom-model + annotations: + ome.io/skip-config-parsing: "true" +spec: + modelType: "llama" + modelArchitecture: "LlamaForCausalLM" + modelParameterSize: "70B" + maxTokens: 8192 + modelCapabilities: [text-to-text] + modelFormat: + name: "safetensors" + storage: + storageUri: "pvc://models-pvc/custom/my-model" +``` + +## PVC Access Modes + +| Mode | Use Case | Behavior | Storage Type | +| ----------------- | ---------------------------- | ----------------------- | ------------------- | +| **ReadWriteMany** | Shared models, multiple pods | Multiple pods can mount | NFS, distributed | +| **ReadWriteOnce** | High-performance, single pod | Only one pod can mount | Block storage, SSDs | +| **ReadOnlyMany** | Immutable model repos | Multiple read-only pods | Any storage | + +## Model Directory Structure + +Your PVC must contain models in this structure: + +``` +/models/ +├── llama-3-70b-instruct/ +│ ├── config.json # Required for auto-metadata +│ ├── model-*.safetensors # Model files +│ ├── tokenizer.json +│ └── tokenizer_config.json +``` + +## Monitoring & Status + +### Quick Status Check + +```bash +# Check BaseModel + metadata +kubectl get basemodel my-model -o yaml | yq '.status' + +# Check PVC +kubectl get pvc models-pvc + +# Check metadata extraction job (if any) +kubectl get jobs -l "app.kubernetes.io/component=metadata-extraction" -n default +``` + +### Fields to Inspect + +```yaml +status: + state: Ready # Overall lifecycle state + lifecycle: Ready # Detailed lifecycle marker + nodesReady: [] # PVC models skip node labeling, so this is normally empty +spec: + modelType: llama # Auto-populated from config.json when available + modelArchitecture: LlamaForCausalLM +``` + +If metadata parsing is skipped or fails, set `ome.io/skip-config-parsing: "true"` and fill +`spec.modelType`, `spec.modelArchitecture`, and optional `modelParameterSize` manually. + +### PVC Metrics to Monitor + +- PVC `ACCESS MODES` (must align with the number of replicas you intend to run) +- PVC capacity/usage via `kubectl describe pvc models-pvc` +- Metadata job logs for config parsing failures + +## Populate & Migrate Models + +### Example: Populating a PVC (HuggingFace) + +```yaml +apiVersion: batch/v1 +kind: Job +metadata: + name: download-model-to-pvc +spec: + template: + spec: + containers: + - name: downloader + image: ghcr.io/huggingface/transformers-pytorch-gpu:latest + command: + - /bin/bash + - -c + - | + git lfs install + cd /models + git clone https://huggingface.co/meta-llama/Llama-2-7b-hf + volumeMounts: + - mountPath: /models + name: model-storage + restartPolicy: OnFailure + volumes: + - name: model-storage + persistentVolumeClaim: + claimName: models-pvc +``` + +### Migrating from Object Storage + +```bash +# Copy from S3/Object storage into the PVC +kubectl apply -f - <<'MANIFEST' +apiVersion: batch/v1 +kind: Job +metadata: + name: migrate-to-pvc +spec: + template: + spec: + serviceAccountName: pvc-migrator + containers: + - name: migrator + image: amazon/aws-cli + command: ["aws", "s3", "sync", "s3://my-bucket/model/", "/models/"] + env: + - name: AWS_REGION + value: us-east-1 + volumeMounts: + - name: target-pvc + mountPath: /models + volumes: + - name: target-pvc + persistentVolumeClaim: + claimName: models-pvc + restartPolicy: OnFailure +MANIFEST + +# Point the BaseModel at the PVC once files exist +kubectl patch basemodel my-model --type='merge' \ + -p='{"spec":{"storage":{"storageUri":"pvc://models-pvc/model-path"}}}' +``` + +## Best Practices + +- **Storage Class**: Use appropriate performance tier (fast-ssd for inference, nfs for sharing) +- **Sizing**: Plan for 100GB+ per large model +- **Organization**: Clear directory structure with consistent naming +- **Monitoring**: Track PVC usage and model status +- **Security**: Implement RBAC and storage encryption + +## Limitations + +- No model agent involvement (no node labels) +- BaseModel can only access same-namespace PVCs +- Requires PVC to be bound and accessible +- Performance depends on storage backend + +## Related Documentation + +- [Troubleshooting PVC Storage](/ome/docs/troubleshooting/pvc-storage/) - Common issues +- [PVC Storage Architecture](/ome/docs/architecture/pvc-storage-flow/) - Technical details +- [Storage Types Reference](/ome/docs/reference/storage-types/) - Complete API spec diff --git a/site/layouts/partials/scripts/mermaid.html b/site/layouts/partials/scripts/mermaid.html new file mode 100644 index 00000000..e4d9fc32 --- /dev/null +++ b/site/layouts/partials/scripts/mermaid.html @@ -0,0 +1,68 @@ +{{ $version := .Site.Params.mermaid.version | default "latest" -}} +{{ $cdnurl := printf "https://cdn.jsdelivr.net/npm/mermaid@%s/dist/mermaid.esm.min.mjs" $version -}} + +{{ $remote := try (resources.GetRemote $cdnurl) -}} +{{ with $remote.Err -}} + {{ errorf "Could not retrieve mermaid script from CDN. Reason: %s." . -}} +{{ else -}} + {{ if not $remote.Value -}} + {{ errorf "Invalid Mermaid version %s, could not retrieve this version from CDN." $version -}} + {{ end -}} +{{ end -}} + +