diff --git a/skills/sap-rpt1-oss-predictor/SKILL.md b/skills/sap-rpt1-oss-predictor/SKILL.md new file mode 100644 index 000000000..ac86311ff --- /dev/null +++ b/skills/sap-rpt1-oss-predictor/SKILL.md @@ -0,0 +1,150 @@ +--- +name: sap-rpt1-oss-predictor +description: Use SAP-RPT-1-OSS open source tabular foundation model for predictive analytics on SAP business data. Handles classification and regression tasks including customer churn prediction, delivery delay forecasting, payment default risk, demand planning, and financial anomaly detection. Use when asked to predict, forecast, classify, or analyze patterns in SAP tabular data exports (CSV/DataFrame). Runs locally via Hugging Face model. +--- + +# SAP-RPT-1-OSS Predictor + +SAP-RPT-1-OSS is SAP's open source tabular foundation model (Apache 2.0) for predictions on structured business data. Unlike LLMs that predict text, RPT-1 predicts field values in table rows using in-context learning—no model training required. + +**Repository**: https://github.com/SAP-samples/sap-rpt-1-oss +**Model**: https://huggingface.co/SAP/sap-rpt-1-oss + +## Setup + +### 1. Install Package + +```bash +pip install git+https://github.com/SAP-samples/sap-rpt-1-oss +``` + +### 2. Hugging Face Authentication + +Model weights require HF login and license acceptance: + +```bash +# Install HF CLI +pip install huggingface_hub + +# Login (creates ~/.huggingface/token) +huggingface-cli login +``` + +Then accept model terms at: https://huggingface.co/SAP/sap-rpt-1-oss + +### 3. Hardware Requirements + +| Config | GPU Memory | Context Size | Bagging | Use Case | +|--------|------------|--------------|---------|----------| +| Optimal | 80GB (A100) | 8192 | 8 | Production, best accuracy | +| Standard | 40GB (A6000) | 4096 | 4 | Good balance | +| Minimal | 24GB (RTX 4090) | 2048 | 2 | Development | +| CPU | N/A | 1024 | 1 | Testing only (slow) | + +## Quick Start + +### Classification (Customer Churn, Payment Default) + +```python +import pandas as pd +from sap_rpt_oss import SAP_RPT_OSS_Classifier + +# Load SAP data export +df = pd.read_csv("sap_customers.csv") +X = df.drop(columns=["CHURN_STATUS"]) +y = df["CHURN_STATUS"] + +# Split data +X_train, X_test = X[:400], X[400:] +y_train, y_test = y[:400], y[400:] + +# Initialize and predict +clf = SAP_RPT_OSS_Classifier(max_context_size=4096, bagging=4) +clf.fit(X_train, y_train) + +predictions = clf.predict(X_test) +probabilities = clf.predict_proba(X_test) +``` + +### Regression (Delivery Delay Days, Demand Quantity) + +```python +from sap_rpt_oss import SAP_RPT_OSS_Regressor + +reg = SAP_RPT_OSS_Regressor(max_context_size=4096, bagging=4) +reg.fit(X_train, y_train) +predictions = reg.predict(X_test) +``` + +## Core Workflow + +1. **Extract SAP data** → Export to CSV from relevant tables +2. **Prepare dataset** → Include 50-500 rows with known outcomes +3. **Rename fields** → Use semantic names (see Data Preparation) +4. **Run prediction** → Fit on training data, predict on new data +5. **Interpret results** → Probabilities for classification, values for regression + +## SAP Use Cases + +See `references/sap-use-cases.md` for detailed extraction queries: + +- **FI-AR**: Payment default probability (BSID, BSAD, KNA1) +- **FI-GL**: Journal entry anomaly detection (ACDOCA, BKPF) +- **SD**: Delivery delay prediction (VBAK, VBAP, LIKP) +- **SD**: Customer churn likelihood (VBRK, VBRP, KNA1) +- **MM**: Vendor performance scoring (EKKO, EKPO, EBAN) +- **PP**: Production delay risk (AFKO, AFPO) + +## Data Preparation + +### Semantic Column Names (Important!) + +RPT-1-OSS uses an LLM to embed column names and values. Descriptive names improve accuracy: + +```python +# Good: Model understands business context +CUSTOMER_CREDIT_LIMIT, DAYS_SINCE_LAST_ORDER, PAYMENT_DELAY_DAYS + +# Bad: Generic names lose semantic value +COL1, VALUE, FIELD_A +``` + +Use `scripts/prepare_sap_data.py` to rename SAP technical fields: + +```python +from scripts.prepare_sap_data import SAPDataPrep + +prep = SAPDataPrep() +df = prep.rename_sap_fields(df) # BUKRS → COMPANY_CODE, etc. +``` + +### Dataset Size +- Minimum: 50 training examples +- Recommended: 200-500 examples +- Maximum context: 8192 rows (GPU dependent) + +## Scripts + +- `scripts/rpt1_oss_predict.py` - Local model prediction wrapper +- `scripts/prepare_sap_data.py` - SAP field renaming and SQL templates +- `scripts/batch_predict.py` - Chunked processing for large datasets + +## Alternative: RPT Playground API + +For users with SAP access, the closed-source RPT-1 is available via API: + +```python +from scripts.rpt1_api import RPT1Client + +client = RPT1Client(token="YOUR_RPT_TOKEN") # Get from rpt-playground.sap.com +result = client.predict(data="data.csv", target_column="TARGET", task_type="classification") +``` + +See `references/api-reference.md` for RPT Playground API documentation. + +## Limitations + +- Tabular data only (no images, text documents) +- Requires labeled examples for in-context learning +- First prediction is slow (model loading) +- GPU strongly recommended for production use diff --git a/skills/sap-rpt1-oss-predictor/examples/customer_churn_sample.csv b/skills/sap-rpt1-oss-predictor/examples/customer_churn_sample.csv new file mode 100644 index 000000000..d609ab120 --- /dev/null +++ b/skills/sap-rpt1-oss-predictor/examples/customer_churn_sample.csv @@ -0,0 +1,24 @@ +CUSTOMER_NUMBER,CUSTOMER_NAME,COUNTRY,ACCOUNT_GROUP,CREDIT_LIMIT,ORDERS_LAST_12M,REVENUE_LAST_12M,DAYS_SINCE_LAST_ORDER,AVG_ORDER_VALUE,AVG_PAYMENT_DELAY,LATE_PAYMENTS_COUNT,CREDIT_UTILIZATION,CHURN_STATUS +100001,Acme Corporation,US,ZIND,75000,15,185000,12,12333,2,0,0.45,ACTIVE +100002,Beta Industries,DE,ZIND,50000,8,62000,45,7750,5,1,0.38,ACTIVE +100003,Gamma Solutions,UK,ZSME,25000,3,18000,120,6000,15,2,0.22,AT_RISK +100004,Delta Corp,FR,ZIND,100000,0,0,380,0,35,5,0.0,CHURNED +100005,Epsilon Ltd,US,ZSME,30000,12,95000,8,7917,1,0,0.52,ACTIVE +100006,Zeta GmbH,DE,ZIND,60000,2,15000,200,7500,22,3,0.18,AT_RISK +100007,Eta Partners,UK,ZKEY,150000,25,450000,5,18000,0,0,0.65,ACTIVE +100008,Theta Inc,US,ZSME,20000,0,0,450,0,45,4,0.0,CHURNED +100009,Iota Systems,JP,ZIND,80000,6,48000,90,8000,8,1,0.28,AT_RISK +100010,Kappa Tech,US,ZKEY,200000,30,620000,3,20667,1,0,0.72,ACTIVE +100011,Lambda Corp,DE,ZSME,35000,4,28000,150,7000,18,2,0.15,AT_RISK +100012,Mu Industries,UK,ZIND,45000,1,5000,280,5000,30,4,0.08,CHURNED +100013,Nu Solutions,FR,ZSME,25000,9,72000,25,8000,3,0,0.42,ACTIVE +100014,Xi Partners,US,ZIND,90000,0,0,400,0,40,6,0.0,CHURNED +100015,Omicron Ltd,DE,ZKEY,120000,18,215000,10,11944,2,0,0.58,ACTIVE +100016,Pi GmbH,AT,ZSME,28000,5,35000,100,7000,12,2,0.25,AT_RISK +100017,Rho Corp,US,ZIND,55000,11,88000,20,8000,4,1,0.48,ACTIVE +100018,Sigma Tech,UK,ZSME,32000,2,12000,220,6000,25,3,0.12,AT_RISK +100019,Tau Inc,JP,ZIND,70000,0,0,365,0,38,5,0.0,CHURNED +100020,Upsilon Ltd,US,ZKEY,180000,22,380000,7,17273,1,0,0.62,ACTIVE +100021,NewCustomer1,DE,ZSME,40000,3,24000,85,8000,10,1,0.20,[PREDICT] +100022,NewCustomer2,US,ZIND,65000,7,52000,60,7429,6,1,0.32,[PREDICT] +100023,NewCustomer3,UK,ZKEY,95000,1,8000,180,8000,20,2,0.10,[PREDICT] diff --git a/skills/sap-rpt1-oss-predictor/examples/payment_default_sample.csv b/skills/sap-rpt1-oss-predictor/examples/payment_default_sample.csv new file mode 100644 index 000000000..9fb89095d --- /dev/null +++ b/skills/sap-rpt1-oss-predictor/examples/payment_default_sample.csv @@ -0,0 +1,24 @@ +CUSTOMER_NUMBER,COMPANY_CODE,DOCUMENT_NUMBER,FISCAL_YEAR,INVOICE_AMOUNT,CURRENCY,PAYMENT_TERMS_DAYS,CREDIT_LIMIT,OUTSTANDING_BALANCE,HIST_AVG_DELAY,HIST_SEVERE_DELAYS,CUSTOMER_AGE_DAYS,INDUSTRY_CODE,PAYMENT_STATUS +100001,1000,5000001,2025,15000,USD,30,75000,25000,2,0,1825,MANUFACTURING,PAID +100002,1000,5000002,2025,8500,EUR,45,50000,18000,8,0,1460,RETAIL,PAID +100003,1000,5000003,2025,22000,GBP,30,25000,24000,25,2,730,SERVICES,DEFAULT +100004,1000,5000004,2025,5000,USD,60,100000,5000,1,0,2190,MANUFACTURING,PAID +100005,1000,5000005,2025,12000,USD,30,30000,28000,18,1,1095,RETAIL,DEFAULT +100006,2000,5000006,2025,45000,EUR,45,60000,52000,12,1,1825,WHOLESALE,DEFAULT +100007,1000,5000007,2025,3500,GBP,30,150000,15000,0,0,2555,TECHNOLOGY,PAID +100008,2000,5000008,2025,18000,USD,60,20000,19500,35,3,365,SERVICES,DEFAULT +100009,1000,5000009,2025,9000,JPY,45,80000,22000,5,0,1460,MANUFACTURING,PAID +100010,1000,5000010,2025,28000,USD,30,200000,45000,1,0,3285,TECHNOLOGY,PAID +100011,2000,5000011,2025,6500,EUR,45,35000,32000,22,2,730,RETAIL,DEFAULT +100012,1000,5000012,2025,11000,GBP,30,45000,8000,3,0,1825,WHOLESALE,PAID +100013,1000,5000013,2025,7500,USD,60,25000,12000,6,0,1095,SERVICES,PAID +100014,2000,5000014,2025,35000,EUR,30,90000,88000,42,4,1460,MANUFACTURING,DEFAULT +100015,1000,5000015,2025,4200,USD,45,120000,18000,2,0,2190,TECHNOLOGY,PAID +100016,1000,5000016,2025,16000,CHF,30,28000,26500,15,1,730,RETAIL,DEFAULT +100017,2000,5000017,2025,9800,USD,60,55000,14000,4,0,1825,WHOLESALE,PAID +100018,1000,5000018,2025,21000,EUR,45,32000,30000,28,2,365,SERVICES,DEFAULT +100019,1000,5000019,2025,5500,GBP,30,70000,8500,1,0,2555,MANUFACTURING,PAID +100020,2000,5000020,2025,13500,USD,30,180000,25000,2,0,1460,TECHNOLOGY,PAID +100021,1000,5000021,2025,19000,EUR,45,40000,38000,20,2,1095,RETAIL,[PREDICT] +100022,2000,5000022,2025,8000,USD,30,65000,15000,7,0,1825,SERVICES,[PREDICT] +100023,1000,5000023,2025,32000,GBP,60,95000,90000,30,3,730,WHOLESALE,[PREDICT] diff --git a/skills/sap-rpt1-oss-predictor/references/api-reference.md b/skills/sap-rpt1-oss-predictor/references/api-reference.md new file mode 100644 index 000000000..c8e6a56a8 --- /dev/null +++ b/skills/sap-rpt1-oss-predictor/references/api-reference.md @@ -0,0 +1,336 @@ +# SAP-RPT-1-OSS API Reference + +Complete documentation for the open source SAP-RPT-1-OSS model and optional RPT Playground API. + +## Table of Contents + +1. [OSS Model Setup](#oss-model-setup) +2. [OSS Model API](#oss-model-api) +3. [Hardware Configuration](#hardware-configuration) +4. [Alternative: RPT Playground API](#alternative-rpt-playground-api) +5. [Error Handling](#error-handling) + +--- + +## OSS Model Setup + +### Step 1: Install Package + +```bash +# From GitHub (recommended) +pip install git+https://github.com/SAP-samples/sap-rpt-1-oss + +# Or clone and install locally +git clone https://github.com/SAP-samples/sap-rpt-1-oss.git +cd sap-rpt-1-oss +pip install -e . +``` + +### Step 2: Hugging Face Authentication + +Model weights are hosted on Hugging Face and require authentication: + +```bash +# Install Hugging Face CLI +pip install huggingface_hub + +# Login interactively (recommended) +huggingface-cli login + +# Or set environment variable +export HF_TOKEN="hf_your_token_here" +``` + +**Important**: Accept the model license at https://huggingface.co/SAP/sap-rpt-1-oss + +### Step 3: Verify Installation + +```python +# Test import +from sap_rpt_oss import SAP_RPT_OSS_Classifier, SAP_RPT_OSS_Regressor + +# Test model loading (downloads weights on first run ~65MB) +clf = SAP_RPT_OSS_Classifier(max_context_size=1024, bagging=1) +print("✅ SAP-RPT-1-OSS ready!") +``` + +--- + +## OSS Model API + +### Classification + +```python +from sap_rpt_oss import SAP_RPT_OSS_Classifier +import pandas as pd + +# Load data +df = pd.read_csv("sap_customers.csv") +X = df.drop(columns=["CHURN_STATUS"]) +y = df["CHURN_STATUS"] + +# Split +X_train, X_test = X[:400], X[400:] +y_train, y_test = y[:400], y[400:] + +# Initialize classifier +clf = SAP_RPT_OSS_Classifier( + max_context_size=4096, # Context window (rows of context) + bagging=4 # Ensemble size for better accuracy +) + +# Fit on training data +clf.fit(X_train, y_train) + +# Predict labels +predictions = clf.predict(X_test) + +# Predict probabilities (for confidence scores) +probabilities = clf.predict_proba(X_test) +``` + +### Regression + +```python +from sap_rpt_oss import SAP_RPT_OSS_Regressor + +# Initialize regressor +reg = SAP_RPT_OSS_Regressor( + max_context_size=4096, + bagging=4 +) + +# Fit and predict +reg.fit(X_train, y_train) +predictions = reg.predict(X_test) +``` + +### Input Formats + +```python +import pandas as pd +import numpy as np + +# DataFrame input (recommended - preserves column names for semantics) +X_train = pd.DataFrame({ + "CUSTOMER_CREDIT_LIMIT": [50000, 75000, 30000], + "DAYS_SINCE_LAST_ORDER": [10, 45, 180], + "PAYMENT_DELAY_AVG": [2, 5, 25] +}) +y_train = pd.Series(["ACTIVE", "ACTIVE", "AT_RISK"]) + +# NumPy array input (loses column semantics) +X_train = np.array([[50000, 10, 2], [75000, 45, 5], [30000, 180, 25]]) +y_train = np.array(["ACTIVE", "ACTIVE", "AT_RISK"]) +``` + +### Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| max_context_size | int | 8192 | Maximum rows in context window | +| bagging | int | 8 | Ensemble size (higher = better accuracy, more memory) | + +### Using the Skill Scripts + +```python +from scripts.rpt1_oss_predict import predict_classification, predict_regression + +# Classification +result = predict_classification( + train_data="train.csv", + test_data="test.csv", + target_column="CHURN_STATUS", + max_context_size=4096, + bagging=4 +) +print(result["predictions"]) +print(result["probabilities"]) + +# Regression +result = predict_regression( + train_data="deliveries_train.csv", + test_data="deliveries_test.csv", + target_column="DELAY_DAYS" +) +print(result["predictions"]) +``` + +--- + +## Hardware Configuration + +### GPU Requirements + +| Config | GPU Memory | Context Size | Bagging | Accuracy | Speed | +|--------|------------|--------------|---------|----------|-------| +| Optimal | 80GB (A100) | 8192 | 8 | Best | Fast | +| Standard | 40GB (A6000) | 4096 | 4 | Good | Good | +| Minimal | 24GB (RTX 4090) | 2048 | 2 | OK | OK | +| CPU | N/A | 1024 | 1 | OK | Slow | + +### Auto-Detection + +```python +from scripts.rpt1_oss_predict import get_optimal_config + +config = get_optimal_config() +# Returns: {"max_context_size": 4096, "bagging": 4, "device": "cuda"} +``` + +### Memory Management for Large Datasets + +```python +# Reduce memory footprint +clf = SAP_RPT_OSS_Classifier( + max_context_size=2048, # Smaller context + bagging=1 # No ensemble +) + +# Process in chunks +chunk_size = 100 +all_preds = [] +for i in range(0, len(X_test), chunk_size): + chunk = X_test.iloc[i:i+chunk_size] + preds = clf.predict(chunk) + all_preds.extend(preds) +``` + +--- + +## Alternative: RPT Playground API + +For users with SAP access, the closed-source RPT-1 is available via hosted API. + +### Authentication + +1. Navigate to https://rpt-playground.sap.com +2. Log in with SAP credentials +3. Scroll to bottom of page, copy API token + +```bash +export RPT_TOKEN="your-token-here" +``` + +### Usage + +```python +from scripts.rpt1_api import RPT1Client + +client = RPT1Client() # Uses RPT_TOKEN env var +# Or: client = RPT1Client(token="your-token") + +result = client.predict( + data="sap_export.csv", + target_column="CHURN_STATUS", + task_type="classification", + model_version="accurate" # or "fast" +) + +print(result["predictions"]) +print(result["probabilities"]) +``` + +### API Endpoints + +| Endpoint | Method | Description | +|----------|--------|-------------| +| /predict | POST | Single prediction request | +| /batch-predict | POST | Separate train/predict sets | +| /health | GET | Health check | + +### Rate Limits + +| Tier | Requests/Hour | Max Rows | +|------|---------------|----------| +| Free | 100 | 1000 | +| Standard | 1000 | 5000 | + +--- + +## Error Handling + +### OSS Model Errors + +```python +from sap_rpt_oss import SAP_RPT_OSS_Classifier + +try: + clf = SAP_RPT_OSS_Classifier(max_context_size=4096) + clf.fit(X_train, y_train) + predictions = clf.predict(X_test) +except ImportError: + print("Install: pip install git+https://github.com/SAP-samples/sap-rpt-1-oss") +except OSError as e: + if "token" in str(e).lower(): + print("Run: huggingface-cli login") + else: + raise +except RuntimeError as e: + if "out of memory" in str(e).lower(): + print("Reduce max_context_size or bagging parameter") + else: + raise +``` + +### Data Validation + +```python +def validate_data(df, target_column): + errors = [] + + if target_column not in df.columns: + errors.append(f"Target column '{target_column}' not found") + + if len(df) < 50: + errors.append("Minimum 50 training examples recommended") + + if df.isna().all().any(): + errors.append("Some columns are entirely empty") + + # Check for semantic column names + generic_names = [c for c in df.columns if c.upper() in ["COL1", "VALUE", "DATA", "FIELD"]] + if generic_names: + errors.append(f"Use descriptive column names instead of: {generic_names}") + + return errors +``` + +--- + +## Complete Workflow Example + +```python +import pandas as pd +from sap_rpt_oss import SAP_RPT_OSS_Classifier +from scripts.prepare_sap_data import SAPDataPrep + +# 1. Load and prepare SAP data +prep = SAPDataPrep() +df = pd.read_csv("sap_export.csv") +df = prep.rename_sap_fields(df) # BUKRS → COMPANY_CODE, etc. + +# 2. Split data +target = "PAYMENT_STATUS" +train_df = df.iloc[:400] +test_df = df.iloc[400:] + +X_train = train_df.drop(columns=[target]) +y_train = train_df[target] +X_test = test_df.drop(columns=[target]) + +# 3. Run prediction +clf = SAP_RPT_OSS_Classifier(max_context_size=4096, bagging=4) +clf.fit(X_train, y_train) + +predictions = clf.predict(X_test) +probabilities = clf.predict_proba(X_test) + +# 4. Add results to test data +test_df["PREDICTED"] = predictions +test_df["CONFIDENCE"] = [max(p) for p in probabilities] + +# 5. Save output +test_df.to_csv("predictions.csv", index=False) +print(f"✅ Saved {len(predictions)} predictions") +``` diff --git a/skills/sap-rpt1-oss-predictor/references/sap-use-cases.md b/skills/sap-rpt1-oss-predictor/references/sap-use-cases.md new file mode 100644 index 000000000..5b7b27ba8 --- /dev/null +++ b/skills/sap-rpt1-oss-predictor/references/sap-use-cases.md @@ -0,0 +1,302 @@ +# SAP Use Cases for RPT-1 Predictions + +Detailed guide for common SAP prediction scenarios using RPT-1. + +## Table of Contents + +1. [Customer Churn Prediction (SD/CRM)](#customer-churn-prediction) +2. [Payment Default Risk (FI-AR)](#payment-default-risk) +3. [Delivery Delay Forecasting (SD/LE)](#delivery-delay-forecasting) +4. [Journal Entry Anomaly Detection (FI-GL)](#journal-entry-anomaly-detection) +5. [Vendor Performance Scoring (MM)](#vendor-performance-scoring) +6. [Demand Forecasting (PP/MM)](#demand-forecasting) + +--- + +## Customer Churn Prediction + +**Module**: SD, CRM, FI-AR +**Task Type**: Classification +**Target Classes**: ACTIVE, AT_RISK, CHURNED + +### SAP Tables Required + +| Table | Description | Key Fields | +|-------|-------------|------------| +| KNA1 | Customer Master (General) | KUNNR, NAME1, LAND1, KTOKD | +| KNB1 | Customer Master (Company) | KLIMK, SKFOR | +| VBAK | Sales Order Header | VBELN, KUNNR, NETWR, ERDAT | +| VBRK | Billing Document Header | Revenue history | +| BSID/BSAD | Open/Cleared Items | Payment behavior | + +### Feature Engineering + +``` +ORDERS_LAST_12M - Count of orders in last year +REVENUE_LAST_12M - Total revenue in last year +DAYS_SINCE_LAST_ORDER - Recency indicator +AVG_ORDER_VALUE - Revenue / Order count +PAYMENT_DELAY_AVG - Average days late on payments +CREDIT_UTILIZATION - Outstanding / Credit Limit +PRODUCT_DIVERSITY - Distinct materials ordered +``` + +### Sample Dataset Structure + +```csv +CUSTOMER_NUMBER,CUSTOMER_NAME,COUNTRY,CREDIT_LIMIT,ORDERS_LAST_12M,REVENUE_LAST_12M,DAYS_SINCE_LAST_ORDER,AVG_PAYMENT_DELAY,CHURN_STATUS +100001,Acme Corp,US,50000,12,125000,15,3,ACTIVE +100002,Beta Inc,DE,30000,2,8000,180,25,AT_RISK +100003,Gamma Ltd,UK,25000,0,0,400,45,CHURNED +``` + +### Prediction Example + +```python +from sap_rpt_oss import SAP_RPT_OSS_Classifier +import pandas as pd + +df = pd.read_csv("customer_churn_data.csv") +X = df.drop(columns=["CHURN_STATUS"]) +y = df["CHURN_STATUS"] + +# Use first 80% as training, last 20% for prediction +split = int(len(df) * 0.8) +X_train, X_test = X[:split], X[split:] +y_train = y[:split] + +clf = SAP_RPT_OSS_Classifier(max_context_size=4096, bagging=4) +clf.fit(X_train, y_train) +predictions = clf.predict(X_test) +# Returns: predictions for ACTIVE/AT_RISK/CHURNED +``` + +--- + +## Payment Default Risk + +**Module**: FI-AR +**Task Type**: Classification or Regression +**Target**: DEFAULT/PAID (classification) or DELAY_DAYS (regression) + +### SAP Tables Required + +| Table | Description | Key Fields | +|-------|-------------|------------| +| BSID | Open AR Items | KUNNR, BELNR, WRBTR, ZFBDT, ZBD1T | +| BSAD | Cleared AR Items | Historical payment data | +| KNB1 | Customer Credit | KLIMK, SKFOR, CTLPC | +| KNKK | Credit Control | Credit exposure data | + +### Feature Engineering + +``` +INVOICE_AMOUNT - Document amount +PAYMENT_TERMS_DAYS - Agreed payment terms +CREDIT_LIMIT - Customer credit limit +CREDIT_UTILIZATION - Current exposure / limit +HIST_AVG_DELAY - Historical average delay +HIST_LATE_COUNT - Count of past late payments +CUSTOMER_AGE_DAYS - Days since first transaction +INDUSTRY_CODE - Customer industry segment +``` + +### Risk Scoring Approach + +For regression (predict delay days): +```python +from sap_rpt_oss import SAP_RPT_OSS_Regressor + +reg = SAP_RPT_OSS_Regressor(max_context_size=4096, bagging=4) +reg.fit(X_train, y_train) # y_train contains delay days +predictions = reg.predict(X_test) +``` + +For classification (predict default): +```python +from sap_rpt_oss import SAP_RPT_OSS_Classifier + +clf = SAP_RPT_OSS_Classifier(max_context_size=4096, bagging=4) +clf.fit(X_train, y_train) # y_train contains DEFAULT or PAID +predictions = clf.predict(X_test) +``` + +--- + +## Delivery Delay Forecasting + +**Module**: SD, LE (Logistics Execution) +**Task Type**: Classification or Regression +**Target**: DELAY_CATEGORY or DELAY_DAYS + +### SAP Tables Required + +| Table | Description | Key Fields | +|-------|-------------|------------| +| LIKP | Delivery Header | VBELN, LFDAT, WADAT_IST, VSTEL, ROUTE | +| LIPS | Delivery Items | MATNR, LFIMG, VGBEL | +| VBAK | Sales Order | AUART, VKORG, KUNNR | +| VTTK | Shipment Header | Carrier, transport info | + +### Feature Engineering + +``` +SHIPPING_POINT - Origin warehouse +ROUTE - Delivery route +CARRIER - Transport provider +TOTAL_QUANTITY - Sum of delivery quantities +DISTINCT_MATERIALS - Number of different products +ORDER_TYPE - Sales order type +CUSTOMER_PRIORITY - Customer importance rating +HISTORICAL_ROUTE_DELAY - Avg delay for this route +SEASON - Month/quarter indicator +``` + +### Delay Categories + +``` +ON_TIME - Delivered on or before planned date +MINOR_DELAY - 1-3 days late +MODERATE_DELAY - 4-7 days late +SEVERE_DELAY - 8+ days late +``` + +--- + +## Journal Entry Anomaly Detection + +**Module**: FI-GL (S/4HANA) +**Task Type**: Classification +**Target**: NORMAL, SUSPICIOUS, ANOMALY + +### SAP Tables Required + +| Table | Description | Key Fields | +|-------|-------------|------------| +| ACDOCA | Universal Journal | BELNR, RACCT, HSL, BUDAT, USNAM | +| BKPF | Document Header | BLART, CPUDT, CPUTM | +| SKA1 | GL Account Master | Account type, category | + +### Feature Engineering + +``` +AMOUNT_ABS - Absolute posting amount +GL_ACCOUNT - Account number +COST_CENTER - Cost center +DOCUMENT_TYPE - FI document type +ENTRY_HOUR - Hour of day posted +ENTRY_DAY_OF_WEEK - Day of week posted +USER_NAME - Posted by user +IS_MONTH_END - Posted in last 3 days of month +IS_ROUND_NUMBER - Amount ends in 000 +REVERSAL_FLAG - Is this a reversal entry +``` + +### Anomaly Indicators to Train On + +- Unusual posting times (nights, weekends) +- Round number amounts +- Unusual account combinations +- Deviations from historical patterns +- Manual entries vs automated +- User posting outside normal scope + +--- + +## Vendor Performance Scoring + +**Module**: MM +**Task Type**: Regression (score) or Classification (tier) +**Target**: PERFORMANCE_SCORE (0-100) or TIER (A/B/C/D) + +### SAP Tables Required + +| Table | Description | Key Fields | +|-------|-------------|------------| +| LFA1 | Vendor Master | LIFNR, NAME1, LAND1 | +| EKKO | PO Header | EBELN, LIFNR, BEDAT | +| EKPO | PO Item | NETPR, MENGE | +| EBAN | Purchase Requisition | Lead time data | +| EKBE | PO History | GR/IR data for delivery performance | + +### Feature Engineering + +``` +ON_TIME_DELIVERY_RATE - % deliveries on time +QUALITY_REJECTION_RATE - % rejected for quality +PRICE_VARIANCE - Actual vs quoted price +RESPONSE_TIME_AVG - Days to confirm PO +INVOICE_ACCURACY - % invoices without errors +TOTAL_PO_VALUE - Total purchase volume +RELATIONSHIP_LENGTH - Days since first PO +COMPLAINT_COUNT - Number of vendor complaints +``` + +--- + +## Demand Forecasting + +**Module**: PP, MM +**Task Type**: Regression +**Target**: FORECAST_QUANTITY + +### SAP Tables Required + +| Table | Description | Key Fields | +|-------|-------------|------------| +| MSEG | Material Document | MATNR, MENGE, BUDAT | +| VBRP | Billing Items | Sales history | +| MARA | Material Master | Material attributes | +| MARC | Plant Data | MRP settings | + +### Feature Engineering + +``` +MATERIAL_NUMBER - Product identifier +PLANT - Location +HISTORICAL_QTY_M1 - Last month quantity +HISTORICAL_QTY_M2 - 2 months ago +HISTORICAL_QTY_M3 - 3 months ago +HISTORICAL_QTY_Y1 - Same month last year +SEASONALITY_INDEX - Seasonal adjustment factor +PROMOTION_FLAG - Marketing activity indicator +PRICE_CHANGE_FLAG - Recent price change +MATERIAL_GROUP - Product category +``` + +### Time Series Approach + +Structure data with lag features: +```csv +MATERIAL,PLANT,MONTH,QTY_M1,QTY_M2,QTY_M3,QTY_Y1,SEASON,FORECAST_QTY +MAT001,1000,2024-01,150,140,160,145,1.05,155 +MAT001,1000,2024-02,155,150,140,148,0.98,[PREDICT] +``` + +--- + +## Best Practices + +### Data Quality +- Remove duplicates before prediction +- Handle NULL values explicitly +- Standardize date formats +- Use consistent units + +### Feature Selection +- Include business-meaningful columns +- Use descriptive column names (RPT-1 uses semantics) +- Include historical context features +- Balance feature count (10-50 columns optimal) + +### Training Data +- Minimum 50 labeled examples +- Balanced class distribution for classification +- Recent data preferred (last 1-2 years) +- Include edge cases and exceptions + +### Validation +- Hold out 20% for testing +- Compare predictions vs actual outcomes +- Monitor prediction confidence scores +- Retrain periodically as patterns change diff --git a/skills/sap-rpt1-oss-predictor/scripts/batch_predict.py b/skills/sap-rpt1-oss-predictor/scripts/batch_predict.py new file mode 100644 index 000000000..2b5305528 --- /dev/null +++ b/skills/sap-rpt1-oss-predictor/scripts/batch_predict.py @@ -0,0 +1,203 @@ +#!/usr/bin/env python3 +""" +Batch Prediction Script for SAP-RPT-1-OSS + +Process large SAP datasets in batches using the local OSS model. +Handles chunking, memory management, and result aggregation. + +Usage: + python batch_predict.py train.csv test.csv target_column output.csv --task classification +""" + +import argparse +import time +import pandas as pd +import numpy as np +from pathlib import Path +from typing import Optional, Literal, Union + +# Check for sap_rpt_oss +try: + from sap_rpt_oss import SAP_RPT_OSS_Classifier, SAP_RPT_OSS_Regressor + RPT_OSS_AVAILABLE = True +except ImportError: + RPT_OSS_AVAILABLE = False + + +def get_optimal_config() -> dict: + """Detect GPU and return optimal configuration.""" + try: + import torch + if torch.cuda.is_available(): + gpu_mem = torch.cuda.get_device_properties(0).total_memory / (1024**3) + if gpu_mem >= 80: + return {"max_context_size": 8192, "bagging": 8} + elif gpu_mem >= 40: + return {"max_context_size": 4096, "bagging": 4} + elif gpu_mem >= 24: + return {"max_context_size": 2048, "bagging": 2} + else: + return {"max_context_size": 1024, "bagging": 1} + except: + pass + return {"max_context_size": 1024, "bagging": 1} + + +def batch_predict_oss( + train_file: str, + test_file: str, + target_column: str, + output_file: str, + task_type: Literal["classification", "regression"] = "classification", + chunk_size: int = 100, + max_context_size: Optional[int] = None, + bagging: Optional[int] = None +) -> pd.DataFrame: + """ + Run batch predictions using SAP-RPT-1-OSS local model. + + Args: + train_file: Path to training CSV (with labels) + test_file: Path to test CSV (to predict) + target_column: Column to predict + output_file: Path for output CSV + task_type: "classification" or "regression" + chunk_size: Rows per prediction batch + max_context_size: Override auto-detected context size + bagging: Override auto-detected bagging + + Returns: + DataFrame with predictions + """ + if not RPT_OSS_AVAILABLE: + raise ImportError( + "sap_rpt_oss not installed.\n" + "Install with: pip install git+https://github.com/SAP-samples/sap-rpt-1-oss\n" + "Then login: huggingface-cli login" + ) + + # Load data + train_df = pd.read_csv(train_file) + test_df = pd.read_csv(test_file) + + print(f"📂 Loaded {len(train_df)} training rows, {len(test_df)} test rows") + + if target_column not in train_df.columns: + raise ValueError(f"Target '{target_column}' not in training data") + + # Prepare X, y + X_train = train_df.drop(columns=[target_column]) + y_train = train_df[target_column] + + # Remove target from test if present + if target_column in test_df.columns: + X_test = test_df.drop(columns=[target_column]) + else: + X_test = test_df.copy() + + # Get config + config = get_optimal_config() + max_context_size = max_context_size or config["max_context_size"] + bagging = bagging or config["bagging"] + + print(f"🔧 Config: context={max_context_size}, bagging={bagging}") + + # Initialize model + print("🚀 Loading SAP-RPT-1-OSS model...") + if task_type == "classification": + model = SAP_RPT_OSS_Classifier(max_context_size=max_context_size, bagging=bagging) + else: + model = SAP_RPT_OSS_Regressor(max_context_size=max_context_size, bagging=bagging) + + # Fit model + print("📈 Fitting model on training data...") + model.fit(X_train, y_train) + + # Predict in chunks + print(f"🔮 Predicting {len(X_test)} rows in chunks of {chunk_size}...") + all_predictions = [] + all_probabilities = [] + + n_chunks = (len(X_test) + chunk_size - 1) // chunk_size + + for i in range(0, len(X_test), chunk_size): + chunk_idx = i // chunk_size + 1 + chunk = X_test.iloc[i:i + chunk_size] + + print(f" Processing chunk {chunk_idx}/{n_chunks}...", end=" ") + start_time = time.time() + + preds = model.predict(chunk) + all_predictions.extend(preds.tolist() if hasattr(preds, 'tolist') else list(preds)) + + if task_type == "classification" and hasattr(model, 'predict_proba'): + probs = model.predict_proba(chunk) + all_probabilities.extend(probs.tolist() if hasattr(probs, 'tolist') else list(probs)) + + elapsed = time.time() - start_time + print(f"✓ ({elapsed:.1f}s)") + + # Add predictions to test data + result_df = test_df.copy() + result_df[f"{target_column}_PREDICTED"] = all_predictions + + if all_probabilities: + result_df[f"{target_column}_CONFIDENCE"] = [ + max(p) if isinstance(p, (list, np.ndarray)) else p + for p in all_probabilities + ] + + # Save results + result_df.to_csv(output_file, index=False) + print(f"\n✅ Saved {len(all_predictions)} predictions to: {output_file}") + + return result_df + + +def main(): + parser = argparse.ArgumentParser( + description="Batch prediction using SAP-RPT-1-OSS" + ) + parser.add_argument("train_file", help="Training CSV with labels") + parser.add_argument("test_file", help="Test CSV to predict") + parser.add_argument("target_column", help="Column to predict") + parser.add_argument("output_file", help="Output CSV path") + parser.add_argument( + "--task", + choices=["classification", "regression"], + default="classification", + help="Task type (default: classification)" + ) + parser.add_argument( + "--chunk-size", + type=int, + default=100, + help="Rows per batch (default: 100)" + ) + parser.add_argument( + "--context-size", + type=int, + help="Context window size (auto-detected if not set)" + ) + parser.add_argument( + "--bagging", + type=int, + help="Bagging ensemble size (auto-detected if not set)" + ) + + args = parser.parse_args() + + batch_predict_oss( + train_file=args.train_file, + test_file=args.test_file, + target_column=args.target_column, + output_file=args.output_file, + task_type=args.task, + chunk_size=args.chunk_size, + max_context_size=args.context_size, + bagging=args.bagging + ) + + +if __name__ == "__main__": + main() diff --git a/skills/sap-rpt1-oss-predictor/scripts/prepare_sap_data.py b/skills/sap-rpt1-oss-predictor/scripts/prepare_sap_data.py new file mode 100644 index 000000000..7f38b97a6 --- /dev/null +++ b/skills/sap-rpt1-oss-predictor/scripts/prepare_sap_data.py @@ -0,0 +1,406 @@ +#!/usr/bin/env python3 +""" +SAP Data Preparation Utilities for RPT-1 + +Utilities for extracting and preparing SAP data for RPT-1 predictions. +Includes SQL templates for common SAP tables and data transformation functions. + +Usage: + from prepare_sap_data import SAPDataPrep + + prep = SAPDataPrep() + df = prep.prepare_for_prediction( + data="sap_export.csv", + target_column="PAYMENT_DEFAULT", + prediction_rows=[100, 101, 102] # Row indices to predict + ) +""" + +import pandas as pd +import numpy as np +from typing import List, Optional, Union, Dict +from pathlib import Path +from datetime import datetime, timedelta + + +class SAPDataPrep: + """Utilities for preparing SAP data for RPT-1 predictions.""" + + # Common SAP date formats + SAP_DATE_FORMATS = ["%Y%m%d", "%Y-%m-%d", "%d.%m.%Y", "%m/%d/%Y"] + + # SAP field semantic mappings for better RPT-1 understanding + FIELD_DESCRIPTIONS = { + # FI Fields + "BUKRS": "COMPANY_CODE", + "GJAHR": "FISCAL_YEAR", + "BELNR": "DOCUMENT_NUMBER", + "BUZEI": "LINE_ITEM", + "DMBTR": "AMOUNT_LOCAL_CURRENCY", + "WRBTR": "AMOUNT_DOC_CURRENCY", + "SHKZG": "DEBIT_CREDIT_INDICATOR", + "ZFBDT": "BASELINE_DATE", + "ZBD1T": "PAYMENT_TERMS_DAYS", + "SGTXT": "ITEM_TEXT", + + # SD Fields + "VBELN": "SALES_ORDER_NUMBER", + "POSNR": "ITEM_NUMBER", + "MATNR": "MATERIAL_NUMBER", + "KWMENG": "ORDER_QUANTITY", + "NETWR": "NET_VALUE", + "WAERK": "CURRENCY", + "ERDAT": "CREATED_DATE", + "LFDAT": "DELIVERY_DATE", + "KUNNR": "CUSTOMER_NUMBER", + + # MM Fields + "EBELN": "PURCHASE_ORDER", + "EBELP": "PO_ITEM", + "LIFNR": "VENDOR_NUMBER", + "MENGE": "QUANTITY", + "MEINS": "UNIT", + "NETPR": "NET_PRICE", + "EINDT": "DELIVERY_DATE_REQUESTED", + + # Master Data + "NAME1": "NAME", + "LAND1": "COUNTRY", + "ORT01": "CITY", + "PSTLZ": "POSTAL_CODE", + "KTOKD": "CUSTOMER_ACCOUNT_GROUP", + "KLIMK": "CREDIT_LIMIT", + } + + def __init__(self): + """Initialize SAP data preparation utilities.""" + pass + + def rename_sap_fields(self, df: pd.DataFrame, custom_mappings: Dict[str, str] = None) -> pd.DataFrame: + """ + Rename SAP technical field names to semantic descriptions. + RPT-1 performs better with descriptive column names. + + Args: + df: DataFrame with SAP field names + custom_mappings: Additional field mappings to apply + + Returns: + DataFrame with renamed columns + """ + mappings = self.FIELD_DESCRIPTIONS.copy() + if custom_mappings: + mappings.update(custom_mappings) + + # Only rename columns that exist in dataframe + rename_dict = {k: v for k, v in mappings.items() if k in df.columns} + return df.rename(columns=rename_dict) + + def parse_sap_dates(self, df: pd.DataFrame, date_columns: List[str]) -> pd.DataFrame: + """ + Parse SAP date formats to standard datetime. + + Args: + df: DataFrame with SAP dates + date_columns: List of column names containing dates + + Returns: + DataFrame with parsed dates + """ + df = df.copy() + for col in date_columns: + if col not in df.columns: + continue + + for fmt in self.SAP_DATE_FORMATS: + try: + df[col] = pd.to_datetime(df[col], format=fmt, errors="coerce") + if df[col].notna().any(): + break + except Exception: + continue + + return df + + def calculate_derived_features( + self, + df: pd.DataFrame, + date_column: str, + reference_date: Optional[datetime] = None + ) -> pd.DataFrame: + """ + Calculate derived features useful for predictions. + + Args: + df: DataFrame with parsed dates + date_column: Name of date column for calculations + reference_date: Reference date for age calculations (default: today) + + Returns: + DataFrame with additional derived columns + """ + df = df.copy() + reference_date = reference_date or datetime.now() + + if date_column in df.columns and pd.api.types.is_datetime64_any_dtype(df[date_column]): + df[f"DAYS_SINCE_{date_column}"] = (reference_date - df[date_column]).dt.days + df[f"MONTH_OF_{date_column}"] = df[date_column].dt.month + df[f"QUARTER_OF_{date_column}"] = df[date_column].dt.quarter + df[f"YEAR_OF_{date_column}"] = df[date_column].dt.year + df[f"DAY_OF_WEEK_{date_column}"] = df[date_column].dt.dayofweek + + return df + + def prepare_for_prediction( + self, + data: Union[str, Path, pd.DataFrame], + target_column: str, + prediction_rows: Optional[List[int]] = None, + mask_value: str = "[PREDICT]" + ) -> pd.DataFrame: + """ + Prepare dataset for RPT-1 prediction by masking target values. + + Args: + data: CSV path or DataFrame + target_column: Column to predict + prediction_rows: Row indices to predict (default: last 10%) + mask_value: Placeholder for prediction (default: [PREDICT]) + + Returns: + DataFrame ready for RPT-1 with masked values + """ + if isinstance(data, (str, Path)): + df = pd.read_csv(data) + else: + df = data.copy() + + if target_column not in df.columns: + raise ValueError(f"Target column '{target_column}' not found") + + # Default: predict last 10% of rows + if prediction_rows is None: + n_predict = max(1, len(df) // 10) + prediction_rows = list(range(len(df) - n_predict, len(df))) + + # Mask target values for prediction rows + df[target_column] = df[target_column].astype(str) + df.loc[prediction_rows, target_column] = mask_value + + return df + + def split_train_predict( + self, + data: Union[str, Path, pd.DataFrame], + target_column: str, + train_ratio: float = 0.8 + ) -> tuple: + """ + Split data into training and prediction sets. + + Args: + data: CSV path or DataFrame + target_column: Column containing target values + train_ratio: Fraction of data for training (default: 0.8) + + Returns: + Tuple of (train_df, predict_df) + """ + if isinstance(data, (str, Path)): + df = pd.read_csv(data) + else: + df = data.copy() + + n_train = int(len(df) * train_ratio) + train_df = df.iloc[:n_train].copy() + predict_df = df.iloc[n_train:].copy() + + # Remove target values from prediction set + predict_df[target_column] = "[PREDICT]" + + return train_df, predict_df + + +# SQL Templates for SAP Data Extraction +SQL_TEMPLATES = { + "customer_churn": """ +-- Customer Churn Analysis Dataset +-- Extract from SAP SD/FI for churn prediction + +SELECT + kna1.KUNNR AS CUSTOMER_NUMBER, + kna1.NAME1 AS CUSTOMER_NAME, + kna1.LAND1 AS COUNTRY, + kna1.KTOKD AS ACCOUNT_GROUP, + knb1.KLIMK AS CREDIT_LIMIT, + + -- Order metrics (last 12 months) + COUNT(DISTINCT vbak.VBELN) AS ORDERS_LAST_12M, + SUM(vbak.NETWR) AS REVENUE_LAST_12M, + MAX(vbak.ERDAT) AS LAST_ORDER_DATE, + DATEDIFF(DAY, MAX(vbak.ERDAT), CURRENT_DATE) AS DAYS_SINCE_LAST_ORDER, + + -- Payment behavior + AVG(bsid.VERZN) AS AVG_PAYMENT_DELAY_DAYS, + COUNT(CASE WHEN bsid.VERZN > 30 THEN 1 END) AS LATE_PAYMENTS_COUNT, + + -- Target: Define based on business rules + CASE + WHEN DATEDIFF(DAY, MAX(vbak.ERDAT), CURRENT_DATE) > 365 THEN 'CHURNED' + WHEN DATEDIFF(DAY, MAX(vbak.ERDAT), CURRENT_DATE) > 180 THEN 'AT_RISK' + ELSE 'ACTIVE' + END AS CHURN_STATUS + +FROM KNA1 kna1 +LEFT JOIN KNB1 knb1 ON kna1.KUNNR = knb1.KUNNR +LEFT JOIN VBAK vbak ON kna1.KUNNR = vbak.KUNNR + AND vbak.ERDAT >= ADD_MONTHS(CURRENT_DATE, -12) +LEFT JOIN BSID bsid ON kna1.KUNNR = bsid.KUNNR + +GROUP BY kna1.KUNNR, kna1.NAME1, kna1.LAND1, kna1.KTOKD, knb1.KLIMK +""", + + "payment_default": """ +-- Payment Default Prediction Dataset +-- Extract from SAP FI-AR + +SELECT + bsid.KUNNR AS CUSTOMER_NUMBER, + bsid.BUKRS AS COMPANY_CODE, + bsid.BELNR AS DOCUMENT_NUMBER, + bsid.GJAHR AS FISCAL_YEAR, + bsid.WRBTR AS INVOICE_AMOUNT, + bsid.WAERS AS CURRENCY, + bsid.ZFBDT AS BASELINE_DATE, + bsid.ZBD1T AS PAYMENT_TERMS_DAYS, + + -- Customer credit info + knb1.KLIMK AS CREDIT_LIMIT, + knb1.SKFOR AS OUTSTANDING_BALANCE, + + -- Historical payment behavior + (SELECT AVG(VERZN) FROM BSAD WHERE KUNNR = bsid.KUNNR) AS HIST_AVG_DELAY, + (SELECT COUNT(*) FROM BSAD WHERE KUNNR = bsid.KUNNR AND VERZN > 60) AS HIST_SEVERE_DELAYS, + + -- Target: Payment default indicator + CASE + WHEN bsad.AUGDT IS NULL AND DATEDIFF(DAY, bsid.ZFBDT + bsid.ZBD1T, CURRENT_DATE) > 90 + THEN 'DEFAULT' + ELSE 'PAID' + END AS PAYMENT_STATUS + +FROM BSID bsid +LEFT JOIN BSAD bsad ON bsid.BUKRS = bsad.BUKRS AND bsid.BELNR = bsad.BELNR +LEFT JOIN KNB1 knb1 ON bsid.KUNNR = knb1.KUNNR AND bsid.BUKRS = knb1.BUKRS + +WHERE bsid.KOART = 'D' -- Customer items only +""", + + "delivery_delay": """ +-- Delivery Delay Prediction Dataset +-- Extract from SAP SD + +SELECT + likp.VBELN AS DELIVERY_NUMBER, + likp.KUNNR AS CUSTOMER_NUMBER, + likp.LFDAT AS PLANNED_DELIVERY_DATE, + likp.WADAT_IST AS ACTUAL_DELIVERY_DATE, + likp.VSTEL AS SHIPPING_POINT, + likp.ROUTE AS ROUTE, + + -- Order details + vbak.AUART AS ORDER_TYPE, + vbak.VKORG AS SALES_ORG, + SUM(lips.LFIMG) AS TOTAL_QUANTITY, + COUNT(DISTINCT lips.MATNR) AS DISTINCT_MATERIALS, + + -- Carrier info + likp.TDLNR AS CARRIER, + + -- Target: Delay in days (for regression) or category (for classification) + DATEDIFF(DAY, likp.LFDAT, likp.WADAT_IST) AS DELAY_DAYS, + CASE + WHEN DATEDIFF(DAY, likp.LFDAT, likp.WADAT_IST) <= 0 THEN 'ON_TIME' + WHEN DATEDIFF(DAY, likp.LFDAT, likp.WADAT_IST) <= 3 THEN 'MINOR_DELAY' + WHEN DATEDIFF(DAY, likp.LFDAT, likp.WADAT_IST) <= 7 THEN 'MODERATE_DELAY' + ELSE 'SEVERE_DELAY' + END AS DELAY_CATEGORY + +FROM LIKP likp +JOIN LIPS lips ON likp.VBELN = lips.VBELN +JOIN VBAK vbak ON lips.VGBEL = vbak.VBELN + +WHERE likp.WADAT_IST IS NOT NULL -- Completed deliveries only + +GROUP BY likp.VBELN, likp.KUNNR, likp.LFDAT, likp.WADAT_IST, + likp.VSTEL, likp.ROUTE, vbak.AUART, vbak.VKORG, likp.TDLNR +""", + + "journal_anomaly": """ +-- Journal Entry Anomaly Detection Dataset +-- Extract from SAP FI-GL (S/4HANA ACDOCA) + +SELECT + acdoca.RCLNT AS CLIENT, + acdoca.RBUKRS AS COMPANY_CODE, + acdoca.GJAHR AS FISCAL_YEAR, + acdoca.BELNR AS DOCUMENT_NUMBER, + acdoca.DOCLN AS LINE_ITEM, + acdoca.RACCT AS GL_ACCOUNT, + acdoca.RCNTR AS COST_CENTER, + acdoca.HSL AS AMOUNT_LOCAL, + acdoca.RHCUR AS LOCAL_CURRENCY, + acdoca.BUDAT AS POSTING_DATE, + acdoca.CPUDT AS ENTRY_DATE, + acdoca.USNAM AS USER_NAME, + acdoca.BLART AS DOCUMENT_TYPE, + + -- Time-based features + EXTRACT(HOUR FROM acdoca.CPUTM) AS ENTRY_HOUR, + EXTRACT(DOW FROM acdoca.CPUDT) AS ENTRY_DAY_OF_WEEK, + + -- Amount analysis + ABS(acdoca.HSL) AS AMOUNT_ABS, + CASE WHEN acdoca.HSL < 0 THEN 'CREDIT' ELSE 'DEBIT' END AS DC_INDICATOR, + + -- Target: Anomaly flag (define based on business rules or historical labels) + -- This should be labeled by auditors for training data + anomaly_label AS ANOMALY_FLAG + +FROM ACDOCA acdoca +LEFT JOIN anomaly_labels ON acdoca.BELNR = anomaly_labels.BELNR -- Your labeled data + +WHERE acdoca.GJAHR >= YEAR(CURRENT_DATE) - 2 +""" +} + + +def get_sql_template(use_case: str) -> str: + """ + Get SQL extraction template for SAP use case. + + Args: + use_case: One of 'customer_churn', 'payment_default', + 'delivery_delay', 'journal_anomaly' + + Returns: + SQL template string + """ + if use_case not in SQL_TEMPLATES: + available = list(SQL_TEMPLATES.keys()) + raise ValueError(f"Unknown use case '{use_case}'. Available: {available}") + + return SQL_TEMPLATES[use_case] + + +if __name__ == "__main__": + import sys + + if len(sys.argv) < 2: + print("Usage: python prepare_sap_data.py ") + print("Available use cases:", list(SQL_TEMPLATES.keys())) + sys.exit(1) + + use_case = sys.argv[1] + print(f"\n--- SQL Template for {use_case} ---\n") + print(get_sql_template(use_case)) diff --git a/skills/sap-rpt1-oss-predictor/scripts/rpt1_api.py b/skills/sap-rpt1-oss-predictor/scripts/rpt1_api.py new file mode 100644 index 000000000..550922bc2 --- /dev/null +++ b/skills/sap-rpt1-oss-predictor/scripts/rpt1_api.py @@ -0,0 +1,236 @@ +#!/usr/bin/env python3 +""" +SAP-RPT-1 Playground API Client + +A Python client for interacting with SAP's RPT-1 tabular prediction model +through the RPT Playground API. + +Usage: + from rpt1_api import RPT1Client + + client = RPT1Client(token="YOUR_TOKEN") + result = client.predict(data="sales_data.csv", target_column="CHURN", task_type="classification") +""" + +import os +import json +import pandas as pd +from typing import Union, Optional, Literal +from pathlib import Path + +try: + import httpx +except ImportError: + import subprocess + subprocess.run(["pip", "install", "httpx", "--quiet", "--break-system-packages"]) + import httpx + + +class RPT1Client: + """Client for SAP RPT-1 Playground API.""" + + BASE_URL = "https://rpt-playground.sap.com/api" + + def __init__(self, token: Optional[str] = None): + """ + Initialize RPT1 client. + + Args: + token: RPT Playground API token. If not provided, reads from + RPT_TOKEN environment variable. + """ + self.token = token or os.environ.get("RPT_TOKEN") + if not self.token: + raise ValueError( + "RPT token required. Get it from https://rpt-playground.sap.com " + "(bottom of page) and pass as token= or set RPT_TOKEN env var." + ) + self.client = httpx.Client( + base_url=self.BASE_URL, + headers={"Authorization": f"Bearer {self.token}"}, + timeout=120.0 + ) + + def predict( + self, + data: Union[str, Path, pd.DataFrame], + target_column: str, + task_type: Literal["classification", "regression"] = "classification", + model_version: Literal["fast", "accurate"] = "accurate" + ) -> dict: + """ + Run prediction on tabular data. + + Args: + data: CSV file path or pandas DataFrame with training examples + target_column: Column name to predict (will be masked with [PREDICT]) + task_type: "classification" for categories, "regression" for numbers + model_version: "fast" for low latency, "accurate" for best results + + Returns: + dict with predictions and confidence scores + + Example: + >>> client = RPT1Client(token="xxx") + >>> result = client.predict( + ... data="customers.csv", + ... target_column="CHURN_RISK", + ... task_type="classification" + ... ) + >>> print(result["predictions"]) + """ + # Load data + if isinstance(data, (str, Path)): + df = pd.read_csv(data) + else: + df = data.copy() + + # Validate target column exists + if target_column not in df.columns: + raise ValueError(f"Target column '{target_column}' not found. Available: {list(df.columns)}") + + # Prepare payload + payload = { + "data": df.to_dict(orient="records"), + "target_column": target_column, + "task_type": task_type, + "model_version": model_version + } + + response = self.client.post("/predict", json=payload) + response.raise_for_status() + return response.json() + + def predict_with_mask( + self, + data: Union[str, Path, pd.DataFrame], + model_version: Literal["fast", "accurate"] = "accurate" + ) -> dict: + """ + Run prediction on data where target values are already marked with [PREDICT]. + + Args: + data: CSV/DataFrame with [PREDICT] placeholders in cells to predict + model_version: "fast" or "accurate" + + Returns: + dict with filled predictions + """ + if isinstance(data, (str, Path)): + df = pd.read_csv(data) + else: + df = data.copy() + + payload = { + "data": df.to_dict(orient="records"), + "model_version": model_version + } + + response = self.client.post("/predict-masked", json=payload) + response.raise_for_status() + return response.json() + + def batch_predict( + self, + train_data: Union[str, Path, pd.DataFrame], + predict_data: Union[str, Path, pd.DataFrame], + target_column: str, + task_type: Literal["classification", "regression"] = "classification", + model_version: Literal["fast", "accurate"] = "accurate" + ) -> dict: + """ + Batch prediction with separate training and prediction datasets. + + Args: + train_data: CSV/DataFrame with labeled examples (known outcomes) + predict_data: CSV/DataFrame with rows to predict (target column will be filled) + target_column: Column to predict + task_type: "classification" or "regression" + model_version: "fast" or "accurate" + + Returns: + dict with predictions for predict_data rows + """ + # Load data + if isinstance(train_data, (str, Path)): + train_df = pd.read_csv(train_data) + else: + train_df = train_data.copy() + + if isinstance(predict_data, (str, Path)): + predict_df = pd.read_csv(predict_data) + else: + predict_df = predict_data.copy() + + payload = { + "train_data": train_df.to_dict(orient="records"), + "predict_data": predict_df.to_dict(orient="records"), + "target_column": target_column, + "task_type": task_type, + "model_version": model_version + } + + response = self.client.post("/batch-predict", json=payload) + response.raise_for_status() + return response.json() + + def health_check(self) -> bool: + """Check if API is accessible.""" + try: + response = self.client.get("/health") + return response.status_code == 200 + except Exception: + return False + + +def predict_from_csv( + csv_path: str, + target_column: str, + task_type: str = "classification", + token: Optional[str] = None +) -> pd.DataFrame: + """ + Convenience function to predict from CSV file. + + Args: + csv_path: Path to CSV file + target_column: Column to predict + task_type: "classification" or "regression" + token: RPT API token (or set RPT_TOKEN env var) + + Returns: + DataFrame with predictions added + """ + client = RPT1Client(token=token) + result = client.predict(csv_path, target_column, task_type) + + df = pd.read_csv(csv_path) + df[f"{target_column}_PREDICTED"] = result.get("predictions", []) + + if "probabilities" in result: + df[f"{target_column}_CONFIDENCE"] = [ + max(p.values()) if isinstance(p, dict) else p + for p in result["probabilities"] + ] + + return df + + +if __name__ == "__main__": + # Example usage + import sys + + if len(sys.argv) < 3: + print("Usage: python rpt1_api.py [task_type]") + print("Example: python rpt1_api.py customers.csv CHURN_RISK classification") + sys.exit(1) + + csv_file = sys.argv[1] + target_col = sys.argv[2] + task = sys.argv[3] if len(sys.argv) > 3 else "classification" + + result_df = predict_from_csv(csv_file, target_col, task) + + output_file = csv_file.replace(".csv", "_predictions.csv") + result_df.to_csv(output_file, index=False) + print(f"Predictions saved to: {output_file}") diff --git a/skills/sap-rpt1-oss-predictor/scripts/rpt1_oss_predict.py b/skills/sap-rpt1-oss-predictor/scripts/rpt1_oss_predict.py new file mode 100644 index 000000000..30d99d7d1 --- /dev/null +++ b/skills/sap-rpt1-oss-predictor/scripts/rpt1_oss_predict.py @@ -0,0 +1,356 @@ +#!/usr/bin/env python3 +""" +SAP-RPT-1-OSS Local Model Prediction + +Wrapper for running predictions using the open source SAP-RPT-1-OSS model +from Hugging Face. + +Requirements: + pip install git+https://github.com/SAP-samples/sap-rpt-1-oss + huggingface-cli login # Accept model terms at HF + +Usage: + from rpt1_oss_predict import predict_classification, predict_regression + + predictions = predict_classification( + train_data="train.csv", + test_data="test.csv", + target_column="CHURN_STATUS" + ) +""" + +import os +import sys +import warnings +from typing import Union, Optional, Literal, Tuple +from pathlib import Path + +import pandas as pd +import numpy as np + +# Check for sap_rpt_oss installation +try: + from sap_rpt_oss import SAP_RPT_OSS_Classifier, SAP_RPT_OSS_Regressor + RPT_OSS_AVAILABLE = True +except ImportError: + RPT_OSS_AVAILABLE = False + warnings.warn( + "sap_rpt_oss not installed. Install with:\n" + "pip install git+https://github.com/SAP-samples/sap-rpt-1-oss\n" + "Then login to Hugging Face: huggingface-cli login" + ) + + +def check_hf_auth() -> bool: + """Check if Hugging Face authentication is configured.""" + token_path = Path.home() / ".huggingface" / "token" + hf_token_env = os.environ.get("HF_TOKEN") or os.environ.get("HUGGING_FACE_HUB_TOKEN") + + if token_path.exists() or hf_token_env: + return True + + print("⚠️ Hugging Face authentication required!") + print("\nSetup steps:") + print("1. pip install huggingface_hub") + print("2. huggingface-cli login") + print("3. Accept model terms at: https://huggingface.co/SAP/sap-rpt-1-oss") + print("\nOr set HF_TOKEN environment variable") + return False + + +def get_optimal_config() -> dict: + """ + Detect available GPU and return optimal configuration. + + Returns: + dict with max_context_size and bagging parameters + """ + try: + import torch + if torch.cuda.is_available(): + gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3) + + if gpu_memory >= 80: + return {"max_context_size": 8192, "bagging": 8, "device": "cuda"} + elif gpu_memory >= 40: + return {"max_context_size": 4096, "bagging": 4, "device": "cuda"} + elif gpu_memory >= 24: + return {"max_context_size": 2048, "bagging": 2, "device": "cuda"} + else: + return {"max_context_size": 1024, "bagging": 1, "device": "cuda"} + else: + print("⚠️ No GPU detected. Using CPU (will be slow)") + return {"max_context_size": 1024, "bagging": 1, "device": "cpu"} + except ImportError: + print("⚠️ PyTorch not found. Install with: pip install torch") + return {"max_context_size": 1024, "bagging": 1, "device": "cpu"} + + +def load_data( + data: Union[str, Path, pd.DataFrame], + target_column: Optional[str] = None +) -> Tuple[pd.DataFrame, Optional[pd.Series]]: + """ + Load data from CSV or DataFrame. + + Args: + data: CSV path or DataFrame + target_column: Column to extract as target (optional) + + Returns: + Tuple of (features DataFrame, target Series or None) + """ + if isinstance(data, (str, Path)): + df = pd.read_csv(data) + else: + df = data.copy() + + if target_column and target_column in df.columns: + y = df[target_column] + X = df.drop(columns=[target_column]) + return X, y + + return df, None + + +def predict_classification( + train_data: Union[str, Path, pd.DataFrame], + test_data: Union[str, Path, pd.DataFrame], + target_column: str, + max_context_size: Optional[int] = None, + bagging: Optional[int] = None, + return_probabilities: bool = True +) -> dict: + """ + Run classification prediction using SAP-RPT-1-OSS. + + Args: + train_data: Training data with known labels (CSV path or DataFrame) + test_data: Test data to predict (CSV path or DataFrame) + target_column: Column name containing class labels + max_context_size: Context window size (auto-detected if None) + bagging: Ensemble size for bagging (auto-detected if None) + return_probabilities: Whether to return class probabilities + + Returns: + dict with 'predictions', 'probabilities' (if requested), 'classes' + + Example: + >>> result = predict_classification( + ... train_data="train.csv", + ... test_data="test.csv", + ... target_column="CHURN_STATUS" + ... ) + >>> print(result["predictions"]) + """ + if not RPT_OSS_AVAILABLE: + raise ImportError("sap_rpt_oss not installed. See setup instructions.") + + if not check_hf_auth(): + raise EnvironmentError("Hugging Face authentication required") + + # Load data + X_train, y_train = load_data(train_data, target_column) + X_test, y_test = load_data(test_data, target_column) + + if y_train is None: + raise ValueError(f"Target column '{target_column}' not found in training data") + + # Get optimal config + config = get_optimal_config() + max_context_size = max_context_size or config["max_context_size"] + bagging = bagging or config["bagging"] + + print(f"🔧 Config: context_size={max_context_size}, bagging={bagging}, device={config['device']}") + print(f"📊 Training samples: {len(X_train)}, Test samples: {len(X_test)}") + + # Initialize and fit classifier + print("🚀 Loading SAP-RPT-1-OSS model...") + clf = SAP_RPT_OSS_Classifier( + max_context_size=max_context_size, + bagging=bagging + ) + + print("📈 Fitting model...") + clf.fit(X_train, y_train) + + # Predict + print("🔮 Running predictions...") + predictions = clf.predict(X_test) + + result = { + "predictions": predictions.tolist() if hasattr(predictions, 'tolist') else list(predictions), + "classes": clf.classes_.tolist() if hasattr(clf, 'classes_') else None, + "n_samples": len(X_test), + "config": {"max_context_size": max_context_size, "bagging": bagging} + } + + if return_probabilities: + print("📊 Computing probabilities...") + probabilities = clf.predict_proba(X_test) + result["probabilities"] = probabilities.tolist() if hasattr(probabilities, 'tolist') else probabilities + + print("✅ Prediction complete!") + return result + + +def predict_regression( + train_data: Union[str, Path, pd.DataFrame], + test_data: Union[str, Path, pd.DataFrame], + target_column: str, + max_context_size: Optional[int] = None, + bagging: Optional[int] = None +) -> dict: + """ + Run regression prediction using SAP-RPT-1-OSS. + + Args: + train_data: Training data with known values (CSV path or DataFrame) + test_data: Test data to predict (CSV path or DataFrame) + target_column: Column name containing target values + max_context_size: Context window size (auto-detected if None) + bagging: Ensemble size for bagging (auto-detected if None) + + Returns: + dict with 'predictions' and config info + + Example: + >>> result = predict_regression( + ... train_data="deliveries_train.csv", + ... test_data="deliveries_test.csv", + ... target_column="DELAY_DAYS" + ... ) + >>> print(result["predictions"]) + """ + if not RPT_OSS_AVAILABLE: + raise ImportError("sap_rpt_oss not installed. See setup instructions.") + + if not check_hf_auth(): + raise EnvironmentError("Hugging Face authentication required") + + # Load data + X_train, y_train = load_data(train_data, target_column) + X_test, y_test = load_data(test_data, target_column) + + if y_train is None: + raise ValueError(f"Target column '{target_column}' not found in training data") + + # Get optimal config + config = get_optimal_config() + max_context_size = max_context_size or config["max_context_size"] + bagging = bagging or config["bagging"] + + print(f"🔧 Config: context_size={max_context_size}, bagging={bagging}, device={config['device']}") + print(f"📊 Training samples: {len(X_train)}, Test samples: {len(X_test)}") + + # Initialize and fit regressor + print("🚀 Loading SAP-RPT-1-OSS model...") + reg = SAP_RPT_OSS_Regressor( + max_context_size=max_context_size, + bagging=bagging + ) + + print("📈 Fitting model...") + reg.fit(X_train, y_train) + + # Predict + print("🔮 Running predictions...") + predictions = reg.predict(X_test) + + print("✅ Prediction complete!") + return { + "predictions": predictions.tolist() if hasattr(predictions, 'tolist') else list(predictions), + "n_samples": len(X_test), + "config": {"max_context_size": max_context_size, "bagging": bagging} + } + + +def predict_from_single_file( + data: Union[str, Path, pd.DataFrame], + target_column: str, + task_type: Literal["classification", "regression"] = "classification", + train_ratio: float = 0.8, + **kwargs +) -> pd.DataFrame: + """ + Convenience function: split single file into train/test and predict. + + Args: + data: CSV path or DataFrame with all data + target_column: Column to predict + task_type: "classification" or "regression" + train_ratio: Fraction for training (default: 0.8) + **kwargs: Additional args for predict functions + + Returns: + DataFrame with predictions added + """ + if isinstance(data, (str, Path)): + df = pd.read_csv(data) + else: + df = data.copy() + + # Split data + n_train = int(len(df) * train_ratio) + train_df = df.iloc[:n_train] + test_df = df.iloc[n_train:] + + # Run prediction + if task_type == "classification": + result = predict_classification(train_df, test_df, target_column, **kwargs) + else: + result = predict_regression(train_df, test_df, target_column, **kwargs) + + # Add predictions to test data + test_df = test_df.copy() + test_df[f"{target_column}_PREDICTED"] = result["predictions"] + + if "probabilities" in result and result["probabilities"]: + # Get max probability as confidence + test_df[f"{target_column}_CONFIDENCE"] = [ + max(p) if isinstance(p, (list, np.ndarray)) else p + for p in result["probabilities"] + ] + + return test_df + + +if __name__ == "__main__": + import argparse + + parser = argparse.ArgumentParser(description="SAP-RPT-1-OSS Prediction") + parser.add_argument("train_file", help="Training CSV file") + parser.add_argument("test_file", help="Test CSV file") + parser.add_argument("target_column", help="Column to predict") + parser.add_argument("--task", choices=["classification", "regression"], + default="classification", help="Task type") + parser.add_argument("--context-size", type=int, help="Context window size") + parser.add_argument("--bagging", type=int, help="Bagging ensemble size") + parser.add_argument("--output", "-o", help="Output CSV file") + + args = parser.parse_args() + + # Run prediction + if args.task == "classification": + result = predict_classification( + args.train_file, args.test_file, args.target_column, + max_context_size=args.context_size, bagging=args.bagging + ) + else: + result = predict_regression( + args.train_file, args.test_file, args.target_column, + max_context_size=args.context_size, bagging=args.bagging + ) + + # Save or print results + if args.output: + test_df = pd.read_csv(args.test_file) + test_df[f"{args.target_column}_PREDICTED"] = result["predictions"] + test_df.to_csv(args.output, index=False) + print(f"💾 Saved predictions to: {args.output}") + else: + print("\n📋 Predictions:") + for i, pred in enumerate(result["predictions"][:10]): + print(f" [{i}] {pred}") + if len(result["predictions"]) > 10: + print(f" ... and {len(result['predictions']) - 10} more")