Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 122 additions & 22 deletions docs/content/advanced/api_examples/oci_embed.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,120 @@ Licensed under the Universal Permissive License v1.0 as shown at http://oss.orac

## Overview

Creating a vector store from documents stored in OCI Object Storage is a two-step API workflow:
There are two API workflows for creating a vector store from documents in OCI Object Storage:

1. **Download** objects from an OCI bucket to the server's temporary staging area.
2. **Embed** the downloaded files into a new vector store.
1. **Single-call** — `POST /v1/embed/oci/store` downloads and embeds in one request. Recommended when the only source is an OCI bucket.
2. **Two-step** — `POST /v1/oci/objects/download` followed by `POST /v1/embed/`. Use this when you need to combine OCI objects with other sources (local uploads, web URLs, SQL query results) before embedding.

This separation is intentional — you can accumulate files from multiple downloads (or mix in files from other sources like local uploads) before triggering the embed step.
## Single-call Workflow

## Step 1: Download Objects from OCI Object Storage
Download and embed in one request.

**Endpoint:** `POST /v1/embed/oci/store`

| Parameter | Location | Description |
|---|---|---|
| `rate_limit` | Query | Embedding API rate limit in requests per minute (default: `0` for unlimited) |
| `client` | Header | Client identifier for scoping temp storage (default: `server`) |
| Request body | Body | `OciEmbedRequest` JSON object (see below) |

### OciEmbedRequest Fields

| Field | Type | Description |
|---|---|---|
| `bucket_name` | string | Name of the OCI Object Storage bucket |
| `auth_profile` | string | OCI profile name (case-insensitive). Default: `DEFAULT` |
| `objects` | array of strings | Object keys to embed. Omit or pass an empty list to embed every supported object in the bucket |
| `alias` | string | Identifiable alias for the vector store |
| `description` | string | Human-readable description of the table contents |
| `embedding_model` | object | `{"provider": "...", "id": "..."}` — the embedding model to use |
| `chunk_size` | integer | Maximum chunk size in characters (0 for default) |
| `chunk_overlap` | integer | Overlap between chunks in characters (0 for default) |
| `distance_strategy` | string | One of: `COSINE`, `EUCLIDEAN_DISTANCE`, `DOT_PRODUCT` |
| `index_type` | string | Vector index type: `HNSW`, `IVF`, or `HYB` |
| `parsing_mode` | string | Document parsing mode: `fast` or `deep` |

**Response:** `202 Accepted` with an `EmbedJobAccepted` body — poll `GET /v1/embed/jobs/{job_id}` for the terminal `EmbedProcessingResult`.

| Field | Type | Description |
|---|---|---|
| `job_id` | string | Identifier of the scheduled embed job |
| `status` | string | Initial status (`queued` or `running`) |
| `location` | string | Path to the job-status endpoint |

### Example — embed specific objects

```bash
curl -X POST "http://localhost:8000/v1/embed/oci/store?rate_limit=60" \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "client: my-session" \
-d '{
"bucket_name": "rag-source-docs",
"auth_profile": "DEFAULT",
"objects": ["product-catalog.pdf", "release-notes/2026-q2.md"],
"alias": "product-docs",
"description": "Product documentation embedded for RAG",
"embedding_model": {
"provider": "oci",
"id": "cohere.embed-english-v3.0"
},
"chunk_size": 1000,
"chunk_overlap": 100,
"distance_strategy": "COSINE",
"index_type": "HNSW",
"parsing_mode": "fast"
}'
```

### Example — embed every supported object in the bucket

Omit `objects` (or pass `[]`) to embed every object whose extension is supported (`.pdf`, `.html`, `.md`, `.txt`, `.csv`, `.docx`, `.pptx`, `.xlsx`, `.png`, `.jpg`, `.jpeg`):

```bash
curl -X POST "http://localhost:8000/v1/embed/oci/store" \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "client: my-session" \
-d '{
"bucket_name": "rag-source-docs",
"auth_profile": "DEFAULT",
"alias": "all-docs",
"embedding_model": {
"provider": "oci",
"id": "cohere.embed-english-v3.0"
},
"chunk_size": 1000,
"chunk_overlap": 100,
"distance_strategy": "COSINE",
"index_type": "HNSW"
}'
```

### Polling for completion

The single-call endpoint is asynchronous — the 202 response carries the `job_id`. Poll the job-status endpoint until it reaches a terminal state:

```bash
curl "http://localhost:8000/v1/embed/jobs/$JOB_ID" \
-H "x-api-key: YOUR_API_KEY" \
-H "client: my-session"
```

A successful job's `result` field carries the `EmbedProcessingResult`:

| Field | Type | Description |
|---|---|---|
| `message` | string | Status message |
| `total_chunks` | integer | Number of chunks created |
| `processed_files` | array | List of successfully processed files |
| `skipped_files` | array | List of files that were skipped |

## Two-step Workflow

Use this flow when you need to combine OCI objects with other sources (local uploads, web URLs, SQL query results) before embedding. Files from each source endpoint accumulate in the same per-client staging area; the embed call consumes everything that has been staged.

### Step 1: Download Objects from OCI Object Storage

Download one or more objects from an OCI Object Storage bucket to the server's staging directory.

Expand All @@ -32,7 +138,7 @@ Download one or more objects from an OCI Object Storage bucket to the server's s

**Response:** JSON array of downloaded filenames.

### Example
#### Example

```bash
curl -X POST "http://localhost:8000/v1/oci/objects/download/my-documents/DEFAULT" \
Expand All @@ -44,7 +150,7 @@ curl -X POST "http://localhost:8000/v1/oci/objects/download/my-documents/DEFAULT

You can call this endpoint multiple times to accumulate files from the same or different buckets before proceeding to Step 2.

## Step 2: Create and Populate the Vector Store
### Step 2: Create and Populate the Vector Store

Process all staged files — splitting them into chunks, generating embeddings, and populating the vector store.

Expand All @@ -56,7 +162,7 @@ Process all staged files — splitting them into chunks, generating embeddings,
| `client` | Header | Must match the `client` value used in Step 1 |
| Request body | Body | `VectorStoreConfig` JSON object (see below) |

### VectorStoreConfig Fields
#### VectorStoreConfig Fields

| Field | Type | Description |
|---|---|---|
Expand All @@ -69,16 +175,9 @@ Process all staged files — splitting them into chunks, generating embeddings,
| `index_type` | string | Vector index type: `HNSW`, `IVF`, or `HYB` |
| `parsing_mode` | string | Document parsing mode: `fast` or `deep` |

**Response:** `EmbedProcessingResult` JSON object:

| Field | Type | Description |
|---|---|---|
| `message` | string | Status message |
| `total_chunks` | integer | Number of chunks created |
| `processed_files` | array | List of successfully processed files |
| `skipped_files` | array | List of files that were skipped |
**Response:** `202 Accepted` with an `EmbedJobAccepted` body — same polling contract as the single-call workflow above.

### Example
#### Example

```bash
curl -X POST "http://localhost:8000/v1/embed?rate_limit=60" \
Expand All @@ -89,7 +188,7 @@ curl -X POST "http://localhost:8000/v1/embed?rate_limit=60" \
"alias": "quarterly-reports",
"description": "Q4 quarterly review documents and metrics",
"embedding_model": {
"provider": "ocigenai",
"provider": "oci",
"id": "cohere.embed-english-v3.0"
},
"chunk_size": 1000,
Expand All @@ -100,7 +199,7 @@ curl -X POST "http://localhost:8000/v1/embed?rate_limit=60" \
}'
```

## Complete Example
### Complete Example

A full end-to-end workflow downloading from two buckets and embedding:

Expand Down Expand Up @@ -132,7 +231,7 @@ curl -X POST "$API_URL/v1/embed?rate_limit=60" \
"alias": "q4-knowledge-base",
"description": "Q4 2024 reports and supporting data",
"embedding_model": {
"provider": "ocigenai",
"provider": "oci",
"id": "cohere.embed-english-v3.0"
},
"chunk_size": 1000,
Expand All @@ -145,6 +244,7 @@ curl -X POST "$API_URL/v1/embed?rate_limit=60" \

## Notes

- **File cleanup**: Staged files are automatically cleaned up after the embed endpoint completes, whether it succeeds or fails.
- **Mixing sources**: Files from multiple sources can be accumulated before embedding. In addition to OCI Object Storage downloads, you can upload local files via `POST /v1/embed/local/store` or scrape web content — all files are staged in the same directory scoped by the `client` header.
- **Single-call vs two-step**: The single-call endpoint downloads directly into a per-request work directory, so it only embeds the objects from the named bucket — files staged via `/v1/embed/local/store`, `/v1/embed/web/store`, or `/v1/embed/sql/store` are *not* pulled into a single-call job. The two-step flow embeds every file currently staged for the client.
- **File cleanup**: In both workflows, staged files are automatically cleaned up after the embed job completes, whether it succeeds or fails.
- **Mixing sources**: Files from multiple sources can be accumulated before embedding via the two-step flow. In addition to OCI Object Storage downloads, you can upload local files via `POST /v1/embed/local/store` or scrape web content — all files are staged in the same directory scoped by the `client` header.
- **Client scoping**: The `client` header isolates temporary storage between different sessions. Use a consistent value across your download and embed calls within a single workflow.
79 changes: 64 additions & 15 deletions src/client/app/content/tools/tabs/split_embed.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,7 @@ class FileSourceData:
web_url: Optional[str] = None
oci_bucket: Optional[str] = None
oci_files_selected: Optional[pd.DataFrame] = None
oci_all_files: bool = False
sql_query: Optional[str] = None
sql_db_alias: Optional[str] = None

Expand All @@ -172,11 +173,18 @@ def is_valid(self) -> bool:
if self.file_source == "SQL":
return bool(self.sql_query and self.sql_query.strip() and self.sql_db_alias)
if self.file_source == "OCI":
return bool(self.oci_files_selected is not None and self.oci_files_selected["Process"].sum() > 0)
if not self.oci_bucket:
return False
return bool(
self.oci_all_files
or (self.oci_files_selected is not None and self.oci_files_selected["Process"].sum() > 0)
)
return False

def get_button_help(self) -> str:
"""Get help text for the populate button based on file source."""
if self.file_source == "OCI" and self.oci_all_files:
return "This button is disabled if no source bucket is selected."
help_map = {
"Local": "This button is disabled if no local files have been provided.",
"Web": "This button is disabled if the URL was unable to be validated. Please check the URL.",
Expand Down Expand Up @@ -206,6 +214,7 @@ def _get_buckets(compartment_ocid: str, auth_profile: str) -> list:
return ["No Access to Buckets in this Compartment"]


@st.cache_data(ttl=60, show_spinner="Listing bucket objects")
def _get_bucket_objects(bucket_name: str, auth_profile: str) -> list:
"""Get object names from an OCI bucket."""
return api_get(f"oci/objects/{bucket_name}/{auth_profile}")
Expand Down Expand Up @@ -456,9 +465,25 @@ def _render_load_kb_section(file_sources: list, oci_setup: dict | None) -> FileS
disabled=not bucket_compartment,
)

src_objects = _get_bucket_objects(data.oci_bucket, auth_profile) if data.oci_bucket else []
src_files = _files_data_frame(src_objects)
data.oci_files_selected = _files_data_editor(src_files, "source")
data.oci_all_files = st.toggle(
"Embed all supported files in bucket",
value=True,
key="runtime_oci_all_files",
disabled=not data.oci_bucket,
help=(
"When enabled, every supported file in the selected bucket is embedded "
"without per-file selection. Disable to pick individual files."
),
)

if data.oci_bucket:
st.caption(state.optimizer_help.get("embed_supported_file_types", ""))
if data.oci_all_files:
st.caption(f"All supported files in `{data.oci_bucket}` will be embedded.")
else:
src_objects = _get_bucket_objects(data.oci_bucket, auth_profile)
src_files = _files_data_frame(src_objects)
data.oci_files_selected = _files_data_editor(src_files, "source")

return data

Expand Down Expand Up @@ -652,6 +677,41 @@ def _process_populate_request(
client_header = {"client": state.optimizer_client}
auth_profile = state["settings"]["client_settings"].get("oci", {}).get("auth_profile", "")

if source_data.file_source == "OCI":
payload = _build_embed_payload(embed_config)
payload["bucket_name"] = source_data.oci_bucket or ""
payload["auth_profile"] = auth_profile or "DEFAULT"
if not source_data.oci_all_files:
oci_selected = source_data.oci_files_selected
if oci_selected is None:
return None, {}
process_list = oci_selected[oci_selected["Process"]].reset_index(drop=True)
object_names = process_list["File"].tolist()
# An empty ``objects`` list is server-equivalent to omitting
# it — i.e. "embed every supported file in the bucket".
# Reject zero-selection here so a TOCTOU race past the
# disabled-button gate cannot silently embed the whole bucket.
if not object_names:
return None, {}
payload["objects"] = object_names
# 7200s mirrors ``/embed/refresh`` (same synchronous-download
# shape); /embed/oci/store downloads bucket objects before the
# 202, so a ReadTimeout would lose the job_id mid-flight.
accepted = api_post(
"embed/oci/store",
json=payload,
params={"rate_limit": rate_limit or 0},
extra_headers=client_header,
timeout=7200,
)
job_id = accepted["job_id"]
mark_embed_job_started(job_id)
try:
return job_id, _poll_embed_job(job_id, client_header)
except httpx.HTTPStatusError as ex:
ex.job_id = job_id # type: ignore[attr-defined]
raise

# Step 1: Store source files on server
if source_data.file_source == "Local":
files = helpers.unique_file_payload(state.runtime_local_file_uploader)
Expand All @@ -664,17 +724,6 @@ def _process_populate_request(
json={"query": source_data.sql_query, "db_alias": source_data.sql_db_alias},
extra_headers=client_header,
)
else: # OCI
oci_selected = source_data.oci_files_selected
if oci_selected is None:
return None, {}
process_list = oci_selected[oci_selected["Process"]].reset_index(drop=True)
file_names = process_list["File"].tolist()
api_post(
f"oci/objects/download/{source_data.oci_bucket or ''}/{auth_profile}",
json=file_names,
extra_headers=client_header,
)

# Step 2: Split and embed — schedule the job and poll for terminal state.
# 300s acceptance timeout outlasts pre-202 latency (``_settings_lock``
Expand Down
Loading