feat: implement alias-first architecture for zero-downtime reindexing

## Summary

After thorough discussion, we need to move from an **index-first** to an **alias-first** architecture to support zero-downtime reindexing for shared cluster operators.

## Problem Statement

We have two usage modes:
1. **Shared cluster operator mode** (e.g., @simianhacker's cluster, new OBLT shared cluster with @alejandro.colomina)
2. **Personal cluster operator mode**

Shared cluster operators need zero-downtime reindexing. When they update indexer logic or mappings, users shouldn't have to wait for reindexing to complete. Currently, with index-first approach, reindexing causes downtime.

## Current Architecture (Index-First)

- Index name is primary: `kibana` (concrete index)
- Alias is secondary: `kibana-repo` → `kibana` (for MCP discovery)
- `npm run index -- repo:kibana` → creates/uses index `kibana`
- Index names are stable and user-facing

**Limitations:**
- No zero-downtime reindexing
- Users see actual index names in `list_indices`
- Can't hot-swap indices without downtime

## New Architecture (Alias-First)

- **Alias is primary**: `kibana` (public-facing identifier)
- **Index is ephemeral**: `kibana-{timestamp}` or `kibana-v{N}` (internal implementation detail)
- `npm run index -- repo:kibana` → creates `kibana-{timestamp}`, indexes to it, sets alias `kibana` → `kibana-{timestamp}`
- All references become alias references
- `list_indices` shows aliases, not indices

**Benefits:**
- ✅ Zero-downtime reindexing via alias swapping
- ✅ Clean public API (aliases only)
- ✅ Works for both shared and personal cluster operators
- ✅ Hot-swappable indices
- ✅ No `-repo` suffix requirement (users can pass any alias name)

## Operational Model (Stateless & Efficient)

To support **Stateless CronJobs** on immutable infrastructure (Kubernetes/GitOps) while ensuring **Resource Efficiency**, we separate the operational concerns:

1.  **Steady State (CronJob):**
    *   Command: `npm run index -- kibana`
    *   Role: Incremental updates.
    *   Resources: Lightweight (Low CPU/RAM).
    *   Behavior: Resolves `kibana` alias to current index, indexes diffs.

2.  **Maintenance State (One-Off Job):**
    *   Command: `npm run index -- kibana --clean`
    *   Role: Full Reindex & Alias Swap.
    *   Resources: Heavyweight (High CPU/RAM).
    *   Behavior: Creates new index, full index, atomic swap, delete old.

This separation prevents the "Super CronJob" anti-pattern (sizing a cronjob for peak 16h reindex load) and keeps the schedule pure.

## Concurrency Control (GitOps Compatible)

To avoid infrastructure mutation (e.g. `kubectl patch suspend` which causes GitOps drift) and overlapping runs:

*   **Repository-Scoped Locking:** The application uses a lock in the settings index (e.g. `reindex_in_progress_kibana`).
*   **Maintenance Job:** Acquires lock → Reindexes → Releases lock.
*   **CronJob:** Checks lock on startup. If locked, **skips run** (exits 0).

This allows the CronJob to remain "running" (according to K8s) but "paused" (logically) during maintenance, satisfying immutable infrastructure constraints.

## Required Changes

### 1. Index Name Generation
- Auto-generate versioned index names: `{aliasName}-{timestamp}` or `{aliasName}-v{N}`
- Index names become ephemeral/internal

### 2. Alias Management
- Create/update aliases as the primary operation
- All `index` parameters become alias references
- Alias becomes the stable identifier

### 3. Old Index Cleanup
- Track previous index versions
- By default, cleanup old indices after reindexing
- Add `--keep-old-indices` flag for operators who want to keep old indices temporarily

### 4. API Changes
- `npm run index -- repo:kibana` → `kibana` is now an alias, not an index name
- All internal references switch from index names to aliases
- Settings index needs to track alias → version mapping

### 5. MCP Integration
- `list_indices` shows aliases, not indices
- MCP already uses aliases for discovery, so this part works

### 6. Interface Changes
- Remove `-repo` suffix requirement
- User can pass whatever alias name they want: `npm run index -- repo:my-alias`
- Alias name becomes the public-facing identifier

## Implementation Details

### Example Flow

```
User Command:
  npm run index -- kibana

Execution:
  1. Generate index name: `kibana-1733856000000` (timestamp)
  2. Create index: `kibana-1733856000000`
  3. Index all data to: `kibana-1733856000000`
  4. Create/update alias: `kibana` → `kibana-1733856000000`
  5. If old index exists: Delete old index (unless `--keep-old-indices` flag)
  6. MCP discovers via alias: `kibana`
  7. `list_indices` shows: `kibana` (alias), not `kibana-1733856000000` (index)
```

### Zero-Downtime Reindexing Flow (with Locking)

```
Scenario: Operator wants to reindex with new mappings via Maintenance Job

1. Maintenance Job Starts (`--clean`)
2. Acquires lock: `reindex_in_progress_kibana = true`
3. Current state: `kibana` (alias) → `kibana-1733856000000` (index)
4. Create new index: `kibana-1733856100000` (new timestamp)
5. Reindex all data to: `kibana-1733856100000`
6. (Meanwhile) CronJob wakes up → Sees lock → Skips run
7. When complete, atomically swap: `kibana` → `kibana-1733856100000`
8. Delete old index: `kibana-1733856000000`
9. Release lock: `reindex_in_progress_kibana = false`
10. Users/CronJob continue via `kibana` alias (zero downtime!)
```

## Trade-offs

- **Drawback**: On every reindex, users will see a new index in index_management dashboard with changed timestamp
- **Benefit**: This is acceptable for the zero-downtime capability it provides
- **Benefit**: Users/agents only interact with clean alias names via `list_indices`

## Related

- Supersedes #130 and #131
- Enables zero-downtime reindexing for shared cluster operators
- Complements MCP server dual discovery strategy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement alias-first architecture for zero-downtime reindexing #132

Summary

Problem Statement

Current Architecture (Index-First)

New Architecture (Alias-First)

Operational Model (Stateless & Efficient)

Concurrency Control (GitOps Compatible)

Required Changes

1. Index Name Generation

2. Alias Management

3. Old Index Cleanup

4. API Changes

5. MCP Integration

6. Interface Changes

Implementation Details

Example Flow

Zero-Downtime Reindexing Flow (with Locking)

Trade-offs

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: implement alias-first architecture for zero-downtime reindexing #132

Description

Summary

Problem Statement

Current Architecture (Index-First)

New Architecture (Alias-First)

Operational Model (Stateless & Efficient)

Concurrency Control (GitOps Compatible)

Required Changes

1. Index Name Generation

2. Alias Management

3. Old Index Cleanup

4. API Changes

5. MCP Integration

6. Interface Changes

Implementation Details

Example Flow

Zero-Downtime Reindexing Flow (with Locking)

Trade-offs

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions