Skip to content

feat: implement alias-first architecture for zero-downtime reindexing #132

@kapral18

Description

@kapral18

Summary

After thorough discussion, we need to move from an index-first to an alias-first architecture to support zero-downtime reindexing for shared cluster operators.

Problem Statement

We have two usage modes:

  1. Shared cluster operator mode (e.g., @simianhacker's cluster, new OBLT shared cluster with @alejandro.colomina)
  2. Personal cluster operator mode

Shared cluster operators need zero-downtime reindexing. When they update indexer logic or mappings, users shouldn't have to wait for reindexing to complete. Currently, with index-first approach, reindexing causes downtime.

Current Architecture (Index-First)

  • Index name is primary: kibana (concrete index)
  • Alias is secondary: kibana-repokibana (for MCP discovery)
  • npm run index -- repo:kibana → creates/uses index kibana
  • Index names are stable and user-facing

Limitations:

  • No zero-downtime reindexing
  • Users see actual index names in list_indices
  • Can't hot-swap indices without downtime

New Architecture (Alias-First)

  • Alias is primary: kibana (public-facing identifier)
  • Index is ephemeral: kibana-{timestamp} or kibana-v{N} (internal implementation detail)
  • npm run index -- repo:kibana → creates kibana-{timestamp}, indexes to it, sets alias kibanakibana-{timestamp}
  • All references become alias references
  • list_indices shows aliases, not indices

Benefits:

  • ✅ Zero-downtime reindexing via alias swapping
  • ✅ Clean public API (aliases only)
  • ✅ Works for both shared and personal cluster operators
  • ✅ Hot-swappable indices
  • ✅ No -repo suffix requirement (users can pass any alias name)

Operational Model (Stateless & Efficient)

To support Stateless CronJobs on immutable infrastructure (Kubernetes/GitOps) while ensuring Resource Efficiency, we separate the operational concerns:

  1. Steady State (CronJob):

    • Command: npm run index -- kibana
    • Role: Incremental updates.
    • Resources: Lightweight (Low CPU/RAM).
    • Behavior: Resolves kibana alias to current index, indexes diffs.
  2. Maintenance State (One-Off Job):

    • Command: npm run index -- kibana --clean
    • Role: Full Reindex & Alias Swap.
    • Resources: Heavyweight (High CPU/RAM).
    • Behavior: Creates new index, full index, atomic swap, delete old.

This separation prevents the "Super CronJob" anti-pattern (sizing a cronjob for peak 16h reindex load) and keeps the schedule pure.

Concurrency Control (GitOps Compatible)

To avoid infrastructure mutation (e.g. kubectl patch suspend which causes GitOps drift) and overlapping runs:

  • Repository-Scoped Locking: The application uses a lock in the settings index (e.g. reindex_in_progress_kibana).
  • Maintenance Job: Acquires lock → Reindexes → Releases lock.
  • CronJob: Checks lock on startup. If locked, skips run (exits 0).

This allows the CronJob to remain "running" (according to K8s) but "paused" (logically) during maintenance, satisfying immutable infrastructure constraints.

Required Changes

1. Index Name Generation

  • Auto-generate versioned index names: {aliasName}-{timestamp} or {aliasName}-v{N}
  • Index names become ephemeral/internal

2. Alias Management

  • Create/update aliases as the primary operation
  • All index parameters become alias references
  • Alias becomes the stable identifier

3. Old Index Cleanup

  • Track previous index versions
  • By default, cleanup old indices after reindexing
  • Add --keep-old-indices flag for operators who want to keep old indices temporarily

4. API Changes

  • npm run index -- repo:kibanakibana is now an alias, not an index name
  • All internal references switch from index names to aliases
  • Settings index needs to track alias → version mapping

5. MCP Integration

  • list_indices shows aliases, not indices
  • MCP already uses aliases for discovery, so this part works

6. Interface Changes

  • Remove -repo suffix requirement
  • User can pass whatever alias name they want: npm run index -- repo:my-alias
  • Alias name becomes the public-facing identifier

Implementation Details

Example Flow

User Command:
  npm run index -- kibana

Execution:
  1. Generate index name: `kibana-1733856000000` (timestamp)
  2. Create index: `kibana-1733856000000`
  3. Index all data to: `kibana-1733856000000`
  4. Create/update alias: `kibana` → `kibana-1733856000000`
  5. If old index exists: Delete old index (unless `--keep-old-indices` flag)
  6. MCP discovers via alias: `kibana`
  7. `list_indices` shows: `kibana` (alias), not `kibana-1733856000000` (index)

Zero-Downtime Reindexing Flow (with Locking)

Scenario: Operator wants to reindex with new mappings via Maintenance Job

1. Maintenance Job Starts (`--clean`)
2. Acquires lock: `reindex_in_progress_kibana = true`
3. Current state: `kibana` (alias) → `kibana-1733856000000` (index)
4. Create new index: `kibana-1733856100000` (new timestamp)
5. Reindex all data to: `kibana-1733856100000`
6. (Meanwhile) CronJob wakes up → Sees lock → Skips run
7. When complete, atomically swap: `kibana` → `kibana-1733856100000`
8. Delete old index: `kibana-1733856000000`
9. Release lock: `reindex_in_progress_kibana = false`
10. Users/CronJob continue via `kibana` alias (zero downtime!)

Trade-offs

  • Drawback: On every reindex, users will see a new index in index_management dashboard with changed timestamp
  • Benefit: This is acceptable for the zero-downtime capability it provides
  • Benefit: Users/agents only interact with clean alias names via list_indices

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions