-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Summary
After thorough discussion, we need to move from an index-first to an alias-first architecture to support zero-downtime reindexing for shared cluster operators.
Problem Statement
We have two usage modes:
- Shared cluster operator mode (e.g., @simianhacker's cluster, new OBLT shared cluster with @alejandro.colomina)
- Personal cluster operator mode
Shared cluster operators need zero-downtime reindexing. When they update indexer logic or mappings, users shouldn't have to wait for reindexing to complete. Currently, with index-first approach, reindexing causes downtime.
Current Architecture (Index-First)
- Index name is primary:
kibana(concrete index) - Alias is secondary:
kibana-repo→kibana(for MCP discovery) npm run index -- repo:kibana→ creates/uses indexkibana- Index names are stable and user-facing
Limitations:
- No zero-downtime reindexing
- Users see actual index names in
list_indices - Can't hot-swap indices without downtime
New Architecture (Alias-First)
- Alias is primary:
kibana(public-facing identifier) - Index is ephemeral:
kibana-{timestamp}orkibana-v{N}(internal implementation detail) npm run index -- repo:kibana→ createskibana-{timestamp}, indexes to it, sets aliaskibana→kibana-{timestamp}- All references become alias references
list_indicesshows aliases, not indices
Benefits:
- ✅ Zero-downtime reindexing via alias swapping
- ✅ Clean public API (aliases only)
- ✅ Works for both shared and personal cluster operators
- ✅ Hot-swappable indices
- ✅ No
-reposuffix requirement (users can pass any alias name)
Operational Model (Stateless & Efficient)
To support Stateless CronJobs on immutable infrastructure (Kubernetes/GitOps) while ensuring Resource Efficiency, we separate the operational concerns:
-
Steady State (CronJob):
- Command:
npm run index -- kibana - Role: Incremental updates.
- Resources: Lightweight (Low CPU/RAM).
- Behavior: Resolves
kibanaalias to current index, indexes diffs.
- Command:
-
Maintenance State (One-Off Job):
- Command:
npm run index -- kibana --clean - Role: Full Reindex & Alias Swap.
- Resources: Heavyweight (High CPU/RAM).
- Behavior: Creates new index, full index, atomic swap, delete old.
- Command:
This separation prevents the "Super CronJob" anti-pattern (sizing a cronjob for peak 16h reindex load) and keeps the schedule pure.
Concurrency Control (GitOps Compatible)
To avoid infrastructure mutation (e.g. kubectl patch suspend which causes GitOps drift) and overlapping runs:
- Repository-Scoped Locking: The application uses a lock in the settings index (e.g.
reindex_in_progress_kibana). - Maintenance Job: Acquires lock → Reindexes → Releases lock.
- CronJob: Checks lock on startup. If locked, skips run (exits 0).
This allows the CronJob to remain "running" (according to K8s) but "paused" (logically) during maintenance, satisfying immutable infrastructure constraints.
Required Changes
1. Index Name Generation
- Auto-generate versioned index names:
{aliasName}-{timestamp}or{aliasName}-v{N} - Index names become ephemeral/internal
2. Alias Management
- Create/update aliases as the primary operation
- All
indexparameters become alias references - Alias becomes the stable identifier
3. Old Index Cleanup
- Track previous index versions
- By default, cleanup old indices after reindexing
- Add
--keep-old-indicesflag for operators who want to keep old indices temporarily
4. API Changes
npm run index -- repo:kibana→kibanais now an alias, not an index name- All internal references switch from index names to aliases
- Settings index needs to track alias → version mapping
5. MCP Integration
list_indicesshows aliases, not indices- MCP already uses aliases for discovery, so this part works
6. Interface Changes
- Remove
-reposuffix requirement - User can pass whatever alias name they want:
npm run index -- repo:my-alias - Alias name becomes the public-facing identifier
Implementation Details
Example Flow
User Command:
npm run index -- kibana
Execution:
1. Generate index name: `kibana-1733856000000` (timestamp)
2. Create index: `kibana-1733856000000`
3. Index all data to: `kibana-1733856000000`
4. Create/update alias: `kibana` → `kibana-1733856000000`
5. If old index exists: Delete old index (unless `--keep-old-indices` flag)
6. MCP discovers via alias: `kibana`
7. `list_indices` shows: `kibana` (alias), not `kibana-1733856000000` (index)
Zero-Downtime Reindexing Flow (with Locking)
Scenario: Operator wants to reindex with new mappings via Maintenance Job
1. Maintenance Job Starts (`--clean`)
2. Acquires lock: `reindex_in_progress_kibana = true`
3. Current state: `kibana` (alias) → `kibana-1733856000000` (index)
4. Create new index: `kibana-1733856100000` (new timestamp)
5. Reindex all data to: `kibana-1733856100000`
6. (Meanwhile) CronJob wakes up → Sees lock → Skips run
7. When complete, atomically swap: `kibana` → `kibana-1733856100000`
8. Delete old index: `kibana-1733856000000`
9. Release lock: `reindex_in_progress_kibana = false`
10. Users/CronJob continue via `kibana` alias (zero downtime!)
Trade-offs
- Drawback: On every reindex, users will see a new index in index_management dashboard with changed timestamp
- Benefit: This is acceptable for the zero-downtime capability it provides
- Benefit: Users/agents only interact with clean alias names via
list_indices
Related
- Supersedes Automatically create -repo alias when indexing repositories #130 and feat: automatically create -repo alias when indexing (with index name normalization) #131
- Enables zero-downtime reindexing for shared cluster operators
- Complements MCP server dual discovery strategy