rodoHasArrived · rodoHasArrived · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026
diff --git a/docs/evaluations/high-impact-improvements-brainstorm.md b/docs/evaluations/high-impact-improvements-brainstorm.md
@@ -1,11 +1,37 @@
 # High-Impact Improvements Brainstorm (Effort Ignored)
 
 **Date:** 2026-03-02  
+**Status:** Active — Expanded and In Review  
+**Author:** Architecture Review  
 **Scope:** Ideas prioritized purely by potential upside to product quality, data value, and long-term competitiveness.  
 **Constraint:** Implementation cost and complexity are intentionally ignored.
 
 ---
 
+## Executive Summary
+
+This document captures the highest-impact architectural and product improvements for the Market Data Collector platform, ordered by long-term strategic value rather than implementation cost. It serves as a north-star reference for investment decisions, roadmap planning, and architecture decisions.
+
+**Key findings:**
+
+- The repository has strong foundations in streaming ingestion, multi-provider support, storage tiering, and data quality monitoring, but operates primarily as a _collector_ rather than a trustable _system of record_.
+- The five highest-impact investments — deterministic replay, multi-provider consensus, unified serving contract, automated self-healing, and explainable diagnostics — form a compounding stack where each layer multiplies the value of the next.
+- Ten further improvements across storage intelligence, compliance, feature engineering, and provider routing would collectively shift the platform from infrastructure tooling to a strategic market-data intelligence platform.
+
+**Document structure:**
+
+- Sections 1–6: Categorized idea inventory
+- Section 7: Impact-only prioritization
+- Section 8: Deep expansion of the top five ideas
+- Section 9: Strategic sequencing rationale
+- Sections 10–12: Additional ideas and cross-cutting capabilities
+- Section 13: Current state gap analysis and integration map _(new)_
+- Section 14: Phased implementation roadmap _(new)_
+- Section 15: Risk factors and mitigation strategies _(new)_
+- Section 16: Open questions for architecture decisions _(new)_
+
+---
+
 ## 1) Build a "truth-grade" market data platform (not just a collector)
 
 ### 1.1 Deterministic event ledger with full replay semantics
@@ -392,3 +418,275 @@ After the original top five, the next highest-upside additions are:
 10. **Policy-driven compliance and provenance evidence framework**
 
 Together, these additions would further shift the repository from a robust collector into a strategic market-data intelligence platform.
+
+---
+
+## 13) Current state gap analysis and integration map
+
+This section maps each major proposal to what already exists in the codebase and what net-new work is required. References point to the primary implementation files.
+
+### 13.1 Deterministic event ledger (idea 1.1 / 8.1)
+
+| Dimension | Current state | Gap |
+| --- | --- | --- |
+| Immutable append-only storage | `WriteAheadLog` provides durability; `JsonlStorageSink` + `ParquetStorageSink` write events. Sinks are not treated as an immutable ledger. | Need cryptographic hashing per event and a sealed ledger API that forbids mutation. |
+| Canonical payload hash | Not computed at ingestion. | Add hash computation in `EventCanonicalizer` / `CanonicalizingPublisher`. |
+| Normalization ruleset versioning | `ConditionCodeMapper` and `VenueMicMapper` have no version identifier exposed downstream. | Tag each event with the rule version IDs used during canonicalization. |
+| Deterministic replay | `JsonlReplayer` replays files but does not guarantee ordering across providers or bit-for-bit reproducibility. | Add a replay contract enforcing sequence ordering and deterministic late-event handling. |
+
+**Existing leverage points:** `WriteAheadLog`, `JsonlReplayer`, `EventCanonicalizer`, `CanonicalizingPublisher`, `MarketEvent` (extend with `CanonicalHash` and `RulesetVersion` fields).
+
+---
+
+### 13.2 Multi-provider consensus engine (idea 1.2 / 8.2)
+
+| Dimension | Current state | Gap |
+| --- | --- | --- |
+| Multi-provider ingestion | `FailoverAwareMarketDataClient` and `CompositeHistoricalDataProvider` support multiple providers. | These route/fallback; they do not fan-out and compare outputs from multiple providers simultaneously. |
+| Disagreement detection | `CrossProviderComparisonService` exists in the data quality layer. | Needs to operate in real time, not just as a batch quality metric. |
+| Confidence scoring | `DataQualityMonitoringService` tracks quality metrics but does not produce per-event confidence scores. | Extend to emit confidence annotations on the `MarketEvent` payload. |
+| Consensus stream persistence | Not implemented. | Requires a new "consensus truth" sink alongside existing symbol sinks. |
+
+**Existing leverage points:** `DataQualityMonitoringService`, `CrossProviderComparisonService`, `FailoverAwareMarketDataClient`, `EventPipeline`.
+
+---
+
+### 13.3 Unified historical/live serving contract (idea 3.3 / 8.3)
+
+| Dimension | Current state | Gap |
+| --- | --- | --- |
+| Live streaming | `IMarketDataClient` provides WebSocket-based live streams. | Contract (event types, field set) differs from historical output. |
+| Historical serving | `HistoricalDataQueryService` + `JsonlReplayer` provide file-based replay. | No HTTP query layer with time-travel and composable filters. |
+| Schema parity | `MarketEvent` is used for live events; `HistoricalBar` for historical. | Needs a single canonical event contract spanning both modes. |
+| Session-boundary stitching | Not implemented. | Corporate actions and session transitions need deterministic handling in the serving layer. |
+
+**Existing leverage points:** `IMarketDataClient`, `HistoricalDataQueryService`, `JsonlReplayer`, `MarketEvent`, `HistoricalBar`, `UiApiRoutes` (for new query endpoints).
+
+---
+
+### 13.4 Automated correction and self-healing (idea 2.3 / 8.4)
+
+| Dimension | Current state | Gap |
+| --- | --- | --- |
+| Gap detection | `GapAnalyzer` in `DataQuality/` detects gaps. | Detection does not automatically trigger remediation. |
+| Targeted backfill | `GapBackfillService` can backfill specific windows. | Not connected to quality alerts; no policy-driven auto-trigger. |
+| Remediation policies | Not implemented. | Need a declarative policy engine specifying when to auto-remediate, with budget and blast-radius caps. |
+| Post-remediation verification | Not implemented. | After backfill, quality score for the repaired window should be re-evaluated and recorded. |
+
+**Existing leverage points:** `GapAnalyzer`, `GapBackfillService`, `HistoricalBackfillService`, `DataQualityMonitoringService`, `BackfillScheduleManager`.
+
+---
+
+### 13.5 Explainable diagnostics assistant (idea 5.1 / 8.5)
+
+| Dimension | Current state | Gap |
+| --- | --- | --- |
+| Diagnostic data collection | `DiagnosticBundleService` collects system state snapshots. | Outputs raw state rather than question-oriented root-cause analysis. |
+| Error correlation | `ErrorTracker` and `ErrorRingBuffer` capture errors. | No causality graph linking provider state, queue pressure, and quality events. |
+| Operator-facing diagnostics | `DetailedHealthCheck`, `StatusSnapshot`, `StatusWriter`. | Rich data exists but is not surfaced through a "why" query interface. |
+| Remediation playbooks | Not implemented. | Need structured playbook storage tied to incident pattern matching. |
+
+**Existing leverage points:** `DiagnosticBundleService`, `ErrorTracker`, `DetailedHealthCheck`, `PreflightChecker`, `SystemHealthChecker`, `DiagnosticsCommands` CLI.
+
+---
+
+### 13.6 Real-time market state intelligence (idea 10.1)
+
+| Dimension | Current state | Gap |
+| --- | --- | --- |
+| Market state tracking | `TradingCalendar` provides session/holiday awareness. | No per-symbol state machine for halt/reopen/auction transitions. |
+| Regime detection | `ProviderDegradationScorer` scores providers. | No volatility/liquidity regime detection for market data itself. |
+| Cross-symbol contagion | Not implemented. | Requires symbol-group models (ETF/constituent, sector) with correlation tracking. |
+
+**Existing leverage points:** `TradingCalendar`, `SpreadMonitor`, `ProviderDegradationScorer`, `MarketDepthCollector`.
+
+---
+
+### 13.7 Policy-driven compliance and governance (idea 10.3)
+
+| Dimension | Current state | Gap |
+| --- | --- | --- |
+| Data retention | `LifecyclePolicyEngine` enforces retention and tier migration. | No jurisdiction-aware or instrument-level policy distinctions. |
+| Lineage tracking | `DataLineageService` tracks file-level lineage. | No signed, exportable evidence bundles for audit. |
+| Usage governance | Not implemented. | No usage-class tagging or entitlement-aware serving. |
+
+**Existing leverage points:** `LifecyclePolicyEngine`, `DataLineageService`, `PortableDataPackager`, `MetadataTagService`.
+
+---
+
+## 14) Phased implementation roadmap
+
+This roadmap groups proposals into three phases ordered by strategic dependency. Each phase builds the foundation that makes the next phase higher-leverage.
+
+### Phase 1 — Foundation of trust (highest-leverage prerequisites)
+
+**Goal:** Establish the platform as a reliable system of record before layering intelligence on top.
+
+| ID | Proposal | Prerequisite for |
+| --- | --- | --- |
+| 1.1 / 8.1 | Deterministic event ledger + immutable storage | All quality scoring, audit, replay |
+| 1.3 | Time-correctness service | Reliable sequence ordering in ledger |
+| 2.2 | Quality contracts per provider | Consensus scoring; self-healing policies |
+| 4.3 | End-to-end exactly-once delivery | Ledger integrity; reliable self-healing |
+
+**Expected outcomes:** Any historical output is reproducible. Provider incidents are precisely measurable. Downstream trust in all data products increases.
+
+---
+
+### Phase 2 — Intelligence and quality governance
+
+**Goal:** Add consensus, quality control, and automated remediation on top of the trustable foundation.
+
+| ID | Proposal | Prerequisite for |
+| --- | --- | --- |
+| 1.2 / 8.2 | Multi-provider consensus engine | Quality control plane; self-healing |
+| 2.1 | Data quality control plane | Self-healing policy triggers |
+| 2.3 / 8.4 | Automated correction and self-healing | Continuous quality improvement |
+| 3.3 / 8.3 | Unified historical/live serving contract | Strategy-facing feature store; consumers |
+| 10.6 | Self-optimizing provider routing | Better consensus inputs and failover |
+
+**Expected outcomes:** Most quality incidents self-heal. Strategy consumers integrate once against a stable contract. Routing decisions are evidence-based.
+
+---
+
+### Phase 3 — Operating system and strategic differentiation
+
+**Goal:** Transform the platform from infrastructure into a strategic market-data intelligence surface.
+
+| ID | Proposal | Depends on |
+| --- | --- | --- |
+| 3.1 | Strategy-facing feature store | Unified serving contract |
+| 3.2 | Query engine + semantic API | Unified contract; Phase 1 ledger |
+| 5.1 / 8.5 | Explainable diagnostics assistant | Rich telemetry from Phases 1 and 2 |
+| 5.2 | Declarative orchestration DSL | Quality contracts; routing policies |
+| 10.1 | Real-time market state intelligence | Time-correctness service |
+| 10.2 | Unified order-book microstructure reconstruction | Ledger; unified contract |
+| 10.3 | Policy-driven compliance and governance | Lineage + ledger from Phase 1 |
+| 10.4 | Intelligent storage tiering and query acceleration | Feature store; semantic API |
+| 10.7 | Data product packaging and contract marketplace | All prior phases |
+| 6.1 | Provider scorecards and procurement analytics | Consensus + routing from Phase 2 |
+| 6.3 | Verifiable data lineage and provenance exports | Compliance framework |
+
+**Expected outcomes:** Platform becomes a strategic market-data intelligence system usable by research, strategy, operations, and external consumers.
+
+---
+
+## 15) Risk factors and mitigation strategies
+
+### 15.1 Immutable ledger introduces storage cost growth
+
+**Risk:** Persisting both raw and canonicalized events, plus hashes and lineage metadata, significantly increases storage volume per event.
+
+**Mitigation:**
+
+- Apply the existing tiered storage model: hot ledger in JSONL/Parquet, cold ledger in ZSTD-compressed archival.
+- Implement ledger compaction for resolved periods where raw frames are no longer needed for debugging.
+- Use the existing `CompressionProfileManager` and `LifecyclePolicyEngine` to manage ledger growth under policy.
+
+---
+
+### 15.2 Consensus engine creates latency overhead on the hot path
+
+**Risk:** Fanning out to multiple providers and computing confidence scores adds latency on the ingestion critical path.
+
+**Mitigation:**
+
+- Run consensus scoring asynchronously off the main `EventPipeline` using a bounded channel (following `PipelinePolicyConstants` presets).
+- Provide a configuration option to disable real-time consensus scoring and fall back to batch scoring during high-load sessions.
+- Use the existing `BackpressureAlertService` to monitor queue depth and degrade gracefully.
+
+---
+
+### 15.3 Unified serving contract risks breaking existing downstream consumers
+
+**Risk:** Introducing a new canonical event schema may break existing integrations built against `HistoricalBar` or the current live event structures.
+
+**Mitigation:**
+
+- Follow ADR-006 (polymorphic payload pattern) and ADR-014 (JSON source generators) for schema evolution.
+- Version the contract from day one and maintain at least one prior version during migration windows.
+- Use the existing `SchemaVersionManager` and `SchemaValidationService` to enforce compatibility gates.
+
+---
+
+### 15.4 Automated self-healing can cause over-correction
+
+**Risk:** Aggressive automated backfill or provider substitution under noisy quality signals can amplify problems rather than resolve them.
+
+**Mitigation:**
+
+- Implement hard budget caps (max symbols per remediation cycle, max backfill window size, max API cost per hour) enforced in the policy engine.
+- Require a human-approval step (section 11.4) for any remediation exceeding a configurable blast-radius threshold.
+- Track every automated action in an audit trail, reviewing false-positive rates weekly during rollout.
+
+---
+
+### 15.5 Explainable diagnostics risks exposing sensitive operational data
+
+**Risk:** Root-cause graphs and incident narratives may surface API key fragments, internal system topologies, or provider rate-limit details in logs or API responses.
+
+**Mitigation:**
+
+- Route all diagnostic output through the existing `SensitiveValueMasker` before serialization.
+- Require the existing API key middleware (`ApiKeyMiddleware`) for all diagnostics endpoints.
+- Separate internal diagnostic depth (full) from external-facing responses (sanitized).
+
+---
+
+### 15.6 Declarative DSL complexity exceeds its usability benefit
+
+**Risk:** Building a purpose-specific orchestration language is a significant investment that may not be adopted if the learning curve is high.
+
+**Mitigation:**
+
+- Validate the DSL concept first through a structured YAML/JSON configuration schema rather than a custom language.
+- Use the existing `ConfigurationPipeline` and `ValidatedConfig` as a foundation for incremental declarative enhancements.
+- Adopt the DSL approach only after demonstrating that the configuration complexity it targets is a documented pain point for operators.
+
+---
+
+## 16) Open questions for architecture decisions
+
+These questions do not yet have agreed answers and should be resolved before detailed design begins on each area.
+
+### 16.1 Ledger identity model
+
+- Should each event have a globally unique ID assigned at ingestion, or should the canonical hash serve as the primary identifier?
+- How should late/amended events reference and supersede earlier records without breaking immutability?
+- What is the retention policy for raw provider frames once a configurable "forensic window" elapses?
+
+### 16.2 Consensus engine scope
+
+- Should the consensus engine operate on trades/quotes only, or also on aggregated bars and derived features?
+- What is the minimum provider overlap (N of M) required before a consensus score is computed vs. marked as single-source?
+- Should consensus state be part of the `MarketEvent` schema (inline) or a separate stream consumers opt into?
+
+### 16.3 Unified contract versioning
+
+- What is the schema evolution policy — forward-compatible only, or bidirectional with a migration layer?
+- How should corporate action adjustments (splits, dividends) be represented — as event annotations, separate adjustment records, or as adjusted price fields?
+- Should the unified contract be expressed as a protobuf schema (for binary efficiency) or kept as JSON (for tooling compatibility)?
+
+### 16.4 Self-healing policy engine design
+
+- Should policies be expressed as code (C# DSL), structured configuration (YAML), or rule-table formats?
+- Who owns policy authorship — operators at runtime, or developers at build time with promotion through environments?
+- How should conflicting policies (e.g., "backfill gap" vs. "pause due to API budget") be resolved?
+
+### 16.5 Diagnostics assistant interface
+
+- Should the diagnostics assistant be CLI-only, API-only, or accessible through the WPF and web UIs?
+- How should the incident memory / playbook corpus be stored, versioned, and shared across deployments?
+- Is an LLM-backed interpretation layer in scope, or should all root-cause logic be rule-based and deterministic?
+
+### 16.6 Provider routing and blended sourcing
+
+- What is the unit of routing — per symbol, per feed type, or per time window?
+- How should routing decisions interact with the existing `FailoverAwareMarketDataClient` — replace it, extend it, or sit alongside it?
+- Should routing scores be influenced by downstream consumer feedback (e.g., strategy performance attribution)?
+
+### 16.7 Data product marketplace
+
+- What is the access control model for data products — shared namespace, team namespaces, or per-consumer entitlement graphs?
+- Should the first implementation target internal research consumers only, or should external redistribution be an explicit design goal from the start?
+- What is the minimum viable "data product" definition: schema + SLO + coverage window, or does it also require embedded transform and serving logic?