Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
298 changes: 298 additions & 0 deletions docs/evaluations/high-impact-improvements-brainstorm.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,37 @@
# High-Impact Improvements Brainstorm (Effort Ignored)

**Date:** 2026-03-02
**Status:** Active — Expanded and In Review
**Author:** Architecture Review
**Scope:** Ideas prioritized purely by potential upside to product quality, data value, and long-term competitiveness.
**Constraint:** Implementation cost and complexity are intentionally ignored.

---

## Executive Summary

This document captures the highest-impact architectural and product improvements for the Market Data Collector platform, ordered by long-term strategic value rather than implementation cost. It serves as a north-star reference for investment decisions, roadmap planning, and architecture decisions.

**Key findings:**

- The repository has strong foundations in streaming ingestion, multi-provider support, storage tiering, and data quality monitoring, but operates primarily as a _collector_ rather than a trustable _system of record_.
- The five highest-impact investments — deterministic replay, multi-provider consensus, unified serving contract, automated self-healing, and explainable diagnostics — form a compounding stack where each layer multiplies the value of the next.
- Ten further improvements across storage intelligence, compliance, feature engineering, and provider routing would collectively shift the platform from infrastructure tooling to a strategic market-data intelligence platform.

**Document structure:**

- Sections 1–6: Categorized idea inventory
- Section 7: Impact-only prioritization
- Section 8: Deep expansion of the top five ideas
- Section 9: Strategic sequencing rationale
- Sections 10–12: Additional ideas and cross-cutting capabilities
- Section 13: Current state gap analysis and integration map _(new)_
- Section 14: Phased implementation roadmap _(new)_
- Section 15: Risk factors and mitigation strategies _(new)_
- Section 16: Open questions for architecture decisions _(new)_

---

## 1) Build a "truth-grade" market data platform (not just a collector)

### 1.1 Deterministic event ledger with full replay semantics
Expand Down Expand Up @@ -392,3 +418,275 @@ After the original top five, the next highest-upside additions are:
10. **Policy-driven compliance and provenance evidence framework**

Together, these additions would further shift the repository from a robust collector into a strategic market-data intelligence platform.

---

## 13) Current state gap analysis and integration map

This section maps each major proposal to what already exists in the codebase and what net-new work is required. References point to the primary implementation files.

### 13.1 Deterministic event ledger (idea 1.1 / 8.1)

| Dimension | Current state | Gap |
| --- | --- | --- |
| Immutable append-only storage | `WriteAheadLog` provides durability; `JsonlStorageSink` + `ParquetStorageSink` write events. Sinks are not treated as an immutable ledger. | Need cryptographic hashing per event and a sealed ledger API that forbids mutation. |
| Canonical payload hash | Not computed at ingestion. | Add hash computation in `EventCanonicalizer` / `CanonicalizingPublisher`. |
| Normalization ruleset versioning | `ConditionCodeMapper` and `VenueMicMapper` have no version identifier exposed downstream. | Tag each event with the rule version IDs used during canonicalization. |
| Deterministic replay | `JsonlReplayer` replays files but does not guarantee ordering across providers or bit-for-bit reproducibility. | Add a replay contract enforcing sequence ordering and deterministic late-event handling. |

**Existing leverage points:** `WriteAheadLog`, `JsonlReplayer`, `EventCanonicalizer`, `CanonicalizingPublisher`, `MarketEvent` (extend with `CanonicalHash` and `RulesetVersion` fields).

---

### 13.2 Multi-provider consensus engine (idea 1.2 / 8.2)

| Dimension | Current state | Gap |
| --- | --- | --- |
| Multi-provider ingestion | `FailoverAwareMarketDataClient` and `CompositeHistoricalDataProvider` support multiple providers. | These route/fallback; they do not fan-out and compare outputs from multiple providers simultaneously. |
| Disagreement detection | `CrossProviderComparisonService` exists in the data quality layer. | Needs to operate in real time, not just as a batch quality metric. |
| Confidence scoring | `DataQualityMonitoringService` tracks quality metrics but does not produce per-event confidence scores. | Extend to emit confidence annotations on the `MarketEvent` payload. |
| Consensus stream persistence | Not implemented. | Requires a new "consensus truth" sink alongside existing symbol sinks. |

**Existing leverage points:** `DataQualityMonitoringService`, `CrossProviderComparisonService`, `FailoverAwareMarketDataClient`, `EventPipeline`.

---

### 13.3 Unified historical/live serving contract (idea 3.3 / 8.3)

| Dimension | Current state | Gap |
| --- | --- | --- |
| Live streaming | `IMarketDataClient` provides WebSocket-based live streams. | Contract (event types, field set) differs from historical output. |
| Historical serving | `HistoricalDataQueryService` + `JsonlReplayer` provide file-based replay. | No HTTP query layer with time-travel and composable filters. |
| Schema parity | `MarketEvent` is used for live events; `HistoricalBar` for historical. | Needs a single canonical event contract spanning both modes. |
| Session-boundary stitching | Not implemented. | Corporate actions and session transitions need deterministic handling in the serving layer. |

**Existing leverage points:** `IMarketDataClient`, `HistoricalDataQueryService`, `JsonlReplayer`, `MarketEvent`, `HistoricalBar`, `UiApiRoutes` (for new query endpoints).

---

### 13.4 Automated correction and self-healing (idea 2.3 / 8.4)

| Dimension | Current state | Gap |
| --- | --- | --- |
| Gap detection | `GapAnalyzer` in `DataQuality/` detects gaps. | Detection does not automatically trigger remediation. |
| Targeted backfill | `GapBackfillService` can backfill specific windows. | Not connected to quality alerts; no policy-driven auto-trigger. |
| Remediation policies | Not implemented. | Need a declarative policy engine specifying when to auto-remediate, with budget and blast-radius caps. |
| Post-remediation verification | Not implemented. | After backfill, quality score for the repaired window should be re-evaluated and recorded. |

**Existing leverage points:** `GapAnalyzer`, `GapBackfillService`, `HistoricalBackfillService`, `DataQualityMonitoringService`, `BackfillScheduleManager`.

---

### 13.5 Explainable diagnostics assistant (idea 5.1 / 8.5)

| Dimension | Current state | Gap |
| --- | --- | --- |
| Diagnostic data collection | `DiagnosticBundleService` collects system state snapshots. | Outputs raw state rather than question-oriented root-cause analysis. |
| Error correlation | `ErrorTracker` and `ErrorRingBuffer` capture errors. | No causality graph linking provider state, queue pressure, and quality events. |
| Operator-facing diagnostics | `DetailedHealthCheck`, `StatusSnapshot`, `StatusWriter`. | Rich data exists but is not surfaced through a "why" query interface. |
| Remediation playbooks | Not implemented. | Need structured playbook storage tied to incident pattern matching. |

**Existing leverage points:** `DiagnosticBundleService`, `ErrorTracker`, `DetailedHealthCheck`, `PreflightChecker`, `SystemHealthChecker`, `DiagnosticsCommands` CLI.

---

### 13.6 Real-time market state intelligence (idea 10.1)

| Dimension | Current state | Gap |
| --- | --- | --- |
| Market state tracking | `TradingCalendar` provides session/holiday awareness. | No per-symbol state machine for halt/reopen/auction transitions. |
| Regime detection | `ProviderDegradationScorer` scores providers. | No volatility/liquidity regime detection for market data itself. |
| Cross-symbol contagion | Not implemented. | Requires symbol-group models (ETF/constituent, sector) with correlation tracking. |

**Existing leverage points:** `TradingCalendar`, `SpreadMonitor`, `ProviderDegradationScorer`, `MarketDepthCollector`.

---

### 13.7 Policy-driven compliance and governance (idea 10.3)

| Dimension | Current state | Gap |
| --- | --- | --- |
| Data retention | `LifecyclePolicyEngine` enforces retention and tier migration. | No jurisdiction-aware or instrument-level policy distinctions. |
| Lineage tracking | `DataLineageService` tracks file-level lineage. | No signed, exportable evidence bundles for audit. |
| Usage governance | Not implemented. | No usage-class tagging or entitlement-aware serving. |

**Existing leverage points:** `LifecyclePolicyEngine`, `DataLineageService`, `PortableDataPackager`, `MetadataTagService`.

---

## 14) Phased implementation roadmap

This roadmap groups proposals into three phases ordered by strategic dependency. Each phase builds the foundation that makes the next phase higher-leverage.

### Phase 1 — Foundation of trust (highest-leverage prerequisites)

**Goal:** Establish the platform as a reliable system of record before layering intelligence on top.

| ID | Proposal | Prerequisite for |
| --- | --- | --- |
| 1.1 / 8.1 | Deterministic event ledger + immutable storage | All quality scoring, audit, replay |
| 1.3 | Time-correctness service | Reliable sequence ordering in ledger |
| 2.2 | Quality contracts per provider | Consensus scoring; self-healing policies |
| 4.3 | End-to-end exactly-once delivery | Ledger integrity; reliable self-healing |

**Expected outcomes:** Any historical output is reproducible. Provider incidents are precisely measurable. Downstream trust in all data products increases.

---

### Phase 2 — Intelligence and quality governance

**Goal:** Add consensus, quality control, and automated remediation on top of the trustable foundation.

| ID | Proposal | Prerequisite for |
| --- | --- | --- |
| 1.2 / 8.2 | Multi-provider consensus engine | Quality control plane; self-healing |
| 2.1 | Data quality control plane | Self-healing policy triggers |
| 2.3 / 8.4 | Automated correction and self-healing | Continuous quality improvement |
| 3.3 / 8.3 | Unified historical/live serving contract | Strategy-facing feature store; consumers |
| 10.6 | Self-optimizing provider routing | Better consensus inputs and failover |

**Expected outcomes:** Most quality incidents self-heal. Strategy consumers integrate once against a stable contract. Routing decisions are evidence-based.

---

### Phase 3 — Operating system and strategic differentiation

**Goal:** Transform the platform from infrastructure into a strategic market-data intelligence surface.

| ID | Proposal | Depends on |
| --- | --- | --- |
| 3.1 | Strategy-facing feature store | Unified serving contract |
| 3.2 | Query engine + semantic API | Unified contract; Phase 1 ledger |
| 5.1 / 8.5 | Explainable diagnostics assistant | Rich telemetry from Phases 1 and 2 |
| 5.2 | Declarative orchestration DSL | Quality contracts; routing policies |
| 10.1 | Real-time market state intelligence | Time-correctness service |
| 10.2 | Unified order-book microstructure reconstruction | Ledger; unified contract |
| 10.3 | Policy-driven compliance and governance | Lineage + ledger from Phase 1 |
| 10.4 | Intelligent storage tiering and query acceleration | Feature store; semantic API |
| 10.7 | Data product packaging and contract marketplace | All prior phases |
| 6.1 | Provider scorecards and procurement analytics | Consensus + routing from Phase 2 |
| 6.3 | Verifiable data lineage and provenance exports | Compliance framework |

**Expected outcomes:** Platform becomes a strategic market-data intelligence system usable by research, strategy, operations, and external consumers.

---

## 15) Risk factors and mitigation strategies

### 15.1 Immutable ledger introduces storage cost growth

**Risk:** Persisting both raw and canonicalized events, plus hashes and lineage metadata, significantly increases storage volume per event.

**Mitigation:**

- Apply the existing tiered storage model: hot ledger in JSONL/Parquet, cold ledger in ZSTD-compressed archival.
- Implement ledger compaction for resolved periods where raw frames are no longer needed for debugging.
- Use the existing `CompressionProfileManager` and `LifecyclePolicyEngine` to manage ledger growth under policy.

---

### 15.2 Consensus engine creates latency overhead on the hot path

**Risk:** Fanning out to multiple providers and computing confidence scores adds latency on the ingestion critical path.

**Mitigation:**

- Run consensus scoring asynchronously off the main `EventPipeline` using a bounded channel (following `PipelinePolicyConstants` presets).
- Provide a configuration option to disable real-time consensus scoring and fall back to batch scoring during high-load sessions.
- Use the existing `BackpressureAlertService` to monitor queue depth and degrade gracefully.

---

### 15.3 Unified serving contract risks breaking existing downstream consumers

**Risk:** Introducing a new canonical event schema may break existing integrations built against `HistoricalBar` or the current live event structures.

**Mitigation:**

- Follow ADR-006 (polymorphic payload pattern) and ADR-014 (JSON source generators) for schema evolution.
- Version the contract from day one and maintain at least one prior version during migration windows.
- Use the existing `SchemaVersionManager` and `SchemaValidationService` to enforce compatibility gates.

---

### 15.4 Automated self-healing can cause over-correction

**Risk:** Aggressive automated backfill or provider substitution under noisy quality signals can amplify problems rather than resolve them.

**Mitigation:**

- Implement hard budget caps (max symbols per remediation cycle, max backfill window size, max API cost per hour) enforced in the policy engine.
- Require a human-approval step (section 11.4) for any remediation exceeding a configurable blast-radius threshold.
- Track every automated action in an audit trail, reviewing false-positive rates weekly during rollout.

---

### 15.5 Explainable diagnostics risks exposing sensitive operational data

**Risk:** Root-cause graphs and incident narratives may surface API key fragments, internal system topologies, or provider rate-limit details in logs or API responses.

**Mitigation:**

- Route all diagnostic output through the existing `SensitiveValueMasker` before serialization.
- Require the existing API key middleware (`ApiKeyMiddleware`) for all diagnostics endpoints.
- Separate internal diagnostic depth (full) from external-facing responses (sanitized).

---

### 15.6 Declarative DSL complexity exceeds its usability benefit

**Risk:** Building a purpose-specific orchestration language is a significant investment that may not be adopted if the learning curve is high.

**Mitigation:**

- Validate the DSL concept first through a structured YAML/JSON configuration schema rather than a custom language.
- Use the existing `ConfigurationPipeline` and `ValidatedConfig` as a foundation for incremental declarative enhancements.
- Adopt the DSL approach only after demonstrating that the configuration complexity it targets is a documented pain point for operators.

---

## 16) Open questions for architecture decisions

These questions do not yet have agreed answers and should be resolved before detailed design begins on each area.

### 16.1 Ledger identity model

- Should each event have a globally unique ID assigned at ingestion, or should the canonical hash serve as the primary identifier?
- How should late/amended events reference and supersede earlier records without breaking immutability?
- What is the retention policy for raw provider frames once a configurable "forensic window" elapses?

### 16.2 Consensus engine scope

- Should the consensus engine operate on trades/quotes only, or also on aggregated bars and derived features?
- What is the minimum provider overlap (N of M) required before a consensus score is computed vs. marked as single-source?
- Should consensus state be part of the `MarketEvent` schema (inline) or a separate stream consumers opt into?

### 16.3 Unified contract versioning

- What is the schema evolution policy — forward-compatible only, or bidirectional with a migration layer?
- How should corporate action adjustments (splits, dividends) be represented — as event annotations, separate adjustment records, or as adjusted price fields?
- Should the unified contract be expressed as a protobuf schema (for binary efficiency) or kept as JSON (for tooling compatibility)?

### 16.4 Self-healing policy engine design

- Should policies be expressed as code (C# DSL), structured configuration (YAML), or rule-table formats?
- Who owns policy authorship — operators at runtime, or developers at build time with promotion through environments?
- How should conflicting policies (e.g., "backfill gap" vs. "pause due to API budget") be resolved?

### 16.5 Diagnostics assistant interface

- Should the diagnostics assistant be CLI-only, API-only, or accessible through the WPF and web UIs?
- How should the incident memory / playbook corpus be stored, versioned, and shared across deployments?
- Is an LLM-backed interpretation layer in scope, or should all root-cause logic be rule-based and deterministic?

### 16.6 Provider routing and blended sourcing

- What is the unit of routing — per symbol, per feed type, or per time window?
- How should routing decisions interact with the existing `FailoverAwareMarketDataClient` — replace it, extend it, or sit alongside it?
- Should routing scores be influenced by downstream consumer feedback (e.g., strategy performance attribution)?

### 16.7 Data product marketplace

- What is the access control model for data products — shared namespace, team namespaces, or per-consumer entitlement graphs?
- Should the first implementation target internal research consumers only, or should external redistribution be an explicit design goal from the start?
- What is the minimum viable "data product" definition: schema + SLO + coverage window, or does it also require embedded transform and serving logic?
Loading