[Feature] Add stain trace for data lineage and performance analysis#10434
Draft
davidzollo wants to merge 1 commit intoapache:devfrom
Draft
[Feature] Add stain trace for data lineage and performance analysis#10434davidzollo wants to merge 1 commit intoapache:devfrom
davidzollo wants to merge 1 commit intoapache:devfrom
Conversation
3 tasks
This PR introduces a comprehensive data tracing system for SeaTunnel Engine
that tracks sampled records through the entire pipeline with minimal overhead.
## Core Features
### Trace Infrastructure
- StainTraceEvent: Event system for trace points
- StainTracePayload: Compact binary payload (STTR protocol v1)
- StainTraceSampler: Deterministic sampling (seq % rate == 0)
- StainTraceBudget: Per-worker per-second budget control
- TaskMappingBuilder: Maps task IDs to human-readable names
### Trace Stages (6 stages)
1. SOURCE_EMIT: Source emits record
2. QUEUE_IN: Queue enqueue complete
3. QUEUE_OUT: Queue dequeue start
4. TRANSFORM_IN: Transform receives record
5. TRANSFORM_OUT: Transform outputs record
6. SINK_WRITE_DONE: Sink write complete
### Framework Integration (Zero connector changes)
- RecordSerializer: Extended for payload transmission (backward compatible)
- SeaTunnelSourceCollector: Create payload, append SOURCE_EMIT
- IntermediateQueue: Append QUEUE_IN/QUEUE_OUT
- TransformFlowLifeCycle: Append TRANSFORM_IN/OUT (handles 1-to-N)
- SinkFlowLifeCycle: Append SINK_WRITE_DONE, report event
### Trace Collector Service
Standalone HTTP service for collecting and querying traces:
- Multi-database support: PostgreSQL, MySQL, ClickHouse
- REST APIs: /ingest, /traces, /trace/{id}, /health, /metrics
- Task mapping cache: Auto-fetch task names from Engine
- Payload decoder: Parse binary payload to timing entries
- Built-in metrics: Ingestion rate, errors, query latency
### Web UI Integration
- New trace visualization page
- Query by trace_id or job_id
- Display timing breakdown per stage
- Visualize bottlenecks
### Configuration
```yaml
seatunnel:
engine:
stain-trace-enabled: true
stain-trace-sample-rate: 100000
stain-trace-max-traces-per-second-per-worker: 50
stain-trace-max-entries-per-trace: 32
```
### Production Safety
- Zero overhead when disabled (single boolean check)
- ~0.1-1% overhead with 1% sampling when enabled
- Strict event volume upper bound (per-worker budget)
- One event per sampled record (no event storm)
- Backward compatible serialization (new reads old)
## Implementation Details
### Binary Payload Protocol
```
MAGIC(4) = 0x53545452 // 'STTR'
VER(2) = 1
TRACE_ID(8)
START_TS_MS(8)
COUNT(2)
ENTRIES:
STAGE(1), TASK_ID(8), TS_MS(8)
```
### Performance Analysis
From trace entries, calculate:
- End-to-end latency: SINK_WRITE_DONE.ts - SOURCE_EMIT.ts
- Queue wait time: QUEUE_OUT.ts - QUEUE_IN.ts
- Transform processing: TRANSFORM_OUT.ts - TRANSFORM_IN.ts
- Sink write time: SINK_WRITE_DONE.ts - TRANSFORM_OUT.ts
Aggregate P95/P99 metrics to identify bottlenecks.
## Testing
- Unit tests: StainTracePayloadTest, StainTraceSamplerTest, RecordSerializerTest
- Integration tests: StainTraceFlowIT, TransformFlowLifeCycleStainTraceTest
- Backward compatibility: Old format → new reader
- Edge cases: 1-to-N (FlatMap), 1-to-0 (Filter), checkpoint recovery
## Documentation
- Quick start guide: seatunnel-trace/STAIN_TRACE_QUICKSTART.md
- Database init scripts: init-mysql.sql, etc.
- Config reference and deployment guide
## Files Changed
82 files changed, 6612 insertions(+), 42 deletions(-)
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces Stain Trace, a comprehensive data lineage and performance tracking system for SeaTunnel. It enables end-to-end tracing of data records through the entire pipeline (Source → Transform → Sink), helping identify performance bottlenecks and analyze data flow.
Key Features
Core Tracing Infrastructure
Trace Stages
SOURCE_READ_DONE: Data read from sourceQUEUE_IN: Data enters intermediate queueTRANSFORM_DONE: Transform processing completeQUEUE_OUT: Data exits queueSINK_WRITE_START: Sink write beginsSINK_WRITE_DONE: Sink write completeTrace Collector Service
A standalone HTTP service for collecting and storing trace data:
Web UI Integration
Database Support
PostgresTraceRepositoryMySqlTraceRepositoryClickHouseTraceRepositoryConfiguration
Enable stain trace in
seatunnel.yaml:Quick Start
Comprehensive setup guide provided in:
seatunnel-trace/STAIN_TRACE_QUICKSTART.mdMigration Notes