Skip to content

[Feature] Add stain trace for data lineage and performance analysis#10434

Draft
davidzollo wants to merge 1 commit intoapache:devfrom
davidzollo:dev
Draft

[Feature] Add stain trace for data lineage and performance analysis#10434
davidzollo wants to merge 1 commit intoapache:devfrom
davidzollo:dev

Conversation

@davidzollo
Copy link
Contributor

@davidzollo davidzollo commented Feb 1, 2026

Summary

This PR introduces Stain Trace, a comprehensive data lineage and performance tracking system for SeaTunnel. It enables end-to-end tracing of data records through the entire pipeline (Source → Transform → Sink), helping identify performance bottlenecks and analyze data flow.

Key Features

Core Tracing Infrastructure

  • StainTraceEvent: Event system for capturing trace points across pipeline stages
  • StainTraceSampler: Configurable sampling mechanism to control tracing overhead
  • StainTracePayload: Compact binary payload format for efficient transmission
  • TaskMappingBuilder: Maps tasks to readable names for trace visualization

Trace Stages

  • SOURCE_READ_DONE: Data read from source
  • QUEUE_IN: Data enters intermediate queue
  • TRANSFORM_DONE: Transform processing complete
  • QUEUE_OUT: Data exits queue
  • SINK_WRITE_START: Sink write begins
  • SINK_WRITE_DONE: Sink write complete

Trace Collector Service

A standalone HTTP service for collecting and storing trace data:

  • Multi-database support: PostgreSQL, MySQL, ClickHouse
  • REST API: Ingest events, query traces, health checks, metrics
  • Task mapping cache: Enriches traces with readable task names
  • Built-in metrics: Track ingestion rate, errors, and performance

Web UI Integration

  • New trace visualization page in SeaTunnel Engine UI
  • Query traces by trace_id or job_id
  • Display detailed timing and stage information
  • Identify performance bottlenecks visually

Database Support

Database Status Repository Class
PostgreSQL Supported PostgresTraceRepository
MySQL Supported MySqlTraceRepository
ClickHouse Supported ClickHouseTraceRepository

Configuration

Enable stain trace in seatunnel.yaml:

seatunnel:
  engine:
    server-config:
      stain-trace:
        enabled: true
        sampling-rate: 0.01
        collector-url: "http://localhost:9090/ingest"

Quick Start

Comprehensive setup guide provided in:

  • seatunnel-trace/STAIN_TRACE_QUICKSTART.md

Migration Notes

  • Fully backward compatible
  • Stain trace is disabled by default
  • No changes required for existing jobs

This PR introduces a comprehensive data tracing system for SeaTunnel Engine
that tracks sampled records through the entire pipeline with minimal overhead.

## Core Features

### Trace Infrastructure
- StainTraceEvent: Event system for trace points
- StainTracePayload: Compact binary payload (STTR protocol v1)
- StainTraceSampler: Deterministic sampling (seq % rate == 0)
- StainTraceBudget: Per-worker per-second budget control
- TaskMappingBuilder: Maps task IDs to human-readable names

### Trace Stages (6 stages)
1. SOURCE_EMIT: Source emits record
2. QUEUE_IN: Queue enqueue complete
3. QUEUE_OUT: Queue dequeue start
4. TRANSFORM_IN: Transform receives record
5. TRANSFORM_OUT: Transform outputs record
6. SINK_WRITE_DONE: Sink write complete

### Framework Integration (Zero connector changes)
- RecordSerializer: Extended for payload transmission (backward compatible)
- SeaTunnelSourceCollector: Create payload, append SOURCE_EMIT
- IntermediateQueue: Append QUEUE_IN/QUEUE_OUT
- TransformFlowLifeCycle: Append TRANSFORM_IN/OUT (handles 1-to-N)
- SinkFlowLifeCycle: Append SINK_WRITE_DONE, report event

### Trace Collector Service
Standalone HTTP service for collecting and querying traces:
- Multi-database support: PostgreSQL, MySQL, ClickHouse
- REST APIs: /ingest, /traces, /trace/{id}, /health, /metrics
- Task mapping cache: Auto-fetch task names from Engine
- Payload decoder: Parse binary payload to timing entries
- Built-in metrics: Ingestion rate, errors, query latency

### Web UI Integration
- New trace visualization page
- Query by trace_id or job_id
- Display timing breakdown per stage
- Visualize bottlenecks

### Configuration
```yaml
seatunnel:
  engine:
    stain-trace-enabled: true
    stain-trace-sample-rate: 100000
    stain-trace-max-traces-per-second-per-worker: 50
    stain-trace-max-entries-per-trace: 32
```

### Production Safety
- Zero overhead when disabled (single boolean check)
- ~0.1-1% overhead with 1% sampling when enabled
- Strict event volume upper bound (per-worker budget)
- One event per sampled record (no event storm)
- Backward compatible serialization (new reads old)

## Implementation Details

### Binary Payload Protocol
```
MAGIC(4)  = 0x53545452  // 'STTR'
VER(2)    = 1
TRACE_ID(8)
START_TS_MS(8)
COUNT(2)
ENTRIES:
  STAGE(1), TASK_ID(8), TS_MS(8)
```

### Performance Analysis
From trace entries, calculate:
- End-to-end latency: SINK_WRITE_DONE.ts - SOURCE_EMIT.ts
- Queue wait time: QUEUE_OUT.ts - QUEUE_IN.ts
- Transform processing: TRANSFORM_OUT.ts - TRANSFORM_IN.ts
- Sink write time: SINK_WRITE_DONE.ts - TRANSFORM_OUT.ts

Aggregate P95/P99 metrics to identify bottlenecks.

## Testing
- Unit tests: StainTracePayloadTest, StainTraceSamplerTest, RecordSerializerTest
- Integration tests: StainTraceFlowIT, TransformFlowLifeCycleStainTraceTest
- Backward compatibility: Old format → new reader
- Edge cases: 1-to-N (FlatMap), 1-to-0 (Filter), checkpoint recovery

## Documentation
- Quick start guide: seatunnel-trace/STAIN_TRACE_QUICKSTART.md
- Database init scripts: init-mysql.sql, etc.
- Config reference and deployment guide

## Files Changed
82 files changed, 6612 insertions(+), 42 deletions(-)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant