Skip to content

Source Coordination in Data Prepper #2412

@graytaylor0

Description

@graytaylor0

Is your feature request related to a problem? Please describe.
Data Prepper has many push based sources, such as http, otel_trace_source, etc. Distributing data between multiple instances of data prepper can easily be solved with a load balancer.

However, pull based sources of Data Prepper do not have a Data Prepper internal way to coordinate which work is done between different instances of Data Prepper in a multi-node scenario. For example, pulling data from something like an OpenSearch cluster with 5 nodes of Data Prepper would result in all 5 nodes pulling the entirety of the data and processing it 5 times total.

Describe the solution you'd like
A core data prepper solution for pull based sources to distribute data between multiple instances of data prepper, and a way to track the progress of the data that is pulled to skip processing of duplicate data.

This solution could use a distributed store to coordinate and track progress of the data. The store could be pluggable and configured in the data-prepper-config.yaml. The store type could range from Remote/Local File DB, Apache Zookeeper, MySQL, DynamoDB, and more.

Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions