Is your feature request related to a problem? Please describe.
Data Prepper has many push based sources, such as http, otel_trace_source, etc. Distributing data between multiple instances of data prepper can easily be solved with a load balancer.
However, pull based sources of Data Prepper do not have a Data Prepper internal way to coordinate which work is done between different instances of Data Prepper in a multi-node scenario. For example, pulling data from something like an OpenSearch cluster with 5 nodes of Data Prepper would result in all 5 nodes pulling the entirety of the data and processing it 5 times total.
Describe the solution you'd like
A core data prepper solution for pull based sources to distribute data between multiple instances of data prepper, and a way to track the progress of the data that is pulled to skip processing of duplicate data.
This solution could use a distributed store to coordinate and track progress of the data. The store could be pluggable and configured in the data-prepper-config.yaml. The store type could range from Remote/Local File DB, Apache Zookeeper, MySQL, DynamoDB, and more.
Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Is your feature request related to a problem? Please describe.
Data Prepper has many push based sources, such as
http,otel_trace_source, etc. Distributing data between multiple instances of data prepper can easily be solved with a load balancer.However, pull based sources of Data Prepper do not have a Data Prepper internal way to coordinate which work is done between different instances of Data Prepper in a multi-node scenario. For example, pulling data from something like an OpenSearch cluster with 5 nodes of Data Prepper would result in all 5 nodes pulling the entirety of the data and processing it 5 times total.
Describe the solution you'd like
A core data prepper solution for pull based sources to distribute data between multiple instances of data prepper, and a way to track the progress of the data that is pulled to skip processing of duplicate data.
This solution could use a distributed store to coordinate and track progress of the data. The store could be pluggable and configured in the
data-prepper-config.yaml. The store type could range from Remote/Local File DB, Apache Zookeeper, MySQL, DynamoDB, and more.Describe alternatives you've considered (Optional)
A clear and concise description of any alternative solutions or features you've considered.
Additional context