You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Removed reference to unused `queue.checkpoint.interval`.
* Changed 'unread' to be 'unACKed'
* Remove confusing/contradictory statement about queue acknowledgement
and outputs
* Describe garbage collection
* Describe some of the internals (pages, checkpoints).
Fixes#6408
WARNING: This functionality is in beta and is subject to change. It should be deployed in production at your own risk.
4
+
WARNING: This functionality is in beta and is subject to change. Deployment in production is at your own risk.
5
5
6
6
By default, Logstash uses in-memory bounded queues between pipeline stages
7
-
(input → filter and filter → output) to buffer events. The size of these
8
-
in-memory queues is fixed and not configurable. If Logstash terminates unsafely,
9
-
either as the result of a software failure or the user forcing an unsafe
10
-
shutdown, it's possible to lose queued events.
11
-
12
-
To prevent event loss in these scenarios, you can configure Logstash to use
13
-
persistent queues. With persistent queues enabled, Logstash persists buffered
14
-
events to disk instead of storing them in memory.
15
-
16
-
Persistent queues are also useful for Logstash deployments that require high
17
-
throughput and resiliency. Instead of deploying and managing a message
18
-
broker, such as Redis, RabbitMQ, or Apache Kafka, to handle a mismatch in
19
-
cadence between the shipping stage and the relatively expensive processing
20
-
stage, you can enable persistent queues to buffer events on disk. The queue size
21
-
is variable and configurable, which means that you have more control over
22
-
managing situations that can result in back pressure to the source. See <<backpressure-persistent-queues>>.
23
-
24
-
[[persistent-queues-advantages]]
25
-
==== Advantages of Persistent Queues
26
-
27
-
Using persistent queues:
28
-
29
-
* Provides protection from losing in-flight messages when the Logstash process is shut down or the machine is restarted.
30
-
* Handles the surge of events without having to use an external queueing mechanism like Redis or Kafka.
31
-
* Provides an at-least-once message delivery guarantee. If Logstash is restarted
32
-
while events are in-flight, Logstash will attempt to deliver messages stored
33
-
in the persistent queue until delivery succeeds at least once. In other words,
34
-
messages stored in the persistent queue may be duplicated, but not lost.
7
+
(inputs → pipeline workers) to buffer events. The size of these in-memory
8
+
queues is fixed and not configurable. If Logstash experiences a temporary
9
+
machine failure, the contents of the in-memory queue will be lost. Temporary machine
10
+
failures are scenarios where Logstash or its host machine are terminated
11
+
abnormally but are capable of being restarted.
12
+
13
+
In order to protect against data loss during abnormal termination, Logstash has
14
+
a persistent queue feature which will store the message queue on disk.
15
+
Persistent queues provide durability of data within Logstash.
16
+
17
+
Persistent queues are also useful for Logstash deployments that need large buffers.
18
+
Instead of deploying and managing a message broker, such as Redis, RabbitMQ, or
19
+
Apache Kafka, to facilitate a buffered publish-subscriber model, you can enable
20
+
persistent queues to buffer events on disk and remove the message broker.
21
+
22
+
In summary, the two benefits of enabling persistent queues are as follows:
23
+
24
+
* Provides protection from in-flight message loss when the Logstash process is abnormally terminated.
25
+
* Absorbs bursts of events without needing an external buffering mechanism like Redis or Apache Kafka.
35
26
36
27
[[persistent-queues-limitations]]
37
28
==== Limitations of Persistent Queues
38
29
39
-
The current implementation of persistent queues has the following limitations:
30
+
The following are problems not solved by the persistent queue feature:
40
31
41
-
* This version does not enable full end-to-end resiliency, except for messages
42
-
sent to the <<plugins-inputs-beats,beats>> input. For other inputs, Logstash
43
-
only acknowledges delivery of messages in the filter and output stages, and not
44
-
all the way back to the input or source.
45
-
* It does not handle permanent disk or machine failures. The data persisted to disk is not replicated, so it is still a single point of failure.
32
+
* Input plugins that do not use a request-response protocol cannot be protected from data loss. For example: tcp, udp, zeromq push+pull, and many other inputs do not have a mechanism to acknowledge receipt to the sender. Plugins such as beats and http, which *do* have a acknowledgement capability, are well protected by this queue.
33
+
* It does not handle permanent machine failures such as disk corruption, disk failure, and machine loss. The data persisted to disk is not replicated.
46
34
47
35
[[persistent-queues-architecture]]
48
36
==== How Persistent Queues Work
49
37
50
-
The persistent queue sits between the input and filter stages in the same
38
+
The queue sits between the input and filter stages in the same
51
39
process:
52
40
53
-
input → persistent queue → filter + output
54
-
55
-
The input stage reads data from the configured data source and writes events to
56
-
the persistent queue for processing. As events pass through the pipeline,
57
-
Logstash pulls a batch of events from the persistent queue for processing them
58
-
in the filter and output stages. As events are processed, Logstash uses a
59
-
checkpoint file to track which events are successfully acknowledged (ACKed) as
60
-
processed by Logstash. An event is recorded as ACKed in the checkpoint file if
61
-
the event is successfully sent to the last output stage in the pipeline;
62
-
Logstash does not wait for the output to acknowledge delivery.
63
-
64
-
During a normal, controlled shutdown (*CTRL+C*), Logstash finishes
65
-
processing the current in-flight events (that is, the events being processed by
66
-
the filter and output stages, not the queued events), finalizes the ACKing
67
-
of these events, and then terminates the Logstash process. Upon restart,
68
-
Logstash uses the checkpoint file to pick up where it left off in the persistent
69
-
queue and processes the events in the backlog.
70
-
71
-
If Logstash crashes or experiences an uncontrolled shutdown, any in-flight
72
-
events are left as unACKed in the persistent queue. Upon restart, Logstash will
73
-
replay the events from its history, potentially leading to duplicate data being
74
-
written to the output.
41
+
input → queue → filter + output
42
+
43
+
When an input has events ready to process, it writes them to the queue. When
44
+
the write to the queue is successful, the input can send an acknowledgement to
45
+
its data source.
46
+
47
+
When processing events from the queue, Logstash acknowledges events as
48
+
completed, within the queue, only after filters and outputs have completed.
49
+
The queue keeps a record of events that have been processed by the pipeline.
50
+
An event is recorded as processed (in this document, called "acknowledged" or
51
+
"ACKed") if, and only if, the event has been processed completely by the
52
+
Logstash pipeline.
53
+
54
+
What does acknowledged mean? This means the event has been handled by all
55
+
configured filters and outputs. For example, if you have only one output,
56
+
Elasticsearch, an event is ACKed when the Elasticsearch output has successfully
57
+
sent this event to Elasticsearch.
58
+
59
+
During a normal shutdown (*CTRL+C* or SIGTERM), Logstash will stop reading
60
+
from the queue and will finish processing the in-flight events being processed
61
+
by the filters and outputs. Upon restart, Logstash will resume processing the
62
+
events in the persistent queue as well as accepting new events from inputs.
63
+
64
+
If Logstash is abnormally terminated, any in-flight events will not have been
65
+
ACKed and will be reprocessed by filters and outputs when Logstash is
66
+
restarted. Logstash processes events in batches, so it is possible
67
+
that for any given batch, some of that batch may have been successfully
68
+
completed, but not recorded as ACKed, when an abnormal termination occurs.
69
+
70
+
For more details specific behaviors of queue writes and acknowledgement, see
71
+
<<durability-persistent-queues>>.
75
72
76
73
[[configuring-persistent-queues]]
77
74
==== Configuring Persistent Queues
78
75
79
76
To configure persistent queues, you can specify the following options in the
* `queue.type`: Specify `persisted` to enable persistent queues. By default, persistent queues are disabled (`queue.type: memory`).
79
+
* `queue.type`: Specify `persisted` to enable persistent queues. By default, persistent queues are disabled (default: `queue.type: memory`).
83
80
* `path.queue`: The directory path where the data files will be stored. By default, the files are stored in `path.data/queue`.
84
-
* `queue.page_capacity`: The size of the page data file. The queue data consists of append-only data files separated into pages. The default size is 250mb.
85
-
* `queue.max_events`: The maximum number of unread events that are allowed in the queue. The default is 0 (unlimited).
81
+
* `queue.page_capacity`: The maximum size of a queue page in bytes. The queue data consists of append-only files called "pages". The default size is 250mb. Changing this value is unlikely to have performance benefits.
82
+
// Technically, I know, this isn't "maximum number of events" it's really maximum number of events not yet read by the pipeline worker. We only use this for testing and users generally shouldn't be setting this.
83
+
* `queue.max_events`: The maximum number of events that are allowed in the queue. The default is 0 (unlimited). This value is used internally for the Logstash test suite.
86
84
* `queue.max_bytes`: The total capacity of the queue in number of bytes. The
87
85
default is 1024mb (1gb). Make sure the capacity of your disk drive is greater
88
-
than the value you specify here. If both `queue.max_events` and
86
+
than the value you specify here.
87
+
88
+
If both `queue.max_events` and
89
89
`queue.max_bytes` are specified, Logstash uses whichever criteria is reached
90
-
first.
90
+
first. See <<backpressure-persistent-queue>> for behavior when these queue limits are reached.
91
91
92
-
You can also specify options that control when the checkpoint file gets updated (`queue.checkpoint.acks`, `queue.checkpoint.writes`, and
93
-
`queue.checkpoint.interval`). See <<durability-persistent-queues>>.
92
+
You can also specify options that control when the checkpoint file gets updated (`queue.checkpoint.acks`, `queue.checkpoint.writes`). See <<durability-persistent-queues>>.
94
93
95
94
Example configuration:
96
95
@@ -101,21 +100,19 @@ queue.max_bytes: 4gb
101
100
[[backpressure-persistent-queues]]
102
101
==== Handling Back Pressure
103
102
104
-
Logstash has a built-in mechanism that exerts back pressure on the data flow
105
-
when the queue is full. This mechanism helps Logstash control the rate of data
106
-
flow at the input stage without overwhelming downstream stages and outputs like
107
-
Elasticsearch.
103
+
When the queue is full, Logstash puts back pressure on the inputs to stall data
104
+
flowing into Logstash. This mechanism helps Logstash control the rate of data
105
+
flow at the input stage without overwhelming outputs like Elasticsearch.
108
106
109
-
You can control when back pressure happens by using the `queue.max_bytes`
110
-
setting to configure the capacity of the queue on disk. The following example
111
-
sets the total capacity of the queue to 8gb:
107
+
Use `queue.max_bytes` setting to configure the total capacity of the queue on
108
+
disk. The following example sets the total capacity of the queue to 8gb:
112
109
113
110
[source, yaml]
114
111
queue.type: persisted
115
112
queue.max_bytes: 8gb
116
113
117
-
With these settings specified, Logstash will buffer unACKed events on disk until
118
-
the size of the queue reaches 8gb. When the queue is full of unACKed events, and
114
+
With these settings specified, Logstash will buffer events on disk until the
115
+
size of the queue reaches 8gb. When the queue is full of unACKed events, and
119
116
the size limit has been reached, Logstash will no longer accept new events.
120
117
121
118
Each input handles back pressure independently. For example, when the
@@ -128,24 +125,38 @@ events.
128
125
[[durability-persistent-queues]]
129
126
==== Controlling Durability
130
127
128
+
Durability is a property of storage writes that ensures data will be available after it's written.
129
+
131
130
When the persistent queue feature is enabled, Logstash will store events on
132
-
disk. The persistent queue exposes the trade-off between performance and
133
-
durability by providing the following configuration options:
131
+
disk. Logstash commits to disk in a mechanism called checkpointing.
132
+
133
+
To discuss durability, we need to introduce a few details about how the persistent queue is implemented.
134
+
135
+
First, the queue itself is a set of pages. There are two kinds of pages: head pages and tail pages. The head page is where new events are written. There is only one head page. When the head page is of a certain size (see `queue.page_capacity`), it becomes a tail page, and a new head page is created. Tail pages are immutable, and the head page is append-only.
136
+
Second, the queue records details about itself (pages, acknowledgements, etc) in a separate file called a checkpoint file.
137
+
138
+
When recording a checkpoint, Logstash will:
134
139
135
-
* `queue.checkpoint.writes`: The number of writes to the queue to trigger an
136
-
fsync to disk. This configuration controls the durability from the producer
137
-
side. Keep in mind that a disk flush is a relatively heavy operation that will
138
-
affect throughput if performed after every write. For instance, if you want to
139
-
ensure that all messages in Logstash's queue are durable, you can set
140
-
`queue.checkpoint.writes: 1`. However, this setting can severely impact
141
-
performance.
140
+
* Call fsync on the head page.
141
+
* Atomically write to disk the current state of the queue.
142
142
143
-
* `queue.checkpoint.acks`: The number of ACKs to the queue to trigger an fsync to disk. This configuration controls the durability from the consumer side.
143
+
The following settings are available to let you tune durability:
144
144
145
-
The process of checkpointing is atomic, which means any update to the file is
146
-
saved if successful.
145
+
* `queue.checkpoint.writes`: Logstash will checkpoint after this many writes into the queue. Currently, one event counts as one write, but this may change in future releases.
146
+
* `queue.checkpoint.acks`: Logstah will checkpoint after this many events are acknowledged. This configuration controls the durability at the processing (filter + output)
147
+
part of Logstash.
148
+
149
+
Disk writes have a resource cost. Tuning the above values higher or lower will trade durability for performance. For instance, if you want to the strongest durability for all input events, you can set `queue.checkpoint.writes: 1`.
150
+
151
+
The process of checkpointing is atomic, which means any update to the file is saved if successful.
147
152
148
153
If Logstash is terminated, or if there is a hardware level failure, any data
149
154
that is buffered in the persistent queue, but not yet checkpointed, is lost.
150
155
To avoid this possibility, you can set `queue.checkpoint.writes: 1`, but keep in
151
156
mind that this setting can severely impact performance.
157
+
158
+
[[garbage-collection]]
159
+
==== Disk Garbage Collection
160
+
161
+
On disk, the queue is stored as a set of pages where each page is one file. Each page can be at most `queue.page_capacity` in size. Pages are deleted (garbage collected) after all events in that page have been ACKed. If an older page has at least one event that is not yet ACKed, that entire page will remain on disk until all events in that page are successfully processed. Each page containing unprocessed events will count against the `queue.max_bytes` byte size.
0 commit comments