Skip to content

Conversation

@jordansissel
Copy link
Contributor

  • Removed reference to unused queue.checkpoint.interval.
  • Changed 'unread' to be 'unACKed'
  • Remove confusing/contradictory statement about queue acknowledgement
    and outputs

shutdown, it's possible to lose queued events.
(inputspipeline workers) to buffer events. The size of these in-memory
queues is fixed and not configurable. If Logstash terminates abnormally,
either as the result of a software crash of `kill -9`, it's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of -> or ?

sent to the <<plugins-inputs-beats,beats>> input. For other inputs, Logstash
only acknowledges delivery of messages in the filter and output stages, and not
all the way back to the input or source.
* Inputs which do not use a request-response protocol cannot be protected from data loss. For example: tcp, udp, zeromq push+pull, and many other inputs do not have a mechanism to acknowledge receipt to the sender. Plugins such as beats and http, which *do* have a acknowledgement capability, are well protected by this queue.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about my wording choice here. Intent was to highlight that protection is only viable for protocols that have a reply mechanism for data. Protocols with no reply mechanism (udp, tcp, etc) would still be queued but we have no way to report back to the sender.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording seems clear to me. I would say "Inputs that do not use..." (use "that" for restrictive clauses).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordansissel can we include a more comprehensive list of inputs that do ack i.e. kafka, rabbitmq, etc.? I think its an important to note the scenarios where we are resilient end-to-end (excluding catastrophic failures of course).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acchen97 I can't tell if the Kafka input actually does this. https://github.com/logstash-plugins/logstash-input-kafka/blob/master/lib/logstash/inputs/kafka.rb#L245-L260 shows no indication that I can see that it does do this acking, but it may do it somewhere deeper inside the Kafka client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordansissel I spoke to @talevy and Kafka input does ack.

In this context, I think we can just note Beats, Kafka, RabbitMQ, and HTTP inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ will update.

are processed, Logstash records a checkpoint to track which events are
successfully acknowledged (ACKed) as processed by Logstash. An event is
recorded as ACKed in the checkpoint if the event is successfully sent to the
last output stage in the pipeline;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a period

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But semicolons are my favorite!

Thanks :)

of these events, and then terminates the Logstash process. Upon restart,
Logstash uses the checkpoint file to pick up where it left off in the persistent
queue and processes the events in the backlog.
Logstash uses last checkpoint to resume processing the events in the backlog.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses the last checkpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing. Thanks :)

persistent queue and sends them to the filter and output stages. As events
are processed, Logstash records a checkpoint to track which events are
successfully acknowledged (ACKed) as processed by Logstash. An event is
recorded as ACKed in the checkpoint if the event is successfully sent to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

«recorded as ACKed in the checkpoint» I would not phrase like this because we actually don't record acking per event in the checkpont - we keep track of the first unacked event in a page . I would just remove "in the checkpoint" - specific ACK tracking IS done in-memory during runtime but not in the checkpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. ++

of these events, and then terminates the Logstash process. Upon restart,
Logstash uses the checkpoint file to pick up where it left off in the persistent
queue and processes the events in the backlog.
Logstash uses last checkpoint to resume processing the events in the backlog.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is not really a "last" checkpoint - there are always 1 checkpoint per page. Do we need to talk about checkpoints here? maybe just saying that logstash will resume processing where it left?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the doc had 'checkpoint file' in various places, and I mostly patched this to remove the word 'file' (because that's an implementation detail unimportant to most users)

I think in most case we are trying to convey the concept that a checkpoint can be behind real-time and that the consequence of this is that an abnormal termination will cause data loss for all data written after the most recent checkpoint ("checkpoint" as an action, not a file).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In places where we are describing order of operations, I am wondering if it would be more clear to have a diagram instead of a paragraph of text.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Ading a diagram was something I discussed with suyog but didn't have time to do for 5.1

Copy link
Contributor

@colinsurprenant colinsurprenant Dec 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but maybe no. I am on the fence - I am wondering if we should talk about checkpoint at all, it is really an implementation detail and might not be the right place (user doc) to discuss. I'd personally prefer that we task about the higher level concept of "ensuring persistence" which encompasses both concepts of doing the checkpointing and also the fsync'ing. It becomes tricky to start getting into these details. At the same time we do have to explain what the queue.checkpoint.acks and queue.checkpoint.writes options are. 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colinsurprenant I am also with you on maybe not needing to discuss checkpoints in the way this document does today -- I think we should mention it with respect to durability and maybe nowhere else. At present, it's spread around the document and maybe needs to be just in one section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

durability++

* `path.queue`: The directory path where the data files will be stored. By default, the files are stored in `path.data/queue`.
* `queue.page_capacity`: The size of the page data file. The queue data consists of append-only data files separated into pages. The default size is 250mb.
* `queue.max_events`: The maximum number of unread events that are allowed in the queue. The default is 0 (unlimited).
* `queue.max_events`: The maximum number of unACKed events that are allowed in the queue. The default is 0 (unlimited).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it really is the max number of unread events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this doc there is no other mention of the word "unread". If we're going to mention it now, it probably needs explanation. It sounds like there's multiple states (read, unread, acked, unacked?) for an event, and that is news to me, can you help me understand more?

Copy link
Contributor

@colinsurprenant colinsurprenant Dec 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, a queue event can be in the following state:

  • unread (which also implies unacked)
  • read & unacked (thus "in flight" in the worker)
  • read & acked

Now, read & acked events only exists because purging is done at page level. only when all events within a page are acked is it purged.

@jordansissel
Copy link
Contributor Author

I'm still working on this. Based on feedback and other discussion, I'm going to do some more possibly-larger changes to this doc in this PR. Will update when it's ready for the next round of review.

Thank you all for the feedback so far!

@jordansissel
Copy link
Contributor Author

Ok I did another pass at the content in this document.

I've tried to do the following:

  • Focus durability discussion only on the durability section
  • Focus on 'ack' terminology
  • consistent use of phrase "abnormal termination" when discussing that topic
  • Remove "Advantages of Persistent Queue" section since it's just a bullet-point summary of the first section. Moved it into the first section.
  • Defined "acknowledge" for pipeline workers, sorta.
  • Added caveat on queue.max_events as being not intended for user benefit.
  • Added necessary background for durability discussion (pages, checkpoints)
  • Described garbage collection scenario of fully acked pages.

@jordansissel
Copy link
Contributor Author

@colin @dedemorton @acchen97 Would love your feedback on the current state.

shutdown, it's possible to lose queued events.
(inputspipeline workers) to buffer events. The size of these in-memory
queues is fixed and not configurable. If Logstash terminates abnormally, the
contents of the in-memory queue will be lost. Such abnormal terminations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bits don't line up stylistically. Maybe say : "Such abnormal terminations include a software crash, kill -9, or the host machine losing power."

To prevent event loss in these scenarios, you can configure Logstash to use
persistent queues. With persistent queues enabled, Logstash persists buffered
events to disk instead of storing them in memory.
Temporary machine failures are scenarios where Logstash or its host machine are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time finding the words to describe why I think this paragraph is a bit confusing. I guess it's because you don't get to the description of persistent queues until the end of the paragraph, so the topic sentence is a bit fuzzy, and readers won't know where you're headed until halfway through. How about the following instead? I think it hits all the sweet spots you've covered:

As an alternative to using in-memory queues, you can configure Logstash to use persistent queues. With persistent queues enabled, Logstash stores the queue on disk. This protects you against event loss in the case of temporary machine failures. Temporary machine failures are scenarios where Logstash or its host machine is terminated abnormally but is capable of being restarted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

I've reorganized these two paragraphs to try and have a better flow to it.

the write to the queue is successful, the input can send an acknowledgement to
its data source.

When processing events from the queue, events are only acknowledged as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence starts with a dangling modifier (a phrase that modifies a word that isn't clearly stated in the sentence). How about:

When processing events from the queue, Logstash acknowledges events as completed, within the queue, only after the filters and outputs have completed.

[changed to "after" because "once" is easily misread/mistranslated as "one time"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++


When processing events from the queue, events are only acknowledged as
completed, within the queue, once the filters and outputs have completed.
The queue keeps a record of events which have been processed by the pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"keeps a record of events that have been processed..."

"ACKed") if, and only if, the event has been processed completely by the
Logstash pipeline.

What does acknowledged mean? For the common case, this means that it has been
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure what you mean by "common case" here. Do you mean "in most cases"? Also the antecedent for "it" is unclear. Does "it" refer back to the event? If so, how about:

"What does "acknowledged" mean? In most cases, this means that the event has been handled by all configured filters and outputs."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

ensure that all messages in Logstash's queue are durable, you can set
`queue.checkpoint.writes: 1`. However, this setting can severely impact
performance.
Disk writes are not free have a cost. Tuning these values higher or lower will trade durability for performance. For instance, if you want to ensure that all events from inputs queue are durable, you can set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is missing or wasn't deleted here: "Disk writes are not free have a cost"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ yeah I didn't clean up my sentence after editing <3


The following settings are available to let you tune durability:

* `queue.checkpoint.writes`: The number of writes from inputs after which a checkpoint is written.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is a bit confusing because of the sentence structure. Does this say what you meant to say?

"The number of times an input writes to the queue before a checkpoint is written?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's confusing because the property of the queue is for all writes, not just from a single input. And internally, it's called a "write" not an "event" which makes this even harder to document.

I really want to say this, "The number of events written to the queue before a checkpoint is written"

But ...

  • It's not the number of events, it's number of writes. It just happens that, today, one write is one event.
  • and, "before a checkpoint" makes it sound (incorrectly) like that we require at least that number of writes before Logstash will write a checkpoint, when really it's a minimum of (number of writes, number of acks) -- whichever happens first.

The following settings are available to let you tune durability:

* `queue.checkpoint.writes`: The number of writes from inputs after which a checkpoint is written.
* `queue.checkpoint.acks`: The number of ACKs to the queue after which a checkpoint is written. This configuration controls the durability at the processing (filter + output) part of Logstash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work?

"The number of ACKs sent to the queue before a checkpoint is written."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or "the number or events acknowledged (ACK'ed) before a checkpoint is written" ?

ensure that all messages in Logstash's queue are durable, you can set
`queue.checkpoint.writes: 1`. However, this setting can severely impact
performance.
Disk writes are not free have a cost. Tuning these values higher or lower will trade durability for performance. For instance, if you want to ensure that all events from inputs queue are durable, you can set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a word missing here (or one that should have been deleted): "For instance, if you want to ensure that all events from inputs queue are durable"

`queue.checkpoint.writes: 1`.

* `queue.checkpoint.acks`: The number of ACKs to the queue to trigger an fsync to disk. This configuration controls the durability from the consumer side.
In Logstash 5.0 and 5.1, a `write` to the queue is counted for every event produced by an input.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe would be better for info maintainability to say, "Starting with Logstash 5.0...." (Unless you know this behavior is changing after 5.1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know when this behavior will change, but it's not guaranteed to always be 1:1 and I'm trying (perhaps poorly) to solve this future problem:

  • We change the "one write is one event" in the code
  • but, we don't change the docs
  • and, then a user files a bug about the behavior
  • and, then I get annoyed we didn't update the docs.


To discuss durability, we need to introduce a few details about how the persistent queue is implemented.

First, the queue itself is a set of pages. There are two kinds of pages: head page and tail page. The head page is where new events are written. When this head page is of a certain size (see `queue.page_capacity), the becomes a tail page, and a new head page is created. Tail pages are immutable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detail I feel should also be part of a blog post where we talk about PQ implementation architecture and introduce this terminology.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably leave it out of the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pages, I believe, are important, because they matter when Logstash considers which data can be deleted from disk, and I'm trying to document this in enough detail that user questions can be just answered by pointing at the docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Typo here "it becomes a tail page"

delivery succeeds at least once. In other words, messages stored in the
persistent queue may be duplicated, but not lost.
* Provides protection from in-flight message loss when the Logstash process is abnormally terminated.
* Absorb bursts of events without needing an external queueing mechanism like Redis or Apache Kafka.
Copy link
Contributor

@acchen97 acchen97 Dec 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's valuable for us to still outline the "at-least-once" messaging here. Maybe another point like this:

"Guarantees at-least-once delivery with inputs that support acknowledgements."

Copy link
Contributor Author

@jordansissel jordansissel Dec 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 5.1.1 it's not guaranteed because:

  1. checkpoint.writes: 1 is necessary for such a guarantee and it is not the default.
  2. and, we don't fsync the checkpoint file (a bug fixed by add explit fsync on checkpoint write #6430)

@jordansissel
Copy link
Contributor Author

jordansissel commented Dec 23, 2016 via email

@acchen97
Copy link
Contributor

@jordansissel thanks for the update, I agree. Let's add back at-least-once after the bug fix and optimistic recovery.

Instead of deploying and managing a message broker, such as Redis, RabbitMQ, or
Apache Kafka, to facilitate a buffered publish-subscriber model, you can enable
persistent queues to buffer events on disk and remove the message broker. This
benefits you by removing one more piece of machinery from your deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like the wording "removing one more piece of machinery" - it feels like a message broker necessarily involves extra hardware (machinery?) but it might not ...
Anyhow it may just be a language thing so please disregard my comment if ya'll are comfortable with this :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. When I reread this paragraph, I think the last sentence is unnecessary anyway so I will remove it. Thanks <3

one output, Elasticsearch, an event ACKed when the Elasticsearch output has
successfully sent this event to Elasticsearch. In other cases, such as
with the `drop` filter, an event is ACKed when the `drop` filter cancels the
event.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

«an event is ACKed when the drop filter cancels the event» this is not true. all events in a batch are acked once the batch is done being processed by the filters+outputs worker, thus a cancelled event will actually be acked at the same time as all the other events in the batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

* `queue.page_capacity`: The size of the page data file. The queue data consists of append-only data files separated into pages. The default size is 250mb.
* `queue.max_events`: The maximum number of unread events that are allowed in the queue. The default is 0 (unlimited).
* `queue.page_capacity`: The maximum size of a queue page in bytes. The queue data consists of append-only files called a "pages". The default size is 250mb. Changing this value is unlikely to have performance benefits.
// Technically, I know, this isn't "maximum number of events" it's really maximum number of events not yet read by the pipeline worker. We only use this for testing and users generally shouldn't be setting this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

«append-only files called a "pages"» -> append-only files called "pages" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Good catch.

[[garbage-collection]]
==== Disk Garbage Collection

On disk, the queue is stored as a set of pages where each page is one ile. Each page can be at most `queue.page_capacity` in size. Pages are deleted (garbage collected) once all events in that page have been ACKed. If an older page has at least one event that is not yet ACKed, that entire page will remain on disk until all events in that page are successfully processed. Each page containing unprocessed events will count against the `queue.max_bytes` byte size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one ile -> one file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++


In order to protect against data loss during abnormal termination, Logstash has
a persistent queue feature which will store the message queue on disk.
Persistent queues provides durability of data within Logstash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not uncommon for feature names to be singular (mainly because they sometimes get trademarked, and trademarks can't be plural), but in this text, I'd say "Persistent queues provide durability" because we use it as plural practically everywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

a persistent queue feature which will store the message queue on disk.
Persistent queues provides durability of data within Logstash.

Persistent queues are also useful for Logstash deployments that need large buffers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period is missing after buffers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

persistent queues to buffer events on disk and remove the message broker. This
benefits you by removing one more piece of machinery from your deployment.

In summary, the two benefits of enabling the persistent queue are as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say "enabling persistent queues" instead of "enabling the persistent queue". Or you could say "enabling the persistent queue feature" but that seems unnecessarily wordy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

In summary, the two benefits of enabling the persistent queue are as follows:

* Provides protection from in-flight message loss when the Logstash process is abnormally terminated.
* Absorb bursts of events without needing an external buffering mechanism like Redis or Apache Kafka.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Absorbs bursts of events..." (use plural form to be parallel with first bullet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

"ACKed") if, and only if, the event has been processed completely by the
Logstash pipeline.

What does acknowledged mean? For most cases, this means that it has been
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The antecedent for "it" is unclear here. I'd just say, "For most cases, this means that the event has been handled...."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, great point. Thank you :)


To discuss durability, we need to introduce a few details about how the persistent queue is implemented.

First, the queue itself is a set of pages. There are two kinds of pages: head page and tail page. The head page is where new events are written. When this head page is of a certain size (see `queue.page_capacity), the becomes a tail page, and a new head page is created. Tail pages are immutable, and the head page is append-only..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to say: "There are two kinds of pages: head pages and tail pages."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to say: "When this head page is of a certain size (see queue.page_capacity) [added backtick]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also change "the becomes a tail page" to "it becomes a tail page".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove extra period after "append-only."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ to all.


First, the queue itself is a set of pages. There are two kinds of pages: head page and tail page. The head page is where new events are written. When this head page is of a certain size (see `queue.page_capacity), the becomes a tail page, and a new head page is created. Tail pages are immutable, and the head page is append-only..

Second, the queue tracks details about the queue (pages, acknowledgements, etc) in a separate file called a checkpoint file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds a bit odd to say that the queue tracks details about the queue. How about: "...the queue tracks details about itself..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++


The process of checkpointing is atomic, which means any update to the file is
saved if successful.
* `queue.checkpoint.writes`: Logstash will checkpoint after this many writes into the queue. At this time, one event counts as one write, but this may change in future releases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "At this time" to "Currently" (otherwise, readers get halfway through the sentence before they realize that "at this time" isn't referring back to the time when the checkpoints are taken).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your explanations are delightfully informative. Thank you!

* `queue.checkpoint.acks`: Logstah will checkpoint after this many events are acknowledged. This configuration controls the durability at the processing (filter + output)
part of Logstash.

Disk writes are have a resource cost. Tuning the above values higher or lower will trade durability for performance. For instance, if you want to the strongest durability for all input events, you can set `queue.checkpoint.writes: 1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have too many verbs. :-) "Disk writes have a resource cost..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I should contract it to are've an obvious contraction of "are" and "have"..

haha ;)

[[garbage-collection]]
==== Disk Garbage Collection

On disk, the queue is stored as a set of pages where each page is one ile. Each page can be at most `queue.page_capacity` in size. Pages are deleted (garbage collected) once all events in that page have been ACKed. If an older page has at least one event that is not yet ACKed, that entire page will remain on disk until all events in that page are successfully processed. Each page containing unprocessed events will count against the `queue.max_bytes` byte size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy for "once" to get misread and/or mistranslated as "one time" I'd say: "Pages are deleted (garbage collected) after all events in that page have been ACKed."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

* Removed reference to unused `queue.checkpoint.interval`.
* Changed 'unread' to be 'unACKed'
* Remove confusing/contradictory statement about queue acknowledgement
  and outputs
* Describe garbage collection
* Describe some of the internals (pages, checkpoints).
@jordansissel
Copy link
Contributor Author

OK! I think I've resolved all the feedback now.

I've rebased this into a single commit and am going to merge.

Thank you for helping copy edit this important document <3 <3

@elasticsearch-bot
Copy link

Jordan Sissel merged this into the following branches!

Branch Commits
5.2 7745408
5.x 19dfa51
master cb471f2

elasticsearch-bot pushed a commit that referenced this pull request Jan 4, 2017
* Removed reference to unused `queue.checkpoint.interval`.
* Changed 'unread' to be 'unACKed'
* Remove confusing/contradictory statement about queue acknowledgement
  and outputs
* Describe garbage collection
* Describe some of the internals (pages, checkpoints).

Fixes #6408
elasticsearch-bot pushed a commit that referenced this pull request Jan 4, 2017
* Removed reference to unused `queue.checkpoint.interval`.
* Changed 'unread' to be 'unACKed'
* Remove confusing/contradictory statement about queue acknowledgement
  and outputs
* Describe garbage collection
* Describe some of the internals (pages, checkpoints).

Fixes #6408
@elasticsearch-bot
Copy link

Jordan Sissel merged this into the following branches!

Branch Commits
5.1 d1b5227

elasticsearch-bot pushed a commit that referenced this pull request Jan 11, 2017
* Removed reference to unused `queue.checkpoint.interval`.
* Changed 'unread' to be 'unACKed'
* Remove confusing/contradictory statement about queue acknowledgement
  and outputs
* Describe garbage collection
* Describe some of the internals (pages, checkpoints).

Fixes #6408
@suyograo suyograo deleted the refactor-pq-docs branch January 24, 2017 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants