Skip to content

[rsyslog]: Use RELP instead of UDP for forwarding from container to host#18113

Open
saiarcot895 wants to merge 21 commits intosonic-net:masterfrom
saiarcot895:rsyslog-use-relp
Open

[rsyslog]: Use RELP instead of UDP for forwarding from container to host#18113
saiarcot895 wants to merge 21 commits intosonic-net:masterfrom
saiarcot895:rsyslog-use-relp

Conversation

@saiarcot895
Copy link
Contributor

@saiarcot895 saiarcot895 commented Feb 16, 2024

Why I did it

When the host's rsyslog is restarted (for example, to regenerate the config after some changes, or as part of some automated script), there is a chance that some syslog messages from the containers are lost. Most of the time, this isn't an issue. However, if there are test cases that expect all syslogs to be present (such as the advanced-reboot test case), then this can cause a problem. Additionally, this could affect debuggability of issues where a rsyslog restart happens in the middle.

There are two options for reliable message transport in rsyslog: TCP and RELP. With TCP, while the protocol knows whether a syslog message has been delivered or not, the application doesn't know, because there is no feedback from the remote side saying the message was received. This means that there is still a chance that messages could be lost when the connection is broken (if, for example, the host rsyslog gets restarted), because after the connection is established, the sender rsyslog (in the container) doesn't know if the message has been received or not.

RELP builds on top of TCP, and adds a feedback mechanism where the remote side notifies the sender whether the message has actually been received or not. This makes it much less likely to lose a message. There is one known possible case where a message (or messages) could be lost: the network is down, and rsyslog gets restarted. This at least requires both the network and rsyslog to have an issue, rather than just one. There is also a slim possibility where a message could get duplicated; this should be mostly fine (hopefully).

RELP does require that both sides are using a recent version of rsyslogd (at least 7.3.16, which looks like it was released more than 10 years ago), but since we use Debian on both the container and the host, it should be fine.

Therefore, switch to using RELP when sending syslog messages from the container to the host. Also, enable a linked list queue on the sending queues on both the container rsyslog and on the host rsyslog. This means that if the sending of a log message fails (either because there is no network/route to the destination, or, thanks to RELP, messages are not getting acknowledged as received), it will be queued in that linked list and retried later.

Additionally, since messages that were generated in the past could now be delivered later, change the timestamp that is recorded into /var/log/syslog to be the timestamp that the log message was generated (i.e. sent from the original application) rather than the log message was received by this rsyslogd instance. This more accurately reflects when an event happened, and with queueing and RELP now involved, the difference could be on the scale of seconds. This does mean that messages in /var/log/syslog may appear out-of-order at times.

Fixes #17792.

Work item tracking
  • Microsoft ADO (number only): 28314311

How I did it

Modify the rsyslog.conf file on the host and the container to use RELP instead of UDP.

In addition, update the syntax used for the config files to the (newer) RainierScript format, which, among other things, makes it easier to set settings for specific outputs.

Finally, modify rsyslog.conf to write the timestamp that the log message was generated, not when it was received. This makes it a bit easier to correlate events, at the cost of making the logs look out of order.

How to verify it

Stop rsyslogd on the host, make sure that the containers generate some syslogs, restart rsyslogd on the host, and verify no logs were lost.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

When the host's rsyslog is restarted (for example, to regenerate the
config after some changes, or as part of some automated script), there
is a chance that some syslog messages from the containers are lost. Most
of the time, this isn't an issue. However, if there are test cases that
expect all syslogs to be present (such as the advanced-reboot test
case), then this can cause a problem. Additionally, this could affect
debuggability of issues where a rsyslog restart happens in the middle.

There are two options for reliable message transport in rsyslog: TCP and
RELP. With TCP, while the protocol knows whether a syslog message has
been delivered or not, the application doesn't know, because there is no
feedback from the remote side saying the message was received. This
means that there is still a chance that messages could be lost when the
connection is broken (if, for example, the host rsyslog gets restarted),
because after the connection is established, the sender rsyslog (in the
container) doesn't know if the message has been received or not.

RELP instead adds a feedback mechanism where the remote side notifies
the sender whether the message has actually been received or not. This
makes it much less likely to lose a message. There is one known possible
case where a message (or messages) could be lost: the network is down,
and rsyslog gets restarted. This at least requires both the network and
rsyslog to have an issue, rather than just one. There is also a slim
possibility where a message could get duplicated; this should be mostly
fine (hopefully).

RELP does require that both sides are using a recent version of rsyslogd
(at least 7.3.16, which looks like it was released more than 10 years
ago), but since we use Debian on both the container and the host, it
should be fine.

Therefore, switch to using RELP when sending syslog messages from the
container to the host.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
…urst not being defined

$SystemLogRateLimitInterval and $SystemLogRateLimitBurst both come from
the imuxsock module. Specify them as module parameters (and also remove
the legacy syntax).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
By default, just using omrelp doesn't hold log messages if the server
happens to be unavailable. This needs to be configured manually.

Configure an in-memory storage (of a linked list) that by default will
store up to 1000 messages (this appears to be a default value that can
be bumped up) if the server is unavailable. I'm assuming this will be
sufficient for most cases.

Assuming each message is 512 bytes (many of our messages will be smaller
than this), this will take up an additional 512kB of memory if 1000
messages are queues. If there are no messages queued, then no additional
space is taken up.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
saiarcot895 added a commit to saiarcot895/sonic-mgmt that referenced this pull request Feb 26, 2024
If rsyslogd on the host goes down, and rsyslogd on the containers is
configured to use librelp to forward messages to the host rsyslogd
(instead of UDP), then there will be error messages from the container
rsyslogd about not being able to forward messages.

Ignore these error messages as they are expected when running tests
which may restart rsyslogd.

This is in preparation for sonic-net/sonic-buildimage#18113

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895 saiarcot895 marked this pull request as draft June 7, 2024 18:29
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
yxieca pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Aug 1, 2024
* Ignore errors about rsyslogd w/ librelp not being able to send syslogs

If rsyslogd on the host goes down, and rsyslogd on the containers is
configured to use librelp to forward messages to the host rsyslogd
(instead of UDP), then there will be error messages from the container
rsyslogd about not being able to forward messages.

Ignore these error messages as they are expected when running tests
which may restart rsyslogd.

This is in preparation for sonic-net/sonic-buildimage#18113

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@saiarcot895 saiarcot895 marked this pull request as ready for review August 7, 2024 16:42
@saiarcot895 saiarcot895 requested a review from prgeor August 7, 2024 16:42
module(load="imklog") # provides kernel logging support
#module(load="immark") # provides --MARK-- message capability

# provides UDP syslog reception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saiarcot895 This UDP syslog is for remote server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the case of a remote syslog server sending over UDP.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895
Copy link
Contributor Author

what about platform/vs/docker-sonic-vs/etc/rsyslog.conf don't need this change?

Strictly speaking, it doesn't need this change, because the logs aren't actually being forwarded anywhere. It'll forward it to localhost port 514, but there likely won't be anything listening on this port. That container doesn't end up on the device.

It would be nice to update the syntax there to have it use the new syntax, but I'll keep that separate for now.

arista-hpandya pushed a commit to arista-hpandya/sonic-mgmt that referenced this pull request Oct 2, 2024
* Ignore errors about rsyslogd w/ librelp not being able to send syslogs

If rsyslogd on the host goes down, and rsyslogd on the containers is
configured to use librelp to forward messages to the host rsyslogd
(instead of UDP), then there will be error messages from the container
rsyslogd about not being able to forward messages.

Ignore these error messages as they are expected when running tests
which may restart rsyslogd.

This is in preparation for sonic-net/sonic-buildimage#18113

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@saiarcot895 saiarcot895 requested a review from prgeor October 14, 2024 21:19
prgeor
prgeor previously approved these changes Oct 23, 2024
vikshaw-Nokia pushed a commit to vikshaw-Nokia/sonic-mgmt that referenced this pull request Oct 23, 2024
* Ignore errors about rsyslogd w/ librelp not being able to send syslogs

If rsyslogd on the host goes down, and rsyslogd on the containers is
configured to use librelp to forward messages to the host rsyslogd
(instead of UDP), then there will be error messages from the container
rsyslogd about not being able to forward messages.

Ignore these error messages as they are expected when running tests
which may restart rsyslogd.

This is in preparation for sonic-net/sonic-buildimage#18113

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
In case rsyslog can't forward messages to the host's rsyslog server,
messages will be queued so that they can be sent out later. For this
queue, set a limit of 20000 messages so that rsyslog doesn't take too
much memory. Assuming each message is 512 bytes, the approximate maximum
additional memory usage is 10MB.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@qiluo-msft
Copy link
Collaborator

This is kind of new feature. Is it possible to config for new behavior or old behavior?

@liat-grozovik
Copy link
Collaborator

@saiarcot895 is there any plan to have this fix some how handled and merged?

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Copilot AI review requested due to automatic review settings February 23, 2026 03:36
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR switches from UDP to RELP (Reliable Event Logging Protocol) for syslog forwarding from containers to the host, addressing issue #17792 where logs could be lost during rsyslog restarts. RELP provides acknowledgment-based delivery that ensures messages are not lost when the connection is interrupted. The changes also modernize rsyslog configurations to use the RainierScript syntax (module() and input() directives) instead of legacy $-prefixed directives.

Changes:

  • Migrated container-to-host syslog transport from UDP port 514 to RELP port 2514 with queue-based reliability
  • Updated rsyslog configurations to RainierScript format for better maintainability
  • Added rsyslog-relp package dependencies to host and all active container base images (bookworm, bullseye, trixie)

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
build_debian.sh Added rsyslog-relp package to host system dependencies
dockers/docker-base-bookworm/Dockerfile.j2 Added rsyslog-relp package installation
dockers/docker-base-bookworm/etc/rsyslog.conf Converted to RainierScript format and switched from UDP to RELP for forwarding
dockers/docker-base-bullseye/Dockerfile.j2 Added rsyslog-relp package to backports installation
dockers/docker-base-bullseye/etc/rsyslog.conf Converted to RainierScript format and switched from UDP to RELP for forwarding
dockers/docker-base-trixie/Dockerfile.j2 Added rsyslog-relp package installation
dockers/docker-base-trixie/etc/rsyslog.conf Converted to RainierScript format and switched from UDP to RELP for forwarding
dockers/docker-platform-monitor/etc/rsyslog.conf Converted to RainierScript format and switched from UDP to RELP for forwarding
files/image_config/rsyslog/rsyslog-container.conf.j2 Template updated to RainierScript format with RELP forwarding and queue configuration
files/image_config/rsyslog/rsyslog.conf.j2 Host config template updated to RainierScript format with RELP reception on port 2514
src/sonic-config-engine/tests/sample_output/py3/rsyslog.conf Updated expected test output for new RainierScript format with RELP
src/sonic-config-engine/tests/sample_output/py3/rsyslog_with_docker0.conf Updated expected test output for new RainierScript format with RELP
src/sonic-config-engine/tests/test_j2files.py Added diff output to test assertions for better debugging
src/sonic-containercfgd/containercfgd/containercfgd.py Updated regex patterns to parse RainierScript format rate limit settings

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines failed to run 1 pipeline(s).

@saiarcot895
Copy link
Contributor Author

This is kind of new feature. Is it possible to config for new behavior or old behavior?

@qiluo-msft While it could be added as a configurable feature, this would require an application (or daemon) running in each of the containers to determine which method to use (RELP or UDP). The last time such a feature was added (#12490), it had to be reworked to be disabled by default (#17458). Given that there are currently issues seen on cold bootup (#25382) and warm reboot (#17792), I'd like this to be enabled by default for simplicity.

@saiarcot895
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[systemd][teamsyncd] missing logs during restarting system logging service

6 participants