[rsyslog]: Use RELP instead of UDP for forwarding from container to host#18113
[rsyslog]: Use RELP instead of UDP for forwarding from container to host#18113saiarcot895 wants to merge 21 commits intosonic-net:masterfrom
Conversation
When the host's rsyslog is restarted (for example, to regenerate the config after some changes, or as part of some automated script), there is a chance that some syslog messages from the containers are lost. Most of the time, this isn't an issue. However, if there are test cases that expect all syslogs to be present (such as the advanced-reboot test case), then this can cause a problem. Additionally, this could affect debuggability of issues where a rsyslog restart happens in the middle. There are two options for reliable message transport in rsyslog: TCP and RELP. With TCP, while the protocol knows whether a syslog message has been delivered or not, the application doesn't know, because there is no feedback from the remote side saying the message was received. This means that there is still a chance that messages could be lost when the connection is broken (if, for example, the host rsyslog gets restarted), because after the connection is established, the sender rsyslog (in the container) doesn't know if the message has been received or not. RELP instead adds a feedback mechanism where the remote side notifies the sender whether the message has actually been received or not. This makes it much less likely to lose a message. There is one known possible case where a message (or messages) could be lost: the network is down, and rsyslog gets restarted. This at least requires both the network and rsyslog to have an issue, rather than just one. There is also a slim possibility where a message could get duplicated; this should be mostly fine (hopefully). RELP does require that both sides are using a recent version of rsyslogd (at least 7.3.16, which looks like it was released more than 10 years ago), but since we use Debian on both the container and the host, it should be fine. Therefore, switch to using RELP when sending syslog messages from the container to the host. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
…urst not being defined $SystemLogRateLimitInterval and $SystemLogRateLimitBurst both come from the imuxsock module. Specify them as module parameters (and also remove the legacy syntax). Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
By default, just using omrelp doesn't hold log messages if the server happens to be unavailable. This needs to be configured manually. Configure an in-memory storage (of a linked list) that by default will store up to 1000 messages (this appears to be a default value that can be bumped up) if the server is unavailable. I'm assuming this will be sufficient for most cases. Assuming each message is 512 bytes (many of our messages will be smaller than this), this will take up an additional 512kB of memory if 1000 messages are queues. If there are no messages queued, then no additional space is taken up. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
If rsyslogd on the host goes down, and rsyslogd on the containers is configured to use librelp to forward messages to the host rsyslogd (instead of UDP), then there will be error messages from the container rsyslogd about not being able to forward messages. Ignore these error messages as they are expected when running tests which may restart rsyslogd. This is in preparation for sonic-net/sonic-buildimage#18113 Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
|
/azpw run Azure.sonic-buildimage |
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Ignore errors about rsyslogd w/ librelp not being able to send syslogs If rsyslogd on the host goes down, and rsyslogd on the containers is configured to use librelp to forward messages to the host rsyslogd (instead of UDP), then there will be error messages from the container rsyslogd about not being able to forward messages. Ignore these error messages as they are expected when running tests which may restart rsyslogd. This is in preparation for sonic-net/sonic-buildimage#18113 Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
|
/azpw run Azure.sonic-buildimage |
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| module(load="imklog") # provides kernel logging support | ||
| #module(load="immark") # provides --MARK-- message capability | ||
|
|
||
| # provides UDP syslog reception |
There was a problem hiding this comment.
@saiarcot895 This UDP syslog is for remote server?
There was a problem hiding this comment.
Yes, in the case of a remote syslog server sending over UDP.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Strictly speaking, it doesn't need this change, because the logs aren't actually being forwarded anywhere. It'll forward it to localhost port 514, but there likely won't be anything listening on this port. That container doesn't end up on the device. It would be nice to update the syntax there to have it use the new syntax, but I'll keep that separate for now. |
* Ignore errors about rsyslogd w/ librelp not being able to send syslogs If rsyslogd on the host goes down, and rsyslogd on the containers is configured to use librelp to forward messages to the host rsyslogd (instead of UDP), then there will be error messages from the container rsyslogd about not being able to forward messages. Ignore these error messages as they are expected when running tests which may restart rsyslogd. This is in preparation for sonic-net/sonic-buildimage#18113 Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
|
/azpw run Azure.sonic-buildimage |
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
* Ignore errors about rsyslogd w/ librelp not being able to send syslogs If rsyslogd on the host goes down, and rsyslogd on the containers is configured to use librelp to forward messages to the host rsyslogd (instead of UDP), then there will be error messages from the container rsyslogd about not being able to forward messages. Ignore these error messages as they are expected when running tests which may restart rsyslogd. This is in preparation for sonic-net/sonic-buildimage#18113 Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
In case rsyslog can't forward messages to the host's rsyslog server, messages will be queued so that they can be sent out later. For this queue, set a limit of 20000 messages so that rsyslog doesn't take too much memory. Assuming each message is 512 bytes, the approximate maximum additional memory usage is 10MB. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
|
This is kind of new feature. Is it possible to config for new behavior or old behavior? |
|
@saiarcot895 is there any plan to have this fix some how handled and merged? |
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR switches from UDP to RELP (Reliable Event Logging Protocol) for syslog forwarding from containers to the host, addressing issue #17792 where logs could be lost during rsyslog restarts. RELP provides acknowledgment-based delivery that ensures messages are not lost when the connection is interrupted. The changes also modernize rsyslog configurations to use the RainierScript syntax (module() and input() directives) instead of legacy $-prefixed directives.
Changes:
- Migrated container-to-host syslog transport from UDP port 514 to RELP port 2514 with queue-based reliability
- Updated rsyslog configurations to RainierScript format for better maintainability
- Added rsyslog-relp package dependencies to host and all active container base images (bookworm, bullseye, trixie)
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| build_debian.sh | Added rsyslog-relp package to host system dependencies |
| dockers/docker-base-bookworm/Dockerfile.j2 | Added rsyslog-relp package installation |
| dockers/docker-base-bookworm/etc/rsyslog.conf | Converted to RainierScript format and switched from UDP to RELP for forwarding |
| dockers/docker-base-bullseye/Dockerfile.j2 | Added rsyslog-relp package to backports installation |
| dockers/docker-base-bullseye/etc/rsyslog.conf | Converted to RainierScript format and switched from UDP to RELP for forwarding |
| dockers/docker-base-trixie/Dockerfile.j2 | Added rsyslog-relp package installation |
| dockers/docker-base-trixie/etc/rsyslog.conf | Converted to RainierScript format and switched from UDP to RELP for forwarding |
| dockers/docker-platform-monitor/etc/rsyslog.conf | Converted to RainierScript format and switched from UDP to RELP for forwarding |
| files/image_config/rsyslog/rsyslog-container.conf.j2 | Template updated to RainierScript format with RELP forwarding and queue configuration |
| files/image_config/rsyslog/rsyslog.conf.j2 | Host config template updated to RainierScript format with RELP reception on port 2514 |
| src/sonic-config-engine/tests/sample_output/py3/rsyslog.conf | Updated expected test output for new RainierScript format with RELP |
| src/sonic-config-engine/tests/sample_output/py3/rsyslog_with_docker0.conf | Updated expected test output for new RainierScript format with RELP |
| src/sonic-config-engine/tests/test_j2files.py | Added diff output to test assertions for better debugging |
| src/sonic-containercfgd/containercfgd/containercfgd.py | Updated regex patterns to parse RainierScript format rate limit settings |
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines failed to run 1 pipeline(s). |
@qiluo-msft While it could be added as a configurable feature, this would require an application (or daemon) running in each of the containers to determine which method to use (RELP or UDP). The last time such a feature was added (#12490), it had to be reworked to be disabled by default (#17458). Given that there are currently issues seen on cold bootup (#25382) and warm reboot (#17792), I'd like this to be enabled by default for simplicity. |
|
/azpw run Azure.sonic-buildimage |
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Why I did it
When the host's rsyslog is restarted (for example, to regenerate the config after some changes, or as part of some automated script), there is a chance that some syslog messages from the containers are lost. Most of the time, this isn't an issue. However, if there are test cases that expect all syslogs to be present (such as the advanced-reboot test case), then this can cause a problem. Additionally, this could affect debuggability of issues where a rsyslog restart happens in the middle.
There are two options for reliable message transport in rsyslog: TCP and RELP. With TCP, while the protocol knows whether a syslog message has been delivered or not, the application doesn't know, because there is no feedback from the remote side saying the message was received. This means that there is still a chance that messages could be lost when the connection is broken (if, for example, the host rsyslog gets restarted), because after the connection is established, the sender rsyslog (in the container) doesn't know if the message has been received or not.
RELP builds on top of TCP, and adds a feedback mechanism where the remote side notifies the sender whether the message has actually been received or not. This makes it much less likely to lose a message. There is one known possible case where a message (or messages) could be lost: the network is down, and rsyslog gets restarted. This at least requires both the network and rsyslog to have an issue, rather than just one. There is also a slim possibility where a message could get duplicated; this should be mostly fine (hopefully).
RELP does require that both sides are using a recent version of rsyslogd (at least 7.3.16, which looks like it was released more than 10 years ago), but since we use Debian on both the container and the host, it should be fine.
Therefore, switch to using RELP when sending syslog messages from the container to the host. Also, enable a linked list queue on the sending queues on both the container rsyslog and on the host rsyslog. This means that if the sending of a log message fails (either because there is no network/route to the destination, or, thanks to RELP, messages are not getting acknowledged as received), it will be queued in that linked list and retried later.
Additionally, since messages that were generated in the past could now be delivered later, change the timestamp that is recorded into
/var/log/syslogto be the timestamp that the log message was generated (i.e. sent from the original application) rather than the log message was received by this rsyslogd instance. This more accurately reflects when an event happened, and with queueing and RELP now involved, the difference could be on the scale of seconds. This does mean that messages in/var/log/syslogmay appear out-of-order at times.Fixes #17792.
Work item tracking
How I did it
Modify the
rsyslog.conffile on the host and the container to use RELP instead of UDP.In addition, update the syntax used for the config files to the (newer) RainierScript format, which, among other things, makes it easier to set settings for specific outputs.
Finally, modify
rsyslog.confto write the timestamp that the log message was generated, not when it was received. This makes it a bit easier to correlate events, at the cost of making the logs look out of order.How to verify it
Stop rsyslogd on the host, make sure that the containers generate some syslogs, restart rsyslogd on the host, and verify no logs were lost.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)