Skip to content

[resolv-config] Improve container resolv.conf update mechanism#22439

Merged
qiluo-msft merged 1 commit intosonic-net:masterfrom
oleksandrivantsiv:resolv-conf-update
Apr 28, 2025
Merged

[resolv-config] Improve container resolv.conf update mechanism#22439
qiluo-msft merged 1 commit intosonic-net:masterfrom
oleksandrivantsiv:resolv-conf-update

Conversation

@oleksandrivantsiv
Copy link
Collaborator

@oleksandrivantsiv oleksandrivantsiv commented Apr 24, 2025

Why I did it

Fix a possible race condition during the config reload caused by the concurrent restart of the Docker containers and the resolv-config service. The update of the DNS configuration inside the container is triggered by the container's post-start action.

Work item tracking
  • Microsoft ADO (number only):

How I did it

  • Add support for single container updates with container name argument
  • Implement parallel updates for bulk operations
  • Add comprehensive logging with syslog integration
  • Add container state handling (start/wait for stopped containers)
  • Add proper error handling and status reporting
  • Remove temporary file usage during resolv.conf updates
  • Add networking service check for bulk updates only

Integration changes:

  • Add automatic DNS update call in docker_image_ctl post-start action

How to verify it

Run sonic-mgmt DNS tests

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft
Copy link
Collaborator

How does "waits up to 5 seconds for containers to start" help solving a possible race condition? Is it just reduce the possibility but not completely resolve it?

Can we add a testcase to prevent future regression?

@ganglyu
Copy link
Contributor

ganglyu commented Apr 25, 2025

Is it possible to add a hook to run the update_container script after Docker starts?
Can we use a timer to run the update_container script?

@oleksandrivantsiv
Copy link
Collaborator Author

How does "waits up to 5 seconds for containers to start" help solving a possible race condition? Is it just reduce the possibility but not completely resolve it?

Can we add a testcase to prevent future regression?

There is a small time window when the race condition can occur. It always happens only during the config reload when the DNS configuration changes.

  1. Dockerd starts the container with the current content of the "resolv.conf". It takes some time for the container to start and move into the "running" state.
  2. Meanwhile, concurrently, hostcfgd receives an event with the DNS configuration. It runs resolv-config.service, which writes the DNS configuration into the "/etc/resolv.conf" file and triggers the update of the containers.
  3. When the "update-containers" script starts, some Docker containers might not be in the "running" state yet, so the update for these containers can be missed.

The time window for this scenario is very small, which is why the issue has such a low reproducibility rate. In this scenario, waiting 5 seconds ensures that the Docker container has enough time to move into the running state.

- Add support for single container updates with container name argument
- Implement parallel updates for bulk operations
- Add comprehensive logging with syslog integration
- Add container state handling (start/wait for stopped containers)
- Add proper error handling and status reporting
- Remove temporary file usage during resolv.conf updates
- Add networking service check for bulk updates only

Integration changes:
- Add automatic DNS update call in docker_image_ctl post-start action
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@oleksandrivantsiv
Copy link
Collaborator Author

@qiluo-msft, @ganglyu I updated the implementation to remove 5seconds wait period. Please review

@qiluo-msft
Copy link
Collaborator

Did you ever try this easier solution https://unix.stackexchange.com/a/348406/161486?

exit $?
fi

# Check if networking service is active (only for bulk updates)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

networking service is active

If "networking service is active" is a dependency to fix the issue, we need to worry about some container start during "networking service is not active", but later on the service started.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully understand this comment

@oleksandrivantsiv
Copy link
Collaborator Author

The solution described in https://unix.stackexchange.com/a/348406/161486 is redundant. This is a default behavior Docker has. When the container starts dockerd copies "/etc/resolv.conf" file from the host OS. The issue with this behavior is described in my previous comment:

There is a small time window when the race condition can occur. It always happens only during the config reload when the DNS configuration changes.

1. Dockerd starts the container with the current content of the "resolv.conf" (copies it from the host). It takes some time for the container to start and move into the "running" state.
2. Meanwhile, concurrently, hostcfgd receives an event with the DNS configuration. It runs resolv-config.service, which writes the DNS configuration into the "/etc/resolv.conf" file and triggers the update of the containers.
3. When the "update-containers" script starts, some Docker containers might not be in the "running" state yet, so the update for these containers can be missed.

@qiluo-msft qiluo-msft merged commit 92e7d2f into sonic-net:master Apr 28, 2025
19 checks passed
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202411: #22462

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants