Skip to content

Add design for migrating systemd-managed docker containers to Kubernetes with resource control#2021

Open
FengPan-Frank wants to merge 36 commits intosonic-net:masterfrom
FengPan-Frank:bmpk8s
Open

Add design for migrating systemd-managed docker containers to Kubernetes with resource control#2021
FengPan-Frank wants to merge 36 commits intosonic-net:masterfrom
FengPan-Frank:bmpk8s

Conversation

@FengPan-Frank
Copy link
Contributor

@FengPan-Frank FengPan-Frank commented Jun 24, 2025

Add design for migrating systemd-managed docker containers to Kubernetes with resource control

Description Repo PR link State
sidecar container sample (telemetry) sonic-buildimage sonic-net/sonic-buildimage#23936 GitHub issue/pull request detail
docker container change sample (telemetry) sonic-buildimage sonic-net/sonic-buildimage#23756 GitHub issue/pull request detail
watchdog container sample (telemetry) sonic-buildimage sonic-net/sonic-buildimage#23724 GitHub issue/pull request detail

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@FengPan-Frank FengPan-Frank changed the title Add design for migrating systemd-managed docker containers to Kuberne… Add design for migrating systemd-managed docker containers to Kubernetes with resource control Jun 24, 2025
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@make1980 make1980 self-requested a review July 1, 2025 18:17
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@FengPan-Frank FengPan-Frank requested a review from make1980 July 3, 2025 23:11
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

let response = "HTTP/1.1 200 OK\r\nContent-Type: application/json\r\nContent-Length: 29\r\n\r\n{\"message\":\"telemetry disabled\"}";
stream.write_all(response.as_bytes()).ok();
} else {
// normal health check

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should describe what's monitored by the watchdog - maybe just a simple physical path probe to telemetry will be a good starting point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added real detection code for gnmi service listen 50051

kill_container() {
POD_NAME=$(get_pod_name)
if [[ -n "$POD_NAME" ]]; then
kubectl exec "$POD_NAME" -n "$NAMESPACE" -- pkill -f telemetry

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried kubectl exec? this may not work in our setup - we don't have a kubemaster to device channel today. so maybe you should use docker command to find the container with the pod name in it and restart it through docker instead

this is again some complexity because we do not put telemetry into its own daemonset/pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a label like "feature: telemetry" which has the same name as FEATURE table?

- Patch systemd service files.


patch_systemd.sh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd prefer to make the sidecar container written in golang so that we don't need a base image for it.

The desired state script will based on top of tree from specific code branch.

```mermaid
sequenceDiagram

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing we should probably clarify in this diagram is that we don't intend to change postupgrade script to use a desired state file(or the same desired state file that we use in the sidecar). if any postupgrade script is doing a diff based update it will still be that way but sidecar will overwrite it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

telemetry-watchdog->>k8s: skip detect and return OK
```

### 5.4 After KubeSonic rollout, rollback the container. (v2+ -> v2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

featured seems to use "systemctl show telemetry --property UnitFileState" to get the status of a service - in our current design how do we make sure it will reflect whether telemetry is running or not?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to work as expected today so we might be fine. the logic is: 1. check show --property result, if it's enabled don't do anything and set state table as enabled. 2. try enable and start if no error, set state table as enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll have no change in current featured monitoring workflow.


syslog.syslog(syslog.LOG_ERR, ERROR_CGROUP_MEMORY_USAGE_NOT_FOUND.format(docker_memory_usage_file_path, container_id))
sys.exit(INTERNAL_ERROR)
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably just say after k8s we simplify the restart logic to be OOM based only - I'm not sure why we can tolerate memory usage exceeding the limit for several times as what's implemented right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure previous intention for retry, updated this part as well.

make1980
make1980 previously approved these changes Jul 27, 2025
#### FEATURE Table Snapshot
```json
"telemetry": {
"auto_restart": "enabled",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still support auto_restart with systemd stub? maybe delayed is still supported through the stub? not sure about that part...

same for check_up_status. we don't use set_owner anymore with this design, right?

sleep 1000
fi
done
exec /usr/local/bin/supervisord

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the feature is enabled and is running by k8s pod, then someone disables the feature, I suppose the systemd service may be stopped, but what's the behavior of the container ? will it exit ? We should clear the behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemd will kill and restart container, after restart container will read from FEATURE table and know it's disabled then enters idle status.

#### v1+ Behavior with Kubernetes DaemonSet
- The container is deployed via Kubernetes DaemonSet.
- The systemd service is retained as a **stub**, to avoid breaking automation or tools that query it.
- For start/stop/restart, it will just kill container simply so that kubernetes will relaunch it automatically. Here is some limitation: since kubernetes takes over the container thus systemd stop will not STOP container really, unless we tint the node but that requires each container should use dedicated daemon set which is not preferrable as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for start action, we may return immediately, because kubelet will always start it if it's not running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For guarantee the correctness we may simply kill/restart container, so that latest FEATURE table could be read.

- The container is deployed via Kubernetes DaemonSet.
- The systemd service is retained as a **stub**, to avoid breaking automation or tools that query it.
- For start/stop/restart, it will just kill container simply so that kubernetes will relaunch it automatically. Here is some limitation: since kubernetes takes over the container thus systemd stop will not STOP container really, unless we tint the node but that requires each container should use dedicated daemon set which is not preferrable as well.
- For status, it will return runtime state via kubectl.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you get the status for the container by kubectl ? If you want to do it, you may need to get the corresponding pod and extract the container status from the pod, there is not a clear mapping between pod and the feature container, so we may return the status based on feature status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a label like "feature: telemetry" which has the same name as FEATURE table now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

we can use label to map feature and container status.

exit 1
}

get_pod_name() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it assumes there is only one pod on the node, but it may not be true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can update during implementation.

kill_container() {
POD_NAME=$(get_pod_name)
if [[ -n "$POD_NAME" ]]; then
kubectl exec "$POD_NAME" -n "$NAMESPACE" -- pkill -f telemetry

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the pod, the service is still managed by systemd inside the container, so I think you may call supervisorctl to restart it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please note, we can deploy multiple containers in a pod, we usually use the feature name as the container name, you will need to consider to use '-c ' in kubectl exec

end
```

### 5.4 After KubeSonic rollout, rollback the container to imaged based version. (v2 -> v0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the scenario for v2 -> v0 ? In k8s rollout, if v2 is failed, it will rollout to v1 instead of v0, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1 only has systemd stub files, here I mean rollback to original SONiC image based version.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

#### v0 Behavior
- Container is fully managed by `systemd` (e.g., `telemetry.service`).
- FEATURE table controls startup (`state: Enabled/Disabled`).
- Actual container starts or stops via `systemctl` by featured.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

featured has used other systemctl commands like mask and unmask, how to support them?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FengPan-Frank - let's take a look at the featured implementation. It seems that telemetry systemd service doesn't even support mask/unmask/enable/disable stuff but for the containers that do support we should probably just treat them all as restart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Verified on this, mask/unmask will flap symlink to /dev/null thus we should be fine with this command

As section "Enforce CPU and memory resource constraints natively" mentioned, this monit functionality could be covered by Kubernetes, thus for monit we need to get rid of if from the KubeSonic rollouted container. However, during transition period if there's any case decalres that `monit` must be retained temporarily, we can also rewrite /usr/bin/memory_check to use Kubernetes data (e.g., via `kubectl top`) instead of reading Docker or CGroup files like above.

---

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to support container_checker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated container_check into doc as well.

#### v1+ Behavior with Kubernetes DaemonSet
- The container is deployed via Kubernetes DaemonSet.
- The systemd service is retained as a **stub**, to avoid breaking automation or tools that query it.
- For start/stop/restart, it will just kill container simply so that kubernetes will relaunch it automatically. Here is some limitation: since kubernetes takes over the container thus systemd stop will not STOP container really, unless we taint the node but that requires each container should use dedicated daemon set which is not preferrable as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really support the container, if the container has memory issue, this might not be enough.

Copy link
Contributor Author

@FengPan-Frank FengPan-Frank Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean OOM issue? OOM should be handled via below section 6

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants