api,adaptation: pass container exit status to plugins. #144

klihub · 2025-02-26T16:41:13Z

Add ContainerStatus to the information NRI passes about containers to plugins. ContainerStatus includes container state, {creation, start, exit} timestamps, container (init process) PID and exit code, status reason, and message.

mikebrow

some questions..

mikebrow · 2025-02-26T19:21:26Z

pkg/api/api.pb.go

+type ContainerStatus struct {
+	state         protoimpl.MessageState
+	sizeCache     protoimpl.SizeCache
+	unknownFields protoimpl.UnknownFields


add strings reason and message after exit code?

should state be a func that runs on the status fields as in containerd/internal/cri/store/container/status.go/State()

unknownFields .. is that unknown bool? // not fully loaded

is this Pid a dup of the above Pid?

What is sizeCache?

add strings reason and message after exit code?

We can add them. But would message also really be useful ? I thought it is a human-readable message counterpart of reason explaining why the container is the its current state, and is meant for consumption by higher-layer components like the kubelet, kubectl, etc.

should state be a func that runs on the status fields as in containerd/internal/cri/store/container/status.go/State()

Well, we have the stock protoc autogenerated [Container.GetStatus()].GetState() getter in place for that.

unknownFields .. is that unknown bool? // not fully loaded

It is generated by the protoc toolchain, AFAIK for use by the protobuf machinery itself.

is this a dup of the above pid?

Yes, both Pid and State are duplicates of the Container-level Pid and State fields. I chose not to removed them to not break backwards compatibility, but if we're not too worried about that prior to a 1.0 then those could be removed. And if we remove them, we can reimplement their eliminated autogenerated getters as GetStatus().Get{Pid,State}().

Another option to avoid duplicates would be to add all new fields directly to Container, without introducing a new ContainerStatus message/struct. So then we'd only add {Created,Started,Finished}At and ExitCode.

@mikebrow WDYT ? Would you prefer to go with either of those options above ?

sizeCache?

Same as unknownFields... protoc internals.

ah used to seeing it look like:
XXX_NoUnkeyedLiteral struct{} json:"-"
XXX_sizecache int32 json:"-"
}

at the bottom of the struct vs up in the middle.. err top in this case.. saw "state" and figured wrong... should've been paying attn to the .proto

rather just add the new time stamps and exit fields since we sort of started that route already and no reason to deprecate/dup ..

higher-layer components like the kubelet, kubectl,

not super clear if these plugins will all be just an isolated piece of the plumbing under the container runtime for mutations or if some will expose metric/exit/error information upstream to user level monitoring tools.. possibly even AI Agentic solutions.

noting from cri:

// Exit code of the container. Only required when finished_at != 0. Default: 0. ExitCode int32 `protobuf:"varint,7,opt,name=exit_code,json=exitCode,proto3" json:"exit_code,omitempty"` // Brief CamelCase string explaining why container is in its current state. // Must be set to "OOMKilled" for containers terminated by cgroup-based Out-of-Memory killer. Reason string `protobuf:"bytes,10,opt,name=reason,proto3" json:"reason,omitempty"` // Human-readable message indicating details about why container is in its // current state. Message string `protobuf:"bytes,11,opt,name=message,proto3" json:"message,omitempty"`

not super clear if these plugins will all be just an isolated piece of the plumbing under the container runtime for mutations or if some will expose metric/exit/error information upstream to user level monitoring tools.. possibly even AI Agentic solutions.

I do think exposing the message and reason in the plugin would be helpful as well, although I'm not clear on which component is surfacing them (runc? containerd shim?). We currently modify the pod's Status.Message based on several factors, one of which is the exit code, and another is whether the task died because of OOM or not. Having another source of information can be helpful.

cc: @jadams41 probably can help clarify our use-case better here.

Thanks for poking me on this @etungsten.

For the system we are currently migrating, we do have certain errors that we want to surface specifically when they occur (and we use message, reason, reasonMessage to do this). To get this to work, we have been subscribing to the events stream to be able to hook in (as a part of the NRI plugin that we have put together), and supply our own custom statuses via the k8s api to update these fields on the pod.

Some examples of what we're doing at present:

Giving some additional color to certain container exit codes (i.e. ExitCode=127 -> reasonMessage = "container couldn't find the command <...> to run in the path (exit code 127)"

Having certain container setup that we do in NRI hooks percolate up as something different than just that we killed the container w/ exitcode (i.e. reasonMessage = "Failed to start a user defined container" or "Unable to launch system services"

And finally, doing a little bit of cleanup for certain things. One example is that our runtime transparently injects tini into our container entrypoints (to fixup signal handling and do some business logic tweaking for our users). Since our entrypoints now look like <tini binary + options> + <What the user actually specified>, we want to make sure we only include the tini aspects if the failure actually pertains to tini to not confuse our users.

Being able to do this in-band with containerd/NRI seems like a big improvement since that the flow is already established between containerd events (CRI layer) and k8s to update the pod, and what we've managed to get working feels like we've engineered something (which we don't particularly love) to swim upstream against NRI instead of working with it.

Maybe there is a better way to do this than what we came up with that doesn't require NRI having access to it, so was generally interested in raising what we're doing while there's some relevant discussion

very interesting use cases... let's do a follow up issue/pr for the reason/message pass through and/or pipelining

very interesting use cases... let's do a follow up issue/pr for the reason/message pass through and/or pipelining

Ok, let's do so if that's the preference. Once this is merged, adding reason and message corresponding to the same CRI status fields should be very straightforward, just adding them to proto.api and regenerating the golang bindings should be enough. Considering that, it might be easier to just update this PR with those two extra fields already included. I'm fine either way.

Based on a chat with @mikebrow I added the two other related fields Mike was mentioning earlier and @jadams41 was asking for above, reason and message. Since I couldn't come up with a better way of scoping it to status, I called them StatusReason and StatusMessage.

The other possibility would be to introduce a new ContainerStatus message/struct and add all of these there. If we did that, it probably would make sense to also add or move move there Pid and State. That would either introduce duplicate data or be a backward incompatible change, so I did not opt for that here. But if the consensus was that it's still doable pre-1.0 and preferred in this case, then chime in I can update this accordingly.

mikebrow

LGTM

mikebrow

LGTM .. just a nit on the description in the readme

README.md

Add exit code, status reason and message, and timestamps for creation, start and exit to container data. Co-authored-by: Mike Brown <[email protected]> Signed-off-by: Krisztian Litkey <[email protected]>

jadams41

This looks great, awesome that you could fold status in here as well!

Didn't follow the discussion entirely about how this could be fully pulled through - is it possible to test this now?

klihub · 2025-03-05T08:27:36Z

This looks great, awesome that you could fold status in here as well!

Didn't follow the discussion entirely about how this could be fully pulled through - is it possible to test this now?

@jadams41 Currently you need to compile an updated containerd yourself to test this. I have a test branch here if you are interested: https://github.com/klihub/containerd/tree/devel/main/container-exit-status

klihub · 2025-03-05T08:29:25Z

LGTM .. just a nit on the description in the readme

@mikebrow Thanks ! Squash-committed and updated the PR.

mikebrow

LGTM

klihub changed the title ~~api: add more complete container status.~~ api,adaptation: pass container exit status to plugins. Feb 26, 2025

mikebrow reviewed Feb 26, 2025

View reviewed changes

klihub force-pushed the devel/container-exit-status branch 2 times, most recently from 7e03468 to c353f16 Compare February 27, 2025 18:02

mikebrow approved these changes Feb 27, 2025

View reviewed changes

klihub force-pushed the devel/container-exit-status branch 3 times, most recently from 84512ae to 4d240db Compare March 4, 2025 14:48

klihub requested review from etungsten and jadams41 March 4, 2025 15:43

mikebrow approved these changes Mar 4, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

api: add more data for container status.

db955a1

Add exit code, status reason and message, and timestamps for creation, start and exit to container data. Co-authored-by: Mike Brown <[email protected]> Signed-off-by: Krisztian Litkey <[email protected]>

klihub force-pushed the devel/container-exit-status branch from f92de89 to db955a1 Compare March 4, 2025 17:28

etungsten approved these changes Mar 4, 2025

View reviewed changes

jadams41 approved these changes Mar 4, 2025

View reviewed changes

klihub requested a review from mikebrow March 5, 2025 08:28

klihub requested a review from fuweid March 5, 2025 08:30

mikebrow approved these changes Mar 5, 2025

View reviewed changes

fuweid approved these changes Mar 6, 2025

View reviewed changes

fuweid merged commit d640b80 into containerd:main Mar 6, 2025
8 checks passed

klihub deleted the devel/container-exit-status branch March 7, 2025 07:50

api,adaptation: pass container exit status to plugins. #144

api,adaptation: pass container exit status to plugins. #144

Uh oh!

Conversation

klihub commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikebrow left a comment

Choose a reason for hiding this comment

Uh oh!

mikebrow Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klihub Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikebrow Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikebrow Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikebrow Feb 27, 2025

Choose a reason for hiding this comment

Uh oh!

etungsten Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jadams41 Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

mikebrow Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

klihub Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

klihub Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

mikebrow left a comment

Choose a reason for hiding this comment

Uh oh!

mikebrow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jadams41 left a comment

Choose a reason for hiding this comment

Uh oh!

klihub commented Mar 5, 2025

Uh oh!

klihub commented Mar 5, 2025

Uh oh!

mikebrow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

klihub commented Feb 26, 2025 •

edited

Loading

mikebrow Feb 26, 2025 •

edited

Loading

klihub Feb 27, 2025 •

edited

Loading

mikebrow Feb 27, 2025 •

edited

Loading

mikebrow Feb 27, 2025 •

edited

Loading

etungsten Feb 28, 2025 •

edited

Loading

klihub Mar 3, 2025 •

edited

Loading