Skip to content

Conversation

@klihub
Copy link
Member

@klihub klihub commented Oct 1, 2025

The primary motivation for this PR is to clean up some annoying bits in the NRI wire protocol.

The most prominent such bit is the lack of proper, dedicated wire RPC calls and messages for some pod and container lifecycle events. These are instead multiplexed over a single RPC call and message (StateChange, StateChangeEvent), lacking any event-specific returned data and making any future extension unnecessarily difficult. Other, smaller annoyances are the usage of a single Empty response type in multiple other RPC calls, with similar negative side effects.

Addressing these alone is a more intrusive protocol change than the usual addition of new NRI inputs or controls and the related protocol extension with new fields. Therefore this PR defines a new v1beta1 revision of the protocol, initially identical to v1alpha1, which is then updated address the aforementioned annoyances.

Some of the original protocol problems are also reflected in the stub and the plugin/stub API. This PR tries to clean some of those up as well. It defines a new v1beta1 stub, turning the original one into a v1alpha1 revision.

In an attempt to provide a painless transition path for plugins (and to experiment with it in practice) this PR also implements an internal transparent protocol conversion from v1alpha1 to v1beta1, for plugins which register themselves talking the earlier version of the protocol.

This PR is still work-in-progress, but I'd like to gather some initial thoughts and general feedback from anybody who has the time and interest to take a look at this, already in its current state.

Here is the list of things (I think) this PR addresses in v1beta1:

  • eliminate on wire StateChange in favor of dedicated event-specific requests
  • eliminate remaining usage of Empty (plugin registration, plugin shutdown, wasm plugin log bridging)
  • implement plugin shutdown, use it where it makes sense instead of just closing the connection
  • delay closing plugin connections for fatal errors, giving error propagation back to plugins a chance

Here is a list of things I either know are still missing or I'd like to additionally address. This is probably still incomplete. I'll try to update it as I can.

  • adaptation test suite tests for protocol conversion
  • documentation updates
  • additional stub/plugin API cleanups

@mikebrow @chrishenzie @marquiz If you have some BW, PTAL.

@klihub klihub force-pushed the devel/api/v1beta1 branch from e3187d6 to c3f5db9 Compare October 1, 2025 16:39
@mikebrow
Copy link
Member

mikebrow commented Oct 1, 2025

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

@klihub klihub force-pushed the devel/api/v1beta1 branch from c3f5db9 to cd2e51f Compare October 1, 2025 18:32
@klihub
Copy link
Member Author

klihub commented Oct 1, 2025

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

Okay, let's have a chat about this next week.

@klihub klihub force-pushed the devel/api/v1beta1 branch 4 times, most recently from 9ca998a to fdb6329 Compare October 2, 2025 09:40
@klihub klihub force-pushed the devel/api/v1beta1 branch 4 times, most recently from f04f8ff to 837c25f Compare October 3, 2025 07:54
@klihub
Copy link
Member Author

klihub commented Oct 9, 2025

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

@mikebrow @chrishenzie Some context about the motivation to ditch once and for all StateChange on the wire and replace it with proper dedicated services and related request/response messages. Our API both on the runtime and the client side has never exposed StateChange. Instead we try to pretend that there are dedicated calls and messages for each one multiplexed over StateChange. In practice what happens is that while both the runtime and the plugins see, for instance, something like RunPodSandox(pod), StopPodSandbox(pod), PostCreateContainer(pod, container), StartContainer(pod, container), in reality we pass this over on the wire with something like

&StateChangeEvent{
    Event: Event_{RUN, STOP, POST_UDPATE,REMOVE}_POD_SANDBOX,
    Pod: pod,
    Container: nil,
}
&StateChangeEvent{
    Event: Event_{POST_CREATE,START,POST_START,POST_UPDATE,REMOVE}_CONTAINER,
    Pod: pod,
    Container: container,
}

while on the return path we always have

type (
    {Run,Stop,PostUpdate,Remove}PodSandboxResponse = Empty
    {PostCreate,Start,PostStart,PostUpdate,Remove}ContainerResponse = Empty
    Empty struct {}
)

This was a futile and pretty misguided attempt, by me in momentary lapse of reason/worse than usual brain damage, to try and keep the protocol looking (slightly) small(er than otherwise). IMO, it did not help or make much sense at that time, and it does even less now. The only thing it causes is problems, and artificially created ones at that.

The initial collective property for all requests mutiplexed as StateChange was that they all had event-like semantics, IOW the plugin/client could not request altering anything in response. A secondary one was that, initially, all these requests had either (PodSandbox) or (PodSandbox, Container) inputs, so the 'union content' approach for StateChangeEvent did not look too ugly. But this already became a problem when @chrishenzie was adding the UpdatePodSandbox request. That, too, was event-like in this sense, but the inputs already did not fit into StateChangeEvent. So there was a dilemma whether to 1) go down the path of further uglification by adding Overhead *LinuxContainerResource, Resources *LinuxContainerResources to the 'union of inputs' or 2) handle it with a proper dedicate request and messages and blow the rest of StateChange apart in a similar fashion(... eventually, while living with the now inconsistent approach for the time being).

We decided to go with the second option, but haven't blown apart StateChange yet. I'd like to do that now, while we're rolling a next wire protocol version as part of this PR.

@mikebrow
Copy link
Member

mikebrow commented Oct 9, 2025

Interesting changes.. let's do a call on this next week covering the motivations. I'm LGTM for a versioning practice run to manage proto version extensions/enhancements. Not sure about the need to split state change notifications into event type unique rpcs? Would like to hear some ideas on scenarios where that would be helpful. The other changes all sound like good tech debt cleanup.

@mikebrow @chrishenzie Some context about the motivation to ditch once and for all StateChange on the wire and replace it with proper dedicated services and related request/response messages. Our API both on the runtime and the client side has never exposed StateChange. Instead we try to pretend that there are dedicated calls and messages for each one multiplexed over StateChange. In practice what happens is that while both the runtime and the plugins see, for instance, something like RunPodSandox(pod), StopPodSandbox(pod), PostCreateContainer(pod, container), StartContainer(pod, container), in reality we pass this over on the wire with something like

&StateChangeEvent{
    Event: Event_{RUN, STOP, POST_UDPATE,REMOVE}_POD_SANDBOX,
    Pod: pod,
    Container: nil,
}
&StateChangeEvent{
    Event: Event_{POST_CREATE,START,POST_START,POST_UPDATE,REMOVE}_CONTAINER,
    Pod: pod,
    Container: container,
}

while on the return path we always have

type (
    {Run,Stop,PostUpdate,Remove}PodSandboxResponse = Empty
    {PostCreate,Start,PostStart,PostUpdate,Remove}ContainerResponse = Empty
    Empty struct {}
)

This was a futile and pretty misguided attempt, by me in momentary lapse of reason/worse than usual brain damage, to try and keep the protocol looking (slightly) small(er than otherwise). IMO, it did not help or make much sense at that time, and it does even less now. The only thing it causes is problems, and artificially created ones at that.

The initial collective property for all requests mutiplexed as StateChange was that they all had event-like semantics, IOW the plugin/client could not request altering anything in response. A secondary one was that, initially, all these requests had either (PodSandbox) or (PodSandbox, Container) inputs, so the 'union content' approach for StateChangeEvent did not look too ugly. But this already became a problem when @chrishenzie was adding the UpdatePodSandbox request. That, too, was event-like in this sense, but the inputs already did not fit into StateChangeEvent. So there was a dilemma whether to 1) go down the path of further uglification by adding Overhead *LinuxContainerResource, Resources *LinuxContainerResources to the 'union of inputs' or 2) handle it with a proper dedicate request and messages and blow the rest of StateChange apart in a similar fashion(... eventually, while living with the now inconsistent approach for the time being).

We decided to go with the second option, but haven't blown apart StateChange yet. I'd like to do that now, while we're rolling a next wire protocol version as part of this PR.

Thanks for the detailed explanation. SGTM!!

events is a very overloaded term.. Calling them something else might help :-)

klihub added 14 commits November 3, 2025 15:26
Neither assume that we have a single proto definition
nor hardcode proto file locations for post-processing
actions based on such an assumption.

Signed-off-by: Krisztian Litkey <[email protected]>
Move the current wire API revision from pkg/api to pkg/api/v1alpha1.
Make importing from the original location an effective alias of
importing from the new one.

Signed-off-by: Krisztian Litkey <[email protected]>
Clone v1alpha1 as a starting point of v1beta1. Add dedicated RPC calls
for requests we used to handle all in v1alpha1 with StateChange tagged
with Event's to tell them apart. Remove now unused StateChange.

Likewise, replace usage of Empty as a response type with dedicated per
RPC call reponses. Remove now unused Empty.

Update Shutdown to allow passing a string which describes the reason
for the shutdown of the plugin.

Signed-off-by: Krisztian Litkey <[email protected]>
Move the current stub revision to pkg/stub/v1alpha1. Make importing
from the original location an effective alias of importing from the
new one.

Signed-off-by: Krisztian Litkey <[email protected]>
Add protocol translation for plugins that talk v1alpha1.

Signed-off-by: Krisztian Litkey <[email protected]>
Implement plugin shutdown. On error paths where this seems the right
choice, shut down plugins instead of closing them. This should give
plugins a chance to gather more context (from the shutdown reason)
than just closing the connection. On other error code paths where we
close the connection, delay it to allow any errors to propagate back
to the plugin before the connection is closed.

Signed-off-by: Krisztian Litkey <[email protected]>
If our connection gets closed before we had a chance to get
Configure()'d by the runtime, cancel Start()'s wait for the
result by letting it know.

Signed-off-by: Krisztian Litkey <[email protected]>
Add and use Event.Mask().

Signed-off-by: Krisztian Litkey <[email protected]>
@klihub klihub force-pushed the devel/api/v1beta1 branch from 837c25f to 5701fcb Compare November 3, 2025 15:48
@klihub klihub force-pushed the devel/api/v1beta1 branch 4 times, most recently from 5dcfafd to 19fc1cc Compare November 4, 2025 16:47
Rework adaptation test suite. Stop using gingko/gomega. Add support
for v1beta1, v1alpha1, and builtin test plugins. Rewrite test cases
using the new suite.

Signed-off-by: Krisztian Litkey <[email protected]>
Rework mux tests without using ginkgo/gomega.

Signed-off-by: Krisztian Litkey <[email protected]>
Rework tests without using ginkgo/gomega.

Signed-off-by: Krisztian Litkey <[email protected]>
Start treating plugins which register with the same index and name
as multiple instances of the same plugin. When a plugin registers,
shut down the previous instance of the same plugin once the later
one gets synchronized.

Signed-off-by: Krisztian Litkey <[email protected]>
@klihub klihub force-pushed the devel/api/v1beta1 branch 2 times, most recently from 02545f6 to dde7d95 Compare November 4, 2025 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants