Skip to content

Conversation

@wk8
Copy link
Contributor

@wk8 wk8 commented Oct 4, 2018

This patch adds a new manager/deallocator package.

The deallocator's job is to take over deletion of services and service-level
resources (as of right now, only networks, but e.g. volumes would also fall in
the same category.

The deallocator relies on the reaper correctly deleting tasks for services that
are shutting down, and only when no tasks are left within a service does it
then proceed to delete that service and its resources.

This patch does not yet actually put that new component to use; that will be
done in a future, separate patch, for the sake of easier reviews.

Includes unit tests.

@GordonTheTurtle
Copy link

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "wk8/network_rm" [email protected]:wk8/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354530096
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

eventsChan, eventsChanCancel, err := store.ViewAndWatch(deallocator.store,
func(readTx store.ReadTx) error {
// look for services that are marked for deletion
// there's no index on the `PendingDelete` field,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no index on the PendingDelete field

can you plz clarify what this means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that there's no DB index on that field; so listing those which are marked for deletion requires iterating over them all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. maybe clarify then:
there's no index on the PendingDelete field in the store

// if in doubt, let's proceed to clean up the service anyway
// better to clean up resources that shouldn't be cleaned up yet
// than ending up with a service and some resources lost in limbo forever
return deallocator.deallocateService(ctx, service)
Copy link
Contributor

@anshulpundir anshulpundir Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may cause problems. Maybe its better to be safe and maybe retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it. Sure, I'm not opposed to re-trying, but that only delays the real problem: what happens after we're retried n times and keep failing? Do we never stop trying (in which case this could be dangerous at another level, namely resource consumption & performance)? Or do we still forget after a while?

ignoreServiceID = &service.ID
}

// then all of its service-level resources, provided no other service uses them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provided no other service uses them

How do we check that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the resourceDeallocator.servicesLocator callback is for

events[1] = api.EventUpdateService{Service: &api.Service{PendingDelete: true}}
i := 2
for _, resourceDeallocator := range resourceDeallocators {
events[i] = resourceDeallocator.event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please clarify why we need to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure I understand the question correctly, but basically this also boils down to separating the resource-specific code from the core of the deallocator: each resource-specific file defines the API event it cares about, and this is just aggregating all that to feed it to store.ViewAndWatch below.

Please let me know if I misunderstood the question!

)

func registerServiceLevelResource(resourceDeallocator resourceDeallocator) {
resourceDeallocators[reflect.TypeOf(resourceDeallocator.event)] = resourceDeallocator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any major performance overheads, but I don't know i this is a good pattern. cc @dperny

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tend to shy away from using reflection when I can, but I couldn't find any other way to achieve what I wanted here, namely: have different callbacks for different event types without hard-coding in this file all the event types and their callback - for the sake of cleanly separating the deallocator itself from the deallocation code for each type of resource.

But then I'd love to be hinted at a better way of doing this :) !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the conversation above, interface{} might be the easiest and simplest way to do this for now. However, I'm not a golang expert, so I would love @dperny and @stevvooe 's opinion here

)

var (
resourceDeallocators = make(map[reflect.Type]resourceDeallocator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do map[interface{}]resourceDeallocator ? @wk8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the idea of not making it a bit more specific (reflect.Type than just an empty interface) if we can?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is reflect.Type more specific than interface{} ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes, it's a well-defined type, as opposed to interface{} which matches anything, akin to C/C++'s void*?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but do any of those properties of a relfect.Type apply to this use-case?

Here's something I found about generics in golang. https://appliedgo.net/generics/
TLDR the recommendation from this article is to avoid reflection. Maybe use interface{} and type assertions?

Copy link
Contributor Author

@wk8 wk8 Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, interesting read.

The main argument against reflection here is performance (type safety is not a concern here, given the very controlled fashion in which it's used). And sure enough, it's slower than a case switch (see below). But at least that allows us to not hardcode in the package's main file the different resource types, and keep each resource cleanly separated from the deallocator's code.

If we wanted to keep that separation of concerns in the code, we could indeed use type assertion in a linked list of resource-specific implementations, but that might grow slow if we add a lot of resources (10+, according to the benchmark below).

A rough benchmark:

package foo

import (
	"reflect"
	"testing"
)

type Foo struct{}
type Bar struct{}

var a = []interface{}{Foo{}, Bar{}}

var foo = reflect.TypeOf(Foo{})
var bar = reflect.TypeOf(Bar{})

func BenchmarkTypeOf(b *testing.B) {
	x, y := 0, 0

	for n := 0; n < b.N; n++ {
		for _, i := range a {
			switch reflect.TypeOf(i) {
			case foo:
				x++
			case bar:
				y++
			}
		}
	}

	assertSuccess(b, x, y)
}

func BenchmarkSwitch(b *testing.B) {
	x, y := 0, 0

	for n := 0; n < b.N; n++ {
		for _, i := range a {
			switch i.(type) {
			case Foo:
				x++
			case Bar:
				y++
			}
		}
	}

	assertSuccess(b, x, y)
}

func BenchmarkTypeAssertion(b *testing.B) {
	x, y := 0, 0

	for n := 0; n < b.N; n++ {
		for _, i := range a {
			if _, isFoo := i.(Foo); isFoo {
				x++
			} else if _, isBar := i.(Bar); isBar {
				y++
			}
		}
	}

	assertSuccess(b, x, y)
}

func assertSuccess(b *testing.B, x, y int) {
	if x != y || x != b.N {
		b.Error("shouldn't happen")
	}
}

yields

BenchmarkTypeOf-8          	100000000	        20.9 ns/op
BenchmarkSwitch-8          	1000000000	         2.89 ns/op
BenchmarkTypeAssertion-8   	2000000000	         1.89 ns/op

so, happy to re-write as a list of type assertions if you want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, we only do a call to reflect.Type when there's an update to a network (or another service-level resource down the line), so the 10 ns price might be okay?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh boy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh lord this is over-engineered.

@wk8 wk8 changed the title Wk8/network rm Adding a new Deallocator component Oct 9, 2018
type: TYPE_STRING
json_name: "phpNamespace"
}
field {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the recent protoc upgrade from #2750

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK lets rebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing to rebase against. The protoc upgrade change did not change this file, but it should have: regenerating the proto files after the upgrade will included these lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and sure, I could make it a separate PR, but is it really worth it?)

@codecov
Copy link

codecov bot commented Oct 10, 2018

Codecov Report

Merging #2759 into master will increase coverage by 0.08%.
The diff coverage is 77.16%.

@@            Coverage Diff             @@
##           master    #2759      +/-   ##
==========================================
+ Coverage   61.87%   61.95%   +0.08%     
==========================================
  Files         136      137       +1     
  Lines       21926    22047     +121     
==========================================
+ Hits        13566    13660      +94     
- Misses       6896     6912      +16     
- Partials     1464     1475      +11

@dperny
Copy link
Collaborator

dperny commented Oct 11, 2018

instead of using reflection and writing a more general API, can we write a less-general API that requires some objects be hard-coded? we have absolute control over what components perform resource deallocation, so we can just bake them in.

Copy link
Collaborator

@dperny dperny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's over-engineered. This is what we'd build if we were providing this as a framework or something against arbitrary types, but the types we're actually acting on are constrained to a pretty small set.


// for services that are shutting down, we keep track of how many
// tasks still exist for them
services map[string]*serviceWithTaskCounts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why map service IDs to both the service object AND the count? what does keeping a pointer to the service object get us, as opposed to having a map[string]int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need the service itself at deallocation time. Saves a lookup. I guess we could do away with that, but seems like a pretty minor choice either way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't pay attention to this comment before, but I agree with @dperny that it is cleaner to just store service id => count.

My suggestion: when designing data structures, its better to have as reasonably strong reason to do it a certain wait rather than it being fine either way. @wk8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshulpundir : isn't saving a lookup a strong enough reason?

Copy link
Contributor

@anshulpundir anshulpundir Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One extra lookup for each service delete is not that big a overhead. But I guess you could argue either way. I'll let you make the call.


// now we also need to look at all existing service-level resources
// that are marked for deletion
for key, resourceDeallocator := range resourceDeallocators {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the more that i think about this... we ought to just do

allNetworks, err = store.FindNetworks(readTx, store.All)

and then later on if we add new resources (i.e. volumes), we'll just add more code:

allVolumes, err := store.FindVolumes(readTx, store.All)

This does lead to duplication, but it's much more clear. If we were ever going to have more than a handful of object types that needed to be managed by this, it would be different, but we almost certainly won't.

// FIXME (jrouge): see the comment on TaskReaper.Stop() and see when to properly stop this
// plus unit test on this!
func (deallocator *Deallocator) Stop() {
close(deallocator.stopChan)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want to wrap this in a sync.Once to make calls to Stop idempotent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that when I get around to actually using it, right now I'm not clear on this. That's what the comment right before it hints at. But good catch, keeping it in mind :)

}

func (deallocator *Deallocator) processNewEvent(ctx context.Context, event events.Event) (bool, error) {
switch typedEvent := event.(type) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol typedEvent is a way better name than swtich ev := event.(type) { (which is what we have in other places iirc)

ran <- returnValue
close(ran)
}()
waitForDeallocatorEvent(t, deallocator, expectedUpdates)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was wondering how you fixed the bug here. i believe you issue an event on notifyEventChan after the setup phase?

a better pattern than doing that would be to add a variable to the deallocator:

type Deallocator struct {
    // other fields elided
    started chan struct{}
}

Then, after the deallocator has started, you do:

close(started)

Finally, instead of passing expectedUpdates, you should just check if the db state is in the state you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly not sold on this one. Adds yet another field for no great reason, and checking whether the deallocator actually performed any DB operations is an additional check on top of just checking that the DB is in the state we expect (which we also do).

@wk8 wk8 force-pushed the wk8/network_rm branch 2 times, most recently from 8a5e572 to b02a280 Compare October 17, 2018 19:44
@dperny
Copy link
Collaborator

dperny commented Oct 22, 2018

LGTM now.

// services, as well as all service-level resources, can only be deleted
// after all the service's containers have properly shut down
// when a user requests a deletion, we just flip this flag
// the deallocator will take it from there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll probably be helpful to briefly mention how the deallocator uses this flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// PendingDelete indicates that this service's deletion has been requested
// services, as well as all service-level resources, can only be deleted
// after all the service's containers have properly shut down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: all the => all of the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// PendingDelete indicates that this service's deletion has been requested
// services, as well as all service-level resources, can only be deleted
// after all the service's containers have properly shut down
// when a user requests a deletion, we just flip this flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a period before this sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// services, as well as all service-level resources, can only be deleted
// after all the service's containers have properly shut down
// when a user requests a deletion, we just flip this flag
// the deallocator will take it from there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"github.com/docker/swarmkit/manager/state/store"
)

// A Deallocator waits for services to fully shutdown (ie no containers left)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove 'A' at the beginning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// user.
//
// NOTE: since networks are the only service-level resources as of now,
// it has been deemed over-engineered to have a clean generic way to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has been deemed over-engineered to have a clean generic way

yikes!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "clean" ;) didn't mean this comment that way

// handle other types of service-level resources; if we ever start
// having more of those and thus want to reconsider this choice, it
// might be worth having a look at this archived branch, that does
// implements a way of separating the code for the deallocator itself
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implements => implement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

doneChan chan struct{}
}

type serviceWithTaskCounts struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Run starts the deallocator, which then starts cleaning up services
// and their resources when relevant (ie when no tasks still exist
// for a given service)
// This is a blocking function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for calling this out!

eventsChan, eventsChanCancel, err := store.ViewAndWatch(deallocator.store,
func(readTx store.ReadTx) error {
// look for services that are marked for deletion
// there's no index on the `PendingDelete` field,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. maybe clarify then:
there's no index on the PendingDelete field in the store

api.EventUpdateNetwork{})

if err != nil {
// if we have an error here, we can't proceed any further, let the world burn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roflol. Perhaps say something less dramatic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

anyUpdated := false
// now let's populate our internal taskCounts
for _, service := range allServices {
if updated, _ := deallocator.processService(ctx, service); updated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment on what processService() does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
}

// now we just need to wait for events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what events are we waiting for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ones that were listed as args to store.ViewAndWatch just a few lines above? I'm not sure how to say that without being redundant?

<-deallocator.doneChan
}

func (deallocator *Deallocator) notifyEventChan(updated bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

for {
select {
case event := <-eventsChan:
if updated, err := deallocator.processNewEvent(ctx, event); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on what processNewEvent() returns and how its used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

func (deallocator *Deallocator) notifyEventChan(updated bool) {
if deallocator.eventChan != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this ever be called where the deallocator.eventChan is nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added comments.

// if in doubt, let's proceed to clean up the service anyway
// better to clean up resources that shouldn't be cleaned up yet
// than ending up with a service and some resources lost in limbo forever
return true, deallocator.deallocateService(ctx, service)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea, especially because deletion can't be reversed. Also, I think its better to have a leaked service object than corruption (tasks running without a service object). What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this patch is to start asynchronously deleting services and their resources. That means that if errors occur when actually deleting, we can't just surface those to the user who requested the deletion in the first place, since their request have long since returned.

That leaves us with just a few options in my opinion:

  1. not actually deleting anything when we encounter an error (in which case it would be nice to give the user say a "request ID" they can use to check on the status of a delete they've requested, and also the ability to force a delete after the normal flow has errored out.
  2. always do a best-effort deletion; in which case, sure, we can end up aggressively deleting services and resources that might actually still be in use; but then that's the current behaviour anyway, so it's not worse - and the hope then (as now) is that the system eventually self recovers when tasks eventually do shut down anyway
  3. a refinement on 2 would be adding exponential retries instead, but that does strike me as over-engineered, given the low impact of an anticipated deletion, together with the very low likelihood of that happening in the 1st place (how often do store operations realistically fail, really? At leadership change?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point with 2 that this behavior is the same as the previous behavior. Looks good as you have it currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return false, nil
}

func (deallocator *Deallocator) deallocateService(ctx context.Context, service *api.Service) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
for _, networkConfig := range networkConfigs {
if network := store.GetNetwork(tx, networkConfig.Target); network != nil {
deallocator.processNetwork(ctx, tx, network, ignoreServiceID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you handle the return value from this function, failures etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should errors be at least logged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both processNetwork and processService log errors


// and deallocate networks that may be marked for deletion and aren't used any more
for _, network := range allNetworks {
if updated, _ := deallocator.processNetwork(ctx, nil, network, nil); updated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do errors need to be handled or just ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should errors be at least logged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return
}

updateFunc := func(t store.Tx) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comments for updateFunc. Also do we really need a function object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do. This method can be called either with or without an open transaction, depending on whether it's done as part of a service cleanup or not. See how it's used in the if tx == nil ... else just below.

return
}

// proceeds to deallocating a network if it's pending deletion and there no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: deallocating => deallocate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// proceeds to deallocating a network if it's pending deletion and there no
// longer are any services using it
func (deallocator *Deallocator) processNetwork(ctx context.Context, tx store.Tx, network *api.Network, ignoreServiceID *string) (updated bool, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments for each of the arguments accepted by the function. e.g. its not obvious what ignoreServiceID does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return
}

func (deallocator *Deallocator) processNewEvent(ctx context.Context, event events.Event) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz add a comment. All functions should have comments unless its blatantly obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// PendingDelete indicates that this network's deletion has been requested.
// Services, as well as all service-level resources, can only be deleted
// after all of the service's containers have properly shut down.
// When a user requests a deletion, we just flip this flag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

When a user requests a deletion

When a user requests a deletion of this network


// for services that are shutting down, we keep track of how many
// tasks still exist for them
services map[string]*serviceWithTaskCounts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't pay attention to this comment before, but I agree with @dperny that it is cleaner to just store service id => count.

My suggestion: when designing data structures, its better to have as reasonably strong reason to do it a certain wait rather than it being fine either way. @wk8

Copy link
Contributor

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor comments. Looks good otherwise!

// if in doubt, let's proceed to clean up the service anyway
// better to clean up resources that shouldn't be cleaned up yet
// than ending up with a service and some resources lost in limbo forever
return true, deallocator.deallocateService(ctx, service)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point with 2 that this behavior is the same as the previous behavior. Looks good as you have it currently.


// and deallocate networks that may be marked for deletion and aren't used any more
for _, network := range allNetworks {
if updated, _ := deallocator.processNetwork(ctx, nil, network, nil); updated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should errors be at least logged?

}
for _, networkConfig := range networkConfigs {
if network := store.GetNetwork(tx, networkConfig.Target); network != nil {
deallocator.processNetwork(ctx, tx, network, ignoreServiceID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should errors be at least logged?

// proceeds to deallocate a network if it's pending deletion and there no
// longer are any services using it
// actually deletes the network if it's marked for deletion and no services are
// using it any more (or the only one using it has ID `ignoreServiceID`, if not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, but I don't think its obvious to a first time reader. Why is it OK to deallocated the network when the only one using it has ID ignoreServiceID ?

@wk8 wk8 force-pushed the wk8/network_rm branch 2 times, most recently from f731bf7 to b828d81 Compare October 30, 2018 19:53
This patch adds a new manager/deallocator package.

The deallocator's job is to take over deletion of services and service-level
resources (as of right now, only networks, but e.g. volumes would also fall in
the same category.

The deallocator relies on the reaper correctly deleting tasks for services that
are shutting down, and only when no tasks are left within a service does it
then proceed to delete that service and its resources.

This patch does not yet actually put that new component to use; that will be
done in a future, separate patch, for the sake of easier reviews.

Special care has been taken to separate the deallocator itself from each of the
service-level resource implementation.

Includes unit tests.

Signed-off-by: Jean Rouge <[email protected]>
@anshulpundir anshulpundir merged commit 46cb7b3 into moby:master Oct 30, 2018
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Nov 1, 2018
Changes included;

- moby/swarmkit#2735 Assign secrets individually to each task
- moby/swarmkit#2759 Adding a new `Deallocator` component
- moby/swarmkit#2738 Add additional info for secret drivers
- moby/swarmkit#2775 Increase grpc max recv message size
  - addresses moby#37941
  - addresses moby#37997
  - follow-up to moby#38103

Signed-off-by: Sebastiaan van Stijn <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Nov 5, 2018
**- What I did**

This patch actually stars to use the deallocator to clean up services, as
well as service-level resources (as of now, only networks).

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks are no longer deleted immediately when a user requests their
deletion; instead, they are deleted when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Nov 5, 2018
**- What I did**

This patch actually stars to use the deallocator to clean up services, as
well as service-level resources (as of now, only networks).

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks are no longer deleted immediately when a user requests their
deletion; instead, they are deleted when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Nov 5, 2018
**- What I did**

This patch actually stars to use the deallocator to clean up services, as
well as service-level resources (as of now, only networks).

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks are no longer deleted immediately when a user requests their
deletion; instead, they are deleted when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Nov 5, 2018
**- What I did**

This patch actually stars to use the deallocator to clean up services, as
well as service-level resources (as of now, only networks).

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks are no longer deleted immediately when a user requests their
deletion; instead, they are deleted when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Nov 5, 2018
**- What I did**

This patch actually stars to use the deallocator to clean up services, as
well as service-level resources (as of now, only networks).

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks are no longer deleted immediately when a user requests their
deletion; instead, they are deleted when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Nov 5, 2018
**- What I did**

This patch actually stars to use the deallocator to clean up services, as
well as service-level resources (as of now, only networks).

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks are no longer deleted immediately when a user requests their
deletion; instead, they are deleted when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this pull request Nov 12, 2018
Changes included;

- moby/swarmkit#2735 Assign secrets individually to each task
- moby/swarmkit#2759 Adding a new `Deallocator` component
- moby/swarmkit#2738 Add additional info for secret drivers
- moby/swarmkit#2775 Increase grpc max recv message size
  - addresses moby/moby#37941
  - addresses moby/moby#37997
  - follow-up to moby/moby#38103

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Upstream-commit: be3843c8c8fb30b4a604dae9d0dad3d393db717c
Component: engine
wk8 added a commit to wk8/swarmkit that referenced this pull request Nov 15, 2018
**- What I did**

This patch actually stars to use the deallocator to clean up services, as
well as service-level resources (as of now, only networks).

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks are no longer deleted immediately when a user requests their
deletion; instead, they are deleted when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/moby that referenced this pull request Jan 12, 2019
Namely the `PendingDelete` fields added to both `Services`
and `NetworkResources` in
moby/swarmkit#2759

Also amended relevant tests to now wait for services to be
fully shut down after removing them.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Feb 1, 2019
**- What I did**

This patch actually stars to use the deallocator to clean up services, as
well as service-level resources (as of now, only networks).

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks are no longer deleted immediately when a user requests their
deletion; instead, they are deleted when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Feb 1, 2019
**- What I did**

This patch allows users to choose to make removals of services and networks
asynchronous, which means that they only get deleted once all the containers
belonging to them or using them are safely shut down.

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

Part of this test was already reviewed in
moby#2778, which subsequently got reverted
following the discussion in moby/moby#38525.
The main difference between this PR and moby#2778 is that asynchronous removal is
an opt-in behaviour here instead of changing the existing behaviour.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks can be deleted asynchronously, in which case the request
for their removal simply marks them for removal, and they are then actually
deleted only when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Feb 1, 2019
**- What I did**

This patch allows users to choose to make removals of services and networks
asynchronous, which means that they only get deleted once all the containers
belonging to them or using them are safely shut down.

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

Part of this test was already reviewed in
moby#2778, which subsequently got reverted
following the discussion in moby/moby#38525.
The main difference between this PR and moby#2778 is that asynchronous removal is
an opt-in behaviour here instead of changing the existing behaviour.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks can be deleted asynchronously, in which case the request
for their removal simply marks them for removal, and they are then actually
deleted only when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Feb 1, 2019
**- What I did**

This patch allows users to choose to make removals of services and networks
asynchronous, which means that they only get deleted once all the containers
belonging to them or using them are safely shut down.

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

Part of this test was already reviewed in
moby#2778, which subsequently got reverted
following the discussion in moby/moby#38525.
The main difference between this PR and moby#2778 is that asynchronous removal is
an opt-in behaviour here instead of changing the existing behaviour.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks can be deleted asynchronously, in which case the request
for their removal simply marks them for removal, and they are then actually
deleted only when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
wk8 added a commit to wk8/swarmkit that referenced this pull request Feb 1, 2019
**- What I did**

This patch allows users to choose to make removals of services and networks
asynchronous, which means that they only get deleted once all the containers
belonging to them or using them are safely shut down.

**- How I did it**

A previous patch (moby#2759) introduced a
new component, the deallocator, responsible for cleaning up services and
service-level resources. This patch is actually starting to make use of that
component.

Since the deallocator used to rely on the reaper deleting the tasks belonging
to services that had been marked for removal, a previous version of this patch
was modifying the task reaper quite heavily to also keep track of such services
(needed since otherwise the reaper would fail to clean up all of them, instead
keeping some for history tracking purposes). However, it soon appeared that
this was not the best approach:
 * this creates a hidden coupling between the reaper and the deallocator
 * it's also not the best user experience to suddenly remove all of a service's
   task history while shutting down, for not apparent reason to the user

For these reasons, this patch instead amends the deallocator to also look at tasks status when keeping track of how many alive tasks remain to a service.

Part of this test was already reviewed in
moby#2778, which subsequently got reverted
following the discussion in moby/moby#38525.
The main difference between this PR and moby#2778 is that asynchronous removal is
an opt-in behaviour here instead of changing the existing behaviour.

**- How to test it**

Updated tests.

**- Description for the changelog**

Services & networks can be deleted asynchronously, in which case the request
for their removal simply marks them for removal, and they are then actually
deleted only when all their tasks are actually shut down.

Signed-off-by: Jean Rouge <[email protected]>
adi-dhulipala pushed a commit to adi-dhulipala/docker that referenced this pull request Apr 11, 2019
Changes included;

- moby/swarmkit#2735 Assign secrets individually to each task
- moby/swarmkit#2759 Adding a new `Deallocator` component
- moby/swarmkit#2738 Add additional info for secret drivers
- moby/swarmkit#2775 Increase grpc max recv message size
  - addresses moby#37941
  - addresses moby#37997
  - follow-up to moby#38103

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants