Begin building a common compliance test suite. by robshakir · Pull Request #13 · openconfig/gribigo

robshakir · 2021-06-06T22:37:43Z

commit ed12d4b9d95f566374552e84122b5ac0a9b4072e
Author: Rob Shakir <robjs@google.com>
Date:   Sun Jun 6 15:35:00 2021 -0700

    Create compliance test library, add error check support
    
      * (M) chk/chk.go
        - Add support for checking whether the received errors from the
          server contain a particular error.
      * (A) compliance/compliance.go
      * (A) compliance/compliance_test.go
        - Create a compliance testing library for tests that are relevant
          to other implementations.
      * (M) fluent/fluent.go
      * (M) fluent/fluent_test.go
        - Remove compliance tests from fluent_test.go
        - Implement builder for Modify errors.
      * (A) testcommon/testcommon.go
        - Move testServer into a common package.

* (M) client/client.go - TODOs for implementation details that are pending.

* (M) client/client.go - add handling for receiving messages from a client into restructured pending and result queues that include more info including the timestamp and result codes. * (M) client/client_test.go - add test coverage for all non-integration parts.

* (M) fluent/fluent.go - Since the client connection parameters might be across different RPCs, restructure the fluent client (per the design doc) to encapsulate the connection parameters. * (M) fluent/fluent_test.go - Update existing tests to correspond to absorb the changes described above.

* (M) client/client.go * (M) client/client_test.go - Restructure pneding queues to be able to store the type of transaction that is pending, not just operations - such that it is possible to track latency of other operations and ensure that the client is aware that is has pending non-operations requests. - Add a converged method to check whether there are any pending requests from the server in the client. * (M) fluent/fluent.go * (M) fluent/fluent_test.go - Add initial Await() implementation.

* (A) chk/chk.go * (A) chk/chk_test.go - Add a check library that can be used to determine characteristics of the results that are returned by the client. This library is a fluent-style helper library for gRIBI client results to maintain readability of our test cases. * (M) client/client.go - Change the modifyCh to be unbuffered such that we have blocking writes, this is required to ensure that we do not lose messages and ensure that Await operates as expected. - Add String debugging output to OpResult. * (M) fluent/fluent.go * (M) fluent/fluent_test.go - Add Results method, and extend testing to results. * (M) go.mod * (M) go.sum - Housekeeping * (M) server/server.go * (M) server/server_test.go - Add support and testing for master election.

* (M) client/client.go - Avoid appending an empty session parameters request with no provided parameters. * (M) fluent/fluent_test.go - Handle test flake because of timing issues.

* (M) client/client_test.go - Remove sending empty parameters message when no parameters are set.

* (M) server/server.go * (M) server/server_test.go - Add handling of session parameters validation and checks against the supported capabilities of the fake.

* (M) client/client.go * (M) client/client_test.go - Add new option to allow for persistence to be sent to the server. - Ensure that a client with errors is marked as converged. * (M) fluent/fluent.go * (M) fluent/fluent_test.go - Plumb persistence through fluent client. - Add test case to check that a client with errors does not hang indefinitely. * (M) server/server.go * (M) server/server_test.go - Fix error handling.

* (M) chk/chk.go - Add support for checking whether the received errors from the server contain a particular error. * (A) compliance/compliance.go * (A) compliance/compliance_test.go - Create a compliance testing library for tests that are relevant to other implementations. * (M) fluent/fluent.go * (M) fluent/fluent_test.go - Remove compliance tests from fluent_test.go - Implement builder for Modify errors. * (A) testcommon/testcommon.go - Move testServer into a common package.

sthesayi · 2021-06-07T14:25:47Z

compliance/compliance.go

+	c.Connection().WithTarget(addr)
+	c.Start(context.Background(), t)
+	c.StartSending(context.Background(), t)
+	time.Sleep(100 * time.Millisecond)


May be good to explain the sleep. Do we need a loop or is 100ms practically good enough for de-flaking the test.

~~Thanks - I'll add some explanation in here, but thoughts are welcome.~~

Without this sleep, the test is flaky because:

We start the server, which is listening. The client connects at c.Start()

We start the client sending at c.StartSending at this point it empties the queues that we created in the c.Connection() call.

Whilst we're doing that, we call Await and because the sendq is empty, and the messages haven't made it onto the wire such that they're in the pendq then we consider the client converged.

100msec (and likely less) gives the client time to have got the messages onto the wire and add them to the pending queues.

Writing this down helped me think about this a bit more, and this helped me figure out where the synchronisation issue was -- I introduced a couple of new booleans that indicate that the client is sending and/or receiving - i.e., we are in that window between sending or receiving and post-processing the message to add to the right queue. We can't do this the other way around as otherwise we won't know whether the send/recv was actually successful.

We also don't want there to be something blocking here, since this could result in a deadlock if isConverged is called at a time when a goroutine wants to grab the lock to actually handle a message, so I've adopted an atomic boolean that is used to ensure that isConverged cannot return true during periods that we know that we have pending post-processing.

This means that tests are de-flaked without the sleep :-D

Gosh, this was quite an adventure.

There were a number of really subtle races here of the type described above -- the one that was only triggering on GitHub actions and not on GCP, AWS, a private VM, any of my Macs, my embedded ARM64 machine..., like anywhere was one whereby:

the client took a message out of the sendq whilst holding the mutex, and wrote it to modifyCh.

whilst the goroutine that handled modifyCh took the message off the channel and was in the process of adding it to the pendq (through handleModifyRequest), Await() was called. At this time, there was nothing pending (we didn't add anything there), nothing in the channel (we took the message off there), and nothing in the send queue (we took the message out of there!) - i.e., the message was either directly in the process of being sent to the socket, /or/ we were part way through processing.

Await looked at all of these things, and thought - boy, we're converged, so returned.

Results was called, and didn't have the result for the 2nd message. It was always the 2nd message, because, hey - we held the pending queue lock during receiving the first message, so we slowed down adding the 2nd message to the pending queue.

So, what did I do? Other than really struggle to debug this?

added a sync.RWMutex called awaiting. This mutex is read-locked every time a goroutine enters a state where it is doing something with a message that is sent or received. awaiting is write locked by Await when it tries to read all the state - this means if anyone is doing something where there's a message in flight, Await blocks and waits for things to come back, it also means that no-one can do any new work whilst we're reading the status - so the queues will remain consistent. awaiting having this form means that if we're in the state that we haven't yet sent the message to the socket, then Await will block. If we're processing the message then we'll also block.

Pre-process messages then they are submitted with Q to add them to the pending queue -- we expect messages that we get should be valid /enough/ to store them as a pending transaction (in the future for completely invalid messages, we'll give access to the stream itself), if they're not, we throw an error and the client returns. This removes the race whereby we're mid-way through sending a message to the channel/getting it from there when someone calls Await.

I switched bool types that were not being protected by mutexes to being atomically updated. I'd like to discuss this a bit -- it seems like we might still want a mutex rather than to use sync/atomic, if we're not checking the value when we're writing (e.g., https://github.com/openconfig/gribigo/pull/13/files#diff-bd3a55a72186f59e2e63efb4951573b2f9e4a7cc98086e922b0859f8ccc1dd09R234) it seems like it is safe for us to not have a mutex and just update this value in place.

I also started to think about how to cover this in tests going forward - it's clear that this logic is complex enough that the integration tests that were being used are not sufficient.

Wow that is quite a thorough explanation. As a general rule, I think mutex is better than atomic variables unless we have a single atomic variable. Ultimately atomic variables are also sort of mutex with CAS instruction. I will have to admit that my review has been mostly superficial looking for anything that smells :) I think I am following what you are saying about the various queues. I wonder if there is a need to use counters instead of simple booleans. I mean like push and pop from the queues that are backed by a counter indicating length. Or if you have mutex then you can simply use the size(). The key point is protecting using appropriate guard that wraps a complex operation as atomic.

ACK - agreed.

compliance/compliance.go

* (M) chk/chk.go - Add removed documentation string.

* (M) client/client.go * (M) client/client_test.go - Add atomic.Bool to indicate whether we are in the process of sending or receiving. * (M) compliance/compliance.go * (M) fluent/fluent_test.go - Remove time.Sleep! * (M) go.mod * (M) go.sum - New dependency on Uber's atomic package.

* (M) compliance/compliance.go - Actually remove time.Sleep. - Add docstring comments. * (M) fluent/fluent_test.go - Actually remove time.Sleep.

* (M) client/client_test.go - Avoid a deadlock by ensuring that we're actually emptying the modifyCh now that it is blocking.

robshakir · 2021-06-08T06:58:36Z

@sthesayi PTAL.

It's clear I vastly underestimated the number of story points that this was going to be based on this debugging alone!

sthesayi

The changes look good. I will probably need to go through the logic in more detail when I have some time. Probably later today. I want to follow the use of the various atomic variables and the mutexes. At first glance it seems quite complicated and probably for good reason. I wonder if there is a simpler way using channels but can't quite answer that without more deeper understanding of the queues.

chk/chk.go

client/client.go

sthesayi · 2021-06-08T14:11:22Z

client/client.go

 // queue (enqued by Q) to the connection established by Connect.
 func (c *Client) StartSending() {
-	c.qs.sending = true
+	c.qs.sending.Store(true)


When will this bool be set to false?

It set to false in two cases:

by default before the client called StartSending, so to start with the client doesn't actually put messages on the wire. This allows us to handle the case where we want to queue up a bunch of things to send when we connect.

by some future StopSending function which will allow us to test some timing related areas.

robshakir · 2021-06-08T20:00:30Z

Suresh, thanks for the detailed comments here - and potential future review.

I do think that there might be some opportunity to simplify this using channels, I just couldn't find the way to do it. The issue breaks down to:

We have a goroutine doing sends to the Modify stream, one that does reads from that Modify stream, and then a set that are triggered by the public API that says "check that no work remains to be done, and if there is any, block" (Await).
There were races happening because even though we're protecting everything inside of the queues with fine-grained mutexes around those queues, Await would check whether the queues are empty during the period that one of the other goroutines was just handling the message and moving it between queues.
Thus we need something that can:
- allow messages to be processed in parallel (i.e., a Send to the stream should not block a Read from the stream).
- be used to block Await from returning whilst any moving of messages between queues is in progress.
The usual pattern I'd have used here would be having a chan struct {} that is written to by the send and receive goroutines to say "done with all processing, you can return". However, the problem I was running into is that if you create such a channel, and read from it in Await you can get deadlocks if there's no message being sent or received at the time that Await is called.
Thus the solution here is to use a single RWMutex where:
- the send and receive goroutines take a readlock whilst they're processing a message, such that they don't block each other
- the Await function takes a write lock on the mutex, so it can only acquire it once the send+receive goroutines are complete, and it blocks new send and receive operations taking place until it is released.
The mutexes around the queues are not sufficient here, since we're always moving between two queues - send -> pending, or pending -> result. We can end up with deadlocks if we get a lock on only one of these things - where each goroutine is waiting for the other lock.

The other thing I spent a lot of time experimenting with was whether you could use sync.WaitGroup here, but this comes with its own set of complexity since we're not really waiting on goroutines, we're waiting on functions within long-lived goroutines.

I'm very open to there being a better way to do this :-) Maybe it was post-midnight thinking but I couldn't find one that worked.

robshakir · 2021-06-10T03:41:53Z

Merging this PR, we can spend some time on this going forward - I don't think there's an immediate alternative here.

robshakir and others added 23 commits June 2, 2021 11:10

Add TODOs for pending implementation items.

1624c34

* (M) client/client.go - TODOs for implementation details that are pending.

Merge branch 'main' into fluent-3

fc69bd1

Merge branch 'fluent-2' into fluent-3

6a60a6a

Merge branch 'fluent-2' into fluent-3

fef7d0d

Merge branch 'main' into fluent-3

4baeb2e

Merge branch 'dd0' into fluent-4

cedba58

Working commit.

5dcd922

Fix up go.mod.

7c60d2a

Ensure we handle nil parameters correctly.

30c7db4

* (M) client/client.go - Avoid appending an empty session parameters request with no provided parameters. * (M) fluent/fluent_test.go - Handle test flake because of timing issues.

Fix erroneous client behaviour test.

59a3b58

* (M) client/client_test.go - Remove sending empty parameters message when no parameters are set.

Align with style guide, and remove race tests for fluent.

f8f6ef4

Fix workflow syntax.

8b404af

Update comment

663891b

Debug CI.

ca12049

Debug CI.

c9ff49b

Update workflows for race.

c5fadd4

Add handling of SessionParameter negotiation

0876862

* (M) server/server.go * (M) server/server_test.go - Add handling of session parameters validation and checks against the supported capabilities of the fake.

robshakir changed the base branch from main to fluent-6 June 6, 2021 22:37

robshakir requested a review from sthesayi June 6, 2021 22:38

sthesayi reviewed Jun 7, 2021

View reviewed changes

robshakir added 4 commits June 7, 2021 09:04

Merge branch 'main' into fluent-5

de27072

Merge branch 'fluent-5' into fluent-6

2ddd3d6

Merge branch 'fluent-6' into fluent-7

65c136f

Address review comments.

25aebc3

* (M) chk/chk.go - Add removed documentation string.

robshakir added 4 commits June 7, 2021 11:13

Merge branch 'main' into fluent-6

69d1688

Merge branch 'fluent-6' into fluent-7

eab2130

Address review comments.

a35d365

* (M) compliance/compliance.go - Actually remove time.Sleep. - Add docstring comments. * (M) fluent/fluent_test.go - Actually remove time.Sleep.

robshakir changed the base branch from fluent-6 to main June 7, 2021 21:07

Merge branch 'main' into fluent-7

af4c1fd

robshakir changed the title ~~[Depends PR#12] Begin building a common compliance test suite.~~ Begin building a common compliance test suite. Jun 7, 2021

robshakir added 13 commits June 7, 2021 14:41

Fix merge errors.

d77f3bc

Debug tests on GH Actions.

6242f23

Check whether flakes are due to timing.

9fb7484

Based on timing, check call order.

3364b31

Add temporary instrumentation to client.

6fa54eb

Change approach to checking for convergenceD

e4b4b58

Restructure order of operations to avoid premature converged signal.

63d665d

Make writes to the modifyCh blocking.

c92a12c

Avoid a test deadlock.

9ca9c71

* (M) client/client_test.go - Avoid a deadlock by ensuring that we're actually emptying the modifyCh now that it is blocking.

adopt atomic.Bool in all cases we signal between goroutines.

90a9988

Debug flaky test on GitHub Actions.

baa35a4

Remove debugging statements.

232e207

remove debug workflow, clean up comments.

e779b59

sthesayi approved these changes Jun 8, 2021

View reviewed changes

robshakir added 4 commits June 8, 2021 17:58

Address review comments.

c178604

Fix tests.

5c6e5d3

Fix typo.

1d72c92

Merge branch 'main' into fluent-7

d1d79d8

robshakir merged commit d646a4f into main Jun 10, 2021

robshakir deleted the fluent-7 branch June 10, 2021 03:42

Conversation

robshakir commented Jun 6, 2021

Uh oh!

sthesayi Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

robshakir Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

robshakir Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

sthesayi Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

robshakir Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robshakir commented Jun 8, 2021

Uh oh!

sthesayi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sthesayi Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

robshakir Jun 8, 2021

Choose a reason for hiding this comment

Uh oh!

robshakir commented Jun 8, 2021

Uh oh!

robshakir commented Jun 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants