Skip to content

[Tooling] Flaky tests - Issue #192#198

Merged
deblasis merged 29 commits intopokt-network:mainfrom
deblasis:issue/192-tooling-flaky-tests
Sep 29, 2022
Merged

[Tooling] Flaky tests - Issue #192#198
deblasis merged 29 commits intopokt-network:mainfrom
deblasis:issue/192-tooling-flaky-tests

Conversation

@deblasis
Copy link
Contributor

@deblasis deblasis commented Sep 8, 2022

Description

This PR addresses test flakiness due to:

  • race conditions
  • use of timers that causes non-determinism

Tests are still flaky

TestHotstuff4Nodes1BlockHappyPath

for i in {1..10}; do go test -timeout 30s -count=1 -run ^TestHotstuff4Nodes1BlockHappyPath$ github.com/pokt-network/pocket/consensus/consensus_tests; done | egrep  '^\--- FAIL:|ok '

image

TestTinyPacemakerTimeouts

for i in {1..10}; do go test -timeout 30s -count=1 -run ^TestTinyPacemakerTimeouts$ github.com/pokt-network/pocket/consensus/consensus_tests; done | egrep  '^\--- FAIL:|ok '

image

Time mocking

Following the input in this comment I proceeded implementing time mocking.

The idea is that the nodes will have a reference that can proxy calls to either the real time package or a mocked version.
I am using github.com/benbjohnson/clock and I am starting to build utility functions around it that could become a wrapper at some point but it's early to say.

Something like this:
image

I believe this functionality is going to be quite useful in general especially since I believe we'll have to deal with timeouts and time-bound stuff quite a bit.

Race conditions

Mostly due to the fact that structs are accessed by multiple goroutines at the same time.

Addressed using mutexes and wrapping the getters/setters into methods that also handle lock/unlock.

Issue

Fixes issue/192

Type of change

Please mark the options that are relevant.

  • New feature, functionality or library
  • Bug fix (non-breaking change which fixes an issue)
  • Code health or cleanup
  • Other: Investigation

List of changes

  • Introduced time mocking
  • Fixed race conditions in Consensus module

Testing

  • make test_all
  • LocalNet w/ all of the steps outlined in the README

Checklist

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have tested my changes using the available tooling
  • If applicable, I have made corresponding changes to related local or global README
  • If applicable, I have added new diagrams using mermaid.js
  • If applicable, I have added tests that prove my fix is effective or that my feature works

@deblasis deblasis added core Core infrastructure - protocol related consensus Consensus specific changes tooling tooling to support development, testing et al labels Sep 8, 2022
@deblasis deblasis self-assigned this Sep 8, 2022
@deblasis
Copy link
Contributor Author

deblasis commented Sep 8, 2022

getting somewhere
image

@deblasis
Copy link
Contributor Author

10 runs with and without the -race flag and no flakiness detected 🙌

image

@deblasis deblasis marked this pull request as ready for review September 14, 2022 12:44
@deblasis deblasis requested a review from a team September 14, 2022 12:44
@deblasis deblasis changed the title [WIP] [Tooling] Flaky tests - Issue #192 [Tooling] Flaky tests - Issue #192 Sep 14, 2022
@deblasis
Copy link
Contributor Author

Hey @okdas! Can you please review this specific commit?

6e0b00f

I made sure we skip the dockerhub steps that currently are simply failing because not fully setup.

This way we don't get false negatives in CI runs.

@Olshansk Olshansk self-requested a review September 15, 2022 22:13
Copy link
Collaborator

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General: I did a first pass at the review and think this is super interesting. It has the foundation for important infrastructure we'll really need in the future when we do more extensive testing. 🧑‍🔬 🤔 🧪

Will do a second pass after we tend to the first set of comments.

Random Ideas: Thoughts on creating PocketClock module. Not creating an issue for it (yet) since it's just an idea, but wanted to see if you think it'd be overkill.

Primary concern: I did some googling and it seems that Setters and Getters in go are idiomatic, but I'm concerned that:

  1. We need a way to enforce/encourage this because of the mutex requirements (you can imagine the issues as we have more contributors)
  2. Grep & update all existing sets/gets

@okdas
Copy link
Contributor

okdas commented Sep 16, 2022

I made sure we skip the dockerhub steps that currently are simply failing because not fully setup.

This should really not happen because this part is set up for sure and usually works, but then sometimes docker hub does not authenticate and throws an error. I'll check why that might happen. Thank you for calling this out.

@deblasis deblasis requested a review from Olshansk September 19, 2022 16:47
@deblasis deblasis requested a review from Olshansk September 21, 2022 16:47
Copy link
Collaborator

@Olshansk Olshansk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're almost on the home stretch!!

Also, for future reference, you should be able to create branches directly from the main repo; deblasis:issue/192-tooling-flaky-tests.

@deblasis deblasis merged commit 2ff6b24 into pokt-network:main Sep 29, 2022
Olshansk added a commit that referenced this pull request Oct 5, 2022
## Description

Consolidate & update techdebt in the consensus module (excluding the pacemaker module which is being modified by #198) in preparation for an end-to-end state hash flow explored in #237.

## Issue

Fixes #249

## Type of change

- [ ] New feature, functionality or library
- [ ] Bug fix (non-breaking change which fixes an issue)
- [X] Code health or cleanup
- [ ] Major breaking change
- [X] Documentation
- [ ] Other: NA

## List of changes

**Consensus logic**
- Pass in a list of messages to `findHighQC` instead of a hotstuff step
- Made `CreateProposeMessage` and `CreateVotemessage` accept explicit values, identifying some bugs along the way
- Made sure to call `applyBlock` when using `highQC` from previous round
- Moved business logic for `prepareAndApplyBlock` into `hotstuff_leader.go`
- Removed `MaxBlockBytes` and storing the consensus genesis type locally as is

**Consensus cleanup**
- Using appropriate getters for protocol types in the hotstuff lifecycle
- Replaced `proto.Marshal` with `codec.GetCodec().Marshal`
- Reorganized and cleaned up the code in `consensus/block.go`
- Consolidated & removed a few `TODO`s throughout the consensus module
- Added TECHDEBT and TODOs that will be require for a real block lifecycle
- Fixed typo in `hotstuff_types.proto`
- Moved the hotstuff handler interface to `consensus/hotstuff_handler.go` 

**Consensus testing**
- Improved mock module initialization in `consensus/consensus_tests/utils_test.go`

**General**
- Added a diagram for `AppHash` related `ContextInitialization`
- Added `Makefile` keywords for `TODO`

## Testing

- [x] `make test_all`
- [x] [LocalNet](https://github.com/pokt-network/pocket/blob/main/docs/development/README.md) w/ all of the steps outlined in the `README`

## Checklist

- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have tested my changes using the available tooling
- [ ] If applicable, I have made corresponding changes to related local or global README
- [x] If applicable, I have added new diagrams using [mermaid.js](https://mermaid-js.github.io)
- [ ] If applicable, I have added tests that prove my fix is effective or that my feature works
Olshansk added a commit that referenced this pull request Jan 11, 2023
## Description

The original motivation for this PR was to fix the `block is nil` on LocalNet.

While testing (using unit tests), debugging and investigating (using LocalNet), several issues were uncovered in the consensus testing framework introduced by #198. The goal of this PR changed to fix those issues and improve the quality of the consensus codebase.

## Issue

Tends to "quality of life" improvements, which are somewhat related to #361.

## Type of change

Please mark the relevant option(s):

- [ ] New feature, functionality or library
- [x] Bug fix
- [ ] Code health or cleanup
- [ ] Major breaking change
- [x] Documentation
- [ ] Other <!-- add details here if it a different type of change -->

## List of changes

### Consensus - Core

- Force consensus to use a "star-like" broadcast instead of "RainTree" broadcast
- Improve logging throughout through the use of emojis and rewording certain statements
- Slightly improve the block verification flow (renaming, minor fixes, etc…) to stabilize LocalNet

### Consensus - Tests

- Rename the `consensus_tests` package to `e2e_tests`
- Internalize configuration related to `fail_on_extra_msgs` from the `Makefile` to the `consensus` module
- Forced all tests to fail if we receive extra unexpected messages and modify tests appropriately
- After #198, we made tests deterministic but there was a hidden bug that modified how the test utility functions because the clock would not move while we were waiting for messages. This prevented logs from streaming, tests from failing, and other issues. Tend to all related changes.

### Consensus - Pacemaker

- Rename `ValidateMessage` to `ShouldHandleMessage` and return a boolean
- Pass a `reason` to `InterruptRoudn`
- Improve readability of some parts of the code

### P2P

- Add a lock to the mempool to avoid parallel messages which has caused the node to crash in the past

### Configs
- Reorder private keys so addresses (retrieved by transforming private keys) to reflect the numbering in LocalNet appropriately. The address for val1, based on config1, will have the lexicographically first address. This makes debugging easier.

## Testing

- [x] `make test_consensus`
- [x] `make develop_test`
- [x] [LocalNet](https://github.com/pokt-network/pocket/blob/main/docs/development/README.md) w/ all of the steps outlined in the `README`

## Required Checklist

- [x] I have performed a self-review of my own code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have tested my changes using the available tooling
- [ ] I have updated the corresponding CHANGELOG

### If Applicable Checklist
- [ ] I have updated the corresponding README(s); local and/or global
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] I have added, or updated, [mermaid.js](https://mermaid-js.github.io) diagrams in the corresponding README(s)
- [ ] I have added, or updated, documentation and [mermaid.js](https://mermaid-js.github.io) diagrams in `shared/docs/*` if I updated `shared/*`README(s)


Co-authored-by: goku <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

consensus Consensus specific changes core Core infrastructure - protocol related tooling tooling to support development, testing et al

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Tooling] Flaky tests

3 participants