Skip to content

Conversation

@sbkok
Copy link
Collaborator

@sbkok sbkok commented Aug 31, 2022

Why?

When the Lambda service is experiencing issue, it could affect the execution of the Step Function. Most of the errors it would fail with could be recovered from.

Additionally, with large ADF installations the Lambda function could timeout when an update is performed. As it was not able to recover, the whole update would fail in a hard-to-fix state.

What?

  • Added retry logic around Lambda service issues.
  • Plus added a retry mechanism to recover from Lambda function executions that ran for too long. To retry those automatically too.
  • Improved log messages, so they output in the same structure within the bootstrap State Machine. The logs were using a mix of output styles, making it hard to debug.
  • Fixed indentation of State Machine definitions in our CloudFormation stacks, some were using a mixture of 4 spaces and 2 spaces, while we should use 2 spaces here.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

sbkok added 4 commits August 31, 2022 12:16
**Why?**

The Lambda function in the account bootstrap times out when
on-boarding a large number of regions at once.

In some cases, this happens just after the CloudFormation Change Set
is created, leaving the stack in `REVIEW_IN_PROGRESS` state.

The next time the Step Function for account bootstrapping would
run, it detects that the stack exists and attempts to update it.
While in fact, it still needs to be created.

**What?**

This set of changes will improve the log messages, to ease
troubleshooting these issues in the future.

It will detect if the stack is in a clean state, if so it will update.
If not, it will remove the stack and recreate it instead.

The same holds for any Change Sets that were left on the Stack from a
prior execution. These will get deleted before a new one is created, as
these might contain outdated changes.

To ensure this error will not happen frequently, the Lambda function
will throw an error when there is not enough time to handle another
deployment. Such that the Step Function can retry with sufficient time
available to perform updates at the next attempt.
**Why?**

When the Lambda service is experiencing issue, it could affect the
execution of the Step Function. Most of the errors it would fail with
could be recovered from.

**What?**

Added retry logic around Lambda service issues. Plus added a retry
mechanism to recover from Lambda function executions that ran for too
long. To retry those automatically too.
@sbkok sbkok added this to the v3.2.0 milestone Aug 31, 2022
@sbkok sbkok requested review from StewartW and javydekoning August 31, 2022 13:33
REGION_DEFAULT = os.environ["AWS_REGION"]
PARTITION = get_partition(REGION_DEFAULT)
LOGGER = configure_logger(__name__)
DEPLOY_TIME_IN_MS = 5 * 60 * 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this magic number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout for the Lambda function is set to 10 minutes. See https://github.com/awslabs/aws-deployment-framework/blob/24ba1e58b419796e7268d73b7691099f5a0509f2/src/template.yml

This lambda function is responsible for deploying the bootstrap stack inside different regions.
Deploying that takes about 2 minutes in ideal situations.
To make sure we don't fail to deploy because of a Lambda timeout midway, I put the limit to 5 minutes. So it will only continue to deploy if it has enough time to complete the execution. Otherwise it throws an error which will trigger it to run another time. The next run, it skips updating the regions it deployed to already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay doke! Thanks

@sbkok sbkok merged commit 062e101 into awslabs:master Aug 31, 2022
@sbkok sbkok deleted the fix/bootstrap-recovery branch August 31, 2022 14:47
sbkok added a commit to sbkok/aws-deployment-framework that referenced this pull request Aug 31, 2022
**Why?**

In PR awslabs#513, an error was introduced which broke the syntax of the State Machine
definition.

While investigating, I discovered a number of other improvements that might
help, addressing these in this change.

**What?**

* Fixing missing and redundant commas in retry logic.
* Moving roles and other properties of State Machines to the beginning, so they
  don't float around at the end of the state machine definition.
* Enabled tracing on state machines.
* Updated name of Account Bootstrapping state machine.
@sbkok sbkok mentioned this pull request Aug 31, 2022
sbkok added a commit that referenced this pull request Sep 1, 2022
* Fix ADF State Machines

**Why?**

In PR #513, an error was introduced which broke the syntax of the State Machine
definition.

While investigating, I discovered a number of other improvements that might
help, addressing these in this change.

**What?**

* Fixing missing and redundant commas in retry logic.
* Moving roles and other properties of State Machines to the beginning, so they
  don't float around at the end of the state machine definition.
* Enabled tracing on state machines.
* Updated name of Account Bootstrapping state machine.

* Include tenacity in pipeline management lambda reqs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants