Add retry logic to SF Lambda invocations #513

sbkok · 2022-08-31T13:33:24Z

Why?

When the Lambda service is experiencing issue, it could affect the execution of the Step Function. Most of the errors it would fail with could be recovered from.

Additionally, with large ADF installations the Lambda function could timeout when an update is performed. As it was not able to recover, the whole update would fail in a hard-to-fix state.

What?

Added retry logic around Lambda service issues.
Plus added a retry mechanism to recover from Lambda function executions that ran for too long. To retry those automatically too.
Improved log messages, so they output in the same structure within the bootstrap State Machine. The logs were using a mix of output styles, making it hard to debug.
Fixed indentation of State Machine definitions in our CloudFormation stacks, some were using a mixture of 4 spaces and 2 spaces, while we should use 2 spaces here.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

**Why?** The Lambda function in the account bootstrap times out when on-boarding a large number of regions at once. In some cases, this happens just after the CloudFormation Change Set is created, leaving the stack in `REVIEW_IN_PROGRESS` state. The next time the Step Function for account bootstrapping would run, it detects that the stack exists and attempts to update it. While in fact, it still needs to be created. **What?** This set of changes will improve the log messages, to ease troubleshooting these issues in the future. It will detect if the stack is in a clean state, if so it will update. If not, it will remove the stack and recreate it instead. The same holds for any Change Sets that were left on the Stack from a prior execution. These will get deleted before a new one is created, as these might contain outdated changes. To ensure this error will not happen frequently, the Lambda function will throw an error when there is not enough time to handle another deployment. Such that the Step Function can retry with sufficient time available to perform updates at the next attempt.

**Why?** When the Lambda service is experiencing issue, it could affect the execution of the Step Function. Most of the errors it would fail with could be recovered from. **What?** Added retry logic around Lambda service issues. Plus added a retry mechanism to recover from Lambda function executions that ran for too long. To retry those automatically too.

StewartW · 2022-08-31T14:31:46Z

src/lambda_codebase/account_bootstrap.py

 REGION_DEFAULT = os.environ["AWS_REGION"]
 PARTITION = get_partition(REGION_DEFAULT)
 LOGGER = configure_logger(__name__)
+DEPLOY_TIME_IN_MS = 5 * 60 * 1000


What is this magic number?

The timeout for the Lambda function is set to 10 minutes. See https://github.com/awslabs/aws-deployment-framework/blob/24ba1e58b419796e7268d73b7691099f5a0509f2/src/template.yml

This lambda function is responsible for deploying the bootstrap stack inside different regions.
Deploying that takes about 2 minutes in ideal situations.
To make sure we don't fail to deploy because of a Lambda timeout midway, I put the limit to 5 minutes. So it will only continue to deploy if it has enough time to complete the execution. Otherwise it throws an error which will trigger it to run another time. The next run, it skips updating the regions it deployed to already.

Okay doke! Thanks

**Why?** In PR awslabs#513, an error was introduced which broke the syntax of the State Machine definition. While investigating, I discovered a number of other improvements that might help, addressing these in this change. **What?** * Fixing missing and redundant commas in retry logic. * Moving roles and other properties of State Machines to the beginning, so they don't float around at the end of the state machine definition. * Enabled tracing on state machines. * Updated name of Account Bootstrapping state machine.

* Fix ADF State Machines **Why?** In PR #513, an error was introduced which broke the syntax of the State Machine definition. While investigating, I discovered a number of other improvements that might help, addressing these in this change. **What?** * Fixing missing and redundant commas in retry logic. * Moving roles and other properties of State Machines to the beginning, so they don't float around at the end of the state machine definition. * Enabled tracing on state machines. * Updated name of Account Bootstrapping state machine. * Include tenacity in pipeline management lambda reqs

sbkok added 4 commits August 31, 2022 12:16

Raise an error when the execution time is not sufficient

7d0308d

Fix indentation of State Machine definitions

55dd609

sbkok added this to the v3.2.0 milestone Aug 31, 2022

sbkok requested review from StewartW and javydekoning August 31, 2022 13:33

javydekoning approved these changes Aug 31, 2022

View reviewed changes

StewartW reviewed Aug 31, 2022

View reviewed changes

StewartW approved these changes Aug 31, 2022

View reviewed changes

sbkok merged commit 062e101 into awslabs:master Aug 31, 2022

sbkok deleted the fix/bootstrap-recovery branch August 31, 2022 14:47

sbkok mentioned this pull request Aug 31, 2022

Fix ADF State Machines #514

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retry logic to SF Lambda invocations #513

Add retry logic to SF Lambda invocations #513

Uh oh!

sbkok commented Aug 31, 2022

Uh oh!

StewartW Aug 31, 2022

Uh oh!

sbkok Aug 31, 2022

Uh oh!

StewartW Aug 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add retry logic to SF Lambda invocations #513

Add retry logic to SF Lambda invocations #513

Uh oh!

Conversation

sbkok commented Aug 31, 2022

Why?

What?

Uh oh!

StewartW Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

sbkok Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

StewartW Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants