-
Notifications
You must be signed in to change notification settings - Fork 236
Add retry logic to SF Lambda invocations #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
**Why?** The Lambda function in the account bootstrap times out when on-boarding a large number of regions at once. In some cases, this happens just after the CloudFormation Change Set is created, leaving the stack in `REVIEW_IN_PROGRESS` state. The next time the Step Function for account bootstrapping would run, it detects that the stack exists and attempts to update it. While in fact, it still needs to be created. **What?** This set of changes will improve the log messages, to ease troubleshooting these issues in the future. It will detect if the stack is in a clean state, if so it will update. If not, it will remove the stack and recreate it instead. The same holds for any Change Sets that were left on the Stack from a prior execution. These will get deleted before a new one is created, as these might contain outdated changes. To ensure this error will not happen frequently, the Lambda function will throw an error when there is not enough time to handle another deployment. Such that the Step Function can retry with sufficient time available to perform updates at the next attempt.
**Why?** When the Lambda service is experiencing issue, it could affect the execution of the Step Function. Most of the errors it would fail with could be recovered from. **What?** Added retry logic around Lambda service issues. Plus added a retry mechanism to recover from Lambda function executions that ran for too long. To retry those automatically too.
| REGION_DEFAULT = os.environ["AWS_REGION"] | ||
| PARTITION = get_partition(REGION_DEFAULT) | ||
| LOGGER = configure_logger(__name__) | ||
| DEPLOY_TIME_IN_MS = 5 * 60 * 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout for the Lambda function is set to 10 minutes. See https://github.com/awslabs/aws-deployment-framework/blob/24ba1e58b419796e7268d73b7691099f5a0509f2/src/template.yml
This lambda function is responsible for deploying the bootstrap stack inside different regions.
Deploying that takes about 2 minutes in ideal situations.
To make sure we don't fail to deploy because of a Lambda timeout midway, I put the limit to 5 minutes. So it will only continue to deploy if it has enough time to complete the execution. Otherwise it throws an error which will trigger it to run another time. The next run, it skips updating the regions it deployed to already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay doke! Thanks
**Why?** In PR awslabs#513, an error was introduced which broke the syntax of the State Machine definition. While investigating, I discovered a number of other improvements that might help, addressing these in this change. **What?** * Fixing missing and redundant commas in retry logic. * Moving roles and other properties of State Machines to the beginning, so they don't float around at the end of the state machine definition. * Enabled tracing on state machines. * Updated name of Account Bootstrapping state machine.
* Fix ADF State Machines **Why?** In PR #513, an error was introduced which broke the syntax of the State Machine definition. While investigating, I discovered a number of other improvements that might help, addressing these in this change. **What?** * Fixing missing and redundant commas in retry logic. * Moving roles and other properties of State Machines to the beginning, so they don't float around at the end of the state machine definition. * Enabled tracing on state machines. * Updated name of Account Bootstrapping state machine. * Include tenacity in pipeline management lambda reqs
Why?
When the Lambda service is experiencing issue, it could affect the execution of the Step Function. Most of the errors it would fail with could be recovered from.
Additionally, with large ADF installations the Lambda function could timeout when an update is performed. As it was not able to recover, the whole update would fail in a hard-to-fix state.
What?
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.