Skip to content

Conversation

@khushboobhatia01
Copy link
Contributor

@khushboobhatia01 khushboobhatia01 commented Nov 29, 2021

This PR will ensure graceful shutdown of workflow engine.

Please ref #5373 for more details.

@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines. Requires some effort to review. label Nov 29, 2021
@khushboobhatia01
Copy link
Contributor Author

Marking the PR as WIP until dependent MR is merged #5396

@khushboobhatia01 khushboobhatia01 changed the title Workflow engine graceful shutdown WIP: Workflow engine graceful shutdown Nov 29, 2021
@khushboobhatia01 khushboobhatia01 changed the title WIP: Workflow engine graceful shutdown Workflow engine graceful shutdown Dec 3, 2021
Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some work is still needed so the mechanism is more robust. When a workflow engine starts up, does it know to resume workflows that have been paused due to a shutdown?

@khushboobhatia01
Copy link
Contributor Author

Some work is still needed so the mechanism is more robust. When a workflow engine starts up, does it know to resume workflows that have been paused due to a shutdown?

Added feature to resume.

Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khushboobhatia01 The implementation here needs some more rework.

Here're some background on the workflow engine that may help you

  • a workflow execution can be processed by any workflow engines
  • the process function to handle workflow execution or handle action execution messages is short lived
  • the workflow engine is multi-threaded and many process functions could be running at any given time
  • if the workflow engine needs to pause workflows because there are no other workflow engines running, then the workflow engine should query the database for the list of running workflows

Given the above,

  • For each workflow engine, we just need to track how many messages it is still processing. We don't need to track the specific workflow execution or task execution. We just need to increment a counter when invoking process and decrement the same counter when exiting process. On shutdown, if the counter is > 0, that means the workflow engine is processing message(s).
  • Incrementing or decrementing the counter should be thread safe so need to thread lock before updating the counter.
  • On shutdown, the workflow engine should initiate shutdown and stop receiving any more messages. Then it waits until the counter becomes zero.
  • After the workflow engine stopped processing, if there are no other workflow engine running, then acquire a distributed lock (use coordinator), query the database for any active/running workflow execution, and pause the workflows. The start up sequence of workflow engine should use the same distributed lock when trying to resume paused workflows. This is to prevent workflow engines stepping over each other.

Please review and let me know if need more clarification.

@khushboobhatia01 khushboobhatia01 force-pushed the workflow_engine_graceful_shutdown branch from cbc0e56 to d65f198 Compare March 14, 2022 07:37
@khushboobhatia01
Copy link
Contributor Author

@khushboobhatia01 The implementation here needs some more rework.

Here're some background on the workflow engine that may help you

  • a workflow execution can be processed by any workflow engines
  • the process function to handle workflow execution or handle action execution messages is short lived
  • the workflow engine is multi-threaded and many process functions could be running at any given time
  • if the workflow engine needs to pause workflows because there are no other workflow engines running, then the workflow engine should query the database for the list of running workflows

Given the above,

  • For each workflow engine, we just need to track how many messages it is still processing. We don't need to track the specific workflow execution or task execution. We just need to increment a counter when invoking process and decrement the same counter when exiting process. On shutdown, if the counter is > 0, that means the workflow engine is processing message(s).
  • Incrementing or decrementing the counter should be thread safe so need to thread lock before updating the counter.
  • On shutdown, the workflow engine should initiate shutdown and stop receiving any more messages. Then it waits until the counter becomes zero.
  • After the workflow engine stopped processing, if there are no other workflow engine running, then acquire a distributed lock (use coordinator), query the database for any active/running workflow execution, and pause the workflows. The start up sequence of workflow engine should use the same distributed lock when trying to resume paused workflows. This is to prevent workflow engines stepping over each other.

Please review and let me know if need more clarification.

@m4dcoder Updated the MR with new implementation. One issue with the above flow is during last engine shutdown, workflows which have active tasks could transition only to pausing state instead of paused state. To address this, I've added a delay (https://github.com/StackStorm/st2/pull/5463/files#diff-e4e1e16f3a5aefdf069b0af9bd510c1fde29a1b8df10796b7e74cb183820973cR79) in the engine startup resume sequence to ensure workflows stuck in pausing state transition to paused state before we start resuming.

Please review.

@arm4b arm4b requested a review from m4dcoder March 30, 2022 12:09
@m4dcoder m4dcoder added this to the 3.8.0 milestone Mar 30, 2022
Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going in the right direction. Thanks!

Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional follow up. We are almost there. Great job!

Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khushboobhatia01 Great Job! Thanks for contribution here. This is a significant improvement.

@CLAassistant
Copy link

CLAassistant commented Jul 5, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@cognifloyd cognifloyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I'm looking forward to graceful shutdowns of the workflow engine!

@cognifloyd

This comment was marked as resolved.

@cognifloyd

This comment was marked as resolved.

@cognifloyd cognifloyd force-pushed the workflow_engine_graceful_shutdown branch from c198e79 to dea5f1e Compare July 5, 2022 17:18
@cognifloyd cognifloyd enabled auto-merge July 5, 2022 17:20
@cognifloyd cognifloyd merged commit d65555e into StackStorm:master Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement feature size/L PR that changes 100-499 lines. Requires some effort to review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants