Make worker action cancellable #1472

Hartorn · 2023-10-10T18:14:35Z

Closes #GSK-1863

Description

Related Issue

Type of Change

📚 Examples / docs / tutorials / dependencies update
🔧 Bug fix (non-breaking change which fixes an issue)
🥂 Improvement (non-breaking change which improves an existing feature)
🚀 New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to change)
🔐 Security fix

Checklist

I've read the CODE_OF_CONDUCT.md document.
I've read the CONTRIBUTING.md guide.
I've updated the code style using make codestyle.
I've written tests for all new methods and classes that I created.
I've written the docstring in Google format for all the methods and classes that I used.

giskard/ml_worker/websocket/listener.py

giskard/utils/__init__.py

giskard/ml_worker/websocket/listener.py

andreybavt · 2023-10-11T12:03:48Z

giskard/utils/__init__.py

            return
        LOGGER.info("Starting worker pool...")
-        self.pool = ProcessPoolExecutor(*args, **kwargs)
+        self.max_workers = min(max_workers, 2) if max_workers is not None else os.cpu_count()


small improvement: it'd be better to move 2 to the giskard.settings.Settings. In this case we'd get a possibility to change it from an env variable

Done. I also changed min to max also, since we always want at least min_workers

andreybavt · 2023-10-11T12:28:51Z

giskard/utils/__init__.py

+    # Create independant process pool
+    # If we kill a running process, it breaking the Process pool, making it unusable
+    one_shot_executor = ProcessPoolExecutor(max_workers=1)
+    pid = one_shot_executor.submit(os.getpid).result(timeout=5)


I don't get it, the comment above said

.result(timeout=timeout) => Not working with WSL and python 3.10

here it's expected to work?

I should remove the timeout here anyway, no reason for basic os.getpid to fail or take a long time.

Basically, when I tested it, in the killer thread, the timeout was not respected, and it never stopped, so I had to do the loop.

Btw, the timeout was working out on MacOS with python 3.11.

I think the overall killing mechanism could be simplified to avoid having a killer thread and a cancellable counter.

We could use an inter-process data structure to communicate pids between a spawned worker process and the main process: multiprocessing.Queue for example.

In this case the first thing the pool process will do is identify its pid and add it to queue, then do the actual work.
In the main process we could use concurrent.futures._base.as_completed to wrap the future and add a timeout to it and then call .add_done_callback(result_handler) as we do now.

When the timeout is expired we'd find a PID related to a given task and kill it from the main process.

WDYT?

@Hartorn , actually, it looks like https://pypi.org/project/Pebble/ does exactly what we need:

from pebble import ProcessPool pool = ProcessPool(max_workers=5) pool.schedule(long_fn, args=(a,b,c), timeout=1).add_done_callback(done_callback)

I think the overall killing mechanism could be simplified to avoid having a killer thread and a cancellable counter.

We could use an inter-process data structure to communicate pids between a spawned worker process and the main process: multiprocessing.Queue for example.

In this case the first thing the pool process will do is identify its pid and add it to queue, then do the actual work. In the main process we could use concurrent.futures._base.as_completed to wrap the future and add a timeout to it and then call .add_done_callback(result_handler) as we do now.

When the timeout is expired we'd find a PID related to a given task and kill it from the main process.

WDYT?

Cancellable counter is only for logs, we can remove it if we want.
I just wanted to check if that was working, or eventually to avoid having too many process launched.
For the callback of a future, you need everything to be pickable, so you cannot have lock, processPool, and so on.
So I don't know how to shutdown properly the one shot executor we ran.

Also, it cannot be the main executor, since killing a process is breaking the pool, which only raises BrokenPoolException after that.

I'm pretty sure we could avoid the thread is the whole code we were running was async, since a coroutine could do this job, but here I would not be confident.

@Hartorn , actually, it looks like https://pypi.org/project/Pebble/ does exactly what we need:

from pebble import ProcessPool pool = ProcessPool(max_workers=5) pool.schedule(long_fn, args=(a,b,c), timeout=1).add_done_callback(done_callback)

Looking at it !

From what I saw, they are kinda doing the same stuff, when using a separate process, they have a handler thread for watching it and handling timeout.

Should we get this merged and change to use it ? Or want me to switch it ?

Although I'm a bit concerned it's not that much used, the code looks clean

Actually, while having a clean API and being easy to use Pebble is a LGPL library so we won't be able to use it (it might be the reason why it's not that widely adopted).

I also read their code and found similarities with your implementation. I suggest we stick to your current code (and merge it since we need these changes ASAP).

As an improvement, I think in a separate PR we could inspire from Pebble's API and also encapsulate call_in_pool into our custom ProcessPool implementation and have a schedule method, WDYT?
In the end Pebble is just a ~5k LOC library

Shame, the code was working well with it also.
Sure, seems a good idea to have our own worker implementation. I'm sure it will also help us to improve performance of the lib

Closes #GSK-1863

Inokinoki

LGTM

sonarqubecloud · 2023-10-12T09:22:51Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

0.0% Coverage
0.0% Duplication

Hartorn self-assigned this Oct 10, 2023

Hartorn requested a review from andreybavt October 10, 2023 18:14

Hartorn marked this pull request as ready for review October 10, 2023 18:15

Hartorn requested a review from Inokinoki October 11, 2023 07:42

Inokinoki reviewed Oct 11, 2023

View reviewed changes

giskard/ml_worker/websocket/listener.py Show resolved Hide resolved

Inokinoki reviewed Oct 11, 2023

View reviewed changes

giskard/utils/__init__.py Show resolved Hide resolved

Inokinoki reviewed Oct 11, 2023

View reviewed changes

giskard/ml_worker/websocket/listener.py Outdated Show resolved Hide resolved

Hartorn force-pushed the action-cancellable branch from f84b7e5 to ae22d25 Compare October 11, 2023 08:39

Hartorn force-pushed the feature/gsk-1863-using-get-similar-examples-on-enron-seems-to-stuck-the branch 2 times, most recently from c821df9 to 5dc0701 Compare October 11, 2023 10:20

Hartorn force-pushed the action-cancellable branch 2 times, most recently from 5211c9b to bcaf884 Compare October 11, 2023 11:30

andreybavt reviewed Oct 11, 2023

View reviewed changes

Hartorn force-pushed the action-cancellable branch 3 times, most recently from 59c29b9 to 0d23d4e Compare October 12, 2023 08:20

andreybavt approved these changes Oct 12, 2023

View reviewed changes

Base automatically changed from feature/gsk-1863-using-get-similar-examples-on-enron-seems-to-stuck-the to main October 12, 2023 08:31

Hartorn and others added 2 commits October 12, 2023 10:32

Make worker action cancellable

69c967e

Closes #GSK-1863

Fix cli arguments

3f808f8

Hartorn force-pushed the action-cancellable branch from 0d23d4e to 3f808f8 Compare October 12, 2023 08:32

Inokinoki approved these changes Oct 12, 2023

View reviewed changes

Hartorn closed this Oct 12, 2023

Hartorn reopened this Oct 12, 2023

Hartorn merged commit 2c22432 into main Oct 12, 2023

Hartorn deleted the action-cancellable branch October 12, 2023 09:21

Uh oh!

Make worker action cancellable #1472

Make worker action cancellable #1472

Uh oh!

Conversation

Hartorn commented Oct 10, 2023

Description

Related Issue

Type of Change

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hartorn Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Inokinoki left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Oct 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

Hartorn Oct 11, 2023 •

edited

Loading