Skip to content

Conversation

@danieldk
Copy link
Contributor

@danieldk danieldk commented Jan 22, 2024

Description

Before this change, the workers of pipe call with n_process != 1 were
stopped by calling terminate on the processes. However, terminating a
process can leave queues, pipes, and other concurrent data structures in
an invalid state.

With this change, we stop using terminate and take the following approach
instead:

  • When the all documents are processed, the parent process puts a
    sentinel in the queue of each worker.
  • The parent process then calls join on each worker process to
    let them finish up gracefully.
  • Worker processes break from the queue processing loop when the
    sentinel is encountered, so that they exit.

We need special handling when one of the workers encounters an error and
the error handler is set to raise an exception. In this case, we cannot
rely on the sentinel to finish all workers -- the queue is a FIFO queue
and there may be other work queued up before the sentinel. We use the
following approach to handle error scenarios:

  • The parent puts the end-of-work sentinel in the queue of each worker.
  • The parent closes the reading-end of the channel of each worker.
  • Then:
    • If the worker was waiting for work, it will encounter the sentinel
      and break from the processing loop.
    • If the worker was processing a batch, it will attempt to write
      results to the channel. This will fail because the channel was
      closed by the parent and the worker will break from the processing
      loop.

Types of change

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added tests New, missing or incorrect tests bug Bugs and behaviour differing from documentation scaling Scaling, serving and parallelizing spaCy labels Jan 22, 2024
@danieldk danieldk force-pushed the bugfix/ci-fail branch 5 times, most recently from 36d88b9 to e58fbdd Compare January 23, 2024 07:54
Before this change, the workers of pipe call with n_process != 1 were
stopped by calling `terminate` on the processes. However, terminating a
process can leave queues, pipes, and other concurrent data structures in
an invalid state.

With this change, we stop using terminate and take the following approach
instead:

* When the all documents are processed, the parent process puts a
  sentinel in the queue of each worker.
* The parent process then calls `join` on each worker process to
  let them finish up gracefully.
* Worker processes break from the queue processing loop when the
  sentinel is encountered, so that they exit.

We need special handling when one of the workers encounters an error and
the error handler is set to raise an exception. In this case, we cannot
rely on the sentinel to finish all workers -- the queue is a FIFO queue
and there may be other work queued up before the sentinel. We use the
following approach to handle error scenarios:

* The parent puts the end-of-work sentinel in the queue of each worker.
* The parent closes the reading-end of the channel of each worker.
* Then:
  - If the worker was waiting for work, it will encounter the sentinel
    and break from the processing loop.
  - If the worker was processing a batch, it will attempt to write
    results to the channel. This will fail because the channel was
    closed by the parent and the worker will break from the processing
    loop.
@danieldk danieldk changed the title WIP: Properly clean up multiprocessing workers Properly clean up multiprocessing workers Jan 23, 2024
@danieldk danieldk marked this pull request as ready for review January 23, 2024 09:04
Copy link
Contributor

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - thanks for the detailed breakdown & documentation of the code! 🙏

@svlandeg svlandeg merged commit 128197a into explosion:master Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Bugs and behaviour differing from documentation scaling Scaling, serving and parallelizing spaCy tests New, missing or incorrect tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants