Skip to content

nightly: CLI Tests (macos-latest) — python-dataflow sender exits 143 on --stop-after #1882

@heyong4725

Description

@heyong4725

Summary

The CLI Tests (macos-latest) nightly job has been failing on the python-dataflow --stop-after 10s step for several nights. The Linux equivalent passes. Filed as a tracked regression so it doesn't get lost while the cross-check failures (#1881) take priority.

Symptom

In python-dataflow cli-test step (dora run dataflow.yml --uv --stop-after 10s):

sender: Sent message 90 (~10 seconds in, just before --stop-after fires)
dora_coordinator: Received destroy command
dora_daemon: received destroy command -> exiting
dora_coordinator: Coordinator took 206ms for handling event: Control
dora_coordinator: stopped

[ERROR]
Dataflow 019e40ae-d20f-755a-9796-993037f95238 failed:

Node `sender` failed: exited with code 143

Exit code 143 = 128 + SIGTERM. The sender was terminated by SIGTERM as part of the --stop-after-triggered destroy sequence, which is expected shutdown behavior, but dora reports the resulting exit code as a node failure.

What's going wrong (two candidate root causes)

Either:

  1. dora's --stop-after shutdown path treats SIGTERM as a node failure — the destroy command sends SIGTERM to nodes, then the dataflow result aggregator interprets the resulting 143 as Node failed: exited with code 143. This would mean the same pattern would surface on Linux too, but Linux passes... so probably not this.

  2. Python sender on macOS doesn't trap SIGTERM in time and gets force-killed — the dora node API on Python likely listens for the stop signal and exits cleanly via a KeyboardInterrupt-style mechanism on Linux, but the equivalent on macOS isn't catching SIGTERM fast enough (or at all), so the sender gets force-killed. Linux's signal delivery + Python interrupt handling tends to be faster than macOS's. This is more likely.

Examples to inspect:

  • examples/python-dataflow/sender.py:17 — the only line that's emitting messages
  • The shutdown handling in apis/python/node/src/lib.rs and apis/python/node/dora/__init__.pyi

Reproduction

Local (macos-arm64):

dora build examples/python-dataflow/dataflow.yml --uv
dora run examples/python-dataflow/dataflow.yml --uv --stop-after 10s

Expected: clean exit, zero error.
Actual: Node sender failed: exited with code 143.

Timeline

Date CLI Tests (macos-latest)
2026-05-12 ✅ passing
2026-05-13 onward mixed: some greens, but consistent failures of this exact step in failed nightly runs (mixed with the cross-check pyo3 regression that #1881 fixes)

Need a clean bisect on this specifically since the cross-check failures were masking signal during the same window.

Not blocking

Filed during investigation of nightly failures alongside #1881.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions