Skip to content

[Bug]: Background tasks via agents chat --background killed during agent reload #3275

@ekzhu

Description

@ekzhu

CoPaw Version

1.0.2

Description

Background tasks dispatched via copaw agents chat --background are spontaneously cancelled when the target agent undergoes a workspace reload. The reload's graceful shutdown logic has a blind spot: it only checks CoPaw's per-workspace TaskTracker for active tasks, but background tasks submitted through /api/agent/process/task are managed by agentscope_runtime's AgentApp and are invisible to that tracker. The old workspace is therefore stopped immediately, killing all in-flight background tasks.

All affected sessions end with _is_interrupted=True and "The tool call has been interrupted by the user" — but no user issued a stop command.

Related PR(s): N/A

Security considerations: N/A

Component(s) Affected

  • Core / Backend (app, agents, config, providers, utils, local_models)
  • Console (frontend web UI)
  • Channels (DingTalk, Feishu, QQ, Discord, iMessage, etc.)
  • Skills
  • CLI
  • Documentation (website)
  • Tests
  • CI/CD
  • Scripts / Deploy

Environment

  • CoPaw version: 1.0.2
  • OS: Linux
  • Install method: from source
  • Python version: 3.10+

Steps to Reproduce

  1. Start CoPaw with a configured agent
  2. Dispatch multiple background tasks to the same target agent:
    for i in $(seq 1 5); do
      copaw agents chat --background \
        --from-agent default --to-agent <target> \
        --text "Run a long task: sleep 120 && echo done"
    done
  3. Trigger a reload from another terminal or the UI:
    curl -X PUT http://localhost:8088/api/agent/running-config \
      -H "Content-Type: application/json" \
      -H "X-Agent-Id: <target>" \
      -d '{"temperature": 0.8}'
  4. Check task status:
    copaw agents chat --background --task-id <any-task-id>

Any config change endpoint that calls schedule_agent_reload() triggers this — including PUT /agent/running-config, PUT /agent/system-prompt-files, PUT /agents/{agentId}, and PUT /config/channels.

Actual vs Expected

  • Actual: All background tasks are cancelled within seconds of a reload. Sessions show _is_interrupted=True. Task status API shows "pending" or "cancelled".
  • Expected: Background tasks should survive agent reloads (or at least be given a grace period to complete). The graceful shutdown should be aware of all running tasks, not just those tracked by CoPaw's TaskTracker.

Logs / Screenshots

All interrupted sessions end with the same pattern:

{
  "metadata": {"_is_interrupted": true},
  "content": "I noticed that you have interrupted me. What can I do for you?"
}

Preceding tool result:

<system-info>The tool call has been interrupted by the user.</system-info>

Root Cause Analysis

Affected code path

CLI: copaw agents chat --background --to-agent kM8Z4E --text "..."
  → POST /api/agent/process/task          (agentscope_runtime AgentApp)
  → DynamicMultiAgentRunner.stream_query  (src/copaw/app/_app.py:104)
  → workspace_runner.stream_query         (agentscope_runtime Runner)
  → AgentRunner.query_handler             (src/copaw/app/runner/runner.py:349)
  → CoPawAgent.reply → tool execution     (src/copaw/agents/react_agent.py)

The task lifecycle is owned by agentscope_runtime, not by CoPaw's TaskTracker (src/copaw/app/runner/task_tracker.py).

Step-by-step

1. Tasks are dispatched and running. Each goes through agentscope_runtime's /api/agent/process/task endpoint, which creates an asyncio.Task wrapping DynamicMultiAgentRunner.stream_query(). At task start, the runner resolves to workspace A's AgentRunner and holds a reference to it.

2. A config change triggers schedule_agent_reload() (src/copaw/app/utils.py:15), which fires MultiAgentManager.reload_agent() in the background.

3. reload_agent() (src/copaw/app/multi_agent_manager.py:208-319) creates a new Workspace with a fresh TaskTracker, starts it, atomically swaps it in (self.agents[agent_id] = new_instance, line 312), then calls _graceful_stop_old_instance().

4. _graceful_stop_old_instance() — THE BUG (src/copaw/app/multi_agent_manager.py:91-186):

has_active = await old_instance.task_tracker.has_active_tasks()  # line 105
if has_active:
    # Wait up to 60s for tasks, then stop ...
else:
    # No active tasks — stop immediately       ← THIS PATH IS TAKEN
    await old_instance.stop(final=False)        # line 176

has_active returns False because TaskTracker only tracks tasks registered via attach_or_start() (console channel, messaging channels). Background tasks from /api/agent/process/task are managed by agentscope_runtime's AgentApp and are never registered in CoPaw's TaskTracker.

5. Old workspace is stopped immediately. stop(final=False) (workspace.py:363) calls ServiceManager.stop_all(final=False) which stops the runner, MCP clients, and channels. The in-flight tasks receive asyncio.CancelledError.

6. CancelledError propagates to agent interrupt:

# runner.py:541-545
except asyncio.CancelledError as exc:
    if agent is not None:
        await agent.interrupt()      # cancels agent's reply task
    raise AgentException("Task has been cancelled!") from exc

agent.interrupt() (react_agent.py:1031-1046) cancels the agent's _reply_task, producing the _is_interrupted=True metadata.

Why the evidence matches

Observation Explanation
All 10 tasks interrupted All used the same old workspace runner
Different durations before death (12s–134s) Tasks dispatched at different times, killed by the same reload event
All interrupted during sleep commands sleep yields to the event loop where CancelledError is delivered
_is_interrupted=True in all sessions Standard agentscope interrupt response to CancelledError
Task status API shows "pending/cancelled" agentscope_runtime's tracker is separate; status doesn't update correctly after workspace stop
No user-initiated /stop Stop was triggered by workspace reload, not user

Suggested Fix

Option A (Preferred): Register AgentApp tasks with CoPaw's TaskTracker

In DynamicMultiAgentRunner.stream_query() (_app.py:104), register each background task with the resolved workspace's TaskTracker so _graceful_stop_old_instance waits for them.

Option B: Delay old workspace stop unconditionally

Always schedule a delayed cleanup with a configurable grace period (e.g. 60–300s) before stopping the old workspace after a reload, instead of relying solely on has_active_tasks().

Key files to modify

File Purpose
src/copaw/app/multi_agent_manager.py:91-186 _graceful_stop_old_instance — add awareness of AgentApp tasks
src/copaw/app/_app.py:104-126 DynamicMultiAgentRunner.stream_query — register tasks with TaskTracker
src/copaw/app/runner/task_tracker.py May need API additions for external task registration

Additional Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions