Skip to content

fix: resolve lost-wakeup race via SeqCst barriers and eliminate ambiguous nanosleep#80

Merged
agicy merged 2 commits intomainfrom
fix/lost-wakeup
Mar 6, 2026
Merged

fix: resolve lost-wakeup race via SeqCst barriers and eliminate ambiguous nanosleep#80
agicy merged 2 commits intomainfrom
fix/lost-wakeup

Conversation

@agicy
Copy link
Contributor

@agicy agicy commented Mar 4, 2026

This PR addresses a critical memory synchronization violation in the virtio notification path.

By implementing a correct Dekker's Algorithm using SeqCst fences, we eliminate the root cause of "lost wakeups" and remove the confusing nanosleep workaround.

The Ambiguity of nanosleep:
The legacy implementation used nanosleep(TIMEOUT)(TIMEOUT is 1ms) to "fix" intermittent queue stalls. This was misleading: it didn't fix the synchronization logic but merely limited the maximum stall duration to 1ms. This obscured the real bug and introduced a mandatory latency floor. With proper memory barriers, the system is now both correct and fast.

Detailed Dependency Analysis:
The system relies on the visibility of two specific variables:

  • Guest: Must ensure Store [req_rear] is visible to Host before Load [need_wakeup].
  • Host: Must ensure Store [need_wakeup] is visible to Guest before Load [req_rear].

Cross-Architecture Memory Reordering Analysis:
The "Lost Wakeup" occurs specifically due to Store-Load Reordering. As shown in the table below, while different architectures have varying degrees of relaxation, all of them allow Store-Load reordering. This makes a full memory barrier (SeqCst) mandatory for correctness across all supported platforms.

Architecture Load-Load Load-Store Store-Store Store-Load (The Root Cause)
x86-64 (TSO) No No No Allowed
ARMv8 / v9 Allowed Allowed Allowed Allowed
RISC-V (RVWMO) Allowed Allowed Allowed Allowed
LoongArch Allowed Allowed Allowed Allowed

Note: "Allowed" means the hardware/CPU may reorder these operations for optimization unless an explicit barrier is used.

Race Condition Timeline (Lost Wakeup Scenario):

Time Hypervisor (Producer) Virtio Daemon (Consumer) Visible Shared State
T1 Store [req_rear] = N (Stuck in Store Buffer) req_rear = 0 (Old)
T2 Store [need_wakeup] = 1 (Stuck in Store Buffer) need_wakeup = 0 (Old)
T3 Load r1, [need_wakeup] -> Reads 0 need_wakeup = 0
T4 Load r2, [req_rear] -> Reads 0 req_rear = 0
T5 Decision: Host is awake. Action: No eventfd kick. Decision: Queue empty. Action: Enter epoll_wait. STALL
T6 (Optional) Store Buffer Flushed (Optional) Store Buffer Flushed req_rear=N, need_wakeup=1

Why SeqCst Fixes This:
Ordering::SeqCst (Sequential Consistency) acts as a full memory barrier. It forces the CPU to flush the Store Buffer and wait for completion before proceeding to the Load.

  • It ensures that T1 must complete (become visible) before T3.
  • It ensures that T2 must complete (become visible) before T4.
  • Consequently, it is mathematically impossible for both sides to read the "old" state simultaneously. At least one side will see the other's update and prevent the stall.

@agicy agicy force-pushed the feat/eventfd-signalfd branch from a3adfaa to 2681f1b Compare March 4, 2026 09:43
@agicy agicy marked this pull request as ready for review March 4, 2026 11:04
agicy added 2 commits March 4, 2026 11:10
This commit significantly improves the VirtIO backend synchronization
mechanism by introducing atomic operations with memory barriers.

Key improvements:
- Removed is_queue_empty() function and replaced with direct index comparison
- Added static assertion to ensure MAX_REQ is a power of two
- Implemented consume_pending_requests() function with:
  - Hybrid busy-polling and event-based synchronization
  - Dekker-style memory barriers to prevent lost wake-up problem
  - Atomic operations for safe shared memory access
  - Efficient processing of all available requests
- Updated handle_virtio_requests() to use the new synchronization mechanism

The new implementation provides robust synchronization with virtio_bridge,
eliminating race conditions and improving performance through efficient use of
atomic operations and memory barriers. This resolves issues with the previous
nanosleep-based approach and ensures proper handling of concurrent requests.
@agicy agicy force-pushed the fix/lost-wakeup branch from 1815563 to d9dc3c9 Compare March 4, 2026 11:11
@agicy agicy requested review from KouweiLee and liulog March 4, 2026 11:13
@agicy agicy changed the base branch from feat/eventfd-signalfd to main March 4, 2026 11:13
@agicy agicy merged commit 713464e into main Mar 6, 2026
1 check passed
@agicy agicy deleted the fix/lost-wakeup branch March 7, 2026 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants