fix: resolve lost-wakeup race via SeqCst barriers and eliminate ambiguous nanosleep#80
Merged
fix: resolve lost-wakeup race via SeqCst barriers and eliminate ambiguous nanosleep#80
Conversation
a3adfaa to
2681f1b
Compare
This commit significantly improves the VirtIO backend synchronization mechanism by introducing atomic operations with memory barriers. Key improvements: - Removed is_queue_empty() function and replaced with direct index comparison - Added static assertion to ensure MAX_REQ is a power of two - Implemented consume_pending_requests() function with: - Hybrid busy-polling and event-based synchronization - Dekker-style memory barriers to prevent lost wake-up problem - Atomic operations for safe shared memory access - Efficient processing of all available requests - Updated handle_virtio_requests() to use the new synchronization mechanism The new implementation provides robust synchronization with virtio_bridge, eliminating race conditions and improving performance through efficient use of atomic operations and memory barriers. This resolves issues with the previous nanosleep-based approach and ensures proper handling of concurrent requests.
liulog
approved these changes
Mar 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR addresses a critical memory synchronization violation in the virtio notification path.
By implementing a correct Dekker's Algorithm using
SeqCstfences, we eliminate the root cause of "lost wakeups" and remove the confusingnanosleepworkaround.The Ambiguity of
nanosleep:The legacy implementation used
nanosleep(TIMEOUT)(TIMEOUT is 1ms) to "fix" intermittent queue stalls. This was misleading: it didn't fix the synchronization logic but merely limited the maximum stall duration to 1ms. This obscured the real bug and introduced a mandatory latency floor. With proper memory barriers, the system is now both correct and fast.Detailed Dependency Analysis:
The system relies on the visibility of two specific variables:
Store [req_rear]is visible to Host beforeLoad [need_wakeup].Store [need_wakeup]is visible to Guest beforeLoad [req_rear].Cross-Architecture Memory Reordering Analysis:
The "Lost Wakeup" occurs specifically due to Store-Load Reordering. As shown in the table below, while different architectures have varying degrees of relaxation, all of them allow Store-Load reordering. This makes a full memory barrier (
SeqCst) mandatory for correctness across all supported platforms.Note: "Allowed" means the hardware/CPU may reorder these operations for optimization unless an explicit barrier is used.
Race Condition Timeline (Lost Wakeup Scenario):
Store [req_rear] = N(Stuck in Store Buffer)req_rear = 0(Old)Store [need_wakeup] = 1(Stuck in Store Buffer)need_wakeup = 0(Old)Load r1, [need_wakeup]-> Reads 0need_wakeup = 0Load r2, [req_rear]-> Reads 0req_rear = 0epoll_wait.req_rear=N, need_wakeup=1Why
SeqCstFixes This:Ordering::SeqCst(Sequential Consistency) acts as a full memory barrier. It forces the CPU to flush the Store Buffer and wait for completion before proceeding to the Load.