Introduce crash-aware data verification#2091
Conversation
|
The first patch of this series is from #2086. |
This comment was marked as resolved.
This comment was marked as resolved.
|
Please rewrite the commit message for the first patch according to this format:
|
This comment was marked as resolved.
This comment was marked as resolved.
|
Why is "backend: check whether to stop in |
This comment was marked as resolved.
This comment was marked as resolved.
|
| */ | ||
| if (ddir_rw(acct_ddir(io_u))) { | ||
| io_u->numberio = td->io_issues[acct_ddir(io_u)]; | ||
| numberio = td->io_issues[acct_ddir(io_u)]; |
There was a problem hiding this comment.
What happens if you run a 90/10 read/write workload? You will end up comparing a read IO's numberio against a write IOs numberio and stop prematurely.
Consider moving this to the if (td_write(td)...) block below.
There was a problem hiding this comment.
You're right. numberio should have been given to verify_state_should_stop() only for io_issues of WRITE commands. I will move it to the below and make io_u put if the given numberio should be stopped.
I guess this change should have been in commit cd2c17d (1st) where the actual verify_state_should_stop has been added. Will make the change below to the corresponding commit.
@@ -1895,8 +1883,11 @@ static uint64_t do_dry_run(struct thread_data *td)
td->o.do_verify &&
td->o.verify != VERIFY_NONE &&
!td->o.experimental_verify) {
- if (!verify_state_should_skip(td, io_u->numberio) &&
- !verify_state_should_stop(td, io_u->numberio))
+ if (verify_state_should_stop(td, io_u->numberio)) {
+ put_io_u(td, io_u);
+ break;
+ }
+ if (!verify_state_should_skip(td, io_u->numberio))
log_io_piece(td, io_u);
|
I see an error when running the following: This stops because the The previous version of this patch avoided this problem because |
For crash consistency testing on NVMe devices via io_uring_cmd, only
writes issued before the last flush command are guaranteed to persist
after a power loss. Writes submitted after the most recent flush
submission may or may not survive a crash. Introduce
`--verify_policy=fsynced` to limit the verification target to only
writes covered before the last fsync submission which has completed
successfully, reflecting what is actually guaranteed to be on stable
storage. By adding a new option, this patch also bumps FIO_SERVER_VER
to 121.
``on_fsync_submitted()`` captures @inflight_issued at submission time as
the safe threshold, stored as @safe_inflight_issued (threshold+1 to
distinguish "no fsync yet" from threshold=0). The threshold must be
captured at submission, not at completion, because an async engine may
complete a flush after subsequent writes have already been submitted.
This patch updated the verify state file struct by bumping it up to 7.
Now we consider the `s->numberio` to stop the verify jobs by
``verify_state_should_stop()`` for two conditions in offline
verification:
(1) When trying to submit a write that was still inflight.
(2) When trying to queue a write whose numberio exceeds the numberio
recorded in the state file.
For `--verify_policy=fsynced`, `safe_inflight_issued - 1` is stored as
`s->numberio` terminating itself by condition 2 at the fsync threshold.
``do_dry_run()`` is extended to call ``verify_state_should_stop()`` so
that io pieces beyond the threshold are not logged into @io_hist_tree,
since logging a beyond-threshold offset that overlaps an already-logged
valid entry would silently erase the valid entry from the rb-tree. For
online verification (--do_verify=1) without a state file,
``verify_state_should_stop()`` checks @safe_inflight_issued directly.
For io_uring_cmd, ``fio_ioring_queue()`` returns FIO_Q_BUSY when a flush
arrives with @cur_depth > 1 so that all in-flight writes are drained
before the flush SQE is submitted. This drain is required because NVMe
devices don't guarantee ordering between a flush SQE and concurrent write
SQEs in their internal queue.
For other async I/O engines (e.g. libaio, io_uring) on block devices,
the kernel block layer drain ensures that all preceding writes are
completed before the flush reaches the device, so no explicit drain is
needed. This drain is only required for io_uring_cmd since it bypasses
the block layer and submits NVMe flush SQEs directly to the controller.
Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
If there's no more verify candidates where the return value of ``do_dry_run()`` is 0, we should mark thread to terminate without calling ``do_verify()`` just like how to call ``do_io()``. Otherwise, the outer big loop might not exit. Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
Giving chances to check whether to stop or not for time_based and loops in ``keep_running()``, check if `--verify_only=1` was given along with verify state file loaded which should stop here since verify state file might represent the less number of I/Os expected or number_ios described by the given jobfile. To prevent infinite loop in `--verify_only=` && `--verify_state_load=1`, stop the outer loop for this case. Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
|
@vincentkfu , I've updated the branch with testing. Thanks for the review. |
Background
When testing NVMe crash consistency via io_uring_cmd, the NVMe
controller bypasses the block layer and handles flush SQEs directly.
Unlike block-layer I/O where a flush drains all preceding writes
before reaching the device, concurrent write SQEs and a flush SQE
submitted to an NVMe controller carry no ordering guarantee. As a
result, after a power loss or intentional reset, only writes covered
by the last flush command are guaranteed to persist — writes submitted
after the most recent flush may or may not survive.
The existing verify flow has no awareness of this crash boundary.
When --verify_state_load is used for offline verification, fio
replays all recorded writes regardless of whether they were fsynced
before the crash. This causes spurious verification failures on
writes that were never guaranteed to persist.
This patch sets fixed the following points:
io_hist entries beyond what was actually written during the
(possibly interrupted) write phase.
not marked to terminate, leaving the outer loop spinning.
can loop indefinitely because the byte limit derived from the
state file may be less than the jobfile limit.
Along with them, this patchset introduced
--verify_policy=optionwith a value named
fsyncedfirst to limit verification to only the writesoffsets covered by the last fsync.
This patchset has tested with the QEMU device with modifications to emit
errors in specific offsets.
Example scenario:
Write Phase of
--verify_policy=fsyncedVerify Phase
offsets 0, 1, 3, 4, 5, 6, 7 will be verified.