Introduce crash-aware data verification by minwooim · Pull Request #2091 · axboe/fio

minwooim · 2026-05-19T04:00:12Z

Background

When testing NVMe crash consistency via io_uring_cmd, the NVMe
controller bypasses the block layer and handles flush SQEs directly.
Unlike block-layer I/O where a flush drains all preceding writes
before reaching the device, concurrent write SQEs and a flush SQE
submitted to an NVMe controller carry no ordering guarantee. As a
result, after a power loss or intentional reset, only writes covered
by the last flush command are guaranteed to persist — writes submitted
after the most recent flush may or may not survive.

The existing verify flow has no awareness of this crash boundary.
When --verify_state_load is used for offline verification, fio
replays all recorded writes regardless of whether they were fsynced
before the crash. This causes spurious verification failures on
writes that were never guaranteed to persist.

This patch sets fixed the following points:

do_dry_run() does not check the termination point, so it logs
io_hist entries beyond what was actually written during the
(possibly interrupted) write phase.
When do_dry_run() returns 0 (no verifiable I/Os), the thread is
not marked to terminate, leaving the outer loop spinning.
With --verify_only=1 and a loaded state file, keep_running()
can loop indefinitely because the byte limit derived from the
state file may be less than the jobfile limit.

Along with them, this patchset introduced --verify_policy= option
with a value named fsynced first to limit verification to only the writes
offsets covered by the last fsync.

This patchset has tested with the QEMU device with modifications to emit
errors in specific offsets.

Example scenario:
Write Phase of --verify_policy=fsynced

Ws0
Ws1
Ws2
        Wc0
        Wc1
Ws3
        Wc2 (errored)     <---- This will not be verified (skipped)
        Wc3
Fs0
Ws4
Ws5
        Wc4
Ws6
        Wc5
Ws7
        Wc6
        Wc7
        Fc0
Fs1                   <---- Last fsync submission, WRITEs after this submission will not be verified
Ws8
Ws9
        Wc8
        Wc9
        Fc1
<DONE>

Verify Phase
offsets 0, 1, 3, 4, 5, 6, 7 will be verified.

minwooim · 2026-05-19T04:00:27Z

The first patch of this series is from #2086.

vincentkfu · 2026-05-20T21:44:50Z

Please rewrite the commit message for the first patch according to this format:

Start with a paragraph explaining what the feature is
Explain how this patch modifies the verify state feature to accomplish this
Explain the change made to the io_uring_cmd ioengine to support this feature
Retain the current final paragraph noting that other ioengines do not require clearing the queue before a flush because the block layer does it

vincentkfu · 2026-05-21T17:17:13Z

Why is "backend: check whether to stop in do_dry_run()" necessary if fio is already checking verify_state_should_stop in do_verify?

minwooim · 2026-05-26T10:22:18Z

Why is "backend: check whether to stop in do_dry_run()" necessary if fio is already checking verify_state_should_stop in do_verify?

do_dry_run() runs before do_verify() where io_hist_tree is built up by replaying the write phase. Without this, offsets beyond the threshold get logged, but especially with overlap risks, io_hist_tree might be corrupted due to beyond offsetse.

vincentkfu · 2026-05-28T23:17:02Z

+		 */
 		if (ddir_rw(acct_ddir(io_u))) {
-			io_u->numberio = td->io_issues[acct_ddir(io_u)];
+			numberio = td->io_issues[acct_ddir(io_u)];


What happens if you run a 90/10 read/write workload? You will end up comparing a read IO's numberio against a write IOs numberio and stop prematurely.
Consider moving this to the if (td_write(td)...) block below.

You're right. numberio should have been given to verify_state_should_stop() only for io_issues of WRITE commands. I will move it to the below and make io_u put if the given numberio should be stopped.

I guess this change should have been in commit cd2c17d (1st) where the actual verify_state_should_stop has been added. Will make the change below to the corresponding commit.

@@ -1895,8 +1883,11 @@ static uint64_t do_dry_run(struct thread_data *td) td->o.do_verify && td->o.verify != VERIFY_NONE && !td->o.experimental_verify) { - if (!verify_state_should_skip(td, io_u->numberio) && - !verify_state_should_stop(td, io_u->numberio)) + if (verify_state_should_stop(td, io_u->numberio)) { + put_io_u(td, io_u); + break; + } + if (!verify_state_should_skip(td, io_u->numberio)) log_io_piece(td, io_u);

vincentkfu · 2026-06-03T16:35:17Z

I see an error when running the following:

vincent@fio-dev:~/fio-dev/2091/fio-verify-policy$ ./fio --name=test --filename=test --filesize=1M --verify_state_save=1 --verify=crc32c --rw=randwrite --verify_policy=fsynced --fsync=16
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.3-3118-ga1ce
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=6132: Wed Jun  3 12:13:32 2026
  read: IOPS=18.5k, BW=72.1MiB/s (75.6MB/s)(960KiB/13msec)
    clat (usec): min=33, max=120, avg=50.16, stdev=16.87
     lat (usec): min=33, max=120, avg=50.23, stdev=16.89
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   37], 10.00th=[   38], 20.00th=[   40],
     | 30.00th=[   41], 40.00th=[   42], 50.00th=[   44], 60.00th=[   45],
     | 70.00th=[   49], 80.00th=[   59], 90.00th=[   79], 95.00th=[   88],
     | 99.00th=[  111], 99.50th=[  112], 99.90th=[  121], 99.95th=[  121],
     | 99.99th=[  121]
  write: IOPS=14.2k, BW=55.6MiB/s (58.3MB/s)(1024KiB/18msec)
    clat (nsec): min=1723, max=31730, avg=3356.52, stdev=3479.53
     lat (nsec): min=2665, max=37911, avg=4921.44, stdev=4159.64
    clat percentiles (nsec):
     |  1.00th=[ 1736],  5.00th=[ 1832], 10.00th=[ 1928], 20.00th=[ 2040],
     | 30.00th=[ 2352], 40.00th=[ 2448], 50.00th=[ 2544], 60.00th=[ 2672],
     | 70.00th=[ 2800], 80.00th=[ 2992], 90.00th=[ 3728], 95.00th=[ 8768],
     | 99.00th=[21120], 99.50th=[28544], 99.90th=[31616], 99.95th=[31616],
     | 99.99th=[31616]
  lat (usec)   : 2=8.06%, 4=39.11%, 10=2.42%, 20=1.41%, 50=36.49%
  lat (usec)   : 100=11.69%, 250=0.81%
  fsync/fdatasync/sync_file_range/syncfs:
    sync (usec): min=675, max=2148, avg=1091.84, stdev=444.40
    sync percentiles (usec):
     |  1.00th=[  676],  5.00th=[  676], 10.00th=[  750], 20.00th=[  791],
     | 30.00th=[  807], 40.00th=[  807], 50.00th=[  848], 60.00th=[  848],
     | 70.00th=[ 1467], 80.00th=[ 1532], 90.00th=[ 1598], 95.00th=[ 2147],
     | 99.00th=[ 2147], 99.50th=[ 2147], 99.90th=[ 2147], 99.95th=[ 2147],
     | 99.99th=[ 2147]
  cpu          : usr=0.00%, sys=20.00%, ctx=279, majf=0, minf=23
  IO depths    : 1=103.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=240,256,0,15 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0.00ns, window=0.00ns, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=72.1MiB/s (75.6MB/s), 72.1MiB/s-72.1MiB/s (75.6MB/s-75.6MB/s), io=960KiB (983kB), run=13-13msec
  WRITE: bw=55.6MiB/s (58.3MB/s), 55.6MiB/s-55.6MiB/s (58.3MB/s-58.3MB/s), io=1024KiB (1049kB), run=18-18msec

Disk stats (read/write):
  vda: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
vincent@fio-dev:~/fio-dev/2091/fio-verify-policy$ ./fio --name=test --filename=test --filesize=1M --verify_state_load=1 --verify=crc32c --rw=randwrite --verify_policy=fsynced --verify_only=1
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.3-3118-ga1ce
Starting 1 process
Stop verify because seq 240 >= 240
fio: ioengines.c:343: td_io_queue: Assertion `(io_u->flags & IO_U_F_FLIGHT) == 0' failed.
fio: pid=6144, got signal=6

test: (groupid=0, jobs=1): err= 0: pid=6144: Wed Jun  3 12:13:51 2026
  lat (msec)   : >=2000=99.59%
  cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,241,0,0 short=0,0,0,0 dropped=0,0,0,0

Run status group 0 (all jobs):

Disk stats (read/write):
  vda: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
fio: file hash not empty on exit

This stops because the IO_U_F_FLIGHT flag is not cleared when verify_state_should_stop returns true in do_dry_run.

The previous version of this patch avoided this problem because verify_state_should_stop was checked before the flag was set and other accounting was carried out. Consider moving the verify_state_should_stop check before the flag is set and accounting is carried out but make sure that this stopping criteria is carried out only when appropriate. This seems easier than unwinding the flag and accounting when verify_state_should_stop returns true.

For crash consistency testing on NVMe devices via io_uring_cmd, only writes issued before the last flush command are guaranteed to persist after a power loss. Writes submitted after the most recent flush submission may or may not survive a crash. Introduce `--verify_policy=fsynced` to limit the verification target to only writes covered before the last fsync submission which has completed successfully, reflecting what is actually guaranteed to be on stable storage. By adding a new option, this patch also bumps FIO_SERVER_VER to 121. ``on_fsync_submitted()`` captures @inflight_issued at submission time as the safe threshold, stored as @safe_inflight_issued (threshold+1 to distinguish "no fsync yet" from threshold=0). The threshold must be captured at submission, not at completion, because an async engine may complete a flush after subsequent writes have already been submitted. This patch updated the verify state file struct by bumping it up to 7. Now we consider the `s->numberio` to stop the verify jobs by ``verify_state_should_stop()`` for two conditions in offline verification: (1) When trying to submit a write that was still inflight. (2) When trying to queue a write whose numberio exceeds the numberio recorded in the state file. For `--verify_policy=fsynced`, `safe_inflight_issued - 1` is stored as `s->numberio` terminating itself by condition 2 at the fsync threshold. ``do_dry_run()`` is extended to call ``verify_state_should_stop()`` so that io pieces beyond the threshold are not logged into @io_hist_tree, since logging a beyond-threshold offset that overlaps an already-logged valid entry would silently erase the valid entry from the rb-tree. For online verification (--do_verify=1) without a state file, ``verify_state_should_stop()`` checks @safe_inflight_issued directly. For io_uring_cmd, ``fio_ioring_queue()`` returns FIO_Q_BUSY when a flush arrives with @cur_depth > 1 so that all in-flight writes are drained before the flush SQE is submitted. This drain is required because NVMe devices don't guarantee ordering between a flush SQE and concurrent write SQEs in their internal queue. For other async I/O engines (e.g. libaio, io_uring) on block devices, the kernel block layer drain ensures that all preceding writes are completed before the flush reaches the device, so no explicit drain is needed. This drain is only required for io_uring_cmd since it bypasses the block layer and submits NVMe flush SQEs directly to the controller. Signed-off-by: Minwoo Im <minwoo.im@samsung.com>

If there's no more verify candidates where the return value of ``do_dry_run()`` is 0, we should mark thread to terminate without calling ``do_verify()`` just like how to call ``do_io()``. Otherwise, the outer big loop might not exit. Signed-off-by: Minwoo Im <minwoo.im@samsung.com>

Giving chances to check whether to stop or not for time_based and loops in ``keep_running()``, check if `--verify_only=1` was given along with verify state file loaded which should stop here since verify state file might represent the less number of I/Os expected or number_ios described by the given jobfile. To prevent infinite loop in `--verify_only=` && `--verify_state_load=1`, stop the outer loop for this case. Signed-off-by: Minwoo Im <minwoo.im@samsung.com>

minwooim · 2026-06-07T22:57:29Z

@vincentkfu , I've updated the branch with testing. Thanks for the review.

This comment was marked as resolved.

Sign in to view

vincentkfu reviewed May 20, 2026

View reviewed changes

Comment thread HOWTO.rst Outdated

vincentkfu reviewed May 20, 2026

View reviewed changes

Comment thread HOWTO.rst Outdated

vincentkfu reviewed May 20, 2026

View reviewed changes

Comment thread verify.h Outdated

This comment was marked as resolved.

Sign in to view

vincentkfu reviewed May 21, 2026

View reviewed changes

Comment thread backend.c

This comment was marked as resolved.

Sign in to view

minwooim force-pushed the verify-policy branch from 3e1444c to 366dfe9 Compare May 26, 2026 10:54