Skip to content

Introduce crash-aware data verification#2091

Merged
axboe merged 3 commits into
axboe:masterfrom
minwooim:verify-policy
Jun 8, 2026
Merged

Introduce crash-aware data verification#2091
axboe merged 3 commits into
axboe:masterfrom
minwooim:verify-policy

Conversation

@minwooim

@minwooim minwooim commented May 19, 2026

Copy link
Copy Markdown
Contributor

Background

When testing NVMe crash consistency via io_uring_cmd, the NVMe
controller bypasses the block layer and handles flush SQEs directly.
Unlike block-layer I/O where a flush drains all preceding writes
before reaching the device, concurrent write SQEs and a flush SQE
submitted to an NVMe controller carry no ordering guarantee. As a
result, after a power loss or intentional reset, only writes covered
by the last flush command are guaranteed to persist — writes submitted
after the most recent flush may or may not survive.

The existing verify flow has no awareness of this crash boundary.
When --verify_state_load is used for offline verification, fio
replays all recorded writes regardless of whether they were fsynced
before the crash. This causes spurious verification failures on
writes that were never guaranteed to persist.

This patch sets fixed the following points:

  1. do_dry_run() does not check the termination point, so it logs
    io_hist entries beyond what was actually written during the
    (possibly interrupted) write phase.
  2. When do_dry_run() returns 0 (no verifiable I/Os), the thread is
    not marked to terminate, leaving the outer loop spinning.
  3. With --verify_only=1 and a loaded state file, keep_running()
    can loop indefinitely because the byte limit derived from the
    state file may be less than the jobfile limit.

Along with them, this patchset introduced --verify_policy= option
with a value named fsynced first to limit verification to only the writes
offsets covered by the last fsync.


This patchset has tested with the QEMU device with modifications to emit
errors in specific offsets.

Example scenario:
Write Phase of --verify_policy=fsynced

Ws0
Ws1
Ws2
        Wc0
        Wc1
Ws3
        Wc2 (errored)     <---- This will not be verified (skipped)
        Wc3
Fs0
Ws4
Ws5
        Wc4
Ws6
        Wc5
Ws7
        Wc6
        Wc7
        Fc0
Fs1                   <---- Last fsync submission, WRITEs after this submission will not be verified
Ws8
Ws9
        Wc8
        Wc9
        Fc1
<DONE>

Verify Phase
offsets 0, 1, 3, 4, 5, 6, 7 will be verified.

@minwooim

Copy link
Copy Markdown
Contributor Author

The first patch of this series is from #2086.

@vincentkfu

This comment was marked as resolved.

Comment thread HOWTO.rst Outdated
Comment thread HOWTO.rst Outdated
Comment thread verify.h Outdated
@vincentkfu

Copy link
Copy Markdown
Collaborator

Please rewrite the commit message for the first patch according to this format:

  • Start with a paragraph explaining what the feature is
  • Explain how this patch modifies the verify state feature to accomplish this
  • Explain the change made to the io_uring_cmd ioengine to support this feature
  • Retain the current final paragraph noting that other ioengines do not require clearing the queue before a flush because the block layer does it

@vincentkfu

This comment was marked as resolved.

Comment thread backend.c
@vincentkfu

Copy link
Copy Markdown
Collaborator

Why is "backend: check whether to stop in do_dry_run()" necessary if fio is already checking verify_state_should_stop in do_verify?

@minwooim

This comment was marked as resolved.

@minwooim

Copy link
Copy Markdown
Contributor Author

Why is "backend: check whether to stop in do_dry_run()" necessary if fio is already checking verify_state_should_stop in do_verify?

do_dry_run() runs before do_verify() where io_hist_tree is built up by replaying the write phase. Without this, offsets beyond the threshold get logged, but especially with overlap risks, io_hist_tree might be corrupted due to beyond offsetse.

Comment thread engines/io_uring.c Outdated
Comment thread t/verify-state.c Outdated
Comment thread t/verify-state.c Outdated
Comment thread HOWTO.rst
Comment thread verify.c Outdated
Comment thread verify.c Outdated
Comment thread verify.c Outdated
Comment thread backend.c Outdated
*/
if (ddir_rw(acct_ddir(io_u))) {
io_u->numberio = td->io_issues[acct_ddir(io_u)];
numberio = td->io_issues[acct_ddir(io_u)];

@vincentkfu vincentkfu May 28, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you run a 90/10 read/write workload? You will end up comparing a read IO's numberio against a write IOs numberio and stop prematurely.
Consider moving this to the if (td_write(td)...) block below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. numberio should have been given to verify_state_should_stop() only for io_issues of WRITE commands. I will move it to the below and make io_u put if the given numberio should be stopped.

I guess this change should have been in commit cd2c17d (1st) where the actual verify_state_should_stop has been added. Will make the change below to the corresponding commit.

@@ -1895,8 +1883,11 @@ static uint64_t do_dry_run(struct thread_data *td)
                    td->o.do_verify &&
                    td->o.verify != VERIFY_NONE &&
                    !td->o.experimental_verify) {
-                       if (!verify_state_should_skip(td, io_u->numberio) &&
-                           !verify_state_should_stop(td, io_u->numberio))
+                       if (verify_state_should_stop(td, io_u->numberio)) {
+                               put_io_u(td, io_u);
+                               break;
+                       }
+                       if (!verify_state_should_skip(td, io_u->numberio))
                                log_io_piece(td, io_u);

Comment thread t/verify-state.c Outdated
Comment thread backend.c
Comment thread verify.c Outdated
@vincentkfu

Copy link
Copy Markdown
Collaborator

I see an error when running the following:

vincent@fio-dev:~/fio-dev/2091/fio-verify-policy$ ./fio --name=test --filename=test --filesize=1M --verify_state_save=1 --verify=crc32c --rw=randwrite --verify_policy=fsynced --fsync=16
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.3-3118-ga1ce
Starting 1 process

test: (groupid=0, jobs=1): err= 0: pid=6132: Wed Jun  3 12:13:32 2026
  read: IOPS=18.5k, BW=72.1MiB/s (75.6MB/s)(960KiB/13msec)
    clat (usec): min=33, max=120, avg=50.16, stdev=16.87
     lat (usec): min=33, max=120, avg=50.23, stdev=16.89
    clat percentiles (usec):
     |  1.00th=[   35],  5.00th=[   37], 10.00th=[   38], 20.00th=[   40],
     | 30.00th=[   41], 40.00th=[   42], 50.00th=[   44], 60.00th=[   45],
     | 70.00th=[   49], 80.00th=[   59], 90.00th=[   79], 95.00th=[   88],
     | 99.00th=[  111], 99.50th=[  112], 99.90th=[  121], 99.95th=[  121],
     | 99.99th=[  121]
  write: IOPS=14.2k, BW=55.6MiB/s (58.3MB/s)(1024KiB/18msec)
    clat (nsec): min=1723, max=31730, avg=3356.52, stdev=3479.53
     lat (nsec): min=2665, max=37911, avg=4921.44, stdev=4159.64
    clat percentiles (nsec):
     |  1.00th=[ 1736],  5.00th=[ 1832], 10.00th=[ 1928], 20.00th=[ 2040],
     | 30.00th=[ 2352], 40.00th=[ 2448], 50.00th=[ 2544], 60.00th=[ 2672],
     | 70.00th=[ 2800], 80.00th=[ 2992], 90.00th=[ 3728], 95.00th=[ 8768],
     | 99.00th=[21120], 99.50th=[28544], 99.90th=[31616], 99.95th=[31616],
     | 99.99th=[31616]
  lat (usec)   : 2=8.06%, 4=39.11%, 10=2.42%, 20=1.41%, 50=36.49%
  lat (usec)   : 100=11.69%, 250=0.81%
  fsync/fdatasync/sync_file_range/syncfs:
    sync (usec): min=675, max=2148, avg=1091.84, stdev=444.40
    sync percentiles (usec):
     |  1.00th=[  676],  5.00th=[  676], 10.00th=[  750], 20.00th=[  791],
     | 30.00th=[  807], 40.00th=[  807], 50.00th=[  848], 60.00th=[  848],
     | 70.00th=[ 1467], 80.00th=[ 1532], 90.00th=[ 1598], 95.00th=[ 2147],
     | 99.00th=[ 2147], 99.50th=[ 2147], 99.90th=[ 2147], 99.95th=[ 2147],
     | 99.99th=[ 2147]
  cpu          : usr=0.00%, sys=20.00%, ctx=279, majf=0, minf=23
  IO depths    : 1=103.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=240,256,0,15 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0.00ns, window=0.00ns, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=72.1MiB/s (75.6MB/s), 72.1MiB/s-72.1MiB/s (75.6MB/s-75.6MB/s), io=960KiB (983kB), run=13-13msec
  WRITE: bw=55.6MiB/s (58.3MB/s), 55.6MiB/s-55.6MiB/s (58.3MB/s-58.3MB/s), io=1024KiB (1049kB), run=18-18msec

Disk stats (read/write):
  vda: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
vincent@fio-dev:~/fio-dev/2091/fio-verify-policy$ ./fio --name=test --filename=test --filesize=1M --verify_state_load=1 --verify=crc32c --rw=randwrite --verify_policy=fsynced --verify_only=1
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.3-3118-ga1ce
Starting 1 process
Stop verify because seq 240 >= 240
fio: ioengines.c:343: td_io_queue: Assertion `(io_u->flags & IO_U_F_FLIGHT) == 0' failed.
fio: pid=6144, got signal=6

test: (groupid=0, jobs=1): err= 0: pid=6144: Wed Jun  3 12:13:51 2026
  lat (msec)   : >=2000=99.59%
  cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,241,0,0 short=0,0,0,0 dropped=0,0,0,0

Run status group 0 (all jobs):

Disk stats (read/write):
  vda: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
fio: file hash not empty on exit

This stops because the IO_U_F_FLIGHT flag is not cleared when verify_state_should_stop returns true in do_dry_run.

The previous version of this patch avoided this problem because verify_state_should_stop was checked before the flag was set and other accounting was carried out. Consider moving the verify_state_should_stop check before the flag is set and accounting is carried out but make sure that this stopping criteria is carried out only when appropriate. This seems easier than unwinding the flag and accounting when verify_state_should_stop returns true.

minwooim added 3 commits June 8, 2026 07:16
For crash consistency testing on NVMe devices via io_uring_cmd, only
writes issued before the last flush command are guaranteed to persist
after a power loss.  Writes submitted after the most recent flush
submission may or may not survive a crash.  Introduce
`--verify_policy=fsynced` to limit the verification target to only
writes covered before the last fsync submission which has completed
successfully, reflecting what is actually guaranteed to be on stable
storage.  By adding a new option, this patch also bumps FIO_SERVER_VER
to 121.

``on_fsync_submitted()`` captures @inflight_issued at submission time as
the safe threshold, stored as @safe_inflight_issued (threshold+1 to
distinguish "no fsync yet" from threshold=0).  The threshold must be
captured at submission, not at completion, because an async engine may
complete a flush after subsequent writes have already been submitted.

This patch updated the verify state file struct by bumping it up to 7.
Now we consider the `s->numberio` to stop the verify jobs by
``verify_state_should_stop()`` for two conditions in offline
verification:

    (1) When trying to submit a write that was still inflight.
    (2) When trying to queue a write whose numberio exceeds the numberio
        recorded in the state file.

For `--verify_policy=fsynced`, `safe_inflight_issued - 1` is stored as
`s->numberio` terminating itself by condition 2 at the fsync threshold.
``do_dry_run()`` is extended to call ``verify_state_should_stop()`` so
that io pieces beyond the threshold are not logged into @io_hist_tree,
since logging a beyond-threshold offset that overlaps an already-logged
valid entry would silently erase the valid entry from the rb-tree.  For
online verification (--do_verify=1) without a state file,
``verify_state_should_stop()`` checks @safe_inflight_issued directly.

For io_uring_cmd, ``fio_ioring_queue()`` returns FIO_Q_BUSY when a flush
arrives with @cur_depth > 1 so that all in-flight writes are drained
before the flush SQE is submitted.  This drain is required because NVMe
devices don't guarantee ordering between a flush SQE and concurrent write
SQEs in their internal queue.

For other async I/O engines (e.g. libaio, io_uring) on block devices,
the kernel block layer drain ensures that all preceding writes are
completed before the flush reaches the device, so no explicit drain is
needed.  This drain is only required for io_uring_cmd since it bypasses
the block layer and submits NVMe flush SQEs directly to the controller.

Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
If there's no more verify candidates where the return value of
``do_dry_run()`` is 0, we should mark thread to terminate without
calling ``do_verify()`` just like how to call ``do_io()``.  Otherwise,
the outer big loop might not exit.

Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
Giving chances to check whether to stop or not for time_based and loops
in ``keep_running()``, check if `--verify_only=1` was given along with
verify state file loaded which should stop here since verify state file
might represent the less number of I/Os expected or number_ios described
by the given jobfile.  To prevent infinite loop in `--verify_only=` &&
`--verify_state_load=1`, stop the outer loop for this case.

Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
@minwooim

minwooim commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

@vincentkfu , I've updated the branch with testing. Thanks for the review.

@axboe axboe merged commit 34fe398 into axboe:master Jun 8, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants