Fix ann-bench dataset blob integer overflow leading to incorrect data copy beyond 4B elems (#671)

achirkin · web-flow · commit f15c1ea93eb6 · 2025-02-07T14:24:29.000Z
ann-bench keeps data dimensions as `uint32_t`. We use `std::fread` to copy the data from a file to the host memory and pass `n_rows * n_cols` there, which gets casted to size_t only after the multiplication. This leads to integer overflow for the datasets larger than 4B elements and a partial data copy. This PR fixes the bug by casting the dimensions before the multiplication. The bug only affects the benchmark cases where the data is requested in the host memory not backed by a file. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: #671
diff --git a/cpp/bench/ann/src/common/blob.hpp b/cpp/bench/ann/src/common/blob.hpp
@@ -453,7 +453,8 @@ struct blob_mmap {
           size_t size = data_end - data_start;
           mmap_owner owner{size, flags};
           std::fseek(file_.descriptor().value(), data_start, SEEK_SET);
-          size_t n_elems = file_.rows_limit() * file_.n_cols();
+          auto n_elems =
+            static_cast<size_t>(file_.rows_limit()) * static_cast<size_t>(file_.n_cols());
           if (std::fread(owner.data(), sizeof(T), n_elems, file_.descriptor().value()) != n_elems) {
             throw std::runtime_error{"cuvs::bench::blob_mmap() fread " + file_.path() + " failed"};
           }