Add asynchronous prefetch for DirectIO directory#15224
Add asynchronous prefetch for DirectIO directory#15224benwtrent wants to merge 4 commits intoapache:mainfrom
Conversation
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
| private final int prefetchBytesSize; | ||
| private final Deque<Long> pendingPrefetches = new ArrayDeque<>(); | ||
| private final FileChannel channel; | ||
| private final ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor(); |
There was a problem hiding this comment.
Is the executor something you would want to share within a directory or potentially even across directories? I can't find any documentation that indicates that this pattern would be a problem.
|
Do you have some number on throughput change? |
|
This is neat -- if Lucene implements enough top-down prefetch hinting, it might eventually be that DirectIO, alone, is sufficient for good query latency/throughput? I.e. we could stop entirely relying on OS to do its prefetching/caching (buffer cache), maybe, in very cold indices? Isn't |
Correct, its only used in certain scenarios. We are experimenting using it in more areas (e.g. vector rescoring, to keep from polluting the off-heap cache with rescoring vectors)
Its not quite there yet. I have seen this improve throughput by more than 2x though depending on the read patterns. MMAP still has TONS of advantages (direct memory segment access being a HUGE one for vectors). Virtual threads make this VERY easy, but I am sure there is a lot of headroom for improvements. |
|
I also think that NIOFS could benefit of a prefetch implementation as well. |
|
If you used direct io for everything you would want to introduce an explicit disk cache somewhere, even with prefetching I don't think performance would meet expectations for a lot of workloads if most reads resulted in a syscall. |
100% agreed. I think we are a long ways away from making IO super cheap. Again, MMAP has many benefits still. But virtual threads do make this way easier than it would have been before! |
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
This adds prefetching to directIO. The idea is pretty simple,
When doing many prefetches and handling things in batches, this can significantly improve throughput.