-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add support for the Linux PAGEMAP_SCAN ioctl #11372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the Linux PAGEMAP_SCAN ioctl #11372
Conversation
* Don't hold a raw pointer to the original data, plumb through an `Arc` to have a safe reference instead. * Update pagemap bindings to latest version of abstraction written. * Include pagemap-specific tests. * Use a `PageMap` structure created once-per-pool instead of a static-with-a-file. * Refactor to use the "pagemap path" unconditionally which blends in the keep_resident bits.
|
One thing I should also note here is that @tschneidereit has plans for revamping the |
fitzgen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Excited for this!
I remember we had some discussion in the Wasmtime meeting about the dual-purpose of keep_resident and how it configures both an amount to memset and a limit on the pool's max RSS when idle. This PR would remove that second bit: if the Wasm program touches different pages on each instantiation, but always touches only less than keep_resident on each instantiation, then all pages in the slot will eventually become and stay resident. Probably this is the correct thing to do long term, and we should have two different knobs for these two different purposes, but some embedders might be relying on that functionality today and it isn't being replaced with a new/different knob right now. It is not clear to me whether punting on a solution for that use case until some vague re-rationalization of options happens in the future is acceptable or not.
| const PFNZERO = 1 << 5; | ||
| /// The page is THP or Hugetlb backed. | ||
| const HUGE = 1 << 6; | ||
| const SOFT_DIRTY = 1 << 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc for this variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest I don't know what this is and while I found sources for docs for the other variants I am unable to find a copy/paste snippet for this.
|
Did some private benchmarking and, in my server application, this took a single-core wasi-http hello world from 25000 to 40000 rps - 60% speedup in a benchmark that is mostly measuring instantiation and cleanup overhead. Awesome work! |
This comment is based on a misunderstanding I had about the way this PR works and can be ignored. We will still only keep |
Basically forward to the Linux kernel source itself.
pchickey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found a correctness issue in this implementation that I'm working to make a reproducing test case for
|
Finished debugging with Pat. Culprit is that the |
|
@pchickey mind reviewing the most recent commit to confirm it looks good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that pagemap is now working properly when wasmtime is used in a forked child process! Thank you for tracking this down with me @alexcrichton .
Now that pagemap scan is actually returning a non-empty set of pages (because, before fixing the bug, it was scanning the parent process's pages) our throughput improvement is 50%, not 60% as before. 50% is still outstanding!
|
I missed this PR dropping but am reviewing now. If I am understanding correctly, one quirk of this approach is that with lazy page faulting of the This behavior wouldn't show up in a microbenchmark but is something I would expect to see in the real world; please correct me if my understanding is off. I also have a standalone test program / benchmarking suite where I've been trying to get some more concrete numbers on where different approaches perform worse/better (based off an earlier version of @tschneidereit's branch). This includes a half-hearted attempt to do runs in parallel, though my setup definitely doesn't get to the point of meaningfully capture the IPI/TLB-Shootdown behavior encountered at scale. I'm actively tinkering with getting the test to be more representative of real-world use and costs: https://github.com/posborne/pagemap-scan-benchmark. So far, I'm seeing quite a few cases where doing a scan approach can be more expensive than other approaches. Hopefully I'll have that updated later today with a model I'm a bit more confident in as well as my analysis tools for processing and plotting the comparisons. |
|
Before responsding, if you're concerned about this PR though I'd be happy to add a Otherwise though one factor not necessarily captured in your benchmark is that it's not always the first N bytes of linear memory that are in the working set of the module. For example all Rust programs start with ~1MB of a linear memory shadow stack of which only a very small fraction is used at the top. After the shadow stack is read-only data which can have variable size depending on the program at hand. After that is data and the heap which will have likely an expected amount of churn. More-or-less I would predict that the old heuristic of "only memset the first N bytes" much worse in practice than what pagemaps can achieve by figuring out a module's working set over time. Another factor perhaps is that one big benefit of pagemaps is that it's possible to skip the In general though I'd love to be able to improve our set of benchmarks over time and how we're modeling all of this and measuring what to optimize, so I look forward to the results you get! I would definitely believe that further tuning of heuristics is needed here to, for example, completely madvise-away a memory if it's been reused too many times or the last known-keep-resident size was too large or something like that. |
I don't have any acute concerns with this ATM, though having reasonable controls may be a good thing to look at with any follow-on PRs to cleanup
Part of what I'm trying to piece together and potentially suss out with benchmarking is whether doing the scan in userspace to avoid the madvise call is actually cheaper than just doing the madvise. Both are going to have some awareness of the VMAs and be using much of the same code in the kernel, so it isn't clear that avoiding madvise for a clean pages is a huge win in practice. The savings in the case where we can avoid doing memset's for a good chunk of pages is the only area I'm currently confident is providing potentially significant savings. I'll try to break some of these cases down better and provide that data here. I'll also spend a bit more time reading through the kernel code paths taken byawareness of the VMAs and be using much of the same code in the kernel, so it isn't clear that avoiding madvise and whether there's any potential side-benefits (or risks) with the userspace pagemap scanning. I haven't carefully considered the COW side of things as of yet.
I agree that heuristics will probably be required at some point; right now I'm just trying to build a slightly better model to see what might make sense. There's probably some factors that may come into play with embeddings at scale in terms of the number of vmas and associated cost of the scan op, etc. but things don't look terrible here at first blush. |
|
(FWIW I've also added this as a discussion topic for tomorrow for further, synchronous, discussion)
Oh this is surprising to me! I'll admit that I basically take it as an axiom that
These extra costs of The main ways COW comes into the picture I know of are:
One thing I know as well is that @tschneidereit's original benchmarking found the implementation in the kernel relatively expensive with temporary data structures and such. My own benchmarking shows the ioctl is still highest in the profile and so it's probably not the fastest thing in the world, but it's so far consistently beaten out pure-madvise. (I haven't been benchmarking against madvise-plus-memset vs pagemaps since I've assumed madvise-plus-memset is way slower, but I should probably confirm that) |
|
Testing with thread parallelism, process parallelism (and no parallelism) yields interesting results. It definitely appears that the PAGEMAP_SCAN ioctl suffers from lock contention issues, especially in the multi-threaded case but, at least as setup in this benchmark, doing a scan/memset may fair better than madvise for some multi-process cases.= Full report from a set of benchmark runs (select region size in dropdown) here: interactive_pagemap_benchmarks_by_size.html.zip Here's the plot with a 1MB pool with 1 thread/process, 16 threads, and 16 processes running the benchmarks concurrently. My current suspicion is that this is capturing some IPIs and lock contention issues (there are some per-process mm locks which I think account for the thread/process perf differences). |
|
@alexcrichton pointed at that the current benchmarks results in a bunch of concurrent memory mapping for test setup which might be skewing things a bit; I'll try to reduce that noise as it is probably compromising the results here a bit. |
|
As suspected, changing up the benchmark to minimize mmap/munmap for the parallel tests greatly reduces the noise on the syscalls; here's the updated results @ posborne/pagemap-scan-benchmark@d347f38 interactive_pagemap_benchmarks_by_size.html.zip General Summary:
1MB memory region size128K memory region size |
|
@posborne thank you for sharing this detailed analysis! I think the results are excellent validation of the viability of pagemap_scan as an approach to improve performance, and in case any of the There's also one more benefit to the pagemap_scan strategy that isn't reflected in the benchmarks yet, relative to Otherwise, I guess it might make sense to add an option to set a minimum heap size, under which we'd always just do a memset. I doubt anything much above 64KB would be useful there, though: doing needless memsets not only increases memory usage for CoW-backed pages, but also causes cache churn. I'd also expect most real-world modules/components not to be this small: anything using a shadow stack is probably going to reserve enough just for that to be too large. (Which also usually means that the overwhelming majority of the current |
|
@tschneidereit Thank you for the feedback.
Yeah, that's a good call; i've had those in and out at different periods of time, but the page fault is probably best to measure in all cases. The point at which you pay that penalty is different but more impactful to actual execution (vs. refreshing the pool).
I'm still building up my full intuition on what values are likely to be configured in practice; for regions on the smaller side, my initial thought was that tables might be more likely to fall in that range. From out-of-band experience, the extra memsets do definitely cause undesirable cache behavior and cut into memory bandwidth which we've seen become a problem at a certain scale (not strictly driven by wasmtime but across the board). I do think having the option could make sense in order to try to find a value that performs optimally. The other thing I've been considering, which would require some state tracking, would be whether to mark a region as not worth scanning (regardless of size). If 100% of the pages are dirty (or maybe even 80% or some other number) then it may make more sense to just do the full memset and avoid the syscall (for regions where we are OK keeping resident). |
* WIP: use the pagemap_scan ioctl to selectively reset an instance's dirty pages * Less hacky, and supporting tables, too * Bugfixes * WIP: memcpy instead of pread * Refactor support for pagemap * Don't hold a raw pointer to the original data, plumb through an `Arc` to have a safe reference instead. * Update pagemap bindings to latest version of abstraction written. * Include pagemap-specific tests. * Use a `PageMap` structure created once-per-pool instead of a static-with-a-file. * Refactor to use the "pagemap path" unconditionally which blends in the keep_resident bits. * Improve safety documentation prtest:full * Fix some lints * Skip ioctl tests when it's not supported * Fix a memory leak by moving impls around * Fix no vm build * Review comments * Add more pagemap-specific documentation * Add more docs, refactor implementation slightly * Improve `category_*` docs Basically forward to the Linux kernel source itself. * Fix compile * Make pagemap integration resilient across forks * Fix non-pooling-allocator-build * Fix portability issues of new test * Actually use config on macos --------- Co-authored-by: Till Schneidereit <[email protected]>
This series of commits is the brainchild of @tschneidereit who, in his spare time, reads Linux kernel documentation and finds random ioctls. Specifically @tschneidereit discovered the
PAGEMAP_SCANioctl which has the basic description of:As a bit of background one of the main scaling bottlenecks for Wasmtime-on-the-server is the implementation of resetting WebAssembly instances back to their original state, notably linear memories and tables. Wasmtime employs two separate strategies on Linux for doing this today:
PoolingAllocationConfighas*_keep_residentoptions which indicates that this much memory will be memset back to the original contents. This options default to 0 bytes.*_keep_residentis done withmadvise(MADV_DONTNEED)to reset linear memories and tables back to their original state.Both of these strategies have drawbacks. For
*_keep_residentand memset this is a blind memset of the lower addresses in linear memory where lots of memory is set that wasn't either read or written from the guest. As a result this can't be too large lest Wasmtime become amemsetbenchmark. Formadvise(MADV_DONTNEED)this requires modifying page tables (removing resident pages) which results in IPIs to synchronize other cores. Additionally each invocation of a WebAssembly instance will always incur a page fault on memory accesses (albeit a minor fault, not a major fault), which can add up.By using the
PAGEMAP_SCANioctl we can take the best of both worlds here and combine these into a more efficient way to reset linear memories and tables back to their original contents. At a high level the ioctl works by:madvise-ing any pages above the*_keep_residentthreshold.In essence this is almost a tailor-made syscall for Wasmtime and perfectly fits our use case. We can quickly find dirty pages, up to a certain maximum, which segments memory into "manually reset these regions" followed by "decommit these regions". This enables Wasmtime to
memsetonly memory written by the guest and, for example, memory that's only read by the guest remains paged in and remains unmodified.This is an improvement over
*_keep_residentbecause only written pages are reset back to their original contents, not all pages in the low addresses of the linear memory address space. This is also an improvement overmadvise(MADV_DONTNEED)because readonly pages are kept resident over time (no page faults on future invocations, all mapped to the same file-backed readonly page across multiple instances), written pages are kept resident over time (but reset back to original contents after instantiation), and IPIs/TLB shootdowns can be avoided (assuming the working set of written memory is less than*_keep_residentand memory isn't grown during execution). Overall this helps combine the best of both worlds and provides another tool in the toolbox of Wasmtime to efficiently reset linear memory back to zero.In terms of practical impact this enables a 2.5x increase in RPS measured of a simple p2 hello-world server that uses
wasmtime serve. The server in question only writes two host pages of memory and so resetting a linear memory requires a single syscall and 8192 bytes of memcpy. Tables are memset to zero and don't require an ioctl because their total size is so small (both are less than 1 page of memory).This series of commits was originally written by @tschneidereit who did all the initial exploration and effort. I've rebased the commits and cleaned things up towards the end for integration in Wasmtime. The main technical changes here are to hook into linear memory and table deallocation. The
MemoryImageSlothas also been refactored to contain anArcpointing to the original image itself in-memory which is used to manuallymemsetcontents back to their original version. Resetting aMemoryImageSlothas been refactored to pretend that page maps are always available which helps simplify the code and ensure thorough testing even when not on Linux.Wasmtime only uses
PAGEMAP_SCANon Linux and this support is disabled on all other platforms. ThePAGEMAP_SCANioctl is also relatively new so Wasmtime also disables this all on Linux at runtime if the kernel isn't new enough to supportPAGEMAP_SCAN.Support for the
PAGEMAP_SCANioctl is in a Linux-specific module and is effectively a copy of this repository where I did initial testing of this ioctl. That repository isn't published on crates.io nor do I plan on publishing it on crates.io. The test suite is included here as well in the module for Linux pagemap bits.A new test is added to
cow.rswhich exercises some of the more interesting cases withPAGEMAP_SCANadditionally.