Skip to content

Conversation

@alexcrichton
Copy link
Member

This series of commits is the brainchild of @tschneidereit who, in his spare time, reads Linux kernel documentation and finds random ioctls. Specifically @tschneidereit discovered the PAGEMAP_SCAN ioctl which has the basic description of:

This ioctl(2) is used to get and optionally clear some specific
flags from page table entries. The information is returned with
PAGE_SIZE granularity.

As a bit of background one of the main scaling bottlenecks for Wasmtime-on-the-server is the implementation of resetting WebAssembly instances back to their original state, notably linear memories and tables. Wasmtime employs two separate strategies on Linux for doing this today:

  • The PoolingAllocationConfig has *_keep_resident options which indicates that this much memory will be memset back to the original contents. This options default to 0 bytes.
  • Resetting memory beyond *_keep_resident is done with madvise(MADV_DONTNEED) to reset linear memories and tables back to their original state.

Both of these strategies have drawbacks. For *_keep_resident and memset this is a blind memset of the lower addresses in linear memory where lots of memory is set that wasn't either read or written from the guest. As a result this can't be too large lest Wasmtime become a memset benchmark. For madvise(MADV_DONTNEED) this requires modifying page tables (removing resident pages) which results in IPIs to synchronize other cores. Additionally each invocation of a WebAssembly instance will always incur a page fault on memory accesses (albeit a minor fault, not a major fault), which can add up.

By using the PAGEMAP_SCAN ioctl we can take the best of both worlds here and combine these into a more efficient way to reset linear memories and tables back to their original contents. At a high level the ioctl works by:

  • A range of virtual memory is specified to "scan". Flags are also configured to indicate what are the interesting bits we care about in the page table.
  • The ioctl writes to a user-supplied buffer with page-aligned regions of memory that match the flags specified.
  • Wasmtime uses this to detect, within an entire linear memory, what pages are "dirty" or written to and will memset these while madvise-ing any pages above the *_keep_resident threshold.

In essence this is almost a tailor-made syscall for Wasmtime and perfectly fits our use case. We can quickly find dirty pages, up to a certain maximum, which segments memory into "manually reset these regions" followed by "decommit these regions". This enables Wasmtime to memset only memory written by the guest and, for example, memory that's only read by the guest remains paged in and remains unmodified.

This is an improvement over *_keep_resident because only written pages are reset back to their original contents, not all pages in the low addresses of the linear memory address space. This is also an improvement over madvise(MADV_DONTNEED) because readonly pages are kept resident over time (no page faults on future invocations, all mapped to the same file-backed readonly page across multiple instances), written pages are kept resident over time (but reset back to original contents after instantiation), and IPIs/TLB shootdowns can be avoided (assuming the working set of written memory is less than *_keep_resident and memory isn't grown during execution). Overall this helps combine the best of both worlds and provides another tool in the toolbox of Wasmtime to efficiently reset linear memory back to zero.

In terms of practical impact this enables a 2.5x increase in RPS measured of a simple p2 hello-world server that uses wasmtime serve. The server in question only writes two host pages of memory and so resetting a linear memory requires a single syscall and 8192 bytes of memcpy. Tables are memset to zero and don't require an ioctl because their total size is so small (both are less than 1 page of memory).


This series of commits was originally written by @tschneidereit who did all the initial exploration and effort. I've rebased the commits and cleaned things up towards the end for integration in Wasmtime. The main technical changes here are to hook into linear memory and table deallocation. The MemoryImageSlot has also been refactored to contain an Arc pointing to the original image itself in-memory which is used to manually memset contents back to their original version. Resetting a MemoryImageSlot has been refactored to pretend that page maps are always available which helps simplify the code and ensure thorough testing even when not on Linux.

Wasmtime only uses PAGEMAP_SCAN on Linux and this support is disabled on all other platforms. The PAGEMAP_SCAN ioctl is also relatively new so Wasmtime also disables this all on Linux at runtime if the kernel isn't new enough to support PAGEMAP_SCAN.

Support for the PAGEMAP_SCAN ioctl is in a Linux-specific module and is effectively a copy of this repository where I did initial testing of this ioctl. That repository isn't published on crates.io nor do I plan on publishing it on crates.io. The test suite is included here as well in the module for Linux pagemap bits.

A new test is added to cow.rs which exercises some of the more interesting cases with PAGEMAP_SCAN additionally.

tschneidereit and others added 5 commits July 30, 2025 08:29
* Don't hold a raw pointer to the original data, plumb through an `Arc`
  to have a safe reference instead.
* Update pagemap bindings to latest version of abstraction written.
* Include pagemap-specific tests.
* Use a `PageMap` structure created once-per-pool instead of a
  static-with-a-file.
* Refactor to use the "pagemap path" unconditionally which blends in the
  keep_resident bits.
@alexcrichton alexcrichton requested a review from a team as a code owner August 1, 2025 19:38
@alexcrichton alexcrichton requested review from fitzgen and removed request for a team August 1, 2025 19:38
@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Aug 1, 2025
@alexcrichton
Copy link
Member Author

One thing I should also note here is that @tschneidereit has plans for revamping the *_keep_resident options to rationalize them a bit more in light of this new syscall, but we've agreed that it's best to separate that out to a separate change. That means that this PR doesn't change the meaning of the preexisting *_keep_resident options, it just makes them much more effective.

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Excited for this!

I remember we had some discussion in the Wasmtime meeting about the dual-purpose of keep_resident and how it configures both an amount to memset and a limit on the pool's max RSS when idle. This PR would remove that second bit: if the Wasm program touches different pages on each instantiation, but always touches only less than keep_resident on each instantiation, then all pages in the slot will eventually become and stay resident. Probably this is the correct thing to do long term, and we should have two different knobs for these two different purposes, but some embedders might be relying on that functionality today and it isn't being replaced with a new/different knob right now. It is not clear to me whether punting on a solution for that use case until some vague re-rationalization of options happens in the future is acceptable or not.

const PFNZERO = 1 << 5;
/// The page is THP or Hugetlb backed.
const HUGE = 1 << 6;
const SOFT_DIRTY = 1 << 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc for this variant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest I don't know what this is and while I found sources for docs for the other variants I am unable to find a copy/paste snippet for this.

@pchickey
Copy link
Contributor

pchickey commented Aug 6, 2025

Did some private benchmarking and, in my server application, this took a single-core wasi-http hello world from 25000 to 40000 rps - 60% speedup in a benchmark that is mostly measuring instantiation and cleanup overhead. Awesome work!

@fitzgen
Copy link
Member

fitzgen commented Aug 6, 2025

I remember we had some discussion in the Wasmtime meeting about the dual-purpose of keep_resident and how it configures both an amount to memset and a limit on the pool's max RSS when idle. This PR would remove that second bit: if the Wasm program touches different pages on each instantiation, but always touches only less than keep_resident on each instantiation, then all pages in the slot will eventually become and stay resident. Probably this is the correct thing to do long term, and we should have two different knobs for these two different purposes, but some embedders might be relying on that functionality today and it isn't being replaced with a new/different knob right now. It is not clear to me whether punting on a solution for that use case until some vague re-rationalization of options happens in the future is acceptable or not.

This comment is based on a misunderstanding I had about the way this PR works and can be ignored. We will still only keep keep_resident bytes resident with this change.

Basically forward to the Linux kernel source itself.
Copy link
Contributor

@pchickey pchickey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found a correctness issue in this implementation that I'm working to make a reproducing test case for

@alexcrichton
Copy link
Member Author

Finished debugging with Pat. Culprit is that the Engine was created, but forked subprocesses created stores. That definitely doesn't work given how I've set things up because the stores have a page map for the parent process, not the child process. Will work on a fix.

@alexcrichton
Copy link
Member Author

@pchickey mind reviewing the most recent commit to confirm it looks good?

Copy link
Contributor

@pchickey pchickey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that pagemap is now working properly when wasmtime is used in a forked child process! Thank you for tracking this down with me @alexcrichton .

Now that pagemap scan is actually returning a non-empty set of pages (because, before fixing the bug, it was scanning the parent process's pages) our throughput improvement is 50%, not 60% as before. 50% is still outstanding!

@alexcrichton alexcrichton enabled auto-merge August 8, 2025 15:36
@alexcrichton alexcrichton added this pull request to the merge queue Aug 8, 2025
Merged via the queue into bytecodealliance:main with commit b500820 Aug 8, 2025
168 checks passed
@alexcrichton alexcrichton deleted the pagemap_scan branch August 8, 2025 16:39
@posborne
Copy link
Collaborator

I missed this PR dropping but am reviewing now. If I am understanding correctly, one quirk of this approach is that with lazy page faulting of the keep_resident_* portion of regions, we might expect to see zeroing get slower over time as it would just take a single full use of the keep_resident pages for the scan to always report all pages as dirty henceforth, thus making resetting this memory revert to a more expensive path from that point forward (as we're going to incur the full cost of memset plus we'll be doing the syscall on these pages).

This behavior wouldn't show up in a microbenchmark but is something I would expect to see in the real world; please correct me if my understanding is off.


I also have a standalone test program / benchmarking suite where I've been trying to get some more concrete numbers on where different approaches perform worse/better (based off an earlier version of @tschneidereit's branch). This includes a half-hearted attempt to do runs in parallel, though my setup definitely doesn't get to the point of meaningfully capture the IPI/TLB-Shootdown behavior encountered at scale.

I'm actively tinkering with getting the test to be more representative of real-world use and costs: https://github.com/posborne/pagemap-scan-benchmark. So far, I'm seeing quite a few cases where doing a scan approach can be more expensive than other approaches. Hopefully I'll have that updated later today with a model I'm a bit more confident in as well as my analysis tools for processing and plotting the comparisons.

@alexcrichton
Copy link
Member Author

Before responsding, if you're concerned about this PR though I'd be happy to add a Config option to forcibly disable (or enable) it as that seems generally useful no matter what. Basically want to make sure it's no a headache to upgrade or have you feel like you're forced to bottom all this out before the next upgrade.

Otherwise though one factor not necessarily captured in your benchmark is that it's not always the first N bytes of linear memory that are in the working set of the module. For example all Rust programs start with ~1MB of a linear memory shadow stack of which only a very small fraction is used at the top. After the shadow stack is read-only data which can have variable size depending on the program at hand. After that is data and the heap which will have likely an expected amount of churn. More-or-less I would predict that the old heuristic of "only memset the first N bytes" much worse in practice than what pagemaps can achieve by figuring out a module's working set over time.

Another factor perhaps is that one big benefit of pagemaps is that it's possible to skip the madvise call altogether. If the working set is smaller than keep-resident options then the pagemap scan is sufficient to reset linear memory.

In general though I'd love to be able to improve our set of benchmarks over time and how we're modeling all of this and measuring what to optimize, so I look forward to the results you get! I would definitely believe that further tuning of heuristics is needed here to, for example, completely madvise-away a memory if it's been reused too many times or the last known-keep-resident size was too large or something like that.

@posborne
Copy link
Collaborator

Before responding, if you're concerned about this PR though I'd be happy to add a Config option to forcibly disable (or enable) it as that seems generally useful no matter what. Basically want to make sure it's no a headache to upgrade or have you feel like you're forced to bottom all this out before the next upgrade.

I don't have any acute concerns with this ATM, though having reasonable controls may be a good thing to look at with any follow-on PRs to cleanup keep_resident config clarity, etc.

Another factor perhaps is that one big benefit of pagemaps is that it's possible to skip the madvise call altogether. If the working set is smaller than keep-resident options then the pagemap scan is sufficient to reset linear memory.

Part of what I'm trying to piece together and potentially suss out with benchmarking is whether doing the scan in userspace to avoid the madvise call is actually cheaper than just doing the madvise. Both are going to have some awareness of the VMAs and be using much of the same code in the kernel, so it isn't clear that avoiding madvise for a clean pages is a huge win in practice.

The savings in the case where we can avoid doing memset's for a good chunk of pages is the only area I'm currently confident is providing potentially significant savings. I'll try to break some of these cases down better and provide that data here. I'll also spend a bit more time reading through the kernel code paths taken byawareness of the VMAs and be using much of the same code in the kernel, so it isn't clear that avoiding madvise and whether there's any potential side-benefits (or risks) with the userspace pagemap scanning.

I haven't carefully considered the COW side of things as of yet.

In general though I'd love to be able to improve our set of benchmarks over time and how we're modeling all of this and measuring what to optimize, so I look forward to the results you get! I would definitely believe that further tuning of heuristics is needed here to, for example, completely madvise-away a memory if it's been reused too many times or the last known-keep-resident size was too large or something like that.

I agree that heuristics will probably be required at some point; right now I'm just trying to build a slightly better model to see what might make sense. There's probably some factors that may come into play with embeddings at scale in terms of the number of vmas and associated cost of the scan op, etc. but things don't look terrible here at first blush.

@alexcrichton
Copy link
Member Author

(FWIW I've also added this as a discussion topic for tomorrow for further, synchronous, discussion)

Part of what I'm trying to piece together and potentially suss out with benchmarking is whether doing the scan in userspace to avoid the madvise call is actually cheaper than just doing the madvise.

Oh this is surprising to me! I'll admit that I basically take it as an axiom that madvise is slow, but that's also been the result of almost all historical benchmarking. You're right that both syscalls will iterate/synchronize on VMAs, but the major costs of madvise are:

  • IPIs and more synchronization are required to remove present pages from the page tables.
  • Reuse of the same pages in the future will incur a minor page fault that the kernel services. This has historically had contention in the kernel show up high in profiles too.

These extra costs of madvise become more problematic the higher the concurrency too. The instantiation.rs benchmark in this repository measures time-to-instantiate with N threads in the background also doing work, and while we ideally want that to be constant (unaffected by other threads) it steadily gets worse the more threads you add due to all of the contention. (this isn't a perfect benchmark but a proxy at least). By doing a memset we avoid all of the kernel synchronization at the cost of a pagemap (ideally) which is the main goal here.

The main ways COW comes into the picture I know of are:

  • If a page is only read (e.g. rodata for the wasm) using PAGEMAP_SCAN vs memset-or-madvise enables us to avoid touching the page entirely. That means that after the first instance faults in the read-only page it never changes from then on and it's always present with no future page faults necessary for future instances. For memset we'd unnecessarily copy the contents and for madvise we'd unnecessarily page it out.
  • If a page is read, then written, then PAGEMAP_SCAN and memset achieve the same goal of keeping the memory resident. The first instance will fault on the read, then fault on the write, and so that cost would be paid on all future instances with madvise but both PAGEMAP_SCAN and memset avoid the cost for future instances.
  • Written-first pages are similar to the above, but slightly cheaper to fault in the future with madvise since it's just one fault instead of 2.

One thing I know as well is that @tschneidereit's original benchmarking found the implementation in the kernel relatively expensive with temporary data structures and such. My own benchmarking shows the ioctl is still highest in the profile and so it's probably not the fastest thing in the world, but it's so far consistently beaten out pure-madvise. (I haven't been benchmarking against madvise-plus-memset vs pagemaps since I've assumed madvise-plus-memset is way slower, but I should probably confirm that)

@posborne
Copy link
Collaborator

Testing with thread parallelism, process parallelism (and no parallelism) yields interesting results. It definitely appears that the PAGEMAP_SCAN ioctl suffers from lock contention issues, especially in the multi-threaded case but, at least as setup in this benchmark, doing a scan/memset may fair better than madvise for some multi-process cases.=

Full report from a set of benchmark runs (select region size in dropdown) here: interactive_pagemap_benchmarks_by_size.html.zip

Here's the plot with a 1MB pool with 1 thread/process, 16 threads, and 16 processes running the benchmarks concurrently. My current suspicion is that this is capturing some IPIs and lock contention issues (there are some per-process mm locks which I think account for the thread/process perf differences).

1024KB-Memory-Reset-Benchmarks

@posborne
Copy link
Collaborator

@alexcrichton pointed at that the current benchmarks results in a bunch of concurrent memory mapping for test setup which might be skewing things a bit; I'll try to reduce that noise as it is probably compromising the results here a bit.

@posborne
Copy link
Collaborator

As suspected, changing up the benchmark to minimize mmap/munmap for the parallel tests greatly reduces the noise on the syscalls; here's the updated results @ posborne/pagemap-scan-benchmark@d347f38

interactive_pagemap_benchmarks_by_size.html.zip

General Summary:

  • Just zeroing the memory takes less time than pagemap below 512KB; there are probably indirect costs to always zeroing (memory bandwidth, overall cpu pressure) not shown here to consider, but this is some data.
  • Many threads in a proc seems to have a penalty (relatively to the same number of threads) doing the benchmark for the syscalls, but it also seems to impact memset in some way as well in this test (don't have a full idea of why this is or if it is an issue with my test setup).
  • In most cases, it does look like doing the pagemap scan + zeroing outperforms madvising the full range. The end result of those ops is not identical in terms of resident memory, etc. so not an exact mapping to how we use these in tree.

1MB memory region size

visualization(3)

128K memory region size

visualization(4)

@tschneidereit
Copy link
Member

@posborne thank you for sharing this detailed analysis! I think the results are excellent validation of the viability of pagemap_scan as an approach to improve performance, and in case any of the keep-resident options are used usually also reduce memory usage.

There's also one more benefit to the pagemap_scan strategy that isn't reflected in the benchmarks yet, relative to madvise: memory reset manually doesn't incur pagefaults the next time it's written to. Because of that, it might make sense to include the region.make_dirty() call in the measurements.

Otherwise, I guess it might make sense to add an option to set a minimum heap size, under which we'd always just do a memset. I doubt anything much above 64KB would be useful there, though: doing needless memsets not only increases memory usage for CoW-backed pages, but also causes cache churn.

I'd also expect most real-world modules/components not to be this small: anything using a shadow stack is probably going to reserve enough just for that to be too large. (Which also usually means that the overwhelming majority of the current keep-resident budgets will be wasted on the shadow stack, making the real-world impact of moving the pagemap_scan even better if keep-resident is already in use.)

@posborne
Copy link
Collaborator

@tschneidereit Thank you for the feedback.

There's also one more benefit to the pagemap_scan strategy that isn't reflected in the benchmarks yet, relative to madvise: memory reset manually doesn't incur pagefaults the next time it's written to. Because of that, it might make sense to include the region.make_dirty() call in the measurements.

Yeah, that's a good call; i've had those in and out at different periods of time, but the page fault is probably best to measure in all cases. The point at which you pay that penalty is different but more impactful to actual execution (vs. refreshing the pool).

Otherwise, I guess it might make sense to add an option to set a minimum heap size, under which we'd always just do a memset. I doubt anything much above 64KB would be useful there, though: doing needless memsets not only increases memory usage for CoW-backed pages, but also causes cache churn.

I'm still building up my full intuition on what values are likely to be configured in practice; for regions on the smaller side, my initial thought was that tables might be more likely to fall in that range. From out-of-band experience, the extra memsets do definitely cause undesirable cache behavior and cut into memory bandwidth which we've seen become a problem at a certain scale (not strictly driven by wasmtime but across the board).

I do think having the option could make sense in order to try to find a value that performs optimally. The other thing I've been considering, which would require some state tracking, would be whether to mark a region as not worth scanning (regardless of size). If 100% of the pages are dirty (or maybe even 80% or some other number) then it may make more sense to just do the full memset and avoid the syscall (for regions where we are OK keeping resident).

bongjunj pushed a commit to prosyslab/wasmtime that referenced this pull request Oct 20, 2025
* WIP: use the pagemap_scan ioctl to selectively reset an instance's dirty pages

* Less hacky, and supporting tables, too

* Bugfixes

* WIP: memcpy instead of pread

* Refactor support for pagemap

* Don't hold a raw pointer to the original data, plumb through an `Arc`
  to have a safe reference instead.
* Update pagemap bindings to latest version of abstraction written.
* Include pagemap-specific tests.
* Use a `PageMap` structure created once-per-pool instead of a
  static-with-a-file.
* Refactor to use the "pagemap path" unconditionally which blends in the
  keep_resident bits.

* Improve safety documentation

prtest:full

* Fix some lints

* Skip ioctl tests when it's not supported

* Fix a memory leak by moving impls around

* Fix no vm build

* Review comments

* Add more pagemap-specific documentation

* Add more docs, refactor implementation slightly

* Improve `category_*` docs

Basically forward to the Linux kernel source itself.

* Fix compile

* Make pagemap integration resilient across forks

* Fix non-pooling-allocator-build

* Fix portability issues of new test

* Actually use config on macos

---------

Co-authored-by: Till Schneidereit <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wasmtime:api Related to the API of the `wasmtime` crate itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants