Cranelift: use SP-offset amodes for `stack_addr`+load/store. #11727

cfallin · 2025-09-21T05:12:26Z

We provide stack_load/ stack_store / stack_addr instructions in Cranelift to operate on stack slots, and the first two are legalized to a stack_addr plus an ordinary load or store instruction.

We currently have lowerings for stack_addr that materialize an SP-relative address into a register: for example, leaq 8(%rsp), %rax on x86-64 or add x0, sp, #8 on aarch64.

Taken together, we see sequences like (aarch64 / x86-64)

    add x0, sp, #8       /   leaq 8(%rsp), %rax
    str x1, [x0]         /   movq %rdx, (%rax)

when using stack_stores. In particular, we do not use the direct SP-relative form, which would look like

    str x1, [sp, #8]     /   movq %rdx, 8(%rsp)

and which we can already generate in other cases, e.g. spillslot moves (spills/reloads) and clobber saves/restores.

This inefficiency is undesirable whenever the embedder is using stackslots, but in particular when we expect to have high memory traffic to stack slots (e.g., I am seeing this now when implementing debug instrumentation in Wasmtime, and user stack map instrumentation for GC will also benefit).

This PR adds new lowerings that use the existing synthetic address mode we already use for spillslots to emit loads/stores to stackslots directly when possible. The PR does this for x86-64 and aarch64; others could be updated later.

Fixes #1064.

We provide `stack_load`/ `stack_store` / `stack_addr` instructions in Cranelift to operate on stack slots, and the first two are legalized to a `stack_addr` plus an ordinary load or store instruction. We currently have lowerings for `stack_addr` that materialize an SP-relative address into a register: for example, `leaq 8(%rsp), %rax` on x86-64 or `add x0, sp, #8` on aarch64. Taken together, we see sequences like (aarch64 / x86-64) ``` add x0, sp, #8 / leaq 8(%rsp), %rax str x1, [x0] / movq %rdx, (%rax) ``` when using `stack_store`s. In particular, we do *not* use the direct SP-relative form, which would look like ``` str x1, [sp, #8] / movq %rdx, 8(%rsp) ``` and which we can already generate in other cases, e.g. spillslot moves (spills/reloads) and clobber saves/restores. This inefficiency is undesirable whenever the embedder is using stackslots, but in particular when we expect to have high memory traffic to stack slots (e.g., I am seeing this now when implementing debug instrumentation in Wasmtime, and user stack map instrumentation for GC will also benefit). This PR adds new lowerings that use the existing synthetic address mode we already use for spillslots to emit loads/stores to stackslots directly when possible. The PR does this for x86-64 and aarch64; others could be updated later.

github-actions · 2025-09-21T07:44:28Z

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

bjorn3 · 2025-09-22T13:07:31Z

This is a much cleaner implementation than what I did for https://bytecodealliance.zulipchat.com/#narrow/channel/217117-cranelift/topic/stack_addr.20.2B.20load.2Fstore.20merging/with/540466352, while still having the exact same performance on x86_64 (aka cg_clif produces faster executables than llvm -O0) and also working on arm64. This passes the full cg_clif test suite on x86_64.

On arm64 I'm getting a test failure with the jit mode however. There is a call to printf with 0x10000e73c18d0 as address, but the expected string can be found at 0xffffe73c18d0 on the stack (the stack is from 0xfffffffdf000 to 0x1000000000000). You can reproduce this by running ./test.sh after patching the Cargo.toml of cg_clif to use the Cranelift from this PR.
Edit: Never mind. The test failure is unrelated to this PR.
Edit2: #11734 has the fix.

cfallin · 2025-09-24T18:45:06Z

(In case others didn't see email updates from edits in bjorn3's comment above: the issue was unrelated from a cg_clif upgrade of Cranelift seeing another regression; this PR is unrelated and remains ready for review)

abrown

Makes sense!

…ealliance#11727) We provide `stack_load`/ `stack_store` / `stack_addr` instructions in Cranelift to operate on stack slots, and the first two are legalized to a `stack_addr` plus an ordinary load or store instruction. We currently have lowerings for `stack_addr` that materialize an SP-relative address into a register: for example, `leaq 8(%rsp), %rax` on x86-64 or `add x0, sp, bytecodealliance#8` on aarch64. Taken together, we see sequences like (aarch64 / x86-64) ``` add x0, sp, bytecodealliance#8 / leaq 8(%rsp), %rax str x1, [x0] / movq %rdx, (%rax) ``` when using `stack_store`s. In particular, we do *not* use the direct SP-relative form, which would look like ``` str x1, [sp, bytecodealliance#8] / movq %rdx, 8(%rsp) ``` and which we can already generate in other cases, e.g. spillslot moves (spills/reloads) and clobber saves/restores. This inefficiency is undesirable whenever the embedder is using stackslots, but in particular when we expect to have high memory traffic to stack slots (e.g., I am seeing this now when implementing debug instrumentation in Wasmtime, and user stack map instrumentation for GC will also benefit). This PR adds new lowerings that use the existing synthetic address mode we already use for spillslots to emit loads/stores to stackslots directly when possible. The PR does this for x86-64 and aarch64; others could be updated later.

cfallin requested review from a team as code owners September 21, 2025 05:12

cfallin requested review from abrown and pchickey and removed request for a team September 21, 2025 05:12

abrown approved these changes Sep 25, 2025

View reviewed changes

cfallin added this pull request to the merge queue Sep 25, 2025

Merged via the queue into bytecodealliance:main with commit 62dfbd6 Sep 25, 2025
62 checks passed

cfallin deleted the direct-stack-loads-stores branch September 25, 2025 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cranelift: use SP-offset amodes for `stack_addr`+load/store. #11727

Cranelift: use SP-offset amodes for `stack_addr`+load/store. #11727

Uh oh!

cfallin commented Sep 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 21, 2025

Uh oh!

bjorn3 commented Sep 22, 2025 •

edited

Loading

Uh oh!

cfallin commented Sep 24, 2025

Uh oh!

abrown left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cranelift: use SP-offset amodes for stack_addr+load/store. #11727

Cranelift: use SP-offset amodes for stack_addr+load/store. #11727

Uh oh!

Conversation

cfallin commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 21, 2025

Subscribe to Label Action

Uh oh!

bjorn3 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfallin commented Sep 24, 2025

Uh oh!

abrown left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cranelift: use SP-offset amodes for `stack_addr`+load/store. #11727

Cranelift: use SP-offset amodes for `stack_addr`+load/store. #11727

cfallin commented Sep 21, 2025 •

edited

Loading

bjorn3 commented Sep 22, 2025 •

edited

Loading