Skip to content

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Sep 21, 2025

We provide stack_load/ stack_store / stack_addr instructions in Cranelift to operate on stack slots, and the first two are legalized to a stack_addr plus an ordinary load or store instruction.

We currently have lowerings for stack_addr that materialize an SP-relative address into a register: for example, leaq 8(%rsp), %rax on x86-64 or add x0, sp, #8 on aarch64.

Taken together, we see sequences like (aarch64 / x86-64)

    add x0, sp, #8       /   leaq 8(%rsp), %rax
    str x1, [x0]         /   movq %rdx, (%rax)

when using stack_stores. In particular, we do not use the direct SP-relative form, which would look like

    str x1, [sp, #8]     /   movq %rdx, 8(%rsp)

and which we can already generate in other cases, e.g. spillslot moves (spills/reloads) and clobber saves/restores.

This inefficiency is undesirable whenever the embedder is using stackslots, but in particular when we expect to have high memory traffic to stack slots (e.g., I am seeing this now when implementing debug instrumentation in Wasmtime, and user stack map instrumentation for GC will also benefit).

This PR adds new lowerings that use the existing synthetic address mode we already use for spillslots to emit loads/stores to stackslots directly when possible. The PR does this for x86-64 and aarch64; others could be updated later.

Fixes #1064.

We provide `stack_load`/ `stack_store` / `stack_addr` instructions in
Cranelift to operate on stack slots, and the first two are legalized
to a `stack_addr` plus an ordinary load or store instruction.

We currently have lowerings for `stack_addr` that materialize an
SP-relative address into a register: for example, `leaq 8(%rsp), %rax`
on x86-64 or `add x0, sp, #8` on aarch64.

Taken together, we see sequences like (aarch64 / x86-64)

```
    add x0, sp, #8       /   leaq 8(%rsp), %rax
    str x1, [x0]         /   movq %rdx, (%rax)
```

when using `stack_store`s. In particular, we do *not* use the direct
SP-relative form, which would look like

```
    str x1, [sp, #8]     /   movq %rdx, 8(%rsp)
```

and which we can already generate in other cases, e.g. spillslot
moves (spills/reloads) and clobber saves/restores.

This inefficiency is undesirable whenever the embedder is using
stackslots, but in particular when we expect to have high memory
traffic to stack slots (e.g., I am seeing this now when implementing
debug instrumentation in Wasmtime, and user stack map instrumentation
for GC will also benefit).

This PR adds new lowerings that use the existing synthetic address
mode we already use for spillslots to emit loads/stores to stackslots
directly when possible. The PR does this for x86-64 and aarch64;
others could be updated later.
@cfallin cfallin requested review from a team as code owners September 21, 2025 05:12
@cfallin cfallin requested review from abrown and pchickey and removed request for a team September 21, 2025 05:12
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Sep 21, 2025
@github-actions
Copy link

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@bjorn3
Copy link
Contributor

bjorn3 commented Sep 22, 2025

This is a much cleaner implementation than what I did for https://bytecodealliance.zulipchat.com/#narrow/channel/217117-cranelift/topic/stack_addr.20.2B.20load.2Fstore.20merging/with/540466352, while still having the exact same performance on x86_64 (aka cg_clif produces faster executables than llvm -O0) and also working on arm64. This passes the full cg_clif test suite on x86_64.

On arm64 I'm getting a test failure with the jit mode however. There is a call to printf with 0x10000e73c18d0 as address, but the expected string can be found at 0xffffe73c18d0 on the stack (the stack is from 0xfffffffdf000 to 0x1000000000000). You can reproduce this by running ./test.sh after patching the Cargo.toml of cg_clif to use the Cranelift from this PR.
Edit: Never mind. The test failure is unrelated to this PR.
Edit2: #11734 has the fix.

@cfallin
Copy link
Member Author

cfallin commented Sep 24, 2025

(In case others didn't see email updates from edits in bjorn3's comment above: the issue was unrelated from a cg_clif upgrade of Cranelift seeing another regression; this PR is unrelated and remains ready for review)

Copy link
Member

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@cfallin cfallin added this pull request to the merge queue Sep 25, 2025
Merged via the queue into bytecodealliance:main with commit 62dfbd6 Sep 25, 2025
62 checks passed
@cfallin cfallin deleted the direct-stack-loads-stores branch September 25, 2025 21:32
bongjunj pushed a commit to prosyslab/wasmtime that referenced this pull request Oct 20, 2025
…ealliance#11727)

We provide `stack_load`/ `stack_store` / `stack_addr` instructions in
Cranelift to operate on stack slots, and the first two are legalized
to a `stack_addr` plus an ordinary load or store instruction.

We currently have lowerings for `stack_addr` that materialize an
SP-relative address into a register: for example, `leaq 8(%rsp), %rax`
on x86-64 or `add x0, sp, bytecodealliance#8` on aarch64.

Taken together, we see sequences like (aarch64 / x86-64)

```
    add x0, sp, bytecodealliance#8       /   leaq 8(%rsp), %rax
    str x1, [x0]         /   movq %rdx, (%rax)
```

when using `stack_store`s. In particular, we do *not* use the direct
SP-relative form, which would look like

```
    str x1, [sp, bytecodealliance#8]     /   movq %rdx, 8(%rsp)
```

and which we can already generate in other cases, e.g. spillslot
moves (spills/reloads) and clobber saves/restores.

This inefficiency is undesirable whenever the embedder is using
stackslots, but in particular when we expect to have high memory
traffic to stack slots (e.g., I am seeing this now when implementing
debug instrumentation in Wasmtime, and user stack map instrumentation
for GC will also benefit).

This PR adds new lowerings that use the existing synthetic address
mode we already use for spillslots to emit loads/stores to stackslots
directly when possible. The PR does this for x86-64 and aarch64;
others could be updated later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize stack_store and stack_load

3 participants