-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Cranelift: use SP-offset amodes for stack_addr+load/store.
#11727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: use SP-offset amodes for stack_addr+load/store.
#11727
Conversation
We provide `stack_load`/ `stack_store` / `stack_addr` instructions in Cranelift to operate on stack slots, and the first two are legalized to a `stack_addr` plus an ordinary load or store instruction. We currently have lowerings for `stack_addr` that materialize an SP-relative address into a register: for example, `leaq 8(%rsp), %rax` on x86-64 or `add x0, sp, #8` on aarch64. Taken together, we see sequences like (aarch64 / x86-64) ``` add x0, sp, #8 / leaq 8(%rsp), %rax str x1, [x0] / movq %rdx, (%rax) ``` when using `stack_store`s. In particular, we do *not* use the direct SP-relative form, which would look like ``` str x1, [sp, #8] / movq %rdx, 8(%rsp) ``` and which we can already generate in other cases, e.g. spillslot moves (spills/reloads) and clobber saves/restores. This inefficiency is undesirable whenever the embedder is using stackslots, but in particular when we expect to have high memory traffic to stack slots (e.g., I am seeing this now when implementing debug instrumentation in Wasmtime, and user stack map instrumentation for GC will also benefit). This PR adds new lowerings that use the existing synthetic address mode we already use for spillslots to emit loads/stores to stackslots directly when possible. The PR does this for x86-64 and aarch64; others could be updated later.
Subscribe to Label Action
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "isle"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
|
This is a much cleaner implementation than what I did for https://bytecodealliance.zulipchat.com/#narrow/channel/217117-cranelift/topic/stack_addr.20.2B.20load.2Fstore.20merging/with/540466352, while still having the exact same performance on x86_64 (aka cg_clif produces faster executables than llvm On arm64 I'm getting a test failure with the jit mode however. There is a call to |
|
(In case others didn't see email updates from edits in bjorn3's comment above: the issue was unrelated from a |
abrown
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
…ealliance#11727) We provide `stack_load`/ `stack_store` / `stack_addr` instructions in Cranelift to operate on stack slots, and the first two are legalized to a `stack_addr` plus an ordinary load or store instruction. We currently have lowerings for `stack_addr` that materialize an SP-relative address into a register: for example, `leaq 8(%rsp), %rax` on x86-64 or `add x0, sp, bytecodealliance#8` on aarch64. Taken together, we see sequences like (aarch64 / x86-64) ``` add x0, sp, bytecodealliance#8 / leaq 8(%rsp), %rax str x1, [x0] / movq %rdx, (%rax) ``` when using `stack_store`s. In particular, we do *not* use the direct SP-relative form, which would look like ``` str x1, [sp, bytecodealliance#8] / movq %rdx, 8(%rsp) ``` and which we can already generate in other cases, e.g. spillslot moves (spills/reloads) and clobber saves/restores. This inefficiency is undesirable whenever the embedder is using stackslots, but in particular when we expect to have high memory traffic to stack slots (e.g., I am seeing this now when implementing debug instrumentation in Wasmtime, and user stack map instrumentation for GC will also benefit). This PR adds new lowerings that use the existing synthetic address mode we already use for spillslots to emit loads/stores to stackslots directly when possible. The PR does this for x86-64 and aarch64; others could be updated later.
We provide
stack_load/stack_store/stack_addrinstructions in Cranelift to operate on stack slots, and the first two are legalized to astack_addrplus an ordinary load or store instruction.We currently have lowerings for
stack_addrthat materialize an SP-relative address into a register: for example,leaq 8(%rsp), %raxon x86-64 oradd x0, sp, #8on aarch64.Taken together, we see sequences like (aarch64 / x86-64)
when using
stack_stores. In particular, we do not use the direct SP-relative form, which would look likeand which we can already generate in other cases, e.g. spillslot moves (spills/reloads) and clobber saves/restores.
This inefficiency is undesirable whenever the embedder is using stackslots, but in particular when we expect to have high memory traffic to stack slots (e.g., I am seeing this now when implementing debug instrumentation in Wasmtime, and user stack map instrumentation for GC will also benefit).
This PR adds new lowerings that use the existing synthetic address mode we already use for spillslots to emit loads/stores to stackslots directly when possible. The PR does this for x86-64 and aarch64; others could be updated later.
Fixes #1064.