Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Jan 22, 2022

For context: #62302

The idea is to prefer

  movz    x0, #0x40c0
  movk    x0, #0x439 LSL #16
  movk    x0, #1 LSL #32
  blr     x0

over

  bl      MyCall()

for cases where bl MyCall() actually points to a jump-stub. E.g.:
image

It seems in most cases the heap with jitted code is quite far away from CoreCLR(VM) so all internal calls and helper calls need jump stubs at the moment, however, managed-to-managed seem to be also affected.

Am I heading into a right direction, @jkotas? I'm trying to address your 3rd point in #62302 (comment)

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 22, 2022
@ghost ghost assigned EgorBo Jan 22, 2022
@ghost
Copy link

ghost commented Jan 22, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

For context: #62302

The idea is to prefer

        00000000          movz    x0, #0x40c0
        00000000          movk    x0, #0x439 LSL #16
        00000000          movk    x0, #1 LSL #32
        00000000          blr     x0

over

        00000000          bl      MyCall()

for cases where bl MyCall() actually points to a jump-stub. E.g.:
image

This PR tries to do that for:

  1. HELPER calls, but turns out some of them were already emitted directly, e.g.:
        00000000          movz    x14, #0xde10
        00000000          movk    x14, #0x1c9d LSL #16
        00000000          movk    x14, #1 LSL #32
        00000000          bl      CORINFO_HELP_CHECKED_ASSIGN_REF

but not all of them.
2) InternalCall (FCall) - I assume these can be considered as never to be rejitted/replaced.
3) TODO: Tier1 methods

TBD: Should I be careful around delegates, pinvokes, check for ReJIT/EnC (I guess it's not supported yet for arm)?
TBD2: Should I introduce a new JIT-EE method instead?

Am I heading into a right direction, @jkotas? I'm trying to address your 3rd point in #62302 (comment)

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Jan 22, 2022

Diff example

static float Test(float x) => x % 1 + MathF.Cos(x);

Diff: https://www.diffchecker.com/mNfaCdFj
(needs some tweaks in JitDisasm to display names)

@EgorBo
Copy link
Member Author

EgorBo commented Jan 22, 2022

Benefits are smaller than from #63842:

[Benchmark] 
[Arguments(3.14)] 
public double Test(double d) => Math.Cos(d) * Math.Sin(d) * Math.Tan(d);  // 3 InternalCalls
| Method |                                  Toolchain |    d |     Mean |
|------- |------------------------------------------- |----- |---------:|
|   Test |                /Core_Root_baseline/corerun | 3.14 | 28.71 ns |
|   Test |             /Core_Root_direct_call/corerun | 3.14 | 26.10 ns |
|   Test |       /Core_Root_smaller_reservmem/corerun | 3.14 |  9.89 ns |

Core_Root_baseline - Main
Core_Root_direct_call - This PR
Core_Root_smaller_reservmem - #63842

But benefits from this PR can be higher if we allow CSE/LoopHoisting for method addresses (especially LoopHoisting)

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

check for ReJIT/EnC (I guess it's not supported yet for arm)

ReJIT is supported for arm64, EnC is not - but it is planned for .NET 7.

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

What is the arm64 machine that you are running this benchmark on?

What does the code for COMDouble::Cos look like on arm64 exactly? I assume that it must be doing an indirect call or jump of some sort since COMDouble::Cos lives in libcoreclr.so and cos should live in libc.so.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

What is the arm64 machine that you are running this benchmark on?

What does the code for COMDouble::Cos look like on arm64 exactly? I assume that it must be doing an indirect call or jump of some sort since COMDouble::Cos lives in libcoreclr.so and cos should live in libc.so.

I am using Apple M1 mini so I assume cos lives in libSystem.dylib.
And yeah, COMDouble::Cos just jumps to cos

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

Addresses (example):

VM's COMPlus::Single -  0x00000001034A7FC4
libSystem.dylib's cos - 0x00000001C5981D90
location =              0x0000000280FFB2F8
FitsIn24B(COMPlus::Single - location): false  --- distance between location and COMPlus::Single is ~6GB
jump-stub is needed: true

g_preferredRangeMin -   0x0000000280000000
g_preferredRangeMax -   0x00000002FFFF0000

If I apply #63842 then numbers are:

VM's COMPlus::Single -  0x000000010189BFC4 (still a bit far from System.dylib)
libSystem.dylib's cos - 0x00000001C5981D90 (exactly the same, obviously)
location=               0x00000001042DB0B8
FitsIn24B(COMPlus::Single - location): true  --- distance between location and COMPlus::Single is ~44Mb
jump-stub is needed: false

Where location, from my understanding, is the location of my MyCos method in memory (native code)

float MyCos(float x) => MathF.Cos(x);

e.g. when ^ is compiled, jit calls recordRelocation(location, target) where target is COMPlus::Single. recordRelocation triggers jump-stub creation if needed

// guess whether the native instruction (e.g. br) will be able to reach this target or not
// within 128Mb (1Mb is subtracted because we don't know the exact code size at this point)
// NOTE: it's still just a hint
if (FitsInRel28(abs((INT64)m_CodeHeader->GetCodeStartAddress()) - (INT64)target) + 1024 * 1024)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's not correct to use m_CodeHeader->GetCodeStartAddress() here.
What I need here is an approximate location where currently jitted method will be allocated in memory, is it possible?

Or at very least I need a location of coreclr in memory so I can return IMAGE_REL_ARM64_BRANCH26 for everything except coreclr and all calls to coreclr will use direct calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I need here is an approximate location where currently jitted method will be allocated in memory, is it possible?

It is certainly possible, but it is not a one line change with the current implementation of the code manager. I think we would need to refactor the implementation of code manager and how the JIT interacts with it to make this possible.

Copy link
Member Author

@EgorBo EgorBo Jan 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume x64 will also benefit from it, the current getRelocTypeHint impl is not accurate for it either and most likely leads to jump-stubs or "rejit with relocs off" (where relocs are turned off globally for all future methods) in some cases.

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

And yeah, COMDouble::Cos just jumps to cos

How does it "just jump"? It needs some sort of indirect jump. It would be useful to know the exact instructions used to perform the long jumps (similar to what we need). I assume this sequence is going to be the most efficient sequence and we should be using the same one.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

And yeah, COMDouble::Cos just jumps to cos

How does it "just jump"? It needs some sort of indirect jump. It would be useful to know the exact instructions used to perform the long jumps (similar to what we need). I assume this sequence is going to be the most efficient sequence and we should be using the same one.

@jkotas here is the result of objdump -d libcoreclr.dylib:


00000000003baa40 <__ZN9COMSingle3CosEf>:
  3baa40: 6e 97 01 14  	b	0x4207f8 <__PWTB_ArgumentRegister_FirstArg+0x420708>

I am not sure I understand how to read it, I thought b could only use pc register (current instruction)

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

@jkotas I guess we can't do the same because the native code for managed code is allocated on a heap with ASLR and all addresses are quite large

hm. or maybe I am wrong, checking..

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

So with the default settings we have coreclr at e.g. address

 4 350 181 316 - address of a function in CoreCLR
10 737 418 240 - g_preferredRangeMin
10 754 175 736 - address of my C# function
12 884 836 352 - g_preferredRangeMax

as you can see, the heap with compiled managed code is quite far away from CoreCLR, the distance is ~6Gb

but with the #63842 change it suddenly changes to

 4 320 772 036 - address of a function in CoreCLR
 4 365 070 520 - address of my C# function

I didn't save g_preferredRangeMin and g_preferredRangeMax for that session but I assume they're ~ 4 352 000 000 and 4 480 000 000. I don't understand how the distance between CoreCLR and my managed code decreased from 6GB to 44Mb 🤔

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

I don't understand how the distance between CoreCLR and my managed code decreased from 6GB to 44Mb

The address space randomization will give you addresses that are far from each other, some of the time. You may be able to observe some patterns in ASLR behavior for given program on given OS version, but it is very fragile to generalize the observed patterns.

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

I am not sure I understand how to read it, I thought b could only use pc register (current instruction)

I think it means that this branch points to some sort of jump stub. It does not go directly to the target.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

I don't understand how the distance between CoreCLR and my managed code decreased from 6GB to 44Mb

The address space randomization will give you addresses that are far from each other, some of the time. You may be able to observe some patterns in ASLR behavior for given program on given OS version, but it is very fragile to generalize the observed patterns.

That is my impression as well, but it reproduces quite stably 🙂

I am not sure I understand how to read it, I thought b could only use pc register (current instruction)

I think it means that this branch points to some sort of jump stub. It does not go directly to the target.

Yeah it seems it's pointing to:


Disassembly of section __TEXT,__stubs:

00000000004204d4 <__stubs>:
  ...
  4207e0: 1f 20 03 d5  	nop
  4207e4: 30 d1 63 58  	ldr	x16, 0x4e8208 <__PWTB_ArgumentRegister_FirstArg+0x4e8118>
  4207e8: 00 02 1f d6  	br	x16
  4207ec: 1f 20 03 d5  	nop
  4207f0: 10 d1 63 58  	ldr	x16, 0x4e8210 <__PWTB_ArgumentRegister_FirstArg+0x4e8120>
  4207f4: 00 02 1f d6  	br	x16
  4207f8: 1f 20 03 d5  	nop <--- here
  4207fc: f0 d0 63 58  	ldr	x16, 0x4e8218 <__PWTB_ArgumentRegister_FirstArg+0x4e8128>
  420800: 00 02 1f d6  	br	x16
  ....

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

  4207f8: 1f 20 03 d5  	nop <--- here
  4207fc: f0 d0 63 58  	ldr	x16, 0x4e8218 <__PWTB_ArgumentRegister_FirstArg+0x4e8128>
  420800: 00 02 1f d6  	br	x16

These instructions look identical to the jump stubs that we are using. Why are the jump stubs that we are using so much slower?

@tannergooding
Copy link
Member

Why are the jump stubs that we are using so much slower?

My interpretation of the issue wasn't that our jump stubs are slower; but rather we are doing jump stubs on top of jump stubs.

And so while in C/C++ you may have Method -> cos_jmpstub -> cos_impl

we have Method -> math.cos_jmpstub -> comdouble.cos_jmpstub -> cos_jmpstub -> cos_impl; and on Unix it may even be worse because we sometimes have special wrappers to fixup edge case handling, so you end up with ... -> comdouble.cos_jmpstub -> wrapper -> cos_jmpstub -> ...

We are basically turning what is normally a single or double branch into a 4-5 level branch (and possible actual call stack increase)

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

My interpretation of the issue wasn't that our jump stubs are slower; but rather we are doing jump stubs on top of jump stubs.

It would be certainly nice to have this more streamlined.

However, the numbers from #64148 (comment) are comparing:

Core_Root_baseline: Method -> math.cos_jmpstub -> comdouble.cos_jmpstub -> cos_jmpstub -> cos_impl
vs
Core_Root_smaller_reservmem: Method -> comdouble.cos_jmpstub -> cos_jmpstub -> cos_impl

The extra math.cos_jmpstub makes the whole thing almost 3x slower. Why is that?

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

It would be certainly nice to have this more streamlined.

#39474 tried to do that math functions, but run into problems with getting them recognized as JIT intrinsics at the same time.

@tannergooding
Copy link
Member

The extra math.cos_jmpstub makes the whole thing almost 3x slower. Why is that?

I'd guess the combination of randomized addresses + multiple indirections isn't handled well by the CPU or instruction cache. Particularly since its not a "common" pattern in native code.

Agner's (https://www.agner.org/optimize/microarchitecture.pdf) and the Intel/AMD/ARM optimization manuals all touch on how decoding works and cover that code is loaded in 16 or 32-byte aligned windows. Unconditional branches can still impact the branch prediction tables and the microcode cache only goes so far.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

However, the numbers from #64148 (comment) are comparing:

Core_Root_baseline: Method -> math.cos_jmpstub -> comdouble.cos_jmpstub -> cos_jmpstub -> cos_impl vs Core_Root_smaller_reservmem: Method -> comdouble.cos_jmpstub -> cos_jmpstub -> cos_impl

The extra math.cos_jmpstub makes the whole thing almost 3x slower. Why is that?

I was also surprised, like the 1-2ns difference as in the PR makes perfect sense but that much difference as in Core_Root_smaller_reservmem doesn't, I can only blame branch predictor

I'll rerun the benchmark and try to profile it

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

You can also try some other cases, like regular managed method to managed method calls. There may be something special about Math.Cos.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

Hm.. CORINFO_HELP_CHECKED_ASSIGN_REF is always fine (close to managed code) - is it dynamically generated/compiled or it's clrgc.dll ?

CORINFO_HELP_NEWSFAST uses a jump stub.

You can also try some other cases, like regular managed method to managed method calls.

Do they use the same emitBackToBackJump? it looks like they don't and I can't find what VM code exactly is responsible for them

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

Do they use the same emitBackToBackJump?

Yes. To experiment with jumpstubs, you can comment out FitsInRel28 checks in CEEJitInfo::recordRelocation to force them to be created even when they are not required.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 23, 2022

Benchmarks:

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Test
{
    public static void Main(string[] args) => 
        BenchmarkSwitcher.FromAssembly(typeof(Test).Assembly).Run(args);

    [Benchmark]
    public int CurrentManagedThreadId() => Environment.CurrentManagedThreadId;  // 1 InternalCall


    [Benchmark]
    [Arguments(3.14f)]
    public float MyCos(float x) => MathF.Cos(x); // 1 InternalCall


    [Benchmark]
    [Arguments(3.14f)]
    public float MyCosSinTan(float x) => MathF.Cos(x) + MathF.Sin(x) + MathF.Tan(x); // 3 InternalCalls


    [Benchmark]
    public object Allocate() =>
        new object[] { new(), new() }; // CORINFO_HELP_NEWARR_1_OBJ and two CORINFO_HELP_NEWSFAST


    [Benchmark]
    public void CallManagedMethod() // managed to managed
    {
        TestCallee();
        TestCallee();
        TestCallee();
        TestCallee();
    }


    [Benchmark]
    public void CallManagedMethodLoop() // managed to managed in a loop
    {
        for (int i = 0; i < 100; i++)
        {
            // NOTE: method address is not hoisted
            TestCallee();
            TestCallee();
            TestCallee();
            TestCallee();
        }
    }


    [MethodImpl(MethodImplOptions.NoInlining)]
    void TestCallee() {}
}

I've built 4 Core_Roots (Release-osx-arm64):

Core_Root_baseline -- Current Main
Core_Root_baseline_always_emit_jumpstub -- Current Main, but `recordRelocation` triggers jumpstub for all inputs
Core_Root_baseline_smaller_exec_mem -- Current Main, but with smaller MaxExecutableMemorySize #64148
Core_Root_this_pr -- This PR

Here is the results:


|                 Method |                                        Toolchain |    x | obj |        Mean |     Error |    StdDev |      Median | Ratio |
|----------------------- |------------------------------------------------- |----- |---- |------------:|----------:|----------:|------------:|------:|
| CurrentManagedThreadId |                      /Core_Root_baseline/corerun |    ? |   ? |   1.0808 ns | 0.0017 ns | 0.0016 ns |   1.0806 ns |  1.00 |
| CurrentManagedThreadId | /Core_Root_baseline_always_emit_jumpstub/corerun |    ? |   ? |   1.0786 ns | 0.0094 ns | 0.0084 ns |   1.0748 ns |  1.00 |
| CurrentManagedThreadId |     /Core_Root_baseline_smaller_exec_mem/corerun |    ? |   ? |   0.7609 ns | 0.0165 ns | 0.0154 ns |   0.7502 ns |  0.70 |
| CurrentManagedThreadId |                       /Core_Root_this_pr/corerun |    ? |   ? |   0.7758 ns | 0.0164 ns | 0.0154 ns |   0.7846 ns |  0.72 |
|                        |                                                  |      |     |             |           |           |             |       |
|               Allocate |                      /Core_Root_baseline/corerun |    ? |   ? |  12.7096 ns | 0.0196 ns | 0.0164 ns |  12.7135 ns |  1.00 |
|               Allocate | /Core_Root_baseline_always_emit_jumpstub/corerun |    ? |   ? |  13.3513 ns | 0.0526 ns | 0.0492 ns |  13.3618 ns |  1.05 |
|               Allocate |     /Core_Root_baseline_smaller_exec_mem/corerun |    ? |   ? |  11.7224 ns | 0.0606 ns | 0.0567 ns |  11.7423 ns |  0.92 |
|               Allocate |                       /Core_Root_this_pr/corerun |    ? |   ? |  27.8916 ns | 0.0751 ns | 0.0702 ns |  27.9253 ns |  2.20 |
|                        |                                                  |      |     |             |           |           |             |       |
|                  MyCos |                      /Core_Root_baseline/corerun | 3.14 |   ? |   2.0633 ns | 0.0277 ns | 0.0259 ns |   2.0761 ns |  1.00 |
|                  MyCos | /Core_Root_baseline_always_emit_jumpstub/corerun | 3.14 |   ? |   2.0545 ns | 0.0309 ns | 0.0289 ns |   2.0757 ns |  1.00 |
|                  MyCos |     /Core_Root_baseline_smaller_exec_mem/corerun | 3.14 |   ? |   1.7512 ns | 0.0032 ns | 0.0027 ns |   1.7502 ns |  0.85 |
|                  MyCos |                       /Core_Root_this_pr/corerun | 3.14 |   ? |   2.0026 ns | 0.0282 ns | 0.0264 ns |   1.9879 ns |  0.97 |
|                        |                                                  |      |     |             |           |           |             |       |
|            MyCosSinTan |                      /Core_Root_baseline/corerun | 3.14 |   ? |  26.2644 ns | 0.5449 ns | 1.2298 ns |  26.8260 ns |  1.00 |
|            MyCosSinTan | /Core_Root_baseline_always_emit_jumpstub/corerun | 3.14 |   ? |  26.1377 ns | 0.5128 ns | 0.4797 ns |  26.5059 ns |  1.01 |
|            MyCosSinTan |     /Core_Root_baseline_smaller_exec_mem/corerun | 3.14 |   ? |   7.3690 ns | 0.0812 ns | 0.0760 ns |   7.3193 ns |  0.29 |
|            MyCosSinTan |                       /Core_Root_this_pr/corerun | 3.14 |   ? |  25.2790 ns | 0.2429 ns | 0.2272 ns |  25.3527 ns |  0.98 |
|                        |                                                  |      |     |             |           |           |             |       |
|                        |                                                  |      |     |             |           |           |             |       |
|      CallManagedMethod |                      /Core_Root_baseline/corerun |    ? |   ? |   3.2936 ns | 0.0020 ns | 0.0019 ns |   3.2939 ns |  1.00 |
|      CallManagedMethod | /Core_Root_baseline_always_emit_jumpstub/corerun |    ? |   ? |   9.4414 ns | 0.0038 ns | 0.0036 ns |   9.4403 ns |  2.87 |
|                        |                                                  |      |     |             |           |           |             |       |
|  CallManagedMethodLoop |                      /Core_Root_baseline/corerun |    ? |   ? | 465.1709 ns | 0.1117 ns | 0.0932 ns | 465.1746 ns |  1.00 |
|  CallManagedMethodLoop | /Core_Root_baseline_always_emit_jumpstub/corerun |    ? |   ? | 549.8527 ns | 0.2164 ns | 0.2024 ns | 549.8142 ns |  1.18 |

Notes:

  1. Core_Root_this_pr regresses when inaccurate getRelocHint tells JIT that addr is not within reachable distance so JIT emits direct address for something that could be reached via relative call. It also sometimes lies that something is reachable when it's not.
  2. Core_Root_this_pr shows ~0.5-1ns improvement when getRelocHint is correct
  3. Core_Root_baseline_smaller_exec_mem does indeed improve MyCosSinTan that much, I repeated the benchmark 3 times 😮 while single Cos call is within expected results, will see which Math call cause that difference exactly UPD none of them, only a combination of them lead to that difference.
  4. jump stub for Managed-to-Managed seems to be noticeable too.

Am I correct that the jump-stub for Managed-to-Managed also actually jumps to another "jump stub" that was used to replace method? (PRECODE fixup?)

@jkotas
Copy link
Member

jkotas commented Jan 23, 2022

Am I correct that the jump-stub for Managed-to-Managed also actually jumps to another "jump stub" that was used to replace method?

Yes, it is to support ReJIT and tiered compilation.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 27, 2022

Related: it'd be nice if we could use the initial heap only for optimized code and allocate all unoptimized one in a separate one.

@ghost
Copy link

ghost commented Feb 28, 2022

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants