Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions src/coreclr/pal/src/include/pal/virtual.h
Original file line number Diff line number Diff line change
Expand Up @@ -180,17 +180,24 @@ class ExecutableMemoryAllocator
int32_t GenerateRandomStartOffset();

private:

// There does not seem to be an easy way find the size of a library on Unix.
// So this constant represents an approximation of the libcoreclr size (on debug build)
// that can be used to calculate an approximate location of the memory that
// is in 2GB range from the coreclr library. In addition, having precise size of libcoreclr
// is not necessary for the calculations.
static const int32_t CoreClrLibrarySize = 100 * 1024 * 1024;
static const int32_t CoreClrLibrarySize = 32 * 1024 * 1024;

#ifdef TARGET_XARCH
// This constant represent the max size of the virtual memory that this allocator
// will try to reserve during initialization. We want all JIT-ed code and the
// entire libcoreclr to be located in a 2GB range.
static const int32_t MaxExecutableMemorySize = 0x7FFF0000;
static const int32_t MaxExecutableMemorySize = 0x7FFF0000;// 2GB - 64KB
#else
// Smaller size for ARM64 where relative calls/jumps only work within 128MB
static const int32_t MaxExecutableMemorySize = 0x7FF0000; // 128MB - 64KB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this range gets consumed in a real-world ASP.NET app?

Note that we map all sorts of stuff into this range, including R2R images. I would expect that 128MB gets exhausted fairly quickly given how things work today.

I agree that this fix works well for micro-benchmarks that are very unlikely to exhaust the 128MB range.

Copy link
Member Author

@EgorBo EgorBo Jan 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to investigate, I guess the idea that the hot code (tier1) should better be closer to VM's FCalls by default?
For large apps I guess we still want to try to address your suggestion with emitting direct addresses without jump-stubs in tier1 #62302 (I have a draft)

For R2R-only it can be improved by pgo + method sorting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think R2R code today benefits from being close to coreclr as all 'external' calls/branches have to go through indirection cells anyway. This may have been different in the days of fragile ngen?
Please correct me if I'm wrong @jkotas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large apps I guess we still want to try to address your suggestion with emitting direct addresses without jump-stubs

I do not think it is just for large apps. With TC enabled, all managed->managed method calls go through a precode that has the exact same instructions as jump stub and so it will introduce similar bottleneck as what you have identified.

For R2R-only it can be improved by pgo + method sorting?

R2R images are generally smaller than 128MB. You can only sort within the image, so the sorting won't help with jump stubs. (Sorting within image is still good for locality.)

Also, once we get this all fixed, we may want to look at retuning the inliner. My feeling is that the inliner expands the code too much these days. Some of it may be just compensating for the extra method call overhead that we are paying today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think R2R code today benefits from being close to coreclr as all 'external' calls/branches have to go through indirection cells anyway.

Calls from runtime generated stubs and JITed code to R2R code still benefit from the two being close.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JITed code to R2R code

Do these not go through an indirection when tiering is enabled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these not go through an indirection when tiering is enabled?

Yes - when tiering is enabled. No - when tiering is disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With TC enabled, all managed->managed method calls go through a precode that has the exact same instructions as jump stub and so it will introduce similar bottleneck as what you have identified.

I'll start from your suggestion to emit direct calls for T1 Caller calls T1 Callee (not as part of this PR)

#endif

static const int32_t MaxExecutableMemorySizeNearCoreClr = MaxExecutableMemorySize - CoreClrLibrarySize;

// Start address of the reserved virtual address space
Expand Down