Skip to content

Conversation

@gbalykov
Copy link
Member

@gbalykov gbalykov commented Jun 4, 2020

By using --out-near-input and --single-file-compilation options multiple files can be compiled in one invocation of crossgen2. This allows to remove startup overhead if many files are to be compiled anyway.

x64, f95b2b2, release build, 100 measurements for each case (for single-file compilation mode all 100 copies of file are passed in one command):

./corerun `pwd`/crossgen2/crossgen2.dll /tmp/crossgen2.dll -O -r:`pwd`/*  --out-near-input --single-file-compilation

default number of threads

Basically, the result is a constant ~0.3s diff per dll:

before:
crossgen2.dll: 0.80555s
System.Private.CoreLib.dll: 3.15564s

after:
crossgen2.dll: 0.54847s (-31.9%, -0.26s)
System.Private.CoreLib.dll: 2.80951s (-11.0%, -0.35s) 

1 thread

before:
crossgen2.dll: 0.88498s
System.Private.CoreLib.dll: 8.32537s

after:
crossgen2.dll: 0.53211s (-39.9%, -0.35s)
System.Private.CoreLib.dll: 7.74492s (-7.0%, -0.58s) 

cc @alpencolt

@MichalStrehovsky
Copy link
Member

Did you run crossgen on crossgen2 itself before doing the measurements? Without R2R code, a significant portion of time will be spent JITting the compiler.

Compilers (clang, Roslyn, VC++, you name it) don't offer multiple input/multiple output modes because they're not useful when integrating into build systems. Multiple-input/multiple-output is a test hook. We place test hooks in the src\coreclr\src\tools\r2rtest runner so that we don't have test hooks in the shipping compiler. r2rtest already has modes to compile all files in a directory.

We could potentially add a new launch option that doesn't create a new compilation process, but does an Assembly.Load to load the crossgen2 entrypoint assembly if this is really important. Once the assembly is loaded it can do asm.EntryPoint.Invoke(...) to invoke the entrypoint in the context of the current process.

Note that the crossgen2 compiler has known "memory leaks" when run in such mode and will run out of memory eventually (there are static fields that cache state that is only relevant to a single compilation and is never released until the process dies).

Cc @dotnet/crossgen-contrib

@gbalykov
Copy link
Member Author

gbalykov commented Jun 4, 2020

Yes, forgot to mention that, all system libs and crossgen2 libs are compiled with first crossgen in r2r mode. However, runtime still spends much time on jitting, even with r2r images. Change in this PR allows to mitigate startup overhead of jitting and loading crossgen2.

Our concern is that on small dlls crossgen2 perf is much worse than first crossgen (for example, on more that 1100% for crossgen2.dll on x64). For large dlls like SPC.dll crossgen2 is better on 25% on x64, however this is achieved with 16 threads, and with 1 thread crossgen2 is 2x times slower.

On arm devices with 2 cpus, however, we can't launch so many threads, so crossgen2 is slower than first crossgen even on SPC.dll.

Here's some data:

x64

dll threads crossgen, s crossgen2, s diff(%)
crossgen2.dll default 0.06508 0.80555 +1138
1 0.06508 0.88498 +1260
2 0.85827 +1219
4 0.81757 +1172
8 0.8094 +1144
16 0.80782 +1141
32 0.80494 +1137
System.Private.CoreLib.dll default 4.18926 3.15564 -24.7
1 4.18926 8.32537 +98.7
2 5.55454 +32.6
4 4.14622 -1.0
8 3.3941 -19.0
16 3.12814 -25.3
32 3.17038 -24.3

armel

dll threads crossgen, s crossgen2, s diff(%)
crossgen2.dll default 0.56884 7.10501 +1149
1 0.56884 7.21069 +1167
2 6.94721 +1121
4 6.88047 +1110
8 6.80611 +1096
System.Private.CoreLib.dll default 43.1077
1 43.1077 88.631 +105.6
2 61.0679 +41.7
4 61.4447 +42.5
8 61.9229 +43.6

@davidwrighton
Copy link
Member

@gbalykov, thank you for looking deeper into this. I agree that it is quite concerning how slow crossgen2 is, especially for the smaller binaries. There are several details to note.

  1. Performance of the compiler running on the armel platform is very slow, and I assume that is significant in terms of engineering cost for your team. Have you considered building a cross-targetting jit so that you can run the armel compiles on an x64 machine, or does that not fit with your engineering strategy? Based on the numbers from your most recent post here, that would be a better time improvement than making this change to the crossgen2 compilation model? We have done some experiments with an arm64 targetting cross compiler and found that we were easily able to produce binary identical images while running on an x64 host.

  2. @MichalStrehovsky This change in compilation model does have precedent in the .NET ecosystem. In particular, Roslyn actually is typically run in a mode which uses a compiler server which amortizes the cost of jitting the Roslyn codebase across multiple compilations. In that mode, there is a compiler server which does the actual compilations, and csc just serves to pass command line arguments to that persistent server. That preserves the illusion that the compiler has a single output model to the build system, but also allows for multiple processes to take advantage of the benefits of running the compiler without needing excess jitting. Of course, this sort of server model requires some careful engineering to make sure that repeated compilations do not interfere with each others, etc.

  3. We expect that the composite build mode of crossgen2 may provide benefits for this sort of situation, such that a R2R variant of crossgen2 may be closer in performance to crossgen. We don't have good numbers for that yet, but in the next couple of months we hope to get them.

  4. Given the above, and especially point 2, I'm tempted to suggest that if you are unable to use a cross compilation approach, instead of building in new command line options, we would be more likely to accept a patch, which moved much of this processing in crossgen2.dll into ILCompiler.ReadyToRun.dll, such that it could be run by either crossgen2 as an application, or used directly by some sort of wrapper or server process that would be able to achieve these sorts of performance wins. I think that's a bit more work, but long term, I believe the compiler server approach would be better than a series of new command line options as is proposed here.

@MichalStrehovsky
Copy link
Member

This change in compilation model does have precedent in the .NET ecosystem. In particular, Roslyn actually is typically run in a mode which uses a compiler server which amortizes the cost of jitting the Roslyn codebase across multiple compilations.

Yes, I'm aware of that mode. That's why I keep pushing for people to stop using statics to store per compilation state but people keep adding those when I'm not looking. But there's a difference between a compiler server and a command line argument to batch compile. The former can be integrated into build systems (but it does bring it's own challenges which is why I sometimes need to taskkill dotnet process before I can do git clean on the runtime repo); the latter cannot be integrated into build systems and cannot be a shipping switch.

I know what sort of response I get for this, but I compiled crossgen2 with the CoreRT compiler and compared throughput with the CoreCLR/ReadyToRun based one:

Before After Improvement
Compile System.Private.CoreLib 3446 ms 2632 ms 23%
Compile Hello World 452 ms 68 ms 85%

We'll probably need to build a compilation server to get throughput anywhere near this as long as the compiler is hosted on top of CoreCLR.

@MichalStrehovsky
Copy link
Member

We expect that the composite build mode of crossgen2 may provide benefits for this sort of situation

When we were discussing publishing options for crossgen2, we ruled out self-contained publishing because the size was prohibitively large - I don't think we'll be able to composite-compile crossgen2 itself and ship it that way to our customers. We can use it to speed up our inner loop, but I'm skeptical of our ability to pass the benefit to our customers.

@alpencolt
Copy link

We've faced with crossgen2 throughput on compiling tests, @davidwrighton one of the ways to solve it using cross compilation, and it should already work by using: https://github.com/dotnet/runtime/tree/master/src/coreclr/src/jit/armelnonjit

But there is another case it's when user install application from market to device. On this scenario we cannot use cross compilation. This PR or server mode will help. In case of server mode it should be easily started and shot down.

@davidwrighton
Copy link
Member

Ah, as I understand it, this switch is intended for use outside of build system driven scenarios, and only for use within a bespoke application installer pipeline built by your company. This isn't a scenario that has been considered actively as part of crossgen2 development.

Could you share the amount of improvement that you are seeing as a result of this change to the end to end install time of typical application?

@ViktorHofer
Copy link
Member

// Auto-generated message

69e114c which was merged 12/7 removed the intermediate src/coreclr/src/ folder. This PR needs to be updated as it touches files in that directory which causes conflicts.

To update your commits you can use this bash script: https://gist.github.com/ViktorHofer/6d24f62abdcddb518b4966ead5ef3783. Feel free to use the comment section of the gist to improve the script for others.

@danmoseley
Copy link
Member

@davidwrighton this PR seems to have been waiting on an update for 7 months - would it make sense to close it if it's not actively being worked on?

@davidwrighton
Copy link
Member

Yes, I think that makes sense. @alpencolt if you are still interested in this, please re-activate this PR/provide some of the performance numbers we were looking for.

@alpencolt
Copy link

@davidwrighton we're working right now on this, I hope we'll share results for crossgen vs crossgen2 perfromance and memory comparison on armel in this week.

@gbalykov
Copy link
Member Author

Could you share the amount of improvement that you are seeing as a result of this change to the end to end install time of typical application?

Sorry for the late response. I've measured Calculator app as a typical Tizen Xamarin app. Its installation (without ni compilation) takes just 3.656 seconds.

Target arm device has just two cpus, so crossgen2 results for >=2 threads are pretty much the same.

crossgen type threads Calculator app compilation time (7 dlls), seconds diff with crossgen1
crossgen1 default=1 17.552 x1.0
crossgen2 with -O option 1 59.828 x3.41
crossgen2 with -O option 2 48.301 x2.75
crossgen2 with -O option, pipeline 1 36.179 x2.06
crossgen2 with -O option, pipeline 2 29.965 x1.70

As you can see, pipeline mode saves 18.3 (38%) and 23.6 (39%) seconds for 2 and 1 threads respectively. Considering app installation time, pipeline mode saves 35-37% of end-to-end app install time. Without these changes crossgen2 is almost 3 times slower than crossgen1.

Additionally, I've measured system libs recompilation from scratch using crossgen1 and crossgen2.

crossgen type threads 261 dlls time, seconds diff with crossgen1
crossgen1 default=1 316.08 x1.0
crossgen2 with -O option 1 1979 x6.26
crossgen2 with -O option 2 1791.72 x5.67

Unfortunately, pipeline mode leaks memory which results in process getting killed by oom killer. So, I wasn't able to measure all 261 system libs compilation in one command. Anyway, currently it takes ~5mins for crossgen1 to compile all system libs and ~30mins for crossgen2.

cc @alpencolt

@davidwrighton davidwrighton reopened this Jan 20, 2021
@davidwrighton
Copy link
Member

I've re-opened the request, as there is active work happening here. Could you clarify if the crossgen2 binaries in this test were themselves crossgenned? Also, could you describe what version of crossgen2 is in use here? Is it from the 5.0 release branch, or a recent build from the master branch?

@nattress, @mangod9 We need to come up with a solution here of some form. In my opinion a slowdown of 2.75X is really not acceptable. I dislike the approach taken here, but it is very expedient, and not particularly impacting to the scenarios we use here in the more general .NET community.

@alpencolt @gbalykov you mention that pipeline mode leaks memory. Do you know what it is leaking, and by how much?

@gbalykov
Copy link
Member Author

This is measured on dotnet/runtime master (6.0), commit d266fdb. In application related measurements above, all system libs including crossgen2 are compiled in r2r. In "system libs recompilation from scratch" scenario no libs are compiled, all libs are compiled from scratch, order of libs compilation is System.Private.CoreLib.dll, all 5 crossgen2 dlls, others in some order.

I'm not yet sure what is leaking in pipeline mode, but this resulted in ~800 Mb of physical memory occupied for approximately 127 compiled dlls. Then oom killer killed the process. This is the patch that I've used: 1.patch.txt.

@gbalykov
Copy link
Member Author

Regarding memory consumption, here's memory consumption of System.Private.CoreLib.dll compilation (when all system libs are compiled in r2r):

crossgen type threads SIZE, kb PSS, kb RSS, kb
crossgen1 default=1 76024 43442 44636
crossgen2 -O 1 354956 131119 137672
crossgen2 -O 2 342544 137495 144436

Crossgen2 RSS is more than 3 times higher.

For Tizen Xamarin app, mentioned above, crossgen2 RSS is ~2 times higher.

dll, compiled with crossgen1 SIZE, kb PSS, kb RSS, kb
Calculator.dll 60176 23692 24888
Tizen.Wearable.CircularUI.Forms.Renderer.dll 51048 19760 20764
Tizen.Wearable.CircularUI.Forms.dll 53980 18199 19400
Xamarin.Forms.Core.dll 62432 30004 31200
Xamarin.Forms.Platform.Tizen.dll 58636 22076 23272
Xamarin.Forms.Platform.dll 48448 13652 14804
Xamarin.Forms.Xaml.dll 53156 17207 18408
dll, compiled with crossgen2 -O 2 threads SIZE, kb PSS, kb RSS, kb
Calculator.dll 248232 35949 39152
Tizen.Wearable.CircularUI.Forms.Renderer.dll 246144 36590 39824
Tizen.Wearable.CircularUI.Forms.dll 249388 36217 39416
Xamarin.Forms.Core.dll 275920 65798 69312
Xamarin.Forms.Platform.Tizen.dll 250044 51770 55184
Xamarin.Forms.Platform.dll 245488 28175 30952
Xamarin.Forms.Xaml.dll 247356 36549 39800

@danmoseley
Copy link
Member

@mangod9 could you please set an assignee on this PR ? It helps for old PR's to have a "shepherd" and this is the oldest one without such a person..

@danmoseley
Copy link
Member

@mangod9 thoughts about owner?

@mangod9
Copy link
Member

mangod9 commented Feb 10, 2021

Sorry, must have missed the previous tag. Adding @trylek as well. We will discuss how to proceed in this week.

Base automatically changed from master to main March 1, 2021 09:06
- Add --out-near-input option, which adds .ni. suffix to input filepath
  and stores resulting ni.dll near original dll. In this mode --out option can be skipped.
- Add --single-file-compilation mode, which allows to compile all input files separately.
@gbalykov gbalykov closed this Mar 29, 2021
@gbalykov gbalykov force-pushed the crossgen2-pipeline branch from a601571 to d266fdb Compare March 29, 2021 14:04
@gbalykov gbalykov reopened this Mar 29, 2021
@mangod9
Copy link
Member

mangod9 commented Mar 29, 2021

Hey @gbalykov. assume this PR is still relevant? If you could resolve conflicts we can work on getting it merged. Just a note that as part of the regular workflow we wouldnt be validating the multiple file compilation mode.

@gbalykov
Copy link
Member Author

gbalykov commented Apr 1, 2021

@mangod9 yes, this is still relevant, I'll rebase it

@mangod9
Copy link
Member

mangod9 commented May 10, 2021

Hi @gbalykov, closing this for now, please reopen when its ready for review? Reminder that preview7 (early July) is when new feature work should be done for .net 6. Thanks.

@mangod9 mangod9 closed this May 10, 2021
@gbalykov
Copy link
Member Author

@mangod9 thanks. This PR is obsolete, #51154 is created instead

@karelz karelz added this to the 6.0.0 milestone May 20, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants