Fix issue in decommit_ephemeral_segment_pages (segment case). #59989

PeterSolMS · 2021-10-05T12:56:52Z

We assume that we can use half the free list space in gen 0 for new allocation. If that is too optimistic, we may allocate into decommitted memory and crash in the allocator. That is because there is a race condition between the allocating thread and the decommitting thread - we decided to avoid that by making sure we would never decommit memory that we may allocate in gen 0.

There are two reasons why assuming we can use half the free list space for new allocations may be too optimistic:

if we allocate large objects in gen 0, we may not have free spaces of the necessary size available.
when background GC goes into background_ephemeral_sweep, it deletes and rebuilds the free list for gen 0. A thread trying to allocate during that time may see a completely empty free list.

We used to do this correctly by not taking into account the free list space in gen 0 at all. But as we made changes for regions, that seemed unnecessarily pessimistic, but as we changed the logic, we failed to honor the invariants the allocator relies on.

The fix essentially goes back to the older logic where the free list space in gen 0 is not taken into account. We can do better than this, but it's more complicated.

We assume that we can use half the free list space in gen 0 for new allocation. If that is too optimistic, we may allocate into decommitted memory and crash in the allocator. That is because there is a race condition between the allocating thread and the decommitting thread - we decided to avoid that by making sure we would never decommit memory that we may allocate in gen 0. There are two reasons why assuming we can use half the free list space for new allocations may be too optimistic: - if we allocate large objects in gen 0, we may not have free spaces of the necessary size available. - when background GC goes into background_ephemeral_sweep, it deletes and rebuilds the free list for gen 0. A thread trying to allocate during that time may see a completely empty free list.

ghost · 2021-10-05T12:56:59Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

We assume that we can use half the free list space in gen 0 for new allocation. If that is too optimistic, we may allocate into decommitted memory and crash in the allocator. That is because there is a race condition between the allocating thread and the decommitting thread - we decided to avoid that by making sure we would never decommit memory that we may allocate in gen 0.

There are two reasons why assuming we can use half the free list space for new allocations may be too optimistic:

if we allocate large objects in gen 0, we may not have free spaces of the necessary size available.
when background GC goes into background_ephemeral_sweep, it deletes and rebuilds the free list for gen 0. A thread trying to allocate during that time may see a completely empty free list.

We used to do this correctly by not taking into account the free list space in gen 0 at all. But as we made changes for regions, that seemed unnecessarily pessimistic, but as we changed the logic, we failed to honor the invariants the allocator relies on.

The fix essentially goes back to the older logic where the free list space in gen 0 is not taken into account. We can do better than this, but it's more complicated.

Author:	PeterSolMS
Assignees:	-
Labels:	`area-GC-coreclr`
Milestone:	-

cshung · 2021-10-05T19:26:59Z

I am wondering if we have considered the NoGCRegion related code paths. In particular, gc_heap::should_proceed_for_no_gc commits memory disregarding allocation budget and increases the allocation budget through set_soh_allocations_for_no_gc and set_loh_allocations_for_no_gc. So I suspect we might ran into issue in this sequence of operations:

We called decommit_ephemeral_segment_pages and computed the heap_segment_decommit_target, with the assumption that allocation will never reach it.
Then we perform a StartNoGCRegion and the allocation budget is increased, and
The memory is given out to an allocation context.
decommit_step kicks in and de-committed memory that we will allocate on.
Allocation happens (on the allocation context) and referenced de-committed memory.

This is just my suspicion, I haven't been able to reproduce that yet.

Maoni0 · 2021-10-06T04:38:47Z

@cshung that is indeed a problem. since set_allocations_for_no_gc is only ever called either during a blocking GC or on the same thread that calls decommit_step, you can simply set gradual_decommit_in_progress_p to FALSE in set_allocations_for_no_gc.

PeterSolMS · 2021-10-06T10:34:16Z

I think we already hit this issue and fixed it (PR #41441 )- the fix is to immediately turn off gradual_decommit_in_progress_p when the server GC threads start running. Code from gc_heap::gc_thread_function():

proceed_with_gc_p = TRUE; gradual_decommit_in_progress_p = FALSE;

I think that on server GC, set_allocations_for_no_gc is always called on a GC thread, so we should be fine. Or am I overlooking something?

Maoni0 · 2021-10-06T23:18:58Z

here's the scenario I'm thinking about - when we are in a NoGC region, we could trigger a GC, the special thing we do is we adjust the budget after we've done the logic to decide whether we want to gradual commit or not. in gc1 we decide whether we set gradual_decommit_in_progress_p to TRUE or not and at that point we haven't adjusted the budget for NoGC yet. tehn we return to garbage_collect and adjust the budget (to the big budget for NoGC). so we can come out of that GC with gradual_decommit_in_progress_p as TRUE and decommit_step can run.

Maoni0 · 2021-10-06T23:25:25Z

oh actually NM.... I was wrong - I forgot that we do not go into the code path that adjusts gradual_decommit_in_progress_p if we are in pause_no_gc... so I agree with you @PeterSolMS.

PeterSolMS · 2021-10-07T09:01:38Z

I looked at the code some more and concluded that you are right, @Maoni0 - the code is in fact correct regarding the potential problem raised by @cshung.

To summarize, the reasons are as follows:

in gc_thread_function, we turn off gradual_decommit_in_progress_p as we are about to enter the main logic for GC.
if set_allocations_for_no_gc is called from should_proceed_for_no_gc, gradual_decommit_in_progress_p is still turned off.
if set_allocations_for_no_gc is called from allocate_for_no_gc_after_gc, decommit_ephemeral_segment_pages may have been called, but immediately returned because of pause_no_gc. So, gradual_decommit_in_progress_p must still be turned off, because decommit_ephemeral_segment_pages is the only place (in the segments case) where gradual_decommit_in progresss is turned on.
for the regions case, gradual_decommit_in_progress is turned on in distribute_free_regions, but the regions to be decommitted are put on a global list that isn't accessed by the allocator.

Maoni0

LGTM and we'll need to port this back to 6.0

cshung

Looks good to me!

PeterSolMS · 2021-10-08T12:12:41Z

/backport to release/6.0

github-actions · 2021-10-08T12:12:54Z

Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/1320244166

PeterSolMS requested review from Maoni0, cshung and mangod9 October 5, 2021 12:56

ghost added the area-GC-coreclr label Oct 5, 2021

Maoni0 approved these changes Oct 7, 2021

View reviewed changes

cshung approved these changes Oct 7, 2021

View reviewed changes

PeterSolMS merged commit 971479e into dotnet:main Oct 8, 2021

github-actions bot mentioned this pull request Oct 8, 2021

[release/6.0] Fix issue in decommit_ephemeral_segment_pages (segment case). #60170

Merged

jkotas mentioned this pull request Oct 24, 2021

NativeAOT SyncTable throwing NullReferenceException dotnet/runtimelab#1675

Closed

ghost locked as resolved and limited conversation to collaborators Nov 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix issue in decommit_ephemeral_segment_pages (segment case). #59989

Fix issue in decommit_ephemeral_segment_pages (segment case). #59989

Uh oh!

PeterSolMS commented Oct 5, 2021

Uh oh!

ghost commented Oct 5, 2021

Uh oh!

cshung commented Oct 5, 2021

Uh oh!

Maoni0 commented Oct 6, 2021

Uh oh!

PeterSolMS commented Oct 6, 2021

Uh oh!

Maoni0 commented Oct 6, 2021

Uh oh!

Maoni0 commented Oct 6, 2021

Uh oh!

PeterSolMS commented Oct 7, 2021

Uh oh!

Maoni0 left a comment

Uh oh!

cshung left a comment

Uh oh!

PeterSolMS commented Oct 8, 2021

Uh oh!

github-actions bot commented Oct 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix issue in decommit_ephemeral_segment_pages (segment case). #59989

Fix issue in decommit_ephemeral_segment_pages (segment case). #59989

Uh oh!

Conversation

PeterSolMS commented Oct 5, 2021

Uh oh!

ghost commented Oct 5, 2021

Uh oh!

cshung commented Oct 5, 2021

Uh oh!

Maoni0 commented Oct 6, 2021

Uh oh!

PeterSolMS commented Oct 6, 2021

Uh oh!

Maoni0 commented Oct 6, 2021

Uh oh!

Maoni0 commented Oct 6, 2021

Uh oh!

PeterSolMS commented Oct 7, 2021

Uh oh!

Maoni0 left a comment

Choose a reason for hiding this comment

Uh oh!

cshung left a comment

Choose a reason for hiding this comment

Uh oh!

PeterSolMS commented Oct 8, 2021

Uh oh!

github-actions bot commented Oct 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants