-
Notifications
You must be signed in to change notification settings - Fork 4k
colexecerror: avoid debug.Stack in CatchVectorizedRuntimeError #123277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Benchmark results from my laptop are here: https://gist.github.com/michae2/4406203dbafc5749ad6a02f8b0ec268e |
DrewKimball
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @michae2, @rafiss, and @yuzefovich)
pkg/sql/colexecerror/error.go line 74 at r4 (raw file):
whence
Nice :)
pkg/sql/colexecerror/error.go line 130 at r4 (raw file):
return } retErr = err
Do you think it would be worth it to wrap the error here in an alreadyCaughtErr struct or something, so that we only have to inspect the stack once in a set of nested catchers?
petermattis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @michae2, @rafiss, and @yuzefovich)
pkg/sql/colexecerror/error.go line 111 at r4 (raw file):
)) } if panicEmittedFrom == "" {
Is this check and the one above for !panicLineFound necessary? If they were omitted we'd call shouldCatchPanic("") which would return false and we'd re-throw panicObj which should ultimately print the stack anyways. Just wondering what the value of emitting errors.AssertFailedf is instead. Do we even have test coverage of this code?
| // stack trace. This method should be called to propagate errors that resulted | ||
| // in the vectorized engine being in an *unexpected* state. | ||
| func InternalError(err error) { | ||
| panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: the comment should be updated so mention the error wrapping instead of "simply panicking."
rafiss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)
pkg/sql/colexecerror/error.go line 113 at r4 (raw file):
if panicEmittedFrom == "" { stackTrace := string(debug.Stack()) panic(errors.AssertionFailedf(
i thought errors.AssertionFailedf already would include the stack trace: https://github.com/cockroachdb/errors/blob/c1cc1919cf999fb018fcd038852e969e3d5631cc/errutil/assertions.go#L33-L35
(though i see this was the behavior from before your PR. we could check if we have been seeing duplicated stack traces in any error reports.)
rafiss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)
-- commits line 2 at r4:
could you include before/after results of this benchmark in the PR description?
helpful incantation:
N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/colexecerror BENCHES=BenchmarkCatchVectorizedRuntimeError ./scripts/bench 'old-sha' 'new-sha'
yuzefovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work and speed up! I have some comments.
Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go line 109 at r1 (raw file):
sqlRowPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/row" sqlSemPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/sem" testSqlColPackagesPrefix = "pkg/sql/col"
Why do we need this addition (it kinda duplicates sqlColPackagesPrefix)? Because we strip the prefix when running tests via bazel? Consider leaving a comment.
pkg/sql/colexecerror/error.go line 78 at r3 (raw file):
// engine. We treat a panic from lower in the stack as unrecoverable. //Find where the panic came from and only proceed if it
nit: missing spaces after the slashes in the third commit.
pkg/sql/colexecerror/error.go line 243 at r3 (raw file):
func init() { errors.RegisterWrapperDecoder(errors.GetTypeKey((*internalError)(nil)), decodeInternalError)
I think we need to register the decoder for both internalError and notInternalError.
pkg/sql/colexecerror/error.go line 246 at r3 (raw file):
} // InternalError simply panics with the provided object. It will always be
nit: this comment needs a minor adjustment.
pkg/sql/colexecerror/error.go line 111 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Is this check and the one above for
!panicLineFoundnecessary? If they were omitted we'd callshouldCatchPanic("")which would return false and we'd re-throwpanicObjwhich should ultimately print the stack anyways. Just wondering what the value of emittingerrors.AssertFailedfis instead. Do we even have test coverage of this code?
These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than runtime/panic.go. We do have some sanity checks for this code in TestCatchVectorizedRuntimeError, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.
pkg/sql/colexecerror/error.go line 113 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
i thought
errors.AssertionFailedfalready would include the stack trace: https://github.com/cockroachdb/errors/blob/c1cc1919cf999fb018fcd038852e969e3d5631cc/errutil/assertions.go#L33-L35(though i see this was the behavior from before your PR. we could check if we have been seeing duplicated stack traces in any error reports.)
Yes, good point.AssertionFailedf includes the stack trace (that's why below we only call errors.NewAssertionErrorWithWrappedErrf when we do want to create an error that would include the stack trace), so I think we can remove two calls to debug.Stack() and rely on the assertion's behavior. (Given my comment above, I don't think it's actually possible to hit this code path right now, so there is no way to check for stack trace duplication.)
pkg/sql/colexecerror/error.go line 130 at r4 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Do you think it would be worth it to wrap the error here in an
alreadyCaughtErrstruct or something, so that we only have to inspect the stack once in a set of nested catchers?
+1 - this seems like an easy extension of the current improvement. IIUC multiple nested catchers significantly exacerbated the problem we saw in the customer environment, and although we now have fast-paths for majority of errors, it'd be great to only inspect the stack once, regardless of the number of catches in it.
pkg/sql/sem/builtins/builtins.go line 5645 at r1 (raw file):
), "crdb_internal.force_vectorized_assertion_error": makeBuiltin(
nit: rather than introducing a new builtin, should we introduce an overload to existing crdb_internal.force_panic builtin where an optional second boolean argument would indicate whether the panic should be catchable by vectorized engine or not?
pkg/sql/colexecerror/main_test.go line 42 at r1 (raw file):
var ( // testAllocator is an Allocator with an unlimited budget for use in tests. testAllocator *colmem.Allocator
nit: we don't need most of the initialization in this file. I think it can be as short as
func TestMain(m *testing.M) {
securityassets.SetLoader(securitytest.EmbeddedAssets)
randutil.SeedForTests()
serverutils.InitTestServerFactory(server.TestServerFactory)
os.Exit(m.Run())
}
petermattis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @rafiss, and @yuzefovich)
pkg/sql/colexecerror/error.go line 111 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than
runtime/panic.go. We do have some sanity checks for this code inTestCatchVectorizedRuntimeError, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.
Ah, got it. I'd suggest a small refactor to this code. Pull the extraction of panicEmittedFrom into a function, and call that from a test and assert that it always finds the location. Right now you essentially have the testing in the regular code path which feels a bit strange.
yuzefovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go line 111 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than
runtime/panic.go. We do have some sanity checks for this code inTestCatchVectorizedRuntimeError, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.
Thinking a bit more about this, I agree that these two checks don't add that much value, so we could remove them. If callsite for panics in Go runtime ever changes, we'll easily catch the change via CI. So I'd be in favor of removing these two ifs and simply re-panicking whenever we don't find the panic line in the stack trace.
mgartner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go line 48 at r2 (raw file):
// without a stacktrace, sentry report, or "internal error" designation. var nie *notInternalError if errors.As(err, &se) || errors.As(err, &nie) {
nit: Why use errors.As here instead of errors.Is or errors.IsAny?
pkg/sql/colexecerror/error_test.go
Outdated
| b.Run(tc.name, func(b *testing.B) { | ||
| // Create as many warm connections as we will need for the benchmark. | ||
| conns := make(chan *gosql.DB, numConns) | ||
| for range numConns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this will make backport difficult.
rafiss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @mgartner, @michae2, and @petermattis)
pkg/sql/colexecerror/error.go line 48 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: Why use
errors.Ashere instead oferrors.Isorerrors.IsAny?
errors.Is[Any] requires the error (or any error in the cause chain) to exactly equal a reference error.
errors.As checks if the error (or any error in the cause chain) is assignable to the value pointed at by the target.
in this case, since there is no "singleton" notInternalError, we need to use As.
|
Previously, rafiss (Rafi Shamim) wrote…
Thanks for the explanation Rafi! |
michae2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the comments! I will push an update tonight.
Reviewable status:
complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go line 130 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
+1 - this seems like an easy extension of the current improvement. IIUC multiple nested catchers significantly exacerbated the problem we saw in the customer environment, and although we now have fast-paths for majority of errors, it'd be great to only inspect the stack once, regardless of the number of catches in it.
Nice idea! @yuzefovich I think I might reuse notInternalError for this, do you see any problems with that?
yuzefovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go line 130 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Nice idea! @yuzefovich I think I might reuse
notInternalErrorfor this, do you see any problems with that?
Reusing notInternalError would lead to a behavior change. Namely, we now won't be able to tell the difference between an expected error within vectorized engine (that should propagated as an error, without stack trace) and an unexpected error outside of the vectorized engine (which shouldn't be caught and should be propagated via panic further up). I'd introduce a new error type like Drew suggested.
michae2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @mgartner, @petermattis, @rafiss, and @yuzefovich)
Previously, rafiss (Rafi Shamim) wrote…
could you include before/after results of this benchmark in the PR description?
helpful incantation:
N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/colexecerror BENCHES=BenchmarkCatchVectorizedRuntimeError ./scripts/bench 'old-sha' 'new-sha'
I'll kick off a run on a gceworker and share tomorrow.
pkg/sql/colexecerror/error.go line 109 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Why do we need this addition (it kinda duplicates
sqlColPackagesPrefix)? Because we strip the prefix when running tests via bazel? Consider leaving a comment.
Yes, exactly. Added a comment and I asked about it in slack.
pkg/sql/colexecerror/error.go line 78 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: missing spaces after the slashes in the third commit.
Done.
pkg/sql/colexecerror/error.go line 243 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I think we need to register the decoder for both
internalErrorandnotInternalError.
Oh, good catch! Done.
pkg/sql/colexecerror/error.go line 246 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this comment needs a minor adjustment.
Done.
pkg/sql/colexecerror/error.go line 0 at r4 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
(Reviewable was unable to map this GitHub inline comment thread to the right spot — sorry!)
super nit: the comment should be updated so mention the error wrapping instead of "simply panicking."
Done.
pkg/sql/colexecerror/error.go line 111 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Thinking a bit more about this, I agree that these two checks don't add that much value, so we could remove them. If callsite for panics in Go runtime ever changes, we'll easily catch the change via CI. So I'd be in favor of removing these two
ifs and simply re-panicking whenever we don't find the panic line in the stack trace.
I removed these two checks.
pkg/sql/colexecerror/error.go line 113 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Yes, good point.
AssertionFailedfincludes the stack trace (that's why below we only callerrors.NewAssertionErrorWithWrappedErrfwhen we do want to create an error that would include the stack trace), so I think we can remove two calls todebug.Stack()and rely on the assertion's behavior. (Given my comment above, I don't think it's actually possible to hit this code path right now, so there is no way to check for stack trace duplication.)
Removed these two checks.
pkg/sql/colexecerror/error.go line 130 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Reusing
notInternalErrorwould lead to a behavior change. Namely, we now won't be able to tell the difference between an expected error within vectorized engine (that should propagated as an error, without stack trace) and an unexpected error outside of the vectorized engine (which shouldn't be caught and should be propagated via panic further up). I'd introduce a new error type like Drew suggested.
I tried this out, but strangely it seemed to make things slower. My guess is that we're mostly re-wrapping with materializers and columnarizers, and it looks like we already wrap with notInternalError in columnarizer:
cockroach/pkg/sql/colexec/columnarizer.go
Line 243 in af3173a
| colexecerror.ExpectedError(meta.Err) |
pkg/sql/sem/builtins/builtins.go line 5645 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: rather than introducing a new builtin, should we introduce an overload to existing
crdb_internal.force_panicbuiltin where an optional second boolean argument would indicate whether the panic should be catchable by vectorized engine or not?
Good call. I added an override with a couple more options.
pkg/sql/colexecerror/error_test.go line 203 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Note to self: this will make backport difficult.
Done.
pkg/sql/colexecerror/main_test.go line 42 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: we don't need most of the initialization in this file. I think it can be as short as
func TestMain(m *testing.M) { securityassets.SetLoader(securitytest.EmbeddedAssets) randutil.SeedForTests() serverutils.InitTestServerFactory(server.TestServerFactory) os.Exit(m.Run()) }
Thank you! Done.
yuzefovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 1 of 1 files at r8, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)
pkg/sql/colexecerror/error.go line 130 at r4 (raw file):
Previously, michae2 (Michael Erickson) wrote…
I tried this out, but strangely it seemed to make things slower. My guess is that we're mostly re-wrapping with materializers and columnarizers, and it looks like we already wrap with
notInternalErrorin columnarizer:cockroach/pkg/sql/colexec/columnarizer.go
Line 243 in af3173a
colexecerror.ExpectedError(meta.Err)
Hm, we might be thinking about this differently. The idea is that in the fall back case, when we had to look at the stack via runtime.CallersFrames (because the panic should be caught by vec engine but wasn't produced via one of colexecerror.*Error calls), we will wrap the error with a special marker alreadyCaughtError so that the next catcher up the stack didn't have to inspect the stack (i.e. we would add another special error type to the hot path at the top of the method). This shouldn't have any influence for columnarizer-materializer pair since they already use colexecerror methods that wrap errors with different markers. Does this match your thinking?
That said, this would be an improvement to an edge case, so I'd be ok with leaving a TODO for it.
Earlier this year we made the vectorized panic-catcher much more efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
133620: colexecerror: improve the catcher due to a recent regression r=yuzefovich a=yuzefovich Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Fixes: #133617. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
Earlier this year we made the vectorized panic-catcher much more efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None
See individual commits for details.
Benchmarks before and after the change:
Fixes: #123235
Release note (performance improvement): Make error handling in the vectorized execution engine much cheaper. This should help avoid bad metastable regimes perpetuated by statement timeout handling consuming all CPU time, leading to more statement timeouts.
Co-authored-by: Drew Kimball [email protected]