colexecerror: avoid debug.Stack in CatchVectorizedRuntimeError #123277

michae2 · 2024-04-30T06:59:57Z

See individual commits for details.

Benchmarks before and after the change:

                                                                    │ /tmp/tmp.PrzUglo6GH/bench.catch~3 │    /tmp/tmp.PrzUglo6GH/bench.catch     │
                                                                    │              sec/op               │    sec/op      vs base                 │
CatchVectorizedRuntimeError/noError-12                                                    0.4033n ± ∞ ¹   0.4224n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/expected-12                                                 40831.00n ± ∞ ¹    98.55n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/storage-12                                                  41302.00n ± ∞ ¹    99.54n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/contextCanceled-12                                          40947.00n ± ∞ ¹    99.78n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/internalWithCode-12                                          41088.0n ± ∞ ¹    474.6n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/internal-12                                                  41032.0n ± ∞ ¹    943.0n ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/runtime-12                                                    37.094µ ± ∞ ¹    1.091µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/noError-12                                         43.83µ ± ∞ ¹    43.63µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/expectedWithCode-12                              1234.24µ ± ∞ ¹    48.27µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/expectedAssertion-12                             1182.11µ ± ∞ ¹    88.12µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/internalAssertion-12                             1211.83µ ± ∞ ¹    77.58µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/internalIndexOutOfRange-12                       1227.76µ ± ∞ ¹    82.40µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/internalDivideByZero-12                          1224.45µ ± ∞ ¹    81.01µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/contextCanceled-12                               1265.30µ ± ∞ ¹    44.83µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/noError-12                                        43.78µ ± ∞ ¹    42.83µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/expectedWithCode-12                             1210.98µ ± ∞ ¹    48.41µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/expectedAssertion-12                            1187.21µ ± ∞ ¹    84.70µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/internalAssertion-12                            1224.62µ ± ∞ ¹    73.85µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/internalIndexOutOfRange-12                      1224.56µ ± ∞ ¹    81.39µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/internalDivideByZero-12                         1259.05µ ± ∞ ¹    80.63µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/contextCanceled-12                              1262.44µ ± ∞ ¹    47.13µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/noError-12                                        44.98µ ± ∞ ¹    43.53µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/expectedWithCode-12                             1231.37µ ± ∞ ¹    48.65µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/expectedAssertion-12                            1201.74µ ± ∞ ¹    97.23µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/internalAssertion-12                            1249.63µ ± ∞ ¹    80.46µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/internalIndexOutOfRange-12                      1257.86µ ± ∞ ¹    77.83µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/internalDivideByZero-12                         1251.79µ ± ∞ ¹    82.22µ ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=600/contextCanceled-12                              1271.82µ ± ∞ ¹    51.19µ ± ∞ ¹        ~ (p=1.000 n=1) ²
geomean                                                                                    243.2µ          13.09µ        -94.62%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05

                                                                    │ /tmp/tmp.PrzUglo6GH/bench.catch~3 │    /tmp/tmp.PrzUglo6GH/bench.catch     │
                                                                    │               B/op                │     B/op       vs base                 │
CatchVectorizedRuntimeError/noError-12                                                      0.000 ± ∞ ¹     0.000 ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/expected-12                                                   9834.00 ± ∞ ¹     32.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/storage-12                                                    8880.00 ± ∞ ¹     32.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/contextCanceled-12                                            8840.00 ± ∞ ¹     40.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/internalWithCode-12                                           9.586Ki ± ∞ ¹   1.511Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/internal-12                                                  10.479Ki ± ∞ ¹   2.961Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/runtime-12                                                   10.964Ki ± ∞ ¹   2.656Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/noError-12                                        81.75Ki ± ∞ ¹   81.71Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/expectedWithCode-12                              143.04Ki ± ∞ ¹   85.28Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/expectedAssertion-12                              362.7Ki ± ∞ ¹   304.5Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalAssertion-12                              304.2Ki ± ∞ ¹   245.3Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalIndexOutOfRange-12                        313.0Ki ± ∞ ¹   267.2Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalDivideByZero-12                           308.3Ki ± ∞ ¹   260.7Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/contextCanceled-12                               128.96Ki ± ∞ ¹   80.60Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/noError-12                                       81.21Ki ± ∞ ¹   81.25Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/expectedWithCode-12                             136.02Ki ± ∞ ¹   84.99Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/expectedAssertion-12                             362.0Ki ± ∞ ¹   302.9Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalAssertion-12                             291.4Ki ± ∞ ¹   244.5Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalIndexOutOfRange-12                       317.1Ki ± ∞ ¹   267.0Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalDivideByZero-12                          308.4Ki ± ∞ ¹   260.4Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/contextCanceled-12                              145.34Ki ± ∞ ¹   81.73Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/noError-12                                       81.33Ki ± ∞ ¹   81.24Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/expectedWithCode-12                             137.49Ki ± ∞ ¹   85.05Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/expectedAssertion-12                             364.7Ki ± ∞ ¹   305.8Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalAssertion-12                             318.4Ki ± ∞ ¹   244.5Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalIndexOutOfRange-12                       315.8Ki ± ∞ ¹   265.4Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalDivideByZero-12                          310.9Ki ± ∞ ¹   260.2Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/contextCanceled-12                              133.20Ki ± ∞ ¹   80.53Ki ± ∞ ¹        ~ (p=1.000 n=1) ³
geomean                                                                                               ⁴                  -61.19%               ⁴
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ need >= 4 samples to detect a difference at alpha level 0.05
⁴ summaries must be >0 to compute geomean

                                                                    │ /tmp/tmp.PrzUglo6GH/bench.catch~3 │    /tmp/tmp.PrzUglo6GH/bench.catch    │
                                                                    │             allocs/op             │  allocs/op    vs base                 │
CatchVectorizedRuntimeError/noError-12                                                      0.000 ± ∞ ¹    0.000 ± ∞ ¹        ~ (p=1.000 n=1) ²
CatchVectorizedRuntimeError/expected-12                                                    36.000 ± ∞ ¹    3.000 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/storage-12                                                     11.000 ± ∞ ¹    3.000 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/contextCanceled-12                                             11.000 ± ∞ ¹    4.000 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/internalWithCode-12                                             35.00 ± ∞ ¹    37.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/internal-12                                                     46.00 ± ∞ ¹    59.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
CatchVectorizedRuntimeError/runtime-12                                                      58.00 ± ∞ ¹    51.00 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/noError-12                                          838.0 ± ∞ ¹    838.0 ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=12/expectedWithCode-12                                1110.0 ± ∞ ¹    950.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/expectedAssertion-12                               1.736k ± ∞ ¹   1.581k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalAssertion-12                               1.285k ± ∞ ¹   1.228k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalIndexOutOfRange-12                         1.418k ± ∞ ¹   1.425k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/internalDivideByZero-12                            1.413k ± ∞ ¹   1.411k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=12/contextCanceled-12                                  880.0 ± ∞ ¹    865.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/noError-12                                         836.0 ± ∞ ¹    836.0 ± ∞ ¹        ~ (p=1.000 n=1) ²
SQLCatchVectorizedRuntimeError/conns=240/expectedWithCode-12                               1012.0 ± ∞ ¹    949.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/expectedAssertion-12                              1.744k ± ∞ ¹   1.572k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalAssertion-12                              1.201k ± ∞ ¹   1.228k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalIndexOutOfRange-12                        1.518k ± ∞ ¹   1.424k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/internalDivideByZero-12                           1.429k ± ∞ ¹   1.411k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=240/contextCanceled-12                                 993.0 ± ∞ ¹    865.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/noError-12                                         838.0 ± ∞ ¹    837.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/expectedWithCode-12                               1038.0 ± ∞ ¹    951.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/expectedAssertion-12                              1.782k ± ∞ ¹   1.594k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalAssertion-12                              1.471k ± ∞ ¹   1.231k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalIndexOutOfRange-12                        1.476k ± ∞ ¹   1.417k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/internalDivideByZero-12                           1.471k ± ∞ ¹   1.411k ± ∞ ¹        ~ (p=1.000 n=1) ³
SQLCatchVectorizedRuntimeError/conns=600/contextCanceled-12                                 960.0 ± ∞ ¹    867.0 ± ∞ ¹        ~ (p=1.000 n=1) ³
geomean                                                                                               ⁴                 -18.86%               ⁴
¹ need >= 6 samples for confidence interval at level 0.95
² all samples are equal
³ need >= 4 samples to detect a difference at alpha level 0.05
⁴ summaries must be >0 to compute geomean

Fixes: #123235

Release note (performance improvement): Make error handling in the vectorized execution engine much cheaper. This should help avoid bad metastable regimes perpetuated by statement timeout handling consuming all CPU time, leading to more statement timeouts.

Co-authored-by: Drew Kimball [email protected]

cockroach-teamcity · 2024-04-30T07:00:13Z

This change is

michae2 · 2024-04-30T07:32:24Z

Benchmark results from my laptop are here: https://gist.github.com/michae2/4406203dbafc5749ad6a02f8b0ec268e

DrewKimball

Nice work!

Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @rafiss, and @yuzefovich)

pkg/sql/colexecerror/error.go line 74 at r4 (raw file):

whence

Nice :)

pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

			return
		}
		retErr = err

Do you think it would be worth it to wrap the error here in an alreadyCaughtErr struct or something, so that we only have to inspect the stack once in a set of nested catchers?

petermattis

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @michae2, @rafiss, and @yuzefovich)

pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

			))
		}
		if panicEmittedFrom == "" {

Is this check and the one above for !panicLineFound necessary? If they were omitted we'd call shouldCatchPanic("") which would return false and we'd re-throw panicObj which should ultimately print the stack anyways. Just wondering what the value of emitting errors.AssertFailedf is instead. Do we even have test coverage of this code?

rafiss · 2024-04-30T13:41:49Z

pkg/sql/colexecerror/error.go

 // stack trace. This method should be called to propagate errors that resulted
 // in the vectorized engine being in an *unexpected* state.
 func InternalError(err error) {
-	panic(err)


super nit: the comment should be updated so mention the error wrapping instead of "simply panicking."

rafiss

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)

pkg/sql/colexecerror/error.go line 113 at r4 (raw file):

		if panicEmittedFrom == "" {
			stackTrace := string(debug.Stack())
			panic(errors.AssertionFailedf(

i thought errors.AssertionFailedf already would include the stack trace: https://github.com/cockroachdb/errors/blob/c1cc1919cf999fb018fcd038852e969e3d5631cc/errutil/assertions.go#L33-L35

(though i see this was the behavior from before your PR. we could check if we have been seeing duplicated stack traces in any error reports.)

rafiss

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)

-- commits line 2 at r4:
could you include before/after results of this benchmark in the PR description?

helpful incantation:

N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/colexecerror BENCHES=BenchmarkCatchVectorizedRuntimeError ./scripts/bench 'old-sha' 'new-sha'

yuzefovich

Nice work and speed up! I have some comments.

Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)

pkg/sql/colexecerror/error.go line 109 at r1 (raw file):

	sqlRowPackagesPrefix     = "github.com/cockroachdb/cockroach/pkg/sql/row"
	sqlSemPackagesPrefix     = "github.com/cockroachdb/cockroach/pkg/sql/sem"
	testSqlColPackagesPrefix = "pkg/sql/col"

Why do we need this addition (it kinda duplicates sqlColPackagesPrefix)? Because we strip the prefix when running tests via bazel? Consider leaving a comment.

pkg/sql/colexecerror/error.go line 78 at r3 (raw file):

		// engine. We treat a panic from lower in the stack as unrecoverable.

		//Find where the panic came from and only proceed if it

nit: missing spaces after the slashes in the third commit.

pkg/sql/colexecerror/error.go line 243 at r3 (raw file):

func init() {
	errors.RegisterWrapperDecoder(errors.GetTypeKey((*internalError)(nil)), decodeInternalError)

I think we need to register the decoder for both internalError and notInternalError.

pkg/sql/colexecerror/error.go line 246 at r3 (raw file):

}

// InternalError simply panics with the provided object. It will always be

nit: this comment needs a minor adjustment.

pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is this check and the one above for !panicLineFound necessary? If they were omitted we'd call shouldCatchPanic("") which would return false and we'd re-throw panicObj which should ultimately print the stack anyways. Just wondering what the value of emitting errors.AssertFailedf is instead. Do we even have test coverage of this code?

These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than runtime/panic.go. We do have some sanity checks for this code in TestCatchVectorizedRuntimeError, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.

pkg/sql/colexecerror/error.go line 113 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

i thought errors.AssertionFailedf already would include the stack trace: https://github.com/cockroachdb/errors/blob/c1cc1919cf999fb018fcd038852e969e3d5631cc/errutil/assertions.go#L33-L35

(though i see this was the behavior from before your PR. we could check if we have been seeing duplicated stack traces in any error reports.)

Yes, good point.AssertionFailedf includes the stack trace (that's why below we only call errors.NewAssertionErrorWithWrappedErrf when we do want to create an error that would include the stack trace), so I think we can remove two calls to debug.Stack() and rely on the assertion's behavior. (Given my comment above, I don't think it's actually possible to hit this code path right now, so there is no way to check for stack trace duplication.)

pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Do you think it would be worth it to wrap the error here in an alreadyCaughtErr struct or something, so that we only have to inspect the stack once in a set of nested catchers?

+1 - this seems like an easy extension of the current improvement. IIUC multiple nested catchers significantly exacerbated the problem we saw in the customer environment, and although we now have fast-paths for majority of errors, it'd be great to only inspect the stack once, regardless of the number of catches in it.

pkg/sql/sem/builtins/builtins.go line 5645 at r1 (raw file):

	),

	"crdb_internal.force_vectorized_assertion_error": makeBuiltin(

nit: rather than introducing a new builtin, should we introduce an overload to existing crdb_internal.force_panic builtin where an optional second boolean argument would indicate whether the panic should be catchable by vectorized engine or not?

pkg/sql/colexecerror/main_test.go line 42 at r1 (raw file):

var (
	// testAllocator is an Allocator with an unlimited budget for use in tests.
	testAllocator     *colmem.Allocator

nit: we don't need most of the initialization in this file. I think it can be as short as

func TestMain(m *testing.M) {
	securityassets.SetLoader(securitytest.EmbeddedAssets)
	randutil.SeedForTests()
	serverutils.InitTestServerFactory(server.TestServerFactory)

	os.Exit(m.Run())
}

petermattis

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @rafiss, and @yuzefovich)

pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than runtime/panic.go. We do have some sanity checks for this code in TestCatchVectorizedRuntimeError, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.

Ah, got it. I'd suggest a small refactor to this code. Pull the extraction of panicEmittedFrom into a function, and call that from a test and assert that it always finds the location. Right now you essentially have the testing in the regular code path which feels a bit strange.

yuzefovich

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)

pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

These two checks were added in case Go runtime ever changes so that panics are emitted from a different location than runtime/panic.go. We do have some sanity checks for this code in TestCatchVectorizedRuntimeError, but I don't think it's possible to come up with a test in which one of these two checks doesn't pass.

Thinking a bit more about this, I agree that these two checks don't add that much value, so we could remove them. If callsite for panics in Go runtime ever changes, we'll easily catch the change via CI. So I'd be in favor of removing these two ifs and simply re-panicking whenever we don't find the panic line in the stack trace.

mgartner

Reviewed 6 of 6 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)

pkg/sql/colexecerror/error.go line 48 at r2 (raw file):

			// without a stacktrace, sentry report, or "internal error" designation.
			var nie *notInternalError
			if errors.As(err, &se) || errors.As(err, &nie) {

nit: Why use errors.As here instead of errors.Is or errors.IsAny?

michae2 · 2024-04-30T17:08:12Z

pkg/sql/colexecerror/error_test.go

+				b.Run(tc.name, func(b *testing.B) {
+					// Create as many warm connections as we will need for the benchmark.
+					conns := make(chan *gosql.DB, numConns)
+					for range numConns {


Note to self: this will make backport difficult.

rafiss

Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @mgartner, @michae2, and @petermattis)

pkg/sql/colexecerror/error.go line 48 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: Why use errors.As here instead of errors.Is or errors.IsAny?

errors.Is[Any] requires the error (or any error in the cause chain) to exactly equal a reference error.

errors.As checks if the error (or any error in the cause chain) is assignable to the value pointed at by the target.

in this case, since there is no "singleton" notInternalError, we need to use As.

mgartner · 2024-04-30T23:11:42Z

pkg/sql/colexecerror/error.go line 48 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

errors.Is[Any] requires the error (or any error in the cause chain) to exactly equal a reference error.

errors.As checks if the error (or any error in the cause chain) is assignable to the value pointed at by the target.

in this case, since there is no "singleton" notInternalError, we need to use As.

Thanks for the explanation Rafi!

michae2

Thanks for all the comments! I will push an update tonight.

Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @petermattis, and @rafiss)

pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

+1 - this seems like an easy extension of the current improvement. IIUC multiple nested catchers significantly exacerbated the problem we saw in the customer environment, and although we now have fast-paths for majority of errors, it'd be great to only inspect the stack once, regardless of the number of catches in it.

Nice idea! @yuzefovich I think I might reuse notInternalError for this, do you see any problems with that?

yuzefovich

Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)

pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Nice idea! @yuzefovich I think I might reuse notInternalError for this, do you see any problems with that?

Reusing notInternalError would lead to a behavior change. Namely, we now won't be able to tell the difference between an expected error within vectorized engine (that should propagated as an error, without stack trace) and an unexpected error outside of the vectorized engine (which shouldn't be caught and should be propagated via panic further up). I'd introduce a new error type like Drew suggested.

michae2

TFTRs!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @mgartner, @petermattis, @rafiss, and @yuzefovich)

-- commits line 2 at r4:

Previously, rafiss (Rafi Shamim) wrote…

could you include before/after results of this benchmark in the PR description?

helpful incantation:
N=10 BENCHTIMEOUT=24h PKG=./pkg/sql/colexecerror BENCHES=BenchmarkCatchVectorizedRuntimeError ./scripts/bench 'old-sha' 'new-sha'

I'll kick off a run on a gceworker and share tomorrow.

pkg/sql/colexecerror/error.go line 109 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Why do we need this addition (it kinda duplicates sqlColPackagesPrefix)? Because we strip the prefix when running tests via bazel? Consider leaving a comment.

Yes, exactly. Added a comment and I asked about it in slack.

pkg/sql/colexecerror/error.go line 78 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: missing spaces after the slashes in the third commit.

Done.

pkg/sql/colexecerror/error.go line 243 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I think we need to register the decoder for both internalError and notInternalError.

Oh, good catch! Done.

pkg/sql/colexecerror/error.go line 246 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this comment needs a minor adjustment.

Done.

pkg/sql/colexecerror/error.go line 0 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

(Reviewable was unable to map this GitHub inline comment thread to the right spot — sorry!)

super nit: the comment should be updated so mention the error wrapping instead of "simply panicking."

Done.

pkg/sql/colexecerror/error.go line 111 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Thinking a bit more about this, I agree that these two checks don't add that much value, so we could remove them. If callsite for panics in Go runtime ever changes, we'll easily catch the change via CI. So I'd be in favor of removing these two ifs and simply re-panicking whenever we don't find the panic line in the stack trace.

I removed these two checks.

pkg/sql/colexecerror/error.go line 113 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Yes, good point.AssertionFailedf includes the stack trace (that's why below we only call errors.NewAssertionErrorWithWrappedErrf when we do want to create an error that would include the stack trace), so I think we can remove two calls to debug.Stack() and rely on the assertion's behavior. (Given my comment above, I don't think it's actually possible to hit this code path right now, so there is no way to check for stack trace duplication.)

Removed these two checks.

pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Reusing notInternalError would lead to a behavior change. Namely, we now won't be able to tell the difference between an expected error within vectorized engine (that should propagated as an error, without stack trace) and an unexpected error outside of the vectorized engine (which shouldn't be caught and should be propagated via panic further up). I'd introduce a new error type like Drew suggested.

I tried this out, but strangely it seemed to make things slower. My guess is that we're mostly re-wrapping with materializers and columnarizers, and it looks like we already wrap with notInternalError in columnarizer:

cockroach/pkg/sql/colexec/columnarizer.go

Line 243 in af3173a

colexecerror.ExpectedError(meta.Err)

pkg/sql/sem/builtins/builtins.go line 5645 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: rather than introducing a new builtin, should we introduce an overload to existing crdb_internal.force_panic builtin where an optional second boolean argument would indicate whether the panic should be catchable by vectorized engine or not?

Good call. I added an override with a couple more options.

pkg/sql/colexecerror/error_test.go line 203 at r4 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Note to self: this will make backport difficult.

Done.

pkg/sql/colexecerror/main_test.go line 42 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: we don't need most of the initialization in this file. I think it can be as short as
func TestMain(m *testing.M) {
	securityassets.SetLoader(securitytest.EmbeddedAssets)
	randutil.SeedForTests()
	serverutils.InitTestServerFactory(server.TestServerFactory)

	os.Exit(m.Run())
}

Thank you! Done.

yuzefovich

Reviewed 6 of 6 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 1 of 1 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, @petermattis, and @rafiss)

pkg/sql/colexecerror/error.go line 130 at r4 (raw file):

Previously, michae2 (Michael Erickson) wrote…

I tried this out, but strangely it seemed to make things slower. My guess is that we're mostly re-wrapping with materializers and columnarizers, and it looks like we already wrap with notInternalError in columnarizer:

cockroach/pkg/sql/colexec/columnarizer.go

Line 243 in af3173a

colexecerror.ExpectedError(meta.Err)

Hm, we might be thinking about this differently. The idea is that in the fall back case, when we had to look at the stack via runtime.CallersFrames (because the panic should be caught by vec engine but wasn't produced via one of colexecerror.*Error calls), we will wrap the error with a special marker alreadyCaughtError so that the next catcher up the stack didn't have to inspect the stack (i.e. we would add another special error type to the hot path at the top of the method). This shouldn't have any influence for columnarizer-materializer pair since they already use colexecerror methods that wrap errors with different markers. Does this match your thinking?

That said, this would be an improvement to an edge case, so I'd be ok with leaving a TODO for it.

Earlier this year we made the vectorized panic-catcher much more efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None

133620: colexecerror: improve the catcher due to a recent regression r=yuzefovich a=yuzefovich Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Fixes: #133617. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>

Earlier this year we made the vectorized panic-catcher much more efficient (in #123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None

Earlier this year we made the vectorized panic-catcher much more efficient (in cockroachdb#123277) by switching from using `debug.Stack()` to `runtime.CallersFrames`. It appears that there is slight difference in the behavior between the two: the former omits frames from within the runtime (only a single frame for the panic itself is included) whereas the latter keeps the additional runtime frames. As a result, if a panic occurs due to a Go runtime internal violation (e.g. invalid interface assertion) it is no longer caught to be converted into an internal CRDB error and now crashes the server. This commit fixes this regression by skipping over the frames that belong to the Go runtime. Note that we will do so only for up to 5 frames within the runtime, so if there happens to be more deeply-nested panic there, we'll still crash the CRDB server. Release note: None

michae2 requested review from DrewKimball, rafiss and yuzefovich April 30, 2024 06:59

michae2 requested review from a team as code owners April 30, 2024 06:59

DrewKimball approved these changes Apr 30, 2024

View reviewed changes

petermattis reviewed Apr 30, 2024

View reviewed changes

rafiss reviewed Apr 30, 2024

View reviewed changes

yuzefovich requested review from DrewKimball, petermattis and rafiss April 30, 2024 16:36

yuzefovich reviewed Apr 30, 2024

View reviewed changes

petermattis requested a review from yuzefovich April 30, 2024 16:42

petermattis reviewed Apr 30, 2024

View reviewed changes

yuzefovich reviewed Apr 30, 2024

View reviewed changes

mgartner requested a review from petermattis April 30, 2024 16:49

mgartner approved these changes Apr 30, 2024

View reviewed changes

michae2 commented Apr 30, 2024

View reviewed changes

rafiss reviewed Apr 30, 2024

View reviewed changes

michae2 commented May 1, 2024

View reviewed changes

yuzefovich requested a review from rafiss May 1, 2024 00:28

yuzefovich reviewed May 1, 2024

View reviewed changes

michae2 force-pushed the catch branch from 17b9864 to fd5f898 Compare May 1, 2024 09:19

michae2 commented May 1, 2024

View reviewed changes

yuzefovich reviewed May 1, 2024

View reviewed changes

blathers-crl bot mentioned this pull request May 2, 2024

release-24.1.0-rc: colexecerror: avoid debug.Stack in CatchVectorizedRuntimeError #123501

Merged

This was referenced May 2, 2024

release-23.2: colexecerror: avoid debug.Stack in CatchVectorizedRuntimeError #123502

Merged

release-23.1: colexecerror: avoid debug.Stack in CatchVectorizedRuntimeError #123503

Merged

michae2 deleted the catch branch May 2, 2024 19:21

michae2 mentioned this pull request May 21, 2024

pkg/sql/colexecerror: BenchmarkSQLCatchVectorizedRuntimeError fails with error #124021

Closed

yuzefovich mentioned this pull request Oct 28, 2024

colexecerror: some runtime panics are no longer recovered from #133617

Closed

yuzefovich mentioned this pull request Oct 28, 2024

colexecerror: improve the catcher due to a recent regression #133620

Merged

blathers-crl bot mentioned this pull request Oct 29, 2024

release-23.1: colexecerror: improve the catcher due to a recent regression #133634

Merged

blathers-crl bot mentioned this pull request Oct 29, 2024

release-23.2: colexecerror: improve the catcher due to a recent regression #133635

Merged

blathers-crl bot mentioned this pull request Oct 29, 2024

release-24.1: colexecerror: improve the catcher due to a recent regression #133636

Merged

blathers-crl bot mentioned this pull request Oct 29, 2024

release-24.2: colexecerror: improve the catcher due to a recent regression #133637

Merged

blathers-crl bot mentioned this pull request Oct 29, 2024

release-24.3: colexecerror: improve the catcher due to a recent regression #133638

Merged

mohini-crl mentioned this pull request Dec 19, 2024

[Replicated] release-23.1: colexecerror: improve the catcher due to a recent regression mohini-crl/cockroach#35

Merged

mohini-crl mentioned this pull request Jan 5, 2025

[Replicated] release-23.1: colexecerror: improve the catcher due to a recent regression mohini-crl/cockroach#129

Merged

mohini-crl mentioned this pull request Jan 5, 2025

[Replicated] release-23.1: colexecerror: improve the catcher due to a recent regression mohini-crl/cockroach#139

Merged

colexecerror: avoid debug.Stack in CatchVectorizedRuntimeError #123277

colexecerror: avoid debug.Stack in CatchVectorizedRuntimeError #123277

Uh oh!

Conversation

michae2 commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Apr 30, 2024

Uh oh!

michae2 commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrewKimball left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

rafiss Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

rafiss left a comment

Choose a reason for hiding this comment

Uh oh!

rafiss left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

mgartner left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 Apr 30, 2024

Choose a reason for hiding this comment

Uh oh!

rafiss left a comment

Choose a reason for hiding this comment

Uh oh!

mgartner commented Apr 30, 2024

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

michae2 commented Apr 30, 2024 •

edited

Loading

michae2 commented Apr 30, 2024 •

edited

Loading