Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I happened to notice that extrapolation is quite a lot slower than interpolation:
This slowness doesn't seem inevitable, because the amount of "work" (computation) performed by extrapolation is quite minor compared to the work done by interpolation:
Interpolation also involves a whole second round of
clamps and throws in a bunch offloor/roundand arithmetic operations to boot (16 multiplies and 16 adds for linear interpolation in 3 dimensions). So one might suspect this could be fixed through a careful look at the generated code.The first commit here is almost trivial: it adds forced-inlining to circumvent the splatting penalty. This leads to an approximately 20ns reduction in this test case. The second commit is more complicated;
sizeandindices, when called with a dimension argument, involve a branch that looks something like this:Our generated code used these dimension-specific constructs and triggered two branches per dimension (one for
lboundand one forubound), resulting in 6 branches total for this example. Note that such branches are not present if you start fromsz = size(A), so I changed our generated code to use this style (withindicesrather thansize). This gave another approximate 20ns boost, so that the result with this PR iswhich is a much more reasonable overhead compared to interpolation.