Skip to content

Conversation

@willtebbutt
Copy link
Member

@willtebbutt willtebbutt commented Feb 1, 2022

Summary

I found a lovely pathological Zygote-related performance problem.

Proposed changes

Optimisations for kernelmatrix for the ConstantKernel

What alternatives have you considered?

The only other option is to improve Zygote's handling of map. Unfortunately I can't see Zygote improving here in the immediate future (I certainly don't have time to look into it right now).

Breaking changes

None

Performance

Here's a demo of how awful the performance is at the minute vs this PR:

using BenchmarkTools
using KernelFunctions
using Zygote

function foo(c, x, y)
    return kernelmatrix(ConstantKernel(c=c), x, y)
end

c = 1.0
x = randn(1_000)
y = randn(1_500)

@benchmark foo($c, $x, $y)
@benchmark Zygote.pullback(foo, $c, $x, $y)

out, pb = Zygote.pullback(foo, c, x, y)
@benchmark $pb($out)

# Without PR

# Primal:
BenchmarkTools.Trial: 956 samples with 1 evaluation.
 Range (min  max):  4.362 ms    7.830 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     5.087 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.222 ms ± 515.433 μs  ┊ GC (mean ± σ):  6.09% ± 9.28%

       ▂           █▆▆▁▁
  ▂▂▃▄▆██▅▅▅▃▃▃▃▄▄▄█████▅▅▄▄▄▄▃▁▂▂▂▃▃▃▄▅▇▇▆▆▅▄▄▃▃▃▃▂▂▃▃▂▁▂▁▂▂ ▃
  4.36 ms         Histogram: frequency by time        6.56 ms <

 Memory estimate: 12.89 MiB, allocs estimate: 14.

# Forwards-pass:
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 12.314 s (51.96% GC) to evaluate,
 with a memory estimate of 4.51 GiB, over 97500049 allocations.

# Reverse-pass:
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 27.904 s (11.51% GC) to evaluate,
 with a memory estimate of 2.50 GiB, over 96000081 allocations.


# With PR:

# Primal
BenchmarkTools.Trial: 1449 samples with 1 evaluation.
 Range (min  max):  1.396 ms  11.263 ms  ┊ GC (min  max):  0.00%  67.42%
 Time  (median):     2.212 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.437 ms ±  2.456 ms  ┊ GC (mean ± σ):  43.52% ± 33.98%

       ▅█
  ██▅▃▃███▄▃▂▂▂▂▂▁▁▁▂▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▄▆▅▅▃▃▃▂▂▂▂▂ ▃
  1.4 ms         Histogram: frequency by time        8.73 ms <

 Memory estimate: 11.44 MiB, allocs estimate: 6.

# Forwards-pass
BenchmarkTools.Trial: 1285 samples with 1 evaluation.
 Range (min  max):  1.732 ms  10.228 ms  ┊ GC (min  max):  0.00%  79.74%
 Time  (median):     2.147 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.880 ms ±  3.025 ms  ┊ GC (mean ± σ):  44.58% ± 32.92%

    █▆▃▁                                            ▂▃▃▂▁▁
  ▇█████▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▄▄▆███████▇ █
  1.73 ms      Histogram: log(frequency) by time     9.79 ms <

 Memory estimate: 11.45 MiB, allocs estimate: 98.

# Reverse-pass
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  23.841 μs  164.260 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     24.554 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.567 μs ±   7.206 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▃▂▂▂▃▂▂ ▂▁▁ ▁ ▁▂ ▁▁▃▄▂▁                                   ▂
  ██████████████████████████▇▇▆▅▅▅▄▅▅▅▅▅▄▄▄▁▄▅▃▁▄▄▄▁▄▅▃▁▃▄▄▅▃▅ █
  23.8 μs       Histogram: log(frequency) by time      48.6 μs <

 Memory estimate: 2.61 KiB, allocs estimate: 107.

You get similar results for the two-argument method of kernelmatrix.

@willtebbutt willtebbutt changed the title Wct/constant kernel performance ConstantKernel Performance Feb 1, 2022
Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency I think it would be good to change kappa to only(k.c) (IMO it is more reasonable to not depend on the type of the input). Can we test the performance in some reliable way to avoid regressions?

I'm happy with the PR otherwise, so I'll approve regardless of these two points.

@theogf
Copy link
Member

theogf commented Feb 1, 2022

Can we test the performance in some reliable way to avoid regressions?

You would need to add tests on the benchmark.jl in master first.

@willtebbutt
Copy link
Member Author

You would need to add tests on the benchmark.jl in master first.

Do I literally just need to add the ConstantKernel to line 19 of benchmarks.jl? Also, how is it that I trigger the benchmark suite?

@willtebbutt
Copy link
Member Author

willtebbutt commented Feb 1, 2022

Or are the AD-related benchmarking tests not on master yet? They don't appear to use Zygote.

edit: sorry, I thought we had AD-related performance checks somewhere.

@devmotion @theogf would either of you mind if I add them in this PR? I guess we might as well start tracking this stuff.

@codecov
Copy link

codecov bot commented Feb 1, 2022

Codecov Report

Merging #432 (0de2c43) into master (cbbf1d4) will decrease coverage by 0.14%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #432      +/-   ##
==========================================
- Coverage   93.14%   92.99%   -0.15%     
==========================================
  Files          52       52              
  Lines        1210     1214       +4     
==========================================
+ Hits         1127     1129       +2     
- Misses         83       85       +2     
Impacted Files Coverage Δ
src/basekernels/constant.jl 100.00% <100.00%> (ø)
src/chainrules.jl 85.18% <0.00%> (-2.47%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbbf1d4...0de2c43. Read the comment docs.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@theogf
Copy link
Member

theogf commented Feb 1, 2022

would either of you mind if I add them in this PR? I guess we might as well start tracking this stuff.

Sure but I am not sure if it works if you change the benchmark file in the PR before in the doing it in master

@willtebbutt
Copy link
Member Author

Ahh okay. I'll merge as-is then, and address benchmarking later.

@theogf
Copy link
Member

theogf commented Feb 1, 2022

Docs fail because of

It looks one cannot add I to a FillMatrix

[EDIT:] Using Eye instead would work but I don't think that we want that...

@willtebbutt
Copy link
Member Author

Ahh okay. I'll just go with fill for now then -- it's still a big win.

@theogf
Copy link
Member

theogf commented Feb 1, 2022

I opened an issue in FillArrays JuliaArrays/FillArrays.jl#170

@devmotion
Copy link
Member

Ah, that's an unfortunate upstream bug it seems (with an easy fix probably?). I guess we could just use a full matrix or Eye in the example.

@devmotion
Copy link
Member

It seems a bit unfortunate to not exploit the performance benefits, only because it breaks this example.

@willtebbutt
Copy link
Member Author

I agree that it's a shame, but I'm still inclined to merge when CI passes, unless there are strong objections. We can always improve once stuff is fixed in FillArrays.

@theogf
Copy link
Member

theogf commented Feb 1, 2022

I made an issue to remember.

@devmotion
Copy link
Member

No strong objections, just a bit sad 😢 I'm looking forward to changing it to Fill (maybe open an issue?), it seems it would be a drastic improvement (2ms vs 37ns in the foo benchmark, and 2ms vs 5μs in Zygote.pullback) 🙂

@willtebbutt willtebbutt merged commit 1a3fa33 into master Feb 1, 2022
@willtebbutt willtebbutt deleted the wct/constant-kernel-performance branch February 1, 2022 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants