[ENH] GPU Rocket 2.0: CuPy Backend ,Exact CPU Parity (<1e-7), 100x Speedup, Full Multivariate Support#3211
Conversation
Thank you for contributing to
|
0026dd6 to
e2879cf
Compare
f86d389 to
a0411f8
Compare
|
Thanks for flagging issue #3212 @MatthewMiddlehurst . I've analyzed this against my CuPy implementation, and it won't negatively impact the GPU variant. Since ROCKETGPU delegates kernel generation to the CPU Rocket class to ensure parity, fixing the RNG handling here will automatically propagate the correct behavior to the GPU version without requiring kernel changes. Just wanted to bring to your notice. |
e8bac71 to
55770aa
Compare
1215b7b to
c648d15
Compare
MatthewMiddlehurst
left a comment
There was a problem hiding this comment.
Leaning towards rejecting this. Don't think it is what we want in terms of an implementation. We have lost a lot of parameters. The whole kernel thing is a bit weird. Testing is not like anything we have in the codebase.
If it was cleaner I would not be against adding the dependency, but this is not it IMO. Not a review, if any other developer feels like it is worth taking this on go ahead otherwise it will likely be closed.
|
Hey @MatthewMiddlehurst, I understand the concerns youve raised about the implementation. This seems like significant progress...so if you can provide guidance on what the 'right' approach looks like, we can definitely have a better ,more standard implementation. Thanks for considering this. |
|
In the meanwhile i was trying to optimise the other rocket family , with their GPU implementations, and see how CuPY fairs out there, i have some benchmarking results from the minirocket GPU implementation aswell, which i just finished with about 10 minutes back. I do still think that CuPy is the way to go forward if we want a GPU implementation in aeon...the effort in adding the entire structure is expected and im not seeing this as a one off PR either, i do understand that this will need major work, but that too seems essential for our progress. |
|
Ignoring the testing and unrelated changes to the issue which I'm not sure why it has been implemented like this, I do not want to start maintaining CUDA code and add in dependencies for 1 second of speed up. Your images seem to plot the same things multiple times? |
|
Each graph set plots the speedup between the CPU version and the GPU version on different channels (1,2...6) (so ranging from univariate to multivariate) As per my understanding why i thought this was of immense importance was because in a typical research workflow where someone would run rocket on 100 different parameter configurations or datasets, that would make the difference between 2 hours vs 5 minutes of run time. I understand the concern about CUDA maintenance and dependencies though. If the maintenance burden outweighs the performance benefit for aeons use cases, I completely get that too. |
|
I agree with Matthew on maintain CUDA kernels, I don't think we want to get into that level of complexity. One alternative if we want to use GPU implementation (which I think is a good thing) would be to use CuPy native operation (i.e. cupy.convolve etc...) so that the code is close to a numpy implementation, which is the whole point of CuPy, to be a drop in replacement for scipy/numpy. |
|
putting my comment here as well the only change am happy to get is to use the same kermel generation function of the cpu implementation, but keep using tensorflow for the convolution |
|
Sorry the labels for the latest plot all say number of cases up to 10,000 so it is a little confusing if it is altering other metrics. Generally yes there will be multiple runs, and speed ups are good, but this as it stands is a bit of a maintainability liability to be honest. In real terms ROCKET is a quick algorithm already, and the speed-up is not worth the potential issues. |
|
Understandable hiccups , i'll try and list all the options available to us as of our current standing. 1) CuPy implementation with convolve (without custom kernel): 2) Previous TF implementation with Kernel Parity changes: 3) CuPy custom kernel : With a final weigh in of opinions and discussion we can decide if we wish to abandon the CuPy approach in its entirety , or try out the 1st option as suggested by @baraline , or if we wanna revert to the TF implementation opened in #3177 (opening it again for review) as seen favourable by @hadifawaz1999 |
|
As much as I love CuPy, I agree with Ali too, if we can do it with tensorflow we should, it keeps the GPU framework we use consistent across aeon. |
|
"Parity: supports Univariate parity (MAE < 1e-5) , does not support Multivariate parity (MAE 1.61-648.18 due to different convolution algorithms at transform stage)" Not really sure how this works, why is it different? Either way this does not seem to be the approach we want, so closing. |
|
TF conv1d uses matrix ops that accumulate in parallel. For univariate , theres only one multiplication so order Totally able to understand the concerns tho, ill get on with the TF implementation betterment : ) |



Reference Issues/PRs
Supersedes #3177
Relates to #313
Relates to #1248
What does this implement/fix? Explain your changes.
This pull request implements a high-performance, GPU-accelerated backend for the
Rockettransformer utilizing CuPy.This implementation achieves strict numerical parity with the CPU baseline and introduces full support for multivariate time series, resolving the architectural limitations and accuracy issues identified in the previous TensorFlow-based approach (PR #3177).
Key Improvements
cupyas a soft, optional dependency, avoiding the substantial package overhead associated with TensorFlow, reducing excessive bloat for the users.Benchmarks & Validation
The following benchmarks demonstrate the performance gains and correctness verification.
(GPU: 1650 ti, CPU: i7)

(Speed and breakeven points will be better on better GPUs like RTX 4090)
(Note: legacy TF refers to the original TF implementation from #1199)
1. Performance: CuPy vs. CPU
Detailed speedup analysis across varying dataset sizes.
Analysis: The CuPy implementation demonstrates a non-linear performance gain as dataset dimensions increase. For standard time-series lengths (5,000 - 10,000), the GPU backend achieves a speedup factor of approximately 14.7x compared to the CPU implementation. The break-even point occurs at approximately 1,000 timepoints, making this backend highly efficient for medium-to-large scale datasets.
2. Performance: CuPy vs. TensorFlow (Legacy)

Comparison against the previous GPU prototype.
Analysis: The architectural shift to CuPy yields massive performance gains over the TensorFlow legacy code. At a time series length of 10,000, the CuPy implementation is 101.5x faster than the TensorFlow version. This also confirms that the overhead of the TensorFlow graph execution was the primary bottleneck in the previous iteration.
3. Deviation Analysis & Numerical Parity
Verification of numerical stability and equivalence with CPU kernels.
Analysis: I conducted an extensive audit across 15 datasets (6 Univariate, 9 Multivariate) to compare the Mean Absolute Error (MAE) against the CPU baseline. The results below demonstrate that the CuPy implementation is statistically indistinguishable from the CPU version, whereas the Legacy TF version failed to maintain scientific precision.
Detailed Dataset Breakdown
Univariate Datasets
Multivariate Datasets
Architectural Justification: Transitioning from TensorFlow to CuPy
Following the development of the TensorFlow-based implementation (PR #3177), extensive testing identified two critical blockers that necessitated a pivot to CuPy:
The CuPy Advantage
Switching to CuPy allowed for direct, CUDA-like kernel definitions within Python. This enabled a direct replication of the
aeonCPU logic, successfully solving the multivariate issue while maintaining a lightweight dependency profile.Future Development
This architecture establishes a robust foundation for a "GPU-First" convolution module. This framework will facilitate the rapid porting of:
and subsequently other extremely important algorithms like HIVE-COTE and Shapelets.
Does your contribution introduce a new dependency? If yes, which one?
Yes,
cupy.It is implemented strictly as an optional/soft dependency. The GPU backend is only initialized if explicitly requested by the user, ensuring no additional overhead for standard CPU-only workflows.
Any other comments?
Hardware version
NVIDIA Gtx 1650 ti / i7 / 16GB RAM
Reviewers
cc: @hadifawaz1999 @TonyBagnall @MatthewMiddlehurst
PR checklist
For all contributions
For new estimators and functions
__maintainer__at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.For developers with write access