[ENH] GPU Rocket 2.0: CuPy Backend ,Exact CPU Parity (<1e-7), 100x Speedup, Full Multivariate Support by Adityakushwaha2006 · Pull Request #3211 · aeon-toolkit/aeon

Adityakushwaha2006 · 2025-12-29T09:47:55Z

Reference Issues/PRs

Supersedes #3177
Relates to #313
Relates to #1248

What does this implement/fix? Explain your changes.

This pull request implements a high-performance, GPU-accelerated backend for the Rocket transformer utilizing CuPy.

This implementation achieves strict numerical parity with the CPU baseline and introduces full support for multivariate time series, resolving the architectural limitations and accuracy issues identified in the previous TensorFlow-based approach (PR #3177).

Key Improvements

Performance: Delivers significant acceleration over both the standard CPU implementation (up to 14x) and the Legacy TensorFlow prototype (up to 100x).
Numerical Accuracy: Guarantees numerical equivalence to the CPU implementation. Benchmarks confirm that while the Legacy TF version suffered from massive deviations (MAE > 600), the CuPy implementation maintains strict parity (MAE < 1e-7).
Multivariate Support: Natively handles multivariate input data, addressing a critical functional gap in the previous TensorFlow prototype.
Minimal Footprint: Integrates cupy as a soft, optional dependency, avoiding the substantial package overhead associated with TensorFlow, reducing excessive bloat for the users.

Benchmarks & Validation

The following benchmarks demonstrate the performance gains and correctness verification.

(GPU: 1650 ti, CPU: i7)
(Speed and breakeven points will be better on better GPUs like RTX 4090)
(Note: legacy TF refers to the original TF implementation from #1199)
1. Performance: CuPy vs. CPU
Detailed speedup analysis across varying dataset sizes.

Analysis: The CuPy implementation demonstrates a non-linear performance gain as dataset dimensions increase. For standard time-series lengths (5,000 - 10,000), the GPU backend achieves a speedup factor of approximately 14.7x compared to the CPU implementation. The break-even point occurs at approximately 1,000 timepoints, making this backend highly efficient for medium-to-large scale datasets.

2. Performance: CuPy vs. TensorFlow (Legacy)
Comparison against the previous GPU prototype.

Analysis: The architectural shift to CuPy yields massive performance gains over the TensorFlow legacy code. At a time series length of 10,000, the CuPy implementation is 101.5x faster than the TensorFlow version. This also confirms that the overhead of the TensorFlow graph execution was the primary bottleneck in the previous iteration.

3. Deviation Analysis & Numerical Parity
Verification of numerical stability and equivalence with CPU kernels.

Analysis: I conducted an extensive audit across 15 datasets (6 Univariate, 9 Multivariate) to compare the Mean Absolute Error (MAE) against the CPU baseline. The results below demonstrate that the CuPy implementation is statistically indistinguishable from the CPU version, whereas the Legacy TF version failed to maintain scientific precision.

Metric	Legacy TF (GPU)	CuPy (New GPU)
Best MAE	1.33	6.28e-08
Worst MAE	648.18	3.02e-05
Datasets Passing Parity (< 1e-5)	0 / 15 (0%)	15 / 15 (100%)

Detailed Dataset Breakdown

Univariate Datasets

Dataset	Legacy TF MAE	CuPy MAE
GunPoint	1.40	6.30e-08
ItalyPowerDemand	1.34	6.32e-08
Chinatown	648.18	3.02e-05
TwoLeadECG	1.53	8.54e-08
ECG200	1.42	8.31e-08
Coffee	1.33	6.28e-08

Multivariate Datasets

Dataset	Legacy TF MAE	CuPy MAE
BasicMotions	2.29	1.41e-07
Epilepsy	1.89	1.12e-07
NATOPS	3.61	1.50e-07
ERing	2.11	1.44e-07
Handwriting	3.30	2.67e-07
ArticularyWord	2.42	1.67e-07
RacketSports	3.20	1.97e-06
StandWalkJump	3.64	1.77e-07
UWaveGesture	1.61	1.12e-07

Architectural Justification: Transitioning from TensorFlow to CuPy

Following the development of the TensorFlow-based implementation (PR #3177), extensive testing identified two critical blockers that necessitated a pivot to CuPy:

Multivariate Complexity: TensorFlow lacked the flexibility required for raw kernel manipulations in multivariate scenarios, leading to inefficient workarounds and the high error rates observed in the table above and the general functioning.
Dependency Overhead: The inclusion of TensorFlow as a dependency introduced disproportionate bloat relative to the performance gains for smaller datasets.

The CuPy Advantage
Switching to CuPy allowed for direct, CUDA-like kernel definitions within Python. This enabled a direct replication of the aeon CPU logic, successfully solving the multivariate issue while maintaining a lightweight dependency profile.

Future Development

This architecture establishes a robust foundation for a "GPU-First" convolution module. This framework will facilitate the rapid porting of:

MiniRocket
MultiRocket
Hydra
and subsequently other extremely important algorithms like HIVE-COTE and Shapelets.

Does your contribution introduce a new dependency? If yes, which one?

Yes, cupy.

It is implemented strictly as an optional/soft dependency. The GPU backend is only initialized if explicitly requested by the user, ensuring no additional overhead for standard CPU-only workflows.

Any other comments?

Hardware version
NVIDIA Gtx 1650 ti / i7 / 16GB RAM

Reviewers
cc: @hadifawaz1999 @TonyBagnall @MatthewMiddlehurst

PR checklist

For all contributions

I've added myself to the list of contributors. Alternatively, you can use the @all-contributors bot to do this for you after the PR has been merged.
The PR title starts with either [ENH], [MNT], [DOC], [BUG], [REF], [DEP] or [GOV] indicating whether the PR topic is related to enhancement, maintenance, documentation, bugs, refactoring, deprecation or governance.

For new estimators and functions

I've added the estimator/function to the online API documentation.
(OPTIONAL) I've added myself as a __maintainer__ at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.

For developers with write access

(OPTIONAL) I've updated aeon's CODEOWNERS to receive notifications about future changes to these files.

aeon-actions-bot · 2025-12-29T09:48:14Z

Thank you for contributing to `aeon`

I have added the following labels to this PR based on the title: [ enhancement ].
This PR changes too many different packages (>3) for automatic addition of labels, please manually add package labels if relevant.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

Run pre-commit checks for all files
Run mypy typecheck tests
Run all pytest tests and configurations
Run all notebook example tests
Run numba-disabled codecov tests
Stop automatic pre-commit fixes (always disabled for drafts)
Disable numba cache loading
Regenerate expected results for testing
Push an empty commit to re-run CI checks

Adityakushwaha2006 · 2025-12-29T23:25:18Z

Thanks for flagging issue #3212 @MatthewMiddlehurst . I've analyzed this against my CuPy implementation, and it won't negatively impact the GPU variant. Since ROCKETGPU delegates kernel generation to the CPU Rocket class to ensure parity, fixing the RNG handling here will automatically propagate the correct behavior to the GPU version without requiring kernel changes.

Just wanted to bring to your notice.

MatthewMiddlehurst

Leaning towards rejecting this. Don't think it is what we want in terms of an implementation. We have lost a lot of parameters. The whole kernel thing is a bit weird. Testing is not like anything we have in the codebase.

If it was cleaner I would not be against adding the dependency, but this is not it IMO. Not a review, if any other developer feels like it is worth taking this on go ahead otherwise it will likely be closed.

Adityakushwaha2006 · 2026-01-02T17:15:25Z

Hey @MatthewMiddlehurst, I understand the concerns youve raised about the implementation.
I want to make sure Im understanding the direction correctly before doing a major rework. I'm willing to put in the effort to bring this in line with aeons standards.

This seems like significant progress...so if you can provide guidance on what the 'right' approach looks like, we can definitely have a better ,more standard implementation.
If the fundamental direction is incompatible with what aeon needs, please let me know that too so I don't waste your time with extensive revisions that arent needed

Thanks for considering this.

Adityakushwaha2006 · 2026-01-02T17:21:13Z

In the meanwhile i was trying to optimise the other rocket family , with their GPU implementations, and see how CuPY fairs out there, i have some benchmarking results from the minirocket GPU implementation aswell, which i just finished with about 10 minutes back.

Attaching them here:

I do still think that CuPy is the way to go forward if we want a GPU implementation in aeon...the effort in adding the entire structure is expected and im not seeing this as a one off PR either, i do understand that this will need major work, but that too seems essential for our progress.
Let me know : )

MatthewMiddlehurst · 2026-01-02T17:34:04Z

Ignoring the testing and unrelated changes to the issue which I'm not sure why it has been implemented like this, I do not want to start maintaining CUDA code and add in dependencies for 1 second of speed up. Your images seem to plot the same things multiple times?

Adityakushwaha2006 · 2026-01-02T17:49:03Z

Each graph set plots the speedup between the CPU version and the GPU version on different channels (1,2...6) (so ranging from univariate to multivariate)
Just to clarify the speedup, its not about saving 1 second on a single run. Its about 24x faster execution on multivariate data.

As per my understanding why i thought this was of immense importance was because in a typical research workflow where someone would run rocket on 100 different parameter configurations or datasets, that would make the difference between 2 hours vs 5 minutes of run time.

I understand the concern about CUDA maintenance and dependencies though. If the maintenance burden outweighs the performance benefit for aeons use cases, I completely get that too.

baraline · 2026-01-03T19:50:00Z

I agree with Matthew on maintain CUDA kernels, I don't think we want to get into that level of complexity.

One alternative if we want to use GPU implementation (which I think is a good thing) would be to use CuPy native operation (i.e. cupy.convolve etc...) so that the code is close to a numpy implementation, which is the whole point of CuPy, to be a drop in replacement for scipy/numpy.

hadifawaz1999 · 2026-01-04T05:58:41Z

putting my comment here as well

the only change am happy to get is to use the same kermel generation function of the cpu implementation, but keep using tensorflow for the convolution

MatthewMiddlehurst · 2026-01-04T20:37:59Z

Sorry the labels for the latest plot all say number of cases up to 10,000 so it is a little confusing if it is altering other metrics. Generally yes there will be multiple runs, and speed ups are good, but this as it stands is a bit of a maintainability liability to be honest. In real terms ROCKET is a quick algorithm already, and the speed-up is not worth the potential issues.

Adityakushwaha2006 · 2026-01-04T22:40:11Z

Understandable hiccups , i'll try and list all the options available to us as of our current standing.

1) CuPy implementation with convolve (without custom kernel):
Pros: easy to maintain, code similar to CPU, removes TF bloat,easy to understand
Cons: Adds soft dependency
For this approach , i cant say a lot about the speed and parity without testing it first, my understanding would place the speedup to be slightly lesser than the custom kernel method.

2) Previous TF implementation with Kernel Parity changes:
Pros: Less codebase change, already exists in PR #3177
Cons: Incredibly slow (~2-3x speedup only), hard to map CPU to GPU, dependency bloat, equivalence point at larger sizes (compared to CuPy)
Speed: ~2-3x vs CPU
Parity: supports Univariate parity (MAE < 1e-5) , does not support Multivariate parity (MAE 1.61-648.18 due to different convolution algorithms at transform stage)

3) CuPy custom kernel :
Pros: extremely fast (14.7x peak speedup), code similar to CPU except kernel, easy to understand, removes TF bloat , equivalence around ~1k datapoints
Cons: Hard to maintain kernel, adds soft dependency, understandable concerns about mixing with codebase ...all in all does not seem favourable here
Speed: 14.7x peak vs CPU, average 8-12x depending on dataset size
Parity: Supports Univariate parity (MAE < 1e-5, 6/6 datasets pass), supports Multivariate parity (MAE < 1e-5, 9/9 datasets pass)

With a final weigh in of opinions and discussion we can decide if we wish to abandon the CuPy approach in its entirety , or try out the 1st option as suggested by @baraline , or if we wanna revert to the TF implementation opened in #3177 (opening it again for review) as seen favourable by @hadifawaz1999

baraline · 2026-01-05T07:10:58Z

As much as I love CuPy, I agree with Ali too, if we can do it with tensorflow we should, it keeps the GPU framework we use consistent across aeon.
Even a 2x or 3x speedup is already really nice to have, especially if the maintenance cost is low. But yeah we might want to work on the other issues it has in terms of parity

MatthewMiddlehurst · 2026-01-05T16:06:53Z

"Parity: supports Univariate parity (MAE < 1e-5) , does not support Multivariate parity (MAE 1.61-648.18 due to different convolution algorithms at transform stage)"

Not really sure how this works, why is it different?

Either way this does not seem to be the approach we want, so closing.

Adityakushwaha2006 · 2026-01-05T17:00:17Z

TF conv1d uses matrix ops that accumulate in parallel. For univariate , theres only one multiplication so order
doesn't matter. For multivariate, the different accumulation order causes float rounding differences to compound.
Also , TFs convolution algorithm isnt fixed either, so its completely unpredictable

Totally able to understand the concerns tho, ill get on with the TF implementation betterment : )

Adityakushwaha2006 requested review from MatthewMiddlehurst, TonyBagnall, dguijo and hadifawaz1999 as code owners December 29, 2025 09:47

aeon-actions-bot bot added the enhancement New feature, improvement request or other non-bug code enhancement label Dec 29, 2025

Adityakushwaha2006 force-pushed the feat/cupy-rocket-acceleration branch from 0026dd6 to e2879cf Compare December 29, 2025 09:54

Adityakushwaha2006 requested a review from a team as a code owner December 29, 2025 09:54

Adityakushwaha2006 force-pushed the feat/cupy-rocket-acceleration branch 2 times, most recently from f86d389 to a0411f8 Compare December 29, 2025 10:34

Adityakushwaha2006 requested review from SebastianSchmidl, baraline and chrisholder as code owners December 29, 2025 10:34

SebastianSchmidl removed request for SebastianSchmidl, baraline and chrisholder December 29, 2025 11:45

SebastianSchmidl added classification Classification package deep learning Deep learning related labels Dec 29, 2025

Adityakushwaha2006 mentioned this pull request Dec 29, 2025

[MNT] enforce strict CPU-GPU numerical parity for ROCKET (< 1e-7 divergence in Kernel creation) #3177

Closed

5 tasks

This was referenced Dec 29, 2025

[BUG] ROCKET transform random number generation #3212

Open

[ENH] Add GPU support to RocketClassifier/Regressor #3205

Draft

ENH: GPU Rocket 2.0: CuPy Backend, Exact CPU Parity, 100x Speedup

55770aa

Adityakushwaha2006 force-pushed the feat/cupy-rocket-acceleration branch from e8bac71 to 55770aa Compare December 30, 2025 13:40

corrected cupy import and dll specifications

c648d15

Adityakushwaha2006 force-pushed the feat/cupy-rocket-acceleration branch from 1215b7b to c648d15 Compare January 1, 2026 12:58

Merge branch 'main' into feat/cupy-rocket-acceleration

d4586f8

MatthewMiddlehurst reviewed Jan 2, 2026

View reviewed changes

Adityakushwaha2006 mentioned this pull request Jan 2, 2026

[MNT] Test consistency between CPU & GPU version of ROCKET #1248

Open

MatthewMiddlehurst closed this Jan 5, 2026

Conversation

Adityakushwaha2006 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Key Improvements

Benchmarks & Validation

Detailed Dataset Breakdown

Architectural Justification: Transitioning from TensorFlow to CuPy

Future Development

Does your contribution introduce a new dependency? If yes, which one?

Any other comments?

PR checklist

For all contributions

For new estimators and functions

For developers with write access

Uh oh!

aeon-actions-bot bot commented Dec 29, 2025

Thank you for contributing to aeon

Uh oh!

Adityakushwaha2006 commented Dec 29, 2025

Uh oh!

MatthewMiddlehurst left a comment

Choose a reason for hiding this comment

Uh oh!

Adityakushwaha2006 commented Jan 2, 2026

Uh oh!

Adityakushwaha2006 commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatthewMiddlehurst commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Adityakushwaha2006 commented Jan 2, 2026

Uh oh!

baraline commented Jan 3, 2026

Uh oh!

hadifawaz1999 commented Jan 4, 2026

Uh oh!

MatthewMiddlehurst commented Jan 4, 2026

Uh oh!

Adityakushwaha2006 commented Jan 4, 2026

Uh oh!

baraline commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatthewMiddlehurst commented Jan 5, 2026

Uh oh!

Adityakushwaha2006 commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Adityakushwaha2006 commented Dec 29, 2025 •

edited

Loading

Thank you for contributing to `aeon`

Adityakushwaha2006 commented Jan 2, 2026 •

edited

Loading

MatthewMiddlehurst commented Jan 2, 2026 •

edited

Loading

baraline commented Jan 5, 2026 •

edited

Loading