Skip to content

[ENH] GPU Rocket 2.0: CuPy Backend ,Exact CPU Parity (<1e-7), 100x Speedup, Full Multivariate Support#3211

Closed
Adityakushwaha2006 wants to merge 3 commits intoaeon-toolkit:mainfrom
Adityakushwaha2006:feat/cupy-rocket-acceleration
Closed

[ENH] GPU Rocket 2.0: CuPy Backend ,Exact CPU Parity (<1e-7), 100x Speedup, Full Multivariate Support#3211
Adityakushwaha2006 wants to merge 3 commits intoaeon-toolkit:mainfrom
Adityakushwaha2006:feat/cupy-rocket-acceleration

Conversation

@Adityakushwaha2006
Copy link
Contributor

@Adityakushwaha2006 Adityakushwaha2006 commented Dec 29, 2025

Reference Issues/PRs

Supersedes #3177
Relates to #313
Relates to #1248

What does this implement/fix? Explain your changes.

This pull request implements a high-performance, GPU-accelerated backend for the Rocket transformer utilizing CuPy.

This implementation achieves strict numerical parity with the CPU baseline and introduces full support for multivariate time series, resolving the architectural limitations and accuracy issues identified in the previous TensorFlow-based approach (PR #3177).

Key Improvements

  • Performance: Delivers significant acceleration over both the standard CPU implementation (up to 14x) and the Legacy TensorFlow prototype (up to 100x).
  • Numerical Accuracy: Guarantees numerical equivalence to the CPU implementation. Benchmarks confirm that while the Legacy TF version suffered from massive deviations (MAE > 600), the CuPy implementation maintains strict parity (MAE < 1e-7).
  • Multivariate Support: Natively handles multivariate input data, addressing a critical functional gap in the previous TensorFlow prototype.
  • Minimal Footprint: Integrates cupy as a soft, optional dependency, avoiding the substantial package overhead associated with TensorFlow, reducing excessive bloat for the users.

Benchmarks & Validation

The following benchmarks demonstrate the performance gains and correctness verification.

(GPU: 1650 ti, CPU: i7)
(Speed and breakeven points will be better on better GPUs like RTX 4090)
(Note: legacy TF refers to the original TF implementation from #1199)
1. Performance: CuPy vs. CPU
Detailed speedup analysis across varying dataset sizes.
image

Analysis: The CuPy implementation demonstrates a non-linear performance gain as dataset dimensions increase. For standard time-series lengths (5,000 - 10,000), the GPU backend achieves a speedup factor of approximately 14.7x compared to the CPU implementation. The break-even point occurs at approximately 1,000 timepoints, making this backend highly efficient for medium-to-large scale datasets.

2. Performance: CuPy vs. TensorFlow (Legacy)
Comparison against the previous GPU prototype.
image

Analysis: The architectural shift to CuPy yields massive performance gains over the TensorFlow legacy code. At a time series length of 10,000, the CuPy implementation is 101.5x faster than the TensorFlow version. This also confirms that the overhead of the TensorFlow graph execution was the primary bottleneck in the previous iteration.

3. Deviation Analysis & Numerical Parity
Verification of numerical stability and equivalence with CPU kernels.

image

Analysis: I conducted an extensive audit across 15 datasets (6 Univariate, 9 Multivariate) to compare the Mean Absolute Error (MAE) against the CPU baseline. The results below demonstrate that the CuPy implementation is statistically indistinguishable from the CPU version, whereas the Legacy TF version failed to maintain scientific precision.

Metric Legacy TF (GPU) CuPy (New GPU)
Best MAE 1.33 6.28e-08
Worst MAE 648.18 3.02e-05
Datasets Passing Parity (< 1e-5) 0 / 15 (0%) 15 / 15 (100%)

Detailed Dataset Breakdown

Univariate Datasets

Dataset Legacy TF MAE CuPy MAE
GunPoint 1.40 6.30e-08
ItalyPowerDemand 1.34 6.32e-08
Chinatown 648.18 3.02e-05
TwoLeadECG 1.53 8.54e-08
ECG200 1.42 8.31e-08
Coffee 1.33 6.28e-08

Multivariate Datasets

Dataset Legacy TF MAE CuPy MAE
BasicMotions 2.29 1.41e-07
Epilepsy 1.89 1.12e-07
NATOPS 3.61 1.50e-07
ERing 2.11 1.44e-07
Handwriting 3.30 2.67e-07
ArticularyWord 2.42 1.67e-07
RacketSports 3.20 1.97e-06
StandWalkJump 3.64 1.77e-07
UWaveGesture 1.61 1.12e-07

Architectural Justification: Transitioning from TensorFlow to CuPy

Following the development of the TensorFlow-based implementation (PR #3177), extensive testing identified two critical blockers that necessitated a pivot to CuPy:

  1. Multivariate Complexity: TensorFlow lacked the flexibility required for raw kernel manipulations in multivariate scenarios, leading to inefficient workarounds and the high error rates observed in the table above and the general functioning.
  2. Dependency Overhead: The inclusion of TensorFlow as a dependency introduced disproportionate bloat relative to the performance gains for smaller datasets.

The CuPy Advantage
Switching to CuPy allowed for direct, CUDA-like kernel definitions within Python. This enabled a direct replication of the aeon CPU logic, successfully solving the multivariate issue while maintaining a lightweight dependency profile.

Future Development

This architecture establishes a robust foundation for a "GPU-First" convolution module. This framework will facilitate the rapid porting of:

  • MiniRocket
  • MultiRocket
  • Hydra
    and subsequently other extremely important algorithms like HIVE-COTE and Shapelets.

Does your contribution introduce a new dependency? If yes, which one?

Yes, cupy.

It is implemented strictly as an optional/soft dependency. The GPU backend is only initialized if explicitly requested by the user, ensuring no additional overhead for standard CPU-only workflows.

Any other comments?

Hardware version
NVIDIA Gtx 1650 ti / i7 / 16GB RAM

Reviewers
cc: @hadifawaz1999 @TonyBagnall @MatthewMiddlehurst

PR checklist

For all contributions
  • I've added myself to the list of contributors. Alternatively, you can use the @all-contributors bot to do this for you after the PR has been merged.
  • The PR title starts with either [ENH], [MNT], [DOC], [BUG], [REF], [DEP] or [GOV] indicating whether the PR topic is related to enhancement, maintenance, documentation, bugs, refactoring, deprecation or governance.
For new estimators and functions
  • I've added the estimator/function to the online API documentation.
  • (OPTIONAL) I've added myself as a __maintainer__ at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.
For developers with write access
  • (OPTIONAL) I've updated aeon's CODEOWNERS to receive notifications about future changes to these files.

@aeon-actions-bot aeon-actions-bot bot added the enhancement New feature, improvement request or other non-bug code enhancement label Dec 29, 2025
@aeon-actions-bot
Copy link
Contributor

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ enhancement ].
This PR changes too many different packages (>3) for automatic addition of labels, please manually add package labels if relevant.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run mypy typecheck tests
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Disable numba cache loading
  • Regenerate expected results for testing
  • Push an empty commit to re-run CI checks

@Adityakushwaha2006
Copy link
Contributor Author

Thanks for flagging issue #3212 @MatthewMiddlehurst . I've analyzed this against my CuPy implementation, and it won't negatively impact the GPU variant. Since ROCKETGPU delegates kernel generation to the CPU Rocket class to ensure parity, fixing the RNG handling here will automatically propagate the correct behavior to the GPU version without requiring kernel changes.

Just wanted to bring to your notice.

@Adityakushwaha2006 Adityakushwaha2006 force-pushed the feat/cupy-rocket-acceleration branch from e8bac71 to 55770aa Compare December 30, 2025 13:40
@Adityakushwaha2006 Adityakushwaha2006 force-pushed the feat/cupy-rocket-acceleration branch from 1215b7b to c648d15 Compare January 1, 2026 12:58
Copy link
Member

@MatthewMiddlehurst MatthewMiddlehurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaning towards rejecting this. Don't think it is what we want in terms of an implementation. We have lost a lot of parameters. The whole kernel thing is a bit weird. Testing is not like anything we have in the codebase.

If it was cleaner I would not be against adding the dependency, but this is not it IMO. Not a review, if any other developer feels like it is worth taking this on go ahead otherwise it will likely be closed.

@Adityakushwaha2006
Copy link
Contributor Author

Hey @MatthewMiddlehurst, I understand the concerns youve raised about the implementation.
I want to make sure Im understanding the direction correctly before doing a major rework. I'm willing to put in the effort to bring this in line with aeons standards.

This seems like significant progress...so if you can provide guidance on what the 'right' approach looks like, we can definitely have a better ,more standard implementation.
If the fundamental direction is incompatible with what aeon needs, please let me know that too so I don't waste your time with extensive revisions that arent needed

Thanks for considering this.

@Adityakushwaha2006
Copy link
Contributor Author

Adityakushwaha2006 commented Jan 2, 2026

In the meanwhile i was trying to optimise the other rocket family , with their GPU implementations, and see how CuPY fairs out there, i have some benchmarking results from the minirocket GPU implementation aswell, which i just finished with about 10 minutes back.

Attaching them here:
image
image
image

I do still think that CuPy is the way to go forward if we want a GPU implementation in aeon...the effort in adding the entire structure is expected and im not seeing this as a one off PR either, i do understand that this will need major work, but that too seems essential for our progress.
Let me know : )

@MatthewMiddlehurst
Copy link
Member

MatthewMiddlehurst commented Jan 2, 2026

Ignoring the testing and unrelated changes to the issue which I'm not sure why it has been implemented like this, I do not want to start maintaining CUDA code and add in dependencies for 1 second of speed up. Your images seem to plot the same things multiple times?

@Adityakushwaha2006
Copy link
Contributor Author

Each graph set plots the speedup between the CPU version and the GPU version on different channels (1,2...6) (so ranging from univariate to multivariate)
Just to clarify the speedup, its not about saving 1 second on a single run. Its about 24x faster execution on multivariate data.

As per my understanding why i thought this was of immense importance was because in a typical research workflow where someone would run rocket on 100 different parameter configurations or datasets, that would make the difference between 2 hours vs 5 minutes of run time.

I understand the concern about CUDA maintenance and dependencies though. If the maintenance burden outweighs the performance benefit for aeons use cases, I completely get that too.

@baraline
Copy link
Member

baraline commented Jan 3, 2026

I agree with Matthew on maintain CUDA kernels, I don't think we want to get into that level of complexity.

One alternative if we want to use GPU implementation (which I think is a good thing) would be to use CuPy native operation (i.e. cupy.convolve etc...) so that the code is close to a numpy implementation, which is the whole point of CuPy, to be a drop in replacement for scipy/numpy.

@hadifawaz1999
Copy link
Member

putting my comment here as well

the only change am happy to get is to use the same kermel generation function of the cpu implementation, but keep using tensorflow for the convolution

@MatthewMiddlehurst
Copy link
Member

Sorry the labels for the latest plot all say number of cases up to 10,000 so it is a little confusing if it is altering other metrics. Generally yes there will be multiple runs, and speed ups are good, but this as it stands is a bit of a maintainability liability to be honest. In real terms ROCKET is a quick algorithm already, and the speed-up is not worth the potential issues.

@Adityakushwaha2006
Copy link
Contributor Author

Understandable hiccups , i'll try and list all the options available to us as of our current standing.

1) CuPy implementation with convolve (without custom kernel):
Pros: easy to maintain, code similar to CPU, removes TF bloat,easy to understand
Cons: Adds soft dependency
For this approach , i cant say a lot about the speed and parity without testing it first, my understanding would place the speedup to be slightly lesser than the custom kernel method.

2) Previous TF implementation with Kernel Parity changes:
Pros: Less codebase change, already exists in PR #3177
Cons: Incredibly slow (~2-3x speedup only), hard to map CPU to GPU, dependency bloat, equivalence point at larger sizes (compared to CuPy)
Speed: ~2-3x vs CPU
Parity: supports Univariate parity (MAE < 1e-5) , does not support Multivariate parity (MAE 1.61-648.18 due to different convolution algorithms at transform stage)

3) CuPy custom kernel :
Pros: extremely fast (14.7x peak speedup), code similar to CPU except kernel, easy to understand, removes TF bloat , equivalence around ~1k datapoints
Cons: Hard to maintain kernel, adds soft dependency, understandable concerns about mixing with codebase ...all in all does not seem favourable here
Speed: 14.7x peak vs CPU, average 8-12x depending on dataset size
Parity: Supports Univariate parity (MAE < 1e-5, 6/6 datasets pass), supports Multivariate parity (MAE < 1e-5, 9/9 datasets pass)

With a final weigh in of opinions and discussion we can decide if we wish to abandon the CuPy approach in its entirety , or try out the 1st option as suggested by @baraline , or if we wanna revert to the TF implementation opened in #3177 (opening it again for review) as seen favourable by @hadifawaz1999

@baraline
Copy link
Member

baraline commented Jan 5, 2026

As much as I love CuPy, I agree with Ali too, if we can do it with tensorflow we should, it keeps the GPU framework we use consistent across aeon.
Even a 2x or 3x speedup is already really nice to have, especially if the maintenance cost is low. But yeah we might want to work on the other issues it has in terms of parity

@MatthewMiddlehurst
Copy link
Member

"Parity: supports Univariate parity (MAE < 1e-5) , does not support Multivariate parity (MAE 1.61-648.18 due to different convolution algorithms at transform stage)"

Not really sure how this works, why is it different?

Either way this does not seem to be the approach we want, so closing.

@Adityakushwaha2006
Copy link
Contributor Author

TF conv1d uses matrix ops that accumulate in parallel. For univariate , theres only one multiplication so order
doesn't matter. For multivariate, the different accumulation order causes float rounding differences to compound.
Also , TFs convolution algorithm isnt fixed either, so its completely unpredictable

Totally able to understand the concerns tho, ill get on with the TF implementation betterment : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

classification Classification package deep learning Deep learning related enhancement New feature, improvement request or other non-bug code enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants