GPU Porting Backend Selection #3233

Adityakushwaha2006 · 2026-01-10T10:51:28Z

Adityakushwaha2006
Jan 10, 2026

Proposal : Use CuPy for porting (similar to numba API so easy to maintain+extreme speedups), NO custom cuda kernels

i was taking a look at the GPU optimisations we could have in aeon, and critical implementations in shapelets and distances are becoming very hard to port and comparatively losing out on a lot of speed because we're sticking to TF (same with pytorch aswell) . Some simply cant be ported with TF, with what my knowledge tells me (more details below).

With CuPy we could enhance shapelets transform's speed by a lot , due to its insanely parallel computation, and since rocket cupy had a 100x speedup aswell (that too on a 1650ti), it would really affect Hive cote's speedup.

Just thought that we we're leaving this insane opportunity on the table. Wanted to understand our current standing and needs.
Since cupy is easier to port from numba and faster aswell, provided if we dont introduce Custom cuda kernels, we should have a insane speedup but might sacrifice on parity with cpu variant (which is extremely hard to obtain even with TF implementations -if there are even possible tf implementations of the algorithm)...

From my analysis of the codebase over the past few weeks:

What TF can do ( just 8 files): The Ones currently feasible

Euclidean, Manhattan, Squared, Minkowski (all these are trivial, just simple matrix operations)
Shape Based Distance (uses FFT, which tf has directly)
ROCKET optimizations (standard convolutions)
Matrix Profile (approximate solution is feasible)

What only CuPy can do (20 files):

DTW and all 12 variants (DP requires sequential ops)
Shapelet transform (tf is terrible in dealing with nested iterations, this is possible , but is gonna be extremely slow as per my understanding)
MinDist variants (complex indexing , tf wont suffice here either)

Obviously CuPy implementation for 8 files that TF can implement is possible and comparatively very easy. TF implementation for some of these aswell would be 'possible', but in these why theyre seperated fromm the rest of the 8 is because there will either be loss in accuracy or extrmely slow speeds, in which case it does not make sense having them..
Since we're at a stage where we have majority of GPU coverage still left, i feel it is important to make a decision that scales better in the coming time, i dont see how continuing with Tensorflow and possibly hitting a roadblock after the standard 8 implementations is going to be good for us : /
and since matthew had said that we arent against adding a soft dept, i thought i should still pursue this since i could get rid of the complex cuda kernel.

Also wanted to add, this isnt just about the 20 files ive identified, more than 40-50 algorithms would benefit form the speedups , imo its very essential for aeons growth
And understanding the maintainance concerns im gonna refrain from using cuda kernels anywhere, i think this was matthews major concern with this approach.

Just a little more info dump for added context:

Currently tested approaches (with kernels , so speedups cant be maintained - just for reference here ):
- Rocket GPU CuPy is 100x faster than TF GPU port, and about 14x faster than CPU variant on appropriate sizes (1650 ti) ,
  On higher end gpus it should extrapolate to around ~180-190x , 25-30x speedups respectively (estimates)
- Minirocket GPU CuPy tested and GPU TF tested, here the TF overheads are so insane, the transform time , just makes it impossible to find sense in even having a GPUI variant , when mini rocket runs in a few milliseconds, TF GPU takes whole seconds , its almost a 100x slower here... The CuPy test GPU port although not PR ready , it attains a speedup of 44x (max) on favourable channel and dataset size).
CuPy has a 1:1 api with numpy - so pure CuPy ports are even easier to maintin and understand than tf ports .

Also summarising both approaches here:
-> Going forward with TF:

- Harder porting
- more consistent across repo
- might/might not(most probable) maintain parity with CPU version for most algorithms
- only single digit speedups (in range of 2-7x) from what ive seen
- For already fast algorithms , tf port does not make sense due to slow speeds caused by overhead
- 500 Mb + bloat

->Going forward with CuPy:
note: CuPy is a literal clone of NumPy for the GPU

- Easier porting comparitively.
- new soft dep, though not hard to understand due to 1:1 api
- parity needs to be checked , but higher probability of maintaining it here due to CuPy-numpy relation. (will need testing for solid claim)
- speedups are definitely going to be more than tf implementations, even without kernels.
- Even already fast algos (like minirocket) have a GPU port possibility here with this.
- just ~60 Mb

Adityakushwaha2006 · 2026-01-11T12:43:45Z

Adityakushwaha2006
Jan 11, 2026
Author

Sharing detailed data here about new implementations: (FOR ROCKET)

Ive been working on a pure cupy native implementation to attain speedup whilst maintaining parity.

Currently for Rocket we have both the scripts for comparison.
The current Rocket Gpu implementation in main does not maintain parity , this fix for this has been published in PR [BUG] Achieve ROCKET GPU kernel and feature parity using CPU kernel generation #3227 - this is what ill refer to as Parity fixed TF implementation
The native script i had been working on for CuPy , which has no custom cuda kernels, purely pythonic CuPy , has not been pushed to a PR yet, but is ready to be pushed. Ill push as per the decision made on this discussion channel , as i dont wanna pollute the PR list with tf implementaiton prs and cupy implementation prs at the same time for review. Ill refer to this as the Pure CuPy implementation

Please note: the CuPy script included in Pr #3211 (now closed) , also contained custom kernels -which were the major concern in terms on maintainability, porting and future ease. These have been removed in the Pure CuPy implemenation , this code is easier to read than even the tf implementation as the CPU implementation in numpy sets the base.

Sharing benchmarking results here:
(Current hardware: gtx 1650 ti)

**please note: For larger L , the bottleneck is caused due to a memory bandwidth limit of the 1650 Ti saturating, not a software inefficiency, on higher end GPUs this shouldnt pose a problem.

Throughput Test (Scaling N)

Configuration: C=1 channel, L=500 timepoints

N (Samples)	CPU Time (s)	CuPy Time (s)	Speedup	MAE
100	5.45	1.12	4.87x	1.28×10⁻⁷
1,000	50.86	1.65	30.75x	1.20×10⁻⁷
5,000	248.63	6.99	35.57x	1.20×10⁻⁷
10,000	480.61	13.67	35.17x	1.20×10⁻⁷
20,000	939.02	26.54	35.38x	1.20×10⁻⁷

Length Test (Scaling L)

Configuration: N=200 samples, C=1 channel

L (Timepoints)	CPU Time (s)	CuPy Time (s)	Speedup	MAE
500	10.34	0.92	11.19x	1.21×10⁻⁷
2,000	39.75	2.85	13.97x	1.32×10⁻⁷
10,000	193.56	13.62	14.21x	1.41×10⁻⁷
50,000	986.09	261.24	3.77x	1.52×10⁻⁷

Multivariate Test (Scaling C)

Configuration: N=200 samples, L=1,000 timepoints

C (Channels)	CPU Time (s)	CuPy Time (s)	Speedup	MAE
1	23.35	1.69	13.79x	1.26×10⁻⁷
2	50.36	3.75	13.43x	2.09×10⁻⁷
10	93.84	14.37	6.53x	6.23×10⁻⁷
50	97.97	16.99	5.77x	6.35×10⁻⁷

The tensorflow GPU implementation shows a ~6hr time to run , and is definitely incredibly slow compared to the CuPy variant, I did not have the time to benchmark it for 6 hours ...We're well aware of the poor performance of the TF implementation from the previous benchmarks shared across multiple PRs.

Both the implementations maintain parity now: However CuPy maintains parity with threshold <1e-7 whereas in TF due to f32 error accumulation it has a threshold of <1e-4...though this is not a very big issue.

I am ready to push this Pure CuPy implementation to a new PR if the maintainers agree this is the preferred direction over the TF backend.

Please note: these benchmarks were done on a 1650 ti , which is an entry level GPU, ill be running these same benchmarks and will get back with the results on a rtx 3050 , though i will refrain from putting a number on the exact extrapolated speedup, my estimates place it around 50-70x CPU variant , and obviously a smaller breakpoint than current benchmark

0 replies

MatthewMiddlehurst · 2026-02-15T21:29:05Z

MatthewMiddlehurst
Feb 15, 2026
Maintainer

Again it is difficult to evaluate this without seeing the code. The last implementation was not really maintainable as mentioned by multiple maintainers.

While details and experiments are appreciated, please try and make your proposals less verbose if possible. Have we discussed distances for these changes in the past? I may have forgotten. Claims such as 40-50 algorithms could be improved seem a bit hyperbolic.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Porting Backend Selection #3233

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GPU Porting Backend Selection #3233

Uh oh!

Uh oh!

Adityakushwaha2006 Jan 10, 2026

Replies: 2 comments

Uh oh!

Adityakushwaha2006 Jan 11, 2026 Author

Throughput Test (Scaling N)

Length Test (Scaling L)

Multivariate Test (Scaling C)

Uh oh!

MatthewMiddlehurst Feb 15, 2026 Maintainer

Adityakushwaha2006
Jan 10, 2026

Adityakushwaha2006
Jan 11, 2026
Author

MatthewMiddlehurst
Feb 15, 2026
Maintainer