GPU Porting Backend Selection #3233
Replies: 2 comments
-
|
Sharing detailed data here about new implementations: (FOR ROCKET) Ive been working on a pure cupy native implementation to attain speedup whilst maintaining parity.
Please note: the CuPy script included in Pr #3211 (now closed) , also contained custom kernels -which were the major concern in terms on maintainability, porting and future ease. These have been removed in the Pure CuPy implemenation , this code is easier to read than even the tf implementation as the CPU implementation in numpy sets the base. Sharing benchmarking results here: **please note: For larger L , the bottleneck is caused due to a memory bandwidth limit of the 1650 Ti saturating, not a software inefficiency, on higher end GPUs this shouldnt pose a problem. Throughput Test (Scaling N)Configuration: C=1 channel, L=500 timepoints
Length Test (Scaling L)Configuration: N=200 samples, C=1 channel
Multivariate Test (Scaling C)Configuration: N=200 samples, L=1,000 timepoints
The tensorflow GPU implementation shows a ~6hr time to run , and is definitely incredibly slow compared to the CuPy variant, I did not have the time to benchmark it for 6 hours ...We're well aware of the poor performance of the TF implementation from the previous benchmarks shared across multiple PRs. Both the implementations maintain parity now: However CuPy maintains parity with threshold <1e-7 whereas in TF due to f32 error accumulation it has a threshold of <1e-4...though this is not a very big issue. I am ready to push this Pure CuPy implementation to a new PR if the maintainers agree this is the preferred direction over the TF backend. Please note: these benchmarks were done on a 1650 ti , which is an entry level GPU, ill be running these same benchmarks and will get back with the results on a rtx 3050 , though i will refrain from putting a number on the exact extrapolated speedup, my estimates place it around 50-70x CPU variant , and obviously a smaller breakpoint than current benchmark |
Beta Was this translation helpful? Give feedback.
-
|
Again it is difficult to evaluate this without seeing the code. The last implementation was not really maintainable as mentioned by multiple maintainers. While details and experiments are appreciated, please try and make your proposals less verbose if possible. Have we discussed distances for these changes in the past? I may have forgotten. Claims such as 40-50 algorithms could be improved seem a bit hyperbolic. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Proposal : Use CuPy for porting (similar to numba API so easy to maintain+extreme speedups), NO custom cuda kernels
i was taking a look at the GPU optimisations we could have in aeon, and critical implementations in shapelets and distances are becoming very hard to port and comparatively losing out on a lot of speed because we're sticking to TF (same with pytorch aswell) . Some simply cant be ported with TF, with what my knowledge tells me (more details below).
With CuPy we could enhance shapelets transform's speed by a lot , due to its insanely parallel computation, and since rocket cupy had a 100x speedup aswell (that too on a 1650ti), it would really affect Hive cote's speedup.
Just thought that we we're leaving this insane opportunity on the table. Wanted to understand our current standing and needs.
Since cupy is easier to port from numba and faster aswell, provided if we dont introduce Custom cuda kernels, we should have a insane speedup but might sacrifice on parity with cpu variant (which is extremely hard to obtain even with TF implementations -if there are even possible tf implementations of the algorithm)...
From my analysis of the codebase over the past few weeks:
What TF can do ( just 8 files): The Ones currently feasible
What only CuPy can do (20 files):
Obviously CuPy implementation for 8 files that TF can implement is possible and comparatively very easy. TF implementation for some of these aswell would be 'possible', but in these why theyre seperated fromm the rest of the 8 is because there will either be loss in accuracy or extrmely slow speeds, in which case it does not make sense having them..
Since we're at a stage where we have majority of GPU coverage still left, i feel it is important to make a decision that scales better in the coming time, i dont see how continuing with Tensorflow and possibly hitting a roadblock after the standard 8 implementations is going to be good for us : /
and since matthew had said that we arent against adding a soft dept, i thought i should still pursue this since i could get rid of the complex cuda kernel.
Also wanted to add, this isnt just about the 20 files ive identified, more than 40-50 algorithms would benefit form the speedups , imo its very essential for aeons growth
And understanding the maintainance concerns im gonna refrain from using cuda kernels anywhere, i think this was matthews major concern with this approach.
Just a little more info dump for added context:
Currently tested approaches (with kernels , so speedups cant be maintained - just for reference here ):
On higher end gpus it should extrapolate to around ~180-190x , 25-30x speedups respectively (estimates)
CuPy has a 1:1 api with numpy - so pure CuPy ports are even easier to maintin and understand than tf ports .
Also summarising both approaches here:
-> Going forward with TF:
->Going forward with CuPy:
note: CuPy is a literal clone of NumPy for the GPU
Beta Was this translation helpful? Give feedback.
All reactions