-
Notifications
You must be signed in to change notification settings - Fork 28
GPU Accelaration for Clustering #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…s8mGAn5UghuFemm5ydF3 Claude/add openrouter support 01 c fs8m g an5 ughu femm5yd f3
CONFIRMED: GPU (CUDA) is currently only used for Analysis, not Clustering RESEARCH FINDINGS: - GPU acceleration can provide 10-30x speedup for clustering tasks - KMeans: 10-50x faster, DBSCAN: 5-100x faster, PCA: 10-40x faster - Example: 5000 clustering runs that take 2-4 hours on CPU can complete in 5-15 minutes on GPU IMPLEMENTATION: Implemented GPU-accelerated clustering using RAPIDS cuML as an optional feature: New Features: - GPU-accelerated KMeans, DBSCAN, and PCA using RAPIDS cuML - Automatic fallback to CPU if GPU unavailable or encounters errors - New environment variable USE_GPU_CLUSTERING (default: false) - Maintains all existing clustering options and settings - Compatible with CUDA 12.2+ and NVIDIA GPUs Files Modified: 1. config.py: - Added USE_GPU_CLUSTERING configuration flag 2. tasks/clustering_gpu.py (NEW): - Created GPU clustering module with RAPIDS cuML implementations - GPU-accelerated classes: GPUKMeans, GPUDBSCAN, GPUPCA - CPU-only wrappers: GPUGaussianMixture, GPUSpectralClustering - Factory functions for model creation with GPU/CPU selection - Automatic GPU availability detection and graceful fallback 3. tasks/clustering_helper.py: - Imported GPU clustering module with fallback handling - Updated _perform_single_clustering_iteration to use GPU PCA when enabled - Modified _apply_clustering_model to support GPU clustering - Maintains full backward compatibility with CPU-only mode 4. Dockerfile: - Added cupy-cuda12x and cuml-cu12 installation for NVIDIA builds - Only installs GPU packages when BASE_IMAGE is nvidia/cuda - CPU builds remain unchanged and lightweight 5. deployment/.env.example: - Added USE_GPU_CLUSTERING configuration with documentation - Default: false (CPU only, backward compatible) 6. README.md: - Added "GPU Acceleration for Clustering" section - Documented performance improvements and usage instructions - Listed supported algorithms and compatibility requirements - Noted that GaussianMixture and SpectralClustering use CPU (no GPU version) Notes: - GPU clustering is OPTIONAL and disabled by default - CPU clustering remains the default for backward compatibility - All existing clustering parameters and settings are preserved - GaussianMixture and SpectralClustering always use CPU (no cuML implementation) - GPU usage: Analysis (ONNX inference) + Clustering (RAPIDS cuML, optional)
…Ju9JeTX1N4popdd3Kt6FoM
The bash -lc subshell prevented access to the BASE_IMAGE ARG, causing GPU packages (cupy, cuml) to never be installed. Changed to use 'set -ux;' pattern (matching base stage) which properly accesses Docker ARG variables in the current shell. This ensures cuML and cupy are installed when building with nvidia/cuda base images, enabling GPU-accelerated clustering.
The new docker image needed packages that where quite large, resulting in a very slow docker build
Feat/gpu clustering
|
I just did a clean install and found it this breaks the Analysis, I previously only tested it on Clustering. Please hold while I check. |
|
Please let me know when finished with that for fix. And yes, thanks, please do a full round of test with ALL the functionality to be sure that even not "directly connected" functionality continue working. Also I'm a bit sceptical to have a |
|
The mistake wasn't in the code, but I build the image with the wrong base image.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your effort but I think that having a totally different algorithm for clustering is not ok because we duplicate the code to mantain.
Please check if the same library that support GPU can also support CPU. In this case add all in one file (clustering.py) that should call the same function.
In doing this please do some test that this work good with both CPU and GPU. Check also (maybe in the documentation) that the library work as well on ARM and also on old Intel CPU (some user used very old CPU, you can have a look in the past closed issue to read more).
Meanwhile I'm doing some test to check how is going as it is on GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to look into it. I separated it on purpose because I didn't want to mess up the CPU compatibility, but the used package didn't support GPU acceleration. In my case clustering took over 50 hours, which made testing very hard.
I understand your concern of having to maintain two version though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but please take attention: in other issue related to old cpu support (generally related to AVX) even include the library/code without running the functionality itself bring the entire application to don't start.
In Artist Similarity some days ago I introduced a python library, that use a C library under the hood with need of AVX, and it bring the entire application to don't start.
The only alternative if this library require AVX is to look if there is "an altenrative version" or the possibility to recompile without it.
I think another things that have sense for who have big library is run clustering only with 1000 run instead of 5000 that was super catelautive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good input, I hadn't considered that. I've only been looking at optimizing for cuda.
That also makes it a lot more complex. I appreciate you take this into account, I can imagine a lot of people in this hobby run their music library on an old server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache seems not working correctly, each rebuild takes several minutes on this steps evne if I don't change a line of code, please review:
=> [audiomuse-ai-flask] exporting to docker image format 22.7s
=> => exporting layers 0.0s
=> => exporting manifest sha256:3c5fe6420f4c612992b198d7fdd8fdb0e082ac2690526d71e1f1763a97b0a5a3 0.0s
=> => exporting config sha256:b00acf6664a8f1a6366be39010fac462b48063e156874a6f08b7794349ebfb03 0.0s
=> => sending tarball 22.7s
=> [audiomuse-ai-worker] exporting to docker image format 22.7s
=> => exporting layers 0.0s
=> => exporting manifest sha256:1332fa40105ad14378572056f87954783640142115eea4b503b9779abbfed53f 0.0s
=> => exporting config sha256:a6a0206adf191b8af770cf1601ca97fd4f5b7ed41140826c1a484ed3a27bca45 0.0s
=> => sending tarball 22.7s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been experiencing a slow build as well, especially with the new packages for the clustering. I've been trying to optimize it but not at a place where I'd like it be.
Regarding the log; (exporting to docker image format -> sending tarball) indicate that the build execution itself is fast (cached), but the Export Phase is slow. I think the delay is because Docker is moving the final image data from the BuildKit engine to the Docker Daemon.
I'll be further looking into this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See issue #189 — three users are running AudioMuse-AI on older CPUs only in this issue. While I'd love to leverage modern GPUs, I don't want to exclude these users. Without GPU support they can still run it; with incompatible dependencies, they can't run it at all.
By some fast search I'm also really in doubt that cuML can supprot old CPU. So OK keep two implementation BUT you need to check at runtime the import of the new one.
So by doing a try / except, if the import of cuML fail, in the except you include the old one. But please be very carefully on this or those user will knock on your door if it stop working :D
Edit: anyway really thanks for your effort and your patience with my review :)
Dealing with empty responses, sql handling, and how it retires. The pre-requirements also would fail silently, now that's explicit and it asks for a retry
|
Meanwhile checking your code I look that So for the point of avoidind breaking change have a look if any other "braking" library is imported in that way and it should be already sufficient. Also let me know if you find a way to improve the time of build of the image. When you finished just advice me that I'll do another round of test. Another time thansk for your effort, is really appreciated! |
I've done a first attempt to look at it, just haven't had the time to really test yet. Still trying to optimize the build time of the image, because testing takes forever now :( |
|
You're right.
As you noticed we can afford to have an image that without nay change of the code takes minutes to recompile. Better to have a bigger image for now. |
|
Just wanted to let you know that I should have some updates this weekend! :) |
Clustering was very slow for me, this PR is my attempt to add GPU support for Clustering. In my case it speeds it up by 11-12x.
I had to include a new DockerFile because of new dependencies. This made the Docker build incredibly slow, so I changed it to a more optimal version.
I haven't done a regression on CPU clustering yet.
edit: