Benchmarking-SD is a comprehensive toolkit for benchmarking Stable Diffusion’s image generation performance across various GPU platforms (with a focus on AWS instance types). Stable Diffusion is a cutting-edge text-to-image latent diffusion model that enables generative AI art. However, running it efficiently requires significant GPU resources. This repository was created to measure and compare performance – helping researchers and developers answer questions like “Which GPU gives the fastest image generation? What’s the cost per image on different cloud instances?” and “How can I optimize my setup for Stable Diffusion?”
Example Stable Diffusion output (fantasy castle scene) generated by the model. This benchmark suite helps quantify how different GPUs perform when generating such high-quality images. Faster GPUs can produce the same image in a fraction of the time compared to slower ones.
Using Benchmarking-SD, you can easily benchmark latency, throughput, memory usage (VRAM), and even estimate the cloud cost for generating images with Stable Diffusion on a variety of hardware. High GPU utilization is key for efficient AI image generation, and this tool shows you how different instances stack up. Whether you’re choosing an AWS instance for a project or just curious about your own GPU’s performance, Benchmarking-SD provides the data to make informed decisions.
- Automated Benchmark Runs: Runs Stable Diffusion inference for a given number of images and records throughput (images/second) and average latency per image.
 - Multi-Instance Support: Easily benchmark multiple GPU instance types in one go (e.g., compare a Tesla T4 vs. RTX A100 vs. H100 instances).
 - Detailed Metrics: Reports latency, throughput, memory (VRAM) usage, and estimated cost per image (using AWS pricing) for each instance.
 - Results Visualization: Generates a comparative table of results and can plot charts for easy analysis of performance vs. cost.
 - Customization: Supports different Stable Diffusion models or image sizes/steps, so you can test Stable Diffusion 1.4, 1.5, 2.1 or even SDXL with configurable parameters.
 - Easy to Extend: Modular design lets you add new hardware or custom inference pipelines for benchmarking.
 
- 
Clone the repo:
git clone https://github.com/yashjani/benchmarking-sd.git - 
Install dependencies: The benchmarking tool relies on PyTorch and HuggingFace Diffusers. You can install all requirements with:
pip install -r requirements.txt
Make sure you have a compatible GPU environment set up with CUDA drivers.
 - 
Get the model weights: The default benchmark uses the open-source Stable Diffusion v1.4 model. The script will attempt to download it via HuggingFace’s API. Ensure you have accepted the model license on HuggingFace and have your
HUGGINGFACE_TOKENconfigured if necessary. (See FAQ if you encounter authentication issues.) 
Once installed, you can run the benchmarking script on your local GPU or on cloud instances. Below are a few common usage examples.
- 
Benchmark your default GPU:
python benchmark.py --num-images 10
This will run Stable Diffusion inference 10 times on your machine’s GPU and report the average latency.
 - 
Compare multiple AWS instances: (requires AWS CLI/SDK permissions)
python benchmark.py --instance-types g4dn.xlarge g5.xlarge p3.2xlarge \ --num-images 50 --awsIn this example, the tool would remotely launch each specified EC2 instance, run 50 image generations on each, and retrieve performance metrics. The
--awsflag indicates that instance types are AWS EC2. You can specify any number of instance types to test sequentially. - 
Use a different Stable Diffusion model or custom pipeline:
python benchmark.py --model StabilityAI/stable-diffusion-2-1 --num-images 20
You can point
--modelto any HuggingFace model ID or local checkpoint. This helps evaluate performance for larger models like Stable Diffusion 2.x or SDXL (note: larger models may require GPUs with more VRAM). - 
Save results to file:
python benchmark.py --instance-types g5.xlarge p3.2xlarge --num-images 30 \ --output results.csvThis will also dump the raw metrics to a CSV file for further analysis.
 
Example terminal output: After running a benchmark, you’ll see an output table in the console. For example:
Benchmarking Stable Diffusion on 3 instance types for 50 images each...
Instance         GPU            Throughput   Latency/image   VRAM    Cost/100 images
--------------   -------------  ----------   -------------   ------  ---------------
g4dn.xlarge      NVIDIA T4      0.52 img/s   1.92 s          16 GB   $0.40
g5.xlarge        NVIDIA A10G    0.85 img/s   1.18 s          24 GB   $0.44
p3.2xlarge       NVIDIA V100    0.78 img/s   1.28 s          16 GB   $1.21
Test completed! Detailed report saved to results.csv
In the above snippet, each row corresponds to one GPU instance type tested. For example, on a g5.xlarge (with an NVIDIA A10G GPU), throughput was 0.85 images/sec (each image ~1.18 sec), and the estimated cost to generate 100 images was ~$0.44 on that instance. The p3.2xlarge (V100 GPU) was slightly slower and costlier in this run, consistent with expectations that newer GPU generations can offer better price-performance.
Figure: Terminal recording of a benchmarking session, testing multiple AWS GPU instances in sequence (g4dn, g5, p3). The script launches each instance, runs Stable Diffusion inference, and prints a summary of throughput, latency, and cost metrics for each.
After running benchmarks, the tool generates a summary table of results. Here is an example aggregated results table comparing several AWS GPU instance types (for Stable Diffusion v1.4, 50 inference steps, 512×512 images):
| Instance | GPU (per instance) | Throughput (img/s) | Latency (s/img) | VRAM (GB) | Cost/100 images ($) | 
|---|---|---|---|---|---|
| G3s.xlarge | Tesla M60 (1×) | 0.20 | 5.00 | 8 | 0.37 | 
| G4dn.xlarge | T4 (1×) | 0.50 | 2.00 | 16 | 0.40 | 
| G5.xlarge | A10G (1×) | 0.85 | 1.18 | 24 | 0.43 | 
| G6.xlarge | L4 (1×) | 0.70 | 1.43 | 24 | 1.10 | 
| P3.2xlarge | V100 (1×) | 0.78 | 1.28 | 16 | 1.20 | 
| P4d.24xlarge | A100 (8×) | 8.9 | 0.90 | 40×8 | 5.27 | 
| P5.48xlarge | H100 (8×) | 16.0 | 0.50 | 80×8 | 10.93 | 
Table Notes: Throughput and latency are measured per image. Multi-GPU instances (P4d, P5) can process multiple images in parallel – hence the very high throughput numbers. Cost/100 images is estimated using on-demand AWS pricing for each instance type. For example, the G5.xlarge (with a single NVIDIA A10G) shows ~0.85 images/sec and $0.43 per 100 images, making it one of the best in terms of cost-efficiency. In contrast, the P5.48xlarge (8× NVIDIA H100) can generate images extremely fast, but the cost is much higher ($10.93 per 100 images) – suitable only when top performance is worth the price. These results highlight the GPU generation gap: newer GPUs like the H100 drastically reduce inference time, but older GPUs or smaller instances may give more cost-effective throughput for less intensive needs.
Figure: Cost per 100 images (lower is better) for various AWS instances. Entry-level GPUs (G3/G4/G5 instances – green bars) have the lowest cost per image, while the high-end multi-GPU servers (P4, P5 – red bars) cost much more per image. This can guide decisions: e.g., use G4/G5 instances for budget-friendly inference, and P5 only if you need ultra-fast generation regardless of cost.
(The above chart corresponds to the table values: e.g., ~$0.4 for g4dn.xlarge vs. ~$10.9 for p5.48xlarge per 100 images.)
Q1: What is Stable Diffusion, and why benchmark it?
Answer: Stable Diffusion is a text-to-image generative model released in 2022 that uses latent diffusion to create images from text prompts. It’s computationally heavy, typically requiring a GPU for reasonable performance. Benchmarking Stable Diffusion across different hardware is important because performance varies widely – older GPUs might take tens of seconds per image, while newer ones can generate images in under a second. By benchmarking, we identify which GPUs or cloud instances offer the best speed and cost-efficiency for image generation. This helps practitioners choose the right infrastructure for their needs (for example, a researcher on a budget might accept a slightly slower GPU if it’s much cheaper to run).
Q2: How are throughput and latency defined in this context?
Answer: In our benchmark, throughput is the number of images generated per second (images/second), averaged over the test run. Latency per image is essentially the inverse: how many seconds on average it takes to generate a single image from a prompt. For example, a throughput of 1.0 img/s equals 1.0 second per image latency. High throughput (and low latency) is better. On multi-GPU instances, we measure as if all GPUs are utilized in parallel – e.g., 8 GPUs might generate 8 images in the time 1 GPU would take to generate 1 image, yielding a higher aggregate throughput.
Q3: Why does a newer GPU like **NVIDIA L4 (G6.xlarge)** appear slower or less cost-effective than A10G (G5.xlarge)?
Answer: The performance of a GPU depends on its architecture and how the Stable Diffusion model utilizes it. In our tests, the A10G (NVIDIA Ampere architecture) on G5.xlarge outperformed the newer L4 (Ada Lovelace architecture) on G6.xlarge in Stable Diffusion inference. This can happen due to differences in GPU clock speeds, memory bandwidth, or inference optimizations. The L4 is optimized for certain INT8 workloads and energy efficiency, but the A10G’s FP16 throughput might be higher for this model. Additionally, AWS pricing for G6 is higher, impacting cost per image. In short, newer doesn’t always mean faster for every model – which is exactly why benchmarking is valuable!
Q4: Which instance type offers the best **bang for the buck**?
Answer: From our results, g4dn.xlarge and g5.xlarge instances offer excellent price-performance for Stable Diffusion. They have low cost per image (~$0.40-$0.45 per 100 images in our tests) while still generating images in ~1-2 seconds. The G5 (A10G GPU) in particular often strikes the best balance, as it’s both faster and only slightly more expensive than G4dn. If cost is your primary concern and generation time of a few seconds per image is acceptable, G4/G5 instances are ideal. On the other hand, if you need maximum speed and have the budget, p5.48xlarge (8× H100 GPUs) is the champion – extremely high throughput and ~0.5 s per image latency – but at ~25× the cost per image of a G4 instance. Intermediate options like p3 (V100) or p4 (A100) instances can be considered if you need faster than G5 but cheaper than P5. Ultimately, “best bang for buck” depends on your specific speed requirements and budget.
Q5: How can I improve Stable Diffusion inference speed on a given GPU?
Answer: There are several ways to optimize Stable Diffusion inference on any GPU:
- Enable half-precision or xFormers: Our benchmark by default uses fp16 precision. Installing xFormers can accelerate the attention mechanism significantly (often a 2x speed boost). If you see a log warning about “No module ‘xformers’... proceeding without it,” consider installing xformers to improve throughput.
 - Optimize batch size: Generating images one-by-one is usually memory efficient, but if you have extra VRAM, you can modify the script to generate multiple images in parallel (batch inference) which increases throughput (at the cost of higher memory use).
 - Lower the number of diffusion steps or use faster samplers: The default 50 steps sampler (PNDM or DDIM) can be substituted with optimized samplers or fewer steps (e.g., 30) to speed up generation, albeit with some trade-off in image quality.
 - Ensure GPU is being fully utilized: Monitor GPU utilization and clock speeds. Sometimes the GPU may be underutilized due to CPU bottlenecks (try increasing 
--num-imagesto amortize initialization overhead, or ensure data loading is efficient). Also, make sure you’re not inadvertently running on CPU – check that PyTorch is using CUDA (torch.cuda.is_available()). 
Even with correct setup, you might encounter some common issues. Here are a few and how to address them:
- 
CUDA out of memoryerrors: This means the GPU ran out of VRAM when trying to generate an image. Stable Diffusion requires ~10GB for 512×512 images at fp16. Solution: Try reducing--num-images(if you were generating images in parallel), using a smaller image size or fewer steps, or switch to a GPU with more memory. You can also enable CPU offloading or memory optimization if supported (not enabled by default in our script). - 
No module 'xformers'. Proceeding without it.message: This isn’t a fatal error – it’s a warning that the xFormers library (which can speed up inference) isn’t installed. The benchmark will still run, but possibly ~2x slower. Solution: If you want maximum speed, install xformers (pip install xformers) and rerun. With xFormers, you should notice improved throughput (and the message will disappear). - 
Hugging Face model download issues (e.g. 403 Forbidden or requires authentication): The Stable Diffusion model weights are large (over 4GB) and require acceptance of terms. Solution: Make sure you: (1) Have a Hugging Face account that has accepted the model license, (2) configured an API token. You can login via
huggingface-cli loginor set theHUGGINGFACE_TOKENenv variable. Alternatively, manually download thesd-v1-4.ckptor model files and point the--modelpath to it. - 
PyTorch not using GPU (running on CPU): If you notice extremely slow inference (e.g., each image taking minutes) and GPU utilization is 0%, the script might be running on CPU due to an environment issue. Solution: Check that PyTorch is installed with CUDA support (
torch.cuda.is_available()is True). Ensure the CUDA toolkit and drivers are properly installed. If on an AWS instance, make sure you’ve installed NVIDIA drivers or used a Deep Learning AMI that comes with them. Restarting the environment after driver installation may be needed. Also verify that the--device cudaflag (if available) is set, or that the script by default picks up the GPU. - 
Outdated drivers or CUDA version errors: You might see errors if your NVIDIA driver is too old for the version of PyTorch/CUDA being used (e.g., “CUDA driver version is insufficient for CUDA runtime”). Solution: Update your NVIDIA drivers on the system to the latest version. On AWS Linux instances, this can often be done via
sudo nvidia-driver-update. Be sure to match the CUDA toolkit version required by the PyTorch binary. 
If you run into an issue not listed here, please check the GitHub Issues for similar reports or open a new issue. We’re continuously improving the tool and the documentation.
Stable Diffusion has opened up new possibilities in generative art and AI applications, but its hardware demands can be a barrier. GPU cloud instances vary wildly in cost and capability – from older NVIDIA Tesla M60 GPUs (from 2015) to the latest NVIDIA H100 datacenter GPUs. This project emerged from research that systematically benchmarked Stable Diffusion across such instances. Our findings (published in NAJER 2023) highlighted that instance selection can make or break application feasibility – for example, a job that costs $0.40 on one instance might cost $5 on another for the same outcome, or take 10× longer if using outdated hardware. By open-sourcing this benchmarking suite, we hope to help the community in optimizing AI workflows. Whether you’re a researcher trying to minimize cloud compute bills or an enthusiast curious about your GPU, Benchmarking-SD provides clarity on performance and cost through rigorous testing.
This project is open-source licensed under the MIT License – see the LICENSE file for details. This means you can use, modify, and distribute this code freely. If you find this project useful, a citation or a star ⭐ on GitHub is appreciated.
Happy benchmarking, and may your diffusion be stable and speedy! 🚀✨