What is the GPU requirement for Stable Diffusion?

You'll need a PC with a modern AMD or Intel processor, 16 gigabytes of RAM, an NVIDIA RTX GPU with 8 gigabytes of memory , and a minimum of 10 gigabytes of free storage space available. A GPU with more memory will be able to generate larger images without requiring upscaling.

Can you run Stable Diffusion without a GPU?

With the advancement of technology, the hardware requirements to run these powerful AI models are becoming less demanding, enabling people to run the tool without a GPU .

What are the best GPU benchmarks?

List of the Best GPU Benchmark Software Heaven UNIGINE. Novabench. PassMark. 3DMark. Geekbench. MSI AfterBurner. Basemark. Cinebench. More items... Jan 17, 2023

What is the speed of Stable Diffusion in A100?

Notably, on A100 SXM 80GB, OneFlow Stable Diffusion reaches a groundbreaking inference speed of 50 it/s , which means that the required 50 rounds of sampling to generate an image can be done in exactly 1 second.

Is 2gb VRAM enough for Stable Diffusion?

Video RAM (VRAM) The larger the size of your VRAM, the higher the resolution of the images the AI model generates. Stability AI insists that you need a VRAM of at least 6.9 gigabytes (GB) on your GPU to download and use Stable Diffusion .

Why is my GPU usage not stable?

Your GPU usage is very low because you're using the integrated graphics, there's a driver issue, you have a CPU bottleneck, or the game you're playing isn't optimized . Possible fixes are reinstalling drivers, upgrading or overclocking your CPU, and adjusting certain game settings.

All You Need Is One GPU: Inference Benchmark for Stable Diffusion (2024)

Lambda presents an inference benchmark of Stable Diffusion model with different GPUs and CPUs.

UPDATE 2022-Oct-13 (Turning off autocast for FP16 speeding inference up by 25%)

What do I need for running the state-of-the-art text to image model? Can a gaming card do the job, or should I get a fancy A100? What if I only have a CPU?

To shed light on these questions, we present an inference benchmark of Stable Diffusion on different GPUs and CPUs. These are our findings:

Many consumer grade GPUs can do a fine job, since stable diffusion only needs about 5 seconds and 5 GB of VRAM to run.
When it comes to speed to output a single image, the most powerful Ampere GPU (A100) is only faster than 3080 by 33% (or 1.85 seconds).
By pushing the batch size to the maximum, A100 can deliver 2.5x inference throughput compared to 3080.

Our benchmark uses a text prompt as input and outputs an image of resolution 512x512. We use the model implementation from Huggingface's diffusers library, and analyze inference performance in terms of speed, memory consumption, throughput, and quality of the output images. We look at how different choices in hardware (GPU model, GPU vs CPU) and software (single vs half precision, pytorch vs onnxruntime) affect inference performance.

For reference, we will be providing benchmark results for the following GPU devices: A100 80GB PCIe, RTX3090, RTXA5500, RTXA6000, RTX3080, RTX8000. Please refer to the "Reproducing the experiments" section for details on running these experiments in your own environment.

Last but not least, we are excited to see how quickly things are moving forward by the community. For example, the "sliced attention" trick can futher reduce the VRAM cost to "as little as 3.2 GB", at a small penalty of about 10% slower inference speed. We also look forward to testing ONNX runtime with CUDA devices once it becomes more stable in the near future.

Speed

The figure below shows the inference speed when using different hardware and precision for generating a single image using the (arbitrary) text prompt: "a photo of an astronaut riding a horse on mars".

Memory

We also measure the memory consumption of running stable diffusion inference.

All You Need Is One GPU: Inference Benchmark for Stable Diffusion (3)

Memory usage is observed to be consistent across all tested GPUs:

It takes about 7.7 GB GPU memory to run single-precision inference with batch size one.
It takes about 4.5 GB GPU memory to run half-precision inference with batch size one.

Throughput

So far we have measured how quickly a single input can be processed, which is critical to online applications that don't tolerate even the slightest delay. However, some (offline) applications may focus on "throughput", which measures the total volume of data processed in a fixed amount of time.

Our throughput benchmark pushes the batch size to the maximum for each GPU, and measures the number of images they can process per minute. The reason for maximizing the batch size is to keep tensor cores busy so that computation can dominate the workload, avoiding any non-computational bottleneck and maximizing the throughput.

Autocast

An update made by the Hugging Face team on their diffuser code claimed that removing autocast speeds up inference with pytorch at half-precision by ~25%.

With autocast:

with autocast("cuda"): image = pipe(prompt).images[0]

Without autocast:

image = pipe(prompt).images[0]

We reproduced the experiment on NVIDIA RTX A6000 and have been able to verify performance gains both on the speed and memory usage side. We expect similar improvements with other devices that support half-precision.

All You Need Is One GPU: Inference Benchmark for Stable Diffusion (6)

All You Need Is One GPU: Inference Benchmark for Stable Diffusion (7)

In conclusion: DO NOT use autocast in conjunction with FP16.

Precision

We are curious about whether half-precision introduces degradations to the quality of the output images. To test this out, we fixed the text prompt as well as the "latents" input and fed them to the single-precision model and the half-precision model. We ran the inference 100 times with increased number of steps. Both models' outputs, as well as their difference map, are saved for each run.

All You Need Is One GPU: Inference Benchmark for Stable Diffusion (8)

Our observation is that there are indeed visible differences between the single-precision output and the half-precision output, especially in the early steps. The differences often decrease with the number of steps, but might not vanish.

Interestingly, such a difference may not imply artifacts in half-precision's outputs. For example, in step 70, the picture below shows half-precision didn't produce the artifact in the single-precision output (an extra front leg):

All You Need Is One GPU: Inference Benchmark for Stable Diffusion (9)

Reproducing the experiments

You can use the Lambda Diffusers repository to reproduce the results presented in this article.

Setup

Before running the benchmark, make sure you have completed the repository installation steps.

You will then need to set the huggingface access token:

Create a user account on HuggingFace and generate an access token.
Set your huggingface access token as the ACCESS_TOKEN environment variable:

export ACCESS_TOKEN=<hf_...>

Usage

Launch the benchmark.py script to append benchmark results to the existing benchmark.csv results file:

python ./scripts/benchmark.py

Lauch the benchmark_quality.py script to compare the output of single-precision and half-precision models:

python ./scripts/benchmark_quality.py

1. Since both text prompt as well as the "latents" input are fixed for each run, this is equivalent to running the inference for 100 steps, and save the intermediate results of each step.