Stable Diffusion Image Generation
Stable Diffusion Image Generation
Section titled “Stable Diffusion Image Generation”State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines.
Skill metadata
Section titled “Skill metadata”| Source | Optional — install with hermes skills install official/mlops/stable-diffusion |
| Path | optional-skills/mlops/stable-diffusion |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | diffusers>=0.30.0, transformers>=4.41.0, accelerate>=0.31.0, torch>=2.0.0 |
| Tags | Image Generation, Stable Diffusion, Diffusers, Text-to-Image, Multimodal, Computer Vision |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
Stable Diffusion Image Generation
Section titled “Stable Diffusion Image Generation”Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.
When to use Stable Diffusion
Section titled “When to use Stable Diffusion”Use Stable Diffusion when:
- Generating images from text descriptions
- Performing image-to-image translation (style transfer, enhancement)
- Inpainting (filling in masked regions)
- Outpainting (extending images beyond boundaries)
- Creating variations of existing images
- Building custom image generation workflows
Key features:
- Text-to-Image: Generate images from natural language prompts
- Image-to-Image: Transform existing images with text guidance
- Inpainting: Fill masked regions with context-aware content
- ControlNet: Add spatial conditioning (edges, poses, depth)
- LoRA Support: Efficient fine-tuning and style adaptation
- Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support
Use alternatives instead:
- DALL-E 3: For API-based generation without GPU
- Midjourney: For artistic, stylized outputs
- Imagen: For Google Cloud integration
- Leonardo.ai: For web-based creative workflows
Quick start
Section titled “Quick start”Installation
Section titled “Installation”pip install diffusers transformers accelerate torchpip install xformers # Optional: memory-efficient attentionBasic text-to-image
Section titled “Basic text-to-image”from diffusers import DiffusionPipelineimport torch
# Load pipeline (auto-detects model type)pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16)pipe.to("cuda")
# Generate imageimage = pipe( "A serene mountain landscape at sunset, highly detailed", num_inference_steps=50, guidance_scale=7.5).images[0]
image.save("output.png")Using SDXL (higher quality)
Section titled “Using SDXL (higher quality)”from diffusers import AutoPipelineForText2Imageimport torch
pipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16")pipe.to("cuda")
# Enable memory optimizationpipe.enable_model_cpu_offload()
image = pipe( prompt="A futuristic city with flying cars, cinematic lighting", height=1024, width=1024, num_inference_steps=30).images[0]Architecture overview
Section titled “Architecture overview”Three-pillar design
Section titled “Three-pillar design”Diffusers is built around three core components:
Pipeline (orchestration)├── Model (neural networks)│ ├── UNet / Transformer (noise prediction)│ ├── VAE (latent encoding/decoding)│ └── Text Encoder (CLIP/T5)└── Scheduler (denoising algorithm)Pipeline inference flow
Section titled “Pipeline inference flow”Text Prompt → Text Encoder → Text Embeddings ↓Random Noise → [Denoising Loop] ← Scheduler ↓ Predicted Noise ↓ VAE Decoder → Final ImageCore concepts
Section titled “Core concepts”Pipelines
Section titled “Pipelines”Pipelines orchestrate complete workflows:
| Pipeline | Purpose |
|---|---|
StableDiffusionPipeline | Text-to-image (SD 1.x/2.x) |
StableDiffusionXLPipeline | Text-to-image (SDXL) |
StableDiffusion3Pipeline | Text-to-image (SD 3.0) |
FluxPipeline | Text-to-image (Flux models) |
StableDiffusionImg2ImgPipeline | Image-to-image |
StableDiffusionInpaintPipeline | Inpainting |
Schedulers
Section titled “Schedulers”Schedulers control the denoising process:
| Scheduler | Steps | Quality | Use Case |
|---|---|---|---|
EulerDiscreteScheduler | 20-50 | Good | Default choice |
EulerAncestralDiscreteScheduler | 20-50 | Good | More variation |
DPMSolverMultistepScheduler | 15-25 | Excellent | Fast, high quality |
DDIMScheduler | 50-100 | Good | Deterministic |
LCMScheduler | 4-8 | Good | Very fast |
UniPCMultistepScheduler | 15-25 | Excellent | Fast convergence |
Swapping schedulers
Section titled “Swapping schedulers”from diffusers import DPMSolverMultistepScheduler
# Swap for faster generationpipe.scheduler = DPMSolverMultistepScheduler.from_config( pipe.scheduler.config)
# Now generate with fewer stepsimage = pipe(prompt, num_inference_steps=20).images[0]Generation parameters
Section titled “Generation parameters”Key parameters
Section titled “Key parameters”| Parameter | Default | Description |
|---|---|---|
prompt | Required | Text description of desired image |
negative_prompt | None | What to avoid in the image |
num_inference_steps | 50 | Denoising steps (more = better quality) |
guidance_scale | 7.5 | Prompt adherence (7-12 typical) |
height, width | 512/1024 | Output dimensions (multiples of 8) |
generator | None | Torch generator for reproducibility |
num_images_per_prompt | 1 | Batch size |
Reproducible generation
Section titled “Reproducible generation”import torch
generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe( prompt="A cat wearing a top hat", generator=generator, num_inference_steps=50).images[0]Negative prompts
Section titled “Negative prompts”image = pipe( prompt="Professional photo of a dog in a garden", negative_prompt="blurry, low quality, distorted, ugly, bad anatomy", guidance_scale=7.5).images[0]Image-to-image
Section titled “Image-to-image”Transform existing images with text guidance:
from diffusers import AutoPipelineForImage2Imagefrom PIL import Image
pipe = AutoPipelineForImage2Image.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
init_image = Image.open("input.jpg").resize((512, 512))
image = pipe( prompt="A watercolor painting of the scene", image=init_image, strength=0.75, # How much to transform (0-1) num_inference_steps=50).images[0]Inpainting
Section titled “Inpainting”Fill masked regions:
from diffusers import AutoPipelineForInpaintingfrom PIL import Image
pipe = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16).to("cuda")
image = Image.open("photo.jpg")mask = Image.open("mask.png") # White = inpaint region
result = pipe( prompt="A red car parked on the street", image=image, mask_image=mask, num_inference_steps=50).images[0]ControlNet
Section titled “ControlNet”Add spatial conditioning for precise control:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModelimport torch
# Load ControlNet for edge conditioningcontrolnet = ControlNetModel.from_pretrained( "lllyasviel/control_v11p_sd15_canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16).to("cuda")
# Use Canny edge image as controlcontrol_image = get_canny_image(input_image)
image = pipe( prompt="A beautiful house in the style of Van Gogh", image=control_image, num_inference_steps=30).images[0]Available ControlNets
Section titled “Available ControlNets”| ControlNet | Input Type | Use Case |
|---|---|---|
canny | Edge maps | Preserve structure |
openpose | Pose skeletons | Human poses |
depth | Depth maps | 3D-aware generation |
normal | Normal maps | Surface details |
mlsd | Line segments | Architectural lines |
scribble | Rough sketches | Sketch-to-image |
LoRA adapters
Section titled “LoRA adapters”Load fine-tuned style adapters:
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
# Load LoRA weightspipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")
# Generate with LoRA styleimage = pipe("A portrait in the trained style").images[0]
# Adjust LoRA strengthpipe.fuse_lora(lora_scale=0.8)
# Unload LoRApipe.unload_lora_weights()Multiple LoRAs
Section titled “Multiple LoRAs”# Load multiple LoRAspipe.load_lora_weights("lora1", adapter_name="style")pipe.load_lora_weights("lora2", adapter_name="character")
# Set weights for eachpipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])
image = pipe("A portrait").images[0]Memory optimization
Section titled “Memory optimization”Enable CPU offloading
Section titled “Enable CPU offloading”# Model CPU offload - moves models to CPU when not in usepipe.enable_model_cpu_offload()
# Sequential CPU offload - more aggressive, slowerpipe.enable_sequential_cpu_offload()Attention slicing
Section titled “Attention slicing”# Reduce memory by computing attention in chunkspipe.enable_attention_slicing()
# Or specific chunk sizepipe.enable_attention_slicing("max")xFormers memory-efficient attention
Section titled “xFormers memory-efficient attention”# Requires xformers packagepipe.enable_xformers_memory_efficient_attention()VAE slicing for large images
Section titled “VAE slicing for large images”# Decode latents in tiles for large imagespipe.enable_vae_slicing()pipe.enable_vae_tiling()Model variants
Section titled “Model variants”Loading different precisions
Section titled “Loading different precisions”# FP16 (recommended for GPU)pipe = DiffusionPipeline.from_pretrained( "model-id", torch_dtype=torch.float16, variant="fp16")
# BF16 (better precision, requires Ampere+ GPU)pipe = DiffusionPipeline.from_pretrained( "model-id", torch_dtype=torch.bfloat16)Loading specific components
Section titled “Loading specific components”from diffusers import UNet2DConditionModel, AutoencoderKL
# Load custom VAEvae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
# Use with pipelinepipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", vae=vae, torch_dtype=torch.float16)Batch generation
Section titled “Batch generation”Generate multiple images efficiently:
# Multiple promptsprompts = [ "A cat playing piano", "A dog reading a book", "A bird painting a picture"]
images = pipe(prompts, num_inference_steps=30).images
# Multiple images per promptimages = pipe( "A beautiful sunset", num_images_per_prompt=4, num_inference_steps=30).imagesCommon workflows
Section titled “Common workflows”Workflow 1: High-quality generation
Section titled “Workflow 1: High-quality generation”from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepSchedulerimport torch
# 1. Load SDXL with optimizationspipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16")pipe.to("cuda")pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)pipe.enable_model_cpu_offload()
# 2. Generate with quality settingsimage = pipe( prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur", negative_prompt="blurry, low quality, cartoon, anime, sketch", num_inference_steps=30, guidance_scale=7.5, height=1024, width=1024).images[0]Workflow 2: Fast prototyping
Section titled “Workflow 2: Fast prototyping”from diffusers import AutoPipelineForText2Image, LCMSchedulerimport torch
# Use LCM for 4-8 step generationpipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
# Load LCM LoRA for fast generationpipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)pipe.fuse_lora()
# Generate in ~1 secondimage = pipe( "A beautiful landscape", num_inference_steps=4, guidance_scale=1.0).images[0]Common issues
Section titled “Common issues”CUDA out of memory:
# Enable memory optimizationspipe.enable_model_cpu_offload()pipe.enable_attention_slicing()pipe.enable_vae_slicing()
# Or use lower precisionpipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)Black/noise images:
# Check VAE configuration# Use safety checker bypass if neededpipe.safety_checker = None
# Ensure proper dtype consistencypipe = pipe.to(dtype=torch.float16)Slow generation:
# Use faster schedulerfrom diffusers import DPMSolverMultistepSchedulerpipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# Reduce stepsimage = pipe(prompt, num_inference_steps=20).images[0]References
Section titled “References”- Advanced Usage - Custom pipelines, fine-tuning, deployment
- Troubleshooting - Common issues and solutions
Resources
Section titled “Resources”- Documentation: https://huggingface.co/docs/diffusers
- Repository: https://github.com/huggingface/diffusers
- Model Hub: https://huggingface.co/models?library=diffusers
- Discord: https://discord.gg/diffusers