Skip to content

Stable Diffusion Image Generation

State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines.

SourceOptional — install with hermes skills install official/mlops/stable-diffusion
Pathoptional-skills/mlops/stable-diffusion
Version1.0.0
AuthorOrchestra Research
LicenseMIT
Dependenciesdiffusers>=0.30.0, transformers>=4.41.0, accelerate>=0.31.0, torch>=2.0.0
TagsImage Generation, Stable Diffusion, Diffusers, Text-to-Image, Multimodal, Computer Vision

The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.

Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.

Use Stable Diffusion when:

  • Generating images from text descriptions
  • Performing image-to-image translation (style transfer, enhancement)
  • Inpainting (filling in masked regions)
  • Outpainting (extending images beyond boundaries)
  • Creating variations of existing images
  • Building custom image generation workflows

Key features:

  • Text-to-Image: Generate images from natural language prompts
  • Image-to-Image: Transform existing images with text guidance
  • Inpainting: Fill masked regions with context-aware content
  • ControlNet: Add spatial conditioning (edges, poses, depth)
  • LoRA Support: Efficient fine-tuning and style adaptation
  • Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support

Use alternatives instead:

  • DALL-E 3: For API-based generation without GPU
  • Midjourney: For artistic, stylized outputs
  • Imagen: For Google Cloud integration
  • Leonardo.ai: For web-based creative workflows
Окно терминала
pip install diffusers transformers accelerate torch
pip install xformers # Optional: memory-efficient attention
from diffusers import DiffusionPipeline
import torch
# Load pipeline (auto-detects model type)
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
)
pipe.to("cuda")
# Generate image
image = pipe(
"A serene mountain landscape at sunset, highly detailed",
num_inference_steps=50,
guidance_scale=7.5
).images[0]
image.save("output.png")
from diffusers import AutoPipelineForText2Image
import torch
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
# Enable memory optimization
pipe.enable_model_cpu_offload()
image = pipe(
prompt="A futuristic city with flying cars, cinematic lighting",
height=1024,
width=1024,
num_inference_steps=30
).images[0]

Diffusers is built around three core components:

Pipeline (orchestration)
├── Model (neural networks)
│ ├── UNet / Transformer (noise prediction)
│ ├── VAE (latent encoding/decoding)
│ └── Text Encoder (CLIP/T5)
└── Scheduler (denoising algorithm)
Text Prompt → Text Encoder → Text Embeddings
Random Noise → [Denoising Loop] ← Scheduler
Predicted Noise
VAE Decoder → Final Image

Pipelines orchestrate complete workflows:

PipelinePurpose
StableDiffusionPipelineText-to-image (SD 1.x/2.x)
StableDiffusionXLPipelineText-to-image (SDXL)
StableDiffusion3PipelineText-to-image (SD 3.0)
FluxPipelineText-to-image (Flux models)
StableDiffusionImg2ImgPipelineImage-to-image
StableDiffusionInpaintPipelineInpainting

Schedulers control the denoising process:

SchedulerStepsQualityUse Case
EulerDiscreteScheduler20-50GoodDefault choice
EulerAncestralDiscreteScheduler20-50GoodMore variation
DPMSolverMultistepScheduler15-25ExcellentFast, high quality
DDIMScheduler50-100GoodDeterministic
LCMScheduler4-8GoodVery fast
UniPCMultistepScheduler15-25ExcellentFast convergence
from diffusers import DPMSolverMultistepScheduler
# Swap for faster generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config(
pipe.scheduler.config
)
# Now generate with fewer steps
image = pipe(prompt, num_inference_steps=20).images[0]
ParameterDefaultDescription
promptRequiredText description of desired image
negative_promptNoneWhat to avoid in the image
num_inference_steps50Denoising steps (more = better quality)
guidance_scale7.5Prompt adherence (7-12 typical)
height, width512/1024Output dimensions (multiples of 8)
generatorNoneTorch generator for reproducibility
num_images_per_prompt1Batch size
import torch
generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
prompt="A cat wearing a top hat",
generator=generator,
num_inference_steps=50
).images[0]
image = pipe(
prompt="Professional photo of a dog in a garden",
negative_prompt="blurry, low quality, distorted, ugly, bad anatomy",
guidance_scale=7.5
).images[0]

Transform existing images with text guidance:

from diffusers import AutoPipelineForImage2Image
from PIL import Image
pipe = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
init_image = Image.open("input.jpg").resize((512, 512))
image = pipe(
prompt="A watercolor painting of the scene",
image=init_image,
strength=0.75, # How much to transform (0-1)
num_inference_steps=50
).images[0]

Fill masked regions:

from diffusers import AutoPipelineForInpainting
from PIL import Image
pipe = AutoPipelineForInpainting.from_pretrained(
"runwayml/stable-diffusion-inpainting",
torch_dtype=torch.float16
).to("cuda")
image = Image.open("photo.jpg")
mask = Image.open("mask.png") # White = inpaint region
result = pipe(
prompt="A red car parked on the street",
image=image,
mask_image=mask,
num_inference_steps=50
).images[0]

Add spatial conditioning for precise control:

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
# Load ControlNet for edge conditioning
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/control_v11p_sd15_canny",
torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
controlnet=controlnet,
torch_dtype=torch.float16
).to("cuda")
# Use Canny edge image as control
control_image = get_canny_image(input_image)
image = pipe(
prompt="A beautiful house in the style of Van Gogh",
image=control_image,
num_inference_steps=30
).images[0]
ControlNetInput TypeUse Case
cannyEdge mapsPreserve structure
openposePose skeletonsHuman poses
depthDepth maps3D-aware generation
normalNormal mapsSurface details
mlsdLine segmentsArchitectural lines
scribbleRough sketchesSketch-to-image

Load fine-tuned style adapters:

from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
torch_dtype=torch.float16
).to("cuda")
# Load LoRA weights
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")
# Generate with LoRA style
image = pipe("A portrait in the trained style").images[0]
# Adjust LoRA strength
pipe.fuse_lora(lora_scale=0.8)
# Unload LoRA
pipe.unload_lora_weights()
# Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style")
pipe.load_lora_weights("lora2", adapter_name="character")
# Set weights for each
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])
image = pipe("A portrait").images[0]
# Model CPU offload - moves models to CPU when not in use
pipe.enable_model_cpu_offload()
# Sequential CPU offload - more aggressive, slower
pipe.enable_sequential_cpu_offload()
# Reduce memory by computing attention in chunks
pipe.enable_attention_slicing()
# Or specific chunk size
pipe.enable_attention_slicing("max")
# Requires xformers package
pipe.enable_xformers_memory_efficient_attention()
# Decode latents in tiles for large images
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
# FP16 (recommended for GPU)
pipe = DiffusionPipeline.from_pretrained(
"model-id",
torch_dtype=torch.float16,
variant="fp16"
)
# BF16 (better precision, requires Ampere+ GPU)
pipe = DiffusionPipeline.from_pretrained(
"model-id",
torch_dtype=torch.bfloat16
)
from diffusers import UNet2DConditionModel, AutoencoderKL
# Load custom VAE
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
# Use with pipeline
pipe = DiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5",
vae=vae,
torch_dtype=torch.float16
)

Generate multiple images efficiently:

# Multiple prompts
prompts = [
"A cat playing piano",
"A dog reading a book",
"A bird painting a picture"
]
images = pipe(prompts, num_inference_steps=30).images
# Multiple images per prompt
images = pipe(
"A beautiful sunset",
num_images_per_prompt=4,
num_inference_steps=30
).images
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
import torch
# 1. Load SDXL with optimizations
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16"
)
pipe.to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
# 2. Generate with quality settings
image = pipe(
prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur",
negative_prompt="blurry, low quality, cartoon, anime, sketch",
num_inference_steps=30,
guidance_scale=7.5,
height=1024,
width=1024
).images[0]
from diffusers import AutoPipelineForText2Image, LCMScheduler
import torch
# Use LCM for 4-8 step generation
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16
).to("cuda")
# Load LCM LoRA for fast generation
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.fuse_lora()
# Generate in ~1 second
image = pipe(
"A beautiful landscape",
num_inference_steps=4,
guidance_scale=1.0
).images[0]

CUDA out of memory:

# Enable memory optimizations
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()
# Or use lower precision
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

Black/noise images:

# Check VAE configuration
# Use safety checker bypass if needed
pipe.safety_checker = None
# Ensure proper dtype consistency
pipe = pipe.to(dtype=torch.float16)

Slow generation:

# Use faster scheduler
from diffusers import DPMSolverMultistepScheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# Reduce steps
image = pipe(prompt, num_inference_steps=20).images[0]