当前位置：首页 > news >正文

An Introduction to Stable Diffusion

news 2025/7/19 8:20:01

An Introduction to Stable Diffusion

1. What is Stable Diffusion?
- 1.1. Variational Auto-Encoder (VAE)
- 1.2. U-Net
- 1.3. Text-Encoder
- 1.4. Why is latent diffusion fast and efficient?
- 1.5. Stable Diffusion during inference
2. Stable Diffusion Deep Dive
- 2.1. token_embedding and position_embedding
References

stable_diffusion.ipynb
https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb

1. What is Stable Diffusion?

Now, let’s go into the theoretical part of Stable Diffusion.

Stable Diffusion is based on a particular type of diffusion model called Latent Diffusion, proposed in High-Resolution Image Synthesis with Latent Diffusion Models https://arxiv.org/abs/2112.10752.

General diffusion models are machine learning systems that are trained to denoise random gaussian noise step by step, to get to a sample of interest, such as an image. For a more detailed overview of how they work, check this colab https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb.

Diffusion models have shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow. In addition, these models consume a lot of memory because they operate in pixel space, which becomes unreasonably expensive when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference.

Latent diffusion can reduce the memory and compute complexity by applying the diffusion process over a lower dimensional latent space, instead of using the actual pixel space. This is the key difference between standard diffusion and latent diffusion models: in latent diffusion the model is trained to generate latent (compressed) representations of the images.

There are three main components in latent diffusion.

A Variational Auto-Encoder (VAE).
A U-Net https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb.
A text-encoder, e.g. CLIP’s Text Encoder https://huggingface.co/docs/transformers/model_doc/clip.

1.1. Variational Auto-Encoder (VAE)

Variational Auto-Encoder，VAE：变分自动编码器
Auto-Encoder，AE：自动编码器

The VAE model has two parts, an encoder and a decoder.

The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net model.
The decoder, conversely, transforms the latent representation back into an image.

conversely /ˈkɑːnvɜːrsli/：adv. 相反地，反过来

During latent diffusion training, the encoder is used to get the latent representations (latents) of the images for the forward diffusion process, which applies more and more noise at each step. During inference, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder. As we will see during inference we only need the VAE decoder.

1.2. U-Net

The U-Net has an encoder part and a decoder part both comprised of ResNet blocks.

The encoder compresses an image representation into a lower resolution image representation and the decoder decodes the lower resolution image representation back to the original higher resolution image representation that is supposedly less noisy.

More specifically, the U-Net output predicts the noise residual which can be used to compute the predicted denoised image representation.

To prevent the U-Net from losing important information while downsampling, short-cut connections are usually added between the downsampling ResNets of the encoder to the upsampling ResNets of the decoder.

Additionally, the stable diffusion U-Net is able to condition its output on text-embeddings via cross-attention layers. The cross-attention layers are added to both the encoder and decoder part of the U-Net usually between ResNet blocks.

1.3. Text-Encoder

The text-encoder is responsible for transforming the input prompt, e.g. “An astronout riding a horse” into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text-embeddings.

Inspired by Imagen https://imagen.research.google/, Stable Diffusion does not train the text-encoder during training and simply uses an CLIP’s already trained text encoder, CLIPTextModel https://huggingface.co/docs/transformers/model_doc/clip.

1.4. Why is latent diffusion fast and efficient?

Since the U-Net of latent diffusion models operates on a low dimensional space, it greatly reduces the memory and compute requirements compared to pixel-space diffusion models. For example, the autoencoder used in Stable Diffusion has a reduction factor of 8. This means that an image of shape (3, 512, 512) becomes (3, 64, 64) in latent space, which requires 8 × 8 = 64 times less memory.

This is why it’s possible to generate 512 × 512 images so quickly, even on 16GB Colab GPUs!

1.5. Stable Diffusion during inference

Putting it all together, let’s now take a closer look at how the model works in inference by illustrating the logical flow.

在这里插入图片描述

The stable diffusion model takes both a latent seed and a text prompt as an input. The latent seed is then used to generate random latent image representations of size $64 \times 64$ where as the text prompt is transformed to text embeddings of size $77 \times 768$ via CLIP’s text encoder.

Next the U-Net iteratively denoises the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Many different scheduler algorithms can be used for this computation, each having its pros and cons.

pros and cons /proʊz ənd kɑːnz/：利弊，优缺点，正反两方面，赞成者和反对者

For Stable Diffusion, we recommend using one of:

PNDM scheduler https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_pndm.py (used by default).
K-LMS scheduler https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_lms_discrete.py.
Heun Discrete scheduler https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_heun_discrete.py.
DPM Solver Multistep scheduler https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_dpmsolver_multistep.py. This scheduler is able to achieve great quality in less steps. You can try with 25 instead of the default 50!

Theory on how the scheduler algorithm function is out of scope for this notebook, but in short one should remember that they compute the predicted denoised image representation from the previous noise representation and the predicted noise residual.

For more information, we recommend looking into Elucidating the Design Space of Diffusion-Based Generative Models https://arxiv.org/abs/2206.00364

The denoising process is repeated ca. 50 times to step-by-step retrieve better latent image representations.

Once complete, the latent image representation is decoded by the decoder part of the variational auto encoder.

After this brief introduction to Latent and Stable Diffusion, let’s see how to make advanced use of Hugging Face Diffusers!

2. Stable Diffusion Deep Dive

Hugging Face Diffusers
https://github.com/huggingface/diffusers

Stable Diffusion Deep Dive.ipynb
https://github.com/fastai/diffusion-nbs/blob/master/Stable%20Diffusion%20Deep%20Dive.ipynb

Textual Inversion
https://huggingface.co/docs/diffusers/en/training/text_inversion

Before running the script, make sure you install the library from source:

(base) yongqiang@yongqiang:~/stable_diffusion_work$ git clone https://github.com/huggingface/diffusers
(base) yongqiang@yongqiang:~/stable_diffusion_work$ cd diffusers/
(base) yongqiang@yongqiang:~/stable_diffusion_work/diffusers$ pip install .
(base) yongqiang@yongqiang:~/stable_diffusion_work/diffusers$

2.1. token_embedding and position_embedding

We use a text encoder model to turn our text into a set of embeddings which are fed to the diffusion model as conditioning.

We begin with tokenization:

# Our text prompt
prompt = 'A picture of a puppy'

# Turn the text into a sequence of tokens
text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True,
                       return_tensors="pt")
print("\ntokenizer.model_max_length:", tokenizer.model_max_length)
print("text_input['input_ids'].shape:", text_input['input_ids'].shape)
print("text_input['input_ids']:\n", text_input['input_ids'])
print("text_input['attention_mask'].shape:", text_input['attention_mask'].shape)
print("text_input['attention_mask']:\n", text_input['attention_mask'])

tokenizer.model_max_length: 77
text_input['input_ids'].shape: torch.Size([1, 77])
text_input['input_ids']:
 tensor([[49406,   320,  1674,   539,   320,  6829, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407, 49407,
         49407, 49407, 49407, 49407, 49407, 49407, 49407]])
text_input['attention_mask'].shape: torch.Size([1, 77])
text_input['attention_mask']:
 tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0]])

# See the individual tokens
print("")
for t in text_input['input_ids'][0][:10]:  # We'll just look at the first 9 to save you from a wall of '<|endoftext|>'
    print(t, tokenizer.decoder.get(int(t)))
print("")


tensor(49406) <|startoftext|>
tensor(320) a</w>
tensor(1674) picture</w>
tensor(539) of</w>
tensor(320) a</w>
tensor(6829) puppy</w>
tensor(49407) <|endoftext|>
tensor(49407) <|endoftext|>
tensor(49407) <|endoftext|>
tensor(49407) <|endoftext|>

The final (output) embeddings like so:

# Grab the output embeddings
output_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]
print('output_embeddings.shape:', output_embeddings.shape)
print('output_embeddings:\n', output_embeddings)

print('\ntext_encoder.text_model.embeddings:\n', text_encoder.text_model.embeddings)

The tokens are transformed into a set of input embeddings, which are then fed through the transformer model to get the final output embeddings.

output_embeddings.shape: torch.Size([1, 77, 768])
output_embeddings:
 tensor([[[-0.3884,  0.0229, -0.0522,  ..., -0.4899, -0.3066,  0.0675],
         [ 0.0290, -1.3258,  0.3085,  ..., -0.5257,  0.9768,  0.6652],
         [ 0.6942,  0.3538,  1.0991,  ..., -1.5716, -1.2643, -0.0121],
         ...,
         [-0.0221, -0.0053, -0.0089,  ..., -0.7303, -1.3830, -0.3011],
         [-0.0062, -0.0246,  0.0065,  ..., -0.7326, -1.3745, -0.2953],
         [-0.0536,  0.0269,  0.0444,  ..., -0.7159, -1.3634, -0.3075]]],
       grad_fn=<NativeLayerNormBackward0>)

text_encoder.text_model.embeddings:
 CLIPTextEmbeddings(
  (token_embedding): Embedding(49408, 768)
  (position_embedding): Embedding(77, 768)
)

Token embeddings

The token is fed to the token_embedding to transform it into a vector. The function name get_input_embeddings here is misleading since these token embeddings need to be combined with the position embeddings before they are actually used as inputs to the model!

#
# token embeddings
#

# Access the embedding layer
token_emb_layer = text_encoder.text_model.embeddings.token_embedding
# Vocab size 49408, emb_dim 768
print('\ntoken_emb_layer:', token_emb_layer)

# Embed a token - in this case the one for 'puppy'
one_token_embedding = token_emb_layer(torch.tensor(6829, device=torch_device))
# 768-dim representation
print('\none_token_embedding.shape:', one_token_embedding.shape)
# print('one_token_embedding:\n', one_token_embedding)

token_embeddings = token_emb_layer(text_input.input_ids.to(torch_device))
# batch size 1, 77 tokens, 768 values for each
print('\ntoken_embeddings.shape:', token_embeddings.shape)
print('token_embeddings:\n', token_embeddings)

This single token has been mapped to a 768-dimensional vector - the token embedding.

token_emb_layer: Embedding(49408, 768)

one_token_embedding.shape: torch.Size([768])

token_embeddings.shape: torch.Size([1, 77, 768])
token_embeddings:
 tensor([[[ 0.0011,  0.0032,  0.0003,  ..., -0.0018,  0.0003,  0.0019],
         [ 0.0013, -0.0011, -0.0126,  ..., -0.0124,  0.0120,  0.0080],
         [ 0.0235, -0.0118,  0.0110,  ...,  0.0049,  0.0078,  0.0160],
         ...,
         [ 0.0012,  0.0077, -0.0011,  ..., -0.0015,  0.0009,  0.0052],
         [ 0.0012,  0.0077, -0.0011,  ..., -0.0015,  0.0009,  0.0052],
         [ 0.0012,  0.0077, -0.0011,  ..., -0.0015,  0.0009,  0.0052]]],
       grad_fn=<EmbeddingBackward0>)

Positional Embeddings

Positional embeddings tell the model where in a sequence a token is. Much like the token embedding, this is a set of (optionally learnable) parameters. But now instead of dealing with ~50k tokens we just need one for each position (77 total).

#
# position embeddings
#

pos_emb_layer = text_encoder.text_model.embeddings.position_embedding
print('\npos_emb_layer:', pos_emb_layer)

position_ids = text_encoder.text_model.embeddings.position_ids[:, :77]
position_embeddings = pos_emb_layer(position_ids)
print('\nposition_embeddings.shape:', position_embeddings.shape)
print('position_embeddings:\n', position_embeddings)

We can get the positional embedding for each position:

pos_emb_layer: Embedding(77, 768)

position_embeddings.shape: torch.Size([1, 77, 768])
position_embeddings:
 tensor([[[ 0.0016,  0.0020,  0.0002,  ..., -0.0013,  0.0008,  0.0015],
         [ 0.0042,  0.0029,  0.0002,  ...,  0.0010,  0.0015, -0.0012],
         [ 0.0018,  0.0007, -0.0012,  ..., -0.0029, -0.0009,  0.0026],
         ...,
         [ 0.0216,  0.0055, -0.0101,  ..., -0.0065, -0.0029,  0.0037],
         [ 0.0188,  0.0073, -0.0077,  ..., -0.0025, -0.0009,  0.0057],
         [ 0.0330,  0.0281,  0.0289,  ...,  0.0160,  0.0102, -0.0310]]],
       grad_fn=<EmbeddingBackward0>)

Combining token and position embeddings

Combining them in this way gives us the final input embeddings ready to feed through the transformer model:

#
# token embeddings + position embeddings
#

# And combining them we get the final input embeddings
input_embeddings = token_embeddings + position_embeddings
print('\ninput_embeddings.shape:', input_embeddings.shape)
print('input_embeddings:\n', input_embeddings)

# The following combines all the above steps
input_embeddings_alias = text_encoder.text_model.embeddings(text_input.input_ids.to(torch_device))
print('\ninput_embeddings_alias.shape:', input_embeddings_alias.shape)
print('input_embeddings_alias:\n', input_embeddings_alias)

input_embeddings.shape: torch.Size([1, 77, 768])
input_embeddings:
 tensor([[[ 2.6770e-03,  5.2133e-03,  4.9323e-04,  ..., -3.1321e-03,
           1.0659e-03,  3.4316e-03],
         [ 5.5371e-03,  1.7510e-03, -1.2381e-02,  ..., -1.1410e-02,
           1.3508e-02,  6.8378e-03],
         [ 2.5356e-02, -1.1019e-02,  9.7663e-03,  ...,  1.9460e-03,
           6.8375e-03,  1.8573e-02],
         ...,
         [ 2.2781e-02,  1.3262e-02, -1.1241e-02,  ..., -8.0054e-03,
          -2.0560e-03,  8.9366e-03],
         [ 2.0026e-02,  1.5015e-02, -8.7638e-03,  ..., -4.0313e-03,
           1.8487e-05,  1.0885e-02],
         [ 3.4206e-02,  3.5826e-02,  2.7768e-02,  ...,  1.4465e-02,
           1.1110e-02, -2.5745e-02]]], grad_fn=<AddBackward0>)

input_embeddings_alias.shape: torch.Size([1, 77, 768])
input_embeddings_alias:
 tensor([[[ 2.6770e-03,  5.2133e-03,  4.9323e-04,  ..., -3.1321e-03,
           1.0659e-03,  3.4316e-03],
         [ 5.5371e-03,  1.7510e-03, -1.2381e-02,  ..., -1.1410e-02,
           1.3508e-02,  6.8378e-03],
         [ 2.5356e-02, -1.1019e-02,  9.7663e-03,  ...,  1.9460e-03,
           6.8375e-03,  1.8573e-02],
         ...,
         [ 2.2781e-02,  1.3262e-02, -1.1241e-02,  ..., -8.0054e-03,
          -2.0560e-03,  8.9366e-03],
         [ 2.0026e-02,  1.5015e-02, -8.7638e-03,  ..., -4.0313e-03,
           1.8487e-05,  1.0885e-02],
         [ 3.4206e-02,  3.5826e-02,  2.7768e-02,  ...,  1.4465e-02,
           1.1110e-02, -2.5745e-02]]], grad_fn=<AddBackward0>)

Feeding token_embedding + position_embedding through the transformer model

在这里插入图片描述

def build_causal_attention_mask(bsz, seq_len, dtype):
    mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype)
    mask.fill_(torch.tensor(torch.finfo(dtype).min))  # fill with large negative number (acts like -inf)
    mask = mask.triu_(1)  # zero out the lower diagonal to enforce causality
    print('\nmask.shape:', mask.shape)
    print('mask:\n', mask)
    return mask.unsqueeze(1)  # add a batch dimension


def get_output_embeds(input_embeddings):
    # CLIP's text model uses causal mask, so we prepare it here
    print('\ninput_embeddings.shape:', input_embeddings.shape)
    bsz, seq_len = input_embeddings.shape[:2]
    causal_attention_mask = build_causal_attention_mask(bsz, seq_len, dtype=input_embeddings.dtype)

    # Getting the output embeddings involves calling the model with passing output_hidden_states=True
    # so that it doesn't just return the pooled final predictions
    encoder_outputs = text_encoder.text_model.encoder(
        inputs_embeds=input_embeddings,
        attention_mask=None,  # We aren't using an attention mask so that can be None
        causal_attention_mask=causal_attention_mask.to(torch_device),
        output_attentions=None,
        output_hidden_states=True,  # We want the output embs not the final output
        return_dict=None,
    )

    # We're interested in the output hidden state only
    output = encoder_outputs[0]

    # There is a final layer norm we need to pass these through
    output = text_encoder.text_model.final_layer_norm(output)

    # And now they're ready!
    return output


out_embs_test = get_output_embeds(input_embeddings)  # Feed through the model with our new function
print('\nout_embs_test.shape:', out_embs_test.shape)
print('out_embs_test:\n', out_embs_test)

input_embeddings.shape: torch.Size([1, 77, 768])

mask.shape: torch.Size([1, 77, 77])
mask:
 tensor([[[ 0.0000e+00, -3.4028e+38, -3.4028e+38,  ..., -3.4028e+38,
          -3.4028e+38, -3.4028e+38],
         [ 0.0000e+00,  0.0000e+00, -3.4028e+38,  ..., -3.4028e+38,
          -3.4028e+38, -3.4028e+38],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -3.4028e+38,
          -3.4028e+38, -3.4028e+38],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          -3.4028e+38, -3.4028e+38],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00, -3.4028e+38],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]]])

out_embs_test.shape: torch.Size([1, 77, 768])
out_embs_test:
 tensor([[[-0.3884,  0.0229, -0.0522,  ..., -0.4899, -0.3066,  0.0675],
         [ 0.0290, -1.3258,  0.3085,  ..., -0.5257,  0.9768,  0.6652],
         [ 0.6942,  0.3538,  1.0991,  ..., -1.5716, -1.2643, -0.0121],
         ...,
         [-0.0221, -0.0053, -0.0089,  ..., -0.7303, -1.3830, -0.3011],
         [-0.0062, -0.0246,  0.0065,  ..., -0.7326, -1.3745, -0.2953],
         [-0.0536,  0.0269,  0.0444,  ..., -0.7159, -1.3634, -0.3075]]],
       grad_fn=<NativeLayerNormBackward0>)