歡迎 Stable Diffusion 3 加入 🧨 Diffusers

HuggingFace發表於2024-06-17

原文網址 : https://www.cnblogs.com/huggingface/p/18253377

作為 Stability AI 的 Stable Diffusion 家族最新的模型，Stable Diffusion 3 (SD3) 現已登陸 Hugging Face Hub，並且可用在 🧨 Diffusers 中使用了。

當前放出的模型版本是 Stable Diffusion 3 Medium，有二十億 (2B) 的引數量。

針對當前釋出版本，我們提供了:

Hub 上可供下載的模型
Diffusers 的程式碼整合
SD3 的 Dreambooth 和 LoRA 訓練指令碼

SD3 新特性

模型

作為一個隱變數擴散模型，SD3 包含了三個不同的文字編碼器 (CLIP L/14、OpenCLIP bigG/14 和 T5-v1.1-XXL) 、一個新提出的多模態 Diffusion Transformer (MMDiT) 模型，以及一個 16 通道的 AutoEncoder 模型 (與 Stable Diffusion XL 中的類似)。

SD3 以序列 Embedding 的形式處理文字輸入和視覺隱空間特徵。位置編碼 (Positional Encoding) 是施加在隱空間特徵的 2x2 patch 上的，隨後被展開成 patch 的 Enbedding 序列。這一序列和文字的特徵序列一起，被送入 MMDiT 的各個模組中去。兩種特徵序列被轉化成相同特徵維度，拼接在一起，然後送入一系列注意力機制模組和多層感知機 (MLP) 裡。

為應對兩種模態間的差異，MMDiT 模組使用兩組不同的權重去轉換文字和影像序列的特徵維度。兩個序列之後會在注意力操作之前被合併在一起。這種設計使得兩種表徵能在自己的特徵空間裡工作，同時也使得它們之間可以透過注意力機制 [1] 從對方的特徵中提取有用的資訊。這種文字和影像間雙向的資訊流動有別於以前的文生圖模型，後者的文字資訊是透過 cross-attention 送入模型的，且不同層輸入的文字特徵均是文字編碼器的輸出，不隨深度的變化而改變。

此外，SD3 還在時間步 (timestep) 這一條件資訊上加入了匯合過的文字特徵，這些文字特徵來自使用的兩個 CLIP 模型。這些匯合過的文字特徵被拼接在一起，然後加到時間步的 Embedding 上，再送入每個 MMDiT 模組。

使用 Rectified Flow Matching 訓練

除了結構上的創新，SD3 也使用了 conditional flow-matching 作為訓練目標函式來訓練模型。這一方法中，前向加噪過程被定義為一個 rectified flow，以一條直線連線資料分佈和噪聲分佈。

取樣過程也變得更簡單了，當取樣步數減少的時候，模型效能也很穩定。為此，我們也引入了新的 scheduler ( FlowMatchEulerDiscreteScheduler )，整合了 rectified flow-matching 的運算公式以及尤拉方法 (Euler Method) 的取樣步驟。同時還提出了一個與生成解析度相關的 shift 引數。對於高解析度，增大 shift 的值可以更好地處理 noise scaling。針對 2B 模型，我們建議設定 shift=3.0 。

如想快速嘗試 SD3，可以使用下面的一個基於 Gradio 的應用:

stabilityai/stable-diffusion-3-medium

在 Diffusers 中使用 SD3

如想在 diffusers 中使用 SD3，首先請確保安裝的 diffusers 是最新版本:

pip install --upgrade diffusers

使用模型前，你需要先到 Stable Diffusion 3 Medium 在 Hugging Face 的頁面，填寫表格並同意相關內容。一切就緒後，你需要登入你的 huggingface 賬號:

huggingface-cli login

下面程式將會下載 SD3 的 2B 引數模型，並使用 fp16 精度。Stability AI 原本釋出的模型精度就是 fp16 ，這也是推薦的模型推理精度。

文生圖

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

image = pipe(
	"A cat holding a sign that says hello world",
	negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]
image

hello_world_cat

圖生圖

import torch
from diffusers import StableDiffusion3Img2ImgPipeline
from diffusers.utils import load_image

pipe = StableDiffusion3Img2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
image = pipe(prompt, image=init_image).images[0]
image

wizard_cat

相關的 SD3 文件可在這裡檢視。

對 SD3 進行記憶體最佳化

SD3 使用了三個文字編碼器，其中一個是 T5-XXL model，是一個很大的模型。這使得在視訊記憶體小於 24GB 的 GPU 上跑模型非常困難，即使使用的是 fp16 精度。

對此，diffusers 整合了一些記憶體最佳化手段，來讓 SD3 能在更多的 GPU 上跑起來。

使用 Model Offloading 推理

Diffusers 上一個最常用的記憶體最佳化手段就是 model offloading。它使得你可以在推理時，把一些當前不需要的模型元件解除安裝到 CPU 上，以此節省 GPU 視訊記憶體。但這會引入少量的推理時長增長。在推理時，model offloading 只會將模型當前需要參與計算的部分放在 GPU 上，而把剩餘部分放在 CPU 上。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

prompt = "smiling cartoon dog sits at a table, coffee mug on hand, as a room goes up in flames. “This is fine,” the dog assures himself."
image = pipe(prompt).images[0]

不使用 T5 模型進行推理

推理時移除掉 4.7B 引數量的 T5-XXL 文字編碼器可以很大程度地減少記憶體需求，帶來的效能損失卻很小。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", text_encoder_3=None, tokenizer_3=None, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "smiling cartoon dog sits at a table, coffee mug on hand, as a room goes up in flames. “This is fine,” the dog assures himself."
image = pipe("").images[0]

使用量化版的 T5-XXL 模型

使用 bitsandbytes 這個庫，你也可以載入 8 位元量化版的 T5-XXL 模型，進一步減少視訊記憶體需求。

import torch
from diffusers import StableDiffusion3Pipeline
from transformers import T5EncoderModel, BitsAndBytesConfig

# Make sure you have `bitsandbytes` installed.
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
text_encoder = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder_3",
    quantization_config=quantization_config,
)
pipe = StableDiffusion3Pipeline.from_pretrained(
    model_id,
    text_encoder_3=text_encoder,
    device_map="balanced",
    torch_dtype=torch.float16
)

完整程式碼在這裡。

視訊記憶體最佳化小結

所有的基準測試都用了 2B 引數量的 SD3 模型，測試在一個 A100-80G 上進行，使用 fp16 精度推理，PyTorch 版本為 2.3。

我們對每個推理呼叫跑十次，記錄平均峰值視訊記憶體用量和 20 步取樣的平均時長。

SD3 效能最佳化

為加速推理，我們可以使用 torch.compile() 來獲取最佳化過的 vae 和 transformer 部分的計算圖。

import torch
from diffusers import StableDiffusion3Pipeline

torch.set_float32_matmul_precision("high")

torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
).to("cuda")
pipe.set_progress_bar_config(disable=True)

pipe.transformer.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)

pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

# Warm Up
prompt = "a photo of a cat holding a sign that says hello world",
for _ in range(3):
 _ = pipe(prompt=prompt, generator=torch.manual_seed(1))

# Run Inference
image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0]
image.save("sd3_hello_world.png")

完整程式碼可參考這裡。

我們測量了使用過 torch.compile() 的 SD3 的推理速度 (在 A100-80G 上，使用 fp16 推理，PyTorch 版本為 2.3)。我們針對每個生成任務跑 10 遍，每次推理使用 20 步取樣。平均推理耗時是 0.585 秒， 這比 eager execution 模式下快了四倍 。

使用 DreamBooth 和 LoRA 進行微調

最後，我們還提供了使用 LoRA 的 DreamBooth 程式碼，用於微調 SD3。這一程式不僅能微調模型，還能作為一個參考，如果你想使用 rectified flow 來訓練模型。當然，熱門的 rectified flow 實現程式碼還有 minRF。

如果需要使用該程式，首先需要確保各項設定都已完成，同時準備好一個資料集 (比如這個)。你需要安裝 peft 和 bitsandbytes ，然後再開始執行訓練程式:

export MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="dreambooth-sd3-lora"

accelerate launch train_dreambooth_lora_sd3.py \
  --pretrained_model_name_or_path=${MODEL_NAME} \
  --instance_data_dir=${INSTANCE_DIR} \
  --output_dir=/raid/.cache/${OUTPUT_DIR} \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks dog" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-5 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --weighting_scheme="logit_normal" \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=25 \
  --seed="0" \
  --push_to_hub

宣告

感謝 Stability AI 團隊開發並開源了 Stable Diffusion 3 並讓我們提早體驗，也感謝 Linoy 對撰寫此文的幫助。

原文連結: https://hf.co/blog/sd3

原文作者: Dhruv Nair, YiYi Xu, Sayak Paul, Alvaro Somoza, Kashif Rasul, Apolinário from multimodal AI art

譯者: hugging-hoi2022

歡迎 Stable Diffusion 3.5 Large 加入 🧨 Diffusers
2024-11-07
Diffusers中基於Stable Diffusion的哪些影像操作
2023-02-24
SOFAStack Community | 歡迎加入
2020-03-09
ASTUnity
社團建立～歡迎加入
2020-11-17
外包交流VX群歡迎加入
2021-07-01
Stable Diffusion中的embedding
2024-04-25
Outpainting with Stable Diffusion on an infinite canvas
2024-08-08
AICanvas
stable diffusion 入門教程
2024-07-30
stable diffusion學習筆記
2024-03-09
筆記
Windows 部署 Stable Diffusion web UI
2024-04-02
WindowsWebUI
歡迎加入d3shop，一個DDD實戰專案
2024-09-10
Stable-diffusion WebUI API呼叫方法
2023-10-16
WebUIAPI
如何使用stable diffusion設計logo
2024-05-09
Go
Stable diffusion取樣器詳解
2024-06-04
歡迎加入 Android Q 測試版計劃！
2019-04-18
Android
[基礎] Stable Diffusion, High-Resolution Image Synthesis with Latent Diffusion Models
2024-03-14
ubuntu2204 部署 stable-diffusion-webui
2024-04-08
UbuntuWebUI
Stable Diffusion中的常用術語解析
2024-04-23
如何用 Serverless 一鍵部署 Stable Diffusion？
2023-05-11
Server
Stable Diffusion 小白的入坑鋪墊
2024-08-31
Stable Diffusion 生成個性圖片指南
2024-06-23
用StabilityMatrix一鍵安裝Stable Diffusion
2024-07-06
Stable Diffusion WebUI詳細使用指南
2024-05-29
WebUI
怎麼使用Stable diffusion中的models
2024-05-28
android招聘啦,美圖秀秀歡迎你加入！
2018-10-17
Android
在英特爾 CPU 上加速 Stable Diffusion 推理
2023-04-13
原來Stable Diffusion是這樣工作的
2024-06-06
Stable Diffusion WebUI 最新版使用文件整理
2025-01-16
WebUI
stable diffusion ControlNet使用介紹與進階技巧
2024-07-02
零程式碼教你安裝部署Stable Diffusion 3，一鍵生成高質量影像
2024-07-12
在雲伺服器中部署stable diffusion webui教程。
2024-02-07
伺服器WebUI
stable-diffusion-webui官方版本地安裝教程
2023-10-29
WebUI
Stable Diffusion WebUI 頁面設定: 顯示 VAE CLIP
2024-05-09
WebUI
連續 3 年最受歡迎：Rust，香！
2020-07-22
Rust
活動 | 歡迎加入網路安全威脅資訊共享計劃
2021-03-22
使用 LoRA 進行 Stable Diffusion 的高效引數微調
2023-02-10
Stable Diffusion解析：探尋AI繪畫背後的科技神秘
2024-02-27
AI
基於PAI-EAS一鍵部署Stable Diffusion AIGC繪畫
2024-01-23
AIGC