UM EECS 542: Advanced Topics in Computer Vision

n58h2r發表於2024-10-03

UM EECS 542: Advanced Topics in Computer Vision

Homework #2: Denoising Diffusion on Two-Pixel Images

Due: 14 October 2024 11:59pmThe field of image synthesis has evolved significantly in recent years. From auto-regressive models and

Variational Autoencoders (VAEs) to Generative Adversarial Networks (GANs), we have now entered anew era of diffusion models. A key advantage of diffusion models over other generative approaches istheir ability to avoid mode collapse, allowing them to produce a diverse range of images. Given thehigh dimensionality of real images, it is impractical to sample and observe all possible modes directly.Our objective is to study denoising diffusion on two-pixel images to better understand how modes aregenerated and to visualize the dynamics and distribution within a 2D space.

1 Introduction

Diffusion models operate through a two-step process (Fig. 1): forward and reverse diffusion.

Figure 1: Diffusion models have a forward process to successively add noise to a clear image x0 and a

backward process to successively denoise an almost pure noise image xT [2].During the forward diffusion process, noise εt is incrementally added to the original data at time step t,over more time steps degrading it to a point where it resembles pure Gaussian noise. Let εt representstandard Gaussian noise, we can parameterize the forward process as xt ∼ N (xt | √ 1 βt xt1, βt I):q1The reverse diffusion process, in contrast, involves the model learning to reconstruct the original data

from a noisy version. This requires training a neural network to iteratively remove the noise, therebyrecovering the original data. By mastering this denoising process, the model can generate new datasamples that closely resemble the training data.We model each step of the reverse process as a Gaussian distribution

pθ(xt1|xt) = N (xt1|µθ(xt , t), Σθ(xt , t)). (6)It is noteworthy that when conditioned on x0, the reverse conditional probability is tractable:where, using the Bayes’ rule and skipping many steps (See [8] for reader-friendly derivations), we have:We follow VAE[3] to optimize the negative log-likelihood with its variational lower bound with respectto µt and µθ(xt , t) (See [6] for derivations). We obtain the following objective function: predicts the noise added to x0 from xt at timestep t.

  1. a) many-pixel images
  2. b) two-pixel images

Figure 2: The distribution of images becomes difficult to estimate and distorted to visualize for many

pixel images, but simple to collect and straightforward 代 寫 UM EECS 542: Advanced Topics in Computer Vision to visualize for two-pixel images. The formerequires dimensionality reduction by embedding values of many pixels into, e.g., 3 dimensions, whereasthe latter can be directly plotted in 2D, one dimension for each of the two pixels. Illustrated is a Gaussianmixture with two density peaks, at [-0.35, 0.65] and [0.75,-0.45] with sigma 0.1 and weights [0.35,

0.65] respectively. In our two-pixel world, about twice as many images have a lighter pixel on the right.

In this homework, we study denoising diffusion on two-pixel images, where we can fully visualize the

diffusion dynamics over learned image distributions in 2D (Fig. 2). Sec. 2 describes our model step by

step, and the code you need to write to finish the model. Sec. 3 describes the starter code. Sec. 4 lists

what results and answers you need to submit.

22 Denoising Diffusion Probabilistic Models (DDPM) on 2-Pixel Images

Diffusion models not only generate realistic images but also capture the underlying distribution of the

training data. However, this probability density distributions (PDF) can be hard to collect for many

pixel images and their visualization highly distorted, but simple and direct for two-pixel images (Fig. 2).

Consider an image with only two pixels, left and right pixels. Our two-pixel world contains two kinds

of images: the left pixel lighter than the right pixel, or vice versa. The entire image distribution can be

modeled by a Gaussian mixture with two peaks in 2D, each dimension corresponding to a pixel.

Let us develop DDPM [2] for our special two-pixel image collection.

2.1 Diffusion Step and Class Embedding

We use a Gaussian Fourier feature embedding for diffusion step t:

xemb = sin 2πw0x, cos 2πw0x, ...,sin 2πwnx, cos 2πwnx , wi ∼ N (0, 1), i = 1, . . . , n. (10)

For the class embedding, we simply need some linear layers to project the one-hot encoding of the class

labels to a latent space. You do not need to do anything for this part. This part is provided in the code.

2.2 Conditional UNet

We use a UNet (Fig. 3) that takes as input both the time step t and the noised image xt , along with

class label y if it is provided, and outputs the predicted noise. The network consists of only two blocks

for the encoding or decoding pathway. To incorporate the step into the UNet features, we apply a dense

Figure 3: Sampe condition UNet architecture. Please note how the diffusion step and the class/text

conditional embeddings are fused with the conv blocks of the image feature maps. For simplicity, we will

not add the attention module for our 2-pixel use case.

3linear layer to transform the step embedding to match the image feature dimension. A similar embedding

approach can be used for class label conditioning. The detailed architecture is as follows.

  1. Encoding block 1: Conv1D with kernel size 2 + Dense + GroupNorm with 4 groups
  2. Encoding block 2: Conv1D with kernel size 1 + Dense + GroupNorm with 32 groups
  3. Decoding block 1: ConvTranspose1d with kernel size 1 + Dense + GroupNorm with 4 groups
  4. Decoding block 2: ConvTranspose1d with kernel size 1

We use SiLU [1] as our activation function. When adding class conditioning, we handle it similarly to the

diffusion step.

Your to-do: Finish the model architecture and forward function in ddpm.py

2.3 Beta Scheduling and Variance Estimation

We adopt the sinusoidal beta scheduling [4] for better results then the original DDPM [2]:

α¯t =

f(t)

f(0)

(11)

f(t) = cos t/T 1 + + s s · π 2 .

(12)

However, we follow the simpler posterior variance estimation [2] without using [4]’s learnt posterior

variance method for estimating Σθ(xt , t).

For simplicity, we declare some global variables that can be handy during sampling and training. Here is

the definition of these global variables in the code.

  1. betas: βt
  2. alphas: αt = 1 βt
  3. alphas cumprod: α¯t = Πt 0αi
  4. posterior variance: Σθ(xt , t) = σt = β˜ t = 1

¯

1α α t ¯ 1

t

βt

Your to-do: Code up all these variables in utils.py. Feel free to add more variables you need.

2.4 Training with and without Guidance

For each DDPM iteration, we randomly select the diffusion step t and add random noise ε to the original

image x0 using the β we defined for the forward process to get a noisy image xt . Then we pass the xt and t to our model to output estimated noise εθ, and calculate the loss between the actual noise ε andestimated noise εθ. We can choose different loss, from L1, L2, Huber, etc.To sample images, we simply follow the reverse process as described in [2]:

(13)

We consider both classifier and classifier-free guidance. Classifier guidance requires training a separateclassifier and use the classifier to provide the gradient to guide the generation of diffusion models. On

the other hand, classifier-free guidance is much simpler in that it does not need to train a separate model.To sample from p(x|y), we need an estimation of

4Figure 4: Sample trajectories for the same start point (a 2-pixel image) with different guidance. Settingy = 0 generates a diffusion trajectory towards images of type 1 where the left pixel is darker than the

right pixel, whereas setting y = 1 leads to a diffusion trajectory towards images of type 2 where the leftpixel is lighter than the right pixel.

where xt log p(y|xt) is the classifier gradient and

xt log p(xt) the model likelihood (also called score

function [7]). For classifier guidance, we could train a classifier fϕ for different steps of noisy images andestimate p(y|xt) using fϕ(y|xt).Classifier-free guidance in DDPM is a technique used to generate more controlled and realistic sampleswithout the need for an explicit classifier. It enhances theflexibility and quality of the generated imagesby conditioning the diffusion model on auxiliary information, such as class labels, while allowing the

model to work both conditionally and unconditionally.For classifier-free guidance, we make a small modification by parameterizing the model with an additional

input y, resulting in εθ(xt , t, y). This allows the model to represent

xt log p(xt |y). For non-conditionaleneration, we simply set y = . We have:where w controls the strength of the conditional influence; w > 0 increases the strength of the guidance,pushing the generated samples more toward the desired class or conditional distribution.During training, we randomly drop the class label to train the unconditional model. We replace the orig

inal εθ(xt , t) with the new (w + 1)εθ(xt , t, y) θ(xt , t, ) to sample with specific class labels (Fig.4).Classifier-free guidance involves generating a mix of the model’s predictions with and without condition

ing to produce samples with stronger or weaker guidance.

Your to-do: Finish up all the training and sampling functions in utils.py for classifier-free guidance.

53 Starter Code

  1. gmm.py defines the Gaussian Mixture model for the groundtruth 2-pixel image distribution. Yourtraining set will be sampled from this distribution. You can leave this file untouched.
  1. ddpm.py defines the model itself. You will need to follow the guideline to build your model there.
  2. utils.py defines all the other utility functions, including beta scheduling and training loop module.
  3. train.py defines the main loop for training.

4 Problem Set

  1. (40 points) Finish the starter code following the above guidelines. Further changes are also welcome!Please make sure your training and visualization results are reproducible. In your report, state anychanges that you make, any obstacles you encounter during coding and training.
  1. (20 points) Visualize a particular diffusion trajectory overlaid on the estimated image distributionpθ(xt |t) at time-step t = 10, 20, 30, 40, 50, given max time-step T = 50. We estimate the PDF bysampling a large number of starting points and see where they end up with at time t, using eitherD histogram binning or Gaussian kernel density estimation methods. Fig. 5 plots the de-noisingtrajectory for a specific starting point overlaid on the ground-truth and estimated PDF.Visualize such a sample trajectory overlaid on 5 estimated PDF’s at t = 10, 20, 30, 40, 50 respectivelyand over the ground-truth PDF. Briefly describe what you observe.Figure 5: Sample de-noising trajectory overlaid on the estimated PDF for different steps.
  1. (20 points) Train multiple models with different maximum timesteps T = 5, 10, 25, 50. Sample and denoise 5000 random noises. Visualize and describe how the de-noised results differ from each other.Simply do a scatter plot to see how the final distribution of the 5000 de-noised samples is comparedwith the groundtruth distribution for each T. Note that there are many existing ways [5, 9] to makesmaller timesteps work well even for realistic images. 1 plot with 5 subplots is expected here.
  1. (20 points) Visualize different trajectories from the same starting noise xT that lead to different modeswith different guidance. Describe what you find. 1 plot as illustrated by Fig. 4 is expected here.
  1. Bonus point (30 points): Extend this model to MNIST images. Actions: Add more conv blocks forencoding/decoding; add residual layers and attention in each block; increase the max timestep to 200or more. Four blocks for each pathway should be enough for MNIST. Show 64 generated images withny random digits you want to guide (see Figure 6). Show one trajectory of the generation from noiseto a clear digit. Answer the question: Throughout the generation, is this shape of the digit generatedpart by part, or all at once.6Figure 6: Sample MNIST images generated by denoising diffusion with classifier-free guidance. Thetensor() below is the random digits (class labels) input to the sampling steps.

75 Submission Instructions

  1. This assignment is to be completed individually.
  2. Submissions should be made through Gradescope and Canvas. Please upload:(a) A PDF file of the graph and explanation: This file should be submitted on gradescope. Includeyour name, student ID, and the date of submission at the top of the first page. Write each problemon a different page.(b) A folder containing all code files: This folder will be submitted under the folder of your uniqname on our class server. Please leave all your visualization codes inside as well, so that we canreproduce your results if we find any graphs strange.(c) If you believe there may be an error in your code, please provide a written statement in the pdfdescribing what you think may be wrong and how it affected your results. If necessary, providepseudocode and/or expected results for any functions you were unable to write.
  1. You may refactor the code as desired, including adding new files. However, if you make substantialchanges, please leave detailed comments and reasonable file names. You are not required to createseparate files for every model training/testing: commenting out parts of the code for different runslike in the scaffold is all right (just add some explanation).

相關文章