Haoyue Xiaoavatar

Part A: Inference with Pretrained DDPM

In part A, we will play around the pretrained DeepFloyd IF model, a text-to-image model, which takes text prompts as input and outputs images that are aligned with the text.

Model Setup

This section is to check that the model is correctly downloaded and deployed.

After inputting the text prompts an oil painting of a snowy mountain village, a man wearing a hat, and a rocket ship, the model generates the following images:

Generated Images

Note that the smaller images are the 64×64 output from the stage-1 model, and the larger images on the second row are 256×256 images from stage 2.

Also to make all results reproducible, I set the random seed to be 3036702335, which is my SID.


A1. Sampling Loops

Starting from a clean image x0, we can iteratively add a small noise ϵtn(0,I) at time t and after sufficient timesteps T, we will get a pure noise image xT. A diffusion model tries to reverse this process by predicting the noise being added at each t, and, getting xt1 by subtracting the noise from xt.

In the DeepFloyd IF model, we have T=1000.

Diffusion Model Primer

In this section, we will explore ways to sample from the model. We will use the following test images:

The Sather Tower:

Sather Tower

My roommate's cat, Nunu:

Nunu the Cat

A watermelon wine I made myself:

Watermelon Wine

These images will be resized to 64×64 for standard model input.


A1.1 Implementing the Forward Process

In the forward process, given timestep t, we can iteratively add noise to the image for t times. Usually, this noise-adding behavior is defined as

xt=αtxt1+1αtϵt1

where ϵt is standard normal, and {αt}t=1T=1{βt}t=1T, and βt is the variance schedule that controls the variance of the noise being added. However, it can be shown that this formula can be simplified to

xt=α¯tx0+1α¯tϵ

where α¯t=i=1tαi is the cumulative product of αi. Thus, we have

q(xt|x0)=N(α¯tx0,(1α¯t)I)

In the DeepFloyd Model, the α¯t are precomputed and stored in stage_1.scheduler.alphas_cumprod, so we can implement the forward pass easily. Here are the results of adding noise to the campanile image at timesteps [250, 500, 750], respectively:

Adding Noise Results

A1.2 Classical Denoising

One of the most classical ways of denoising is the Gaussian blur filter. Here, we use the filter with kernel size of 5, and here are the results of trying to denoise the noisy images above:

Classical Denoising Results

It's obvious that the result is not desirable.


A1.3 Implementing One-Step Denoising

Given the formula above that

xt=α¯tx0+1α¯tϵ

we can try to get from xt to x0 in one step using the formula

x0=xt1α¯tϵα¯t

where ϵ is the model's estimate of noise at stage t.

Here are the results of the one-step denoising on the images above:

One-Step Denoising Results

A1.4 Implementing Iterative Denoising

From the output of the last section, we see that when t is large, the denoised image is very vague and blurred. Intuitively, we just predict the noise once and use the cumulative coefficients to get back to the original image, and it's hard to recover all details in one step. The diffusion model, on the other hand, is designed to iteratively remove noise.

In theory, we should run x999 step by step all the way back to x0, but this will be very inefficient. Instead, we take a strided timestep that is a subset of {0,1,,998,999} and here, I use a stride of 30 and go from 900 all the way back to 0.

Now, if t = strided_timesteps[i], and t' = strided_timesteps[i+1], to get xt from xt, we have

xt=α¯tβt1α¯tx0+αt(1α¯t)1α¯txt+vσ

where vσ is a random noise also predicted by the model.

Starting from i_start=10, the images of every 5 loops of denoising are:

Iterative Denoising Step 1
Iterative Denoising Step 2

And a contrast between different denoising methods is:

Denoising Methods Comparison

A1.5 Diffusion Model Sampling

Now, with the iterative denoising loop, we can use the diffusion model to generate images by first creating an image of random noise, then input it into the model and denoise from i_start=0. Then, the model will denoise the pure noise in which process a new image is sampled.

Here are some results of sampling from the current model:

Diffusion Model Sampling Results

A1.6 Classifier Free Guidance

In this section, we implement the classifier-free guidance (CFG). In CFG, at each t, we compute both a noise estimate, ϵc, conditioned on a text prompt, and an unconditional noise estimate ϵu. Then, we compute the noise estimate to be

ϵ=ϵu+γ(ϵcϵu)

where γ controls the intensity of the CFG. When γ>1, we will get high-quality images.

Here are some results of sampling when γ=7:

Classifier Free Guidance Results

It's notable that the images are now much better in quality and resemble a realistic photo under the prompt a high quality photo.


A1.7 Image-to-image Translation

In part 1.4, we take a real image, add noise to it, and then denoise. This effectively allows us to make edits to existing images. The more noise we add, the larger the edit will be. This allows us to create an image-to-image transition by adding noise of different levels and then denoise. Intuitively, we will create a series of noisy pictures, from pure noise to medium noisy, to slightly noisy; then, the diffusion model will create images from completely new, to medium modification, to very slight modification pictures, featuring the image-to-image transition.

Here are the results of this process, given prompt at noise levels [1, 3, 5, 7, 10, 20] in the strided_timesteps, on the three test images above:

Image-to-image Translation Results 1
Image-to-image Translation Results 2

We see that it has very interesting results that initially, the image is completely unrelated to the original image, but gradually it resembles the original ones.

A1.7.1 Editing Hand-Drawn and Web Images

This procedure works particularly well if we start with a non-realistic image (e.g., painting, a sketch, some scribbles) and project it onto the natural image manifold. In this section, we will experiment on hand-drawn images. I will show the result of one web-downloaded image and two images that I draw myself. Here are the original images:

The image downloaded from this site:

Web Downloaded Image

The images I drew using Procreate:

Hand-Drawn House
Hand-Drawn Flower

And here are the results of image-to-image transition:

Web Image Translation Result
Hand-Drawn Image Translation Result

A1.7.2 Inpainting

We can use the same procedure to implement inpainting, that is, given an image xorig , and a binary mask m , we can create a new image that has the same content where m is 0, but new content wherever m is 1.

To do this, after each denoising step to obtain xt, we force every pixel outside the editing mask m to be the same as xorig. Mathematically, that is

xtmxt+(1m)forward(xorig,t)

By doing so, we only make edits on the masked region and keep everything else as original.

I used the following masks on the three test images:

Mask 1
Mask 2
Mask 3

And here are the results, respectively:

Inpainting Result 1
Inpainting Result 2
Inpainting Result 3

A1.7.3 Text-Conditioned Image-to-image Translation

Now, we will do the same thing as the previous section, but guide the projection with a text prompt. This is no longer pure "projection to the natural image manifold" but also adds control using language. This is simply a matter of changing the prompt from a high quality photo:

This is the result of test_im_1 using the prompt a rocket ship:

Text-Conditioned Translation Result 1

The result of test_im_2 using the prompt a sitting tiger:

Text-Conditioned Translation Result 2

The result of test_im_3 using the prompt a volcano:

Text-Conditioned Translation Result 3

A1.8 Visual Anagrams

In this section, we implement the Visual Anagrams that we will create an image that looks like prompt 1, but when flipped upside down will reveal prompt 2.

To achieve this, at each step t, we compute the noise estimate using this algorithm:

ϵ1=UNet(xt,t,p1)ϵ2=flip(UNet(flip(xt),t,p2))ϵ=ϵ1+ϵ22

where UNet is the diffusion model as before, and flip is the operation to vertically flip the image. Theoretically, I can use other operations like rotate(img,θ) to create anagrams that are not just vertically dual, but here for simplicity, I only attempted vertically flipped anagrams.

Here are some results of creating vertically flipped visual anagrams:

Normal: an oil painting of an old man; flipped: an oil painting of people around a campfire

Visual Anagram 1

Normal: an oil painting of a red panda; flipped: an oil painting of a kitchenware

Visual Anagram 2

Normal: an oil painting of an old man; flipped: an oil painting of a horse

Visual Anagram 3

A1.9 Hybrid Images

In this part we'll implement Factorized Diffusion and create hybrid images that look like prompt 1 from a far-away distance, and look like prompt 2 at close-up.

To achieve this, we use this algorithm:

ϵ1=UNet(xt,t,p1)ϵ2=UNet(xt,t,p2)ϵ=flow-pass(ϵ1)+fhigh-pass(ϵ2)

Here are some results of running this algorithm:

Far: a lithograph of a skull; close: a lithograph of waterfalls

Hybrid Image 1

Far: an oil painting of a dog; close: an oil painting of landscape

Hybrid Image 2

Far: an oil painting with frame of a panda; close: an oil painting with frame of houseplant

Hybrid Image 3

Bells & Whistles Part A

Using the diffusion model, I create two course logos that I think look kind of cool:

Prompt: A futuristic logo with a computer in the middle, and on its screen there's a camera lens in the middle to feature computer vision

Course Logo 1

Prompt: A logo about a robot with computer vision feature

Course Logo 2

Upsample Test Images

I also attempted the stage 2 of DeepFloyd IF model that does up-sampling to images, and here are the results of running it on the test images:

Upsampled Test Image 1
Upsampled Test Image 2

Text-conditioned Translation on Hand-drawn Images with Up Sampling

I also did a text-conditioned transition on the sketch house I drew, conditioned on the prompt that it's a high quality photo of a house, then I up-sampled it using the same prompt. Here are the results:

Text-conditioned Upsampling Result 1
Text-conditioned Upsampling Result 2

Cool Image Creation

On the other hand, I attempted to create some fictional cool images using the model and then up-sample it. Here's the result of the prompt a gigantic robot with a skull face destroying the city:

Cool Image Creation Result



Part B: DDPM with Customized UNet

B1 Unconditioned UNet

In this section, I implement the unconditioned UNet following this flow:

UNet Architecture

And the elementary blocks are implemented according to:

Elementary Blocks

Once we have the UNet, given a noisy image z=x+σϵ, we can train the UNet to be a denoiser such that

argminθE[||ϵθ(z)ϵ||2]

In this project, we play around with the MNIST dataset of handwritten digits. Here are some examples of adding noises of various levels to the images in MNIST:

Noisy MNIST Examples

In the training, we use the noise level σ=0.5, hidden_dim=128, and lr=1e-4 on Adam. Here are some training data:

Training Data 1
Training Data 2
Training Data 3
Training Data 4

And finally, here's the result of trying to use the model trained at σ=0.5 to denoise images of various noise levels:

Denoising Results

B2 Diffusion Models

B2.1 Time-conditioned DDPM

According to the DDPM paper, we implement the method similar to the math introduced in A1.1, and we will make a slight modification to our UNet above to allow time-conditioning when computing the noise:

Time-conditioned UNet

Specifically, we will add the embedded time vector to the layers circled in the architecture plot.

Training of the model follows:

Training Algorithm

And sampling follows this algorithm:

Sampling Algorithm

Here are some samples after epoch=5 and epoch=20 respectively:

After epoch 5:

Samples after Epoch 5

After epoch 20:

Samples after Epoch 20

And the training curve for the time-conditioned DDPM is:

Training Curve

B2.2 Class-conditioned DDPM

The performance of solely time-conditioned sampling is not good because the model doesn't know which digit it's supposed to proceed towards. Now, we add a class-conditioned vector to the architecture by multiplying certain layers with the embedding of class vectors. The pseudo code is:

Class-conditioned Pseudocode

We follow this new algorithm to train the model:

Class-conditioned Training Algorithm

And follow this algorithm to sample:

Class-conditioned Sampling Algorithm

Here are some samples, with CFG γ=5, after epoch=5 and epoch=20 respectively:

After epoch 5:

Class-conditioned Samples after Epoch 5

After epoch 20:

Class-conditioned Samples after Epoch 20

And the training curve for the class-conditioned DDPM is:

Training Curve

Bells & Whistles, Part B

GIFs for Time-conditioned and Class-conditioned DDPMs

I created some GIFs on the denoising process of tc_ddpm and cc_ddpm at different epochs. Here are the results:

Time-conditioned DDPM after epoch 1, 10, and 20

Time-conditioned DDPM Epoch 1
Time-conditioned DDPM Epoch 10
Time-conditioned DDPM Epoch 20

Class-conditioned DDPM after epoch 1, 10, and 20

Class-conditioned DDPM Epoch 1
Class-conditioned DDPM Epoch 10
Class-conditioned DDPM Epoch 20

Rectified Flow

The problem interested in the rectified flow is that, given two distributions π0,π1, we have two observations X0,X1Rd. We are interested in finding a transition map T:RdRd such that T(X0)π1 when X0π0.

This problem can be reformulated into finding a drift force v(Xt,t), such that

dXtdt=v(Xt,t)

for t[0,1]. This drift force can be thought of as an instruction of movement at time t for the given Xt to move towards X1.

The rectified flow suggests that the linear interpolation, X1X0, effectively translates X0 towards X1. However, it cannot be modeled by v(Xt,t) because (1) it peaks at X1, which should not be known at intermediate timesteps, and (2) it's not deterministic even though Xt and t are deterministic, meaning it's not fully dependent on Xt and t. This guide provides a visual explanation about why.

Therefore, we cannot use the linear interpolation drift directly, but we can use a neural net that's fully dependent on t and Xt to approximate it, by minimizing

minθ01E[||(X1X0)vθ(Xt,t)||2]dt

The author of rectified flow shows that this approximated trajectory is guaranteed to have the same marginal distribution on the two ends and also guaranteed to have a lower transition cost over any convex cost function.

I implemented the rectified flow following the code in this repo. Specifically, I used the same Class-conditioned UNet as in the DDPM above as the neural net to estimate the drifted force vθ. Then, let X0 be the clean images and X1 be the pure noise, we approximate the added noise X1X0 using the neural net, conditioned on both time and class.

In inference time, I use a backward Euler method with total step N = 300. Specifically, we move from t=1 gradually to t=0 in 300 steps, so Δt=1N. And at each t, we compute the new estimate Xt=XtΔt(vc,t+γ(vc,tvu,t)), where vc,t=UNet(Xt,t,cond), vu,t=UNet(Xt,t,null cond), and γ is the CFG constant.

Here are some results of rectified flow after respectively 1 and 20 epochs:

After 1 epoch

Rectified Flow Sample after 1 Epoch
Rectified Flow Last Image after 1 Epoch

After 20 epochs

Rectified Flow Sample after 20 Epochs
Rectified Flow Last Image after 20 Epochs

Rectified Flow: Reflow

Another amazing property is that, as introduced above, the rectified flow guarantees a lower transition cost than before. Therefore, if we repeatedly apply the rectified flow, called Recflow, namely

Zk+1=Reflow(Z0k,Z1k)

with (Z00,Z10)=(X0,X1). Then the transition map will be straightened such that the flow looks like a straight line in its space. This property allows us to solve the Euler equation in one or very few steps, namely

Ztk=Z0k+tvk(Z0k,t)

Here's also a picture from this site that helps explain this:

Reflow Illustration

In this project, I attempted repeating Reflow for 3 times and sample using a small N=3, and here are the results:

Reflow 1 with N = 3

Recflow 1 Result

Reflow 2 with N = 3

Recflow 2 Result

Reflow 3 with N = 3

Recflow 3 Result