Diffusion Models
CS 180 project 5

Part A: Inference with Pretrained DDPM
In part A, we will play around the pretrained DeepFloyd IF model, a text-to-image model, which takes text prompts as input and outputs images that are aligned with the text.
Model Setup
This section is to check that the model is correctly downloaded and deployed.
After inputting the text prompts an oil painting of a snowy mountain village
, a man wearing a hat
, and a rocket ship
, the model generates the following images:

Note that the smaller images are the
Also to make all results reproducible, I set the random seed to be 3036702335
, which is my SID.
A1. Sampling Loops
Starting from a clean image
In the DeepFloyd IF model, we have

In this section, we will explore ways to sample from the model. We will use the following test images:
The Sather Tower:

My roommate's cat, Nunu:

A watermelon wine I made myself:

These images will be resized to
A1.1 Implementing the Forward Process
In the forward process, given timestep
where
where
In the DeepFloyd Model, the stage_1.scheduler.alphas_cumprod
, so we can implement the forward pass easily. Here are the results of adding noise to the campanile image at timesteps [250, 500, 750], respectively:

A1.2 Classical Denoising
One of the most classical ways of denoising is the Gaussian blur filter. Here, we use the filter with kernel size of 5, and here are the results of trying to denoise the noisy images above:

It's obvious that the result is not desirable.
A1.3 Implementing One-Step Denoising
Given the formula above that
we can try to get from
where
Here are the results of the one-step denoising on the images above:

A1.4 Implementing Iterative Denoising
From the output of the last section, we see that when
In theory, we should run
Now, if t = strided_timesteps[i]
, and t' = strided_timesteps[i+1]
, to get
where
Starting from i_start=10
, the images of every 5 loops of denoising are:


And a contrast between different denoising methods is:

A1.5 Diffusion Model Sampling
Now, with the iterative denoising loop, we can use the diffusion model to generate images by first creating an image of random noise, then input it into the model and denoise from i_start=0
. Then, the model will denoise the pure noise in which process a new image is sampled.
Here are some results of sampling from the current model:

A1.6 Classifier Free Guidance
In this section, we implement the classifier-free guidance (CFG). In CFG, at each
where
Here are some results of sampling when

It's notable that the images are now much better in quality and resemble a realistic photo under the prompt a high quality photo
.
A1.7 Image-to-image Translation
In part 1.4, we take a real image, add noise to it, and then denoise. This effectively allows us to make edits to existing images. The more noise we add, the larger the edit will be. This allows us to create an image-to-image transition by adding noise of different levels and then denoise. Intuitively, we will create a series of noisy pictures, from pure noise to medium noisy, to slightly noisy; then, the diffusion model will create images from completely new, to medium modification, to very slight modification pictures, featuring the image-to-image transition.
Here are the results of this process, given prompt at noise levels [1, 3, 5, 7, 10, 20] in the strided_timesteps
, on the three test images above:


We see that it has very interesting results that initially, the image is completely unrelated to the original image, but gradually it resembles the original ones.
A1.7.1 Editing Hand-Drawn and Web Images
This procedure works particularly well if we start with a non-realistic image (e.g., painting, a sketch, some scribbles) and project it onto the natural image manifold. In this section, we will experiment on hand-drawn images. I will show the result of one web-downloaded image and two images that I draw myself. Here are the original images:
The image downloaded from this site:

The images I drew using Procreate:


And here are the results of image-to-image transition:


A1.7.2 Inpainting
We can use the same procedure to implement inpainting, that is, given an image
To do this, after each denoising step to obtain
By doing so, we only make edits on the masked region and keep everything else as original.
I used the following masks on the three test images:



And here are the results, respectively:



A1.7.3 Text-Conditioned Image-to-image Translation
Now, we will do the same thing as the previous section, but guide the projection with a text prompt. This is no longer pure "projection to the natural image manifold" but also adds control using language. This is simply a matter of changing the prompt from a high quality photo
:
This is the result of test_im_1
using the prompt a rocket ship
:

The result of test_im_2
using the prompt a sitting tiger
:

The result of test_im_3
using the prompt a volcano
:

A1.8 Visual Anagrams
In this section, we implement the Visual Anagrams that we will create an image that looks like prompt 1
, but when flipped upside down will reveal prompt 2
.
To achieve this, at each step
where
Here are some results of creating vertically flipped visual anagrams:
Normal: an oil painting of an old man
; flipped: an oil painting of people around a campfire

Normal: an oil painting of a red panda
; flipped: an oil painting of a kitchenware

Normal: an oil painting of an old man
; flipped: an oil painting of a horse

A1.9 Hybrid Images
In this part we'll implement Factorized Diffusion and create hybrid images that look like prompt 1
from a far-away distance, and look like prompt 2
at close-up.
To achieve this, we use this algorithm:
Here are some results of running this algorithm:
Far: a lithograph of a skull
; close: a lithograph of waterfalls

Far: an oil painting of a dog
; close: an oil painting of landscape

Far: an oil painting with frame of a panda
; close: an oil painting with frame of houseplant

Bells & Whistles Part A
Design a Course Logo
Using the diffusion model, I create two course logos that I think look kind of cool:
Prompt: A futuristic logo with a computer in the middle, and on its screen there's a camera lens in the middle to feature computer vision

Prompt: A logo about a robot with computer vision feature

Upsample Test Images
I also attempted the stage 2 of DeepFloyd IF model that does up-sampling to images, and here are the results of running it on the test images:


Text-conditioned Translation on Hand-drawn Images with Up Sampling
I also did a text-conditioned transition on the sketch house I drew, conditioned on the prompt that it's a high quality photo of a house
, then I up-sampled it using the same prompt. Here are the results:


Cool Image Creation
On the other hand, I attempted to create some fictional cool images using the model and then up-sample it. Here's the result of the prompt a gigantic robot with a skull face destroying the city
:

Part B: DDPM with Customized UNet
B1 Unconditioned UNet
In this section, I implement the unconditioned UNet following this flow:

And the elementary blocks are implemented according to:

Once we have the UNet, given a noisy image
In this project, we play around with the MNIST dataset of handwritten digits. Here are some examples of adding noises of various levels to the images in MNIST:

In the training, we use the noise level hidden_dim=128
, and lr=1e-4
on Adam. Here are some training data:




And finally, here's the result of trying to use the model trained at

B2 Diffusion Models
B2.1 Time-conditioned DDPM
According to the DDPM paper, we implement the method similar to the math introduced in A1.1, and we will make a slight modification to our UNet above to allow time-conditioning when computing the noise:

Specifically, we will add the embedded time vector to the layers circled in the architecture plot.
Training of the model follows:

And sampling follows this algorithm:

Here are some samples after epoch=5
and epoch=20
respectively:
After epoch 5:

After epoch 20:

And the training curve for the time-conditioned DDPM is:

B2.2 Class-conditioned DDPM
The performance of solely time-conditioned sampling is not good because the model doesn't know which digit it's supposed to proceed towards. Now, we add a class-conditioned vector to the architecture by multiplying certain layers with the embedding of class vectors. The pseudo code is:

We follow this new algorithm to train the model:

And follow this algorithm to sample:

Here are some samples, with CFG epoch=5
and epoch=20
respectively:
After epoch 5:

After epoch 20:

And the training curve for the class-conditioned DDPM is:

Bells & Whistles, Part B
GIFs for Time-conditioned and Class-conditioned DDPMs
I created some GIFs on the denoising process of tc_ddpm
and cc_ddpm
at different epochs. Here are the results:
Time-conditioned DDPM after epoch 1, 10, and 20



Class-conditioned DDPM after epoch 1, 10, and 20



Rectified Flow
The problem interested in the rectified flow is that, given two distributions
This problem can be reformulated into finding a drift force
for
The rectified flow suggests that the linear interpolation,
Therefore, we cannot use the linear interpolation drift directly, but we can use a neural net that's fully dependent on
The author of rectified flow shows that this approximated trajectory is guaranteed to have the same marginal distribution on the two ends and also guaranteed to have a lower transition cost over any convex cost function.
I implemented the rectified flow following the code in this repo. Specifically, I used the same Class-conditioned UNet
as in the DDPM above as the neural net to estimate the drifted force
In inference time, I use a backward Euler method with total step N = 300
. Specifically, we move from
Here are some results of rectified flow after respectively 1 and 20 epochs:
After 1 epoch


After 20 epochs


Rectified Flow: Reflow
Another amazing property is that, as introduced above, the rectified flow guarantees a lower transition cost than before. Therefore, if we repeatedly apply the rectified flow, called Recflow
, namely
with
Here's also a picture from this site that helps explain this:

In this project, I attempted repeating Reflow for 3 times and sample using a small N=3
, and here are the results:
Reflow 1 with N = 3

Reflow 2 with N = 3

Reflow 3 with N = 3
