Diffusion and Score-Based Generative Models
Date Posted:
December 16, 2022
Date Recorded:
December 12, 2022
Speaker(s):
Yang Song, Stanford University
All Captioned Videos Computational Tutorials
Description:
Generating data with complex patterns, such as images, audio, and molecular structures, requires fitting very flexible statistical models to the data distribution. Even in the age of deep neural networks, building such models is difficult because they typically require an intractable normalization procedure to represent a probability distribution. To address this challenge, we consider modeling the vector field of gradients of the data distribution (known as the score function), which does not require normalization and therefore can take full advantage of the flexibility of deep neural networks. I will show how to (1) estimate the score function from data with flexible deep neural networks and efficient statistical methods, (2) generate new data using stochastic differential equations and Markov chain Monte Carlo, and even (3) evaluate probability values accurately as in a traditional statistical model. The resulting method, called score-based generative modeling or diffusion modeling, achieves record performance in applications including image synthesis, text-to-speech generation, time series prediction, and point cloud generation, challenging the long-time dominance of generative adversarial networks (GANs) on many of these tasks. Furthermore, score-based generative models are particularly suitable for Bayesian reasoning tasks such as solving ill-posed inverse problems, yielding superior performance on several tasks in medical image reconstruction.
GRETA TUCKUTE: So first of all, thank you so much for agreeing to give us this tutorial. There's been a lot of interest in our department and, of course, elsewhere. And Dr. yang Song is currently at OpenAI and joining Caltech in '24, you said, as an assistant professor. So we are very excited to hear about diffusion and score-based generative models. So please take it away.
YANG SONG: Thank you for the introduction and thank you for having me here today. So I will be giving you a tutorial on diffusion models, but this tutorial is a little bit different because how we approach diffusion from the perspective of a score matching and score-based generative models. So in this talk, you can view score-based generative models as an interchangeable term with diffusion models.
So I'm using score-based generative models because I hope to emphasize a connection of them to score functions. So score functions are defined as gradients of log probability densities. By modeling and estimating this quantity, we can build very flexible and powerful probabilistic change models that can give us very high-quality samples and also predict accurate probability values. So in this talk, I will mostly cover the research I did during my PhD at Stanford. And this line of research is impossible without the help of many collaborators, mentors, and friends whose names are listed below.
So let's start by briefly reviewing the recent progress of deep generative models in various applications. So nowadays we are able to build a very powerful image generative models that can create realistic pictures from text description. So here is an example provided by DALL-E 2, which is a model developed by OpenAI. And you can find similar pictures generated from imaging or Stable Diffusion. And similar success has been extended to radio generation as well, and this is an example generated from Imagen video which is a model developed by Google Brain very recently.
And deep generative models are also very useful for many scientific applications. This video shows you an example of using deep generative models to predict weather maps for Weather Nowcasting. And DeepMind has demonstrated that this approach can even outperform human experts on Weather Nowcasting.
We can also use deep generative models to help us automatically complete code in order to maximize the productivity of computer programmers. And this is an example of using a language model for generating the code given comments of the computer program. So this technology has already been deployed to products, and you might have known it as GitHub Copilot.
So we have many important applications of deep genetic models. And you may ask, how can I build such powerful generative models? So it turns out that almost generative models follow the same pipeline. And the basic idea is to estimate the probability distribution of data.
So in order to build a deep generative model, the first thing we need to do is to collect a large data set. And as a running example, let's suppose the data set contains many images of dogs. A typical assumption in statistics and machine learning is that all those data points in our training data set come from some underlying data distribution. In other words, those data points are basically ID samples from this data distribution, but we don't have the analytical form of the data distribution, and we have to estimate it.
And to estimate this data distribution, we have to create a model. This model represents parameterized probability distribution, which we call the model distribution. And we hope to tune this model parameter to make sure this model distribution is close to the data distribution in a certain sense. So if this model distribution is very close to the data distribution, then we can use the model for many important applications. And one example is, of course, we can generate an unlimited number of novel data points just by sampling from this model distribution.
Another application is we can use this model distribution to compute the probability value for any potential data point. So as an example for a data point, like a picture of a chihuahua, because it is a picture of a dog, it is actually within our data distribution. And therefore, this model distribution usually assigns high probability values for such data points. For some irrelevant data point, like a picture of a muffin, because it is not a picture of a dog, a good model distribution will add lower probability values to such images. So because this model distribution provides a way to generate novel data points, we also refer to it as a generative model.
So how can we train those generative models? As we know, when we have a large data set, we may formalize the problem a little bit further. So we can use a simple [INAUDIBLE] x i to represent each data point in the data set. And we have a total of N data points. And our model provides a family of probability distributions. And we hope to find a single probability distribution inside this huge family by minimizing the distance from P theta to P data And afterwards, we can just generate samples from P data.
However, there is one key challenge associated with this framework. That is, our data distribution can be extremely complicated, especially for data is high dimensions. So consider how complicated it might be for distributions of images, video, audio. It might have millions of dimensions. And as a result, we have to [INAUDIBLE] a very powerful model distribution in order to estimate our data distribution.
So how can we build a powerful model distribution? Let's recall that in statistics we often work with simple distributions, such as Gaussian distribution. Of course, a Gaussian distribution is too simple. It won't be able to approximate our complicated data distribution. But it serves as a good starting point.
So a Gaussian distribution is basically a computational graph that has two layers. The first layer corresponds to the input data point. The second layer is a single unit that basically gives you the probability density function of this Gaussian distribution. So this computation is very simple, and the middle in this slide denotes the mean parameter of discussing distribution. By changing the [INAUDIBLE] to middle, you are basically changing the mean of this Gaussian.
But as we said, Gaussian models are too simple. How can we make a more complicated model? Well, a very natural idea is to leverage a bigger and deeper computational graph. And we also call it a deep neural network. So we hope to use that deep neural network to represent a complicated probability distribution P theta, where theta denotes the width in this deep neural network.
And when we use deep neural networks to build those powerful generative models, we obtain deep generative models. But it is actually nontrivial to use a deep neural network to directly represent a probability distribution because we typically view a deep neural network as a black box that converts a high [INAUDIBLE] input x to a typically one-dimensional output f theta. So this output value f theta does not directly model distribution because it may not be positive everywhere.
So our first step to convert this into our probability density is to take the exponential of the output. So then the output becomes positive. And then we can normalize the output by dividing by a constant Z theta in order to construct a probability distribution which has positive values everywhere and is also properly normalized. So the denominator here is called the normalizing constant. And by definition, this normalizing constants should be computed by evaluating the high dimension integral of the exponential function of our theta over all possible values of x in the space.
In the special case of Gaussian models, this normalizing constant is very simple to compute because f theta in Gaussian models has a very simple form. So we can directly compute the integral in cluster form. But when we are trying to handle more powerful deep neural network models, this normalizing constant becomes intractable to compute. And as a quick example, even if we consider a simplified case where x is discrete and in which case the integral becomes a summation, computing this normalizing constant is still a [INAUDIBLE] P-complete problem, which is at least as hard as NP-complete.
And this difficulty is by no means the unique challenge in deep generative modeling. You can find many similar challenges in adjacent fields, such as thermodynamics and statistical mechanics. And people have been studying this problem for quite a while. In the current literature of deep generative models, there are mostly three approaches to address this intractable normalizing constant difficulty. And as a result, we can actually categorize deep generative models into three different categories of families.
So the first category is based on approximating this normalizing constants using approaches such as Markov chain Monte Carlo. So one typical example inside this family is energy-based models trained by contrastive divergence. So the disadvantage of this direction is then because we have to approximate this normalizing constant, we cannot compute the probability value accurately, since the probability value requires dividing by this approximate normalizing constant.
The second major approach is based on using restricted neural network models, such that this normalizing constant is tractable by construction. So there are a few examples inside this family, but the challenges are once we restrict our neural network models, we also limit the flexibility of deep generative models that we can potentially build along this direction. So the last category is based on modeling the data generating process directly instead of modeling the probability density function. So the most predominant example in this family is generative adversarial networks. However, because those approaches to not model the underlying data distribution, they cannot give us accurate probability values.
So these are a few challenges associated with previous generative modeling frameworks. And if we want to address those difficulties by proposing a better framework of generative modeling, then we require this better framework to satisfy certain desiderata And one thing is we hope that this better framework can allow us to use a very flexible neural network models to parameterize this distribution. So this not only addresses the second challenge on our side, but also allows us to take full advantage of the deep learning revolution to leverage very powerful deep neural networks to build our deep generative models.
The second desideratum is we hope to evaluate probability values accurately using this new framework of generative modeling. So if we can evaluate the probability values accurately, we can address the rest of the challenges on the left side. And then moreover, those accurate probability values are very important for applications such as outlier detection, model comparison, or lossless compression.
And finally, because we are aiming to build a more powerful framework of general models, we of course want to generate samples with better quality. So not only do we want to generate samples with better quality, we also want to control this generation process in a principled way so that we may use this generative model for numerous downstream applications. And one example is medical image reconstruction, which I will discuss briefly later in the tutorial.
So now, in today's talk, I will show you one such framework that satisfies all three desiderata listed here. And the key of this framework is to work with score functions to represent our probability distribution. So what is the score function? Well, suppose we have a continuous probability distribution where we use px to represent the probability density function.
We define the score function as the gradient of log px. So this quantity has multiple names. It can be called a Stein score function to differentiate from Fisher score functions that typically appear in statistics. It can also be called as the score function or simply score. So be careful this gradient is taken with respect to the random variable x-- it is not taken with respect to any model parameter like theta.
So what does our score function look like? Let's consider a simple example which is a mixture of two Gaussians. This figure shows a density function and the score function for this mixture of Gaussian distribution. The density function is a color coded, where darker color indicates higher density. The score function is a vector field that gives the direction where the density function grows most quickly.
So given the density function, we can compute the score function very easily because we can just text the derivative. Conversely, with the score function, we can also recover the density function in principle by computing integrals. So mathematically this score function preserves all the information in the density function. So they are equivalent to [INAUDIBLE]. But computationally, this score function is much easier to work with compared to the density function.
So when we work with the score function for representing probability distributions, we get our score-based generative models. And I will show you that this score-based generative model has multiple advantages. So first, it allows very flexible models because the score functions actually do not need to be normalized at all. Which means you can use a very flexible neural net models to represent this score function, and we can learn such models or score functions from data using principle statistical approaches.
The second advantage is we can directly generate samples from those models of score functions, and those samples could have surprisingly good quality and can be even better than [INAUDIBLE] in many situations. And moreover, we can control the sample generation process in a principal way for many important applications. And finally, even if we only have the model of the score function, we can still compute the probability values accurately. And empirically, we can even obtain better probability values compared to those models that directly work with probability density functions.
So in the rest of the tutorial, I will first focus on how score-based generative modeling allows very flexible models. So recall that one major difficulty in deep generative modeling is due to the intractable normalizing constant problem when we are trying to model the probability density function. So indeed, if we want to model this probability distribution using a normalized probability model, then no matter how we change our model parameter in some of the architectures or other configurations, we always have to ensure that the distribution represented is fully normalized. In other words, the area below this curve has to be 1.
And due to this constraint, when we use the deep neural network model to those density functions, we always have to deal with this intractable normalizing constant difficulty. But in contrast, if we model the same distribution through the score functions, then, as the animation shows, there is no such normalization restriction. And, in fact, if we compute a score function for the neural network on the left side, we notice that the score function is the difference of two terms.
Only the second term involves the intractable normalizing constant. But the second term is always 0 because the gradient of any constant is always 0. As a result, the score function equals the gradient of the deep neural network. And as you might know, those gradients of deep neural networks can be easily computed with automatic differentiation or with backpropagation. So this is a very efficient operation. And from the [INAUDIBLE], we use a simple s theta to denote such a deep neural network model for the score function, and we call it our score model.
Suppose we have collected a large trained data set, and again we use x1, x2, to xN to denote each point in this data set. We assume the underlying data density is given by P data. With our knowledge in statistics, we know that we can train our properly normalized statistical model to estimate the underlying data density using methods such as maximum likelihood. But because we want to work with score functions, we want to develop a similar approach that can allow us to train our score model to estimate the underlying score function from a limited set of training data points.
And we have formulated this problem score estimation. So mathematically, we are given a bunch of data points which are assumed to be ID sampled from the data distribution P data. And our goal is to estimate this score function of the data density. We are given a score model. This is assumed to be a deep neural network model that maps the deep dimensional input to a deep dimensional output, and we hope to train this score model such that it approximates our ground truth score function of the data distribution.
So how can we train this score model to be close to our ground truth data score function? Well, we need to minimize a certain objective. This objective has to compare two vector fields of score functions.
Here, one vector field is the ground truth data score function. The other vector field is predicted by our score model. How can we compare the difference?
Let's recall that those two vector fields actually lie in the same space. So we might be able to compute the difference vectors between those pairs of vectors from the original vector fields. And then we can [INAUDIBLE] over the densities of those difference vectors to form a single scalar-valued objective. So mathematically, we can capture this intuition with the Fisher divergence objective.
So Fisher divergence is essentially an expected squared Euclidean distance between the data score and the model score averaged over samples from the data distribution. However, Fisher divergence cannot be directly computed because we don't know the ground truth value of the data score function. But luckily there is a way to address this challenge, and the method is called score matching.
So score matching uses integration by parts of Gauss's theorem to convert Fisher divergence into the following equivalent objective. So the objective at the bottom is equivalent to Fisher divergence up to a constant. But since constants do not affect optimization, their score matching objective defines the same optimum as the Fisher divergence.
So in a score matching objective, there is no dependency on the score function of the data distribution anymore. And moreover, the expectation in score matching can be efficiently approximated using the empirical mean over the training data set. So, so far, so good. However, the score matching objective is not scalable to compute, especially when you want to use deep neural networks to model high-dimensional data points. So let's suppose our score function is parameterized by our deep neural network, which we call deep score models.
In order to use score matching we have to compute two terms, where one term is the squared Euclidean norm of the score model output. The second term is the choice of the Jacobin of the score model. So for the first term, it is super simple to compute and very efficient because we just need [INAUDIBLE] forward propagation to get the output. Then we can compute the squared L2 norm very efficiently.
For the second term things become a little bit more complicated because we need one forward propagation to compute the first element of the score function output, and we need a backpropagation to compute the first element on the diagonal of this Jacobian. So this procedure has to be repeated multiple times until we have recovered all diagonal elements on the Jacobian. And then we can sum over the diagonal elements to get the trace.
So this whole procedure requires a lot of backpropagations. And the number of backpropagations actually is proportional to the dimensionality of our data point. For modeling high-dimension data like images, we might need to deal with medians and dimensions. And this means score matching in its naive form is not scalable.
So to address this challenge, we actually propose a more efficient variant of score matching which we term sliced score matching. The basic intuition is that one-dimensional problem should be much easier to solve than those high-dimensional problems. And how can we convert a high-dimensional problem to a one-dimensional problem? Well, we can leverage random projections.
We project the high-dimensional vector fields to run directions. Then we get one-dimensional scalar fields . So suppose those two high-dimensional vector fields are close to each other. Then we can project them along random one-dimensional directions. This gives us one-dimensional scalar fields. Those scalar fields will also be close to each other.
So we can capture this intuition with the concept of sliced Fisher divergence. Here v denotes the projection direction. It is a vector. And pv denotes the distribution of those projection directions. So we compute the inner product [INAUDIBLE] v and those two school functions and measure the resulting difference between them. And we can again leverage integration by parts to eliminate the dependency on the ground truth data score. This gives us the sliced score matching objective.
And in sliced score matching, there is no trace of a Jacobian anymore. Instead, we have vector Jacobian vector product. And this term is much more scalable to compute. So this is actually not hard to see because we can rewrite the vector Jacobian vector product as an alternative for on the right-hand side. So this just requires us to swap the location of v and theta within the gradient operator.
So now I will show you how to compute this vector Jacobian vector product very efficiently. First, we just need one forward propagation to get the output [INAUDIBLE] of s theta, and then we can directly compute the inner product between v and s theta. So this amounts to adding one additional neuron to the computational graph. And next, we can compute that gradient by doing one backpropagation. And as the last step, we just need to computer the inner products in the [INAUDIBLE] gradient.
So the whole procedure only requires a one backpropagation, which is much more efficient compared to the vanilla form of score matching. So this is how sliced score matching works in practice. We just a sample a minibatch of data points from our data set. And for each data point, we sample one single projection direction from our distribution of pv. And then we form the empirical estimate of the sliced score matching training objective using the empirical mean over our sample data points and those projection directions.
So the projection distribution pv is typically a simple standard Gaussian distribution or sometimes better you can use Rademacher distributions, which are uniform distributed sine vectors. And then we can use stochastic gradient descent to minimize our empirical objective for sliced score matching. And if you want a better performance or equivalently lower variance of our training objective, you could potentially use more projections per data point.
So that concludes the discussion of sliced score matching. There exists another approach called denoising score matching that can also bypass the computational challenge of vanilla score matching. The idea of denoising score matching is to add additional noise to the data point to help us compute the choice of a Jacobian term. So to perform to denoising score matching, we need to design a perturbation kernel which we denote as q sigma. So x tilde denotes the perturbed data point, and x denotes the original noise-free data point.
So sigma can typically be a Gaussian distribution with means x and a standard deviation sigma. So after converting this perturbation kernel with our original data distribution, we get a noisy data distribution to sigma of x tilde. The key idea of denoising score matching is to estimate the scope function of this noise data density instead of the score function of the original data density. So of course, when sigma is very small, you can approximately view the score function of the noise density as the equivalent to the scope function of the noise-free density.
So the magic happens when you estimate this score function of a noisy distribution. So you can use some arithmetic derivation to write down an equivalent form to the denoising score matching objective, which I give at the bottom of this slide. So in this new form, what we need to compute is just the gradient of the perturbation kernel. So because we designed the perturbation kernel by hand, usually this perturbation kernel is a fully tractable distribution. So computing this gradient is very efficient, and it can be done analytically.
But the [INAUDIBLE] known as score matching is that since it requires adding noise to data points, it cannot estimate scores of the noise-free distributions. And what's worse, when you're trying to lower the magnitude of the noise, the variance of denoising score matching objective actually becomes bigger and bigger and eventually explodes. So there is no easy way to use denoising score matching for noise-free score estimation.
So we can actually derive the formula with denoising score matching very easily. But I guess due to time reasons, we have skip this part. And it's not hard to find this derivation from the original paper of denoising score matching. So as a conclusion, when you want to apply denoising score matching, you follow a similar procedure as sliced score matching.
First of all, you sample a minibatch of data points from the data density. And then you sample a minibatch of perturbed data points. So usually for one data point, you sample a single perturbed data point by adding the additional amount of noise to the chosen data point. And then you can form the empirical estimation of the denoising score matching loss by approximating the expectation using empirical means.
So in the special case of Gaussian perturbations, you can further simplify the denoising score matching loss function. And then you can just apply stochastic gradient descent to minimize this objective function to train your score model. So in practice, if you want it to work well for estimating score functions of noise-free data densities, you need to choose a very small sigma. But as I said, when sigma is very small, the variance of this objective will explode. So there is a tradeoff, and you need to find the sweet spot.
So here are some experimental results. We first want to compare the computational efficiency of sliced score matching and also denoising score matching versus the vanilla version [INAUDIBLE] score matching. We consider the problem of training energy-based models, or, equivalently, we are considering the problem of training score functions from noise-free data. So the first figure shows how much time is needed to perform each iteration of various algorithms as a function of data dimensionality. So when data dimensionality increases, all those algorithms will take more time to perform one training iteration. But clearly, score matching, in the color of a brown, scales the worst. And in contrast, Sliced Score Matching and Denoising Score Matching, which can be abbreviated as SSM and DSM in the figure, they scale much more preferably compared to score matching.
And in terms of the actual performance of score estimation, we report the results on the left figure. So the performance is better when the number is lower. So comparing sliced score matching and score matching, you can see that even though sliced score matching takes much less time to compute, they can still obtain more or less comparable performance as score matching in terms of score estimation.
So really, we gain a lot of computational boost at a small cost of the accuracy in score estimation. For Denoising Score Matching-- DSM-- because you have to inject noise to the data point, the performance in score estimation is not as good as sliced score matching when you want to estimate the score function of a clean data points. So everything is where we expected. So now I have discussed how working with score functions allow very flexible models because score functions bypass the challenge of a normalizing constant, and we can use principled statistical methods like score matching, sliced score matching, or denoising score matching to train those score models from data. On the next part, I will show you how can we generate samples from these models of score functions, and how can we control the sample generation process in a principal way.
So as a quick recap, we know that, given a large training data set, we can use principaled statistical methods like score matching to train our score model to estimate the underlying score function. In order to build our generative model, we have to find a certain approach to create new data points from the given vector field of score functions. So how can we do this?
Well, suppose we are already given the score function. And imagine there are many random points scattered across the stairs. Can we move those random points to form samples from the score function?
Well, one idea is we can potentially move those points by following the directions predicted by the score function. However, this will not give us valid samples because all of those points will eventually collapse into each other. But this problem can be addressed if we follow a noise inversion of the score function. So equivalently, we want to inject Gaussian noise to our score function and follow those noise perturbed score functions.
So this method is the well-known approach of Langevin dynamics. And it is also well known that if we keep this sampling procedure long enough to reach convergence, and if we set the step size to be very, very small, then Langevin dynamics [INAUDIBLE] to give you the correct samples from the score function. So this is the details of Langevin sampling. The goal is to sample from some density px using only the score function-- the gradient of px.
And the procedure of Langevin dynamics is as follows-- first, we initialize our sample from some prior distribution. This prior distribution can be very simple. It can be a Gaussian distribution. It can be a uniform distribution. And we repeat the following procedure multiple times.
So in each of the sampling steps, we first generate a random Gaussian vector from the standard Gaussian distribution. And then we modify x according to following recurrence equation. So we basically update the previous sample using our score function plus a scaled version of the Gaussian noise vector. And if you set epsilon to something very close to 0, and if you set the total number of iterations, the capital T, to be large enough, then we are guaranteed to obtain a valid sample from the underlying density of the score function.
So now we know score matching can estimate the score function data. And Langevin dynamics can generate samples from the score function. So it becomes very, very natural to just replace the score function in Langevin dynamics with our score model, and then we can generate [INAUDIBLE] data samples-- we define a new generative model.
So this approach sounds very nice from the theoretical perspective, but it does not work well in practice. So here are the results of combining score matching and Langevin dynamics naively. The left figure shows some images from the CIFAR-10 data set. So CIFAR-10 is a data set that contains many images of size 32 by 32, and the right figure shows you newly generated samples by combining score matching and Langevin dynamics naively.
So clearly you can see that the newly generated samples do not look realistic at all. So there has to be something very wrong with this simple naive approach. And in our research, we identified several challenges. One interesting challenge is it is hard to estimate score functions accurately in low data density regions.
So to illustrate this challenge, let's consider the prior example of a mixture of Gaussian distribution again. The left figure shows you the ground truth density function. Middle figure shows you the ground truth score function. The rightmost figure gives the estimated score function from score matching.
If you compare those two vector fields, it's clear that the estimated scores are accurate in high data density regions, which are given by those green boxes. But for low data density regions, the estimated scores are not accurate at all. So this is not totally unexpected because we use score matching to train our score model, and score matching compares the difference between the ground truth and the model only at samples from the data distribution. So in low data density regions we don't know how many samples, and therefore we don't have enough information to infer the true score functions in those regions.
And this is a huge obstacle for Langevin dynamics to provide high quality samples because Langevin dynamics will have a lot of trouble exploring and navigating those low data density regions. So how can we address this challenge? One idea is to inject Gaussian noise to perturb our data points. So after adding enough Gaussian noise, we perturb the data points to everywhere in the space. This means the size of low data density regions becomes smaller.
So in the context of image generation, adding additional Gaussian noise means we inject Gaussian noise to perturb each pixel of the image. So in this toy example, you can see that, after injecting the right amount of Gaussian noise, the estimated scores now become accurate almost everywhere. And this phenomenon is very promising because it at least says that the score function of noisy data densities are much easier to be estimated accurately, and those score functions of noisy densities could provide valuable directional information to guide Langevin dynamics to move from low data density region to high data density regions.
But simply injecting Gaussian noise will not solve all the problems. Because of perturbation of data points, those noisy data distances are no longer good approximations to the original true data density. So to solve this problem, we propose to use a multiple sequence of different noise levels. So as a toy example, we consider three noise levels from sigma 1 to sigma 3. We used Gaussian noise on mean 0 and standard deviation from sigma 1 to sigma 3 to perturb our training data set. And this will give us three noisy training data sets. For
Each noisy data set, there will be a corresponding noise data density, which we denote as p sigma 1 to p sigma 3. So in the context of images, perturbation using multiple levels of noise will give you a sequence of images demonstrated here. So after obtaining those noisy data sets, we want to estimate the underlying density. We want to estimate the underlying function of the corresponding noisy to the densities.
So how can we estimate three noisy score functions? Well, the most naive approach is we train three networks, and each network is responsible for estimating the score function of a single noise level. But this is not a scalable solution. Because in practice, we might require much more noise levels. For example, our image generation would typically require hundreds to thousands of noise levels.
A more scalable solution is to consider a conditional score model, which we call a noise conditional score model. A noise conditional score model is a simple modification to our score model. It takes noise level sigma as one additional input dimension to the model. The output corresponds to the score function of the data density perturbed with noise level sigma.
So how can we train this noise conditional score model. Well, again, we can leverage the idea of score matching. We have an important modification with score matching to jointly train the score model across all levels. So in this modification we have a summation with score matching losses. We have one score matching loss for each noise level sigma i, and we have a positive weighting function, lambda sigma i.
So the value of this weighting function is typically chosen using [INAUDIBLE] heuristics. It can also be derived using principled analysis of the problem. We have this positive weighting function just to balance the scales of score matching loss across all noise levels, and this is helpful for optimization. So by minimizing this modified score matching loss, if our optimizer is powerful enough, and if our model is expressive enough, then we will obtain accurate score estimation for all noise labels.
So after turning this noise-conditional score model, how do we generate samples? Well, one additional note is that this mixture of score matching loss function is actually a generalization to the training objective of the first version diffusion probabilistic models proposed in 2015. And this connects score-based generative models to diffusion models.
The connection between score-based models and diffusion models was first unveiled by the DDPM paper, which was in 2020. So now, let's return to the question of how to sample from the noise-conditional score model after training with the score matching loss. Well, we can still use Langevin dynamics.
We can first apply Langevin dynamics to sample from the score model with the biggest perturbation noise. And the samples will be used as the initialization to sample from the score model of the next noise level. And then we continue in this fashion until finally with generate samples from the score function with the smallest noise level.
We call this sampling procedure annealed Langevin dynamics because the rough intuition is we hope to gradually anneal down the temperature of our data density to gradually reduce the noise level. And this is what it looks like when we apply this approach to modeling real images. So it's quite remarkable that we can start from a random noise, then modify those images according to the score model, and this can eventually give us nice looking samples.
And this is the result of this simple noise conditional score model approach in 2019. So we provided similar scores and [INAUDIBLE] scores of CIFAR-10 data set. Those inception scores and FIDs are important quantities for comparing the performance of different generative models in terms of sample quality. So this was the first time that a different method can outperform GANs in terms of achieving higher inception score.
So of course FID score is still lagging behind GANs, but at that time it was quite surprising that a simple proof can already outperform GANs in one important metric, which is the inception score. So why it is important to outperform GANs? Because GANs were the best generative model for sample generation, especially for images, for quite a while. And as Turing Award winner, Yann LeCun, has said, [INAUDIBLE] are the most interesting idea in the last 10 years in machine learning.
And, indeed GANs have attracted a lot of research efforts from big corporations and universities, and people have improved GANs so much-- spent so much engineering effort on it-- it is amazing to see that GANs can generate very nice-looking images. But it is quite amazing that we can actually outperform GANs with score-based generative models. And with the resources available in academia, we actually do not have much capability to tune those kind of score models well enough, so especially considering the imbalance between computer resources and engineering efforts spent on GANs and score-based models, I consider this a very surprising achievement.
So, of course, noise-conditional score models can be applied to other types of image generation tasks, including images of different objects and of different resolution. So with some later development score matching techniques and neural network model architectures, we can further improve the sample quality of CIFAR-10. And Of course, nowadays, diffusion models have captured so much attention, and people are now working on diffusion, trying to improve their various perspectives. So it's not unexpected that people are achieving better and better quality using diffusion models or score-based models.
So in this work, we, again, [INAUDIBLE] data set of CIFAR-10. The left figure shows some existing training images from the CIFAR-10 data set. The right figure shows the newly generated samples from this improved approach. So now you can see new regenerative samples look very realistic and very diverse. They are also different from existing training images. You cannot generate such images by simply memorizing the training data set.
And again, we compare with the best approach in terms of FID scores and inception scores. So now we are able to outperform the best GAN approach in terms of both FID scores and inception scores. And this means score-based models can challenge the long-time dominance of GANs in every generation.
The same approach can be extended to generate images of very large resolution. So here are two samples generated from a score-based model. Each one has the resolution of 1,024 by 1,024. And here are more such samples from the same model with same resolution. So you can see the samples are very high quality-- quite comparable to the best GAN approaches [INAUDIBLE] time.
So one remarkable property of a score-based generative model training is the capability to control the generating process in a principled way. So suppose we are given an unconditional score-based generative model that generates images of dogs-- that generate images of both dogs and cats, but we want to only generate the images of dogs. So how can we do that?
Let's suppose we are given a forward model. So this forward model is basically an image classifier that gives us the label of an image y from an image x. We want to specify a control signal which is a target label y. We want to specify the target label to be dog, and then we hope to sample from the conditional distribution of x given y.
This conditional distribution will provide images of dogs only. It is called the inverse distribution because we can view it as a probabilistic inversion of the forward model. So how can we obtain this inverse distribution? The standard approach is to leverage the Bayes's rule.
So in Bayes's rule we have access to the unconditional distribution px. We are given the forward model, but we don't know the denominator. So this denominator is exactly the normalizing constant of the inverse distribution. This means we can use score functions to again bypass this challenge in Bayes's rule, and we can derive the Bayes's rule for school functions very easily.
So the derivation is very simple. We just take the logarithm on both sides of Bayes's rule and then take the gradient. Again, we can find that the only term that depends on the denominator goes away. The score function of the inverse distribution now becomes a simple summation with two terms, where the first term is the unconditional score function that can be estimated by training an unconditional score model.
The second term is the gradient of the log forward model. In this particular application, conditional image generation, the forward model is the classifier, and the gradient can be very easy to compute using backpropagation. In some other applications, this forward model might be manually specified, and the gradient is actually analytically tractable in most cases.
So the nice thing of this decomposition is now we can plug in different forward models or exactly the same score model. Which means we only need to train an unconditional score model once. Then we can repurpose this unconditional score model for various conditional generation applications just by switching the forward model.
This is one example. We can train one unconditional score model of CIFAR-10 images, then couple it with a classifier to generate class conditional samples. Here, the forward model is the time-dependent classifier. It is time dependent because we consider a sequence of score functions, and this means we need to have become a sequence of classifiers.
So the figures demonstrates the conditional generation results of CIFAR-10 [INAUDIBLE] all from an unconditional score-based model. And this approach has to been further developed as classifier guidance or classifier-free guidance in subsequent works, and nowadays, it is the standard technique used in all the text-to-image generation approaches, like DALL-E 2 or Imagen.
We can use an unconditional score-based model for imaging painting. So here the control signal is the masked image. We only know some partial regions of the image. And we want to sample from the inverse distribution, which gives us completed images from a partially-observed image. In this case, the forward model can be directly specified using our domain expertise. So there is no need to train a separate model for this task.
Similarly, we can apply unconditional score-based models for image colorization. And again, in this figure you can see that our control signal is the gray image, and we can infer the colorized images from the gray images. Now, further models can be specified manually. For this image in painting and the image colorization tasks, we were actually using the same unconditional score model.
So this means that once score-based [INAUDIBLE] model can be used for both imaging painting and the colorization, demonstrating the flexibility of this decomposition of the inverse score function. And again, we can apply this approach to larger-scale examples, such as colorization for images from resolution 1,024 by 1,024. So the same approach can be applied to convert stroke paintings to realistic images, and here is one example.
Now, the stroke paintings become the control signal, and we use an unconditional score-based model trained on realistic images only. They have no idea of what a stroke painting looks like. We can develop the forward model by manual specification using our domain expertise.
And this is another example, language-guided image generation. In this case, we are based on an unconditional score-based model and the control signal becomes a language description-- tree house in the style of Studio Ghibli animation. The forward model is given by an image captioning neural network. So in this example, the score model has no knowledge of language at all, but it is capable of generating spatial images that conform with the language description.
And this is another example. We can apply conditional score-based generation for medical image reconstruction. And we consider the special problem of computed tomography. So in this case, we use X-rays to shoot through a human body. Those X-rays will hit the detector to form observations called sparse-view sonogram.
We can invert this physical procedure to obtain those cross-sectional medical image. So here the control signal is a sonogram. The inverse distribution gives you the conditional distribution of medical images given the sonogram. We want to consider the problem of sparse-view computed tomography, meaning that we want to use as few X-rays as possible to reduce radiation. And this is a very simple task for generative models because by training an unconditional generative model on large-scale medical images, our generative models can actually learn what a typical medical image looks like. It can learn very useful image prior, and this can be subsequently used to reduce the number of X-ray projections.
In this case, the forward model is given by physical simulation. So there is no need to train any separate conditional model on capturing this forward model. And we can have some results on real-world CT data sets.
We consider the task of using 23 projections, while in contrast typical traditional approaches require hundreds to thousands of projections. So this is the result of a traditional approach based on compressed sensing. So using only 23 projections, you can see the medical image is quite blurry. Quantitatively, we compared the performance of different algorithms using PSNR and SSIM.
Here are the results of two deep neural network-based approaches. So those methods are based on mapping projections directly to images. So they are kind of limited to a particular training setting. In this case, since they are trained on 23 projections, it is hard to adapt them to a different number of projections later.
And this is our fully unsupervised approach. So because we only try one unconditional score-based model, we do not train any particular model associated with these 23 projections. So that means we can adapt the same model to different settings, like changing the number of projections later after training. And this is the ground truth. So both qualitatively and quantitatively, we can see that this score-based medical image reconstruction approach can actually outperformed other deep learning methods.
Even though this generative approach is fully unsupervised, it does not bind to a particular experimental setting. While in contrast, existing deep learning methods come to be limited to a specific experimental setting. So similar success has also been observed on accelerated magnetic resonance imaging as well.
And there has been numerous developments of score-based models or diffusion models. We have obtained state-of-the-art performance on many other tasks. And this is already kind of outdated at this time, but I think it's worth mentioning anyways. So we can generate high-quality images for a much more complicated data set like image [INAUDIBLE]. And we can obtain outstanding performance audio syntheses, text-to-speech generation, material design-- this is actually a paper by MIT researchers-- and also shape generation. We can also use score-based approaches for molecular confirmation prediction and time series prediction. And there is a website of scorebasedgenerativemodeling.github.io that includes a list of relevant works trying to build upon the technology of diffusion and trying to improve the methodology of score-based generative models.
So now I am talking about how score-based generative modeling allows flexible model architectures, allows improved sample quality, with a controllable generation procedure. In the last part of a tutorial, I will talk about how we can compute probability values accurately, and how we can outperform existing likelihood-based generative models in terms of density estimation.
So in order to compute anchored probability values, we have to generalize the previous framework from using a finite number of noise levels to using an infinite number of noise levels. So let's get some intuition first by assuming our data distribution is a one-dimensional mixture of two Gaussians. Let's start with three noise levels. We have sigma one to sigma three. And we use Gaussian noise of a standard deviation from sigma one to sigma three to perturb our data distribution.
So if the noise level is large enough, we can convert any data distribution into a simple Gaussian distribution. We may use a one-dimensional heat map to represent each of those noisy data densities. With more noise levels, we have more heat maps. In the limit of the infinite noise levels, we have a continuous, two-dimensional heat map that represents an infinite number of noisy data densities.
We use pt to represent each of those noisy data densities, where t is a continuous parameter ranging between 0 and capital T, where capital T, capital T is the fixed constant. When t's at 0, t0 is the same of the data density because we do not inject any Gaussian noise and this time instance. When t is capital T, the capital T contains a lot of Gaussian noise. It would be close a simple Gaussian distribution, which we denote as pi X.
So suppose we are given this sequence of infinite number of noise levels. How do we generate noises that are-- how do we generate noisy data sets for training our score models? Well, we need to leverage the intuition or stochastic processes. So starting from clean trending data points, we progressively inject the Gaussian to perturb our trend data sets. So after enough perturbation, eventually we will obtain very noisy images which are close to samples from a simple Gaussian distribution.
So the trajectory of those noisy data sets form the trajectories of a stochastic process. A stochastic process is basically a collection of a infinite number of random variables. Here those random variables are indexed by the continuous parameter t. For each random variable, there will be a corresponding probability density. So one stochastic process corresponds to an infinite number of probability densities.
So how do we choose the right stochastic process such that it represents an infinite number of noisy data densities? Well, we use the term of a stochastic differential equation. So a stochastic differential equation is very similar to an ordinary [INAUDIBLE] equation, but it has one additional stochastic term.
In the general form of SDEs, we have one deterministic drift term that controls the deterministic properties of the stochastic process. We have one stochastic term which involves dwt, the infinitesimal Gaussian noise. So without loss of a generality, we consider the following toy formulation of the SDE which does not have the deterministic drift term.
And it has a very simple stochastic term, sigma t. We can use sigma t as a continuous time generalization of noise table-- of noise table sigma i, which we introduce it before. So now we have an infinite number of noisy data sets. How do we generate samples, suppose we can estimate their score functions? Well, the sample generation process amounts to a time reversal of the perturbation process.
So by reversing the perturbation process, we can stop from Gaussian noise, then progressively denoise to generate noise free samples. How do we obtain this reverse stochastic process? Recall that our forward stochastic process is given by a stochastic differential equation. It turns out that any stochastic differential equation of that form can be reversed in analytical form, and this gives us the reverse stochastic differential equation.
The reverse stochastic differential equation depends on an infinitesimal noise term, dwt bar. So this is very similar to dwt, but it maxes only when time flows backwards. It also depends on a score function of the noisy data density, pt. So now with the forward and backward SDEs, we can generalize the previous scope as the generative modeling approach for using-- yielding noise levels.
In this formulation, the key is to estimate the score function, which we accomplish by modeling the time-- by parameterizing the time-conditional score model. So we hope to train this time-conditioned model to approximate a function of the data density and time instant, t. And again, the training procedure depends on score matching.
We have one score matching loss for any time instant, t. We have a positive working function, lambda t, to balance the optimization procedure. And we have a generalized summation to an expectation over t. So after training with this score mentioned objective, we obtain a good time-conditional score-based model that approximates the score function of noisy data densities.
And of course training involves minimizing the score mentioned objective, and we can train it efficiently using the denoising [INAUDIBLE] or [INAUDIBLE]. After training, we can plug our time-conditional score model into the reverse time SDE. and then, we can use any numerical SD solver to solve this reverse time SDE for sample generation.
And one simple approach is the Euler-Maruyama approach, which is stochastic generalization to the classical Euler solver for ordinary differential equations. So with this continuous SDE approach, we not only improve empirical performance, we can also finally discuss how we can compute the accurate probability values. And this requires to convert the stochastic differential equation to an ordinary differential equation.
So with the right SDE, we can convert any data distribution to the Gaussian distribution. It turns out we can do the same thing by using ordinary differential equations. So the trajectory of the ODE and the SDE look quite different from each other, but actually they can share the same set of marginal densities that's given by the background.
So for any SDE of this form, we show that the corresponding ordinary differential equation, named probability flow ODE, has a form on the right side. So again, this ODE only relies on the score function. And since we have the time-conditional score model, we can plug it into the ODE, and then we can solve the ODE in various ways.
So after substituting this model into the ODE, we can solve the probability flow, or the bank crossing time bar, starting from some samples from the Gaussian distribution. And this ODE trajectory is gradually convert our Gaussian vectors into high quality image samples. And we did not the resulting distribution of samples from this ODE solver as p data. So now, I will show you how to compute the accurate value of p data.
So recall that we define this probability distribution in this way. We have prior Gaussian distribution. We have time-conditional score model. By solving the ODE, we get the distribution p data, so this is actually a [INAUDIBLE]. So why we need to compute the exact likelihood? Because they have many useful applications. We also mentioned this briefly before. This includes lossless compression, unsupervised anomaly detection, generative classification, density estimation, and so on.
And a formula for this likelihood is given by the following equation. So this equation connects log p data-- any data point x0-- with the log prior distribution, log pi, and also one-dimensional integral that involves the choice of the Jacobian of the score model. The trace can be computed using an unbiased estimator. And the integral can be computed using an ODE solver. This integral is simple to evaluate, because it is a one-dimensional integral.
So here are some results of computing the density with this ODE approach. So our results are highlighted with the green box. And we report results in negative log likelihoods, which are better when lower. And in this table, you can see that we can achieve lower negative log likelihood on almost all previous approaches, even though our methods are not explicitly trained for maximum likelihood.
I mentioned that the weighting function on the t can be chosen using some theoretically principled approach. And indeed we can do that to specifically maximize maximum likelihood. So there is a theorem we shouldn't-- there is a important connection between the KL divergence and the score matching objective, and this connection looks like below. Here the second term, KL divergence from p capital T to pi, is approximately 0 if capital T is large enough.
And this term does not affect optimization, because it does not depend on model parameter theta. The first term is exactly our score matching objective, but with a different weighting function. So this weighting function is sigma t squared, which we call the likelihood of weighting because KL divergence is directly related to maximal likelihood training.
By minimizing the score matching loss function with this particular likelihood of weighting function, we're actually implicitly maximizing likelihoods. Because this score matching loss function is very efficient to optimize, this also gives a way for efficient maximum likelihood training for score-based diffusion models.
And with this approach, we can further improve the density values on several tested data sets. Again, we report results in negative log probability, which is lower-- which is better when lower. And here are some existing results. Those are state-of-the-art likelihood based generative models that achieve very good likelihood of values to image data, CIFAR-10 and ImageNet.
And here is our result, which achieves very good likelihood 2.83 [INAUDIBLE]. This is second to the state-of-the-art. We also achieved a new state-of-the-art likelihood of ImageNet 32x32. And this demonstrates the score-based generative models, or diffusion models, can not only challenge the dominance of GANs on image generation quality, but can also challenge the dominance of other regression models and AVEs, obtaining high likelihood values.
So aside from probability evaluation, there also a few nice properties with the probability flow ODE. One example is, we can perform latent space manipulation because this ODE actually connects score-based models to normalizing flows, or latent space generative models. We can manipulate the latent space for applications such as image interpolation, which we show on the left side, or temperature scaling, which we show on the right side.
And there is one unique property associated with the probability flow ODE. That is it recovers that encode that is uniquely identifiable. So what does it mean? For traditional latent space-- for traditional latent space generative models, such as AVEs, GANs, or normalizing flows, if you train two models with different architectures, or if you train them with different optimizers, then they will map the same image, the same datapoint x, to different latent code, z.
But in the case of that probability flow ODE, things are a bit different. Even if you have different model architectures or different optimizers, as long as the architectures and optimizers are good enough, they will map the same data point into the same latent code, z. And this is because the probability flow ODE itself actually does not depend on a model parameter at all. So once we have fixed the forward process, this probability flow ODE is also fixed. And the [INAUDIBLE] between the data point x and the latent code z is also fixed.
So here are some experiment results. We trained two model architectures on the same CIFAR-10 data set. And we plot the first 100 dimensions of the latent code for fixed CIFAR-10 image input. You can see that the latent code is almost the same, even though we were using two different model architectures trained separately.
So as a summary, we have talked about score-based generative models. It has multiple desirable properties. The first, it allows very flexible neural network models. And because the score functions can bypass the normalizing constant, and we can train those flexible models from data with principle statistical methods.
And second, we can generate samples with a very high quality that can even surpass GANs in many challenging image interaction tasks. And moreover, we can control the selection process for important applications in conditional image generation, and also inverse problem solving.
And finally, we can compute the probability values accurately, even though we only have sample-- we only have models of the score function. And empirically, we can even obtain better density estimation performance than existing likelihood-based generative models. And with that, I'd like to thank you for your attention, and I'm happy to take any questions.
AUDIENCE: I was wondering whether you could comment a little bit more on the link between the forward models and then the score-based function? Because for instance, you're saying that it could be like a physical simulation. It can be like an image to text network. What are the requirement, or what could be like the forward model? What are the limits to what that model can be?
YANG SONG: Yeah, so ideally the forward model has to condition on noise level t, as well, just like the score function. So if you have such a forward model, then you can generate from the conditional distribution exactly, just like a cross-conditional generation text-to-image synthesis. So in all those cases, we have a forward model condition on noise level.
If we don't have such a noise level, which is the case in medical image reconstruction, we need to develop approximation approaches. So if the forward model is linear, then it is typically easy to approximate it. And there are already a sequence of follow up works that demonstrate different methods for approximating the forward model.
So if this forward model is nonlinear, then approximation is a little bit more complicated, but it is not impossible. So there are also relevant works on that. It is still a research direction whether there exists a principled way to convert a single forward model to a sequence of forward models that condition on the noise level, t. So I hope that answers your question.
AUDIENCE: Yeah, thanks.
AUDIENCE: OK, so there's a question in the chat. Do you have any intuition on why diffusion models outperform GANs? Are there cases in which diffusion models under perform GANs slash AVEs in terms of performance?
YANG SONG: Yes, so this is still an open question because people don't have even an understanding of why GANs have better [INAUDIBLE] quality than AVEs and other regressive models, and other things. But I can offer some intuition and some guess about why diffusion models generate samples with better quality. So my first hypothesis is that, because diffusion models have very flexible neural network architectures.
So architectures are critical in the performance of deep generative models. Even for GANs, the importance of architecture cannot be underestimated. For example with GANs, you did not really have good performance until people have invented DCGAN architecture, and more recently StyleGAN architectures. And this is the same case in diffusion models.
Because the score functions do not have to be normalized, you can use almost arbitrary neural network architectures, and this offers a lot of advantages. For example, nowadays, all the diffusion models are based on unit-type architectures. If you change them to some other type of architecture, the performance utility grows a lot. But because the first models allow us to use flexible architectures, we have more degree of freedom to try different types of architectures.
And some of those neural net architectures are extremely, extremely powerful for other computer vision tasks. For example, units were initially used in image segmentation and some other [INAUDIBLE] tasks. And that's also the reason why we tried units in score-based generative models. So because of this extra degree of freedom, we can use very good deep neural network architectures, and this helps in terms of sample quality.
So another hypothesis on why diffusion [INAUDIBLE] GANs is based on the observation that diffusion models generate samples with better diversity. So why they compare the diversity? Because there is a strong connection between the score matching loss and maximum likelihood. And maximum likelihood is an objective that promotes high likelihood. So when we are training diffusion models, we are also implicitly maximizing likelihood to some degree, and that means we have better diversity since.
So another reason why the diffusion outperforms GANs is due to the reason that we can decompose the conditional score function into two terms. And as discussed in the tutorial before, the score function is decomposed into the sum of the unconditional score function and the gradient of the forward model. So we can actually relate the gradient of the forward model to trait of diversity and image fidelity.
And this technique called classifier guidance, or classifier-free guidance-- which was later developed further by OpenAI people and Google people-- those techniques can become absolutely important for text-to-image generation approaches. Without this kind of guidance, the samples will have much worse quality. So for GANs, it is much less intuitive to enforce this kind of classifier guidance. So that's also one reason why diffusion models outperform GANs.
So in cases where diffusion models under perform GANs or AVEs. I think one example is sampling speed. Because diffusion models require a sequence of iterative sample generation procedure, this means we have to evaluate the score model for multiple times. While in GANs or AVEs, you only need to evaluate the image generator with one network evaluation, so it is much faster in that case. So there are already a lot of progress for accelerating the sampling speed of diffusion models. So in the future I think this part will become less and less important.
Another thing is related to the latent space. So for diffusion, we can define the latent space through the probability flow ODE. But this latent space is less disentangled compared to the latent space in GANs or AVEs. So this is one drawback because people can use the semantic latent space of GANs or AVEs. For a lot of innovative image editing tasks. But in the case of diffusion, this is much harder to do.
However this is not a big issue from my perspective, because all of those text-to-image generation models are actually mostly based on diffusion models. And in those cases, when you have a text-to-image generating model, you can basically view the text description as one latent code. So now you convert the latent space into the space of language descriptions.
And I believe language descriptions are the best latent space for any type of image editing. It is interpretable. It is very easy to manipulate. So in that regard, I think, it is less harmful for diffusion models to not have a similar latent space as in GANs and AVEs. So yes, I know that was a long answer. I hope that addresses your question.
AUDIENCE: And you briefly touched upon this in your answer, but what is the main architecture type that is being used in current diffusion models? And how large are they compared to other state-of-the-art neural networks?
YANG SONG: Yeah, so right now we still use unit architectures. Units are very similar to resonance, but they have some additional skipped connections connecting feature maps with the same resolution. And in terms of the size among architectures, diffusion models can be very large. They can be as large as the largest image generating models from other categories. But compared to language models, they are still relatively small.
So all the recent text-to-image generating models have parameter count like around a few billions. But the language-- the biggest language models on text can have parameters up to hundreds of billions. So yeah, I think that's still a big difference here. And people are still on the way of scaling diffusion models to those as large as language models. And I don't think there is any fundamental difficulty in scaling them to larger and larger models.
AUDIENCE: Could you explain what you said about the latent code being less disentangled? Is there a reason why there would be difference between diffusion models and-- I guess AVEs kind of implicitly have a term to encourage that semantic disentangling, but I don't understand why GANs would do that better, either.
YANG SONG: Yeah, so since these entanglements in GANs and AVEs are mostly an empirical observation. So somehow these disentangled codes are much easier to work with, and are much useful for improving sample quality in those kind of generative models. So for the future models, because the latent code is not related to the neural network architecture or the optimization procedure, the latent code is entirely defined by the forward perturbation process.
So you can view the latent code as analogous to some type of parameter. And this manually constructed latent code is, unfortunately, not as disentangled as those latent code in GANs or AVEs in that if you change one dimension of the latent code, and observe how the image changed in diffusion models, the image might change in some random ways.
It's not like GANs or AVEs. When you are modifying a particular latent code dimension, the image will change in a very predictable way. So I think it's just because how we design those diffusion models that it encode really don't have much semantics. There might be ways to design a forward process such that the latent code is fully disentangled. So it could be a good future research direction.
AUDIENCE: And then, we have another question from the chat which is, can diffusion models be used for transfer learning, for instance fine tuning? A recent self-supervised method used image reconstruction by recovering the masked pixels, such as masked auto-encoders. In terms of learning representation, which strategy do you think would be preferred, denoising images step-by-step, or recovering masked pixels at once?
YANG SONG: So anyway, so from my understanding there are a lot of similarities between masked self-supervised learning and diffusion models. The biggest difference might be, in diffusion models, we have multiple noise levels. So the objective actually consists of a summation of different masks, all including losses. There might be some difference in terms of what kind of perturbation and denoising process we use.
In masked auto-encoders, the perturbation is actually masking out part of the image. And the denoising process amounts to recovering the missing pixels in the particular region. In Gaussian diffusion models, we perturb images by adding Gaussian noise, and denoise by removing those Gaussian noise. So you can also construct diffusion models by using the perturbation process in masked auto-encoders, which means you can progressively remove part of the image, and then progressively recover those missing pixels.
So you can use a similar procedure for self-supervised representation learning, as well, and there has been some recent works on that. In that case, you will obtain representations as a function of noise level. And this continuous formulation of the representation might be useful for some downstream tasks, as well, just like masked auto-encoders. So which strategy should be preferred? This is hard to say. I think only experiments can give you the definite answer. But both approaches are reasonable, and they are always trying.
In terms of a transfer learning or fine tuning, so it really depends on what kind of diffusion models you want to use. There are some written work on text-to-image image generation diffusion models, like you can fine tune DALL-E 2, or imagine for tasks like tune [INAUDIBLE] or texture inversion. I don't think that's much different from other generative models in terms of capabilities for transfer learning or fine tuning.
Yeah, there is another question. Does diffusion model also have overfitting and mode collapse? If so, how to deal with these problems. So that's a good question. So diffusion models also overfit and also have mode collapse. And in terms of overfitting, it is quite easy to detect overfitting in diffusion models because you can just compare the loss curve for training and test. If the test loss cost curve is going up, then it's indicate clear overfitting.
So for some small image data sets, like CIFAR-10, overfitting is very important to-- it's very important to suppress overfitting in order to obtain samples of a high quality, and we typically use [INAUDIBLE] or YPK. For larger data sets like ImageNet, because there are so many images there, it's becoming harder and harder to overfit. However, it is still possible for diffusion models to memorize certain images in the trend data set.
This is called an image regurgitation, I think. This was observed in DALL-E 2 training, as well. And this is typically caused by repetition of certain images in the training data set. If there are a lot of similar images are more or less the same with each other in the training data set, then it is very likely for the diffusion model to completely memorize such images.
And the way to mitigate this problem is to image the duplications, effectively remove all the similar images. Or another way to do it is to increase the size of the training data set. Like if you have a lot of different images, and a lot of images are repeated in similar ways, then it is also helpful for the diffusion model to avoid memorizing such images.
So the first model do not have like mode similar. They have some kind of mode collapse, but not in the same way as GANs. Because in GANs mode collapse, usually happens when the training is done poorly. But for diffusion models, the mode collapse usually happen because of the data set. So yeah, that's a very interesting difference.
AUDIENCE: So in the score-based models, there's usually a sigma term, I think, that you introduced. And you said that one of the reasons why you want this is for, kind of, perform-- or sorry, like tractability reasons. Do you think that the weight sharing that this also introduces is important for the models to do well, or is it kind of just an artifact of reusing the same model?
YANG SONG: Yeah, weight sharing is also important. On some small scale problems-- I have never tried this on a large scale problems-- on some small scale problems, it is possible to train a sample score network for each individual noise level, and it performs fine. So it is definitely not a scalable solution, so nobody has tried this on very large-scale problems.
But there is indication that this might also work for large-scale problems because some of the recent text-to-image diffusion models, like [INAUDIBLE], they actually have multiple, separate score models. So some of the scale model are responsible for high noise levels, some of them are responsible for medium noise levels, and some responsible for small noise level.
And this can, indeed, greatly improve the performance. So from this perspective, I don't think worth sharing is the most important thing to explain is a high sample quality. And if you have the computing resource to do it separately, then maybe it can be much better.