Josherich's Blog

HOME SHORTS PODCAST SOFTWARES DRAWING ABOUT RSS

Deep Learning Day: Generative Modeling

06 Mar 2025

Deep Learning Day: Generative Modeling

Good to see you everyone. In this second talk, I will talk about generative modeling.

So first question: who has used chatbot or anything like that? Maybe everyone, right? But who has heard the term generative model before you got to know ChatGPT? Well, still quite a lot, yeah. So in this talk, I will give a very high-level overview of what generative modeling is and how it’s impacting our life and our future research.

There is no doubt that we are in the so-called generative AI area. For public audience, perhaps this moment happens when chatbots or many other chat systems were introduced, so you can communicate with the computer in natural languages. You can talk, you can ask the computer whatever problems, and it’s just like an agent that can help you to solve many issues.

So this is not the only generative AI model. Another very popular and very powerful tool is the so-called text-to-image generation. For example, users can give the computer some text which is usually called a prompt, and then the computer can generate an image.

So for example, in this case, the prompt would be a teddy bear teaching a course with “generative model” written on a blackboard. It is very likely that the computer algorithms haven’t seen this exact image before, but this is how it can produce from the given text prompt.

We can even go one step further; we can ask the computer algorithm to generate a video. This is what was generated by Sora one year ago. This is just really impressive. I believe perhaps no producers have ever filmed a video in this way, having so many paper planes flying over trees or forests. This is completely generated by the imagination of the computer algorithms.

Actually, generative models could be very powerful productive tools in our daily life. For example, it’s still kind of a chatbot, but it is a tool that can help us to write code. This is kind of an AI assistant; it can read your code, it can try to fix the issues of your code, and you can directly communicate with the agent using natural language.

The agent will turn it into code. In some sense, perhaps the previous programming languages like C++, Python, or Java, the next level of programming language would just be English or human language. It’s more than that; it’s more than just computer science. Actually, generative models have been used in many scientific problems.

This is an application called protein design and generation. The ultimate goal is to design or generate some type of proteins that can solve some problems that we care about, let’s say some very dangerous diseases. This work is called RF diffusion. It is actually part of the work of the Nobel Prize winner this year, and there are many other scientific problems that can benefit from AI generative modeling.

This is a work from DeepMind a few years ago. They can use this model to predict the weather change over the next several hours or next several days. This would be a very difficult problem for classical algorithms because, as we may know, the change of weather or the change of climate is chaotic, so it is very difficult to predict it precisely.

We may not want to have the exact physical state of that moment; what we want is some qualitative behavior, let’s say whether it’s raining or whether it is windy at that moment. In this sense, generative models of deep learning could provide a very good solution to this problem.

Actually, before the emergence of generative models in our daily life, generative models have been used or developed for decades. This is a tool called patch match or content-aware fill in software like Photoshop. It was a very impressive tool when I was a PhD student, and at that time, I worked exactly on the same problem.

The problem here is that you will be given a photo and the user can specify some area or some structures in the photo. The computer algorithm is trying to fix or edit the photo based on user instructions. At that time, there was no deep learning, and to be exact, for this application or for this algorithm, there was even no machine learning.

It is a very classical computer vision algorithm, but conceptually this is also a kind of generative modeling. The technique behind this generative model can date back even another 10 years ago. This is an algorithm called texture synthesis. The goal is that you will be given an example texture and you want to extend the texture to a bigger image or you want to paste the texture to some 3D objects that you care about.

The idea here is just very simple; you just try to synthesize the texture pixel by pixel based on what has been synthesized. In today’s world, this is actually an autoregressive model. So this is basically what I’m going to talk about. In this talk, I will very quickly go through the concept of what a generative model is, and then I will introduce some approaches, some modern approaches to how we can build generative models using today’s deep neural networks.

Then I will also talk about how we can formulate real-world problems into generative modeling.

Okay, so first, what are generative models? It turns out this is a very difficult question because when generative models become more powerful, the scope of generative models keeps changing. Even though I will talk about some classical definitions of generative models, I just want to say perhaps today every single problem could be formulated as a kind of generative model.

So now let’s look at the applications or scenarios we have just introduced. What do these scenarios have in common? For example, in image generation, video generation, and text generation, there are multiple predictions or conceptually infinite predictions just from one input.

Let’s say if you want the computer to generate an image of a cat, you will tell the computer, “This is a cat; I want a cat.” Conceptually, there is an infinite number of possible cats. Another property of generative models is that some predictions are more plausible than others.

For example, if you want a cat, the computer may generate a lion or it can also generate a dog. Perhaps in common sense, a lion is more plausible than a dog in this scenario. Of course, a cat is more plausible than a lion.

Another very intriguing property of generative modeling is that your training data may not contain the exact solution. As we have seen, I believe the computer has never seen a teddy bear standing in front of a blackboard and teaching generative models. Similarly, the computer may not have seen these paper planes flying over a forest, so this is a kind of out-of-distribution generation.

The computer algorithms were trained on some data, but what they are generating is some distribution that could be outside of the training data. Additionally, most of the time, the predictions of generative models could be more complex and more informative. Conceptually, it is higher dimensional than their input.

For example, in text-to-image generation, if you want a computer to generate a cat, which is just a very short word, the output image would have millions of pixels or maybe even more.

All these properties make generative models way more difficult than some of the classical deep learning or recognition problems. In a textbook, this is a kind of formal definition of what a generative model would be. Usually, when generative models are introduced, people would compare it with a so-called discriminative model.

So what is a discriminative model? Typically, as you have seen in Philip’s talk, if we care about image classification problems, you will be given an image, and then you are going to train a model, for example, a neural network. You want the neural network to output a label, let’s say a dog. In this very simple scenario, we can just imagine a generative model as reversing this process.

In this case, you would be given a dog, and then you would like to train a model, again perhaps a neural network, and then you want to output the image, which is X. In this case, there would be many possible outputs, many possible dogs. The output will be higher dimensional and the output would be another dog that you haven’t seen before.

Then conceptually, this is kind of a probabilistic visualization of what a discriminative model would be and what a generative model would be.

So on the left-hand side is a discriminative model. You have some green dots, which is one class, and some orange dots, which is another class. The goal of a discriminative model is just to find a boundary that can separate these two classes.

Conceptually, the task is to try to find this conditional probability distribution, which means you will be given X, such as an image, then you want to estimate the probability of Y, such as it is a label zero or label one. As a comparison, in the context of a generative model, you would still be given the same data, the same dots, but the goal here is to estimate the probability distribution of these dots.

Let’s say in the case of this class that corresponds to y equals 1, you want to estimate what the conditional probability distribution of this class is. Conceptually, in a generative model, we care about probabilistic modeling, so that is the key problem generative models want to address, and that is also the key challenge.

You may wonder why there is probability and why we care about probabilistic modeling. Actually, in many of the real-world problems, we can assume there are some underlying distributions, and you can also assume your data is actually generated by some very complicated world models.

For example, if we care about human face images, we can formulate the problem such that there would be some latent factors, such as the pose, the lighting, the scale, and actually the identity of the face. This would be the latent factors, and then you assume there will be some distributions about these latent factors.

These latent factors would be rendered by a world model. This is, for example, how you can project a 3D object onto a 2D grid of pixels. What you can actually observe is just a 2D grid, so that is the observation X. Your 2D grid would follow some very complicated distributions that cannot simply be described by some underlying distributions.

This is why we care about probabilistic modeling, and a generative model is trying to uncover these underlying factors or to reverse this process.

Now, for example, let’s say we have some data; let’s say I have a dataset of dogs, which means I have many data points, and every single data point corresponds to one image of a dog. Conceptually, we imagine there is some underlying distribution that can model the distribution of all dogs.

It’s worth noticing that this is already part of your modeling because you can model the underlying world generator in many different ways. Even though we often assume there is this underlying distribution, this distribution is a part of the modeling.

Then the goal of generative modeling is to learn a neural network or perhaps another model to approximate this distribution. Let’s say this red distribution is what we can learn from a neural network, and the goal here is to minimize the distance between the data distribution and the distribution you estimate.

This is still a very difficult problem. There are many solutions to this problem, but conceptually, almost all existing generative models could be formulated in this way, and they are just trying to address the challenges posed by this problem.

Then conceptually, assuming your model has done a good job on this, you can start to sample from the distribution you estimated. If your model is doing good work, that means when you sample from this distribution, you would be doing something that is conceptually similar to sampling from the original data distribution.

In this case, hopefully, it will produce another dog that your algorithm hasn’t seen. It is also possible to do the probability estimation, so that is, your model would be given another image, let’s say a cat, and then you can ask the model how likely this image is under the original data distribution.

In this case, if the original data distribution is about dogs and the input image is a cat, then hopefully it will produce a low estimation of the probability density. This is kind of how we can use probability modeling to formulate the generative modeling problem.

As you can imagine, the most powerful tool today for us to address generative modeling is deep learning. Philip has given a very excellent and very quick introduction to what deep learning is, so conceptually, in a nutshell, deep learning is representation learning.

What Philip has introduced is a process of learning to represent the data… conceptually the data instances that means you will be given the data let’s say the images and then you want to map the images to labels. This is one way of using deep neural networks for representation learning.

Then in the case of generative modeling, actually there is another way of using deep learning but still for the goal of representation learning. We don’t just want to learn the representation of one single data instance. We want to learn the representation of a probability distribution. That is a more complicated problem, and conceptually it can be viewed as learning the mapping the other way. Let’s say here the output would be the labels of, let’s say, the label of cats or the label of dogs, and then you want to map it back to the pixel space.

Then as you can imagine, deep learning or deep networks is a very powerful tool for generative modeling. Conceptually, when you use this tool for this problem, the models are actually simultaneously playing these two roles: first learning to represent data instances and second learning to represent probability distributions.

Then this is conceptually what a model would look like. Your model will be given a very simple distribution, for example, it could be a Gaussian distribution or it could be a uniform distribution; it doesn’t matter. So in the case of an image, this would look like just a completely noisy image. Then the goal is to learn a neural network such that it can map a noisy image to just another image in the output space.

Then conceptually, if your model can do a good job, hopefully, the output would be a visually reasonable image, such as a dog in this case. Then you can just keep sampling noise from the input distribution, and hopefully, the neural network will turn everything into some meaningful images in the output. Then conceptually, when you do this, actually your neural network is trying to map a simple distribution, let’s say a Gaussian distribution, to another distribution, which conceptually is to approximate the underlying data distribution.

Then in this sense, a generative model is a mapping between distributions. It is not just a mapping between a pair of data points and a label; it goes from one distribution to another distribution. The next slide would be a little bit technical, perhaps, and I can go very quickly. So these are some of the fundamental elements of a generative model.

First of all, you may need to formulate a real-world problem as a probabilistic model. This is one of the most critical parts for us to design an algorithm. After you can do that, you need some representations, and today usually this is a neural network. You want to represent the data and their distribution, and then you need to introduce some objective functions to measure the difference between two distributions. Then you need an optimizer that can solve the very difficult optimization problem, and then you also need an inference algorithm, which is conceptually a sampler that can sample from the underlying distribution.

So today, many of the mathematical or theoretical research would be about one or many elements in this list. I’m not going to delve into the details, but next, I’m going to give a very high-level and very quick overview of what are some of the modern approaches and popular approaches to generative models, and I’m also going to explain why a generative model is a hard problem.

This is the figure you have just seen. As you can see, the problem here is that if your model is given one noisy image or noise input, you want it to map it. So why is this hard? Recall that in Philip’s talk he has talked about the problem of supervised learning. In that case, you will be given one image and also a label of that image, so you have a pair of input and output. That is a very well formulated problem of supervised learning, and that problem is easy for modern neural networks to solve.

But in the case of generative modeling, conceptually it is an unsupervised learning problem. You will be given an image, but then conceptually you have no idea what the input noise corresponds to that image. This correspondence or this pairing problem is also what your underlying algorithm should try to figure out.

So then in this sense, conceptually it is not just about mapping pairs of images or pairs of data; it is about mapping two distributions. You want to map a simple Gaussian distribution to a very complicated data distribution, and this is why generative modeling is hard. There are many effective and very smart algorithms to address this problem.

I will start from some very fundamental and elegant algorithms, and then I will start to talk about some of the state-of-the-art algorithms today. So first, I will talk about variational autoencoders or VAEs. Conceptually, in generative models as we have introduced, you want to map an input distribution to an output distribution. Then we can formulate this as an autoencoding problem that means if you have the distribution of the data, then you can train another neural network to map the data distribution to the distribution you like, let’s say a Gaussian distribution.

After you have this distribution, you can learn the generator to transform it back. Then conceptually, you compute the distance between the inputs and the output. This is a very classical idea of autoencoding in deep learning, but in classical algorithms, usually, this would be applied to the concept of data instances. You apply this to every single image.

In the case of variational autoencoders, conceptually, the concept of autoencoding is applied to the distribution. You can just imagine this distribution as just one object; it’s just one entity that you want to process. You transform this object into a simpler object, and then you transform it back.

This is the autoencoding idea. Another very popular solution that is kind of the beginning of research in generative modeling 10 years ago is called the generative adversarial networks, or in short, GANs. Again, conceptually, it also just wants to learn a generator that goes from a simple distribution to the data distribution.

But instead of introducing another network before the data or before the simple distribution in the case of GANs, it introduces the discriminator network after you have obtained the estimated distribution. This extra neuron network would be called a discriminator. The goal of the discriminator is to tell whether your sample is from the predicted distribution or is from the real distribution. If the discriminator cannot tell which distribution it is from, then it means these two distributions would be very similar.

GANs are kind of the most popular and most powerful generative models over the last decades until some very powerful tools came out over the last three or four years. Another very powerful generative modeling tool is called autoregressive models. In the context of natural language processing, this is usually known as next token prediction, but the idea of autoregressive or auto-regression is more than just predicting the next token.

Basically, if we care about probabilities that involve many elements or many variables, then following the very basic principle of probability theory, we can always decompose this joint probability into a train of many conditional probabilities. So the key idea of autoregressive modeling is to model every single conditional probability individually, rather than modeling the entire joint probability.

If you do this decomposition following the order of the sequence, let’s say in this case you want to predict X1 first and then predict X2 conditional on X1, and so on. If you follow this sequential order, then you can turn your problem into next token prediction. This idea of autoregressive models is to break a very complicated problem into a bunch of simpler and smaller problems.

For example, in this case, in the first output, you will estimate a very simple and lower-dimensional distribution; in this illustration, for example, it would be a one-dimensional distribution. Then in the second output, it will predict the next dimension of the variable, which will be a two-dimensional distribution, and so forth. It will be difficult to visualize a higher-dimensional distribution, but conceptually when you do this, it would be a distribution in a high-dimensional space.

This is the key idea of autoregressive modeling. Over the last three or four years, there is a very powerful model emerging, especially in the context of image generation and computer vision. This model was motivated by thermodynamics in physics. The idea is that you can formulate the problem as repeatedly corrupting the clean data or input image by adding Gaussian noise, and then you can progressively turn it into a fully noisy image.

Then the goal of learning is to reverse this process, and if you can do that, then you can progressively go from a noisy input back to the clean image. This idea is called diffusion, or it is often also called denoising diffusion. So conceptually using the terminology of probability or probability distribution, this means you will have an input data distribution that, hopefully, will be about clean images. Then you just repeat adding noise on top of it.

Conceptually, this is just like running a convolutional kernel on top of the distribution space, and by doing it many times, ultimately, you will turn the data distribution into a GAN distribution. Then your model is just trying to learn to reverse this process.

This is what a diffusion model may look like at inference time. It will start from a very simple distribution, say a Gaussian, and then it will progressively reverse the process and go back to the data distribution. Actually, this visualization is very similar to many of the concepts that are popular in graphics.

For example, you can imagine the starting points of this process is some canonical shape, let’s say it would be a sphere or a cylinder. Then you want to progressively morph or warp this object into another shape that you like. Let’s say this could be, for example, just a mountain or a bunny. You want to progressively walk the input sphere to a bunny, and this is a very well-studied problem.

So in the case of distribution modeling, we can imagine this distribution literally as a geometric entity, and then you can formulate a process to do this transformation. What I have just described is an emerging idea which is called flow matching. You want to flow from a very simple object or very simple shape, such as a sphere, to another more complicated shape, such as a bunny.

If you have this algorithm, and then if you formulate your underlying shape as some probability distributions, you can use this idea to do probability modeling that is generative modeling. Here conceptually, this is just another visualization of the same thing. You will be starting from some simple distribution, let’s say a Gaussian, and this would be your data distribution that you want to model.

The goal here is to progressively change your input distribution to the output distribution. There are many excellent solutions in computer graphics to this problem. One idea here is to learn a flow field. You can imagine if this is literally a 3D object, then you will have some 3D vertices or 3D surfaces. You want to gradually move these 3D surfaces from the sphere to some 3D surfaces in your bunny.

Then if you do that, there will be a flow field that can be constructed via this process. There will be a lot of mathematical details behind flow matching, and of course, I’m not going to delve into it, but this is kind of the high-level idea of the latest progress in generative modeling that is flow matching.

Conceptually, these are some of the popular approaches today to generative models. I haven’t covered any of the mathematical details, but it is kind of fun to walk through all these methods. The point I’m going to make is that in all of these generative models, there would be some deep neural networks as the building blocks. Conceptually, this is just like in deep neural networks where there would be some layers as the building blocks.

The layers are those modules that Philip has just introduced. They could be a linear layer, they could be a ReLU, they could be a normalization layer, or a softmax layer. Neural networks are entities that are built by so-called layers, and today these generative models are some entities that are built by deep neural networks.

In this sense, the generative models are the next level of abstractions. Okay, then next, I will talk about how we can use these mathematical models or theoretical models of generative modeling in the context of solving real-world problems. As we have introduced, the key problem in generative modeling is about this conditional distribution. You want to model a distribution that conceptually you will be given. The condition why it is about the distribution of your data X but then reality what is y and what is x in common terminology Y is called the conditions.

Let’s say you want to generate a cat. It could also be some constraints. Let’s say you don’t want to generate some type of output images. It could also be labels or text labels, or maybe some other labels. It could also be attributes. Let’s say you want to generate a big object or a small object. So in most cases, the label, the condition y would be more abstract, and it will be less informative.

As a comparison, the output X is usually called the data or it would be the observations or measurements of the samples that you can see in real-world problems. So in the case of image generation, usually X is just the image. So usually X would be more concrete than the condition Y, and it would be more informative. It would be higher dimensional.

Now let’s go through the applications we have just introduced and let’s discuss what would be X and what would be Y in the context of a natural language conversation or chatbot. The condition Y would be the so-called prompt that is given by the user, and the output X would be the response of the chatbot. So usually the output is higher dimensional, and there would be many plausible outputs that can correspond to the same prompt.

Similarly, in the context of text to image generation or text to video generation, the condition would be the text prompt. It could be a sentence, it could be a class label, it could be some attribute, and the output would be the generated visual content such as an image or video. The output is higher dimensional; it is more complicated.

So these are kind of typical use cases, and of course, this is also the case in terms of 3D generation. In this case, the condition would still be a text prompt, and the output would be the 3D text structures. In this computer vision or graphics application, the 3D text structure would be the shapes, textures, or maybe even illuminations of the underlying object.

Then we can move one step further. We can generalize the scenario to the problem of, let’s say, protein generation. In this case, the input condition could still be some prompt; it could still be some text. Let’s say you can try to tell the computer that you want to generate a protein that can cure cancer. That is valid, but the problem here is that there’s no way for the computer to understand what it means to cure cancer or what it can do to cure cancer.

So in this case, there would be a lot of research into how you can represent the underlying condition that you care about. You want your output protein to have some properties, and you hope that those properties would be related to curing cancer or curing some special diseases. In this case, the condition would be more abstract. It could also be higher dimensional because it is the abstraction of some behaviors, let’s say, curing cancer.

The output would be another representation that is also higher dimensional. Let’s say the protein structure in 3D; it would just be like another kind of 3D object.

Then let’s talk about some other scenarios that typically people won’t think of as generative models. Let’s say this is a very classical case that people regard as discriminative models. We have introduced this typical case of image generation as well. You will be given a class label, and then your algorithm will be asked to generate the output image. This is the so-called class conditional case, which means your Y would be very specific about one label.

But then there is another scenario where you can imagine you won’t be given any conditions. That means you want to generate the data output that will follow the entire distribution of the data. In this case, you can imagine the underlying condition as an implicit condition, which means you want the image to follow the distribution of your underlying data sets.

If your model can do a good job in this regard, it will try to distinguish the distribution of this dataset from the distribution of any other data set. This is the idea that we can apply generative modeling to the scenario of discriminative modeling. So here is a very typical case of supervised learning or discriminative learning that is image classification.

You will be given an image, and then you want to estimate the label of that image. If we want to formulate this as a generative model, then in this case, actually, Y, which was a label in almost all our previous examples, would be the image. In this case, the image is your condition, and then the class label X would be the predicted output.

You want to model the probability distribution of your output. Just because this problem is too simple and too trivial, usually people won’t think about it as a generative model, but it can be. So then, what is the point here? If you can model image classification as a generative model, then actually you can extend the scenario from closed vocabulary classification, which means you will be given a predefined set of class labels, to the scenario of open vocabulary recognition.

That means you won’t be given a predefined set of class labels, so there could be many plausible answers to the same image. In this case, you will still be given one image, but then your output is no longer one unique correct answer; there could be many different possible answers that can describe this image. For example, in this case, these are all reasonable answers to say this is a bird or a flamingo, or that this is a red color or perhaps orange color.

As you can see, even for this very classical image classification problem, if we try to formulate it as a generative model, it could also open up new opportunities and will enable new applications that are non-typical for classical discriminative models.

We can even move one step forward. You can imagine the input condition Y is still an image, and you want the output not just to be a label or a short description; it can be an entire sentence or it can even be some paragraphs that can describe this image. So actually, this is also a classical problem in computer vision that is known as image captioning. You want the computer to write a caption about this image.

With this context, we can even move one step forward. So then this image could just be part of the input in your natural language conversation with your chatbot. In this scenario, the condition would be the input image and some other text that is a prompt given by the user. The output would be the response of the chatbot based on this image and the text prompt.

Let’s say in this scenario, given this image, the user could ask what is unusual about this image, and the chatbot can try to come up with some answers regarding this problem, and it might say it is just unusual to have an ironing part attached to the roof of a moving taxi.

In many other real-world problems such as robotics, we can also formulate the problem of policy learning as a generative model. For example, in robotics control, there could be many plausible trajectories, many plausible policies that can fulfill the same task. In this case, for example, you want the robot to move this T-shaped object into the target location.

The robot could either move from the right-hand side, or it could move from the left-hand side. So both trajectories are plausible; there is no single unique answer. This is also where we can use generative models to model this policy learning problem.

So in general, this is what we have just seen. A generative model conceptually just cares about this conditional distribution. In my opinion, there are no constraints or requirements about what can be X or what can be Y. Conceptually, they can be anything.

This means we can use generative models to solve many kinds of real-world problems. We can just try to formulate all these real-world problems as kind of conditional distribution problems, and then we can try to apply the latest advances in generative models as a tool for this problem. This is also partially why generative models are becoming more and more common today for people to solve real-world problems.

This will be the last slides of this talk, but I just want to give some high-level ideas and convey some of the most important messages in my mind. As we have seen, generative models have some deep neural networks as their building blocks. This is just like deep neural networks will have some layers as their building blocks.

Ten years ago, the research in deep learning was mainly about these layers such as convolutions, activation functions, normalizations, self-attention layers, etc. So that was the research about one decade ago.

Then we have generative models. Generative models become the next level of abstractions. All previous research on deep neural networks still applies, but there is a new level of research that would be built around generative models. Moving forward, when people use these generative models to do more amazing stuff, such as large language models, reasoning, and agentic machine learning, which we will cover in the remaining parts of this talk, these existing generative models will become another level of building blocks.

As we can see, and as you have seen from Philip’s introduction slides, we are building a stack of many different levels of models. These are different levels of abstractions. The abstractions could be layers, deep neural networks, generative models, and they could be reasoning agents. This is just like how computer science has progressed over the last century.

People are building different levels of abstractions, and then we can unlock different levels of new opportunities. In this sense, I would say so generative models represent the next level of deep learning, and they are also the next level of abstraction and building blocks.

With that, that is the end of my talk.

[Applause]

Mapping distributions which is much higher means they are inferior in that simple task of supervised learning, yes.

I think there is no certain answer yet because in some sense, it is not yet a common understanding that you can address a discriminative problem using generative models. If it is a very easy, let’s say close vocabulary classification task and you clearly know that you have 10 or 1,000 possible labels, usually a simple solution is sufficient. But even in the case of the so-called open vocabulary recognition, let’s say you will be given one image. You still want one label, let’s say a hashtag. You can still have a vocabulary, but that vocabulary is just the English vocabulary, the human vocabulary; it could be very long.

Even in that case, I think a generative model is a good idea. If you want to move one step further, if you want to have a sentence as a description, or if you want to start having some conversations based on this image, then a generative model is perhaps the only solution that you should use.

Great presentation, two questions.

Is it possible to go the other way? I think it depends on what the method is. I think recently the answer is yes. The flow matching algorithm can enable us to do that. As you can imagine, in my analogy of flow matching as moving from a sphere to a rabbit, conceptually it doesn’t need to be a sphere; it could be a cat. You can move from a cat to a rabbit.

In this scenario, that means you can transform from one arbitrary distribution to another arbitrary distribution, and their positions are just symmetric, so conceptually you can swap them.

The second question is about the robotic scenario. Is there a clear objective function? That’s a good question. I think it is more like a distinction between reinforcement learning and imitation learning or basically just supervised learning. Conceptually, we can always formulate the problem as reinforcement learning, which means you just want to approach the goal.

Let’s say if the goal is to move the T-shaped object to the target location, then if you can do that, you receive a reward; if you can’t do that, your reward is zero. That’s possible. Then imitation learning or supervised learning is the other way. You try to give some examples of what would be the possible trajectory, and then I try to model the behavior.

Yeah, I think I can take questions offline because I’m over time, and let’s move on to the next talk.

[Applause]