π0: A Foundation Model for Robotics with Sergey Levine

π0: A Foundation Model for Robotics with Sergey Levine - 719

In robotics, every time you want to tackle a new robotics application, you like have to start an entire company around or you have to start an entire research cloud.

So each robotic application is just like an enormous amount of work. If we can have these general-purpose models that can serve as the foundation for a huge range of applications, that would actually allow us to get robots to the next level. It would get us the kind of generalist robots that we see in science fiction, basically.

All right, everyone, welcome to another episode of the TWIML AI podcast. I’m your host Sam Charington. Today, I’m joined by Sergey Levine. Sergey is an associate professor at UC Berkeley and co-founder of Physical Intelligence. Before we get going, be sure to take a moment to hit that subscribe button wherever you’re listening to today’s show.

Sergey, welcome back to the podcast. It’s been a little bit.

Yeah, it’s been a while. Thank you for having me. I’m looking forward to digging into our conversation.

We will be talking about all things physical intelligence, and in particular, we’ll be digging into the PI Zero model that you launched last fall and recently open-sourced, as well as some other work that you’ve been doing at the company. I think the last time you were on the show was in January of 2023, so a couple of years ago talking about reinforcement learning and robotics as part of our AI trend series.

What have you been up to since then? I guess a lot of that is going to be founding a company.

Yeah, it’s been a bit of an adventure. So, you know, I probably wouldn’t have imagined myself founding a company, honestly, a few years back. But as I and a number of my colleagues saw how things were advancing in 2023, we decided that the pieces were really falling into place to make a very serious and larger scale push towards true robotic foundation models.

We felt like to really get robotic learning to advance to the next level, it needed to be something bigger than what we could do individually in a more academic context. So that’s why we really pulled the trigger on this thing. Yeah, I’m excited to tell you more about it.

So at Physical Intelligence, we are very committed to actually building general-purpose robotic foundation models. There’s a lot that you could do with robots, like automate warehouses, have self-driving cars, all that stuff. But we’re much more interested in what it would take to get things to the next level where you could actually have general-purpose models, in the same way that ChatGPT is general-purpose.

So it used to be that we would have fairly specialized systems for natural language processing or fairly specialized systems for computer vision. But with general-purpose foundation models, you could have a single system that could be adapted to a wide range of different behaviors, and it’s tremendously powerful.

Especially in robotics, because in robotics every time you want to tackle a new robotics application, you like have to start an entire company around it or start an entire research cloud. If we can have these general-purpose models that can serve as the foundation for a huge range of applications, that would actually allow us to get robots to the next level. Now is the time to really ramp it up and make a very serious push for it.

And what are some of those key pieces in robotic learning? If we kind of step back and look at the challenges that have made robotic learning hard to use in the real world. The idea that a robot can just go into some environment and figure out on its own how to solve a problem—that’s enormous promise. But in practice, that dream has never quite panned out before because a few things have been missing.

One big one is machine learning works best when there’s a lot of data, and in robotics, that has always been a very deep tension. You have to create that data yourself, so that means that everybody who wants to get their robot to do something now is put in this position where they have to create big datasets.

In practice, that’s always been kind of a showstopper in robotic learning. The community has figured out much better how to create transferable general-purpose models, so that now we actually have some hope that we can create models that can control a wide variety of different robots. Maybe even prompted in zero-shot to perform some new tasks. Now that drastically lowers the barrier to entry.

Other challenges that have been really big in robotic learning include generalization and common sense. For a robot, unlike a chatbot, when the robot goes into some physical environment, it has to deal with whatever is going to happen there. If it’s driving along the floor and there’s a sign that says, “Slippery floor, do not enter,” it has to react intelligently.

This is where things like vision language models come into play, where previously we wouldn’t have had any idea how to handle this. The third really big one has to do with robustness, reliability, performance, and this is where advances in reinforcement learning are really making it much more feasible to get highly precise, highly performant, and highly reliable systems.

I was going to ask specifically about reinforcement learning. You’ve been on the podcast three times previously, and each of those times, RL has been a big focus. Looking through the PI Zero paper, RL isn’t mentioned a lot. In what ways does RL come into play regarding this idea of robotic foundation models?

You can think of it as really kind of a first step towards robotic foundation models. We’re pretty proud of the work; we think we have a really great demo. To use an analogy, we’ve seen in the world of language models that there are a number of steps we have to advance through.

A lot of those pieces have to fall into place, and you can think of PI Zero as one of those early steps. What we’re doing at Physical Intelligence is we’re developing multiple ingredients, many of them in parallel, and they will connect more and more in the future. In the same way that reinforcement learning came into the world of language models later on, once the basic foundation was already pretty solid, I expect reinforcement learning will make a really big difference for our foundation models at Physical Intelligence.

Let’s dig into PI Zero talk a little bit about PI Zero from a model perspective. The first thing I would say there is I think it’s very important when we talk about foundation models to remember that the foundation model is not just about the model itself.

A lot of the most impactful research on foundation models does nothing to change the architecture whatsoever and actually refines the recipe. So there’s a very particular challenge that you have to address if you want to adapt vision language models to robotic control.

Robotic behaviors, especially dexterous and sophisticated behaviors, require precisely representing spatial movements, and that is not something that is very easy to express in text. Fortunately, we actually have a very good methodology for representing continuous spatial information.

The best protein design systems in machine learning today are based on diffusion models that will actually generate positions of atoms in a molecule. So we use the same kind of techniques, and the challenge was to figure out how to connect these to vision language models.

The original PI Zero paper stated that the actions are represented as continuous vectors. The challenge is scraping enough data, but it’s important to not just take everything you can get your hands on. The challenge in robotic foundation models is that people haven’t thought about it that way, even though it might seem kind of obvious.

Because people would take whatever data they have for their robot and try to overfit to it. We collected a much larger dataset, and it was much more heterogeneous, allowing us to start thinking about a pre-training and post-training separation. Actually design this from the ground up.

In the paper we have this narrative that like oh like you know this is the right way to do it like clearly we thought this through what we actually did is we added more and more data and then we found that like yeah the model training is taking a really long time so it’s better not to like retrain it each time but to have like a little pre-trained thing that we’ll fine-tune.

So it kind of grew organically, it’s like a bit of convergent evolution that we ended up on a similar recipe and we found a lot of really interesting things that maybe in hindsight are kind of obvious but to me were pretty cool.

We also found that using very high quality data is actually bad and the reason that using very high quality for pre-training particular as opposed to task post-training if you use only high quality data only high quality if you use—this might seem kind of bizarre—if you use only good data it’s a bad idea and it’s kind of interesting why.

The reason it’s a bad idea is because the robot is never perfect. The robot will make mistakes and when it makes a mistake it finds itself in some situation that doesn’t happen in the good data. So if you have low quality data, mediocre data, yes you see some mistakes but you also see the recoveries from those mistakes and then end up making the robot much more intelligent.

So you really need both. If you have just the mediocre data the robot doesn’t know how to do the task well. If you have only the high quality data then it won’t know what to do when it messes up. And is the pre-training data that is all unsupervised just hours and hours and hours of robots performing tasks in video or is there some degree of task supervision?

For all of our data came from people controlling the robots. We are actually studying autonomous data collection as well but in this prototype all the data was collected by our robot operators across a very heterogeneous range of different tasks.

Now people naturally have a lot of variability both in how they perform the task and how good they are at it and you know some people are good at some tasks some people are good at others so what we actually found to work pretty well is to take all of the data we’ve collected ever across all the tasks and across all the different robot types use that for pre-training and then put an untrivial amount of effort into curating the data for post-training.

For post-training it’s very important to get data that is obviously of high quality because it shows the robot how it should do the task but also data that has a degree of consistency so you want the robot to perform the task with a consistent strategy that reliably works well.

So that’s where it takes a bit more care to pick out the right kinds of behaviors. The amount of post-training data is not typically actually all that large so it depends a lot on the task—a harder task needs more data but it’s between two and 20 hours. The pre-training data is about 10,000 hours so it’s much larger.

And so you mentioned the pre-training data is collected from humans operating the robots. What does a typical frame or sample of data contain? So the way that the data is collected, and this is something that we do put quite a lot of thought into, is we use a teleoperation rig. You can think of it as a leader-follower setup so there are these—they look a lot like robotic arms—they’re these leader arms that a person holds, they’re kind of lightweight arms that are meant to track their movement.

Which they can use to demonstrate behaviors and the behaviors that they demonstrate we typically try to make them kind of as realistic as possible like it’s supposed to be kind of like reflective of a job like maybe the job is to like, I don’t know, replace the paper towel roll on the paper towel holder or fold all the laundry or like assemble cardboard boxes so these episodes you know they’ll range from a few minutes to tens of minutes in length.

The episodes can be segmented and tagged with language so that allows us to get the robots something you know a degree of language and instruction understanding and then they’re processed into training data and used to train them all. You also used as part of your pre-training dataset the Open X embodiment dataset. My understanding from the paper is that it was less than 10% of the data. Can you talk—was this just your source of bad data or talk about like where that comes in and why?

Yeah this is an interesting question. So one thing that I’ll preface this by saying is that you know PI Zero is partly a demo—it’s like a technology demonstration like you know we wanted to show the kind of cool task that a robot could do but partly it is actually genuinely intended as a step towards general-purpose foundation models which means that it needs to be able to understand all sorts of different robot types and all sorts of different skills not just the ones that we demonstrated in our videos.

And we’ve been using the model actually in all sorts of ways like for example finding to new robot types where we released a demo with another company called Astrobot where we fine-tuned our robotic foundation model to their robot with a small amount of data and it’s a humanoid robot it’s very different from the robots we have so we really wanted to make sure that as much as we could we could get PI Zero to understand a variety of robot embodiments.

Which means that we basically took all the data we could get our hands on from as many different robots as we could find and tried to fold it into the pre-training dataset. Got it so the big contribution there is just the number of types of robots that it had whereas you trained you built your pre-training dataset on eight or nine I believe that one had more on the order 20. Yeah exactly and the number is growing regularly.

So we are adding additional robot embodiments all the time both from our own experiments and from data that we get from researchers from partners from all sorts of sources and so you you’ve got this pre-training dataset you pre-train the model and you do the fine-tuning based on specific tasks and is that—you know simply additional iterations with higher quality data or is there more that goes into that part of the recipe?

Yeah so in the current prototype it is—the actual training part of it is pretty straightforward it’s just supervised learning take the pre-trained checkpoint and fine-tune most of the care goes into selecting curating or collecting the data in the right way and this is the kind of stuff where we you know for example for some of the more difficult tasks it might even be hard for humans to do them so we might have there might be a particular person who’s really good at doing that task and maybe all of the post-training data just comes from that one expert person whereas the pre-training might contain data the same task even but just from less expert humans.

So this provides kind of an interesting sketch for how we could see robotic learning in the future where it’s natural that as tasks get more complex there might be a lot of value in getting the best possible data from experts which might actually be a lot better than what that expert could do if they were like actually doing the job routinely.

Like someone can probably fold the T-shirt much more effectively if they just have to like do it really well for three minutes than if they were doing it for like 100 hours right so you could actually ask somebody like really just go all out and do this really really well and then we’ll teach that behavior to the robot in the post-training phase.

Let’s talk a little bit about the laundry folding demo because I think for a lot of folks that saw the initial PI Zero release that was super compelling. I can’t even remember what year CES this was but Samsung or one of the big companies did a laundry folding robot this was at least five years ago probably more and you know first of all they had to set up the clothes just so and then even still the folding wasn’t all that great and it was a purpose-built laundry folding robot. So a lot of care and curation of the scenario and poor results whereas in your example the robot takes the laundry out of the dryer into a bin pulls it out and starts folding it.

Talk a little bit more about that specific example and to what degree you—is it zero shot in the sense that you didn’t do anything specific to get the robots to perform well on that task rather than relative to other tasks beyond the post-training or was there care somehow that went into that scenario and other scenarios to make the robots perform well on them?

Yeah this is a good question. So there was a lot of care but no care went into the code in the sense that the code that is running on the robot for laundry folding is exactly completely 100% identical to what’s running when it’s like building the cardboard box or cleaning the table.

The care went into carefully working with the robot operators to come up with a good strategy that the robot could execute well so this is again this is coming back to that point about post training like yeah it really matters what kind of data you get for the post training phase.

Now fortunately you don’t need an enormous amount of that—you need an enormous amount of pre-training data and there you can be you don’t have to be nearly as careful but for the post-training like yeah the strategy of folding the shirt there’s like five different ways you could do it and that particular way of doing it is better it’s more reliable the robot can more easily get the shirt into that position.

But at the same time you know one of the things that makes laundry folding a really great illustration of this pre-training post-training principle is that while you could have a really smart strategy that you’ve designed and you say okay like this strategy if you guys do this for like 20 hours the robot will be pretty good at this you can still get into lots of unpredictable situations so you’re still relying very heavily on that pre-training data to get the robot out of the weird situations and into the ones from which those nice strategies will work.

And in the videos that we released there’s a few things that even kind of surprised us like we tried to you know obviously when you build a robotic system the first thing you want to do is you want to mess with it and see how you can get it to fail so Michael who’s one of the researchers working on this he in the videos you can see him he puts a shirt in the basket the robot takes it out starts folding then he takes another shirt and just drops it on the table and the robot just picks this thing up and just drops it back in right so it’s like yeah get this out of here I’m working.

Exactly and like the post-training data doesn’t have that obviously because it’s a small amount of very carefully controlled data where everything is just just right but yeah probably somewhere in the pre-training data somebody messed up a little bit took out two shirts by accident and put it back maybe they didn’t put it back in the same way so the robot has to generalize but that diversity of lower quality data illustrates a lot more of these issues and as long as you have a good generalizable model on top of it that can extrapolate from those patterns then you’ll actually get those kinds of emerging behaviors and that like that even surprised us like we kind of thought like yeah it’ll generalize in some cool way but we didn’t expect quite that kind of nice extrapolation.

And along the lines of curating that training data if you think about like folding a shirt in terms of stages one stage is getting the shirt into the position the next stage you know then you execute your folding steps and then there’s putting it onto the stack.

Did you—does the training data isolate those individual steps or is it all you know end to end you know run through of the process? It’s end to end so we basically instruct the operators to perform the entire task. We do afterwards for the language condition stuff we do have human labelers segment the data and annotate it with text but the collection is just fully intent and I think this is important because if you can sort of imagine that for now we’re doing this in the laboratory but in the future you can have a robot that is actually like out there in the world really doing real work maybe initially under teleoperation.

But that teleoperation is creating the data that would later help it become more autonomous.

Yeah I think the question came from the idea that once you’ve got the table laid out your scenarios may be more well defined but when it’s in the shirts in the basket you’ve got a lot more you know think of it combinatorially like a lot more positions that the shirts could be in that kind of thing and so you might need more data of you know getting the shirt out of the basket and flat onto the table but it sounds like you didn’t do that and maybe a follow-on question is you know are there areas that you worked with or introduced synthetic data into the process?

Yeah I think this question is getting us something really important that is important to think about when we’re collecting robotic data and something that people sometimes don’t put as much thought into which is that very broadly speaking in machine learning the one thing we know works is when training matches test so getting real authentic data is really really important and we try actually pretty hard to make the data collection process for our systems as realistic as possible and as representative of what would actually happen for real tasks like even to the point where right now we’re doing some experiments collecting data for like type tasks and the robot is like literally in our kitchen like our building has like a little office kitchen and the robot is like cleaning up like the actual office kitchen when we’re doing the table cleaning task.

We would like eat our lunch put our lunch there and the robot would go and like clean up the actual lunch so the more realistic we can make it the more training will match tests and the more the model will actually generalize to real world situations.

Now at the same time to your question about synthetic data something that I’m pretty excited to study in the future is the degree to which having this initial foundational understanding might actually make it easier to incorporate synthetic data.

It’s like if you’re playing a video game right, we understand how the real world works so the somewhat abstract cartoony environment in a video game or in an animated film makes sense to us but we’re coming to that image with a lot of the physical priorities we learn from the real world.

And it may very well be that by having this really nice foundation of real world experience it might actually be easier to incorporate less realistic data in the future because the robot will represent its experience in this way that abstracts away those differences.

The PI Zero model I think the full model is 3.3 billion parameters can you talk a little bit about kind of the application of scaling laws and how you see that evolving?

Yeah that’s a great question. So we wanted to start with a pretty lightweight model because well we didn’t want to wait for a long time for things to train we also have to run this thing in real time for inference.

So we weren’t particularly like deliberate saying like this is the optimal size we just picked kind of the smallest model that seemed like it had enough of that internet scale knowledge baked into it but one of the technical challenges that I do think is really exciting to study in the future is how we can both have the benefit of large models for the more elaborate kind of semantic reasoning the you know like all the fancy Chain of Thought stuff solving complex problems and still connect it up to like a little motor cortex model that can run fast enough to control the robot.

And there been a little bit of work academically including in my group and other research groups that shows that if you have these Vision language action models you can actually do the same kind of reasoning stuff the same kind of test time compute tricks that people have used for LLMs.

So we had a paper for example called Embod Chain of Thought from my lab at UC Berkeley where we trained a vision language action Model—a different one predating PI Zero this was based on Open VA—which would actually reason through things in the task so it would say like okay you’re asking me to put the banana in the plate okay well to do that I should first find the banana in the image find the plate I know the bananas are yellow so let’s find like the yellow thing I need to find where my hand is okay my hand is over the banana that means that the right thing is to move down move down means coordinate coordinate coordinate and then execute the task.

And this actually helps—this helps a lot to get good results especially in unfamiliar settings because in those unfamiliar settings while the robot might not have might not kind of instinctually know what to do if it reasons through the task it can succeed more reliably.

What was the performance like it sounds very slow? It was slow yeah I mean this was also on an academic compute budget let’s say so yeah it got kind of like moves pauses moves pauses but it’s 50% better so you do get a big improvement.

Obviously this now the systems challenge needs to be overcome.

I’m curious if one of the things that clear you know both in our conversation and the paper is that the team’s been really good at like taking tidbits of you know pieces innovations from lots of different places and pulling them together into this work was there anything that came out of the recent DeepSeek you know R1 stuff that you know was inspiring or that you are looking forward to playing with in terms of improving this model.

Yeah, that’s a really good question. I mean, it’s something that we have been thinking about a lot. Um, I don’t think there’s anything concrete that I can really talk about now because, you know, when you have kind of vague ideas, you want to like at least test them out first before you risk saying something stupid, but we, a lot of us, found that to be really inspiring, and I think that especially the way that, uh, they have this fairly simple recipe where it’s just like pre-train and then use RL, uh, obviously like the fancier recipe is a little more conventional, like a little bit better, but just the fact that a pre-train and then use RL works so well, that’s pretty cool, and, uh, I really wonder how some of those could be adapted to things like these, uh, VA reasoning models where the robot could actually use RL to train itself how to, how to think more carefully. Uh, but these is just speculation at this point.

Uh, so you mentioned asked a couple of times. Talk us through the motivation there and how it builds on what you’re doing with Pi Z. Yeah, yeah, so, um, we used the diffusion-based, uh, model for, for Pi Zero, which we had to kind of adapt with VMS, but the action expert in particular, yeah, the action expert exactly. But, um, prior work on VAs use discretization, and the reason that it used discretization is because, uh, VMS naturally output discrete tokens. So if you want to adapt the VM most directly to control robots, you take your actions and you basically turn them into text. In fact, in the first, uh, Vision language action model ever developed in the rt2 model, they were literally like numbers, like you represent the action as the number 132, and that, uh, is converted into floating point and then run on the robot. Um, the trouble with this, though, is that we know from language models that it really matters how you represent those tokens. Like much as we’d like to imagine that language models are so brilliant that they can just handle like whatever you throw at them, the way that you represent your tokens makes a huge difference for performance.

Not necessarily because there’s any particular knowledge baked into it, but because it makes the model train a lot more effectively. Um, basically, different um letters in English occur with different frequency, and therefore, they carry different amounts of information. And it’s much easier for a neural network to learn when all the outputs are kind of equalized in terms of how, how much information they carry. So, uh, it really helps to represent that text in a way where every single token has about the same amount of information, and then the model learns very quickly and generalizes effectively. And, of course, there’s no reason to think that that wouldn’t be true for actions either, except that that’s not how, uh, actions were being tokenized before. If you’re just representing them as numbers, the occurrence of those numbers on the internet has absolutely no bearing with how frequently those actions occur for robotic data. So, uh, the foundation of modern tokenizers is basically compression. Like it turns out that the way that you get every token to have about the same amount of information is to compress your text because the optimal compression will spend about as, will provide as about as much information about the equal amount of information for every single, uh, bit.

So we can take inspiration from compression methods for continuous data to develop a tokenization for actions. Uh, and we thought about this, and, uh, one place where you see compression of continuous data is images. So you use like JPEG compression. Uh, JP compression compresses images. JP compression basically represents an image with um the with different frequencies. So, different frequencies occur to different extents in different images. You turn your, you put your image in a frequency domain, roughly speaking, and then you, uh, compress the weights on those frequencies. So we decided to try that. We essentially, like essentially, it’s like run JP compression on your action chunks, uh, and that actually gets you a much better tokenization. Uh, it gets you the ability to represent the same actions with the same uh fidelity with a much smaller number of tokens. That by itself is not actually as important.

What’s important is that now that tokens contain about the same amount of information, and when you train on this, your model trains about four times faster than it would if you used, uh, the naive tokenization. That’s a really big difference, and it’s not just that you spend less time waiting. Like if your model trains four times faster, you can train for the same amount of time, it’ll train four times better, right? So that’s a really big deal. Um, that allowed us to train much better models, especially for, uh, tasks that required a lot of generalization to language. Following uh, this part, we’re still trying to understand like what’s the connection between this and language following. Like, you know, maybe it’s something like if you fit the data better, you’ll get to understanding the language better, but one of the things this allows you to do is, uh, take, for example, the open source Droid data set, which is a big data set of U Frank robotic arm manipulation. Uh, that was collected across a number of different universities with language annotations and actually get a policy that will generalize to a new Frank arm in a new location.

Uh, we took the Pi Z fast model, and we sent it to um our friends at all sorts of different universities. There was a video that I posted on Twitter from some folks at UPenn that ran this, and it’s like literally the first time they load this up on their Frank robot, they tell it someone like pick up the pineapple and put it in the basket, and it actually goes and does it. And like, you know, the folks running this are not used to models working out of the box. Like usually robotic models don’t work out of the box. Um, so it’s pretty cool that we could get that by just figuring out a better way to compress actions. And when you say that you, uh, train the model using this idea, is what is the model in this case, the entire model just the VM? Just the action expert? Yeah, so the model at this point, uh, for the P fast experiment is just the VM component. Uh, and I don’t know if that’s necessarily a good choice. Like we honestly didn’t even really experiment with it very carefully.

You could apply the same fast tokenizer to other VMS. Uh, so uh we used it with the open VAA model for instance, and the tokenizer is like available on Hugging Face. So anyone could grab it and run with their own VA model. And so does the ACT did the action expert change at all, or is that the same um with and without the fast tokenizer? Basically nothing else changed. Like the only thing that changes is the tokenization. Okay, and so the mapping from the you know this the fast tokens to the actions that is still learned, but it’s just learning a different thing, you know? Yeah, yeah. That’s part of the tokenize part of the loop, right? Okay, interesting uh and so you talk a little bit about um the uh the open sourcing of all this. That’s the kind of the most recent thing that you guys did. Yeah, yeah.

So we we figured that now that we actually have a model that uh can perform some pretty interesting tasks and also a model that other people could actually run, so we had like a little closed beta where we sent this to a few other folks, a few universities, a few companies. Um, that this would be a really great thing to share uh with the community. Now, obviously we have our own reason that we do that. Like, we want to see how people use robotic foundation mods because that’ll help us learn how to make them better in the future. Uh, but we also think that this is a great way to galvanize a lot of a lot more interest in this stuff because uh we saw how with language models, just getting the pre-trained uh checkpoints out there and allowing people to fine-tune their own models created this like wave of creativity where people come with all sorts of new things to do with them. So we’re really looking forward to see what people will do uh with pre-trained robotic foundation models. So we really want to get that out there. We have a few demos, you know, it’ll run on some robot that will run on Droid.

The truth is that it is like a very early prototype. I think in the grand trajectory of robotic foundation models. So probably most things that people will try for won’t work. But just from seeing the kinds of experiments people do, I think we’ll all learn a lot about it and we’ll figure out a lot of new ideas for how to make robotic foundation models more applicable in the future. And so what specifically did you provide in open source? And you provided the weights for the models. Are you providing any of the details around the training recipes and or data sets, those kinds of things? Is it does it allow you to fully replicate uh Pi Z or is it allow you to use what’s already been done? So uh the open source repo includes uh the code for fine-tuning, it includes the base checkpoint, it includes another base checkpoint for the fast version of the model with the tokenizer, and then it has a few example fine-tune models. Like it has a model fine-tuned for Droid, it has a model fine-tuned for Aloha, and these are really kind of intended as like demos if you just want to like try it out and see. Uh, the main use case that we anticipate is for fol to take the base model and then fine-tune to their own robots because the robotics community is still very fragmented.

Like everyone’s setup is different. So if you are lucky enough to somehow have a setup very similar to Droid or very similar to a setup that we had, then you might be able to try to run it in zero shot, but really the intended use case is to collect some of your own data, fine-tune it, and then uh try to use it to solve your task. And in order to do that, would you need a setup that offered the same kind of operator guided uh um data collection methodology? That’s a really good question, and this is actually one of the things that we hope to understand better. So uh you know, we explored a few different ways to collect data. We kind of have our own intuition for what works and what doesn’t work, but uh if somebody tries to uh fine-tune the Pi Zero model, they’ll collect data. Maybe they’ll do it the same way that we do, maybe they’ll do it differently. And uh I think it’ll be really interesting to see what kind of recipes work and what kind don’t. So we know a recipe worked for us. We described it in the paper, so hopefully if someone tries it, it’ll work for them, but maybe we’ll try other things and then we’ll find out that maybe you can get away with a lot less data if it’s more consistent or maybe there’s some other kind of thing that we didn’t know was true that is actually true. So we got of just really want to see how people experiment.

And you most of the robots that are depicted in the paper and in the demo videos are arm-based robots. You mentioned some work on humanoid-based robots. Are there other form factors that you see folks experimenting with? Um, yeah, so um we ourselves have uh success been able to run the model well either we ourselves or with our collaborators and partners on robots that include humanoids, single arm, dual arm, and mobile platforms. Um, in principle, uh the model supports very flexible action representations as long as your action is less than 32 dimensions. Uh, so you know we have to pick a maximum, so we picked 32. Uh, I’ve gotten videos of people that have successfully run this model with uh FiveFinger hands. I’m not entirely sure how they did it because no one has actually told me. They just sent me videos like look, it’s working. Um, in principle it should be possible to run it for um like navigation and things like that. We haven’t done that ourselves, but that’s very much within the uh kind of the constraints of the model. So, as long as it fits the dimensionality, somebody could try it. I’m really curious to see what works and what doesn’t. I’m sure there’ll be some limitations like if you train on an octopus arm, well that’s probably a little too different, but I’d be curious to see what happens when you think about kind of the robot platform landscape.

Like, are there accessible hobbyist types of uh or enthusiast types of arms that uh you could try it out on? Yeah, that’s a really good question. So um we’ve uh we actually were very lucky to get some help from um from Hugging Face, uh, who helped us with uh doing a PyTorch port of the model. And they also have a very low-cost arm. Um, you know, I can’t vouch for this quality. I’ve never actually used it, but they seem to have been able to do some pretty cool things with it, so uh if uh if anyone is interested in a really low-cost robot, checking out L robot and their PyTorch port of Pi Zero, uh, as well as some of the, you know, I think they’ve actually tried out on their arms that could be a good way to go. But honestly like even the nicer arms that we’ve been using, uh, the arms that we used for uh things like the Aloha experiments, these are not all that expensive. Um, I think Aloha was like on the order of 20k, so the whole system is on the order of 20k. The arms are I think somewhere in the $6,000 range per arm. Uh, so if you just want the follower arms, like the minimum thing I think you could probably get away with like 15k or so. Um, but the cost is, you know, seems to be going down every year, so I wouldn’t be surprised if like next time this year that these things are even cheaper.

Okay, awesome, awesome. By the way, one of the things I’m really excited about is like if we get these robotic foundation models into people’s hands, if the cost of hardware keeps dropping, maybe it will be like very practical for uh anybody to just play around with their own robot. I mean, you know, 15k is a bit expensive, but if it drops like you know another factor of two or four, maybe that’ll be actually pretty practical. Awesome, what’s next? Where do you see it all going? Yeah, so there are a number of uh next steps that I’m pretty excited about. One of the things that um I really want us to be able to do better, and I think that that’s something that uh we should be able to talk about more in a few weeks is uh do a much better job of following complex instructions. So one of the things that’s really cool about ChatGPT is you can actually give it a prompt that describes in detail almost like a job you wanted to do. So you don’t just tell like oh uh you know uh please write me an email to my boss. Like you would actually describe like write me an email to my boss that describes how I want to raise and blah blah blah blah blah blah, you know, whatever. It’s kind of a complete description of a task, and it would be really cool if we can do that with robots too.

Instead of just telling it like fold the shirt or clean the table, you can tell it like um hey I’m throwing a party. Um, I already put the plates on the table, but there’s some trash, so put away the trash but leave the plates where they are. Make sure the fork and the knife is next to the plate in a nice tidy way. Some of that kind of like describes what you actually want, and maybe the robot would do the task. Maybe it might even ask you like hey, I didn’t get that part. Like, are you sure you you want me to like leave that plate there? That doesn’t look like it belongs. Like you can have a much more intricate interaction, and the interesting part there is not just the interaction. It’s the ability to instruct the robot to really do the job. Um, I think that that can also be a really interesting mechanism not just for getting robots to do sophisticated tasks, but also getting robots to repurpose their skills. So if the robot can actually get this more intricate task and think about hey I’ve learned to do these particular behaviors, how do I adapt them to solve this new problem that I’ve been presented with, that’s something that you could do with a lot of that semantic knowledge inside of LMS. Uh, but it requires a little bit more uh processing. Maybe it requires a little bit of that test time compute. Um, and that’s something that we’ve been working on that we hope to be able to uh tell people more about in the near future.

Do you make a distinction between complex instructions to instruct the robot to do a complex multi-step task and instructions complex instructions that instruct about how to do a complex task? Yeah, that that’s a really interesting question. Um, I would like to not have to make that distinction. Like the distinction is, is the distinction is actually important in the sense of like the capability. But I think you can have a system that does both of those, and in particular, like people actually do it in a very flexible way where people bring to bear their own knowledge and their own skills. Uh, but they also benefit from the information that they’re offering. So if you already uh you know, you are a professional laundry folder, like it’s enough for someone to just tell you like hey like go do your job and you already know what to do, but if you are not experienced of that task, uh, then maybe you’ll still be able to do it if only somebody provides you with more instruction. And if you have a model that is trained end to end that can perform this kind of reasoning, put together the steps that it knows and flexibly decide what to do, then it can do either of those depending on the situation, depending on its prior knowledge. Got it. Was there anything else on that list? Yeah, so other things that that we’re doing uh that uh I’m pretty excited about. We are trying obviously to push the boundaries on the generalization for these systems.

So, uh, I’m very happy with how we’ve been able to demonstrate pretty sophisticated tasks, but it’s still a challenge if you want those tasks to work with any object and in any environment. Generalization means a lot in this context. It’s generalization to task, generalization to environment, generalization to object, generalization to platform and instruction as well. Yeah, instruction, it’s a very big space, and because it’s so big, it’s also kind of hard to say anything particularly definitive about it. Uh, but it’s something that we’re studying quite a lot. We’re trying to see how different ways of collecting data, different ways of training models, different ways of transferring knowledge from internet scale pre-training can facilitate generalization. And I think something that’s really exciting there is like those moments like I mentioned when we had that Droid model and somebody was able to run it uh at a different location.

Students were able to run it at another university and just see like that spark of life where they just give it some task and it just goes and does it. Maybe it does it poorly, maybe it’s slow, but I think that’s really special. And I think that the more we can enable that, uh that kind of aha moment where you just load up the model on your robot, tell it to do something and it actually kind of gets it, like that I think is really really important. I think that if we’re careful with transferring knowledge from the web, uh, creating data in the right way and setting up our model in the right way, then I think we can get a lot more. When I think about you know, this an example like this where you know someone is using this model uh presumably the same type of robot but just another environment and you think oh well that should work. It’s software program, you put it someplace else, and I think back to I think it was one of uh it was like a Peter Riehl demo of like the robot trying to do knot tying and like just different colors of rope and stripes and stuff like that totally confounded the robot. Yeah, my very first uh robotics project, which was actually with Peter Riehl, we we had to use the same background in every trial because that that was the background that worked for the robot. So I’m I’m really glad that we’ve gotten past that at this point. That’s awesome. Well, I’m looking forward to keeping in touch and uh keeping up with the updates coming out of the work. Very cool stuff.

Josherich's Blog

π0: A Foundation Model for Robotics with Sergey Levine - 719