Windsurf CEO: Betting On AI Agents, Pivoting In 48 Hours, And The Future of Coding
One of the things that I think is true for any startup is you have to keep proving yourself. Every single insight that we have is a depreciating insight. You look at a company like Nvidia. If Nvidia doesn’t innovate in the next two years, AMD will be on their case. That’s why I’m completely okay with a lot of our insights being wrong. If we don’t continually have insights that we are executing on, we are just slowly dying. This notion of just a developer is probably going to broaden out to what’s called a builder. I think everyone is going to be a builder. I think software is going to be this very democratized thing.
Welcome back to another episode of the Light Cone. Today we’ve got a real treat. We have the co-founder and CEO of Windsurf, one of the people who literally brought Vibe Coding into existence, Verun. Thanks for joining us.
Thanks for having me, guys. Where is Windsurf now? Like all of us intuitively know, we use it, but you know, how big is it now? Where is it? Yeah. So, the product has had well over a million developers kind of use the product. It has hundreds of thousands of daily active users right now. It’s being used for all sorts of things from modifying large code bases to building apps extremely quickly, zero to one. We’re super excited to see where the technology is going. Let’s get to the brass tax. How did you get started? The company actually started four years ago. We didn’t start as Windsurf; we actually started as a company called Exofunction. We were a GPU virtualization company at the time. Previously, my co-founder and I had worked on autonomous vehicles and ARVR, and we believed deep learning was going to transform many industries, from financial services to defense to healthcare.
We might have timed it wrong, though. Ultimately, we built a system to make it easier to run these deep learning workloads, similar to what VMware does for computers and CPUs. The middle of 2022 rolled around, though, and what happened at the time was we were managing upwards of 10,000 GPUs for a handful of companies. We had made it to a couple million in revenue, but the transformer became very popular with models like Text Da Vinci from OpenAI, and we felt that was going to fundamentally disrupt the small business we had at that time because we felt that everyone was going to run these transformer-type models.
In a world in which everyone was going to run one type of model architecture, transformers, we thought if we were a GPU infrastructure provider, we would get commoditized, right? If everyone’s going to do the same thing, what is our alpha going to be? So, at the time, we basically said, hey, could we take our technology and wholesale pivot the company to do something else? And that was a bet-the-company moment. We did it within a weekend. My co-founder and I had a conversation along the lines of, I don’t think this is going to work. We don’t think we know how to scale this company. At the time, we were early adopters of GitHub Copilot. We told the rest of the company on Monday, and everyone started working on Kodium, which is the extension product, the Monday immediately after.
I’m just curious to dig into this pivot story because it’s pretty rare to hear the details of a pivot, especially a late-stage pivot like at the time that you guys decided to pivot to Kodium. How far along was the company? One of the things about the company was, I guess, we tried to embrace a lot of the YC analogies here of ramen profitability and these other key insights. We were only a team of eight people at the time, even though we were making a couple million in revenue. We were kind of free cash flow positive. It was the peak of zero interest rates at that time. The company was a year and a half old. We had raised somehow magically $28 million of cash at the time.
I think the big point here in our minds was it doesn’t matter if we’re doing kind of well now; if we didn’t know how to scale it, we kind of needed to change things really fast. I guess the thing that’s remarkable is when you started the company, you had this thesis where you were betting that a lot of companies were going to build their own custom deep learning pipelines to train birds, right? That was like the thing that was working. But in 2022, you saw the hockey-stick shift that suddenly there would be one model that would rule them all. You were foreshadowing a lot of the future, and there’s a lot of it that came from that conviction. So, I’m curious, what were those signs?
You had to be really embedded into it. We’re making already seven figures. You could have raised a Series A, and you were like, we’re going to throw all that away and burn it. So, actually, even crazily enough, we had raised our Series A at that time, but whether or not we should have been able to is a different question. No, I think you’re totally right. I think one of the things that was happening at the time was we were working with largely speaking these autonomous vehicle companies because they had the largest sort of deep learning workloads, and we were sort of seeing, hey, that workload is growing and it’s large. But we were betting fundamentally that the other workloads, which were these other natural language workloads, kind of like these workloads in these other industries like financial services and healthcare, would take off.
But I think once we saw these generative models handle so many of the use cases, right? Maybe an example is in the past you would train a BERT model to actually go ahead and do sentiment classification. But very quickly when we tried even a bad version of GPT-3, like the very old version, we were like, this is going to kill sentiment classification. There’s no reason why anyone is going to train a very custom model anymore for this task. I think we saw the writing on the wall that our hypothesis was just wrong. You go in with some thesis on where you believe the space is going, but if your hypothesis is wrong and information on the ground changes, you have to change really fast.
So then what did you decide to do? It’s like you decided, okay, we’re going to pivot, and when we work with founders, that’s kind of stage one. You’re not half-foot in, half-foot out. So you had that conviction; we need to try something out. How do you figure out what was going to be the next step? I think we needed to pick something that actually everyone at the company was going to be excited about. I think if we had picked something that we thought could be valuable but people were not excited about, we ultimately would fail immediately.
We came with an opinionated stance of we were early adopters of a product called GitHub Copilot. We thought that that was the tip of the iceberg on where the technology could go. Obviously, everyone at the company was a developer. Devtool companies generally speaking usually don’t do that well in the past. They have not. But hey, when you have no other options, it’s a very easy decision, right? When you’re going to be a zero with a high probability anyways, you might as well pick something that you think could be valuable and everyone’s going to be motivated to work on. Everyone’s forgotten this now, it feels like, because GitHub Copilot’s in the background. But at that particular moment, it felt inevitable that Copilot was going to win. It just had everything: the GitHub connection, Microsoft distribution, OpenAI.
It seemed like no one could compete. So how did you have the bravery to be like, yeah, we can totally crush Copilot? This is where there’s like an irrational optimism piece I’ve said this before to the company. But I think startups require two distinct beliefs, and they actually kind of run counter to each other. You need this irrational optimism because if you don’t have the optimism, you just won’t do anything. You’re just a pessimist and a skeptic, and those people don’t really accomplish anything in life. You need uncompromising realism, which is that when the facts change, you actually change your mind. That’s a very hard thing to do because the thing that makes you succeed through irrational optimism is the exact opposite of the things that allow you to be a very realistic company.
So, irrational optimism: we basically said, hey, we know how to run and train models ourselves. We actually trained the first autocomplete models ourselves and ran it on our product, gave it out for free. I don’t think we had the exact roadmap on where this was going, but we just felt there was a lot more to do here. If we couldn’t do it, then I guess we’d die, but we might as well bet that we could do it.
Were your early versions better than GitHub Copilot at the time? So our earliest version that we shipped out was materially worse than GitHub Copilot. The only difference was it was free. We built a VS Code extension after pivoting, and within I think two months, we had shipped the product and given it out to Hacker News, like posted something on Hacker News. We built that out; it was missing a lot of key features. The model that we were running was like an open-source model that was not nearly as good as the model that GitHub Copilot was running. Very quickly then, our training infrastructure got better, so we actually went out and trained our own models based on the task, and then suddenly it actually got capabilities that even GitHub Copilot didn’t.
Within two months, basic capabilities. Now, we’d find this hilarious that it was even state-of-the-art, but our model could actually fill in the middle of code. So when you’re writing code, you’re not only just adding code at the end of your cursor, but you’re filling it in between two parts of a line, right? And that code is very incomplete and looks nothing like the training data of these original models. So we trained our models to make it actually capable for that use case, and that actually allowed us to pull ahead in terms of quality and latency. We were able to control a lot of details within a couple of months.
At the beginning of 2023, I’d say the autocomplete capabilities were much better than what Copilot had. Was that a totally new capability for you guys? Because you guys have been building GPU infrastructure. It sounds like you basically hacked together the first version by taking an off-the-shelf open-source model, sticking it into a VS Code extension, and just kind of wiring the two to talk to each other. But then right after that, you had to train your own coding model from scratch. And you guys had been following the transformer stuff, but like you hadn’t built it. In order to do that, I assume you had to download all of GitHub and train a whole model from scratch. How did you all figure out how to do that in only two months?
Yeah, it’s a great question. So first of all, when we ran the model ourselves, the reason why we were able to run it and provide it for free is we actually had our own inference runtime at the time, and that obviously came from the fact that we were originally a GPU virtualization company to start with. So that enabled us to actually ship the zero with an open-source product quite quickly. Immediately after that, you’re totally right. We had never trained a model like this in the past. But I think we hired people that were smart, capable, and excited to win. So we needed to figure it out. There was no other option, right? Otherwise, you die.
It made the decision really, really simple. So, yeah, we had to figure out how to get a lot of data. How do you do this at scale? How do you clean this data? How do you make it so that it’s actually capable of handling this case where code is very incomplete? We shipped a model very, very quickly after that.
Wow. And you did all of that with eightish people in two months. Yeah, that’s right. And then right after that, because you were running your own models, you started getting interesting customers, right?
Yeah. So basically what happened was the product was free at the time, so we ended up getting a lot of developers using the product across all the IDEs. So VS Code, JetBrains, Eclipse, Vim; companies started reaching out because they not only wanted to run the product in a secure way, but they also wanted to personalize it to all the private data inside the company. So very quickly afterwards, in the next coming months, companies like Dell and JP Morgan Chase started to become customers of our product.
Now these companies have tens of thousands of developers on the product internally. But we started making a lot of focus on the company, making sure that the product works on these very large code bases. Some of these companies have code bases that are well over a hundred million lines of code. Making sure that the suggestions are fast is one thing, but making sure it’s actually personalized to the codebase and the environment that they have was almost a requirement at the time.
You did that pivot, you built it in two months, then shipped it, and within a couple of months, you got these big logos. Yeah. So, I mean, obviously these companies take some time to close, but pilots were starting within a couple months or a quarter after that. Obviously, we had no salespeople at the company, so the founding team was just trying to run as many pilots as possible to see what would ultimately work. At what point did you expand beyond just the VS Code extension into supporting all these other IDEs?
That was actually very, very soon afterwards. How did you think about that? There’s like one argument that you could make which is like there’s lots of VS Code developers. You had a tiny team. You could have made the argument that I just focus on building a great experience for VS Code. You’d only captured a tiny percentage of the market of all possible VS Code developers. And that’s not what you did. You expanded horizontally very quickly and built extensions for all those IDEs.
Why? I think maybe the fundamental reason that we thought was quite critical is if we were going to work with companies, companies have developers that write in many languages. For instance, a company like JP Morgan Chase might have over half of their developers writing in Java. For those developers, they are going to use JetBrains and IntelliJ. IntelliJ is over 70 to 80% of all Java developers in the world currently use. We would just need to turn away a lot of companies; a lot of companies would not be able to use us as the de facto solution; we’d be one of many solutions inside the company.
So because of that, we made the decision, but luckily because we made the decision early enough, it changed the architecture of how we built the product out, which is to say we’re not building a separate version of the product for every single IDE. We have a lot of shared infrastructure that actually lives on a per-editor basis. So it’s a very small amount of code that actually needs to get written to make sure we can support as many IDEs as possible. This is one of those things that an early decision that we made ended up making it much easier to make this transition.
How about the transition from Kodium to Windsurf? At the time, we’re now probably in the middle of 2023. We started working with some very large enterprises. Within the next year, the business has gotten well over eight figures in revenue just from these enterprises using the product. We have this free individual product, but I think one of the things about this industry that we all kind of know is the space moves really fast, and we basically are always making bets on things that are not working right. Actually, most of the bets we make in the company don’t work. I’m excited when we’re, let’s say, only 50% of the things we’re doing are actually working because I think if 100% of the things we’re doing are working, it’s a very bad sign for us.
Because it’s probably one of maybe three things. The first thing it is is like, hey, we’re not trying hard enough. That’s probably what it means. The second thing is we somehow have a lot of hubris, and the hubris is like we believe everything we do is right even despite the facts that are on the ground. The third key piece is here is we’re not actually testing our hypotheses in a way that tells us where the future is going. We’re not actually at the frontier of what the capabilities and technology is. We believed in the beginning of last year that agents were going to be extremely huge, and we had prototypes of this in the beginning of last year, and they just didn’t work. But there were different pieces we were building that we felt were going to be important to making agents work, which is understanding large code bases, understanding the intent of developers, and making edits on the codebase really quickly.
We had all these pieces. The thing we didn’t have was a model that was capable of calling these tools efficiently enough. Then obviously in the middle of last year, that completely changed with the advent of ChatGPT and with that we basically said, okay, we now have these agent capabilities, but the ceiling of what we can show to our developers on VS Code was limited. We were not able to provide a good enough experience, and we thought what was going to happen is developers would spend way more time not writing software but reviewing software that the AI was going to put out.
I think we are a technology company at heart, and I think we are a product company, but I think the product serves the technology, which is to say we want to make the product as good as possible to make it so that people can experience the technology. We felt that we were not able to do that with VS Code, so in the middle of last year we decided, hey, we need to actually go out and create our own IDE.
That’s what triggered actually creating Windsurf. The way you did that was you forked VS Code for its new set of capabilities. You guys had to learn how to develop on this new VS Code base, which I’m sure was super complicated. Yeah, we needed to figure that out. That was once again another thing where we ended up shipping Windsurf within less than three months of starting the project. That’s when we shipped it out across all operating systems.
Wow. And what happened? Did it take off immediately or was it unnoticed for a long time? It took off pretty quickly, I would say. I think the speed at which it took off among early adopters was quite high. There were obviously some very rough edges, and this is one of those things where, because of the rough edges, obviously people started coming and leaving the platform fairly quickly. But what we saw was, as we improved the capabilities of the agent and as we improved the capabilities of the passive experience, even the passive tab experience has made massive leaps in the last couple of months. We started realizing that not only were people talking about the product more and more, but people were also staying on the product more and more at a higher rate.
How many people worked on shipping Windsurf? It was done in a period of one or two months. A couple months. So, yeah, like less than three months. I wouldn’t say it’s a bet-the-company moment because it’s not a fundamentally different paradigm compared to moving from a GPU virtualization product to an AI code product. But yeah, it was anyone that could work on it needed to drop what they were working on in the past and work on it immediately.
And at that time, how big were you guys? The engineering team was probably still less than 25 people. Wow. This is crazy. Interestingly, our company, from an employee standpoint, didn’t have that few people. We actually had a fairly large go-to-market team because it’s in the AI space. One thing that’s a little bit weird about our company compared to most other companies is we have a fairly large go-to-market team. We were selling our product to the largest Fortune 500 companies. It’s very hard to do that purely by letting them swipe a credit card. need a lot of support. You need to make sure that the technology is getting adopted properly, which is very different than just giving people the product and seeing it grow effectively.
So from an engineering standpoint, we’ve always run fairly lean. But because of the market interest, we’ve always had a lot of people in go-to-market actually. These are sort of the ideal people to go into that function. Are they really good engineers who want to be forward deployed?
Yeah, we have two components of it. We have account executives. These are folks that in general we try to find people that are very curious and excited about the capabilities. In fact, people that would use windsurf in their free time because they’re providing the product to leaders who also love software and technology. So, if they’re just completely unaware of the technology, they’re not going to be helpful.
And then we also have these deployed engineer roles similar to what you said that get their hands really dirty with the technology and make sure that our customers get the most value from the technology. I mean the wild thing is because everyone uses windsurf, it sounds like you’re having even these AEs who are non-technical become like just vibe coding champions.
One of our biggest users of windsurf at the company is a non-technical person who leads partnerships at the company. He has actually replaced buying a bunch of sales tools inside the company. This is one of those things where I think windsurf is giving power back to the domain experts. In the past, what would happen in an organization is he would need to talk to a product manager who would talk to an engineer, and the engineer would have a large backlog because this clearly doesn’t immediately make the product better, so this has to be a lower priority.
But now he is actually kind of empowered to actually go and build these apps. Does he have any programming background at all? Interesting, because that’s definitely one of the controversies on Twitter at the moment: can you actually vibe code unless you know some amount of coding already?
Yeah. One of the things we do have is if we need to deploy one of these apps, we have a person that actually focuses on making sure that these apps are secure and deployed. But the amount of leverage that that person has is ridiculous. Instead of him going out and building all of these apps, the Vzero could actually get built by people that are domain experts but non-technical inside the company.
With the Kodium launch, you went head-on against Microsoft and GitHub and these huge incumbents. With the ID launch, you went sort of head-on against Cursor. It’s like the hot startup of the moment. How did you all think about that internally?
This might be a weird thing about our company, but our company just doesn’t have morale that is really affected by what other companies do. It’s not possible if our company has gone through a lot of turbulent times. The fact that we needed to pivot at 10 employees and just completely kill our idea is like a normal thing for the company.
And then second of all, the companies that are relevant in our space have always been a fluctuating set of companies. I really respect all the companies in our space, but if you were to go to the beginning of 2023, everyone would have thought GitHub Copilot was the product that everyone would use, and there was no point in building. And in the middle, kind of DevOn came out and everyone was like, “Hey, like DevOn is going to solve everything.” I’m sure they’re doing good work now.
But then after that, obviously, Cursor is doing a really great job. So, I think what really matters to us most is do we have a good long-term strategy, and are we executing in a way where we’re getting towards that long-term strategy while being flexible with the details? As long as we’re doing that, I think we have a fighter’s chance, right?
Do you educate yourself at all on the competitors’ products, though? Yeah. I think we don’t want to put our heads in the sand and kind of tell ourselves our product is awesome. It’s very easy to do that, especially given the fact that before we worked on windsurf, the company was also growing very quickly from a revenue standpoint.
What sort of opinions did you have on the full IDE that was maybe different from Cursor? I’m actually asking, as Cursor is a very well-liked product obviously, so at a product level, why are you like, “Oh yeah, we want to build it this way”?
I think it’s a great question. The first point is at the time actually when we started working on windsurf, all the products were basically chat and autocomplete capabilities. I think that’s what GitHub Copilot was, what Cursor was at the time.
We took a very opinionated stance that we thought agents were where the technology was actually going. We were the first agentic editor that was out there. I think the biggest takeaway was we didn’t believe in this kind of paradigm where everyone would be at mentioning everything. This almost reminded us of the anti-pattern of what Google and these search engines were before Google improved their product a lot, which was kind of like these landing pages that had every distinct kind of bucket of things you could search for.
But Google came out with this very clean search box. Even Google at the time, you would need better answers if you wrote and or site link. Now it’s gotten way better, and I guess we had a belief that the software would get easier to build, and we would build from that starting point.
When we saw all the other players in the space making their product configurable in a way that we thought was good for users for what the technology was, we thought it could be unnecessary down the line. So we invested in capabilities like how to deeply understand the codebase to understand the intent of the developer and how to make changes quickly to the codebase.
We took the approach of, instead of having a read-only system where you tag everything, what happens if you could make changes very quickly? That’s why at the time we were kind of the first to do that.
Now, if you were to ask, is that a very obvious decision? I think it looks very obvious today. This is where one of the things that I think is true for any startup: you have to keep proving yourself right. Every single insight that we have is a depreciating insight.
The reason why companies win at any given point is not like they had a tech insight one year ago. Actually, if a company wins, other than the fact that they have a monopoly, it’s a compounding tech advantage that keeps existing over and over again.
I think the example that I find most exciting is you look at a company like Nvidia. If Nvidia doesn’t innovate in the next two years, AMD will be on their case, and Nvidia will not be able to make 60 to 70% gross margins at that point, right? Even though it’s one of the largest companies in existence right now, by having good insights to start with, you’re able to learn from the market and maybe compound that advantage with time.
And that’s the only thing that could be persistent. It sounds like a moat is something we think of as a noun, but it’s actually a verb. Something that could change with time. I also think for us, and I tell the company this, if we’re not continuing to have insights, and that’s why I’m completely okay with a lot of our insights being wrong.
If we don’t continually have insights that we are executing on, we are just slowly dying. That’s what’s actually happening. I think the interesting thing is that it is easier now looking back and connecting the dots on your journey, how a lot of these technology bets you took actually did end up compounding what windsurf ended up becoming.
It was happenstance that you being really good at GPU deployment and VMware optimization ended up being the thing to be really good at being blazingly fast autocomplete because it’s faster than other products. So that kind of compounded there. There’s the aspect of you building all these plugins for enterprises and being so good at reading large codebases.
You did something that was contrarian. There were a lot of products, and we work with OC companies. Many codegen tools use vector databases because we work with a lot of companies, and that was the standard approach for many folks. But you guys did something very different, right?
Yeah. One of the things that got really popular is this term rag that got very popular. You weren’t anti-rag. I don’t know if we’re anti-anti-rag. Rag obviously makes sense. You do want to retrieve some stuff, and based on the retrieval, you want to generate some stuff.
So, I guess the idea is correct that everything is retrieval-augmented generation. But I think what people got maybe a little too opinionated about was the way rag is implemented. It has to be a vector database that you go out and search. I think a vector database is a tool in the toolkit.
If you think about what users ultimately want, they want great answers and they want great agents. That’s what they actually want. And how do you end up doing that? You need to make sure that what’s in the context is as relevant as possible.
What we ended up doing is having a series of systems that enable us to pack the context with the most relevant snippets of code. The way we ultimately did that was a combination of keyword search, rag, abstract syntax tree parsing, and then on top of that using, as you mentioned, all the GPU infrastructure we have to take large chunks of the codebase and in real time rank it as the query is coming in.
We found that this was the best way for us to find the best context for the user. The motivation for this is because people have kind of weird questions. They might have a question for a large codebase to upgrade all versions of this API to this API.
If embedding search only finds five of them out of ten, it’s not a very useful feature at that point. So we needed to make sure the precision recall was as high as possible, which meant that we used a series of technologies to actually get to the best solution.
There’s a bit of a thing going on with a lot of AI startups getting built taking shortcuts with what’s necessary for the problem domain space, but you took it from first principle. Right? So, you built a way more complex system that did parsing and all this stuff, which is cool.
Yeah, I think maybe one of the things that is potentially interesting to discuss is a lot of the companies started off working on autonomous vehicles. The reason why that’s important is these are systems where you can’t just YOLO these systems, which is to say you build the software and then let it run. You need really good evaluation.
I think at the company we don’t strive for complexity. We strive for what works. The question is, why is the system so much more complex now? It was because we built really good evaluation systems.
The evals for code are actually really cool. Basically, the idea is you can leverage a property of code which is it can be run. We not only have real-time user data, we can put that aside for now, but we can take a lot of open-source projects and find commits in these open-source projects with tests attached to them.
You can imagine a lot of cool things we can do based on that. You can take the intent of a commit, delete all the code that is not the unit test, and then you can see, hey, are you able to retrieve the parts where the change needs to get made? Do you have good intent to make those changes? And then after making the changes, does the test pass?
You can do that task; you can mask the task. By masking the task, it’s more like the Google task. What I mean by the Google task is it’s trying to predict your intent, which is to say, let’s say you only put in a third of the change, but you don’t get the intent. Can you fill out the rest to make the test pass?
So, there are so many ways you can slice this. Each of them you can break down into so much granularity. You can be like, what is my retrieval accuracy? What is my intent accuracy? What is my passing accuracy? You can do that.
And then now you have a hill to climb. I think that’s actually important before you add a lot of complexity for any of these AI apps. I think you need to make a rigorous hill that you can actually climb. Otherwise, you’re just shooting in the dark, right?
Why would we add the ASG parsing if it’s unnecessary? Actually, it’s awesome if it is unnecessary. I don’t want to add a lot of complex stuff to our code. In fact, I want the simplest code that ends up having the most impact.
So, the eval is really critical for us to make a lot of these investments at the company. How much of the development that you do is driven by improving the scores on the eval versus vibes-based? You guys are all using windsurf yourself. You’re getting feedback from users all the time, and then you have just a sense that this thing is going to work better.
Then the evals are just a sort of check that you didn’t screw up something else. It’s a little bit of both, but obviously for some kinds of systems, I think evals are more important than vibes, but like more easy than vibes, just because for the system that takes a large chunk of code, chunks it up, and passes it to hundreds of GPUs in parallel, giving you a result in one second, it’s very hard to have an intuition of like is this way better because that’s a very complex sort of retrieval question.
On the other hand, there are much easier things that from a vibe perspective are valuable. What if we looked at the open files in a codebase? This is actually harder to eval because when you’re evaluating you don’t know what the user is doing in real time.
This is one of those cases where having a product and market helps us a lot, and we’re able to take a lot of user data on how people use the product to actually actively make the product much better. So, that maybe starts with vibes, and then after that you can build eval.
It’s a little bit of both basically. I think there’s been a lot of chatter on the internet about vibes. Code is only for toy apps. Windsurf is actually being used for real production large codebases. Can you tell us about how the power users use it for hardcore engineering?
This is an interesting thing where a lot of us at the company didn’t get tremendous value from ChatGPT in the way that probably a lot of the rest of the world did. And that’s not because ChatGPT is not a useful product. I think ChatGPT is an incredibly useful product. It’s actually because a lot of them had already used things like Stack Overflow at the time.
Stack Overflow is a worse version than ChatGPT for the kinds of questions you want, but that was just a thing that they already knew how to use.
Basically what happened is very recently with agents, the agent is making larger and larger scale changes with time. I think what developers now at our company do is they have felt the hills and valleys of this product, which is to say if you don’t provide enough intent, it actually goes out and changes way more of the code than you actually need.
This is a real problem with the tool right now. But they understand the hills and valleys, and now the very first time they have a task, they are putting it into windsurf. Their first thing is not to actually go out and type in the actual editor. It’s to actually put the intent and actually make those changes.
They’re doing very interesting things now, like deploying our software to our servers actually now gets done with workflows that are entirely built inside windsurf. A lot of boilerplate and repetitive tasks have been completely eliminated inside our company.
But the reason why this is possible is kind of because we’re able to operate over a codebase that has many millions of lines of code really, really effectively. If you were to give some tips to the audience, how should a user that uses windsurf properly provide this intent so that the changes are more surgical?
Because what you’re saying with the agents creating all these broad changes, I’ve seen that happen. But how do you get those precise changes? What do you do? How do you feed the system? How do you become a shouted it with all caps?
No, I think this is one of those things where you need to have a little bit of faith in the system and let it mess up a little bit. Which is kind of scary because I think a lot of people, for the most part, will write off these tools really quickly. Obviously, no one at our company would write off the tool because they’re building the tools themselves.
I think people’s expectations are very high, and maybe that’s the main piece of feedback I’d give, which is that our product, for these larger and larger changes, it might make 90% of the changes correctly, but if 10% is wrong, people will just write off the entire tool.
I think at that point, probably the right thing to do is either revert the change. We have the ability to actually revert the change or just keep going and see where it ultimately can go. Maybe the most important aspect is to commit your code as frequently as possible.
I think that maybe that’s the big tip there, which is that you don’t want to get in a situation where you’ve made 20 changes and on top of that made some changes yourselves and you can’t revert it. Then you get very frustrated at the end of it.
One thing I’ve wondered, in that vein, is whether we need to change the way git works with this AI coding paradigm. Have you thought at all about whether doing git commit all the time is the right move or whether there needs to be a deeper infrastructure change?
Yeah, I think we have. One of the things that we always think of is in the future you’re going to have many agents running in parallel on your codebase. That has some trade-offs, right? If you have two agents that modify the same piece of code at the same time, it’s hard to actually know what’s going on.
That’s another thing is that it’s hard to have multiple branches checked out at the same time with different agents working on them independently. All the merge conflicts. Oh god. Yeah, there’s a lot of that. But hey, that’s how real software development works too.
When you have a lot of engineers that operate on a codebase, they’re all kind of mucking around with the codebase at the same time. So that’s not a very unique thing. I think git is a great tool. I think it’s maybe a question of how can you skin git to work in a way that works for this product surface.
An example is git has these things called work trees, which is like you can have many work trees and versions of the repository all in the same directory. Perhaps you can have many of these agents working on different work trees. Or instead of exposing the branch concept to you, you can actually maintain a branch yourself that you can apply to the main branch of the user kind of repeatedly.
One of the things that we think about at the company in terms of why we think our agent is really good is we try to have a unified timeline on everything that happened. The unified timeline is not just what the developer did but actually what the developer did in addition to what the agent did. the terminal right all of those things are captured and the intent is actually kind of tracked in such a way where when you use the AI, the AI knows right in that situation. So, in some ways, we’d like this thing where the agent is not operating on a completely different timeline, but it’s like something that’s kind of getting merged in at a fairly high cadence.
So, I think this is like an open problem. I don’t think we have the exact right answer for this. What are other things that you envision changing about Winterf in the future? How is it going to evolve? There’s probably a lot of people that think the vibe coding is kind of a fad, but I think that’s going to get more and more capable with time.
I think whenever I hear someone saying, “Hey, this is not going to work for this complex use case.” It feels like a lite saying something, right? It’s like if you look at the way these AIs have gotten better sort of year over year. It’s actually astonishing.
Like I’ll give you an example of something that I held kind of near and dear to my heart, which is there’s this math Olympiad called Amy and I used to do that in high school. I was very excited about how well I would do. My high score was somewhere close to 14. And that’s a very high score. Yeah. But the crazy thing is that was one of those things that I thought, oh wow, like the AI systems, they’re not going to get anywhere near as good.
At the beginning of the year last year, it was probably well under five, maybe, and now they’re, you know, the average that open out is like getting 14 and a half to 15 for O4 mini, so it’s almost like you have to keep projecting this out. It’s going to get crazy in basically every part of the software development life cycle, whether it be writing code, reviewing code, testing code, debugging code, designing code. AI is going to be adding 10 times the amount of leverage very shortly.
It’s going to happen much more quickly than people imagine. Going back to your current engineering team, I’m curious if they have all this time freed up from not having to deal with version upgrades and boring boilerplate stuff, like what do they spend the extra time on? One of the things about our company and probably every startup that is building in the space is the ceiling of where the technology can go is so high. It’s so high.
So it’s actually that if developers can spend less time doing boilerplate, they can spend more time testing hypotheses for things that they’re not sure work. In some ways, engineering becomes a lot more of a research kind of culture, right, where you’re testing hypotheses really quickly.
That has some high cycle time attached to it, right? You need to go out and implement things, you need to build the evaluation, you need to test it out with our users. But those are the things that actually make the product way better. Does that mean you’re going to hire a different type of engineer going forward, like you’re looking for different things?
Yeah, I think for engineers that we hire, we want to look for people with really high agency that are willing to be wrong and bold. But, weirdly, I don’t know if that has changed for a startup, right? Startups should never be hiring people where the reason why they’re joining a company is to very quickly write boilerplate code, right? Because in some sense, and I don’t want this to be the goal, but a startup can succeed even if they have extremely kind of ugly code, right?
That’s not usually the reason why a startup fails. Sounds like my startup. Yeah. Exactly. That’s not usually the reason why startups fail. The reason why a startup fails is they didn’t build a product that was differentially good for their users. That’s why they ultimately failed.
This is all true, but also, in reality, you always need some sort of workhorses to just kind of get certain things done. I feel like in the old days this was like building Android apps. You hired someone to do it because there were very few people who would just be willing to do it.
Yeah. Like maybe in your vision for engineering, you don’t need those skills because the AI is just like your infinite workhorse. Is that fair? Yeah. Maybe the sort of aspects of software that are really niche that are undesirable for a lot of people to do except for a handful of people, those things kind of get democratized a lot more unless that has a lot of depth attached to it, right?
At least for the time being. If something is like, hey, we need to change a system to use a new version. And there was someone that deeply always got in the weeds with version changes, I don’t think you have people that are just focused on that inside companies. How about how you interview people?
Yeah, I think we have a fairly rigorous and high technical bar, and that is a combination of we give interviews that actually allow people to use the AI to kind of solve a problem because we want to validate if people kind of hate these tools or not. There are still some developers that do, and obviously, if you do, we’re probably the wrong company to kind of work at. But also at the same time, we do have in-person interviews where we don’t give them the AI, and we want to see them think.
It would be a bad thing if ultimately when someone needs to write a nested for loop they need to go to ChatGPT, right? That’s fundamentally because it just feels like that is a good proxy for problem-solving skills, and I think problem-solving skills are just at a high level still should go at a premium. That is the valuable skill that humans have.
Yeah. A challenge that a lot of companies we’ve talked to have had, that we’ve even had ourselves, is that Windsurf has gotten so good that if you give people Windsurf, it’s difficult to even come up with an interview question that Windsurf can’t just oneshot where, you know, anyone can do it because you literally just copy and paste the question into Windsurf and hit enter. So, you’re not really evaluating anything at that point.
So I actually think that’s true and you’re totally right. There’s very few problems now that something like an O4 Mini is not able to solve. If you look at competitive programming, it’s just in a league of its own already at this point. The crazy thing is interviews by nature are going to be kind of isolated problems, right? They’re by nature because if the problem actually required so much understanding to do, you wouldn’t be able to explain the problem.
So, that’s like perfect for the LMS, where you give them an isolated problem where you can test and run code extremely quickly. So, yeah, you’re totally right. I think if you tell if you only have algorithmic interviews and you let people use the AI, you’re not really testing anything at that point.
Does that mean that you’ve gone away from just algorithmic questions and you ask different, much harder questions that are actually well-suited to being able to use an AI? Yeah, we have questions that are both system designy plus algorithms related. But these are questions that are fairly open-ended, right?
There may not be a correct answer. There are trade-offs that you can ultimately make, and I think what we want to do is just see how people think given different trade-offs and different constraints, and we’re trying to validate for intellectual curiosity. If someone ultimately says, “I don’t know why,” that’s totally fine as long as they’ve gone to a depth that we feel shows their interest and good problem-solving skills.
If that makes sense, you can tell when someone is curious and wants to learn things; it’s very obvious. The next thought after is, which might be counterintuitive. You’re at the forefront of building all these AI coding tools. It hasn’t affected at all your hiring plans. On the contrary, you actually need way more engineers to execute. Tell us more about that.
So, I think that just boils down to the problem having a very high ceiling, right? There’s so many more things that we really want to do. The mission of the company is to reduce the time it takes to build technology and apps by 99%. It’s going to take a lot of work to go out and do that. Now granted, each person in our company is way more productive than they were a year ago.
But I think for us to go out and accomplish that, I think it’s a herculean task. We need to start targeting more of the development experience. Right now, we’ve helped a lot with the code writing process and maybe the navigation of code process, but we have not touched much on the design process, on the deploying process, the debugging process is fairly rudimentary right now. There’s just so many different pieces.
If I were to look at it, you know, if you say you have 100 units of time, we’ve cut off maybe like 40 or 50 in that time but there’s just a lot more snippets that we need to cut out. Basically, at this point, it does feel like when I’m using Windsurf I am often the extremely slow bridge between different pieces of technology, copying and pasting data back and forth.
That’s probably actually still a large chunk of your time. All the pieces have gotten so fast that now it’s like the glue between them, but I’m the glue and I’m much slower. Can I go off the reservation and ask a weird question? Go for it.
Okay. One of the things that I mean, I think Pete on our team just released a great essay about prompting and you should let users have access to system prompts. The other thing that he came up with that we’ve been using at YC internally is a new agent infrastructure that has direct access, read access to our system of record, our Postgres database.
In the process of using this, we’re starting to realize that if codegen gets a lot better — and based on this conversation, I think we can count on that getting like 10x or 100x better from here — what if instead of building package software, there’s just in-time software that the agent basically just builds for you as you need it? Does that change the nature of software and SaaS? And you know what happens to all of us in Windsurf?
I don’t know. I think this notion of just a developer is probably going to broaden out to what’s called a builder, and I think everyone is going to be a builder and they can decide how deep they want to go and build things. Maybe our current version of developers can go deep enough that they can build more complex things, right?
In the shorter term, yeah, I think software is going to be this very democratized thing, right? I imagine a future in which you ask an AI assistant, “Hey, build me something that tracks the amount of calories I have.” Why would you have a very custom app that goes out and does this? It’s probably something that takes all the inputs from your AR glasses and everything and has a custom piece of software that kind of comes out like an app that is there and has tasks that go and tell you, you know, are you on track with all the calories you’re sort of consuming here.
I think that’s a very custom piece of software that you have for yourself that you can keep tweaking. I can imagine a future like that where effectively everyone is building but people don’t know what they’re building as software. They’re kind of just building capabilities and technology that they have for themselves.
Do many people use Windsurf who don’t know how to write code at all? It’s actually a large number of our users. Yeah. Interesting. How did they end up getting into Windsurf? Did they work at some company where a programmer showed them how to use it?
I tend to think of Windsurf as targeting more of the professional developer market that’s like using this as a new superpower versus the non-technical user market that’s doing like what Gary was talking about. We were shocked by this too because we were like, “Hey, our product is an ID, but there’s actually a non-trivial chunk of our developers that have never opened the editor up.” They just, you know, our agent is called Cascade, right? And just live in Cascade.
We have browser previews, so they just open up the browser preview. They can click on things and make changes. The benefit is because we kind of understand the code when they come back to the repository, and the code has actually gotten quite gnarly. We’re actually able to pick up from where the developer left off or the builder left off and keep going from where they were.
I will say we have not optimized tremendously for that use case. But it’s actually kind of crazy how much is actually happening there. Do you think in the long term that this ends up being one product that targets both of these audiences, or do you think actually there are different products for different audiences?
There’s Windsurf which is focused on serious developers who want to see the code and be in the details, and then there may be other products for folks who are totally non-technical who don’t even want to see the code. I don’t know what the long term is going to look like. Something tells me it’s going to become more unified.
But one of the things that I will just say is as a startup for us, even though we do have a good number of people, there’s a limit to what we can focus on internally as well. So for us, we’re not going to be focused on how do we build the best possible experience for the developer as well as building the experience where we have so many things for the non-developer.
But I have to imagine that this idea of building technology, if you get better at understanding code, you’re going to be able to deliver a great experience for non-developers as well. But I don’t know what the path dependence is. I assume a bunch of companies in the space will go from non-developers to then supporting an ability to edit the code.
I think we’re starting to see this already, where you know the lines are sort of getting blurred right now. You probably care about it for your evaluations at least. Yeah. No, you need to care about it for your evaluations. That maybe that’s like the hard part for me to imagine for the pure non-developer product.
What is the hill you’re climbing if you’re not kind of understanding the code? How do you know your product is getting better and better? That’s an open question. Are you completely relying on the base models getting better, which is fine, but then you should imagine then your product is an extremely light layer on top of the base model, which is a scary place to be, right?
That means you’re going to get competed across all different axes. How do you think about that in general? I guess like something we’ve talked about a lot on this podcast is just the GBT wrapper meme has completely gone away.
I feel though every big release from one of the labs sort of brings it back a little bit, and everyone’s a little bit scared that, you know, OpenAI is just going to eat everything. How do you think about that? I think the way we think about this is like, yeah, as I mentioned before, it’s a moving goalpost.
Which is to say today if we’re generating sort of 80 to 90% of all committed software, yeah, I think when the new model comes out, we’re going to need to up our game. We can’t just be at the same stage. Maybe we need to be generating 95% of all committed code, and I think our opportunity is the gap between where the foundation model is and what 100% is.
As long as we can continue to deliver an experience where there is a gap between the two, which I think there is, as long as there’s any human in the loop at all in the experience, there’s a gap we’ll be able to go out and build things. But that is a constantly transforming sort of goalpost for us, right?
So you can imagine when a new model comes out, maybe the baseline on what the foundation model by itself provides has doubled. The alpha we provide on top of what the base model provides needs to double as well. For me, the reason why this is not the most concerning is let’s suppose that you were to take the foundation model, and it’s providing 90%. It’s reducing the time it takes by 90%.
That actually means if we can deliver one or two percentage points more, that’s a 20% gain on top of what the new baseline is, right? If 90 becomes 92 or 93, which is still very valuable, right? At that point, because effectively, the 90 becomes the new baseline for everyone.
So I think basically the way we sort of operate is how can we provide as much additional value as possible? As long as we have our eye on that, I think we’re going to do fine. What advice would you have for our startups that are working in the AI coding space? You have a ton of them. What are the opportunities that you think are going to be open to new startups?
I’ve seen a lot of things that I think could be particularly interesting. I don’t think any of these technologies we’ve really adopted, but there are so many different pieces of how people build software. I’m not going to say niche, but there are so many different types of workloads out there. I’ve not really seen a lot of startups in the space that are just like we do this one thing really, really well.
I’ll give you an example. We do these kind of Java migrations really well. Crazy enough, if you look at this category, the amount that people spend on this is probably maybe billions, if not tens of billions of dollars doing these migrations every year. It’s a massive category.
That’s an example. Migrations from what to what? For example, JVM 7 to 8 or something, JVM Rails versions, even more than that. Actually, a lot of companies write COBOL, and crazy enough, most of the IRS software is written in COBOL.
Apparently, in the early 2000s, they tried to migrate from COBOL to Java. I think it was a five plus billion-dollar project. Surprise, surprise, it didn’t happen. You think they could one-shot it now? I don’t know if they can one-shot it, but I’m just kidding. Imagine if you could do those tasks very well. It’s such an economically valuable sort of task.
I think we obviously don’t have the ability to focus on these kinds of things inside the company. That’s a very exciting space if you could do a really good job there. The second key piece is there are so many things that developers do that are also not making the product better but are important, like the automatic resolution of alerts and bugs in software.
That’s also a huge amount of spend out there, and I’d be curious to see what a best-in-class product in that category actually looks like. I’m sure someone that if they got truly in the weeds on that, they could build an awesome product. But I think I’ve not heard of one that has tremendously taken off.
I think those are actually both really great insights. And one thing I like about them is that it’s not just an opportunity for like two startups. Each one of those is like a bucket that could have like a hundred large companies in it. We actually do have a company from S21 called Bloop that does these COBOL to Java migrations with agents.
That’s awesome. It’s a gnarly problem. It’s a very gnarly problem, but if you were to talk to any company that has existed for over 30 years, this is probably something that is costing them hundreds of millions a year.
So, reflecting on this journey, I mean, we’re all really thankful for you creating Windsurf. It’s supercharging all of society right now. What would you say to the person who, you know, basically the you from five years ago before you started this whole thing?
The biggest thing I would say is change your mind much, much faster than you believe is reasonable. It’s very easy to kind of fall in love with your ideas over and over again, and you do need to; otherwise, you won’t really do anything. But pivot as quickly as possible and treat pivots as a badge of honor. Most people don’t have the courage to change their mind on things, and they would rather kind of fail doing the thing that they told everyone they were doing than change their mind, take a bold step, and succeed.
Vun, thank you so much for joining us today. We’ll Catch you guys next time.