Josherich's Blog

HOME SHORTS TRANSCRIPT SOFTWARES DRAWING ABOUT RSS

Paper walkthrough: rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

02 Mar 2025

Paper walkthrough: rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Hello and welcome back to Data Science Castnet. In this video, I’m going to be doing a paper reading SL walkthrough of Arar Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking.

If you’re kind of new to this idea of test time compute and tree search and process rewards model versus Arcom rewards model versus policy models, this is a nice way to dig into that and also to explore this new technique. I’m going to go through the paper, but I’m not going to go linearly reading section by section. Instead, we’re going to talk about the problem that’s trying to solve, the steps that they present, and then we’ll go back and look at some of the prior work, the past ideas, and the problems with those that this is trying to address.

I haven’t rehearsed this much or anything, so it’s going to be very freeform. Yeah, let’s dive in. Okay, so this paper is presenting Arstar Math, demonstrating that small language models, all of their models are 7 billion parameters or below, can rival or even surpass the math reasoning capabilities of OpenAI’s state-of-the-art frontier model—without distillation.

They achieve this by exercising deep thinking through Monte Carlo tree search, where a math policy small language model performs test time search guided by a process reward model. Arar Math introduces three innovations to tackle the challenges in training two SLMS: code-augmented data synthesis, which has step-by-step verified reasoning trajectories.

And there’s a lot of words here, so don’t worry about that. We’re going to dig into what all these mean. Then, a process reward model training that avoids naive step-level score annotation, so we’ll talk about the issues with process reward models in the past, how they get around that, and then a self-evolution recipe in which you can build from scratch and iteratively improve these different components of their system, building up until you end up with these fairly small models jumping to 90% on the fairly challenging math benchmark—jumping from 40 to 80% with a different model surpassing A1 preview on the Math Olympiad—doing shockingly well for such small models.

So, we’re going to see how they do that. We’re going to talk about some caveats that I think are worth being aware of in this paper. There are a few things that are different from some of the other approaches that are being compared, so it’s worth keeping those in mind. However, I don’t think those are deal breakers—why this is still a very impressive paper.

Okay, so that’s that. We are, like I said, not going to go directly through this. Importantly, I think we’ll come back to this overview figure, but we’ll do that after we’ve looked at some of the pieces a little bit first. I highlighted this sentence in the test time compute paradigm: the key is to train a powerful policy model that generates promising solution steps and a reliable reward model that accurately evaluates them. Both of which depend on high quality training data.

This sentence is kind of like the crux of this whole field. We want to be able to do this thing where instead of just asking the model a question and then it does some steps and gives you an answer, we want to be able to expend more test time compute. Some models like O1 are able to linearly produce a bunch of thinking steps, including within that process, jumping back to earlier ideas and trying out different things.

This is going to be different. We want to be able to search multiple possible solution trajectories. So, we start with the question, and we want to be able to try a few different options for what our first step might be. Based on each of those, we might want to try a few different options for what the next step might be, and so on. The hope is that we can go step by step, but in this kind of branching tree search exploratory fashion, and eventually get to the correct answer. Instead of just brute forcing and hoping that we find good looking answers, we want to be able to both steer the process while we’re going.

This process reward model is going to tell us, “This looks like a promising step to have taken,” and we want our step predictions to be good so that at the end of the day, we can explore the promising branches in this tree, the promising solution trajectories, and hopefully find the best of the best—actually correct solutions at the end of the day.

Okay, so that’s the goal: to have some sort of system like this. The problem is training data. We don’t have lots of existing data that captures this kind of branching searching exploring space. That’s going to be like the first big challenge: how do we even start? Where do we get some data?

There have been previous works that have done something like this. I’ll go now to a fairly famous paper from some researchers at OpenAI that explored step by step. This is a paper that introduced the idea of having a so-called process supervision: a process reward model that can give you a score for every intermediate reasoning step and then using that to do search and to perform well.

In this paper, what they do is manually collect feedback for each step in a solution, human annotation-wise. So, we have a question, we have step-by-step little pieces of reasoning, and then some human has to very carefully come in and validate the different steps and label which steps are fine and which steps were wrong. Human data labelers did a lot of work to produce a lot of training data for this, but with that, they were able to do fairly well. Specifically, they also compared outcome reward models, where you have a signal that’s only whether the answer at the end of the chain is correct or not, versus process reward models that look at each step.

Here you can see, this is a problem: you’ve got a number of steps, some of them are more promising than others, and a few of them are just plain bad. They’re going to throw you off the path, and you might end up with an incorrect answer. The process reward model needs to learn how to do this, but to learn this model, you need very detailed and fine-grained data. They show in this paper that this process-supervised reward model, using that to do your search process versus the outcome-supervised reward model, is able to get you a much higher percentage problem solved and much better performance.

This is being kind of like the Holy Grail of how do we train such a model for different types of data and for different problems without this super costly annotation process? So, that’s going to be what we’re going to try and tackle in this paper.

Let’s start from the beginning. You want to train this system; we want to train Arstar Math. What do we do? Step one is to get some data. As I said, we don’t have tons of math data with step-by-step solutions, and those steps graded, including some incorrect solutions. What we do have is lots of math questions with just the answer.

So sorry, I’m going to come back to some of these pieces. Here we start out with some math data sets and then we boost the number of questions using GPT-4. How do we collect our math questions? They find a bunch of word problems, including ones from the training sets of some of the evaluations that we’re going to be using. We’re using the training set, and we’re going to use the training split. We then use GPT-4 to synthesize new problems based on these seed problems.

That’s already like we said, we’re using small language models to match the big ones without any distillation from larger models. This isn’t exactly distillation, but it is using the larger models for synthetic data generation. That’s just a little flag worth keeping in mind; we are starting with a bunch of math problems, but especially in the harder categories like math and the AM, we’re generating some extra problems using a large model.

We’re filtering those generated solutions to ensure that we only get the ones that have somewhat consistent solutions according to this big model. So, it’s like we use this big model, but it might come up with questions that are unsolvable or incorrect. We generate a bunch of candidate extra questions and then filter them.

Now we have a bunch of questions with answers, but that still doesn’t give us the step-by-step pieces that we need. What they do starting out initially is say, “Let’s also bootstrap the fine-tuning data that we need to train a model to do one step at a time.” They do that with DeepCoda V2.

This is a 230 billion parameter model, and the claim of using small models that are less than 7 billion parameters is sort of true. The results are really impressive, and the models are these small models, but you’re beginning your whole process with a 200 billion parameter model, which totally makes sense. They’re using it for synthetic data to get the ball rolling.

Just wanted to point that out and have that as a reference. We’re starting with really high-quality questions generated with GPT-4, and we’re starting our training on demonstrations generated by DeepCoda V2.

So those are the highest quality data we can possibly get, but importantly, this data still isn’t everything that we need for our system. This is now the actual process: we have some math questions, and then we’re using DeepCoda V2 to instruct to collect some step-by-step solutions. They say they are running Monte Carlo research, but they don’t have any sort of process reward model yet. So, they’re just rolling out: generate a few different solution steps, and then from each of those steps, generate a few possible next steps. They’re going to do, I think, the eight rollouts for each. question um we know the correct answers so even if we don’t know whether the steps are correct we can find which of those trajectories lead to correct solutions um and we can pick the top two um and use those as our starting point.

so when we eventually want to train a model this um slm R1 round one we want to train this to predict the next step. the data that we use to train that is picking the um the most good-look trajectory of these roll outs uh and in this case the Q values here is like a measure of how good the trajectory is.

it’s just based on the um how how many trajectories starting from these different points end up at correct solutions right so you can imagine like step one I had two possible step ones and one of them always led to correct solutions and the other always led to incorrect solutions. the ones that start with the good step those are going to have much higher Q values um and so on down the chain and so when I pick two ones that ended at correct solutions and that also all of the like intermediate steps tended to lead to other correct solutions as well those are going to be my like good candidates and those are what I’ll start my fine-tuning process on.

okay so at this stage we don’t have any model that gives me a score for each step but we do now have a model that is able to predict what is hopefully a decent next step in a solution um at the same time they try and train a model that can actually score each step um but it’s not very good yet and specifically the reason it’s not good is the same reason that um a lot of these attempts to generate this kind of annotation struggle is like you’re only doing eight roll outs and your initial model might not be very good even though it’s a really big model um it might have lots of mistakes and so on and so you’re trying to assign these scores these Q values but they’re based on very limited amounts of data right like eight is this correct or not answers is maybe not enough to tell you whether an initial step was promising um and so you train a reward model but it’s not great and we’ll talk more about how exactly they train these reward models but this is just the overview flow.

um okay so now at the end of round one I have a small model that can predict the next step and that’s about it. so now in step two I’m going to take this S billion parameter small model my policy model and I’m going to do more roll outs. so now I’m using the actual model that I’m going to be continuing to use right we’re like we’re out of the uh leaning on the bigger model regime I’m using my small model and I’m doing lots and lots of different trajectories to try and solve each problem right uh 16 um and then later on they they bump up the number even more um and I’m doing this with the final model that I’m going to be using um and then because I started this out on good trajectories it’s already like a pretty decent model and so our step-by-step reasoning trajectories are decent and also um I’m starting to get better measures of the Q values right like if I trajectory leads to um lots of final correct solutions it’s going to get a decent score and if it leads to bad ones it’s going to get a bad score and the reason at least the bad ones is probably going to be because it was actually a bad step given that I now have like an okay model that I’m starting with so I’m starting to collect these trajectories now and I can keep track of the absolute best ones right just like we did in the first round and I’m going to train my um my policy model on those but I can also start to pull out um pairs of these trajectories where I take one that has a very high score and one that has a low score and I use those to start training my process policy model and again we’ll look at the policy model in more detail just now um but at the end of this round I’m starting to be able to predict hey even without going all the way to the end uh multiple times and trying to see how many are correct I now have this model that can just give me a score directly for this step um and so this is very useful because we can then do another round but now instead of just doing like random rollouts right where I’m just picking steps and then picking substeps at random um I have a model that can predict which steps look promising and so I can skew my um tree search using this process policy sorry process preference model the the process reward model I can use that to search more promising branches more thoroughly and so I’m going to get more high-quality trajectories right and so I’m going to get more really high quality fine-tuning data for the next step of my um policy model and the next step of my um process preference model.

so it’s this kind of like very virtuous cycle where once you have an okay reward model you can use that to improve your search which can get you high quality trajectories which can improve your policy model and these high quality trajectories are also like better quality um Q values that I can use to improve my reward model right and so like it’s just you can see now how they have this like they call this self-Evolution and this like cyclic improvement.

now we’re in gravy right because now we can keep on doing this um the round four they actually start saying look our models are getting good enough that we should um increase the amount of hard problems right um in these early rounds having these like Olympiad level problems pretty much all the trajectories are going to lead to incorrect results it’s not going to be very useful um but now like we’ve got a good enough process uh sorry good enough policy model and a good enough process reward model we can start to um include more of these tricky ones and for the ones that we don’t solve after 16 roll outs let’s just do 64 roll outs or 128 roll outs so we can try searching more and more and more um using our policy um sorry our process preference model to steer to the promising looking trajectories and hopefully eventually find some solutions to these hard problems right and then we’re going to be able to train on those and um boost our success on these extremely tricky problems.

so this self-Evolution step that’s like gradually improving your um policy model that’s predicting the next step and your um preference model that’s choosing which step is looking better like more promising those two together um build on each other and you just get this improvement over time um and so you can see um if you look at the scores for round one versus round two versus round three versus round four um and likewise for the um the just using the policy model like there’s initial 0% accuracy on the map math competition problems right round 1 3% 10% 16% 26% it’s just this like continual improvement because at each stage we’re finding higher quality trajectories and we’re training on those um so very very impressive way to bootstrap up um and the final results are astonishing they they have these models that are doing um incredibly well like competitive with Frontier llms but these tiny little models a few billion parameters and then a 7 billion parameter um process pref model um able to search and try you know maybe 64 different trajectories um and pick the ones that look most promising and then you pick the one that looks best and then you look at that answer and the accuracies on these very challenging G it’s like I mean some of these grade school math very very easy questions but some of these um more uh high level math problems like they’re really challenging even to human mathematicians um at least those who haven’t been like trained for decades.

so very very impressive results um we should talk about some other nuances of exactly how they’re doing this um and specifically um a few extra tricks that they use so you’ll remember right in the intro they had this phrase um step-by-step verified reasoning trajectories right so what is that and is that just uh the same as something like let’s verify step by step where we have um these kind of uh text based steps in solving the equation by like just reasoning as using tokens uh the answer is no they have um they have code augmented Chain of Thought Co is Chain of Thought data synthesis and what that means is that the way they phrase their answers is to use Python right so when they say apply verifiers we can use our reward model but we can also use code um and so when they’re generating these steps they generate a piece of like explanatory text and they generate python code and so one way that you can tell if a node is particularly bad is if that python code doesn’t execute right then you can throw it away because you say okay that step is a dead end if we I should look at some examples of the data here um if our step-by-step reasoning looks like the step one calculate the DAT distance walk south total South equals a half plus a half right if this instead was like an expression that didn’t evaluate then you know that there’s no point going down that solution track and so we can execute that code and check that it runs step two right another line of Python code as well uh we execute step one and then step two and we check that that all runs um and so this is the like this is the extra code execution augmented Chain of Thought this um having a little bit of python code that runs at each step and using whether or not that runs as part of the um verification uh that’s one way that they improve the data both initially um when they’re like trying to filter things out and at search time when they’re trying to run their solutions if they ever get um a node in the tree or like a step in the process how however you like to think about it if they get one that doesn’t execute then they don’t follow on from that they stop there and they can search other trajectories.

um okay and so that is phrased kind of just as like oh this is an additional way to verify what’s going on but it also brings up that hey this model is able to answer um using code right and so you might be comparing it to all these other models but none of these are able to actually run python code um whereas this one is um and it’s allowed to use um you know you’ll see some of the examples that they give um it’s using Senpai to set up systems of equations right um and then to solve them with that Library so you get the correct solution um and it prints out the answer um but you’re getting there with the use of code which is great this is fantastic um but it is kind of surprisingly subtle that that’s what’s happening when they say oh these models. are able to do deep thinking and um to come up with solutions that beat the state-of-the-art models. It’s like, well, the state-of-the-art models aren’t running code as part of their process; they’re reasoning in tokens whereas these models are reasoning in tokens and code, and they’re using the code execution results as part of it.

Okay, so that’s one cool extra thing, um, very, very neat to think about. If you’re trying to imagine extending this idea to different domains, right, you could imagine something similar where your verifier is um, something that’s easy to check. Does the code that you’ve just written compile? Like, does it have any syntax errors? Or maybe you can imagine like a robot policy that’s doing multiple actions. If any of those actions involve moving through the table or something like that, you could eliminate those as a sort of filtering and quality improving mechanism. Um, so you could have all sorts of rules for what that verification might be when you’re filtering out these trajectories.

Um, but also, so yeah, it’s worth noting that this is a code-augmented model that’s able to use Python as part of its process. Um, okay, another thing I promised we would come back to: what is this PPM? What is this policy um preference model or process preference model? Um, in other methods, they talk about a process reward model. Um, what’s the difference here and why is that important, right?

And why have we even given it a slightly different name rather than just calling it The PRM? Um, and the answer is to do with what this PPM predicts at each stage, right? So we have here, I predict two possible first steps. They both get a fairly reasonable score. Um, from this one, I predict two steps and this one gets a somewhat negative score.

Um, this score you could imagine, and this is what past works have done. Um, let’s look at the final outcome, right? Is it correct or not? And then let’s trace back and try and assign the Q values to these intermediate points. Um, and like I said, they’re only doing a limited number of rows, so these values are not going to be like ultra precise, and there might well be like, “This is actually a perfectly valid step.” Um, and it just so happened that the reasoning went off the rails right before the answer.

Um, and so that credit assignment is tricky. Past models have often tried to directly predict those Q values. Um, and instead, the approach here, um, they take the trajectories that you’ve sampled, um, and they say, “Look, these Q values are not precise enough to score the steps, but they can tell you uh correct steps from incorrect ones,” right?

So I can’t necessarily tell you like, “Oh, this step is exactly a value of like 0.3,” but I get one step that’s value 0.3 and one that’s value like -0.8. I can probably with some higher confidence say, “Look, the first step is better than the second,” right? So instead of training on those Q values directly, we just pick out preference pairs and use a pairwise ranking loss. We say, “Look, give me a score,” and the loss that I’m going to enforce is not that that score exactly matches the Q value, which might be noisy. Instead, the loss that I’m going to use is just going to say, “Make sure that the score that you give for the good sample in the pair is higher than for the bad sample in the pair.”

Um, and so that’s why they call this a preference model rather than just a process reward model. Um, they give it a special name and they um use that to like emphasize this is trained on this kind of paired data. It has one good and one bad type of data and a pairwise ranking loss as opposed to trying to predict the Q scores directly, right?

And this gives you a reliable labeling signal, um, and it’s also one that initially you start out with like, “Okay, here’s one that was good and it gave a correct answer, and here’s one that was bad and it gave an incorrect answer.” Um, but as the refinement evolution process goes on, um, you’re exploring more and more promising looking trajectories, and so you’re able to have like, “Okay, now I can really tease out this one has a higher Q value, but they’re quite close together.”

Um, because this intermediate step led to more successful solutions. And so there’s still going to be that nuance of like we still want to enforce that the really good ones are given a higher score than the slightly worse ones. Um, and so it has this nice property that you know you’re constructing these pair data and you’re doing it um in a way that’s like, “Okay, I can reliably say A is better than B even if I can’t tell you what exact scores they should get.”

And then you’re learning that pairwise ranking loss. You’re learning to give it a score that maintains that sort of ordering without enforcing exactly what value it should be. Um, so yeah, that’s a nice technique, um, a nice way to get around needing to produce thousands of rollouts for every question to get accurate Q scores to train on. Um, and yeah, very useful final result.

Um, we should dig into like how important that is. Um, you might think that okay, this PPM is making these big claims, but they generate hundreds of thousands of examples, right? Using GPT-4, and they start with hundreds of thousands of trajectories from a 200 billion parameter model. Deep SC Coda isn’t most of the success just from learning from that data? Isn’t this just distillation and the process reward model? It’s like maybe a little cherry on top.

Um, and they, I think, quite compellingly make the case that no, that that isn’t the case. Um, and so if you look at uh, there’s a figure down here, um, in light green. This is the policy model by itself, right? So this is the model that’s been trained on the trajectories that we found, the high-quality trajectories.

On some data sets, it’s able to do pretty well, but on others like this AIME, you’ll see that by itself, predicting one step at a time, it’s actually still pretty terrible at that benchmark. It’s only when you add, and that’s the green here, when you use this um process model to steer the trajectories as you’re searching, um, that you get this huge boost in performance.

Right? And so they explore, they show how useful this is, and then they also show some comparisons using a 72 billion parameter um outcome reward model to produce a bunch of trajectories, and then use this big 72 billion parameter model to pick the best. Um, and that gives you some boost, um, even applied to a 72 billion parameter policy model. Um, but still not nearly as big of a boost as their policy model paired with their um process model.

So um, yeah, I think it’s a compelling demonstration that yes, the key here is being able to do this effective search by at every step getting a good score, um, and using that to inform which trajectories you might sample. Um, yeah, so that’s their big thrust is that this is almost like as important of an outcome from this as the policy model, and you need both together to be able to get the full performance.

Um, you can see as well like one of the other places this comes in is if you only sample one solution, you know, you get impressive accuracy especially compared to um some of the, you know, previous attempts at these benchmarks. Like starting out at 80% on math is highly impressive, but the real benefit comes in when you’re able to sample multiple solutions.

And like on some of these, the more solutions, the higher the score just continues as the trend. On others, it sort of saturates. Like these are, you know, we’re getting to the limit of the kinds of problems we can solve perhaps on those specific tasks with these specific models. Um, but in general, you can see very clearly, yeah, as we are able to try more different solutions, um, we get better and better results.

And importantly, like this grows higher and stays higher than if we were just um doing best of N. So if we just did, “Look, just 10 trajectories,” and then we picked the best with an outcome reward model, um, that only gets you so far. Whereas if you’re able to pick at each step using this process model, um, you’re able to find much higher quality trajectories and get to much higher percentage of correct answers at the end of the day.

Um, once you get to 64 solutions, uh, okay, so I think that’s the main key ideas in the paper. Really, really cool to see starting out so bad and getting better and better and better. Um, what does this mean for other things? They have a little um discussion in the appendix. Um, this is a good general methodology for improving LLMs on various domains as long as you can get a bunch of data that has um the final outcome, you know, whether or not it’s correct, right?

So you could imagine something like um a coding task where I have some issue or some feature request on a repository. There’s going to be some steps like adding different code, and then there’s going to be some tests that pass if we’ve added that code correctly. Um, you could imagine doing something similar here. We don’t have any model for like what intermediate steps might look good, but we could use some sort of verification like, “Well, at least the intermediate steps should um have no syntax errors,” right?

Um, but then we could start to bootstrap up. Let’s try a few different attempts at doing this task. Let’s see which ones are correct solutions. Let’s use that to start assigning our Q scores. Let’s pick the best trajectories, let’s train on those. Let’s start building a reward model that’s able to look at each step and see if it’s a useful step. Um, and let’s bootstrap up to be able to um get this impressive increase in performance.

And then when someone says, “Hey, can you make a change to this repository?” if you’re able to try 64 different times, but you know each of the times you’re trying something slightly different, and then you’re picking the promising ones, you can imagine getting a boost in performance over just something that, you know, tries to just do it all in one go.

So it is something that’s potentially extendable, but it does require these kind of verifiable tasks, thinking about what the intermediate verification might look like. Um, yeah, thinking carefully about how you might get the data. Um, but still very, very early, you know, interesting work and very, very impressive in its constrained domain. Um, I found it very interesting to look at, I hope you did too.

Um, yeah, now we can like check out the diagram, and it should all like be familiar from what we’ve just discussed. Uh, if I’ve left anything unclear, do leave a comment down below. Um, if you’ve got other papers that you’d like me to look at that might be fun, um, yeah, do tell me that as well. Otherwise, I hope you’ve enjoyed and we’ll see you in the next one. Next one. Thank you for watching.