#09 - Search Parallelization: Bottom-up (CMU Optimize!)

Optim a journey session this course is filmed on occasion at Car Melon University in front of a live studio audience. Now we’re on going back and forth team Bottoms Up and top down and looking at the sort of core aspects of the search implementation, the implementation. So today will be a discussion on how to paralyze the search for a Bottoms Up and then on Monday’s class, excuse me, next week that’ll be looking at how to apply it for the top down approach which we already started seeing inklings of that in the last class when we talked about how the pruning method, whether it was accumulated or predicted, you know, the accumulated one required you to sort of go down and do something and that sort of blocked other threads from exploring other parts of the tree.

The Bottom Up isn’t going to have that problem. We’ll address that issue again next Monday for top down. All right, so as I said on Patza, I have no office hours today. Send me an email if you want to meet, and then I’ll follow the people who already have emailed me. Probably be like for tomorrow. I may be on campus tomorrow depending on what the flu test says. Project one, again, is due Friday, February 28th for project two topics. For next class, we’ll start going over some potential directions you may want to pursue. Again, you don’t have to decide right away, not until I think after spring break, what you want your project two to be based on.

But at least you can start talking amongst yourselves to figure out what groups you want to form, and then we’ll post on the spreadsheet what groups you want to be in. If you’re not in a group, we’ll have a sort of side buffer that you can put yourself in as a free agent and then we’ll figure out how to assign you to a group. Again, for project two, it’s supposed to be open-ended. It’s supposed to ideally touch on the things we talked about in this class, but there’s other areas about database optimization that you’re curious about, that you’re doing in your own research for some other class or something else.

I’m all for combining ideas, right? So don’t assume just don’t think, “Oh, whatever I talk about next week, those are the only ones you can pick,” right? If you have a good idea, then I’m all ears. Okay, all right. Any questions about project one? Okay, all right. So, all right, last class we, as I said, we started looking at how to do the top-down join enumeration. One of the techniques that we saw was this idea of partitioning the query graph into subgraphs or smaller, connected graphs.

That was done in such a way to not necessarily paralyze the search, at least be more targeted in how we’re doing the search going down. But it doesn’t take much a stretch of imagination to say, well, if I’m already partitioning the graph, you know, if those partitions or the subgraphs are independent, I could start exploring them in parallel across threads. So even though last class we saw a partitioning technique and that was the idea of sort of zoning down on the parts of the graph you think are going to be providing those benefits, so try to optimize those first before you run out of time.

We’ll see basically the same technique today in the next class of doing the same thing to paralyze things, and as I said, the challenge is going to be that there’s going to be dependencies between different parts of the solution space within the joint graph that we’re trying to enumerate upon, where we’re going to have to do some part first before we can start exploring other parts of the tree. Right? Meaning, like we have to know the answer for some things, for some part of the joint graph before we start saying, okay, let’s look further down the tree and start looking at other stuff.

Again, we saw this in accumulated cost; there’s a predicted cost bounding. In top-down, this is more of an issue because if you’re doing accumulated cost, you need a physical cost, you need a physical plan to at least get that first initial cost to figure out what your upper bound’s going to be. So you kind of need to have one thread go down at least to the bottom, generate some physical plan. Even though it’s not ideal, just have something, and then once you have that, that can then unlock other stuff.

We saw this in other cases too, where we talked about doing transformations from an outer join to an inner join. Well, you can’t start doing joint enumeration on the inner join until you apply the outer join transformation, right? So, like, that’s another example where you have to do certain things before you’re allowed to make a wider search into the different options you could have for a query plan.

So, as I showed this before, what we’re trying to figure out now is we know the general framework of how to design a query optimizer, whether it’s top-down or bottom-up, and now we’re trying to say, okay, what can we do to ensure that our query optimizer actually produces good query plans for whatever SQL query shows up? So we’re at this point here, we’re trying to understand how to make the search algorithm as efficient as possible, and obviously accurate as well. But you’ll see in all the algorithms we’ll talk about today, there’s always these explicit checks to see if I, you know, if I try to partition the graph in a certain way that’s going to make me disconnected. Therefore, that’s going to introduce either correctness issues or Cartesian products, which is an efficiency issue.

Like, there’s a bunch of checks we have to do to make sure that we’re producing accurate and correct results. But now we’re kind of also really focused on how can we make this run as fast as possible. So that’s what this class is about, and then for next week we’ll start talking about the cost model, which is kind of this big Albatross that’s hanging on the side that we haven’t really addressed.

Right, all right. So up until now, I also want to say for all the methods we’ve talked about with either the implementation of the framework or the algorithm to do join operations, they’ve all been assumed that you’re running in a single-threaded environment. Right? And so the goal again, what this class and next class is about, is how can we make the query optimizer run the search in parallel.

The challenge here is that most query optimizers are not going to be parallel, right? Even for most modern systems, they’re not going to be parallel. Right? So the optimizer itself is going to be a single-threaded component, of which the data system will allow multiple threads to run at the same time. So it could be optimizing multiple queries or independent queries at the same time, but they don’t take, you know, for one single query, spawn a bunch of worker threads to then do that search at the same time.

Right? And this is for two reasons, right? One is for historical reasons because, you know, most database systems up until, you know, really Snowflake made it big in 2013, or that sort of architecture came along, like most database systems were these single-node monolithic architectures where everything you needed to do to run your data system was on a single box or was in a single program, if you will. I don’t want to say process because it could be a multiprocess system, but it’s assuming you’re running everything on a single box.

That means when a query shows up, it goes through this connection handler thing, and then that goes to the parser, then the binder, and then it goes to the optimizer. All that’s running on the same machine where you’re actually running queries. Right? Like this is what MySQL does, what PostgreSQL does, SQL Lite, Oracle, SQL Server, right? This is how people built these systems for, you know, since the ‘70s. And so the challenge is going to be now if I want to have a parallel query optimizer, well now I’m taking threads away, workers away from other parts of the system just to do query planning.

Right? And so again, if I’m running on a single box, I’m not elastic. I can’t magically conjure up more threads; they got to come from somewhere. So I’m taking away threads that I could be using for background tasks or query execution and then using them for query planning. So that’s a hard thing to do in terms of how much better is my query plan actually be if I’m multi-threaded versus just throwing more hardware at the problem. Again, the multicore stuff, that’s really only early 2000s; from the ‘80s to the ‘90s, you know, most machines were like a single CPU with a single thread.

I mean the Enterprise stuff was a little bit different, but like most of the time you didn’t have all the threads or cores we have today. All right, so again, that’s a historical reason for monolithic database system architectures. But even now also in a cloud environment, something like Snowflake, they still going to have a single-threaded optimizer even though, you know, it’s running as a service.

So again, multiple queries can get optimized at the same time, again within the context of a single query that’s going to be running as a single-threaded context. So the goal for doing paralyzation is that even though the search problem is NP hard, exponential, we want to have the search time for our trying to find optimal plans should improve linearly in regard to the number of cores we’re given it. So if I have, if I go from one core to two cores, then my time should be in theory cut in half. Right?

Again, as we talked before, it depends on what the stopping mechanism is for the optimizer, right? If it’s a wall clock time, then you have to account for that in your implementation or in your setup. So even though you have more cores, you have a parallel optimizer, if you set it to run for, you know, 20 seconds, it’ll still run for 20 seconds. Right? Maybe it exhaust finds, maybe it gets through all the possible choices and it just quits.

But you’d have to account for that. But again, if you do like the Microsoft way where you say how many transformations you’re allowed to do, then now that you would automatically scale that time out as well as you add more cores because things are just running faster. All right, so the challenge is going to be why this is going to be hard to do. Unlike other parts of a database system, right? Again, we think of like parallel execution database system; we know how to do that really, really well. We know how to do parallel scans. We know how to do parallel joins, parallel sorting—all those tricks we’ve been doing since, I mean really since the ‘80s, right?

But this is a bit harder to do now. It is somewhat similar to query planning or query parallel query execution, right? You have these pipeline breakers you kind of need to wait to finish some pipeline before you allow another pipeline to start running because it depends on the output of this. So that’s sort of the challenge that we’re trying to face here. It’s that sort of dependency. And so for joint ration, this sort of the class of algorithms that we’ve been looking at for dynamic programming are called non-serial poly.

And so non-serial just means that the computation that we’re going to do at one phase or one level—think of like the search tree—well, it’s going to depend on multiple levels of results from the search tree below me. A lot of times in dynamic programming it’s just like I care about what was the last one, was the last level. But in this world, we care about everything since the beginning of the query, right? Because in order for me to know how I want to join things maybe up above, I’m relying on some computation I did at a lower part of the tree, assuming I’m going bottom up to figure out the optimal join ordering for some tables down below.

Right? Then the poly just means think of like the recursive calls we’re making for our search; we’re passing into two parameters. Like we’ll see this in some of the examples we receive; you’ve got to know the left side of the tree and the right side of the tree, right? And that complicates how you’re allowed to paralyze things. So, it’s a more restrained environment; it’s just like, you know, it’s not like we’re magically going to be able to throw more cores at a problem. It’s not all going to just get magically faster. There is some intelligence we have to put in our decisions of how we’re going to split things up.

So the basic idea how we’re going to do this though at a high level is going to be the same thing we do when we want to paralyze query execution. Right? We’re going to partition the search space in such a way that we can divide the work evenly among the workers so that at any time during execution, every worker always has something to do.

There’s not like a bunch of threads sitting idle pulling for work, right? That’s hard to always achieve. But in general, we should be okay with this. And then the other challenge is going to be, the other technique we use, well once we split things up into partitions, it’d be really nice if we could then have other tasks be able to execute without being dependent on what other workers are actually doing.

So the idea is that you want to sort of localize, almost like a pipeline, a bunch of work we want to do in our search so that one worker can kind of do a bunch of things and just roll through the computation without waiting for other workers to produce a result.

This should not be if you, like we talked about in previous classes, all this should not be new. We discussed this in SE21 last year.

All right, so I’m going to briefly talk about three algorithms today. The first would be sort of the straw man that came in the paper you guys read about this dynamic programming for subsets, which is basically the same thing as System R, with slight differences in the method in which they’re going about and generating the subsets. Then, we’ll talk about the massively parallel DP algorithm from the paper you guys read.

What I like about it is that it covers the background of like here’s what this algorithm is based upon. I also like it because it’s one of the only ones that runs on a GPU. There’s another paper from other people that runs on a GPU that predates this one, but this one covers everything and also they’re doing that at a tiny piece or that technique from the hyper guys as well.

So a lot of ideas sort of thrown into one which I like. You know, the writing’s okay. We take that offline. Then I’m going to finish up talking about another approach called multi-plan join DP, and this is from one of the guys at IBM, but it was done by IBM research. This was not part of Starburst or DB2; it was sort of a side project.

All right, so the DP sub-algorithm is going to serve as the basis for what the MDPD algorithms will do. The basic idea, again, as I said, this is just be more or less the same thing as System R except how they’re going to structure the subsets and do that exploration is going to be slightly different. The basic idea is that we are going to split the joint graph based on vertices into subset sizes based on the number of vertices, and that’s going to slowly get bigger and bigger, meaning we’ll start looking at larger and larger subsets.

For each of the different subsets we have, we’re allowed to split them up as long as there is a way to connect sort of the left side and the right side, the two subsets that we’re looking at. Then we just do what’s standard: “Okay, what’s the best way to evaluate the joins on these two subsets? Is it better than the best plan we’ve seen so far? If yes, then it becomes the new best plan; otherwise, we throw it away.”

Again, they’re just like before need to do a bunch of steps and make sure that if we start splitting up the query graph into subgraphs to avoid Cartesian products we’ve got to make sure that we don’t have any disconnected nodes. So, again, I don’t normally like to show code, but I just want to visualize what’s going on here, and the algorithm is pretty basic. You start at the beginning here and say, “Okay, I’m going to start looking at subsets of size two,” right?

So you just generate all possible numerations for all possible subsets. Again, in this one here, when they generate the subsets, they’re not actually checking to see whether they’re connected or not; they’re going to do that later. Remember we talked about how, in the case of the hyper guys—the hypergraph one—where they would make sure they only generated subsets that were valid rather than checking after they’d generated them to see whether they were valid or not. This one just generates all of them.

All right, so then we get down here and now we’re going to go through all the different subsets, again starting with size two. For this one, again, you just start picking each one and you take whatever is on the left side here. In this part of the loop here, it can now be done in parallel. So within one sort of subset—in this case here, it’s only size two—right? For each of these, I could have a different thread now evaluate the best join order for that part of the subset in parallel, and they don’t need to coordinate with each other because there’s no information I need to know about what’s the best join plan for the subset that I’m looking at.

There’s nothing I need to know about your subset, right? So again, now they’re just going to do that System R approach where you just go through and expand out. So this one here, do I look at a hash join or a merge join? The same thing, I keep fanning out, right? And relying on what came down below me to tell me the path I’m going to generate. I’m not showing this because it’s PowerPoint, but think of all possible combinations for the subset size two and all the possible combinations of additional subsets.

This keeps going over and over again. But again, here the point here is that they’re going to check to see whether if they split a subgraph off of the sort of the main graph that they’re looking at, is it going to be disconnected? If yes, then they discard it; if no, they’re allowed to proceed and do it. Right? And this is just the System R stuff that we said before, but It’s slightly composed differently. Yes, why is the third “the” this size? Is this one I L back around I didn’t show that. Yeah, instead of like instead I could expand it this way I went yeah another loop. Now I’m looking at sub set at side to and now I don’t care about like it’s it’s again it looks a lot similar to the top down. I don’t care about how I’m joining A and B because down below me has told me that how to do that right this is when, yeah.

Okay, so the MP MPDP algorithm is it going to expand upon this and it’s basically the same thing. Well, at least it’s sort of confusing. There’s MPDP algorithm, the sort of technique they talk about and then of which of that there is the MPDP algorithm that’s based on the DP sub one but then collectively they’re also saying that if you tack on this adaptivity piece where they use this uh um this Union DP one or the IDP stuff then that’s also sort of part of what they’re calling MPDP. But for simplicity we’re say the MPDP is just going to be the extension of the parallel version of the DP.

Sub, right? So again, they’re going to have that same uh check that uh hyper or umra is doing about the complexity of the query, decide which algor they want to use for. So for simple ones, they’ll use TP sub to do the search exhaustively, but for larger queries, they use a combination of the heris, their own heris technique and DP. But the difference of that they’re going to do here is now uh instead of just picking vertices and deciding how to split based on that they’re going to do a combination of vertex and Edge based numeration. And the goal here is that it’s like, again, pre-computing what are the valid Cuts you can make in the query graph uh to avoid having disconnection and then having additional checks uh later on within that for Loop to see whether something is connected or not. And that’s going to matter later on when they put this on the GPU because in GPUs you don’t want branching because you don’t want your different cores or threads down within a, it’s a grouping or warp, Nidia calls it. You don’t want them to start going down different paths. We’ll see that in a second.

All so what’s the difference between vertex numeration and Edge-based numeration? So again, vertex is you just exploring the search Base by deciding how to pull off subgraphs of the joint graph you’re examining based on vertices, right? And then you then build back up the the relation as you add back in the vertices. So you sort of split things up and then you add things back going up and then you have to again uh you have to again take, make sure that you’re not generating invalid uh, you know subgraphs because if you’re trying to avoid cartesian products. In the case of uh Edge-based iteration, you’re going to look at the joint pairs, the edges themselves. And that’s how you’re going to decide what you actually want to cut or not, right? So this is going to be better in terms of reducing amount of waste to work because you’re not having to uh, you know validate a bunch of joint pairs that, that you know you’re throw away anyway. But now it’s more difficult to paralyze because now again you’re sort of blocking the the generation of these of these different partitions you could be exploring the subass you be exploring until you finish the slicing up of the of the uh of the joint or sorry of the subgraph.

All right, so I don’t have example of MPDP but again it’s basically the DP something, which is how they’re slicing up the edges is slightly different. But then again they have this now this heris that that kicks in when they recognize that the uh, the query becomes too complex. They have something that looks very similar to the Goo thing we talked about either last class or two classes ago where they’re going to look at the uh, you know, look at the join graph and calculate some cost of of each join, you know, using heuristics, and then just decide which one they want to combine together.

Assume that’s the order you’re going to apply the joints, right? So it’s basically the same thing we saw before with the goo, right? You have all these edges, uh, they don’t talk about how they’re calculating selectivity intermediate results. They just say there’s just a size of the result that’s the cardinality you would get from the size of relation and the selectivity of a predicate. So now you have a way to say, okay here’s how much data is going to come out of a particular join, and then they’re going to end up picking the largest one and have that be merged together. And then that’s that’s saying that I want to apply this join first, as early as possible, right? And again they do this. Uh, they keep, they sort of split that up into this, the small size you have uh within some size K. And then they keep adding up the the larger pieces until you have the full complete joint graph all over again.

And basically in the case of Goo we saw them merging it into you know combining the two relations together. Say they’re being joined, they call it a composite node, a composite vertex. But it’s basically the same thing. It’s telling me I want to join BD first, and this is going to give me the ordering that I want to apply this. So this will work really well for snowflake star, but I was just wondering will it work well for like cliques or trees which um, yeah, so I mean his statement is this would be terrible for CS, so my understanding is that they do this.

Um, sorry, look at the K, so the K matters here. So you’re going to keep building this up until you have the subass of size K. Right? And then take those subass of size K, and then you run the MPDP AL on that in the paper. Yes, and this is what I’m doing. I don’t know, it’s a small graph, but you could do like four or something like that, right? Uh, so I would do this heuristics. I would end up these composite vertices now with four relation tables in them. Then run MPDP on that. So the reason why they’re ordering it based on the size, so the larger ones go first. Because they want to make, that they optimize those clusters first, and those would naturally be like those will naturally be very uh obvious and important parts to optimize. If you have a snowflake or a star schema, but if you don’t have that, please, then it won’t. The statement is that it doesn’t make sense. If you have CS, that you’re saying because a snowflake star scheme it would be obviously optimized first, but obvious optim first. But in CS you wouldn’t have that, right? But still based on their algorithm somebody’s going to have the biggest cost; just go optimize that first. Yeah, but then they didn’t show it in the paper. There’s a lot, there’s a lot of things that didn’t show.

Yes. Yep, uh, there’s a tech report, it wasn’t in there either. Okay, and I I’ve small do for large they give the results, but they, yes again, I mean so how I say this. In addition to learning a bunch of like here’s what how to build an Aizer, which you all should be getting from reading the papers is just one how to read a scientific paper, but also healthy to do the skepticism about like, yeah, they’re saying this, it doesn’t really make sense, right? And it’s hard because you gotta go back to the paper multiple times like what the hell they actually talking about, right? Um, and so you, the German papers are great, but like, they’re also very understand this one is missing a bunch of things and they have a bunch of stuff that doesn’t really matter. And then they cite to the longer tech report that’s on the archive but that do have the things we were missing as well.

So is you, is it, there’s not a lot of papers on query parallelization. This is like one of the newest ones. That’s a great paper. It’s a good paper. Easy to, yeah, sure. Yes, okay. So again, the main thing I want to point out is like, so there are it’s hard to show this with a small graph, but like think of like a really big graph, a joint graph, and you’re trying to figure out again which of these, which of these, which ones going to have the most, the largest intermediate results. You start putting them, I don’t use the term cluster, but you’re putting them in composite group and you keep going until you have a group of size K. And then once you have that, then you go run the MPDP algor on that that cluster and figure out how to do that optimal join for that one, and then you find the next cluster and do the same thing. So again, it’s like you can’t do the exhaustive search across the entire joint graph; would be too difficult and run forever. So instead, you’re kind of like zoning in on portions of it, optimizing that piece and then, and then you know, hopefully that at least puts you in a good direction. The view is actually choosing the small one.

Yeah, so same is he’s correct: the goo, which I can bring that slide back up. Uh, the goo is choosing the smallest one that’s linearizing. Yeah, so that was a way to simplify the problem to generate with the adjoint ordering. So let me bring up. Yeah, I should include SL. But then the objective is slightly different.

Yeah, so again, go back, put this back in more context. So again, this is for large queries in the German approach. And so the idea here is that the high-level goal is the same. You want to simplify the search problem because you can’t do an exhaustive search. So in the case of goo, what they want to do is linearize the sort of C space so that you at least have start with left deep tree and then do the more expensive search, then look at bushy plans and other things. Right. So, like they say, um, how big of a sub problem you want to look at? They’re doing K100 here. All right, so again, you have this cost, the number of tuples that come out of a relation. You have the selectivity, so the same thing you have; like, you do the simple math, you the cardinality. So the idea here now is they want to choose the smallest one and you want to merge them into an update graph. And so by choosing the smallest, you’re basically saying that this thing doesn’t really matter in terms of the entire query plan because they’re not joining that much data. This is not my big problem, so you just sort of combine them together so that you don’t have to have in your exhaustive search, you know, considering B and C explicitly in the search process. You’re just kind of have, again, using this to the placeholder and say, yeah, you know there’s, there’s two—there’s some, there’s a join happening here, but it doesn’t really matter in the grand scheme of things. So that’s what goo is doing here. It’s a way to hide things that you don’t think actually are going to matter, right? To simplify the search process. You keep doing this recursively.

So then now going back to what the Unid DP is doing, this is doing the opposite. This is trying to find the things you think are going to matter by taking one with the highest cost and again combining them into a bunch of nodes. And so I could use heuristics to decide how I want to join A with B, D, and C with B, D, but I really want to know what is the optimal best choice for me to join B, D because the dataset size is so huge. Is that clear? Again, it’s a different approach to trying to solve again this really hard problem of like how do you pick the joint order? The idea of the from the German paper was like, okay, simplify it through goo, and then use that as the starting point, at least my initialization, so I’m not spending time just trying to find a quick upper bound. In the case of Union DP, it’s a way to sort of highlight the part of the query plan you know you want to spend most of your time computing on, trying to find the best join order.

Okay, so I am not a GPU expert, and so what they’re saying seems reasonable. Um, but the basic idea is that they claim that because you can rewrite uh, because you can get rid of those if clauses where you’re checking to see whether the graph is connected. If you, if you start when you start building up the subgraphs, if you can remove all that and just have the computation of deciding here are the numerations I want to look at by doing it based on edges so that you don’t have to generate invalid joint graphs, then now you basically have this sequential code that you can run down the GPU and you don’t worry about what is called branch divergence, where like the one different threads are going different threads with in the same warp, a grouping of threads are going down different code paths or pass in the same code, and then that requires the GPU I think to, to my understanding is to execute them in serial order. So you kind of lose all the benefit of having a GPU with, you know, thousands and thousands of cores because now you just end up getting executed in single file.

So they’re going to do the same trick we talked about last class, where you represent the join graph as just an array of fixed-width bitmaps. That’s great because again, GPUs don’t like variable length things. If you know exactly the offset of stuff, you can jump to that very quickly.

Um, and again, they remove all the branches to uh, remove the conditionals and get rid of the branch divergence. Right? Another thing they don’t talk about but is in the paper that they cite is you can’t have the GPU threads as you’re doing this joint numeration call back to the CPU for any additional information. So what’s, yeah, that’s one of the concerns I have. Like you have to make the yes performance.

Yes, you have to dynamically evaluate something that’s making such very, yes. So to repeat what he’s saying: Um, we talk about how you know that these costs model things we keep pointing to on the side that it’s going to magically tell us the selectivity of an operator or predicate, or the cardinality of an operator, right? So when we look at the algorithm for, there’s always like, oh, compute this cost function within these sort of these algorithm in the paper and that’s making a call to some cost model to figure out again, some to estimate what what the cost of your operator is. So where does that exist? Well that’s in the catalog, right? There’s some, some other piece of code that’s going to generate those estimates for you. So, but now if you’re running your joint enumeration where it’s trying to look at the cost of query plans to decide which one’s better, right, it’s basically looking at like, is this thing better than this, but it’s the algorithm structure in way to sort of rip through the possible orderings you could have very quickly. If all of a sudden you got to say, okay, well here’s a valid query plan I could consider computing the cost for it, so because you want to check to see whether that’s better than the best cost you’ve seen so far, you can’t make a call of the catalog in the GPU because now you’re going back up to the CPU and you’re blocking your threads in the GPU and just you’re losing all the benefits of the parallelism.

Right? So they don’t, the paper doesn’t talk about it but the other one does, where you basically have to package up all the estimates that you’re going to have for any possible joiner you’re going to consider down below that has to get sent down to the GPU with you. And that way all the computation and analysis is localized in the GPU and you never have to go back to the CPU until you say, you know, here’s our computational result or give me more work to do. I didn’t find anything yet, right? So GPUs are great; like, like you know, they’re really good at doing parallel computations, which what they’re trying to do here, but you got to have everything be local on the GPU. Because soon as you go talk to somebody else, you’re blocking and then you’re screwed. Is that clear?

Okay, so this paper’s got results, which is another reason I like it, right? So I’m going to cherry-pick three graphs. So the first one is going to be what is the search time for these different methods? The DPCP CCP that’s the sort of graph-based one, precursor to the hyper one. I forget what DPE is, but DP sub DP size is System R. DP sub is the one we talk about before, but again they have a bunch of these running on the CPU. They have a bunch of these running on the GPU as well. It’s actually quite impressive. Right? And so the two workloads they’re considering are the joint out benchmark which is based on the IMDb database that we’re going to see a lot through the newer papers. We look at this as the common work that everyone does. Music Brainz was the, it’s an online music database, and I think they generate synthetic queries for this. Right? So if you look at the music Brainz ones, and again they’re scaling up the number of relations that are being joined, this seems like, um, you know, the GPU is the clear winner here because you know everything else is falling off, falling off the cliff, whereas this one is doing okay, up to like 2025. Again, it’s log scale, so it’s not great anyway, but it’s doing quite well here. Right?

But then when you look at the Joint owner benchmark one, um, it’s surprising that you know, the GPU is basically flatlined for both the DP size one and the red one, here is the GPU one. Uh, and now you see that the performance gap between what the CPU can do and what the GPU can do is not as great. And I think if I remember correctly, it was just because of the complexity of the joint order benchmark queries. Um, the predicates were more restrictive, and there were fewer options they could consider in a parallel search. So even though the GPU has, you know, has all hundreds of thousands of cores, they’re running on like a 2016 GPU. I forget it’s like eight gigs of RAM, but I forget how many cores they have. But still, like all the, there isn’t as much opportunities for parallel search as there was in this one over here, so they’re not getting as a big, big of a win. And this sort of, the reason why it’s sort of flatlined is the cost of like packaging up the request or the search for you in such a way that you know the GPU wasn’t really sending that down to the GPU, letting the GPU crunch on it and then come back up to you. That’s why there’s sort of like a fixed cost no matter what the size of the query is. You’re always paying that penalty.

Yes yes so his point is his point is like if you ignore everything on this side after 18 and it kind of looks very similar um so it’s so is the the benefit you would get from uh you know from using a GPU or the parallel search.

like so again the hyper paper didn’t show uh crazy number of of um of tables in a joint they mentioned there’s the 5,000 one from sap but I forget what they what’s the highest one they went up to in their experiments uh but it was definitely I think a little bit higher than this like 30 seems like I mean the next line they go up to 100 so I think there’s only 30 tables in music brains so that’s why they couldn’t scale beyond that and then join our Benchmark it’s just 18 tables or 17 tables mean the bottom ofman that one was like 100 uh yeah yeah so they’re gonna they’re GNA go to a thousand in this one but like so for these are these are real data sets synthetic workloads synthetic queries and so they’re maxing out the number of queries that they have the number of tables that they have in their database and that determ how many you do want to join um.

the how say this um they didn’t they didn’t do the same sort of scalability experiment I guess actually the next slide but they don’t show the search time I don’t think uh for going up to a thousand queries yeah they show what the plan cost is and we we’ll discuss that next slide um.

which is also sketchy but well okay all right let’s get to that okay so so again so this is search time so this is like for these given queries with different number relations um how uh how well you know does the the search algorithm scale.

right the the next one they did is they measured what is the quality of the plan for these different methods relative to each other so they didn’t actually run the queries right they’re just saying here’s what the cost model in the query Optimizer thinks is the the quality of this query plan and I don’t think they even try to like prove like you know run run some random ordering to see whether uh the relative ordering is actually correct or not is it because the is that big no so I mean uh so that’s I’m saying I don’t think you I don’t think you want to do an exhaust search of like try to run everything prove that the ranking is correct I mean you could uh on a reason side data set you maybe could but the I think if you did a bunch of you know the statistical method you randomly sample and then show that for any random sample uh the ordering is correct and therefore with some kind of guarantee or confidence I can say that the global ordering is correct they didn’t do that.

right this paper is what 2022 so that means they wrote it in 2021 so data Fusion wasn’t a thing back then duct B was still pretty early right.

um yeah but that that would take forever right to run this on postest uh because it’s a road store right all right so for this one they’re kind of they’re throwing everything in which is kind of nice so gqo that’s the gentic algorithm from postgress Lindy P that’s the the Adaptive One from the Germans goo we’ve covered ik kkbz is the approximation they used to generate uh optimal left deep trees we saw this in the hyper the umbre paper and then I I didn’t talk about the um the iterative dp1 um it’s mentioned in the hyper paper.

uh it’s just it’s a different form formulation right it’s a way again for a really large uh number of joins it’s a way to sort of approximate things and be be easier so they talk about how they run it um the the mdp mpdp algorithm uh on different sizes of K and then I’m only show they they only rep put 15 that’s like I think the median but they do scale up the size of K to show that different cluster sizes are going to be different so this graph here it’s all be relative to what their best implementation can do the UN DP the npdp k equals 15 so all these numbers here are just relative to this right and again it’s based on the plan cost estimate from the query from the query Optimizer itself so the query is saying I think this query plan is going to be cost five and this other query plan is going to cost six that’s what they’re reporting here it’s not wall clock time.

making more phases what do they say LGE summize the results interest to space and they say it times out after.

okay whereas the Germans don’t right um and then the again the X’s here just saying that this is when the the the the optimizing or the algorithm failed to finish and then for these here is when you go these are up in like the the 40 to 50 uh relative difference to um the interesting that they say is that after in general for c id2 for what for click gra got it is this a snowflake graph the star graph the star schema one looks basically the same as well um.

they don’t rep they don’t have numbers on that yeah it might be in the tech report I again I don’t remember seeing it. though okay so this again this is the one from epfl.

um I want to briefly talk about another approach from the from IBM uh that predates that since came out 2008 and the basic idea here is that we’re going to um you’re going to represent the possible joint orderings you can have in a relational database and then now when you want to figure out what kind of enumeration you want to do now you’re just doing self-joins on that same table within your database to then do the enumeration and then there’s a little bit trick they’re going to do to decide how they want to allocate uh the task to do the exploration on the uh to to your workers and they’re going to do this in such a way that you want to to maximize the amount of work uh sorry maximize the or minimize the amount of wasted time workers are have because they don’t have any work to do.

so the blindly handing out like you do this you do that is not always going to work out because it may be the case that because they’re not going to figure out what are valid joint orders before you actually start doing the the evaluation of it you could end up with a thread with a bunch of stuff they don’t need to do and everything gets thrown away and then they become Idol right so in the paper they talk about they’re GNA this they’re going to split things up based on what our quantifier says I remember what that was when we talked about relational calculus nobody good.

excent again Beyond this class you you don’t really know this but the quantifiers again are just going to be the think of like the the things that are producing tuples that are being fed into other other uh you know other other sort of blocks right this is from the I think lecture three or four about in Starburst right so the quantifiers are getting things that are producing table right they don’t want to call it that they don’t want to call it like tables because it could be again Nest ques and other things all right so sa a simple query here joining ABC and D and a query graph looks like this so in the first step they’re going to generate the set of all the plan T you’d have in this memo table but again the memo table is just going to be a table in your database.

right and then now they’re going to uh do logical partitioning uh to divide up this work uh into uh into groupings where the the size of the of the the the subgraph the number of quantifiers you’re having in your subgraph are all going to be the same so what I mean by that so the so the first p in P1 these are going to be where the subgraphs are size one P2 is all the subgraph of size two three four and so forth right and again it’s it’s not the ordering the ordering doesn’t matter logically what’s what’s in them and again I’m saying logically partitioning just because it just you’re just saying there’s some other mapping to say you know this these rows here belong to this partition so then now you’re going to then explore each partition starting from from the bottom because we’re doing bottoms up so we have first got to figure out how we’re going to do uh what kind what are the the subquery plans for uh for this first partition here and again that’s just running the you know the the the DP algorithm and figure out here’s here’s the best you know joint ordering for or the best access method I want to use for uh for for you know for for this subgraph I’m examining here right and the reason why there could be multiple query plans is because again it’s uh it’s not just accessing Tables by quantifiers it could be like a subquery we can ignore that and then I go to the next one and so now if I want to evaluate the

uh if I want to evaluate the different combinations I could have in here well that’s just doing a join of all the uh things I’ve generated in the first partition P1 with itself because now I’m gonna get up with all combinations within uh of subset size size two so I do that do the same calculation again yada yada right and general query plans expand upon Q3 again sorry partition three and again now I can either I can only have uh it’s a combination of partition size one or and partition size two and now I just join the the top one and the bottom one so the top one and the second one together and that gives me all possible permutations again you’re just doing this within the database itself and then now I get down to this one here and for P4 now I have two choices right it could be a combination of uh subgraphs of size one with a subgraph size three or subgraphs size two joined with itself right but again like in this is basically like DP sub where they’re not checking as they’re generating this these enumerations whether the joints are actually valid so you’re just basically coming up like through this this these self joints here’s all the possible tasks or joint ERS I could consider and so if you do the naive thing and say okay well I have two worker threads so I’ll give this one to the first worker thread and this one to the second worker thread the problem is when we look down here we know a bunch of these are actually useless and invalid because we just kind of took a cartisian product of P2 with itself right so a join of of AB with AB is useless because it doesn’t actually contain uh you know C and D which I need to have right so all of this work would get thrown away and the only thing we’re actually really end up Computing is what what’s in this sort of task list here so now we need to be a bit smarter in how we’re going to decide which workers are going to do this this eneration because we don’t want to assign this one thread it throws everything away and then it’s then it’s Idol so the paper really this paper is really focused on how do you do that allocation of the the work I have to do to the workers that the that are available to me and doing it in such a way where you’re mindful of what the the the join graph is actually trying you know looks like to make sure that you are evenly Distributing the work amongst the different threads in the case of the the mpdv paper for the gpus they kind of throwing everything in there and then you have so many so many cores who cares right 2008 you know there was basically like four core six core machines um you not not the sort of behem we have now but still all right so the first thing to do the easy thing to do is just like take uh do what I said in the beginning is you just take all the possible plan joints I have divided by the number of workers I have and just hand them all off it’s easiest thing to do but again it’s going to be uh you have imbalance in the workload because one worker could have a bunch of stuff it throws away there’s nothing to do so instead what you do what they call stratified allocation where you want to divide the the the work I have to do into sort of multiple subsets um and then hand them out in an intelligent way to kind of try to can easily distribute the work I want to give out to my workers so the three ways to do this is again if you just think about the the the joint elimination as a as a self join like a nest of loop join then now I can think about in terms of like what’s what’s the outer table and the inner table the outer loop and the inner loop right so in this case here I have my my ater Loop would be this line here for for I in one to uh the size this floor the size I two right that’s an outer loop that’s going to be grabbing sort of one half of a of a join graph and then now you have an inner loop that’s G to say okay look at all the possible uh combinations I could have uh or the different joint ERS I could have for the the rest of the subgraph the question is do you how do you want to assign those that valuation out so aquad says you just take I know my looks like and divided uh into ranges amongst the um amongst the threads that I have so like within one iteration of this outer loop here one thread would would take take all of that like all all the things you would to do in the Inner Loop another one would be you just every time you iterate through the outer loop you just hand that work off to one worker in a round robin fashion and again then they take all the work that’s within the inner loop the alternative is that you you do the round robin Within the inner loop itself right so Distributing the outer one actually turns out to be the best because you end up uh it works really well for star schemas and and uh and other queries that have skewed skewed joins because you’re kind of handing out the easy problems evenly amongst everyone else amongst all the different workers.

right this one works well too but it’s a more complicated to to figure out uh in in the calculation but this handle again the skewed workloads star schemas and S schemas as well right it’s really great when you have a central table that’s really large because now you don’t have one thread trying to figure out all the joints for that really large fact table right because it’s iterating all through the dimension tables on the on the inner Loop here that work is spread across multiple workers okay so let’s finish.

up so hopefully the main take away from this is that the the idea of how to divide up a the the the the join search or the The Joint ordering search is complicated uh to paralyze because again there’s there’s so many things you got to worry about not just like how do you divide work up so that uh everyone is is all the workers are are active and have stuff to do but also how can you do such you know do the enumeration ahead of time so that you don’t generate much of work that’s actually useless so that part is super tricky in terms of whether or not it makes sense to do the op run the optimizer on a GPU uh I would say the way Nidia keeps jacking up the prices and the scarcity of discret gpus like again they were running on a a GTX 1080 which is in was high-end but highend six years ago um you nobody would want to say like I’m gonna run my database system but I also have to have you know a a $110,000 GPU on it right just just to do joint optimization right I don’t think that’s feasible practical um but there are CPUs now that come with what they call these AMD calls them apus they’re basically integrated Graphics chips right and so like you can now get like ryzen this is from two years ago but uh you can get ryzen CPUs that’ll have like some kind of Ron thing in there is it going to be super powerful like a like a five you know 5090 no but is it going to be better you know it’s going to be a lot of cores yeah and that might be something to be interesting now when you’re you know since they since it looks like there sort of limits on Cor they can cram into a general purpose CPU that seems to be uh not increasing as much as it used to like that’s another opportunity where like this is something that the D could take advantage of if you know if there’s a cheap C GPU right there it’s not cash coherent which I but that’s not a problem right for joint surge like that might be something the D system could take advantage of to do the mpdp algorithm we talked today but I don’t think Amazon I think Amazon you can’t get easy2 you can’t get an AMD processor with with a an APU.

butf. number not sure thisp.

all right so his statement is um the the paper shows that the GPU really only makes sense when you get to a large number of uh relations right and it’s not clear whether you actually for those large number of relations you have to have a discret GPU or what an integrated one be sufficient.

um it depend one is again if the thing if every CPU is coming with this now uh and it’s just there you know you could take advantage of it the other thing you could also think about is like well for the if I already have to to bundle up the the the the join enumeration to put it down to the GPU so it doesn’t talk to my CPU there’s no reason that has to then run on the same box that I’m at so if I’m if I’m like a snowflake or running you know David says a service as a business I could have bunch of GPU machines sitting on the side that I could then send these discret requests to to do joint information on that and then that way it’s a shared resource across multiple things and then it advertises the cost across multiple um you know of multiple customers yes that could be part of it.

but I think I think that for for better worse from a business perspective it’d be very hard to justify hey we got a bunch of buy you know h100s so we can do join enumeration versus like trading in llm to you know do some chat thing right uh do I think join ersion is is is a more important thing yes but like that’s a sort of not what the it’s not the hot thing it’s not gonna get people oh man you know I can’t I’m gonna buy you know I’m not gonna give Nidia $100,000 to make my join optimization look a little bit faster so again these things might be a cheap way to achieve some of those benefits but to your point yes it’s not clear whether the the integrated on because don’t handle it you’d have to look at where the har specs for what this is now versus what they’re was 6 years ago.

okay all right so that’s it for today uh we’ll pick up again on Monday next week I start doing parallel search now for for the top down stuff okay and then we’ll also present some of the potential topics for project.

Josherich's Blog

#09 - Search Parallelization: Bottom-up (CMU Optimize!)