QCon London ‘23 — A New Era for Database Design with TigerBeetle
why design a new database
There was a time when you could have any database as long as it was MySQL or Postgres. These systems took 30 years to develop; they were tried and tested, and people thought twice before replacing them. And then something happened—a wave of discoveries in the research around durability, efficiency, and testing grew. It became more and more difficult for existing database designs to retrofit. At least this was our experience of the impact of this research on our design decisions for Tiger Beetle and why we decided to start fresh.
Tiger Beetle is a new open-source distributed database where some databases are designed for analytics, others for streaming, and still others for time series. Tiger Beetle is designed from the ground up for tracking balances to track the movement of value from one person or place to another. For example, to track financial transactions, in-app purchases, economies, or to switch payments, record trades, or arbitrage commodities like energy. To do this with mission-critical safety and performance, you can use balance tracking to model any kind of business event.
That’s because the way you track balances is really double-entry accounting, which is the schema Tiger Beetle provides as a first-class primitive out of the box. You can spin up the replicas of a Tiger Beetle cluster with a single binary and then use a Tiger Beetle client to connect to the cluster to create accounts and execute double-entry transactions between accounts with strict serializability.
Tiger Beetle is designed for high availability with automated failover if the leader of the cluster fails so that everything just works. We wanted to make it easy for others to build and operate the next generation of financial services and applications without having to cobble together a ledger database from scratch or to execute manual database failover at 2 AM. With a tightly scoped domain, we’ve gone deep on the technology to do new things with the whole design of Tiger Beetle—our global consensus protocol, local storage engine, the way we work with the network, disk, in-memory, the testing techniques we use, and the guarantees that Tiger Beetle gives the operator—first and foremost of which is durability.
What is durability? Durability means that once a database transaction has been acknowledged as committed to the user, it will remain committed even in the event of a crash. It’s fine to lose a transaction if the transaction has not yet been acknowledged to the user, but once it’s been committed, it can’t be lost. To achieve durability, a database must first write the transaction to stable storage or disk before acknowledging the transactions. After a crash, the transaction is still there.
However, writing data to disk is an art or a science. A blog post has been written about how hard it is to get right. It’s tempting to use a file system and not a database, but writing to a file from an application when the plug can be pulled at any time is simply too hard—at least if you want consistency. So we might think to ourselves, surely we know by now how to use rename to do atomic file updates, and then you hear this fresh report from three weeks ago that Green M8 is not linearizable on Windows Subsystem for Linux’s ext4.
This is why our applications trust the database for durability to get this right, at least in the face of sudden power loss—not to mention gradual disk corruption. So there are at least three ways that a database can be designed to write data to disk. First of these is mmap, where you map the contents of a file into your program’s address space to give the illusion that you’re working with pure memory instead of disk.
However, any Pavlovian student at Carnegie Mellon wrote an excellent warning on the pitfalls of mmap to motivate why mmap is not acceptable for database durability. Some databases do still use mmap, but since we’re going to cover many of the same pitfalls as we go along, as we look at other designs, we won’t go into mmap further except to shine a spotlight on the paper next.
After mmap, there’s direct I/O, and this is where the database takes responsibility for working with the disk directly so that writes and reads go directly from user memory to disk and back again, bypassing the kernel’s page cache. It’s much more work, but the database has full control over durability as well as caching. However, this is what Linus has to say: “The right way is to just not use direct I/O. There is no valid reason for ever using direct I/O. You need a buffer for whatever I/O you do. If you want to use miners, let it be the page cache, so don’t use direct I/O.” Pretty clear, right?
So if we go with any Pavlovian advice and we stay clear of mmap and if we don’t rebel against the buffered file I/O, what else is left? And here we come to what many databases do, such as Postgres, and this is to outsource durability to the kernel with buffered I/O. You write from the database’s memory to the kernel’s page cache and then you rely on the kernel to flush or sync the page cache to disk whenever the database wants to write to disk.
It issues a write system call to the kernel, passing the buffer to be written and the address to which it should be written. The interesting thing is that this write system call does nothing—at least nothing durable—at this point. That’s because when the kernel receives the write system call from the database, it simply copies the buffer across to some pages in the page cache, marks the pages as dirty—in other words, you know they’re en route to disk—but they’re not there yet. Finally, because there’s no real discard involved, these write system calls don’t typically fail. The kernel usually returns success almost immediately back to the database, and then at some point, the kernel is going to start writing the data out to disk.
As each page is written, it’s marked clean if the database does nothing more than issue writes to the kernel in this way. Then, it can’t know when these writes are safely on disk. But that’s not a problem so far, because if the machine were to crash right now, then the data would be lost; it wouldn’t matter because the database would not yet have acknowledged the transaction to the user.
So when the database does want to commit the transaction, when it wants all the dirty pages relating to the transaction to be flushed to disk again for durability, it issues another system call: this time fsync. Fsync tells the kernel to finish all writes to disk before returning. So this ensures that all the data can be retrieved even if the system crashes or restarts.
In other words, where a database does make the big design decision to outsource durability to the kernel, then fsync is crucial. This is because most database update protocols, such as write-ahead logging or copy-on-write designs, rely on forcing data to disk in the right order for correctness. However, fsync is not easy to get right because under the buffered I/O design, fsync is where the rubber hits the road—and the dead actually hits the disk.
Because disks are physical, they can fail in all kinds of ways, either permanently or temporarily. Some sectors might fail; others might not. If you’re lucky, the disk will tell you when an I/O fails; if you’re unlucky, it won’t. So we’ll come back to this, but the takeaway for now is that since the kernel may know that some of the buffered writes might hit I/O errors on their way to the physical disk, when the database does eventually call fsync, then it might receive an error back from the kernel indicating that some of the writes in the batch didn’t make it.
If fsync does return an I/O error, there are three choices that the database can make. So option one is just to ignore any error from fsync, pretend that the writes didn’t fail, which is what some databases used to do in the past. Option two is to retry the fsync in the hope that the kernel will retry all the buffered writes to disk, keeping retrying until you’re durable—until you don’t get an error back from fsync. Option three is just to crash the database, and then you restart and recover from a checkpoint.
So if you were in MySQL or Postgres, what would you choose? How confident would you be that your choice guarantees durability? If you’ve heard the story before and you know the answer, then I promise there’s a twist in the tail—something new that I think you will find surprising. But before we get to the answer, I want to warm up with a look at some of the cracks in buffered I/O.
None of these cracks on their own are enough to end an era of database design or begin another, but I think they point to something shaky in the foundations. First, writing to cache instead of disk means that you lose congestion control over the disk, with no mechanism for back pressure. This can result in significant latency spikes. You know when the system has to write out gigabytes of dirty pages per second, it’s difficult to prioritize foreground and background I/O. You can’t schedule your fsyncs apart from other application data in the page cache.
You know everything is just sharing this one page cache, so there are ways around this like sync_file_range, but the man page for that has this friendly warning: “This system call is extremely dangerous and should not be used in portable programs.” Now, what does that mean? Who knows?
Buffered I/O is all-or-nothing, so you can’t handle errors related to specific writes. If something goes wrong, you don’t know what it was or where. Finally, disks have now become so fast—on the order of three gigabytes a second—they’re starting to approach per-core memory bandwidth on the order of, you know, 6 to 20 gigabytes per second, maybe if you’ve got an M1. But this means that if you’re still using buffered I/O and you’re doing memory copies to the kernel page cache for every write, then you’re not only thrashing your L1 through L3 CPU caches using up CPU to do the copies, but you’re also potentially halving memory bandwidth.
This is assuming that the copy to the page cache is the only copy in your data plan. If you need a second copy, maybe for networking or deserialization, then that can be almost all your memory bandwidth gone. So when it comes to buffered I/O, there’s a crack in everything, but that’s how the light gets in.
Let’s return to our foundational fsync question, and again, the question is: if fsync fails with an I/O error, what should the database do? How do you handle fsync failure? Do you ignore the error and pretend that your buffered writes are durable? Do you keep retrying the fsync in the hope that the kernel will retry all the buffered writes that failed? Do you crash and restart to recover from a checkpoint?
Of course, ignoring the error is not correct; it’s not an option. What about retrying? Indeed, for many years, most databases would retry. For example, Postgres would keep retrying a failed fsync until it succeeded, under the assumption that the kernel would retry any failed buffered writes. At least, this was the assumption for 20 years. Then something happened.
Five years ago, Craig Ringer posted this post on the Postgres mailing list to report that he had run into real data loss with Postgres. The critical database guarantee of durability, the “D” in ACID, had been violated. What was more stunning, I think, and perhaps the reason that the incident became known as “Epsongate,” is that this wasn’t due to a bug in Postgres per se.
Postgres had followed Linus’s advice and relied on the design of the page cache. Even when Postgres was careful to retry after an I/O error, it was still using stale data. This was because in the kernel, when a buffered write failed due to a disk error, the kernel was, in fact, simply marking the write pages as clean, even though the dirty pages had not been written properly to disk. This means that Postgres might get the first fsync to succeed the first time around, assuming that another process didn’t consume it first.
But then, the second time around, when Postgres retries the fsync, the fsync would succeed, and Postgres would proceed as if the data was committed to disk, despite the relevant pages still not having been made durable. So again, the dirty pages would simply be marked clean in the kernel, and the kernel developers maintained that this mock clean behavior was necessary, for example, to avoid out-of-memory situations if a USB stick was pulled out.
So the page cache wouldn’t fill up with dirty pages that could never be flushed. But it rocked the database world. After the discovery of Epsongate, Postgres, MySQL, and other affected databases decided to fix the issue by just changing their answer to the foundational fsync question. So instead of attempting to retry fsync upon failure as before, they would crash and then recover from the data file on disk.
Jonathan Corbett wrote about Postgres’s fsync surprise, and then a year later, in 2019, with the fix in place, Thomas Vandra gave a fascinating talk about this, questioning how it is that Postgres used fsync incorrectly for 20 years and what we’ll do about it. Up to now, if I’ve told the story properly, then I think you’ll agree that this was the correct fix for fsyncgate.
This is where the story ends, right? Perhaps you’re still wondering where all the seismic shifts in database design that you promised at the beginning are because it looks like the fix for fsyncgate was not a major design change—not even a minor design change—but just a simple panic to restart and recover. It’s a clever fix for sure, but a simple fix.
Well, here we come to the part of the story that I find is less often told. The story picks up two years later in 2020 when the University of Wisconsin-Madison asked the question: can applications recover from fsync failures? You can probably guess what the answer is when some of the leading storage researchers, Rimsay and Andrea, asked this question.
So if what the first Epsongate found was stunning, then I think that what this paper found was even more so because while databases such as SQL item Postgres would now crash after an fsync EIO error, after restarting their recovery would still read from the page cache—not the actual on-disk state. So they’re potentially making recovery decisions based on non-durable pages that were marked clean through the fsync failure.
The paper also raised other ways that fsync failure was not being handled correctly by these databases. For example, with ext4 in data mode, it would suppress an EIO error and only return it to the next fsync call. So, you want to get an fsync error, you have to call fsync twice. Users were still at risk of data loss and corruption, but the most important finding, at least for me, was just this little sentence at the end of the abstract.
You know, sometimes in the paper there are just these little things that you almost gloss over, and you read it again. And there’s the sentence that just changes your understanding of things. For me, this sentence at the end of the abstract was one of those: “Our findings have strong implications for the design of file systems and applications, i.e., databases that intend to provide strong durability guarantees, and strong implementation implications for the design of databases.”
So, UW-Madison had shown that the classic buffered I/O design for databases of the past 30 years was now fundamentally broken. There was, in fact, no correct answer to the question of how to handle fsync failure under the buffered I/O design. So the answers were all wrong and not actually sufficient for a database to guarantee durability. Instead, database design would need to evolve to guarantee actual durability rather than relying on these fragile systems. move from buffered IO to direct IO. Databases would need to take direct responsibility for durability. They would need to be able to read and write to the disk directly to be sure that they always made decisions and acknowledgments of durable data instead of basing decisions only on the contents of the kernel page cache.
So I think the implications of this design change are enormous. It’s one thing to design a database from scratch like we did with Tiger Beetle to take advantage of direct IO; it’s something else entirely to try and retrofit an existing design for direct IO. The reason is you need to align all your memory allocations and IO operations to advanced format for kilobyte sector size. You need a new buffer pool, a new user space page cache, and because the latencies of your writes to disk are now realistic, i.e., disk speed rather than memory speed, you can’t afford to block on synchronous IO anymore. You could have taken those shortcuts in your design; now you actually need to implement proper asynchronous IO in your write path.
So if you can do this, there are some incredible performance gains, but it’s a complete overhaul of your whole database design. Or, as Andreas Freund on the Postgres team said back in 2018 when this happened, five years ago, efficient direct IO usage is a metric ton of work. Andrus has since done a phenomenal amount of work around direct IO for Postgres. We were in touch last week, and Andres shared this with me: Thomas Monroe is planning to merge some actual direct IO support soon. At the same time, it’s exactly five years tomorrow since the 28th of March 2018 and the first discovery of Epsilon get, so for all these reasons and all this history, I believe that this event in 2018 drew the first line in the sand. This is what marked the end of an era for database design. This was the first line, I think, and it almost passed us by.
But the design of new databases that intend to provide strong durability guarantees would have to change. And not only because of Epsilon get; something else was about to happen in 2018. Researchers at UW Medicine were again about to discover something just as momentous, this time not in the buffered IO design but in the write-ahead log design of almost every database, you know. So we started out by saying that for a database to guarantee durability, it must protect committed transactions through power loss.
One aspect of this is the write path, which we’ve looked at to compare buffered IO with direct IO, you know, in how a database writes transactions to disk and then reads them again at recovery. But how does a database actually recover these transactions after a crash? So the idea is pretty simple and common to most databases; it’s called the write-ahead log. I first learned the technique when I was diving into Redis a decade ago. So Redis has what is called the AOF, which stands for Append Only File. This is a log on disk. When Redis wants to commit a transaction, it first appends the transaction to the end of the AOF and then it calls fsync.
For example, if you want to execute a transaction in Redis to set the Qcon key to “London,” then Redis would append this log, append this to the log. So I’ve simplified it a little here, but this is an elegant format. First, you’ve got the number of arguments; in this case, three. Then you’ve got the number of bytes in each argument—three bytes for the command, which is “set,” four bytes for “Qcon,” and then finally, you’ve got six bytes for “London.”
So if we then want to update Qcon to “New York,” we would append another transaction to the log like this. The trick is that Redis never updates anything in place in the log; it always appends, so that data is never overwritten. This means that if the power goes, then existing data is not destroyed. So there might be some garbage data at the end of the log if a partial transaction was being appended and then the power went. For example, if we try to set Qcon to “San Francisco,” then the power goes, we might end up with this like partial transaction.
But it’s startup Redis that configures this out and discards the partial transaction. This is safe to do because Redis would not yet have acknowledged the “San Francisco” transaction until it was durable on the log. So this write-ahead log design is the foundation for most databases. It’s the crucial building block for ensuring atomic changes to database state. It’s also the foundation for distributed databases.
So this is where each node in the cluster will keep its own write-ahead log, which is then appended to by the global consensus protocols such as viewstamped replication, Raft, or Paxos. However, for distributed databases, the write-ahead log is doubly critical because if anything goes wrong, if you have a bug that undermines durability so that you lose a transaction that you’ve acknowledged to the rest of the cluster, then you’ve not only lost a copy of user data, which you have, but you’ve also undermined the quorum voting that’s going on in the consensus protocol.
So you mess with the quorum votes, and this can easily lead to a split brain, which can in turn cause global cluster data loss. So the write-ahead log is foundational to the design of a database with a single node or distributed, and there are many variants on this design. For example, I’ve shown you a simplified version of Redis’ text format to give you the big idea, but most databases use a binary format.
They prefix each transaction in a log with a transaction header. There’s also a checksum, so if that startup database sees that the checksum doesn’t match, then the database truncates the log from that point to discard the partial transaction. However, it’s critical that only the last partial transaction, a transaction that was being appended as the power went out, is truncated.
The database must never truncate any other transactions in the write-ahead log because obviously these would have been acknowledged to the user as committed, so to truncate them would violate durability. So at a high level, this is how most write-ahead log designs all work today. But if we apply what we’ve learned from fsync, how does this design interact with real storage hardware?
So can you spot the problem? We’ve seen that Qcon “London” has already committed as a transaction through a write-ahead log, but what if while the machine is off, the disk sector containing our Qcon “London” transaction in the write-ahead log, what if that disk sector is corrupted and experiences just a single flip bit? So the length prefix for “London” changes from six to four.
So when the database starts up and reads the log, the Qcon key now is going to look like it was set to “land,” but then the “New York” transaction after that, which was committed, now that’s going to look like it was being appended when the power went out because the log is now no longer going to have the proper format. It’s going to have garbage data at the end, from “0” and onwards.
As far as I’m aware in this situation, most databases would truncate the log before “New York,” and the committed “New York” transaction, as well as every transaction after it, would be lost, violating durability. For databases that check on the transaction and their write-ahead log designs, the “London” key would also be truncated because of the checksum mismatch, even though it was in fact committed.
So in other words, a single disk sector failure is enough to break the write-ahead log design of most databases, so they incorrectly truncate the committed log, potentially truncating tens to hundreds of committed transactions. You can see the big problem; they’re all trying to solve the problem of power loss and torn writes after a crash, and they’re being fooled because now you get a little bit of bit rot in the middle of the committed log, and they conflate that with a system crash and just rewind everything, which is not correct.
As far as I know, that’s every database out there with a write-ahead log. If yours handles this, please let me know. But with these designs, users will experience silent data loss. Whereas for a single node database, the correct behavior would be to fail fast, then notify the user of the corruption because at least then they’ve got a chance to restore from backup. But for distributed databases, it’s worse.
So this single disk sector fault would also have the potential to do more damage. Again, like we said, it can mess with the quorum votes, of course, split brain, and that can cascade into global cluster data loss. So there, you’re not just losing one piece of user data in the transactions; you’re actually corrupting everything—the whole data file of every replica just gets messed up.
So split brain cascades into cluster data loss. At least this was the finding of the research team that discovered this as they analyzed the write-ahead log designs of distributed systems using consensus protocols such as Raft.
So they wanted to see if they could just put in one single disk sector fault on one node. On a single machine, you know, what could that do to the cluster? And you basically, it could do the worst. So can you guess who the team was?
So again, Ram, Andrea, Apache, these were students at UW-Madison as well. This time, the students were Ram, Nathan, Alakipan, Ashwari, Afghanistan, who won best paper at Fast for their research on this, which was called protocol-aware recovery for consensus-based storage. So can you guess the year? Again, 2018. And this was, in fact, a month before fsync, and it totally passed us by.
We heard about fsync; I don’t think many of us have heard about this, but like fsync, again, it’s something we still haven’t dealt with fully as an industry. I think some cloud providers, I know, have patched their proprietary databases for these findings, but I’m not aware of other open-source databases that handle this correctly.
So in the case of fsync, it took users to experience data loss, and then they had to be dedicated enough to report, you know, the data loss and to figure out what had happened, that this was not just corruption but a design flaw. So the database could have handled fsync, but it didn’t.
So instead, it accelerated the storage fault into unnecessary data loss in the case of, you know, the write-ahead log design flaw, you know, protocol-aware recovery. So here, the research community, they’ve found the failure; you know, they’ve discovered this proactively, but it’s yet to trickle down to inform new designs.
So how long until the write-ahead log powering a major Kubernetes deployment is truncated in a way that leads to split brain, a loss of the control plane and public outage? I think this is the kind of paper you need to read like three times to let the impact sink in.
So how do you fix a design flaw like this? How do you distinguish between power loss and bit rot to know whether to translate a tone, right: power loss, or to raise an error, or do distributed recovery in the case of bit rot?
So the strategy given by the paper, what we do in Tiger Beetle, is to supplement the write-ahead log with a second write-ahead log. So we’ve actually got two write-ahead logs in Tiger Beetle; we love them so much. The second write-ahead log is very small; it’s lightweight. It just contains a copy of the headers from the first.
So the first write-ahead log has got your messages and your headers, or your, you know, your transactions and transaction headers. The second write-ahead log just has the transaction headers. So this turns the write-ahead log append process into a two-step process.
So first, you append to the log as you normally would, but then after you append the transaction, you also pin the small transaction header to a second log. So you’ve got another copy on it. And this enables you to distinguish between corruption in the middle of the committed log caused by bit rot from corruption at the end of the log caused by power loss. But again, like the fix for Epsilon get, it’s a major design change for a database to have two write-ahead logs.
It’s not just one, so it’s not always easy to retrofit. There were also other findings in the protocol recovery paper from UW-Madison that impact how we design distributed databases, showing that natural protocols like Raft, if the operator is lucky and a single disk fault doesn’t lead to global cluster data loss, then the fault can still cause the cluster to become prematurely unavailable because of how storage faults are handled in the write-ahead log.
So you’re paying for this durability, but you’re not getting the availability. Fixing these is also going to require design changes. For example, in the past, global consensus protocol and the local storage engine would be separate modules or components; we used to think of these as completely decoupled or not integrated.
But if you want your distributed database to maximize availability, how your local storage engine recovers from storage faults in the write-ahead log needs to be properly integrated with the global consensus protocol. So if you want to optimize for high availability, you want to tolerate up to the theoretical limit of faults, you know, that your consensus protocol should be able to tolerate, then it’s no longer enough to just cobble together off-the-shelf Raft with off-the-shelf storage systems like RocksDB or LevelDB.
Instead, you need to take both of these according to the paper, and they show you how to make them storage fault-aware, how to talk to each other so they can both recover. It sounds complicated, but it’s really simple; it’s an obvious idea, right? Use your replicated redundancy to heal your write-ahead log, your local write-ahead log. But this is like a design change that’s fundamental enough, I think, to signal a new era for how we design distributed databases.
You have to use a consensus protocol that implements this and the storage engine that implements it, and both of them must talk to each other with these new interfaces. So it’s a big design change. To summarize, what both Epsilon get and protocol-aware recovery from UW-Madison have in common is that I think storage faults force us to reconsider how we design our databases to write to disk, or even just to append to a write-ahead log, and how they recover at startup.
So if we can extract a principle from this, then the principle is that if we want to guarantee durability, we need to move beyond a crash safety model. You know, this is the model that we’ve had up until now: let’s just survive, you know, durability through power loss. We need to move beyond that.
This idea that the plug can be pulled any second. If we need—yeah, sure we need to support that—but we need more. We need to actually adopt an explicit storage fault model so that we can test our designs against the full range of storage faults that we expect, not only crash safety but, you know, storage fault safety.
In the security world, this would be the equivalent of a threat model. Just as it can make for better security design to be upfront about your threat model, I think it can make for better durability or just durability. Right? If we expect the disk to be not perfectly pristine, you know, this is formal proofs for Paxos or Raft. This is what they assume, right—in your perfect storage? But rather, you know, what I think we need to do is think of disks. As near Byzantine, that’s what we call it. A tiger beetle, it’s our own made-up term. Right, it’s somewhere between non-Byzantine for tolerance and Byzantine photons. Disks are somewhere in between, between the near Byzantine you can actually… you do well, you know if you expect the disc to be almost an active adversary.
So for tiger beetle, we saw both these events of 2018 as an opportunity. We could take the growing body of storage fault research; there’s so much out there. You know, all the other papers coming also from UW-Madison with Vanguard. We could take that design, tiger beetle, to not only be crash tolerant but also storage fault tolerant. To start to see disks as distributed systems, you know, one disk… there’s so much stuff going on there in terms of failures. It’s like the network fault model, you know, like network partitions. You can get cut off from disk sectors, so you start to see just a single disk as a distributed system. It’s a whole faulty microcosm, you know, where you can have bugs in the disk firmware, and the device drivers, even in the file systems. You can have latent sector errors, at least when you get an explicit EIO from the kernel, as we’ve seen. But then you can also get silent corruption where you don’t, uh, you know… this could be caused by bitrot, even lost or misdirected reads or writes. This is just where the disk sends the I/O to the wrong side sector for some reason.
So for example, here is how we design tiger beetle to, you know, read our write-ahead log at startup. We’ve got these two right-edited logs, so even in the presence of storage faults. What we really liked about this was we enumerated all the possible faults we expected according to the fault model. You know while we’re recovering from both write-ahead logs, you’ve got the transaction log and the header log. And then we enumerated all these combinations in a matrix.
So my co-founder on DJ and myself, we worked these up together, you know, at a countless course, worked them up by hand. And then we generated all the failure handling branches in the code dynamically. And you know this way we can ensure that we wouldn’t miss a fault combination. We just enumerated them all, worked out what the proper response should be, and then we generated it in the code so that all the branches are there.
So of course, you know, these storage faults are rare compared to whole machine failures, but in large-scale deployments, um, even rare failures become prevalent. So at the same time, while we had the opportunity to focus on safety, to design for an explicit storage fault model and for direct I/O from scratch, we also took the opportunity to ask, well, you know, how could we take advantage of direct I/O from a performance perspective?
In the past, you know, when working with direct I/O, you would use asynchronous I/O. Um, like we’ve said, you know, that your system calls don’t block your main process. However, historically on Linux, because the I/O API for async I/O is not great, it wasn’t great. You’d have to implement this illusion of asynchronous I/O by means of a user space thread pool. So, you know, think of libuv, right? Your database control plane would submit I/O to a user space thread pool, which would then use a blocking syscall to the kernel to do the I/O. The problem with this is that your control plane is first context switched to your thread pool. Then your thread pool must pay the cost of the blocking syscall to the kernel. Then you must context switch back to your control plane.
So in July 2020, as you know, as we were designing tiger beetle, we were asking the performance questions. You know, what if we could have first-class asynchronous I/O but without the complexity of a user space thread pool?
Um, so with the cost of a context switch in the range of a microsecond, it’s starting, you know, context switches approaching the same order as just doing the I/O itself. What if you also didn’t need a context switch? We’d love to ask these questions, you know. Let’s just get rid of the thread pool, just get rid of the context switch. So even better, you know, with the mitigations of the Spectre and Meltdown, you know, those are making syscalls slower. So we asked, you know, what if you just didn’t need to syscall into the kernel? We wanted to embrace direct I/O, almost to the point of bypassing the kernel completely. So we’re really bypassing the page cache, and I’ll be bypassing the context switches and, you know, the user space thread pool and the syscalls. But, obviously, we didn’t want the complexity of bypassing the kernel completely.
So for example, we didn’t want to use kernel bypass techniques like, you know, SPDK or DPDK. Um, because, well, firstly, it’s just not that good to do that. You know, that’s what Red Panda does, which is amazing, one of my favorite databases. Um, and so that’s high performance, right? But also more complex, I think.
So finally, we wanted a unified asynchronous I/O interface with this simple API for networking and storage.
So here again, we’re lucky, you know, with the timing, because just as we were thinking about these things, uh, Jens Axboe comes along, you know, with his new I/O ring API for Linux. And this is arriving on the scene and starting to make waves. The reason is that, like, yeah, it’s just single-handedly fixed Linux’s asynchronous I/O problem. Uh, he landed a series of magnificent patches to the kernel. Um, he actually almost started the end of 2018, just you know, touching AIO a little bit. And then Jan, you know, there comes the first I/O ring patch early 2019.
Um, so here he gives, you know, user space a ring buffer to submit I/O to the kernel without blocking. And then he gives the kernel another ring buffer to send I/O completion callbacks back to user space, so without blocking. So no syscalls or could be amortized, no need for a user space thread pool. You can now have a single-threaded event loop control plane. And then you can use the kernel’s own thread pool as your data plane, right? No more thread pool! But get this: like, you get to use the kernel’s own thread pool as your data plane. So it’s much more efficient.
Also, significantly simpler, which I really love. So if you’ve enjoyed Martin Thompson’s brilliant talks with Pukon, you know, over the years as much as I have, then you’ll recognize the design of I/O ring has what Martin calls mechanical sympathy. And if you don’t know this word, I know you didn’t watch Martin’s talks, but please go watch them; they’re fantastic. They’ve made a huge impact on tiger beetle’s design. But I think, you know, again, I/O ring is a beautiful design. What I love most of all about I/O ring is how it not only gives the perfect interface for file and disk I/O but also for network I/O. You know, so historically, storage and network had different APIs. Now with our ring, you’ve got a unified interface for both. So we really couldn’t ask for a better I/O API. In fact, it’s so good, I think I/O ring alone is a perfect excuse, you know, to redesign the way that our databases do I/O in 2023.
So a round of applause, please!
Given these major design changes already, at what point do we start to consider the next steps for databases? So if we’re going to write new databases for the future, we’ve made that, you know, that decision. We’re going to do this. If we’re going to write them with new designs, what languages are we going to write them in? Are we going to use the systems languages of the last 30 years, C and C++? Or are we going to use the systems languages of the next 30 years?
So these questions were on our mind when we were deciding whether to write tiger beetle in C or else in a new systems language called Zig that we’ve been following for two years, you know, by this point. So Barbara Liskov said that if you want to teach programmers new ideas, you need to give them new languages to think those ideas in.
What we came to appreciate with Zig is that it fixed all the issues we had with C and also made it easier to write correct code. But it also resonated with our thinking on how you work with memory efficiently. As you do systems programming, memory is probably the most important aspect of systems programming. How good are you, you know, working with memory? How sharp are your tools to do that?
So for example, it was refreshing how easily we could implement direct I/O in Zig where all the memory that you pass to the sector is aligned, where you can enforce this in the type system. Even the allocators are just… they give you… you don’t need special calls anymore, you know, just to do an aligned allocation. With Zig, there’s also no second-class macro language. Instead, you simply program in Zig at compile time as your binary is being compiled, so your meta-language is also Zig.
So we considered Rust, but I/O ring and the ability to use the kernel thread pool without context switches meant we had less need or desire for fearless multi-threading in user space. Like we didn’t actually want to do that. We were trying to get away from that. So we, you know, the borrow checker would still have been useful for a single-threaded eventually, but, you know, for logical concurrency bugs. But no, we didn’t want to do multi-threading in the first place.
So we also wanted tiger beetle to follow NASA’s power of 10 rules for safety-critical code. Um, you have to handle memory allocation failure if you’re a database. We also wanted to enforce explicit limits on all resource usage in the design. So even for loops, you know, there’s no while true; every loop must terminate, and there’s an expected, you know, balance on how long that loop can possibly run for.
But another, you know, big example of this is just tiger beetle’s use of static memory allocation. So this means that all the memory that tiger beetle needs is calculated and allocated at startup. After that, there’s no dynamic allocation at runtime. So this is almost a lost art, but we wanted to bring it back to enable tiger beetle to decouple database performance from memory allocation, you know, for extreme memory efficiency and predictable operating experience. This means that after you start tiger beetle, again, you know, there’s no more malloc or free, no risk of fragmentation. It’s just pure predictable performance.
So we’ve made resource usage explicit like this throughout the design, with memory, with other resources as well. It’s like C groups in your database, and this is striking, I think, because, you know, compared to designs where the limits have not been thought through or made explicit, um, I really… we really enjoy this model of, you know, building the database. Um, and because tiger beetle makes everything explicit, from storage fault model to resource usage, this means that we can test everything and then actually, like, test it to the limit. We’ve got all the limits; we can now test them.
So for example, to simulate this corruption on the read or write path in the double-digit percentages on every node in the cluster, even the primary, we inject faults, you know, up to the limit, the theoretical limit, and then we check that tiger beetle doesn’t truncate committed transactions. You know, it’s all right: the write-ahead log design is working. Can it heal? Can the write-ahead log, um, you know, use replicated redundancy to heal itself? Can we preserve durability? Can we remain available?
So what’s powerful about this testing also is that everyone on our team can run this simulator on their local laptop. And every bug that the simulator finds can be reproduced deterministically, replayed again and again, you know, for incredible developer velocity. As you, you know, you should be building a new feature, you just run the simulator.
And this is the final major design decision in tiger beetle because the whole database has been designed from the ground up as a deterministic distributed database. So this means that all the abstractions are deterministic.
Um, for example, again, the control plane is single-threaded, so there’s no non-determinism from the operating system’s thread scheduler. You know, that could be a source of randomness. Um, even the source of time, you know, the clock in each node in the database is deterministic. So it’s an abstraction that you can tick; it’s got a tick method, just like you would tick the second hand of a clock. And then we can shim the source of time, or we can shim the disk, you know, or the message bus. We can give a network to a message bus that’s actually backed by a packet simulator, and then we can run a whole cluster of tiger beetle replicas in a single process.
So this is the second-order effect, but if we want, because we control the whole simulation, we can literally speed up time. I kid you not, you know. So instead of ticking time every 10 milliseconds as we would normally do, you know, if we want 10 millisecond granularity in our timeouts, instead of doing that, we can tick time, you know, in a tight while true loop so that every iteration of the while true loop is just a hot loop. You know, we’ve simulated 10 milliseconds of real world time in a fraction of the time.
So we actually worked out the time dilation numbers for this last week, and an average run of the tiger beetle simulator takes just 3.3 seconds to run a pretty interesting simulation, you know, with like 64 committed operations, all kinds of stuff, just 3.3 seconds.
Um, that executes 235,000, you know, clock ticks.
Um, each of those represents 10 milliseconds of time in the real world.
Um, in other words, 39 minutes in total. So you can run the simulator for 3.3 seconds on your laptop, and you’ve achieved the equivalent of 39 minutes of simulated test time, full of all kinds of network latencies, packet drops, partitions, crashes, disruptions, disk slowdowns, every possible fault, right? And if you want to debug something that would normally take 39 minutes to manifest, you can now do that in just 3.3 seconds.
So it’s a speed-up factor of 712 times, you know, whereas with existing test harnesses like Jepsen—they’re fantastic, but they’re not deterministic. Uh, so you might, you know, if you find a bug, you might not find it again.
Um, but also they run in real time, so if you want 39 minutes of test time, you know, you have to give up 39 minutes of real-world time on your local laptop. You don’t get the same developer velocity. So being able to speed up time like this feels like magic, like a silver bullet.
What I’m most excited about is that because everything in tiger beetle is abstracted, you know, even the state machine logic, you can just take this accounting state machine out, put another state machine in, and you’ve got a whole new distributed database. You know, but you benefit from all the fault tolerance testing of tiger beetle.
So for example, we’ve done this internally to create our own control plane database in a week, before might have taken years. And that’s why I believe we’re really in a new era. The rate at which new databases can be created is going to just accelerate, and they’re going to operate and be tested at, you know, much tighter tolerances than anything we’ve seen before.
So each of these advances—direct I/O, political recovery for consensus-based storage, explicit storage fault model, I/O ring, an efficient replacement for C and Zig, deterministic simulation testing—each of these on their own, I think makes for a whole new dimension in database design. Hard to retrofit, but taking it together, you know, these advances in database design are going to unlock an abundance of new, correct, and high-performance open source database management systems tailored to the domain. It’s a new era for database design, and I think it’s a good day to do this.