Architecture: Relational Databases
Databases store the state of our applications, yet plenty of developers only have a fuzzy picture of how they work underneath. This post digs into two of the pieces that matter most when working with an RDBMS: indexes and transactions.
What is an index?
An index is a data structure that makes finding specific rows in a table faster, the same way the index at the back of a textbook saves you from reading every page to find one topic.
It works by keeping a separate, smaller structure that holds part of the table’s data, ordered so it can be searched faster than scanning the whole table. That speed isn’t free. An index needs extra storage and memory, and it slows writes down, since every insert or update has to keep the index current. The textbook comparison is a fine place to start, but real database indexes get more involved than that.
Why do we need indexes?
Indexes exist to get the data you want as quickly as possible.
The more data you have, the harder that gets. Think of the difference between keeping a class attendance list and keeping a birth registry for a whole city. A short list is easy to scan; at city scale, finding one record by reading everything stops being practical.
Databases have a few tricks for this, but past a certain size you need something more deliberate, and that’s where indexes come in.
How do indices actually work?
💡 Using an index makes it faster to find the data you are looking for, but it also makes it slower to add new data. This is because the index needs to be updated every time you add new information to the database.
A simple idea would be ordering data in the way we want to query it. Unfortunately, in relational databases, it usually happens that we want to query data in multiple ways. Therefore, we need a different place to keep our data ordered.
An index creates a separate data structure that stores a subset of the data as well as a pointer to the corresponding rows on disk, organized in a specific way to optimize search performance for specific queries. The most common type of index is the B-tree index, which organizes data in a hierarchical tree structure of index leaf nodes, allowing for efficient searching and sorting of data. Since this structure requires index leaf nodes to be sorted logically, we need to find a way to add data quickly without having to move data or edit other entries.
This is done via a doubly linked list. Every node has links to two neighboring entries, very much like a chain. New nodes are inserted between two existing nodes by updating their links to refer to the new node. The physical location of the new node doesn’t matter because the doubly linked list maintains the logical order. The data structure is called a doubly linked list because each node refers to the preceding and the following node. It enables the database to read the index forwards or backwards as needed. It is thus possible to insert new entries without moving large amounts of data - it just needs to change some pointers.
Indexes can be created on one or more columns in a table, and a single table can have multiple indexes.