Architecture: Relational Databases

Databases are crucial for storing the states of our applications, but many people have a limited understanding of how they operate. Today, I will discuss two essential topics when working with RDBMSs: indexes and transactions. These are foundational concepts that are crucial for the overall success of most systems.

What is an index?

Indexes are a crucial aspect of data management in Relational Database Management Systems (RDBMS). They are data structures that facilitate faster data retrieval by minimizing the time required to search for specific records within a table, similar to indexes in a textbook.

This is accomplished by creating a separate data structure that stores a subset of the data in the table, which can be searched more quickly than the entire table. However, it should be noted that the use of indexes comes with additional costs, including increased storage and memory requirements, as well as slower write performance due to the overhead of maintaining the index. Although the analogy of an index in a textbook may serve as a useful introduction to the concept, it is important to note that the intricacies of database indexes are far more complex.

Why are indices necesarry?

We need indexes to help us get the relevant data we need as quickly as possible.

As the volume of data increases, the management of such data becomes increasingly challenging. For example, consider the difference in managing a small attendance list for a class versus a birth registry for a large city. While small amounts of data may be easily manageable, as the scale increases, so does the difficulty in efficiently retrieving and manipulating that data.

To combat this issue, various strategies have been implemented in database management systems. However, as data continues to grow, the need for more sophisticated and optimized methods of data management like inddices becomes imperative.

How do indices actually work?

💡

Using an index makes it faster to find the data you are looking for, but it also makes it slower to add new data. This is because the index needs to be updated every time you add new information to the database.

A simple idea would be ordering data in the way we want to query it. Unfortunately, in relational databases, it usually happens that we want to query data in multiple ways. Therefore, we need a different place to keep our data ordered.

An index creates a separate data structure that stores a subset of the data as well as a pointor to the corresponding rows on disk, organized in a specific way to optimize search performance for specific queries. The most common type of index is the B-tree index, which organizes data in a hierarchical tree structure of index leaf nodes, allowing for efficient searching and sorting of data. Since this structure requires index leaf nodes to be sorted logically, we need to find a way to add data quickly without having to move data or edit other entries.

This is done via a doubly linked list. Every node has links to two neighboring entries, very much like a chain. New nodes are inserted between two existing nodes by updating their links to refer to the new node. The physical location of the new node doesn’t matter because the doubly linked list maintains the logical order. The data structure is called a doubly linked list because each node refers to the preceding and the following node. It enables the database to read the index forwards or backwards as needed. It is thus possible to insert new entries without moving large amounts of data—it just needs to change some pointers.

Indices can be created on one or more columns in a table and multiple indexes can be created on a single table,