Awesome Database Development

Database development is interesting and challenging. You can always find interesting things to learn and challenging problems to solve. You need to get a lot of things right to build a reliable and high-performance database. And it takes time, a lot of time, to think and practice. I have been working on databases for ten years. However, as the proverb goes, the more I know, the more I realize I don't know. So, I collect the database development materials I have read here to review them from time to time. I think it will be helpful to those who share the same interests as me.

Storage Device

Media

Hard Disk Drive (HDD)
How HDD Works (Video)
The Development of HDD Technique (Video)
Solid-State Drive (SSD)
Coding for SSDs
How Flash Memory Works (Video)

Interface

A History of PC Buses - From ISA to PCI Express (Video)
Small Computer System Interface (SCSI)
Serial Attached SCSI (SAS)
Understanding SCSI (Video)
AT Atachment (ATA)
Serial AT Attachment (SATA)
Advanced Host Controller Interface (AHCI)
Peripheral Component Interconnect (PCI)
PCI Express (PCIe)
NVM Express (NVMe)

Operating System

Kernel

Understanding the Linux Kernel
Understanding the Linux Virtual Memory Manager
Linux Device Drivers
The Slab Allocator: An Object-Caching Kernel Memory Allocator (1994)

This paper presents a comprehensive design overview of the SunOS 5.4 kernel memory allocator. This allocator is based on a set of object-caching primitives that reduce the cost of allocating complex objects by retaining their state between uses.

File system

ext4 Data Structures and Algorithms
The Design and Implementation of a Log-Structured File System (1991)

This paper presents a new technique for disk storage management called a log-structured file system. A log- structured file system writes all modifications to disk sequentially in a log-like structure, thereby speeding up both file writing and crash recovery.
SFS: Random Write Considered Harmful in Solid State Drives (FAST, 2012)

In this paper, we propose a new file system for SSDs, SFS. First, SFS exploits the maximum write bandwidth of SSD by taking a log-structured approach. SFS transforms all random writes at file system level to sequential ones at SSD level. Second, SFS takes a new data grouping strategy on writing, instead of the existing data separation strategy on segment cleaning. It puts the data blocks with similar update likelihood into the same segment. This minimizes the inevitable segment cleaning overhead in any log-structured file system by allowing the segments to form a sharp bimodal distribution of segment utilization.

Modern hardware

What Every Programmer Should Know About Memory (2007)

This paper explains the structure of memory subsystems in use on modern commodity hardware, illustrating why CPU caches were developed, how they work, and what programs should do to achieve optimal performance by utilizing them.
What Every Systems Programmer Should Know About Concurrency (2018)

Seasoned programmers are familiar with tools like mutexes, semaphores, and condition variables. But what makes them work? How do we write concurrent code when we can’t use them, like when we’re working below the operating system in an embedded environment, or when we can’t block due to hard time constraints? And since your system transforms your code into things you didn’t write, running in orders you never asked for, how do multithreaded programs work at all? Concurrency — especially on modern hardware — is a complicated and unintuitive topic, but let’s try to cover some fundamentals.
Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask (2013)

This paper presents the most exhaustive study of synchronization to date. We span multiple layers, from hardware cache-coherence protocols up to high-level concurrent software. We do so on different types of architectures, from single-socket – uniform and non- uniform – to multi-socket – directory and broadcast-based – many-cores.

Storage virtualization

Logical volume management
Logical Volume Management (LVM) - Linux (Video)
Redundant Array of Independent Disks (RAID)
What is RAID 0, 1, 2, 3, 4, 5, 6 and 10 (1+0)? (Video)

Storage Engine

The Five-Minute Rule for Trading Memory for Disc Accesses (SIGMOD, 1987)
The Five-Minute Rule 10 Years Later, and Other Computer Storage Rules of Thumb (SIGMOD, 1997)
The Five-Minute Rule 20 Years Later, and How Flash Memory Changes the Rules (2007)
The Five-Minute Rule 30 Years Later, and its Impact on the Storage Hierarchy (2017)
The Log-Structured Merge-Tree (LSM-Tree) (1996)

This paper presents the Log-Structured Merge-tree (LSM-tree), a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort.
Weaving Relations for Cache Performance (VLDB, 2001)

This paper presents a new data organization model, Partition Attributes Across (PAX), that significantly improves cache performance by grouping together all values of each attribute within each page.
Cache-Oblivious Streaming B-trees (SPAA, 2007)

This paper presents two cache-oblivious streaming B-trees, the shuttle tree, and the cache-oblivious lookahead array (COLA).
Bitcask: A Log-Structured Hash Table for Fast Key/Value Data (Basho, 2010)
bLSM: A General Purpose Log Structured Merge Tree (SIGMOD, 2012)

This paper presents bLSM, a Log Structured Merge (LSM) tree with the advantages of B-Trees and log structured approaches. bLSM uses Bloom filters to improve index performance and uses spring and gear scheduler to avoid long write pauses.
The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases (2013)

This paper presents ART, an adaptive radix tree (trie) for efficient indexing in main memory. Its lookup performance surpasses highly tuned, read-only search trees, while supporting very efficient insertions and deletions as well. At the same time, ART is very space efficient and solves the problem of excessive worst-case space consumption, which plagues most radix trees, by adaptively choosing compact and efficient data structures for internal nodes.
The Bw-Tree: A B-tree for New Hardware Platforms (ICDE, 2013)

This paper presents Bw-Tree, a new form of B-Tree that achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips.
LLAMA: A Cache/Storage Subsystem for Modern Hardware (VLDB, 2013)

LLAMA is a subsystem designed for new hardware environments that supports an API for page-oriented access methods, providing both cache and storage management.
Hekaton: SQL Server’s Memory-Optimized OLTP Engine (SIGMOD, 2013)

This paper presents Hekaton, a new database engine optimized for memory resident data and OLTP workloads. Hekaton uses only latch-free data structures and a new optimistic, multiversion concurrency control technique.
WiscKey: Separating Keys from Values in SSD-Conscious Storage (USENIX, 2016)

This paper presents WiscKey, a persistent LSM-Tree-based key-value store with a performance-oriented data layout that separates keys from values to minimize I/O amplification.
PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees (SOSP, 2017)

This paper presents a novel data structure that is inspired by Skip Lists, termed Fragmented Log-Structured Merge Trees (FLSM). FLSM introduces the notion of guards to organize logs, and avoids rewriting data in the same level.
TinyLFU: A Highly Efficient Cache Admission Policy (2017)

This article proposes to use a frequency-based cache admission policy in order to boost the effectiveness of caches subject to skewed access distributions. Given a newly accessed item and an eviction candidate from the cache, our scheme decides, based on the recent access history, whether it is worth admitting the new item into the cache at the expense of the eviction candidate.
Monkey: Optimal Navigable Key-Value Store (SIGMOD, 2017)

This paper presents Monkey, an LSM-based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The insight is that worst-case lookup cost is proportional to the sum of the false positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize this sum.
Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging (SIGMOD, 2018)

We introduce Lazy Leveling, a new design that removes merge operations from all levels of LSM-tree but the largest. Lazy Leveling improves the worst-case complexity of update cost while maintaining the same bounds on point lookup cost, long range lookup cost, and storage space. We further introduce Fluid LSM-tree, a generalization of the entire LSM-tree design space that can be parameterized to assume any existing design.
The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models (2018)
Faster: A Concurrent Key-Value Store with In-Place Updates (2018)
X-Engine: An Optimized Storage Engine for Large-scale E-commerce Transaction Processing (2019)
MyRocks: LSM-Tree Database Storage Engine Serving Facebook's Social Graph (2020)
The Case for Learned Index Structures (2018)
The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds (2020)

SQL

Access Path Selection in a Relational Database Management System (SIGMOD, 1979)

This paper presents a cost-based SQL optimizer in System R. The optimizer estimates the cost of access paths in terms of I/O and CPU costs, using statistics about the contents of each relation.
The Volcano Optimizer Generator: Extensibility and Efficient Search (ICDE, 1993)

This paper presents an optimizer generator that translates a model specification into an optimizer source code. It also provides a search engine to be used in all generated optimizers. The search engine is goal-oriented using directed dynamic programming search algorithms.
The Cascades Framework for Query Optimization (IEEE, 1995)
How We Built a Cost-Based SQL Optimizer (Cockroach Labs, 2018)
How We Built a Vectorized SQL Engine (Cockroach Labs, 2019)

Transaction

Granularity of Locks and Degrees of Consistency in a Shared Data Base (IBM, 1975)

The first part of this paper introduces a locking protocol that allows simultaneous locking at various granularities in a database with a hierarchical structure. The second part of this paper introduces four degrees of consistency and the relationships of the four degrees to the locking protocol.
The Notion of Consistency and Predicate Locks in a Database System (IBM, 1976)

This paper proofs that two-phase locking (2PL) guarantees serializability and introduces predicate locks to address the problem of phantom reads.
A Critique of ANSI SQL Isolation Levels (SIGMOD, 1995)

This paper analyzes the ambiguities of ANSI isolation levels and provides clearer phenomena definitions. It also presents a new MVCC isolation level called snapshot isolation. A transaction in snapshot isolation reads data from a snapshot of the committed data as of the time the transaction started, and checks for write-write conflicts.
Generalized Isolation Level Definitions (ICDE, 2000)

This paper proposes a graph-based approach to define existing ANSI isolation levels.
Serializable Isolation for Snapshot Databases (SIGMOD, 2008)

This paper presents an algorithm to achieve serializable snapshot isolation based on anti-dependencies detection.
Large-scale Incremental Processing Using Distributed Transactions and Notifications (OSDI, 2010)

This paper presents Percolator, an incremental update processing system built on top of Bigtable. Percolator provides snapshot isolation transactions using a two-phase commit protocol.
A Critique of Snapshot Isolation (EuroSys, 2012)

This paper presents a new MVCC isolation level called write-snapshot isolation. A transaction in write-snapshot isolation checks for read-write conflicts instead of write-write conflicts in snapshot isolation.
Calvin: Fast Distributed Transactions for Partitioned Database Systems (SIGMOD, 2012)

This paper presents Calvin, a practical transaction scheduling and data replication layer that uses a deterministic ordering guarantee to significantly reduce the normally prohibitive contention costs associated with distributed transactions.
Highly Available Transactions: Virtues and Limitations (VLDB, 2013)

This paper introduces a taxonomy of highly available systems and analyze existing ACID isolation and distributed data consistency guarantees to identify which can and cannot be achieved in HAT systems.
Consistency in Non-Transactional Distributed Storage Systems (CSUR, 2016)

This paper provides a structured and comprehensive overview of different consistency notions and provide a partial order among different consistency predicates.
SLOG: Serializable, Low-latency, Geo-replicated Transactions (VLDB, 2019)

This paper presents SLOG, a system that provides strictly serializable ACID transactions at geo-replicated distance. SLOG achieves high-throughtput and low latency for transactions that initiate from a location close to the home region for data they access.
Transactional Causal Consistency for Serverless Computing (SIGMOD, 2020)
Distributed consistency at scale: Spanner vs. Calvin (Daniel Abadi, 2017)
NewSQL database systems are failing to guarantee consistency, and I blame Spanner (Daniel Abadi, 2018)
Consistency without Clocks: The FaunaDB Distributed Transaction Protocol (Fauna, 2018)
Demystifying Database Systems, Part 1: An Introduction to Transaction Isolation Levels (Fauna, 2019)
Demystifying Database Systems, Part 2: Correctness Anomalies Under Serializable Isolation (Fauna, 2019)
Demystifying Database Systems, Part 3: Introduction to Consistency Levels (Fauna, 2019)
Demystifying Database Systems, Part 4: Isolation levels vs. Consistency levels (Fauna, 2019)
Concurrency Control and Recovery in Database Systems (1987) (Book)

Distributed Algorithm

Theorem

Papers

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services (SIGACT, 2002)

This paper proves the CAP conjecture in the asynchronous network model, and then discusses solutions to this dilemma in the partially synchronous model.

Links

CAP Twelve Years Later: How the "Rules" Have Changed (Eric Brewer, 2012)

Consensus

Papers

Paxos Made Simple (Lamport, 2001)

The Paxos algorithm, when presented in plain English, is very simple.
Consensus on Transaction Commit (2004)

This paper presents the Paxos Commit algorithm. Paxos Commit runs a Paxos consensus algorithm on the commit/abort decision of each participant to obtain a transaction commit protocol that uses 2F + 1 coordinators and makes progress if at least F + 1 of them are working properly.
Paxos Made Live - An Engineering Perspective (PODC, 2007)

This paper presents the experience of building Chubby, a fault-tolerant storage system using the Paxos consensus algorithm.
There Is More Consensus in Egalitarian Parliaments (SOSP, 2013)

This paper presents the design and implementation of Egalitarian Paxos (EPaxos), a new distributed consensus algorithm based on Paxos that achieves uniform load balancing across all replicas.
Paxos Quorum Leases: Fast Reads Without Sacrificing Writes (SOCC, 2014)

This paper presents quorum leases, a technique that allows Paxos-based systems to perform consistent local reads on multiple replicas.
In Search of an Understandable Consensus Algorithm (USENIX, 2014)

This paper presents Raft, a consensus algorithm for managing a replicated log. Raft produces a result equivalent to Paxos, and it is as efficient as Paxos, but its structure is different from Paxos. Raft is more understandable than Paxos and also provides a better foundation for building practical systems.

Consistency

Papers

Time, Clocks, and the Ordering of Events in a Distributed System (Lamport, 1978)

This paper discusses the partial ordering defined by the "happened before" relation, and gives a distributed algorithm for extending it to a consistent total ordering of all the events.
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Progranm (Lamport, 1979)

This paper defines the condition of sequential consistency and describes a method to ensure the sequential consistency of interconnecting sequential processors with memory modules.
Linearizability: A Correctness Condition for Concurrent Objects (1990)

This paper defines the condition of linearizability and discusses the differences between it and other correctness conditions.
Session Guarantees for Weakly Consistent Replicated Data (PDIS, 1994)

This paper proposes four per-session guarantees to aid users and applications of weakly consistent replicated data: Read Your Writes, Monotonic Reads, Writes Follow Reads, and Monotonic Writes.
Causal Memory: Definitions, Implementation and Programming (1995)

This paper defines the condition of causal consistency based on Lamport's "happened before" relation.
Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases (OPODIS, 2014)

This paper proposes a hybrid logical clock, HLC, that combines the best of logical clocks and physical clocks.

Links

Linearizability versus Serializability (Peter Bailis, 2014)

Replication

Papers

Chain Replication for Supporting High Throughput and Availability (OSDI, 2004)

Chain replication is a new approach to coordinating clusters of fail-stop storage servers. The approach is intended for supporting large-scale storage services that exhibit high throughput and availability with-out sacrificing strong consistency guarantees.
Conflict-free Replicated Data Types (SSS, 2011)

Distributed System

Papers

An Opportunity Cost Approach for Job Assignment in a Scalable Computing Cluster (2000)
The Google File System (SOSP, 2003)

This paper presents the Google File System, a scalable distributed file system for large distributed data-intensive applications.
Bigtable: A Distributed Storage System for Structured Data (OSDI, 2006)

This paper presents Bigtable, a distributed storage system for managing structured data that is designed to scale to a very large size.
Dynamo: Amazon’s Highly Available Key-value Store (SOSP, 2007)

This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an “always-on” experience.
Finding a needle in Haystack: Facebook’s photo storage (OSDI, 2010)
f4: Facebook’s Warm BLOB Storage System (OSDI, 2014)
Large-scale cluster management at Google with Borg (EuroSys, 2015)
Sharding the Shards: Managing Datastore Locality at Scale with Akkio (OSDI, 2018)

In this paper we present Akkio, a locality management service for distributed datastore systems whose aim is to improve data access response times and to reduce cross-datacenter bandwidth usage as well as the total amount of storage capacity needed.
Anna: A KVS for Any Scale (ICDE, 2018)
Autoscaling Tiered Cloud Storage in Anna (VLDB, 2019)
DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching (FAST, 2019)
Virtual Consensus in Delos (OSDI, 2020)

OLTP Database

Papers

Megastore: Providing Scalable, Highly Available Storage for Interactive Services (CIDR, 2011)

This paper presents Megastore, a storage system that blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability.
Spanner: Google’s Globally-Distributed Database (OSDI, 2012)

This paper presents Spanner, a scalable, multi-version, globally-distributed, and synchronously-replicated database.
F1: A Distributed SQL Database That Scales (VLDB, 2013)

This paper presents F1, a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency.
Online, Asynchronous Schema Change in F1 (VLDB, 2013)

This paper describes a protocol for schema evolution in a globally distributed database management system with shared data, stateless servers, and no global membership.
Spanner: Becoming a SQL System (SIGMOD, 2017)

This paper highlights the database DNA of Spanner. It describes distributed query execution in the presence of resharding, query restarts upon transient failures, range extraction that drives query routing and index seeks, and the improved blockwise-columnar storage format.
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases (SIGMOD, 2017)
Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes (SIGMOD, 2018)
Socrates: The New SQL Server in the Cloud (SIGMOD, 2019)
CockroachDB: The Resilient Geo-Distributed SQL Database (SIGMOD, 2020)

This paper presents the design of CockroachDB and its novel transaction model that supports consistent geo-distrib- uted transactions on commodity hardware.
TiDB: A Raft-based HTAP Database (VLDB, 2020)

Links

How CockroachDB Does Distributed, Atomic Transactions (Cockroach Labs, 2015)
How online schema changes are possible in CockroachDB (Cockroach Labs, 2016)
Living Without Atomic Clocks (Cockroach Labs, 2016)
Serializable, Lockless, Distributed: Isolation in CockroachDB (Cockroach Labs, 2016)
CockroachDB’s Consistency Model (Cockroach Labs, 2019)

OLAP Database

Papers

C-Store: A Column-oriented DBMS (VLDB, 2005)

This paper presents the design of a read-optimized relational DBMS that stores data by column.
MonetDB/X100: Hyper-Pipelining Query Execution (CIDR, 2005)

This paper presents a CPU efficient query processor that employs a vectorized query processing model. The processor uses loop-pipelined and cache-conscious operations to take advantage of modern CPUs.
Dremel: Interactive Analysis of Web-Scale Datasets (VLDB, 2010)
The Snowflake Elastic Data Warehouse (SIGMOD, 2016)
AnalyticDB: Real-time OLAP Database System at Alibaba Cloud (VLDB, 2019)
Procella: Unifying serving and analytical data at YouTube (VLDB, 2019)
Alibaba Hologres: A Cloud-Native Service for Hybrid Serving/Analytical Processing (VLDB, 2020)

Books

The Design and Implementation of Modern Column-Oriented Database Systems (2013)

This book discusses modern column-stores, their architecture and evolution as well the benefits they can bring in data analytics. There is a specific focus on three influential research prototypes, MonetDB, MonetDB/X100, and C-Store.

Miscellaneous

Papers

What Goes Around Comes Around (2005)
"One Size Fits All": An Idea Whose Time Has Come and Gone (ICDE, 2005)
The Seattle Report on Database Research (SIGMOD, 2019)

Books

Readings in Database Systems
Designing Data-Intensive Applications

Links

Notes on Distributed Systems for Young Bloods (2013)
How does a relational database work (2015)
Elements Of Scale: Composing And Scaling Data Platforms (2015)
Apache Kafka, Samza, and the Unix Philosophy of Distributed Data (2015)
What I Learned From Programming Databases (2016)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome Database Development

Storage Device

Media

Interface

Operating System

Kernel

File system

Modern hardware

Storage virtualization

Storage Engine

SQL

Transaction

Distributed Algorithm

Theorem

Papers

Links

Consensus

Papers

Consistency

Papers

Links

Replication

Papers

Distributed System

Papers

OLTP Database

Papers

Links

OLAP Database

Papers

Books

Miscellaneous

Papers

Books

Links

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome Database Development

Storage Device

Media

Interface

Operating System

Kernel

File system

Modern hardware

Storage virtualization

Storage Engine

SQL

Transaction

Distributed Algorithm

Theorem

Papers

Links

Consensus

Papers

Consistency

Papers

Links

Replication

Papers

Distributed System

Papers

OLTP Database

Papers

Links

OLAP Database

Papers

Books

Miscellaneous

Papers

Books

Links