Databases — SQL Fundamentals: Relational Models, ACID, Indexes, and Joins
Master relational database fundamentals: tables, normalization, ACID properties, B-tree indexes, and JOIN operations.
Why Databases First?
Before you can design a system that scales, you need to understand how data is stored and retrieved. Databases are the foundation of almost every system you'll design. If you get the database layer wrong, no amount of caching or load balancing will save you.
The Relational Model
A relational database organizes data into tables (relations), where each table has rows (tuples) and columns (attributes).
Core Concepts
- Table — A collection of related data (e.g.,
users,orders) - Row — A single record in a table
- Column — An attribute of the table (e.g.,
id,name,email) - Primary Key (PK) — A column (or set of columns) that uniquely identifies each row
- Foreign Key (FK) — A column that references the primary key of another table
Example: Users and Orders
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE orders (
id INT PRIMARY KEY,
user_id INT NOT NULL,
total DECIMAL(10, 2),
status VARCHAR(20) DEFAULT 'pending',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(id)
);
The orders.user_id is a foreign key pointing to users.id. This establishes the relationship: each order belongs to one user.
Normalization
Normalization is the process of organizing data to minimize redundancy and dependency.
Normal Forms
| Form | Rule | Example Violation |
|---|---|---|
| 1NF | Atomic values (no repeating groups) | A phone_numbers column containing "555-0101,555-0102" |
| 2NF | 1NF + no partial dependency on composite PK | order_item table storing product_name (depends on product_id, not the full PK) |
| 3NF | 2NF + no transitive dependency | users table storing city and state when state depends on city |
When to Denormalize
Normalization is great for write-heavy systems (reduces update anomalies). But for read-heavy systems, denormalization can improve performance by reducing JOINs.
Rule of thumb: Normalize by default. Denormalize only when you can prove that specific queries are slow due to excessive JOINs. Always denormalize with a clear understanding of the trade-offs: faster reads, slower writes, risk of data inconsistency.
ACID Properties
ACID guarantees that database transactions are processed reliably. Every production database you use should be ACID-compliant.
| Property | What It Means | Example |
|---|---|---|
| Atomicity | All or nothing — the entire transaction succeeds or fails together | Transferring $100: debit account A AND credit account B. If one fails, both roll back. |
| Consistency | Data remains valid before and after the transaction | A CHECK constraint that balance >= 0 is never violated |
| Isolation | Concurrent transactions don't interfere with each other | Two users booking the last seat — only one succeeds |
| Durability | Once committed, data survives crashes | After COMMIT, data is on disk and survives power loss |
The Bank Transfer Example
BEGIN TRANSACTION;
-- Atomicity: both must succeed or both fail
UPDATE accounts SET balance = balance - 100 WHERE id = 1; -- Debit
UPDATE accounts SET balance = balance + 100 WHERE id = 2; -- Credit
COMMIT;
If the second UPDATE fails (e.g., account 2 doesn't exist), the first UPDATE is rolled back. No money is lost.
Interview question: "How does a database ensure durability?" — The answer involves the Write-Ahead Log (WAL). Before modifying data pages, the database writes the change to a sequential log on disk. On crash recovery, it replays the WAL.
Indexes — The Performance Multiplier
Without an index, the database must scan every row to find matching data (full table scan). With an index, it jumps directly to the relevant rows.
How B-Tree Indexes Work
Most relational databases use B-tree (balanced tree) indexes. Here's how they work:
Searching for "Smith":
- Root → go right (T-Z? No, N-S)
- N-S node → go right (Q-S)
- Leaf node → find "Smith"
This is O(log n) instead of O(n).
When to Create Indexes
- Columns used in
WHEREclauses frequently - Columns used in
JOINconditions - Columns used in
ORDER BYorGROUP BY - Columns with high selectivity (many distinct values)
When NOT to Create Indexes
- Small tables (full scan is fast enough)
- Columns with low selectivity (e.g.,
genderwith only 2 values) - Write-heavy tables (every INSERT/UPDATE/DELETE must update indexes)
Common mistake: Adding indexes to every column. Each index slows down writes (the database must maintain the index on every change). Start with indexes on foreign keys and frequently queried columns. Add more based on slow query analysis.
JOINs — Connecting Related Data
JOINs combine rows from two or more tables based on a related column.
Types of JOINs
-- INNER JOIN: Only matching rows from both tables
SELECT u.name, o.total
FROM users u
INNER JOIN orders o ON u.id = o.user_id;
-- LEFT JOIN: All rows from left table, matching rows from right (NULL if no match)
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
-- RIGHT JOIN: All rows from right table, matching rows from left
SELECT u.name, o.total
FROM users u
RIGHT JOIN orders o ON u.id = o.user_id;
-- FULL OUTER JOIN: All rows from both tables
SELECT u.name, o.total
FROM users u
FULL OUTER JOIN orders o ON u.id = o.user_id;
Visual Guide
INNER JOIN: [A ∩ B] Only rows where both tables match
LEFT JOIN: [A ∪ (A ∩ B)] All left rows + matching right rows
RIGHT JOIN: [B ∪ (A ∩ B)] All right rows + matching left rows
FULL OUTER: [A ∪ B] All rows from both tables
JOIN Performance
- Always index the JOIN columns (usually foreign keys)
- JOIN order matters — start with the most selective table
- Avoid JOINing huge tables without proper indexes
- Consider denormalization if a specific JOIN is a persistent bottleneck
Transaction Isolation Levels
Isolation levels control how visible uncommitted changes are to other transactions. Each level trades off consistency for performance.
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Performance |
|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Fastest |
| Read Committed | ❌ No | Possible | Possible | Fast |
| Repeatable Read | ❌ No | ❌ No | Possible | Moderate |
| Serializable | ❌ No | ❌ No | ❌ No | Slowest |
- Dirty Read: Reading uncommitted data from another transaction
- Non-Repeatable Read: Getting different results when reading the same row twice
- Phantom Read: New rows appearing between two identical range queries
Default levels: PostgreSQL defaults to Read Committed. MySQL InnoDB defaults to Repeatable Read. Most applications are fine with the defaults.
What to Remember for Interviews
- ACID — Atomicity, Consistency, Isolation, Durability — know each with an example
- Normalization — 1NF, 2NF, 3NF — know when to denormalize
- B-Tree indexes — O(log n) lookup, when to create and when not to
- JOINs — INNER, LEFT, RIGHT, FULL OUTER — know the difference
- Isolation levels — Read Committed vs Repeatable Read vs Serializable