NewSQL and Modern Trends
NewSQL and Modern Trends¶
Previous: 14. Distributed Databases | Next: 16. Database Design Case Study
The database landscape has evolved dramatically since the NoSQL revolution. A new generation of systems aims to combine the best of both worlds: the horizontal scalability and availability of NoSQL with the ACID guarantees and SQL interface of relational databases. Beyond NewSQL, entirely new categories of databases have emerged to serve specialized workloads -- vector similarity search for AI, time-series storage for IoT, and graph analytics at scale. This lesson surveys the cutting edge of database technology and the architectural innovations that make these systems possible.
Difficulty: βββ
Learning Objectives: - Explain the NewSQL design philosophy and how it reconciles ACID with horizontal scalability - Describe Google Spanner's TrueTime mechanism and external consistency - Compare CockroachDB, TiDB, and Spanner architectures - Understand vector databases and similarity search algorithms (HNSW, IVF) - Explain time-series database optimizations - Evaluate graph analytics platforms for large-scale analysis - Assess Database-as-a-Service offerings - Understand the data lakehouse paradigm and its relationship to databases
Table of Contents¶
- NewSQL: Motivation and Design Philosophy
- Google Spanner
- CockroachDB
- TiDB
- Vector Databases
- Time-Series Databases
- Graph Analytics at Scale
- Database-as-a-Service
- Data Lakehouse
- Exercises
- References
1. NewSQL: Motivation and Design Philosophy¶
1.1 The Gap Between SQL and NoSQL¶
By the early 2010s, the database world had split into two camps:
Traditional SQL NoSQL
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β β ACID transactions β β β Horizontal scalability β
β β SQL interface β β β High availability β
β β Rich query capability β β β Flexible schemas β
β β Strong consistency β β β Low latency β
β β Horizontal scaling β β β No ACID (usually) β
β β Geographic distributionβ β β No SQL (or limited) β
β β Auto-failover β β β Weak consistency β
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
GAP: Can we have both?
NewSQL
ββββββββββββββββββββββββββββ
β β ACID transactions β
β β SQL interface β
β β Horizontal scalability β
β β High availability β
β β Strong consistency β
β β Geographic distributionβ
β β Auto-failover β
ββββββββββββββββββββββββββββ
1.2 NewSQL Definition¶
The term "NewSQL" was coined by Matthew Aslett of the 451 Group in 2011 to describe a new class of relational database management systems that:
- Provide ACID guarantees for read-write transactions
- Use SQL as the primary interface
- Scale horizontally across commodity hardware using shared-nothing architecture
- Achieve throughput comparable to or exceeding NoSQL systems for OLTP workloads
1.3 Architectural Innovations¶
NewSQL systems typically employ three key innovations:
1. Sharding with distributed transactions: Data is automatically partitioned (sharded) across nodes, but the system provides transparent distributed transactions via consensus protocols (Paxos/Raft), making the sharding invisible to the application.
2. Multi-version concurrency control (MVCC): Instead of locking, NewSQL systems use MVCC with globally ordered timestamps, allowing reads to proceed without blocking writes.
3. Consensus-based replication: Each partition (shard) is replicated across multiple nodes using Raft or Paxos, providing both high availability and strong consistency.
NewSQL Architecture (Generic):
Client βββΆ SQL Layer (Parse, Plan, Optimize)
β
βΌ
Transaction Layer (Distributed MVCC, 2PC/Raft)
β
βΌ
Storage Layer (Sharded, Replicated)
ββββββββ ββββββββ ββββββββ
βShard 1β βShard 2β βShard 3β
βR1,R2,R3β βR1,R2,R3β βR1,R2,R3β β Each shard has
ββββββββ ββββββββ ββββββββ 3 replicas (Raft group)
1.4 NewSQL vs Sharded PostgreSQL¶
A common question is: "Why not just shard PostgreSQL?" The answer reveals what makes NewSQL special:
| Feature | Sharded PostgreSQL (Citus) | NewSQL (CockroachDB) |
|---|---|---|
| Cross-shard transactions | Limited (2PC across shards) | Full ACID (Raft + MVCC) |
| Automatic resharding | Manual or semi-automatic | Automatic range splitting |
| Schema changes | Rolling DDL required | Online schema changes |
| Replication | Async/sync (per shard) | Raft (per range) |
| Failover | Manual or scripted | Automatic (Raft leader election) |
| Global distribution | Possible but complex | Built-in (geo-partitioning) |
| Compatibility | Full PostgreSQL | PostgreSQL wire protocol (subset) |
2. Google Spanner¶
2.1 Overview¶
Google Spanner, introduced in a 2012 paper, is the first system to distribute data at global scale while providing externally consistent (linearizable) transactions. It is the original NewSQL system that inspired all others.
Google Spanner Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Universe β
β β
β Zone (US-East) Zone (EU-West) Zone (Asia) β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββ β
β β Spanserver 1 β β Spanserver 4 β β Spanserver 7β β
β β Spanserver 2 β β Spanserver 5 β β Spanserver 8β β
β β Spanserver 3 β β Spanserver 6 β β Spanserver 9β β
β β β β β β β β
β β ββTrueTimeββ β β ββTrueTimeββ β β ββTrueTimeββ β
β β βGPS+Atomicβ β β βGPS+Atomicβ β β βGPS+Atom.ββ β
β β β Clocks β β β β Clocks β β β β Clocks ββ β
β β ββββββββββββ β β ββββββββββββ β β ββββββββββββ β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 TrueTime¶
The key innovation in Spanner is TrueTime, a globally synchronized clock API that returns a time interval rather than a single timestamp.
Traditional clock: now() β T (single point, unknown error)
TrueTime: TT.now() β [T-Ξ΅, T+Ξ΅] (interval with bounded error)
Where Ξ΅ is the clock uncertainty, typically 1-7 milliseconds.
TrueTime guarantees:
For any invocation of TT.now() returning [earliest, latest]:
earliest β€ actual_time β€ latest
TrueTime Implementation:
- GPS receivers: provide time accurate to nanoseconds
- Atomic clocks: drift slowly, used when GPS signal is lost
- Multiple time sources per datacenter for redundancy
- Marzullo's algorithm to compute tight bounds
2.3 External Consistency via Commit Wait¶
Spanner uses TrueTime to achieve external consistency (also called strict serializability or linearizability): if transaction T1 commits before transaction T2 starts (in real time), then T1's commit timestamp is less than T2's commit timestamp.
Commit Wait Protocol:
1. Transaction T acquires all locks
2. T picks commit timestamp s = TT.now().latest
3. COMMIT WAIT: T waits until TT.after(s) is true
i.e., waits until the clock uncertainty has passed
4. T commits with timestamp s and releases locks
Why this works:
- T picked s = TT.now().latest at some real time t_pick
- After commit wait, real time t_commit satisfies t_commit > s
- Any future transaction T' starting after t_commit will get
s' = TT.now().latest > t_commit > s
- Therefore s < s' β T serializes before T' (external consistency)
Timeline:
ββββββββ€Ξ΅ββββββββ€
| |s| |
| | |
| TT.now() |
| [T-Ξ΅, T+Ξ΅] |
| |
| Commit |
| Wait |
| |
| βββ TT.after(s) = true β COMMIT
Cost of commit wait: Average ~7ms latency added to every write transaction (the average clock uncertainty).
2.4 Spanner SQL and Features¶
Spanner supports a dialect of SQL with extensions for distributed semantics:
-- Create a table with interleaving (co-locate child rows with parent)
CREATE TABLE Customers (
CustomerId INT64 NOT NULL,
Name STRING(100),
Email STRING(256)
) PRIMARY KEY (CustomerId);
CREATE TABLE Orders (
CustomerId INT64 NOT NULL,
OrderId INT64 NOT NULL,
Total NUMERIC,
CreatedAt TIMESTAMP
) PRIMARY KEY (CustomerId, OrderId),
INTERLEAVE IN PARENT Customers ON DELETE CASCADE;
-- Interleaving physically co-locates orders with their customer
-- This eliminates network round-trips for customer-order joins!
-- Read-write transaction (serializable)
BEGIN TRANSACTION;
UPDATE Accounts SET Balance = Balance - 100 WHERE AccountId = 1;
UPDATE Accounts SET Balance = Balance + 100 WHERE AccountId = 2;
COMMIT;
-- Read-only transaction (no locks, no commit wait)
-- Uses a snapshot timestamp for consistent reads
SET TRANSACTION READ ONLY;
SELECT * FROM Orders WHERE CustomerId = 42;
-- Stale reads (bounded staleness, for lower latency)
SELECT * FROM Orders
WHERE CustomerId = 42
WITH (MAX_STALENESS = 10s);
2.5 Spanner Summary¶
| Feature | Detail |
|---|---|
| Consistency | External consistency (linearizability) |
| Clock | TrueTime (GPS + atomic clocks) |
| Replication | Multi-Paxos per split |
| Partitioning | Range-based, automatic split/merge |
| Transactions | 2PC across Paxos groups |
| SQL | ANSI SQL with extensions |
| Availability | 99.999% SLA (5 nines) |
| Latency | ~7ms write (commit wait), low ms reads |
3. CockroachDB¶
3.1 Overview¶
CockroachDB (CRDB) is an open-source NewSQL database inspired by Spanner but designed to run without specialized hardware (no GPS/atomic clocks). It is named after the cockroach for its resilience -- it is designed to survive failures at every level.
3.2 Architecture¶
CockroachDB Architecture:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SQL Layer β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Gateway β β Gateway β β Gateway β (any node) β
β β Node β β Node β β Node β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β
β ββββββββββββββββΌβββββββββββββββ β
β β β
β βββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ β
β β Distribution Layer β β
β β Key space divided into Ranges (~512MB each) β β
β β Each Range is a Raft group (3+ replicas) β β
β β Range 1: [a, f) Range 2: [f, m) Range 3: ... β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ β
β β Storage Layer (Pebble) β β
β β LSM-tree based key-value store β β
β β MVCC timestamps on every key β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.3 Multi-Active Availability¶
Unlike traditional primary-standby setups, CockroachDB has no single leader for the entire database. Each Range has its own Raft leader, and leaders are distributed across all nodes:
Node 1 Node 2 Node 3
ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ
β Range A: LEADERβ β Range A: Followerβ β Range A: Followerβ
β Range B: Followerβ β Range B: LEADERβ β Range B: Followerβ
β Range C: Followerβ β Range C: Followerβ β Range C: LEADERβ
ββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββ
Every node serves both reads and writes (for ranges it leads).
No single point of failure!
3.4 Serializable Isolation Without TrueTime¶
Since CRDB runs on commodity hardware without TrueTime, it uses a different mechanism for transaction ordering:
Hybrid Logical Clocks (HLC): Combine a physical clock component with a logical counter.
HLC = (physical_time, logical_counter, node_id)
Rules:
1. Local event: HLC.physical = max(HLC.physical, wall_clock)
HLC.logical += 1
2. Send message: attach current HLC
3. Receive message with HLC_msg:
HLC.physical = max(HLC.physical, HLC_msg.physical, wall_clock)
if HLC.physical == HLC_msg.physical:
HLC.logical = max(HLC.logical, HLC_msg.logical) + 1
else:
HLC.logical = 0
Clock skew handling: CRDB enforces a maximum clock offset (default 500ms). If clocks drift beyond this, nodes are taken out of the cluster. For transactions that might be affected by clock skew, CRDB uses uncertainty intervals and read refreshes:
Transaction T1 reads key K at timestamp t:
- T1 observes value V written at timestamp t_w
- If t_w is within T1's uncertainty interval [t, t + max_offset]:
T1 cannot determine if V was written before or after T1 started
β T1 pushes its timestamp forward past t_w and retries the read
This is called a "read restart" and ensures serializability.
3.5 Key Features¶
-- Geo-partitioning: pin data to specific regions
ALTER TABLE users PARTITION BY LIST (country) (
PARTITION us VALUES IN ('US'),
PARTITION eu VALUES IN ('DE', 'FR', 'UK'),
PARTITION asia VALUES IN ('JP', 'KR', 'SG')
);
ALTER PARTITION us OF TABLE users
CONFIGURE ZONE USING constraints='[+region=us-east1]';
ALTER PARTITION eu OF TABLE users
CONFIGURE ZONE USING constraints='[+region=europe-west1]';
-- Follows leader (low-latency reads from nearest replica)
ALTER TABLE users CONFIGURE ZONE USING
lease_preferences='[[+region=us-east1],[+region=europe-west1]]';
-- Online schema changes (no downtime)
ALTER TABLE orders ADD COLUMN discount DECIMAL DEFAULT 0;
-- Runs as a background job, no table locks!
-- Change data capture (CDC)
CREATE CHANGEFEED FOR TABLE orders INTO 'kafka://broker:9092';
3.6 CockroachDB Summary¶
| Feature | Detail |
|---|---|
| Consistency | Serializable isolation |
| Clock | HLC (Hybrid Logical Clock) |
| Replication | Raft per Range |
| Partitioning | Automatic range splitting (~512MB) |
| SQL | PostgreSQL wire protocol compatible |
| Geo-distribution | Geo-partitioning, follower reads |
| Schema changes | Online, non-blocking |
| License | BSL (free for most uses) |
4. TiDB¶
4.1 Overview¶
TiDB (Ti stands for Titanium) is an open-source NewSQL database created by PingCAP. Its distinguishing feature is HTAP (Hybrid Transactional/Analytical Processing): it can handle both OLTP and OLAP workloads in a single system.
4.2 Architecture¶
TiDB Architecture:
βββββββββββ
β Client β
ββββββ¬βββββ
β
ββββββββββββΌβββββββββββ
βΌ βΌ βΌ
ββββββββββ ββββββββββ ββββββββββ
β TiDB β β TiDB β β TiDB β SQL Layer
β Server β β Server β β Server β (stateless)
ββββββ¬ββββ ββββββ¬ββββ ββββββ¬ββββ
β β β
ββββββββββββΌβββββββββββ
β
ββββββββββββΌβββββββββββ
βΌ βΌ βΌ
ββββββββββ ββββββββββ ββββββββββ
β TiKV β β TiKV β β TiKV β Row Store (OLTP)
β(Raft) β β(Raft) β β(Raft) β (transactional)
ββββββββββ ββββββββββ ββββββββββ
β β β
ββββββββββββΌβββββββββββ
β
ββββββββββββΌβββββββββββ
βΌ βΌ βΌ
ββββββββββ ββββββββββ ββββββββββ
βTiFlash β βTiFlash β βTiFlash β Column Store (OLAP)
β(Raft β β(Raft β β(Raft β (analytical)
βLearner)β βLearner)β βLearner)β
ββββββββββ ββββββββββ ββββββββββ
Placement Driver (PD): Cluster metadata, timestamp oracle, scheduling
4.3 HTAP: Bridging OLTP and OLAP¶
Traditional architectures separate OLTP (write-heavy, row-oriented) and OLAP (read-heavy, column-oriented) into different systems, with ETL pipelines between them. TiDB unifies both:
Traditional Architecture:
OLTP (MySQL) ββETLβββΆ OLAP (Data Warehouse)
(hours of delay)
TiDB HTAP Architecture:
ββββββββββββββββββββββββββββββββββββββββββββ
β TiDB β
β β
β OLTP query βββΆ TiKV (row store) β
β OLAP query βββΆ TiFlash (column store) β
β β
β Raft replication: TiKV βββΆ TiFlash β
β (real-time, asynchronous via Raft Learner)β
ββββββββββββββββββββββββββββββββββββββββββββ
TiKV (row store): Stores data in key-value pairs ordered by primary key. Optimized for point reads and writes (OLTP). Uses Raft for replication and RocksDB for storage.
TiFlash (column store): Stores the same data in columnar format. Optimized for scans, aggregations, and analytical queries (OLAP). Receives data via Raft Learner protocol (asynchronous, does not affect TiKV write latency).
Query routing: The TiDB SQL optimizer automatically decides whether to use TiKV or TiFlash for each part of a query:
-- OLTP query β routed to TiKV
SELECT * FROM orders WHERE order_id = 12345;
-- OLAP query β routed to TiFlash
SELECT product_id, SUM(quantity), AVG(price)
FROM order_items
GROUP BY product_id
ORDER BY SUM(quantity) DESC
LIMIT 100;
-- Hybrid query β TiKV for point lookup, TiFlash for aggregation
SELECT c.name, stats.total_spent
FROM customers c
JOIN (
SELECT customer_id, SUM(total) as total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(total) > 10000
) stats ON c.id = stats.customer_id
WHERE c.id = 42;
4.4 TiDB Summary¶
| Feature | Detail |
|---|---|
| Consistency | Snapshot isolation (default), serializable (optional) |
| Clock | Timestamp Oracle (centralized, via PD) |
| Replication | Raft per Region |
| HTAP | TiKV (row) + TiFlash (column) |
| SQL | MySQL wire protocol compatible |
| Storage engine | RocksDB (TiKV), ClickHouse-derived (TiFlash) |
| License | Apache 2.0 (open source) |
5. Vector Databases¶
5.1 Motivation¶
The rise of deep learning has created a new data type: embeddings -- high-dimensional vectors that represent the semantic meaning of text, images, audio, or any object.
Traditional DB query: Vector DB query:
"Find user with id = 42" "Find items similar to this image"
Exact match on a key Nearest neighbors in vector space
ββ Vector Space βββββββββββββββββββ
β β
β β(cat image) β
β β (kitten image) β
β β (query: cat photo) β
β β
β β(car image) β
β β (truck image) β
β β
β β(sunset photo) β
β β (beach photo) β
β β
βββββββββββββββββββββββββββββββββββββ
5.2 Embeddings¶
An embedding is a learned representation that maps an object to a fixed-dimensional vector such that semantically similar objects have nearby vectors.
Text embeddings (e.g., OpenAI text-embedding-3-small, 1536 dimensions):
"The cat sat on the mat" β [0.023, -0.041, 0.108, ..., 0.055] (1536 dims)
"A kitten rested on a rug" β [0.021, -0.039, 0.112, ..., 0.051] (similar!)
"Stock prices fell today" β [-0.087, 0.032, -0.005, ..., 0.019] (different)
Image embeddings (e.g., CLIP, 512 dimensions):
photo_of_cat.jpg β [0.15, -0.23, ..., 0.08] (512 dims)
photo_of_dog.jpg β [0.12, -0.19, ..., 0.11] (nearby in vector space)
5.3 Similarity Metrics¶
| Metric | Formula | Range | Use Case |
|---|---|---|---|
| Cosine similarity | cos(A,B) = (A . B) / (|A| * |B|) |
[-1, 1] | Text similarity (direction matters) |
| Euclidean distance (L2) | d = sqrt(sum((a_i - b_i)^2)) |
[0, inf) | Image similarity, spatial data |
| Dot product | A . B = sum(a_i * b_i) |
(-inf, inf) | Recommendation (magnitude matters) |
| Manhattan distance (L1) | d = sum(|a_i - b_i|) |
[0, inf) | Sparse vectors |
5.4 Approximate Nearest Neighbor (ANN) Search¶
Exact nearest neighbor search in high dimensions is computationally prohibitive (curse of dimensionality). Vector databases use approximate algorithms that trade a small accuracy loss for massive speed gains.
Inverted File Index (IVF):
IVF Algorithm:
1. TRAINING: Cluster vectors into K centroids (K-means)
ββββββββββββββββββββββββββββββββββββββ
β β β C1 β β β β
β β β C2 β β
β β
β β C3 β
β β β β
β β β β β
β β C4 β β
ββββββββββββββββββββββββββββββββββββββ
C1, C2, C3, C4 are cluster centroids
2. INDEXING: Assign each vector to its nearest centroid
Inverted lists:
C1 β [v1, v5, v12, v33, ...]
C2 β [v2, v7, v15, v41, ...]
C3 β [v3, v9, v22, v37, ...]
C4 β [v4, v11, v19, v45, ...]
3. SEARCH: For query q, find nprobe nearest centroids,
then search only those clusters' vectors
nprobe=1: search 1 cluster (fast, less accurate)
nprobe=K: search all clusters (slow, exact)
HNSW (Hierarchical Navigable Small World):
HNSW Algorithm:
Build a multi-layer graph where:
- Layer 0 (bottom): contains ALL vectors, densely connected
- Layer 1: contains a SUBSET of vectors, sparser connections
- Layer 2: even fewer vectors, even sparser
- ...
- Top layer: very few vectors, long-range connections
Search traversal:
Layer 3: A βββββββββββββββββββ B (few nodes, long jumps)
β β
Layer 2: A ββββ C ββββ D ββββ B (more nodes)
β β β β
Layer 1: A β E β C β F β D β G β B (even more nodes)
β β β β β β β
Layer 0: A E H C I F J D K G L B M (all nodes, short connections)
β
query point
Search: Start at top layer, greedily move to nearest neighbor,
descend to next layer, repeat.
At Layer 0, explore local neighborhood for final result.
Time complexity: O(log N) per query (vs O(N) for brute force)
5.5 Vector Database Systems¶
Pinecone (fully managed):
import pinecone
# Initialize
pinecone.init(api_key="...", environment="us-east-1-aws")
index = pinecone.Index("product-search")
# Upsert vectors
index.upsert(vectors=[
("prod_1", [0.1, 0.2, ..., 0.5], {"category": "electronics", "price": 99.99}),
("prod_2", [0.3, 0.1, ..., 0.8], {"category": "clothing", "price": 49.99}),
])
# Query: find 5 most similar products
results = index.query(
vector=[0.15, 0.22, ..., 0.48], # query embedding
top_k=5,
filter={"category": {"$eq": "electronics"}}, # metadata filter
include_metadata=True
)
Milvus (open source):
from pymilvus import Collection, FieldSchema, CollectionSchema, DataType
# Define schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=1000),
]
schema = CollectionSchema(fields)
collection = Collection("documents", schema)
# Create HNSW index
index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 200}
}
collection.create_index("embedding", index_params)
# Search
results = collection.search(
data=[query_embedding],
anns_field="embedding",
param={"metric_type": "COSINE", "params": {"ef": 100}},
limit=10,
output_fields=["text"]
)
pgvector (PostgreSQL extension):
-- Enable extension
CREATE EXTENSION vector;
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536) -- 1536-dimensional vector
);
-- Create HNSW index
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Insert
INSERT INTO documents (content, embedding)
VALUES ('The cat sat on the mat', '[0.023, -0.041, ...]');
-- Similarity search (cosine distance)
SELECT content, 1 - (embedding <=> '[0.025, -0.038, ...]') AS similarity
FROM documents
ORDER BY embedding <=> '[0.025, -0.038, ...]'
LIMIT 5;
-- Hybrid search: vector similarity + metadata filter
SELECT content, 1 - (embedding <=> $1) AS similarity
FROM documents
WHERE category = 'science'
ORDER BY embedding <=> $1
LIMIT 10;
5.6 Use Cases for Vector Databases¶
| Use Case | Description |
|---|---|
| Semantic search | Find documents by meaning, not just keywords |
| RAG (Retrieval-Augmented Generation) | Store knowledge base embeddings; retrieve context for LLM prompts |
| Recommendation systems | Find items similar to user preferences (collaborative filtering via embeddings) |
| Image search | Find visually similar images |
| Anomaly detection | Identify vectors far from normal clusters |
| Deduplication | Find near-duplicate documents or images |
| Multimodal search | Search across text, images, and audio using shared embedding spaces (CLIP) |
5.7 Choosing a Vector Database¶
| System | Type | Hosting | Strengths | Limitations |
|---|---|---|---|---|
| Pinecone | Managed | Cloud only | Simple API, auto-scaling | Vendor lock-in, cost |
| Milvus | Open source | Self-hosted / Zilliz Cloud | Feature-rich, GPU support | Operational complexity |
| Weaviate | Open source | Self-hosted / Cloud | GraphQL API, modules | Newer, smaller ecosystem |
| Qdrant | Open source | Self-hosted / Cloud | Rust-based (fast), filtering | Smaller community |
| pgvector | Extension | Any PostgreSQL | No new infra, full SQL | Limited scale, no GPU |
| Chroma | Open source | Embedded / Server | Simple, Python-native | Limited production features |
6. Time-Series Databases¶
6.1 What is Time-Series Data?¶
Time-series data consists of measurements or events indexed by time. It has unique characteristics that general-purpose databases handle poorly.
Time-series data examples:
IoT sensor readings:
timestamp | device_id | temperature | humidity
2024-11-15T10:00:00 | sensor_42 | 23.5 | 65.2
2024-11-15T10:00:01 | sensor_42 | 23.6 | 65.1
2024-11-15T10:00:02 | sensor_42 | 23.5 | 65.3
...
Application metrics:
timestamp | service | metric | value
2024-11-15T10:00:00 | api-gw | latency_ms | 42
2024-11-15T10:00:00 | api-gw | req_count | 1547
2024-11-15T10:00:00 | api-gw | error_rate | 0.02
...
6.2 Characteristics of Time-Series Data¶
| Characteristic | Implication |
|---|---|
| Write-heavy | Continuous ingestion, rarely updated after write |
| Time-ordered | Data naturally ordered by timestamp; most queries involve time ranges |
| Append-only | New data is always newer; old data is rarely modified |
| High cardinality | Millions of unique series (device_id x metric combinations) |
| Downsampling | Old data can be aggregated (1-second β 1-minute β 1-hour) |
| TTL (expiry) | Old data has diminishing value; auto-delete after retention period |
| Compression | Temporal locality enables high compression ratios (10-20x) |
6.3 TSDB Optimizations¶
Time-based partitioning: Data is automatically partitioned by time, enabling efficient time-range queries and data lifecycle management.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Time-based partitioning β
β β
β Chunk 1 Chunk 2 Chunk 3 ... β
β [Nov 1-7] [Nov 8-14] [Nov 15-21] β
β ββββββββ ββββββββ ββββββββ β
β β Data β β Data β β Data β β
β β Indexβ β Indexβ β Indexβ β
β ββββββββ ββββββββ ββββββββ β
β β
β Query: WHERE time > 'Nov 10' AND time < 'Nov 20' β
β β Only scans Chunk 2 and Chunk 3! β
β β
β Retention: DROP Chunk 1 (instant, no row-by-row) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delta-of-delta compression: Consecutive timestamps are similar (e.g., every 10 seconds). Store the delta between deltas.
Raw timestamps: 1700000000, 1700000010, 1700000020, 1700000030
Deltas: 10, 10, 10
Delta-of-deltas: 0, 0
Compression: store base + first delta + delta-of-deltas
1700000000, 10, 0, 0 (4 values instead of 4 full timestamps)
For values: Gorilla compression (XOR-based, from Facebook's Gorilla paper)
If consecutive values are similar, XOR is mostly zeros β compress well
Downsampling: Automatically aggregate old data to reduce storage.
Raw data (1-second resolution): β 86,400 points/day/series
After 7 days: downsample to 1-min β 1,440 points/day/series
After 30 days: downsample to 1-hr β 24 points/day/series
After 1 year: downsample to 1-day β 1 point/day/series
Storage reduction: 86,400x over 1 year!
6.4 TimescaleDB¶
TimescaleDB is a time-series extension for PostgreSQL, preserving full SQL capability.
-- Create a hypertable (auto-partitioned by time)
CREATE TABLE metrics (
time TIMESTAMPTZ NOT NULL,
device_id TEXT NOT NULL,
temperature DOUBLE PRECISION,
humidity DOUBLE PRECISION
);
SELECT create_hypertable('metrics', 'time');
-- Automatic partitioning by time (chunks)
-- Each chunk is a separate PostgreSQL table
-- Continuous aggregate (materialized, auto-refreshing)
CREATE MATERIALIZED VIEW hourly_stats
WITH (timescaledb.continuous) AS
SELECT
time_bucket('1 hour', time) AS bucket,
device_id,
AVG(temperature) AS avg_temp,
MAX(temperature) AS max_temp,
MIN(temperature) AS min_temp
FROM metrics
GROUP BY bucket, device_id;
-- Retention policy (auto-delete old data)
SELECT add_retention_policy('metrics', INTERVAL '90 days');
-- Compression policy
ALTER TABLE metrics SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'device_id',
timescaledb.compress_orderby = 'time DESC'
);
SELECT add_compression_policy('metrics', INTERVAL '7 days');
-- Query: last 24 hours, 5-minute averages
SELECT
time_bucket('5 minutes', time) AS bucket,
device_id,
AVG(temperature) AS avg_temp
FROM metrics
WHERE time > NOW() - INTERVAL '24 hours'
AND device_id = 'sensor_42'
GROUP BY bucket, device_id
ORDER BY bucket;
6.5 InfluxDB¶
InfluxDB is a purpose-built time-series database with its own query language (Flux) and line protocol.
# InfluxDB Line Protocol (write)
cpu,host=server01,region=us-east usage=0.72,system=0.15 1700000000000000000
cpu,host=server02,region=eu-west usage=0.45,system=0.08 1700000000000000000
# Structure: measurement,tag_key=tag_value field_key=field_value timestamp
# Flux query: average CPU usage by host, last hour, 5-minute windows
from(bucket: "monitoring")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "cpu" and r._field == "usage")
|> aggregateWindow(every: 5m, fn: mean)
|> group(columns: ["host"])
6.6 TSDB Comparison¶
| Feature | TimescaleDB | InfluxDB | Prometheus |
|---|---|---|---|
| Base | PostgreSQL extension | Purpose-built | Purpose-built |
| Query language | Full SQL | Flux / InfluxQL | PromQL |
| Compression | Columnar compression | TSM engine | Gorilla + delta |
| Clustering | Self-managed or cloud | Enterprise only | Thanos/Cortex |
| Cardinality | Good (PostgreSQL indexes) | Limited (high cardinality = slow) | Limited |
| Joins | Full SQL joins | Limited | No |
| Best for | General time-series + SQL | Metrics, IoT | Infrastructure monitoring |
7. Graph Analytics at Scale¶
7.1 Beyond Graph Databases¶
While graph databases like Neo4j excel at transactional graph queries (OLTP), graph analytics requires processing the entire graph (OLAP): PageRank, community detection, shortest paths across billions of edges.
7.2 Apache Spark GraphX¶
GraphX is Spark's API for graph-parallel computation. It extends the RDD (Resilient Distributed Dataset) abstraction with a Resilient Distributed Property Graph.
from pyspark.sql import SparkSession
from graphframes import GraphFrame
spark = SparkSession.builder.appName("GraphAnalytics").getOrCreate()
# Vertices DataFrame
vertices = spark.createDataFrame([
("1", "Alice", 30),
("2", "Bob", 25),
("3", "Charlie", 35),
("4", "Diana", 28),
], ["id", "name", "age"])
# Edges DataFrame
edges = spark.createDataFrame([
("1", "2", "follows"),
("2", "3", "follows"),
("3", "1", "follows"),
("1", "4", "follows"),
("4", "3", "follows"),
], ["src", "dst", "relationship"])
# Create graph
g = GraphFrame(vertices, edges)
# PageRank
pagerank = g.pageRank(resetProbability=0.15, maxIter=10)
pagerank.vertices.select("id", "name", "pagerank").show()
# Connected components
cc = g.connectedComponents()
cc.show()
# Shortest paths
sp = g.shortestPaths(landmarks=["3"])
sp.show()
# Triangle count
tc = g.triangleCount()
tc.show()
7.3 TigerGraph¶
TigerGraph is a distributed graph analytics platform designed for deep link analytics (multi-hop queries at scale).
// GSQL: TigerGraph's query language
// Define schema
CREATE VERTEX Person (PRIMARY_ID id STRING, name STRING, age INT)
CREATE VERTEX Company (PRIMARY_ID id STRING, name STRING)
CREATE DIRECTED EDGE works_at (FROM Person, TO Company, since DATETIME)
CREATE UNDIRECTED EDGE knows (FROM Person, TO Person, strength FLOAT)
// Install a query: 2-hop fraud detection
CREATE QUERY fraud_ring_detection(VERTEX<Person> seed) FOR GRAPH social {
// Find all people within 2 hops who share financial connections
Start = {seed};
// 1-hop: direct connections
hop1 = SELECT t FROM Start:s -(knows:e)- Person:t
WHERE e.strength > 0.8;
// 2-hop: connections of connections
hop2 = SELECT t FROM hop1:s -(knows:e)- Person:t
WHERE e.strength > 0.8 AND t != seed;
// Find shared financial patterns
suspicious = SELECT t FROM hop2:t -(works_at)- Company:c
WHERE c.name == "shell_company_pattern";
PRINT suspicious;
}
7.4 Graph Analytics Comparison¶
| Feature | Neo4j | Spark GraphX | TigerGraph |
|---|---|---|---|
| Type | Graph DB + Analytics | Distributed compute | Distributed graph DB |
| Scale | Billions of nodes | Billions of edges | Trillions of edges |
| Query | Cypher | Scala/Python API | GSQL |
| Real-time | Yes (OLTP + OLAP) | No (batch) | Yes |
| Best for | Mixed OLTP/OLAP graph | Batch analytics on Spark | Deep link analytics |
8. Database-as-a-Service¶
8.1 The Serverless Database Trend¶
Database-as-a-Service (DBaaS) abstracts away server provisioning, scaling, patching, and backup management. The latest wave of DBaaS offerings goes further with serverless architectures that scale to zero when not in use.
8.2 Neon¶
Neon is a serverless PostgreSQL with a separation of storage and compute, enabling instant branching and scale-to-zero.
Neon Architecture:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Compute Layer (stateless, scale-to-zero) β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Postgres β β Postgres β β Postgres β (auto) β
β β Instance β β Instance β β (idle) β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββββββββ β
β β β β
β ββββββββββββββββ β
β β β
β βββββββββββββββ΄βββββββββββββββββββββββββββββββββββ β
β β Pageserver (page cache + WAL processing) β β
β βββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββ΄βββββββββββββββββββββββββββββββββββ β
β β SafeKeeper (WAL durability, consensus) β β
β βββββββββββββββ¬βββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββ΄βββββββββββββββββββββββββββββββββββ β
β β Object Storage (S3) - bottomless storage β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key features: - Scale to zero: When no queries are running, compute scales down to zero (no cost) - Instant branching: Create a copy-on-write branch of the entire database (like git branch for data) - Bottomless storage: Data stored in S3, no upper limit
# Create a branch (instant, copy-on-write)
neonctl branches create --name staging --parent main
# Each branch gets its own compute endpoint
# Useful for: development, testing, previews
8.3 PlanetScale¶
PlanetScale is a serverless MySQL platform built on Vitess (YouTube's MySQL sharding middleware).
Key features: - Non-blocking schema changes: Schema changes run as background operations, no table locks - Database branching: Create branches for schema changes, merge via deploy requests - Sharding: Built on Vitess, automatic horizontal sharding
# Git-like workflow for schema changes
pscale branch create main add-column
pscale shell main add-column
# On the branch:
mysql> ALTER TABLE users ADD COLUMN avatar_url VARCHAR(500);
# Create a deploy request (like a pull request)
pscale deploy-request create main add-column
# Review and merge
pscale deploy-request deploy main 1
# Schema change applied to production with zero downtime!
8.4 Supabase¶
Supabase is an open-source Firebase alternative built on PostgreSQL.
Supabase Stack:
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Dashboard / Studio (Admin UI) β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Auth (GoTrue) β β
β β Realtime (WebSocket subscriptions) β β
β β Storage (S3-compatible) β β
β β Edge Functions (Deno runtime) β β
β β PostgREST (auto-generated REST API) β β
β β pgvector (vector search) β β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββ΄βββββββββ β
β β PostgreSQL β β
β β (the source β β
β β of truth) β β
β βββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
// Supabase client (auto-generated API from PostgreSQL schema)
import { createClient } from '@supabase/supabase-js'
const supabase = createClient('https://xxx.supabase.co', 'anon-key')
// Query (translates to SQL automatically)
const { data, error } = await supabase
.from('products')
.select('name, price, categories(name)')
.gte('price', 10)
.order('price', { ascending: true })
.limit(20)
// Real-time subscription
const subscription = supabase
.channel('orders')
.on('postgres_changes',
{ event: 'INSERT', schema: 'public', table: 'orders' },
(payload) => console.log('New order:', payload.new)
)
.subscribe()
// Row Level Security (RLS) - security at the database level
// Users can only see their own orders
// CREATE POLICY "Users see own orders" ON orders
// FOR SELECT USING (auth.uid() = user_id);
8.5 DBaaS Comparison¶
| Feature | Neon | PlanetScale | Supabase |
|---|---|---|---|
| Engine | PostgreSQL | MySQL (Vitess) | PostgreSQL |
| Serverless | Yes (scale to zero) | Yes | Partial |
| Branching | Database branching | Schema branching | No |
| Sharding | No (single node) | Yes (Vitess) | No |
| Auto-generated API | No | No | Yes (PostgREST) |
| Realtime | No | No | Yes (WebSocket) |
| Vector search | pgvector | No | pgvector |
| Open source | Yes (storage) | No (Vitess is OSS) | Yes |
| Best for | Dev/test, branching | High-scale MySQL | Full-stack apps |
9. Data Lakehouse¶
9.1 The Evolution of Data Architecture¶
Era 1: Data Warehouse (1990s-2010s)
Structured data β ETL β Data Warehouse (Redshift, Snowflake)
+ Strong schema, ACID, SQL
- Expensive, limited to structured data
Era 2: Data Lake (2010s)
All data β Dump into HDFS/S3 β Process with Spark
+ Cheap storage, all data types
- No ACID, no schema enforcement, "data swamp"
Era 3: Data Lakehouse (2020s)
All data β Lake storage (S3) + Table format (Delta/Iceberg)
+ Cheap storage + ACID + schema + SQL
Best of both worlds!
9.2 What is a Lakehouse?¶
A data lakehouse combines the low-cost storage of a data lake (object storage like S3) with the data management features of a data warehouse (ACID transactions, schema enforcement, indexing).
Data Lakehouse Architecture:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query Engines β
β βββββββββββ βββββββββββ βββββββββββ β
β β Spark β β Trino β β DuckDB β ... β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
β β β β β
β ββββββββββββββΌβββββββββββββ β
β β β
β βββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ β
β β Table Format (metadata layer) β β
β β ββββββββββββββ ββββββββββββββ β β
β β β Delta Lake β β Apache β β β
β β β β β Iceberg β β β
β β ββββββββββββββ ββββββββββββββ β β
β β Provides: ACID, schema evolution, time travel, β β
β β partition pruning, file-level statistics β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ β
β β Object Storage β β
β β ββββββββ ββββββββ ββββββββ β β
β β β S3 β β GCS β β ADLS β β β
β β ββββββββ ββββββββ ββββββββ β β
β β Data stored as Parquet/ORC files β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
9.3 Delta Lake¶
Delta Lake, created by Databricks, adds ACID transactions to Apache Spark on data lakes.
Transaction log: Delta Lake maintains a transaction log (_delta_log/) in JSON that records every change to the table. This provides:
my_table/
βββ _delta_log/
β βββ 00000000000000000000.json β initial table creation
β βββ 00000000000000000001.json β INSERT 1000 rows
β βββ 00000000000000000002.json β UPDATE some rows
β βββ 00000000000000000003.json β DELETE old rows
βββ part-00000-xxxx.parquet
βββ part-00001-xxxx.parquet
βββ part-00002-xxxx.parquet
from delta import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0") \
.getOrCreate()
# Write data (ACID)
df.write.format("delta").mode("overwrite").save("/data/events")
# Time travel: read data as of a specific version
df_v1 = spark.read.format("delta") \
.option("versionAsOf", 1) \
.load("/data/events")
# Time travel: read data as of a specific timestamp
df_yesterday = spark.read.format("delta") \
.option("timestampAsOf", "2024-11-14") \
.load("/data/events")
# MERGE (upsert): update existing rows, insert new ones
deltaTable = DeltaTable.forPath(spark, "/data/events")
deltaTable.alias("target").merge(
new_data.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# Schema evolution
df_new_columns.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save("/data/events")
9.4 Apache Iceberg¶
Apache Iceberg is an open table format for large analytic tables, designed to solve the limitations of Hive tables.
Key advantages over Delta Lake:
| Feature | Delta Lake | Apache Iceberg |
|---|---|---|
| Engine lock-in | Optimized for Spark | Engine-agnostic (Spark, Trino, Flink, Dremio) |
| Hidden partitioning | Explicit partition columns | Automatic (partition evolution without rewriting) |
| Metadata | JSON log + checkpoints | Manifest files + snapshot metadata |
| Schema evolution | Add/rename/drop columns | Full evolution including nested types |
| Governance | Databricks-led | Apache Foundation (vendor-neutral) |
-- Iceberg in Spark SQL
CREATE TABLE catalog.db.events (
event_id BIGINT,
user_id BIGINT,
event_type STRING,
event_time TIMESTAMP,
properties MAP<STRING, STRING>
) USING iceberg
PARTITIONED BY (days(event_time)); -- hidden partitioning by day
-- Time travel
SELECT * FROM catalog.db.events VERSION AS OF 12345;
SELECT * FROM catalog.db.events TIMESTAMP AS OF '2024-11-14 00:00:00';
-- Snapshot management
SELECT * FROM catalog.db.events.snapshots;
-- Expire old snapshots (reclaim storage)
CALL catalog.system.expire_snapshots('db.events', TIMESTAMP '2024-10-01');
9.5 Lakehouse vs Traditional Architectures¶
| Feature | Data Warehouse | Data Lake | Data Lakehouse |
|---|---|---|---|
| Storage cost | High | Low | Low |
| Data types | Structured only | All types | All types |
| ACID | Yes | No | Yes |
| Schema | Schema-on-write | Schema-on-read | Schema-on-write (flexible) |
| Query performance | Excellent | Variable | Good (with indexing) |
| Time travel | Limited | No | Yes |
| Governance | Strong | Weak | Strong |
| ML support | Limited | Good (raw data) | Good (raw + structured) |
10. Exercises¶
Exercise 1: NewSQL Comparison¶
Compare Google Spanner, CockroachDB, and TiDB across the following dimensions. Fill in the table:
| Dimension | Spanner | CockroachDB | TiDB |
|---|---|---|---|
| Clock mechanism | |||
| Default isolation level | |||
| Replication protocol | |||
| SQL compatibility | |||
| HTAP support | |||
| Open source? | |||
| Typical deployment |
Exercise 2: TrueTime and Commit Wait¶
Google Spanner's TrueTime reports a clock uncertainty of [T-Ξ΅, T+Ξ΅] where Ξ΅ is typically 1-7ms.
- Explain why Spanner must "wait out" the uncertainty before committing a transaction.
- If Ξ΅ = 5ms, what is the minimum commit latency for a write transaction?
- If Ξ΅ were 0 (perfect clock), how would this simplify the protocol?
- Why can't CockroachDB use the same approach? What does it do instead?
- Explain a scenario where CockroachDB's HLC approach might result in a "read restart." What is the user-visible impact?
Exercise 3: Vector Database Design¶
You are building a RAG (Retrieval-Augmented Generation) system for a customer support chatbot. The knowledge base contains: - 100,000 support articles (average 500 words each) - Articles are updated weekly - Expected query load: 1,000 queries per minute - Latency requirement: < 100ms per query
Design the vector database component: 1. Choose a vector database system and justify your choice. 2. Choose an embedding model and dimension size. 3. Describe your indexing strategy (IVF, HNSW, or hybrid). 4. How would you handle article updates? (re-embed entire article? chunk-level updates?) 5. How would you implement hybrid search (vector similarity + keyword matching)?
Exercise 4: Time-Series Schema Design¶
Design a TimescaleDB schema for a smart building monitoring system: - 500 sensors across 50 floors - Each sensor reports temperature, humidity, CO2, and occupancy every 10 seconds - Common queries: last 24 hours for a specific floor, average per floor per hour, anomaly detection - Data retention: raw data for 90 days, hourly aggregates for 2 years
Write the SQL for: 1. Table creation with hypertable 2. Continuous aggregate for hourly averages 3. Retention policy 4. Compression policy 5. A query for "floors where average temperature exceeded 28C in the last 6 hours"
Exercise 5: HTAP Scenario Analysis¶
A financial trading company currently runs: - An OLTP database (PostgreSQL) for trade execution - An OLAP warehouse (Snowflake) with 2-hour ETL lag for risk analysis
They are considering migrating to TiDB for HTAP.
- What are the benefits of eliminating the ETL pipeline?
- What risks does running analytics on the same system as transactions introduce?
- How does TiDB's architecture (TiKV + TiFlash) mitigate the resource contention risk?
- In what scenarios would you still recommend keeping separate OLTP and OLAP systems?
Exercise 6: Database Selection¶
For each of the following applications, select the most appropriate database technology from the options covered in this lesson. Justify your choice.
- A genomics research platform storing DNA sequence embeddings for similarity search across 10 billion sequences.
- A factory monitoring system with 10,000 machines, each reporting 50 metrics every second.
- A global banking application requiring ACID transactions with operations in 30 countries.
- A startup building a collaborative document editor (like Google Docs) with 10,000 users.
- A data analytics platform that needs to run SQL queries over 100 TB of event data stored in S3.
Exercise 7: Lakehouse Design¶
You are designing a data platform for a ride-sharing company. Data sources include: - Ride events (10 million rides/day) - GPS traces (100 points per ride) - Payment transactions - Driver/rider profiles - Surge pricing calculations
Design a lakehouse architecture: 1. Choose between Delta Lake and Apache Iceberg. Justify. 2. Define the table schemas for the three most important tables. 3. Describe the partitioning strategy for each table. 4. How would you implement time travel for auditing payment transactions? 5. What query engines would you use for (a) real-time dashboards, (b) monthly reports, (c) ML feature engineering?
Exercise 8: Serverless Database Evaluation¶
Your team is building a SaaS application with these characteristics: - Multi-tenant (each tenant has isolated data) - Variable load: peaks at 10x the average - Development team needs frequent schema changes - Budget is limited (startup)
Evaluate Neon, PlanetScale, and Supabase for this use case: 1. How does each handle multi-tenancy? 2. How does each handle variable load? 3. How does each handle schema changes in production? 4. What would be your recommendation and why?
Exercise 9: Consistent Hashing in CockroachDB¶
CockroachDB divides the key space into Ranges (approximately 512 MB each). Each Range is a Raft group.
- When a Range grows beyond 512 MB, how does CockroachDB split it? What happens to the Raft group?
- When a node joins the cluster, how are Ranges rebalanced?
- What is the difference between a Range split and a Range rebalance?
- How does CockroachDB decide which node should be the Raft leader for a Range? How does this relate to geo-partitioning?
Exercise 10: Essay Question¶
Write a 600-word essay on the following topic:
"The relational database is dead" has been proclaimed many times, yet PostgreSQL usage continues to grow year over year. Based on the material in this lesson (NewSQL, vector databases, time-series databases, lakehouse), argue whether the relational model is becoming more or less relevant. Support your argument with specific examples of how relational databases are adapting (e.g., pgvector, TimescaleDB as extensions) versus being replaced by specialized systems.
11. References¶
- Corbett, J. et al. (2013). "Spanner: Google's Globally-Distributed Database." ACM TODS.
- Taft, R. et al. (2020). "CockroachDB: The Resilient Geo-Distributed SQL Database." ACM SIGMOD.
- Huang, D. et al. (2020). "TiDB: A Raft-based HTAP Database." VLDB.
- Aslett, M. (2011). "How Will the Database Incumbents Respond to NoSQL and NewSQL?" The 451 Group.
- Johnson, J., Douze, M., Jegou, H. (2019). "Billion-Scale Similarity Search with GPUs." IEEE Trans. on Big Data.
- Malkov, Y. & Yashunin, D. (2018). "Efficient and Robust Approximate Nearest Neighbor using Hierarchical Navigable Small World Graphs." IEEE TPAMI.
- Armbrust, M. et al. (2021). "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics." CIDR.
- Armbrust, M. et al. (2020). "Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores." VLDB.
- Apache Iceberg Documentation. https://iceberg.apache.org/
- Pelkonen, T. et al. (2015). "Gorilla: A Fast, Scalable, In-Memory Time Series Database." VLDB.
- Neon Documentation. https://neon.tech/docs
- CockroachDB Architecture Documentation. https://www.cockroachlabs.com/docs/stable/architecture/overview.html
Previous: 14. Distributed Databases | Next: 16. Database Design Case Study