Database Cheatsheet: SQL vs NoSQL, Indexing, Sharding, and Replication

Choosing the right database is one of the most consequential decisions in system design. This cheatsheet gives you quick-reference comparisons, decision frameworks, and practical guidance for every database topic you will encounter in system design interviews. Pair this with our Scalability Cheatsheet and General Cheat Sheets for complete preparation.

SQL vs NoSQL Comparison

Feature	SQL (Relational)	NoSQL
Schema	Fixed schema, predefined tables	Flexible/dynamic schema
Data Model	Tables with rows and columns	Documents, key-value, wide-column, graph
Transactions	Full ACID compliance	Varies (some support ACID per-document)
Joins	Native, efficient multi-table joins	No joins (denormalize or application-level)
Scaling	Primarily vertical; horizontal is complex	Designed for horizontal scaling
Consistency	Strong by default	Configurable (eventual to strong)
Query Language	SQL (standardized, powerful)	Database-specific APIs
Best For	Complex queries, transactions, structured data	High throughput, flexible schema, horizontal scale

When to Use Which Database

Database	Type	Strengths	Best Use Cases
PostgreSQL	Relational	ACID, complex queries, JSON support, extensions	Default choice for most applications, geospatial (PostGIS)
MySQL	Relational	Mature ecosystem, read performance, replication	Web applications, read-heavy workloads, CMS platforms
MongoDB	Document	Flexible schema, developer-friendly, aggregation pipeline	Rapid prototyping, content management, catalogs
Cassandra	Wide Column	Extreme write throughput, linear scalability, multi-DC	Time-series, messaging, IoT, event logging
Redis	Key-Value	Sub-millisecond latency, rich data structures	Caching, sessions, leaderboards, rate limiting, pub/sub
DynamoDB	Key-Value/Document	Managed, auto-scaling, single-digit ms latency	Serverless apps, gaming, ad-tech, session management
Neo4j	Graph	Relationship traversal, pattern matching	Social networks, fraud detection, recommendations
InfluxDB	Time Series	Time-based queries, downsampling, retention policies	Monitoring, metrics, IoT sensor data
Elasticsearch	Search Engine	Full-text search, aggregations, inverted index	Search functionality, log analysis, analytics

Database Decision Tree

Start Here:
|
|-- Need ACID transactions across multiple tables?
|   YES --> PostgreSQL / MySQL
|
|-- Need flexible/evolving schema?
|   YES --> MongoDB (document model)
|
|-- Need extreme write throughput (>100K writes/sec)?
|   YES --> Cassandra (wide-column)
|
|-- Need sub-millisecond reads/caching?
|   YES --> Redis (in-memory)
|
|-- Need full-text search?
|   YES --> Elasticsearch
|
|-- Need to traverse relationships (social graph)?
|   YES --> Neo4j (graph)
|
|-- Need time-series data with retention policies?
|   YES --> InfluxDB / TimescaleDB
|
|-- Need managed, serverless, auto-scaling?
|   YES --> DynamoDB
|
|-- Default/Uncertain?
|   --> PostgreSQL (most versatile, rarely wrong)

Indexing Rules and Types

Index Type	Structure	Best For	Limitation
B-Tree (default)	Balanced tree, sorted data	Range queries, equality, sorting	Write overhead per insert/update
Hash	Hash table	Exact equality lookups only	No range queries, no sorting
Composite (Multi-column)	B-Tree on multiple columns	Queries filtering on multiple columns	Column order matters (leftmost prefix)
Covering	Index includes all query columns	Index-only scans (no table access)	Larger index size
GIN (Generalized Inverted)	Inverted index	Full-text search, JSON, arrays	Slower writes
GiST (Generalized Search Tree)	Generalized tree	Geospatial, range types, nearest neighbor	Less precise than B-Tree for simple queries
Partial	Index with WHERE clause	Indexing subset of rows (e.g., active users only)	Only used when query matches condition

Indexing Best Practices

-- 1. Index columns in WHERE, JOIN, and ORDER BY clauses
CREATE INDEX idx_users_email ON users(email);

-- 2. Composite index: put high-cardinality columns first
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);

-- 3. Covering index: include all selected columns
CREATE INDEX idx_orders_covering ON orders(user_id, status)
  INCLUDE (total_amount, created_at);

-- 4. Partial index: index only relevant rows
CREATE INDEX idx_active_users ON users(email) WHERE is_active = true;

-- 5. Avoid indexing:
--    - Low-cardinality columns (boolean, status with 2-3 values)
--    - Frequently updated columns (high write overhead)
--    - Small tables (full scan is faster)

-- 6. Monitor unused indexes
SELECT * FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND idx_tup_read = 0;

Sharding Strategies

Strategy	How It Works	Pros	Cons
Hash-Based	hash(shard_key) % N	Even data distribution	Resharding requires data migration; no range queries
Range-Based	Key ranges per shard (1-1M, 1M-2M)	Efficient range queries	Hot spots if ranges are uneven
Geographic	Shard by user region	Data locality, low latency	Cross-region queries are expensive
Directory-Based	Lookup table maps key to shard	Most flexible, easy resharding	Lookup table is single point of failure
Consistent Hashing	Hash ring with virtual nodes	Minimal data movement on shard changes	More complex to implement

Choosing a Shard Key

Good Shard Keys:
- user_id    (evenly distributed, queries scoped to user)
- tenant_id  (multi-tenant SaaS, natural isolation)
- order_id   (if no cross-user queries needed)

Bad Shard Keys:
- timestamp  (all writes go to latest shard - hot spot)
- country    (uneven distribution: US shard overloaded)
- status     (only 3-5 values - terrible distribution)

Rule: Pick a key that is:
1. High cardinality (many distinct values)
2. Evenly distributed across shards
3. Used in most queries (avoid cross-shard queries)
4. Immutable (changing shard key requires data migration)

Replication Modes

Mode	How It Works	Consistency	Availability
Single Leader (Primary-Replica)	One primary handles writes; replicas serve reads	Strong on primary; eventual on replicas	Primary failure requires failover
Multi-Leader (Primary-Primary)	Multiple nodes accept writes; sync asynchronously	Eventual (conflict resolution needed)	High (no single point of failure for writes)
Leaderless (Quorum)	Any node accepts reads/writes; quorum decides	Tunable (R + W > N for strong)	Very high (tolerates multiple failures)

Quorum Configuration

// N = total replicas, W = write quorum, R = read quorum
// Strong consistency: R + W > N

N=3, W=2, R=2  --> Strong consistency (most common)
N=3, W=1, R=3  --> Fast writes, slow reads
N=3, W=3, R=1  --> Slow writes, fast reads
N=3, W=1, R=1  --> Eventual consistency (fast but may read stale)

// Fault tolerance
Can tolerate (N - W) write failures
Can tolerate (N - R) read failures
Example: N=3, W=2 --> survive 1 node failure for writes

ACID vs BASE

Property	ACID (SQL)	BASE (NoSQL)
A	Atomicity — all or nothing	Basically Available
C/S	Consistency — valid state transitions only	Soft state — data may change over time
I/E	Isolation — concurrent txns don't interfere	Eventually consistent
D	Durability — committed data survives crashes	(Durability still applies)
Best for	Financial transactions, inventory, user accounts	Social feeds, analytics, logging, high-throughput writes

Normalization vs Denormalization

Aspect	Normalized	Denormalized
Data duplication	No duplication	Intentional duplication
Write performance	Fast (update one place)	Slower (update multiple copies)
Read performance	Slower (needs JOINs)	Fast (pre-joined data)
Storage	Efficient	More storage used
Data integrity	High (single source of truth)	Risk of inconsistency
When to use	Write-heavy, data integrity critical	Read-heavy, performance critical

-- Normalized: separate tables, JOIN to query
SELECT u.name, o.total, p.name AS product
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE u.id = 123;

-- Denormalized: single document (MongoDB)
{
  "user_id": 123,
  "user_name": "Jane",
  "orders": [{
    "total": 99.99,
    "items": [{ "product_name": "Widget", "price": 49.99 }]
  }]
}
-- Single read, no joins, but updating user_name means updating all orders

Query Optimization Tips

-- 1. Use EXPLAIN ANALYZE to find slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;

-- 2. Avoid SELECT * (fetch only needed columns)
SELECT id, name, email FROM users WHERE status = 'active';

-- 3. Use LIMIT for pagination
SELECT * FROM posts ORDER BY created_at DESC LIMIT 20 OFFSET 40;

-- Better: cursor-based pagination (avoids OFFSET scan)
SELECT * FROM posts WHERE created_at < '2024-01-15T00:00:00'
ORDER BY created_at DESC LIMIT 20;

-- 4. Avoid N+1 queries (use JOINs or batch loading)
-- BAD: 1 query for users + N queries for orders
-- GOOD: SELECT * FROM orders WHERE user_id IN (1, 2, 3, ...);

-- 5. Batch inserts
INSERT INTO events (type, data, created_at) VALUES
  ('click', '{}', NOW()),
  ('view', '{}', NOW()),
  ('purchase', '{}', NOW());
-- 10x faster than individual inserts

Database Migration Strategies

Strategy	Downtime	Risk	Complexity
Blue-Green	Seconds (DNS switch)	Low (easy rollback)	Medium
Dual-Write	Zero	Medium (consistency)	High
Shadow Read	Zero	Low (compare results)	Medium
Strangler Fig	Zero	Low (incremental)	Medium
Big Bang	Hours	High	Low

Secure your databases with proper authentication and authorization, encryption at rest, and API security for database access layers. Use our Security Crypto Tools and API and Network Tools for testing. Visit swehelper.com/tools for all resources.

Frequently Asked Questions

Should I start with SQL or NoSQL for a new project?

Start with PostgreSQL unless you have a specific requirement that demands otherwise. PostgreSQL handles 90% of use cases well: it supports JSON (flexible schemas when needed), has excellent indexing, strong ACID compliance, and scales reasonably with read replicas. You can always add specialized databases later (Redis for caching, Elasticsearch for search) as needed.

When should I denormalize my data?

Denormalize when: (1) read performance is critical and reads far outnumber writes, (2) you are hitting JOIN limits in high-throughput queries, (3) you are moving to NoSQL where JOINs are not available, or (4) specific read patterns are well-defined and stable. Start normalized and denormalize only when you have evidence of performance issues. Premature denormalization leads to data inconsistency nightmares.

How do I handle cross-shard queries after sharding?

Cross-shard queries are expensive — they require scatter-gather across all shards. Minimize them by: (1) choosing a shard key that co-locates frequently joined data, (2) maintaining denormalized views for common cross-shard queries, (3) using a search index (Elasticsearch) as a secondary read path, (4) accepting eventual consistency for analytics queries. If most of your queries are cross-shard, your shard key choice may be wrong.

What is the best indexing strategy for a new application?

Start with indexes on: (1) primary keys (automatic), (2) foreign keys used in JOINs, (3) columns in WHERE clauses of frequent queries, (4) columns used for sorting (ORDER BY). Do not add indexes preemptively — add them based on slow query logs and EXPLAIN ANALYZE results. Over-indexing slows writes and wastes storage. Review indexes quarterly and drop unused ones.

How do I choose between Redis and Memcached for caching?

Use Redis if you need: data structures (lists, sets, sorted sets), persistence, pub/sub, Lua scripting, or cluster mode for horizontal scaling. Use Memcached if you need: simple key-value caching with multi-threaded performance, or if your caching needs are straightforward and you want simplicity. Redis is the default choice for most modern applications due to its versatility. See our Scalability Cheatsheet for cache sizing formulas.

Database Cheatsheet: SQL vs NoSQL, Indexing, Sharding, and Replication