Skip to main content
🎯Interview Prep

Database Cheatsheet: SQL vs NoSQL, Indexing, Sharding, and Replication

Choosing the right database is one of the most consequential decisions in system design. This cheatsheet gives you quick-reference comparisons, decision fr...

📖 9 min read

Database Cheatsheet: SQL vs NoSQL, Indexing, Sharding, and Replication

Choosing the right database is one of the most consequential decisions in system design. This cheatsheet gives you quick-reference comparisons, decision frameworks, and practical guidance for every database topic you will encounter in system design interviews. Pair this with our Scalability Cheatsheet and General Cheat Sheets for complete preparation.

SQL vs NoSQL Comparison

Feature SQL (Relational) NoSQL
Schema Fixed schema, predefined tables Flexible/dynamic schema
Data Model Tables with rows and columns Documents, key-value, wide-column, graph
Transactions Full ACID compliance Varies (some support ACID per-document)
Joins Native, efficient multi-table joins No joins (denormalize or application-level)
Scaling Primarily vertical; horizontal is complex Designed for horizontal scaling
Consistency Strong by default Configurable (eventual to strong)
Query Language SQL (standardized, powerful) Database-specific APIs
Best For Complex queries, transactions, structured data High throughput, flexible schema, horizontal scale

When to Use Which Database

Database Type Strengths Best Use Cases
PostgreSQL Relational ACID, complex queries, JSON support, extensions Default choice for most applications, geospatial (PostGIS)
MySQL Relational Mature ecosystem, read performance, replication Web applications, read-heavy workloads, CMS platforms
MongoDB Document Flexible schema, developer-friendly, aggregation pipeline Rapid prototyping, content management, catalogs
Cassandra Wide Column Extreme write throughput, linear scalability, multi-DC Time-series, messaging, IoT, event logging
Redis Key-Value Sub-millisecond latency, rich data structures Caching, sessions, leaderboards, rate limiting, pub/sub
DynamoDB Key-Value/Document Managed, auto-scaling, single-digit ms latency Serverless apps, gaming, ad-tech, session management
Neo4j Graph Relationship traversal, pattern matching Social networks, fraud detection, recommendations
InfluxDB Time Series Time-based queries, downsampling, retention policies Monitoring, metrics, IoT sensor data
Elasticsearch Search Engine Full-text search, aggregations, inverted index Search functionality, log analysis, analytics

Database Decision Tree

Start Here:
|
|-- Need ACID transactions across multiple tables?
|   YES --> PostgreSQL / MySQL
|
|-- Need flexible/evolving schema?
|   YES --> MongoDB (document model)
|
|-- Need extreme write throughput (>100K writes/sec)?
|   YES --> Cassandra (wide-column)
|
|-- Need sub-millisecond reads/caching?
|   YES --> Redis (in-memory)
|
|-- Need full-text search?
|   YES --> Elasticsearch
|
|-- Need to traverse relationships (social graph)?
|   YES --> Neo4j (graph)
|
|-- Need time-series data with retention policies?
|   YES --> InfluxDB / TimescaleDB
|
|-- Need managed, serverless, auto-scaling?
|   YES --> DynamoDB
|
|-- Default/Uncertain?
|   --> PostgreSQL (most versatile, rarely wrong)

Indexing Rules and Types

Index Type Structure Best For Limitation
B-Tree (default) Balanced tree, sorted data Range queries, equality, sorting Write overhead per insert/update
Hash Hash table Exact equality lookups only No range queries, no sorting
Composite (Multi-column) B-Tree on multiple columns Queries filtering on multiple columns Column order matters (leftmost prefix)
Covering Index includes all query columns Index-only scans (no table access) Larger index size
GIN (Generalized Inverted) Inverted index Full-text search, JSON, arrays Slower writes
GiST (Generalized Search Tree) Generalized tree Geospatial, range types, nearest neighbor Less precise than B-Tree for simple queries
Partial Index with WHERE clause Indexing subset of rows (e.g., active users only) Only used when query matches condition

Indexing Best Practices

-- 1. Index columns in WHERE, JOIN, and ORDER BY clauses
CREATE INDEX idx_users_email ON users(email);

-- 2. Composite index: put high-cardinality columns first
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);

-- 3. Covering index: include all selected columns
CREATE INDEX idx_orders_covering ON orders(user_id, status)
  INCLUDE (total_amount, created_at);

-- 4. Partial index: index only relevant rows
CREATE INDEX idx_active_users ON users(email) WHERE is_active = true;

-- 5. Avoid indexing:
--    - Low-cardinality columns (boolean, status with 2-3 values)
--    - Frequently updated columns (high write overhead)
--    - Small tables (full scan is faster)

-- 6. Monitor unused indexes
SELECT * FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND idx_tup_read = 0;

Sharding Strategies

Strategy How It Works Pros Cons
Hash-Based hash(shard_key) % N Even data distribution Resharding requires data migration; no range queries
Range-Based Key ranges per shard (1-1M, 1M-2M) Efficient range queries Hot spots if ranges are uneven
Geographic Shard by user region Data locality, low latency Cross-region queries are expensive
Directory-Based Lookup table maps key to shard Most flexible, easy resharding Lookup table is single point of failure
Consistent Hashing Hash ring with virtual nodes Minimal data movement on shard changes More complex to implement

Choosing a Shard Key

Good Shard Keys:
- user_id    (evenly distributed, queries scoped to user)
- tenant_id  (multi-tenant SaaS, natural isolation)
- order_id   (if no cross-user queries needed)

Bad Shard Keys:
- timestamp  (all writes go to latest shard - hot spot)
- country    (uneven distribution: US shard overloaded)
- status     (only 3-5 values - terrible distribution)

Rule: Pick a key that is:
1. High cardinality (many distinct values)
2. Evenly distributed across shards
3. Used in most queries (avoid cross-shard queries)
4. Immutable (changing shard key requires data migration)

Replication Modes

Mode How It Works Consistency Availability
Single Leader (Primary-Replica) One primary handles writes; replicas serve reads Strong on primary; eventual on replicas Primary failure requires failover
Multi-Leader (Primary-Primary) Multiple nodes accept writes; sync asynchronously Eventual (conflict resolution needed) High (no single point of failure for writes)
Leaderless (Quorum) Any node accepts reads/writes; quorum decides Tunable (R + W > N for strong) Very high (tolerates multiple failures)

Quorum Configuration

// N = total replicas, W = write quorum, R = read quorum
// Strong consistency: R + W > N

N=3, W=2, R=2  --> Strong consistency (most common)
N=3, W=1, R=3  --> Fast writes, slow reads
N=3, W=3, R=1  --> Slow writes, fast reads
N=3, W=1, R=1  --> Eventual consistency (fast but may read stale)

// Fault tolerance
Can tolerate (N - W) write failures
Can tolerate (N - R) read failures
Example: N=3, W=2 --> survive 1 node failure for writes

ACID vs BASE

Property ACID (SQL) BASE (NoSQL)
A Atomicity — all or nothing Basically Available
C/S Consistency — valid state transitions only Soft state — data may change over time
I/E Isolation — concurrent txns don't interfere Eventually consistent
D Durability — committed data survives crashes (Durability still applies)
Best for Financial transactions, inventory, user accounts Social feeds, analytics, logging, high-throughput writes

Normalization vs Denormalization

Aspect Normalized Denormalized
Data duplication No duplication Intentional duplication
Write performance Fast (update one place) Slower (update multiple copies)
Read performance Slower (needs JOINs) Fast (pre-joined data)
Storage Efficient More storage used
Data integrity High (single source of truth) Risk of inconsistency
When to use Write-heavy, data integrity critical Read-heavy, performance critical
-- Normalized: separate tables, JOIN to query
SELECT u.name, o.total, p.name AS product
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE u.id = 123;

-- Denormalized: single document (MongoDB)
{
  "user_id": 123,
  "user_name": "Jane",
  "orders": [{
    "total": 99.99,
    "items": [{ "product_name": "Widget", "price": 49.99 }]
  }]
}
-- Single read, no joins, but updating user_name means updating all orders

Query Optimization Tips

-- 1. Use EXPLAIN ANALYZE to find slow queries
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;

-- 2. Avoid SELECT * (fetch only needed columns)
SELECT id, name, email FROM users WHERE status = 'active';

-- 3. Use LIMIT for pagination
SELECT * FROM posts ORDER BY created_at DESC LIMIT 20 OFFSET 40;

-- Better: cursor-based pagination (avoids OFFSET scan)
SELECT * FROM posts WHERE created_at < '2024-01-15T00:00:00'
ORDER BY created_at DESC LIMIT 20;

-- 4. Avoid N+1 queries (use JOINs or batch loading)
-- BAD: 1 query for users + N queries for orders
-- GOOD: SELECT * FROM orders WHERE user_id IN (1, 2, 3, ...);

-- 5. Batch inserts
INSERT INTO events (type, data, created_at) VALUES
  ('click', '{}', NOW()),
  ('view', '{}', NOW()),
  ('purchase', '{}', NOW());
-- 10x faster than individual inserts

Database Migration Strategies

Strategy Downtime Risk Complexity
Blue-Green Seconds (DNS switch) Low (easy rollback) Medium
Dual-Write Zero Medium (consistency) High
Shadow Read Zero Low (compare results) Medium
Strangler Fig Zero Low (incremental) Medium
Big Bang Hours High Low

Secure your databases with proper authentication and authorization, encryption at rest, and API security for database access layers. Use our Security Crypto Tools and API and Network Tools for testing. Visit swehelper.com/tools for all resources.

Frequently Asked Questions

Should I start with SQL or NoSQL for a new project?

Start with PostgreSQL unless you have a specific requirement that demands otherwise. PostgreSQL handles 90% of use cases well: it supports JSON (flexible schemas when needed), has excellent indexing, strong ACID compliance, and scales reasonably with read replicas. You can always add specialized databases later (Redis for caching, Elasticsearch for search) as needed.

When should I denormalize my data?

Denormalize when: (1) read performance is critical and reads far outnumber writes, (2) you are hitting JOIN limits in high-throughput queries, (3) you are moving to NoSQL where JOINs are not available, or (4) specific read patterns are well-defined and stable. Start normalized and denormalize only when you have evidence of performance issues. Premature denormalization leads to data inconsistency nightmares.

How do I handle cross-shard queries after sharding?

Cross-shard queries are expensive — they require scatter-gather across all shards. Minimize them by: (1) choosing a shard key that co-locates frequently joined data, (2) maintaining denormalized views for common cross-shard queries, (3) using a search index (Elasticsearch) as a secondary read path, (4) accepting eventual consistency for analytics queries. If most of your queries are cross-shard, your shard key choice may be wrong.

What is the best indexing strategy for a new application?

Start with indexes on: (1) primary keys (automatic), (2) foreign keys used in JOINs, (3) columns in WHERE clauses of frequent queries, (4) columns used for sorting (ORDER BY). Do not add indexes preemptively — add them based on slow query logs and EXPLAIN ANALYZE results. Over-indexing slows writes and wastes storage. Review indexes quarterly and drop unused ones.

How do I choose between Redis and Memcached for caching?

Use Redis if you need: data structures (lists, sets, sorted sets), persistence, pub/sub, Lua scripting, or cluster mode for horizontal scaling. Use Memcached if you need: simple key-value caching with multi-threaded performance, or if your caching needs are straightforward and you want simplicity. Redis is the default choice for most modern applications due to its versatility. See our Scalability Cheatsheet for cache sizing formulas.

Related Articles