Vector Databases
When to Use This Skill
Use this skill when:
- Choosing between vector database options
- Designing semantic/similarity search systems
- Optimizing vector search performance
- Understanding ANN algorithm trade-offs
- Scaling vector search infrastructure
- Implementing hybrid search (vectors + filters)
Keywords: vector database, embeddings, vector search, similarity search, ANN, approximate nearest neighbor, HNSW, IVF, FAISS, Pinecone, Weaviate, Milvus, Qdrant, Chroma, pgvector, cosine similarity, semantic search
Vector Database Comparison
Managed Services
| Database |
Strengths |
Limitations |
Best For |
| Pinecone |
Fully managed, easy scaling, enterprise |
Vendor lock-in, cost at scale |
Enterprise production |
| Weaviate Cloud |
GraphQL, hybrid search, modules |
Complexity |
Knowledge graphs |
| Zilliz Cloud |
Milvus-based, high performance |
Learning curve |
High-scale production |
| MongoDB Atlas Vector |
Existing MongoDB users |
Newer feature |
MongoDB shops |
| Elastic Vector |
Existing Elastic stack |
Resource heavy |
Search platforms |
Self-Hosted Options
| Database |
Strengths |
Limitations |
Best For |
| Milvus |
Feature-rich, scalable, GPU support |
Operational complexity |
Large-scale production |
| Qdrant |
Rust performance, filtering, easy |
Smaller ecosystem |
Performance-focused |
| Weaviate |
Modules, semantic, hybrid |
Memory usage |
Knowledge applications |
| Chroma |
Simple, Python-native |
Limited scale |
Development, prototyping |
| pgvector |
PostgreSQL extension |
Performance limits |
Postgres shops |
| FAISS |
Library, not DB, fastest |
No persistence, no filtering |
Research, embedded |
Selection Decision Tree
Need managed, don't want operations?
├── Yes → Pinecone (simplest) or Weaviate Cloud
└── No (self-hosted)
└── Already using PostgreSQL?
├── Yes, <1M vectors → pgvector
└── No
└── Need maximum performance at scale?
├── Yes → Milvus or Qdrant
└── No
└── Prototyping/development?
├── Yes → Chroma
└── No → Qdrant (balanced choice)
ANN Algorithms
Algorithm Overview
Exact KNN:
• Search ALL vectors
• O(n) time complexity
• Perfect accuracy
• Impractical at scale
Approximate NN (ANN):
• Search SUBSET of vectors
• O(log n) to O(1) complexity
• Near-perfect accuracy
• Practical at any scale
HNSW (Hierarchical Navigable Small World)
Layer 3: ○───────────────────────○ (sparse, long connections)
│ │
Layer 2: ○───○───────○───────○───○ (medium density)
│ │ │ │ │
Layer 1: ○─○─○─○─○─○─○─○─○─○─○─○─○ (denser)
│││││││││││││││││││││││
Layer 0: ○○○○○○○○○○○○○○○○○○○○○○○○○ (all vectors)
Search: Start at top layer, greedily descend
• Fast: O(log n) search time
• High recall: >95% typically
• Memory: Extra graph storage
HNSW Parameters:
| Parameter |
Description |
Trade-off |
M |
Connections per node |
Memory vs. recall |
ef_construction |
Build-time search width |
Build time vs. recall |
ef_search |
Query-time search width |
Latency vs. recall |
IVF (Inverted File Index)
Clustering Phase:
┌─────────────────────────────────────────┐
│ Cluster vectors into K centroids │
│ │
│ ● ● ● ● │
│ /│\ /│\ /│\ /│\ │
│ ○○○○○ ○○○○○ ○○○○○ ○○○○○ │
│ Cluster 1 Cluster 2 Cluster 3 Cluster 4│
└─────────────────────────────────────────┘
Search Phase:
1. Find nprobe nearest centroids
2. Search only those clusters
3. Much faster than exhaustive
IVF Parameters:
| Parameter |
Description |
Trade-off |
nlist |
Number of clusters |
Build time vs. search quality |
nprobe |
Clusters to search |
Latency vs. recall |
IVF-PQ (Product Quantization)
Original Vector (128 dim):
[0.1, 0.2, ..., 0.9] (128 × 4 bytes = 512 bytes)
PQ Compressed (8 subvectors, 8-bit codes):
[23, 45, 12, 89, 56, 34, 78, 90] (8 bytes)
Memory reduction: 64x
Accuracy trade-off: ~5% recall drop
Algorithm Comparison
| Algorithm |
Search Speed |
Memory |
Build Time |
Recall |
| Flat/Brute |
Slow (O(n)) |
Low |
None |
100% |
| IVF |
Fast |
Low |
Medium |
90-95% |
| IVF-PQ |
Very fast |
Very low |
Medium |
85-92% |
| HNSW |
Very fast |
High |
Slow |
95-99% |
| HNSW+PQ |
Very fast |
Medium |
Slow |
90-95% |
When to Use Which
< 100K vectors:
└── Flat index (exact search is fast enough)
100K - 1M vectors:
└── HNSW (best recall/speed trade-off)
1M - 100M vectors:
├── Memory available → HNSW
└── Memory constrained → IVF-PQ or HNSW+PQ
> 100M vectors:
└── Sharded IVF-PQ or distributed HNSW
Distance Metrics
Common Metrics
| Metric |
Formula |
Range |
Best For |
| Cosine Similarity |
A·B / (||A|| ||B||) |
[-1, 1] |
Normalized embeddings |
| Dot Product |
A·B |
(-∞, ∞) |
When magnitude matters |
| Euclidean (L2) |
√Σ(A-B)² |
[0, ∞) |
Absolute distances |
| Manhattan (L1) |
Σ|A-B| |
[0, ∞) |
High-dimensional sparse |
Metric Selection
Embeddings pre-normalized (unit vectors)?
├── Yes → Cosine = Dot Product (use Dot, faster)
└── No
└── Magnitude meaningful?
├── Yes → Dot Product
└── No → Cosine Similarity
Note: Most embedding models output normalized vectors
→ Dot product is usually the best choice
Filtering and Hybrid Search
Pre-filtering vs Post-filtering
Pre-filtering (Filter → Search):
┌─────────────────────────────────────────┐
│ 1. Apply metadata filter │
│ (category = "electronics") │
│ Result: 10K of 1M vectors │
│ │
│ 2. Vector search on 10K vectors │
│ Much faster, guaranteed filter match │
└─────────────────────────────────────────┘
Post-filtering (Search → Filter):
┌─────────────────────────────────────────┐
│ 1. Vector search on 1M vectors │
│ Return top-1000 │
│ │
│ 2. Apply metadata filter │
│ May return < K results! │
└─────────────────────────────────────────┘
Hybrid Search Architecture
Query: "wireless headphones under $100"
│
┌─────┴─────┐
▼ ▼
┌───────┐ ┌───────┐
│Vector │ │Filter │
│Search │ │ Build │
│"wire- │ │price │
│less │ │< 100 │
│head- │ │ │
│phones"│ │ │
└───────┘ └───────┘
│ │
└─────┬─────┘
▼
┌───────────┐
│ Combine │
│ Results │
└───────────┘
Metadata Index Design
| Metadata Type |
Index Strategy |
Query Example |
| Categorical |
Bitmap/hash index |
category = "books" |
| Numeric range |
B-tree |
price BETWEEN 10 AND 50 |
| Keyword search |
Inverted index |
tags CONTAINS "sale" |
| Geospatial |
R-tree/geohash |
location NEAR (lat, lng) |
Scaling Strategies
Sharding Approaches
Naive Sharding (by ID):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ IDs 0-N │ │IDs N-2N │ │IDs 2N-3N│
└─────────┘ └─────────┘ └─────────┘
Query → Search ALL shards → Merge results
Semantic Sharding (by cluster):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ Tech │ │ Health │ │ Finance │
│ docs │ │ docs │ │ docs │
└─────────┘ └─────────┘ └─────────┘
Query → Route to relevant shard(s) → Faster!
Replication
┌─────────────────────────────────────────┐
│ Load Balancer │
└─────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Replica 1│ │Replica 2│ │Replica 3│
│ (Read) │ │ (Read) │ │ (Read) │
└─────────┘ └─────────┘ └─────────┘
│ │ │
└───────────┼───────────┘
│
┌─────────┐
│ Primary │
│ (Write) │
└─────────┘
Scaling Decision Matrix
| Scale (vectors) |
Architecture |
Replication |
| < 1M |
Single node |
Optional |
| 1-10M |
Single node, more RAM |
For HA |
| 10-100M |
Sharded, few nodes |
Required |
| 100M-1B |
Sharded, many nodes |
Required |
| > 1B |
Sharded + tiered |
Required |
Performance Optimization
Index Build Optimization
| Optimization |
Description |
Impact |
| Batch insertion |
Insert in batches of 1K-10K |
10x faster |
| Parallel build |
Multi-threaded index construction |
2-4x faster |
| Incremental index |
Add to existing index |
Avoids rebuild |
| GPU acceleration |
Use GPU for training (IVF) |
10-100x faster |
Query Optimization
| Optimization |
Description |
Impact |
| Warm cache |
Keep index in memory |
10x latency reduction |
| Query batching |
Batch similar queries |
Higher throughput |
| Reduce dimensions |
PCA, random projection |
2-4x faster |
| Early termination |
Stop when "good enough" |
Lower latency |
Memory Optimization
Memory per vector:
┌────────────────────────────────────────┐
│ 1536 dims × 4 bytes = 6KB per vector │
│ │
│ 1M vectors: │
│ Raw: 6GB │
│ + HNSW graph: +2-4GB (M-dependent) │
│ = 8-10GB total │
│ │
│ With PQ (64 subquantizers): │
│ 1M vectors: ~64MB │
│ = 100x reduction │
└────────────────────────────────────────┘
Operational Considerations
Backup and Recovery
| Strategy |
Description |
RPO/RTO |
| Snapshots |
Periodic full backup |
Hours |
| WAL replication |
Write-ahead log streaming |
Minutes |
| Real-time sync |
Synchronous replication |
Seconds |
Monitoring Metrics
| Metric |
Description |
Alert Threshold |
| Query latency p99 |
99th percentile latency |
> 100ms |
| Recall |
Search accuracy |
< 90% |
| QPS |
Queries per second |
Capacity dependent |
| Memory usage |
Index memory |
> 80% |
| Index freshness |
Time since last update |
Domain dependent |
Index Maintenance
┌─────────────────────────────────────────┐
│ Index Maintenance Tasks │
├─────────────────────────────────────────┤
│ • Compaction: Merge small segments │
│ • Reindex: Rebuild degraded index │
│ • Vacuum: Remove deleted vectors │
│ • Optimize: Tune parameters │
│ │
│ Schedule during low-traffic periods │
└─────────────────────────────────────────┘
Common Patterns
Multi-Tenant Vector Search
Option 1: Namespace/Collection per tenant
┌─────────────────────────────────────────┐
│ tenant_1_collection │
│ tenant_2_collection │
│ tenant_3_collection │
└─────────────────────────────────────────┘
Pro: Complete isolation
Con: Many indexes, operational overhead
Option 2: Single collection + tenant filter
┌─────────────────────────────────────────┐
│ shared_collection │
│ metadata: { tenant_id: "..." } │
│ Pre-filter by tenant_id │
└─────────────────────────────────────────┘
Pro: Simpler operations
Con: Requires efficient filtering
Real-Time Updates
Write Path:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Write │ │ Buffer │ │ Merge │
│ Request │───▶│ (Memory) │───▶│ to Index │
└─────────────┘ └─────────────┘ └─────────────┘
Strategy:
1. Buffer writes in memory
2. Periodically merge to main index
3. Search: main index + buffer
4. Compact periodically
Embedding Versioning
Version 1 embeddings ──┐
│
Version 2 embeddings ──┼──▶ Parallel indexes during migration
│
│ ┌─────────────────────┐
└───▶│ Gradual reindexing │
│ Blue-green switch │
└─────────────────────┘
Cost Estimation
Storage Costs
Cost = (vectors × dimensions × bytes × replication) / GB × $/GB/month
Example:
10M vectors × 1536 dims × 4 bytes × 3 replicas = 184 GB
At $0.10/GB/month = $18.40/month storage
Note: Memory (for serving) costs more than storage
Compute Costs
Factors:
• QPS (queries per second)
• Latency requirements
• Index type (HNSW needs more RAM)
• Filtering complexity
Rule of thumb:
• 1M vectors, HNSW, <50ms latency: 16GB RAM
• 10M vectors, HNSW, <50ms latency: 64-128GB RAM
• 100M vectors: Distributed system required
Related Skills
rag-architecture - Using vector databases in RAG systems
llm-serving-patterns - LLM inference with vector retrieval
ml-system-design - End-to-end ML pipeline design
estimation-techniques - Capacity planning for vector systems
Version History
- v1.0.0 (2025-12-26): Initial release - Vector database patterns for systems design
Last Updated
Date: 2025-12-26