20. Search Systems
20. Search Systems¶
Difficulty: ββββ
Overview¶
Search is a fundamental component of most applications. This lesson covers how search engines work internally β from inverted indexes and text analysis to distributed search architecture and ranking algorithms. We focus primarily on Elasticsearch as the most widely adopted solution.
Table of Contents¶
- Search Engine Fundamentals
- Inverted Index
- Elasticsearch Architecture
- Indexing and Mapping
- Search Queries
- Ranking and Relevance
- Scaling Search Systems
- Practice Problems
1. Search Engine Fundamentals¶
1.1 Search System Components¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Search System Architecture β
β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββ β
β β Crawling / β β Indexing β β Query Processing β β
β β Ingestion ββββββΆβ Pipeline ββββββΆβ & Retrieval β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββ β
β β Fetch docs β β Tokenize β β Parse query β β
β β Extract textβ β Analyze β β Match documents β β
β β Normalize β β Build indexβ β Rank results β β
β β Detect lang β β Store β β Highlight β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.2 Full-Text Search vs Database Search¶
ββββββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββββ
β β Database (SQL) β Search Engine (ES) β
ββββββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββββ€
β Primary use β Transactional β Search & analytics β
β Data model β Structured rows β Semi-structured docs β
β Query language β SQL β DSL (JSON) β
β Text matching β LIKE / FTS β Full-text + facets β
β Relevance scoring β Limited β Advanced (BM25) β
β Horizontal scaling β Complex (sharding) β Native (shards) β
β Real-time updates β ACID guarantees β Near real-time β
β Aggregations β GROUP BY β Rich aggregations β
β Consistency β Strong β Eventual β
ββββββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββββ
2. Inverted Index¶
2.1 How It Works¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Inverted Index β
β β
β Documents: β
β Doc1: "the quick brown fox" β
β Doc2: "the lazy brown dog" β
β Doc3: "the quick dog jumps" β
β β
β Forward Index: Inverted Index: β
β Doc1 β [the, quick, brown, fox] brown β [Doc1, Doc2] β
β Doc2 β [the, lazy, brown, dog] dog β [Doc2, Doc3] β
β Doc3 β [the, quick, dog, jumps] fox β [Doc1] β
β jumps β [Doc3] β
β lazy β [Doc2] β
β quick β [Doc1, Doc3] β
β the β [Doc1, Doc2, Doc3] β
β β
β Query: "quick brown" β
β quick β {Doc1, Doc3} β
β brown β {Doc1, Doc2} β
β Intersection: {Doc1} β AND query result β
β Union: {Doc1, Doc2, Doc3} β OR query result β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.2 Text Analysis Pipeline¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Text Analysis Pipeline β
β β
β Input: "The Quick Brown Foxes are JUMPING!" β
β β
β 1. Character Filters β
β βββ HTML strip: remove <tags> β
β βββ Pattern replace: normalize chars β
β Result: "The Quick Brown Foxes are JUMPING!" β
β β
β 2. Tokenizer β
β βββ Standard: split on word boundaries β
β βββ Result: ["The", "Quick", "Brown", "Foxes", "are", β
β "JUMPING"] β
β β
β 3. Token Filters β
β βββ Lowercase: "the", "quick", "brown", "foxes", ... β
β βββ Stop words: remove "the", "are" β
β βββ Stemmer: "foxes" β "fox", "jumping" β "jump" β
β βββ Result: ["quick", "brown", "fox", "jump"] β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2.3 Posting List with Positions¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Posting List with Term Frequency & Positions β
β β
β Term: "database" β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Doc1: tf=3, positions=[5, 12, 45] β β
β β Doc3: tf=1, positions=[8] β β
β β Doc7: tf=5, positions=[2, 10, 15, 30, 42] β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Uses: β
β β’ tf (term frequency) β relevance scoring β
β β’ positions β phrase queries, proximity queries β
β β’ doc frequency β IDF calculation β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. Elasticsearch Architecture¶
3.1 Cluster Architecture¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Elasticsearch Cluster β
β β
β Cluster: "production" β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Node 1 (Master + Data) β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Shard 0 β β Shard 2 β βReplica 1 β β β
β β β (Primary)β β (Primary)β β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Node 2 (Data) β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Shard 1 β βReplica 0 β βReplica 2 β β β
β β β (Primary)β β β β β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Node 3 (Data) β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β β βReplica 0 β βReplica 1 β βReplica 2 β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Index: "products" β 3 primary shards, 1 replica each β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.2 Index, Shard, and Segment¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Index β Shard β Segment β
β β
β Index (logical) β
β βββ Shard 0 (Lucene index) β
β β βββ Segment 1 (immutable) β
β β β βββ Inverted index β
β β β βββ Stored fields β
β β β βββ Doc values (columnar) β
β β βββ Segment 2 (immutable) β
β β βββ Segment 3 (immutable) β
β βββ Shard 1 β
β β βββ ... β
β βββ Shard 2 β
β βββ ... β
β β
β New docs β in-memory buffer β refresh (1s) β new segment β
β Segments merge periodically (background) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.3 Write and Read Paths¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Write Path: β
β β
β Client βββΆ Coordinating Node βββΆ Primary Shard β
β β β
β ββββΆ In-memory buffer β
β ββββΆ Translog (WAL) β
β ββββΆ Replica Shards β
β β
β Refresh (every 1s): buffer β searchable segment β
β Flush (every 30m): translog β disk, clear translog β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Read Path: β
β β
β Client βββΆ Coordinating Node β
β β β
β βββββββββΌββββββββ β
β βΌ βΌ βΌ β
β Shard 0 Shard 1 Shard 2 (scatter) β
β β β β β
β βββββββββΌββββββββ β
β βΌ β
β Merge & rank results (gather) β
β β β
β βΌ β
β Client β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4. Indexing and Mapping¶
4.1 Index Creation¶
PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"product_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "english_stemmer", "english_stop"]
}
},
"filter": {
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "product_analyzer",
"fields": {
"keyword": { "type": "keyword" },
"suggest": {
"type": "completion"
}
}
},
"description": {
"type": "text",
"analyzer": "product_analyzer"
},
"category": { "type": "keyword" },
"price": { "type": "float" },
"rating": { "type": "float" },
"created_at": { "type": "date" },
"in_stock": { "type": "boolean" },
"tags": { "type": "keyword" }
}
}
}
4.2 Field Types¶
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Type β Use Case β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββ€
β text β Full-text search (analyzed, tokenized) β
β keyword β Exact match, sorting, aggregations β
β integer/long β Numeric values, range queries β
β float/double β Decimal numbers β
β date β Timestamps, date math β
β boolean β true/false filters β
β object β Nested JSON (flattened internally) β
β nested β Array of objects (maintains relationships) β
β geo_point β Latitude/longitude β
β completion β Auto-complete suggestions β
β dense_vector β ML embeddings, similarity search β
ββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββ
5. Search Queries¶
5.1 Match Queries¶
// Full-text match (analyzed)
GET /products/_search
{
"query": {
"match": {
"description": {
"query": "wireless bluetooth headphones",
"operator": "and"
}
}
}
}
// Multi-match across fields
GET /products/_search
{
"query": {
"multi_match": {
"query": "sony headphones",
"fields": ["name^3", "description", "tags^2"],
"type": "best_fields"
}
}
}
5.2 Boolean Queries¶
// Compound boolean query
GET /products/_search
{
"query": {
"bool": {
"must": [
{ "match": { "description": "wireless headphones" } }
],
"filter": [
{ "term": { "category": "electronics" } },
{ "range": { "price": { "gte": 50, "lte": 200 } } },
{ "term": { "in_stock": true } }
],
"should": [
{ "term": { "tags": "premium" } },
{ "range": { "rating": { "gte": 4.5 } } }
],
"must_not": [
{ "term": { "tags": "refurbished" } }
],
"minimum_should_match": 1
}
}
}
5.3 Aggregations¶
// Faceted search with aggregations
GET /products/_search
{
"size": 0,
"aggs": {
"categories": {
"terms": { "field": "category", "size": 20 }
},
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "to": 50, "key": "budget" },
{ "from": 50, "to": 200, "key": "mid-range" },
{ "from": 200, "key": "premium" }
]
}
},
"avg_rating": {
"avg": { "field": "rating" }
},
"top_rated_by_category": {
"terms": { "field": "category" },
"aggs": {
"avg_rating": { "avg": { "field": "rating" } },
"top_products": {
"top_hits": {
"size": 3,
"sort": [{ "rating": "desc" }],
"_source": ["name", "rating", "price"]
}
}
}
}
}
}
5.4 Autocomplete / Suggestions¶
// Completion suggester
POST /products/_search
{
"suggest": {
"product-suggest": {
"prefix": "wire",
"completion": {
"field": "name.suggest",
"size": 5,
"fuzzy": {
"fuzziness": "AUTO"
}
}
}
}
}
// Search-as-you-type with match_phrase_prefix
GET /products/_search
{
"query": {
"match_phrase_prefix": {
"name": {
"query": "wireless blue",
"max_expansions": 10
}
}
}
}
6. Ranking and Relevance¶
6.1 BM25 Scoring¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BM25 Scoring (Elasticsearch default) β
β β
β score(q, d) = Ξ£ IDF(qi) Γ [f(qi,d) Γ (k1+1)] β
β ββββββββββββββββββββββββ β
β f(qi,d) + k1Γ(1-b+bΓ|d|/avgdl) β
β β
β Where: β
β β’ IDF(qi) = inverse document frequency of term qi β
β (rare terms β higher score) β
β β’ f(qi,d) = term frequency of qi in document d β
β (more occurrences β higher, but with diminishing returns) β
β β’ |d| = document length β
β β’ avgdl = average document length β
β β’ k1 = term saturation parameter (default 1.2) β
β β’ b = length normalization (default 0.75) β
β β
β Key insight: BM25 has diminishing returns for term frequency β
β (unlike TF-IDF which grows linearly) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6.2 Boosting and Custom Scoring¶
// Field boosting
GET /products/_search
{
"query": {
"multi_match": {
"query": "headphones",
"fields": ["name^5", "description^2", "tags^3"]
}
}
}
// Function score for custom ranking
GET /products/_search
{
"query": {
"function_score": {
"query": { "match": { "description": "headphones" } },
"functions": [
{
"field_value_factor": {
"field": "rating",
"modifier": "log1p",
"factor": 2
}
},
{
"gauss": {
"created_at": {
"origin": "now",
"scale": "30d",
"decay": 0.5
}
}
}
],
"score_mode": "multiply",
"boost_mode": "multiply"
}
}
}
6.3 Explain API¶
// Understand why a document scored the way it did
GET /products/_explain/doc_id
{
"query": {
"match": { "description": "wireless headphones" }
}
}
7. Scaling Search Systems¶
7.1 Shard Sizing Strategy¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shard Sizing Guidelines β
β β
β Rules of thumb: β
β β’ Target shard size: 10-50 GB β
β β’ Max shards per node: ~20 per GB of heap β
β β’ Typical node heap: 32 GB β ~640 shards max β
β β
β Example calculation: β
β β’ Data: 500 GB, 1 replica β
β β’ Total: 1 TB (with replicas) β
β β’ Shard size target: 25 GB β
β β’ Primary shards: 500/25 = 20 β
β β’ Total shards: 20 Γ 2 = 40 β
β β’ Nodes needed: 40/640 β 1, but 3 for HA β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.2 Index Lifecycle Management (ILM)¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Index Lifecycle (Hot-Warm-Cold-Delete) β
β β
β Hot Phase (0-7 days) β
β βββ Fast SSDs, high write throughput β
β βββ Rollover at 50GB or 7 days β
β βββ Primary for indexing + search β
β β
β Warm Phase (7-30 days) β
β βββ Standard disks, read-only β
β βββ Force merge to 1 segment per shard β
β βββ Shrink replica count β
β β
β Cold Phase (30-90 days) β
β βββ Object storage (S3) via searchable snapshots β
β βββ Minimal resources β
β βββ Slower queries acceptable β
β β
β Delete Phase (> 90 days) β
β βββ Automatically delete expired indices β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.3 Search Performance Optimization¶
Performance checklist:
1. Mapping optimization
- Use keyword for exact match (not text)
- Disable _source for write-heavy indices
- Use doc_values for sorting/aggregations
2. Query optimization
- Use filter context (cacheable, no scoring)
- Avoid wildcard queries on large fields
- Use routing for tenant-specific data
3. Index optimization
- Appropriate shard count (avoid over-sharding)
- Force merge read-only indices
- Use index sorting for common sort patterns
4. Hardware
- 50% memory for OS file cache
- SSDs for hot data
- Dedicated master nodes in production
8. Practice Problems¶
Problem 1: E-Commerce Search¶
Design a search system for an e-commerce platform with 10M products.
Requirements:
- Full-text search across name, description, brand
- Category faceting
- Price range filtering
- Sort by relevance, price, rating, newest
- Autocomplete suggestions
- Typo tolerance
Key decisions:
- Index design: 5 primary shards (10M docs β 20GB)
- Custom analyzer with synonyms (e.g., "phone" = "smartphone")
- Completion suggester for autocomplete
- Function score boosting popular products
- Fuzziness for typo tolerance
Problem 2: Log Search Platform¶
Design a centralized log search system for 50TB/day of logs.
Key considerations:
- Write-heavy workload (50TB/day = ~580MB/s)
- Time-based indices (daily or hourly rollover)
- Hot-warm-cold architecture
- 7-day search window (hot), 30-day archive (cold)
- Structured fields: timestamp, level, service, message
Architecture:
- Kafka as ingestion buffer
- Logstash/Vector for parsing
- Hot nodes: 10 Γ (32GB RAM, 2TB NVMe SSD)
- Warm nodes: 6 Γ (32GB RAM, 8TB HDD)
- ILM policy: rollover at 50GB, move to warm at 24h
- Estimated cost vs ELK Cloud vs Loki comparison
Problem 3: Autocomplete System¶
Design a search autocomplete that returns results in < 50ms.
Approach:
1. Completion suggester for prefix matching
2. Edge n-gram analyzer for partial word matching
3. Separate lightweight index for suggestions
4. Cache popular queries in Redis (TTL: 5min)
5. Client-side debouncing (300ms)
Index design:
{
"suggest_text": {
"type": "text",
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "standard",
"fields": {
"suggest": { "type": "completion" }
}
},
"popularity": { "type": "integer" }
}
Query pipeline:
User types β debounce β edge-ngram match + completion suggest
β boost by popularity β return top 10