System Design Problems
29 min read

Design a Web Crawler

A comprehensive system design for a web-scale crawler that discovers, downloads, and indexes billions of pages. This design addresses URL frontier management with politeness constraints, distributed crawling at scale, duplicate detection, and freshness maintenance across petabytes of web content.

Coordination

Storage Layer

Parser Pipeline

Fetcher Cluster

URL Frontier

Seed URLs

Seed URL List

Priority Queues

by importance

Politeness Queues

by host

URL Scheduler

DNS Resolver

+ Cache

robots.txt Cache

HTTP Fetchers

Connection Pool

JS Renderer

Headless Chrome

Content Extractor

Link Extractor

Duplicate Detector

SimHash + Bloom

URL Database

Seen URLs

Content Store

Page Archive

Search Index

Web Graph

Link Structure

Message Queue

Kafka

Coordination

ZooKeeper

High-level web crawler architecture: Seeds flow through a prioritized, politeness-aware frontier to distributed fetchers. Parsed content feeds storage while extracted links cycle back through duplicate detection.

Web crawlers solve three interconnected problems: frontier management (deciding which URLs to fetch next), politeness (avoiding overwhelming target servers), and deduplication (avoiding redundant downloads of seen URLs and near-duplicate content).

The URL frontier is a two-tier queue system. Front queues implement prioritization—URLs scoring higher for PageRank, historical change frequency, or domain authority rise to the top. Back queues enforce politeness—one queue per host, with timestamps tracking when each host can next be contacted. The Mercator architecture recommends 3× as many back queues as crawler threads to avoid contention.

Duplicate detection operates at two levels. URL-level: Bloom filters provide O(1) membership testing with configurable false-positive rates (1% at 10 bits/element). URLs are normalized before hashing to collapse equivalent forms. Content-level: SimHash generates 64-bit fingerprints where near-duplicate documents have fingerprints differing by ≤3 bits—Google uses this threshold for 8 billion pages.

Distributed crawling partitions URLs by host hash using consistent hashing. This co-locates all URLs for a host on one crawler node, enabling local politeness enforcement without coordination. Adding/removing nodes requires re-assigning only 1/N of the URL space. Common Crawl processes 2.3 billion pages/month across ~47 million hosts using this architecture.

FeaturePriorityScope
URL discovery from seed listCoreFull
HTTP/HTTPS content fetchingCoreFull
robots.txt complianceCoreFull
HTML parsing and link extractionCoreFull
URL deduplicationCoreFull
Content deduplicationCoreFull
Distributed crawling coordinationCoreFull
Recrawl scheduling for freshnessCoreFull
JavaScript renderingHighFull
Sitemap processingHighFull
Content storage and indexingHighOverview
Image/media crawlingMediumBrief
Deep web / form submissionLowOut of scope
RequirementTargetRationale
Crawl throughput1,000+ pages/second per nodeMatch search engine scale
Politeness≤1 request/second per hostAvoid overloading targets
URL dedup accuracy0% false negativesNever miss seen URLs
Content dedup threshold3-bit SimHash distanceCatch near-duplicates
robots.txt freshnessCache ≤24 hoursPer RFC 9309 recommendation
Fault toleranceNo data loss on node failureCritical for multi-day crawls
Horizontal scalabilityLinear throughput scalingAdd nodes to increase capacity

Web Scale:

  • Total websites: ~1.4 billion (206 million active)
  • Google’s index: ~400 billion documents
  • Common Crawl monthly: ~2.3 billion pages, ~400 TiB uncompressed

Target Crawl:

  • Goal: 1 billion pages in 30 days
  • Daily pages: 33.3 million = ~385 pages/second sustained
  • With peak headroom (3×): ~1,200 pages/second

Traffic per node:

  • Target: 100 pages/second per node
  • Required nodes: 12 (sustained), 15 (with buffer)
  • Network per node: 100 pages/sec × 500KB avg = 50 MB/s = 400 Mbps

Storage:

  • Raw HTML: 1B pages × 500KB avg = 500 TB
  • Compressed (gzip ~10×): 50 TB
  • URL database: 1B URLs × 200 bytes = 200 GB
  • SimHash fingerprints: 1B × 8 bytes = 8 GB
  • Bloom filter (1% FP): 1B URLs × 10 bits = 1.25 GB

DNS:

  • Unique hosts: ~50 million (Common Crawl average)
  • DNS queries: 50M × 1.2 (retries) = 60M
  • With caching (1-hour TTL): ~6M active lookups

Best when:

  • Small to medium scale (< 100M pages)
  • Single datacenter deployment
  • Strong consistency requirements

Architecture: A central coordinator maintains the entire URL frontier in memory/database. Crawler workers pull batches of URLs from the coordinator, fetch content, and report results back.

Key characteristics:

  • Single point of truth for URL state
  • Global priority ordering
  • Centralized politeness enforcement

Trade-offs:

  • Simple architecture, easy to reason about
  • Global priority optimization
  • Consistent deduplication
  • Coordinator becomes bottleneck
  • Single point of failure
  • Does not scale beyond ~10 nodes

Real-world example: Scrapy Cluster uses Redis as a centralized frontier for coordinated distributed crawling up to millions of pages.

Best when:

  • Web-scale crawling (billions of pages)
  • Multi-datacenter deployment
  • No single point of failure tolerance

Architecture: No central coordinator. URLs are assigned to nodes via consistent hashing on host. Each node maintains its own frontier for assigned hosts. Nodes communicate only for handoff during rebalancing.

Key characteristics:

  • Hash-based URL partitioning
  • Local frontier per node
  • Peer-to-peer coordination

Trade-offs:

  • Linear horizontal scaling
  • No single point of failure
  • Local politeness (no coordination needed)
  • Global priority ordering is approximate
  • Complex rebalancing on node join/leave
  • Harder to debug and monitor

Real-world example: UbiCrawler pioneered this approach for AltaVista. Common Crawl uses similar partitioning at scale.

Best when:

  • Large scale with operational simplicity
  • Need both global coordination and distributed execution
  • Flexible deployment (cloud or on-premise)

Architecture: URL frontier is distributed across nodes, but coordination happens through a message queue (Kafka). Each node owns a partition of the URL space (by host hash). New URLs are published to Kafka topics; consumers (crawler nodes) pull URLs for their assigned partitions.

Key characteristics:

  • Message queue for durable URL distribution
  • Hash-partitioned topics (one partition per host range)
  • Local frontier within each node for politeness
  • Separate fetcher and parser processes

Trade-offs:

  • Kafka provides durability and replayability
  • Easy to add/remove nodes (Kafka rebalancing)
  • Clear separation of concerns
  • Built-in backpressure handling
  • Additional operational complexity (Kafka cluster)
  • Some latency from message queue
  • Requires careful partition design
FactorPath A (Centralized)Path B (Distributed)Path C (Hybrid)
Scale limit~10 nodesUnlimited~1000 nodes
ComplexityLowHighMedium
Fault toleranceLowHighHigh
Global optimizationPerfectApproximateGood
Operational overheadLowHighMedium
Best forPrototype/smallWeb-scaleProduction

This article implements Path C (Hybrid) because it balances operational simplicity with scale. Kafka’s consumer groups handle node failures and rebalancing automatically, while hash-based partitioning enables local politeness enforcement without coordination overhead.

Manages the queue of URLs to crawl with dual-queue architecture:

Front Queues (Priority):

  • F queues ranked by priority (typically F=10-20)
  • Priority factors: PageRank, domain authority, change frequency, depth from seed
  • URLs are assigned to front queues based on computed priority score

Back Queues (Politeness):

  • B queues, one per host (approximately)
  • Each queue tracks next_allowed_time for rate limiting
  • Mercator recommendation: B = 3 × number of crawler threads

Queue Selection:

  1. Select a non-empty front queue (weighted by priority)
  2. From that queue, select a URL whose back queue is ready (next_allowed_time ≤ now)
  3. After fetch, update back queue’s next_allowed_time

High-performance DNS resolution with aggressive caching:

  • Local cache: LRU cache with TTL-aware expiration
  • Batch resolution: Prefetch DNS for queued URLs
  • Multiple resolvers: Parallel queries to multiple DNS servers
  • Negative caching: Cache NXDOMAIN responses (shorter TTL)

Performance target: DNS should not be on the critical path. Pre-resolve while URL is queued.

HTTP client optimized for crawling:

  • Connection pooling: Reuse connections per host
  • HTTP/2: Single connection with multiplexed streams (where supported)
  • Compression: Accept gzip/brotli, decompress on receipt
  • Timeout handling: Connect timeout (10s), read timeout (30s), total timeout (60s)
  • Redirect following: Up to 5 redirects (per RFC 9309 for robots.txt)
  • User-Agent: Identify as crawler with contact info

Caches and enforces robots.txt rules:

  • Cache duration: Up to 24 hours (RFC 9309 recommendation)
  • Parsing: Handle malformed files gracefully
  • Rules: Allow/Disallow with longest-match precedence
  • Size limit: Parse at least first 500 KiB (RFC 9309 minimum)
  • Error handling: 4xx = unrestricted, 5xx = assume complete disallow

Extracts content and links from fetched pages:

  • HTML parsing: DOM-based extraction (jsoup, lxml)
  • Link extraction: <a href>, <link>, <script src>, <img src>
  • URL normalization: Resolve relative URLs, canonicalize
  • Content extraction: Title, meta description, body text
  • Metadata: Content-Type, charset, Last-Modified

Renders JavaScript-heavy pages (separate from main fetcher):

  • Headless Chrome/Puppeteer: Full browser rendering
  • Render queue: Separate queue for JS-required pages
  • Resource limits: Memory/CPU limits per render
  • Timeout: 30-second render timeout
  • Detection: Heuristics to identify JS-dependent pages

Prevents redundant crawling:

URL Deduplication:

  • Bloom filter for fast negative checks
  • Persistent URL store for seen URLs
  • URL normalization before hashing

Content Deduplication:

  • SimHash fingerprinting (64-bit)
  • Threshold: Hamming distance ≤ 3 bits = duplicate
  • Index fingerprints for fast lookup
StorageDedup ServiceParserHTTP FetcherDNS Cacherobots.txt CacheFrontierStorageDedup ServiceParserHTTP FetcherDNS Cacherobots.txt CacheFrontieralt[New URL]loop[For each extracted link]alt[New content][Duplicate content]alt[Success (2xx)][Redirect (3xx)][Error (4xx/5xx)]alt[Disallowed][Allowed]Select URL (priority + politeness)Resolve hostnameIP address (cached or resolved)Check robots.txtAllowed / DisallowedMark URL as blockedFetch URLHTTP Response (status, headers, body)Parse contentExtract links, contentCheck content fingerprintNew / DuplicateStore contentStore fingerprintCheck URL seenSeen / NewAdd to frontier (with priority)Mark URL seenStore as duplicate (reference only)Check redirect URLAdd redirect target to frontierLog error, schedule retry (with backoff)Update politeness timer for host

Crawler Nodes

Kafka Topics

Seeds

hash by host

hash by host

hash by host

hash by host

consumer group

consumer group

consumer group

consumer group

discovered URLs

discovered URLs

discovered URLs

discovered URLs

discovered URLs

discovered URLs

discovered URLs

discovered URLs

discovered URLs

discovered URLs

discovered URLs

discovered URLs

Seed URLs

url-partition-0

url-partition-1

url-partition-2

url-partition-N

Crawler 1

hosts 0-999

Crawler 2

hosts 1000-1999

Crawler 3

hosts 2000-2999

Crawler N

hosts ...

Endpoint: POST /api/v1/frontier/urls

3 collapsed lines
// Headers
Authorization: Bearer {internal_token}
Content-Type: application/json
// Request body
{
"urls": [
{
"url": "https://example.com/page1",
"priority": 0.85,
"source_url": "https://example.com/",
"depth": 1,
"discovered_at": "2024-01-15T10:30:00Z"
},
{
"url": "https://example.com/page2",
"priority": 0.72,
"source_url": "https://example.com/",
"depth": 1,
5 collapsed lines
"discovered_at": "2024-01-15T10:30:00Z"
}
],
"crawl_id": "crawl-2024-01-15"
}

Response (202 Accepted):

{
"accepted": 2,
"rejected": 0,
"duplicates": 0
}

Design Decision: Batch Submission

URLs are submitted in batches (100-1000) rather than individually to amortize Kafka producer overhead and reduce network round-trips. The frontier service handles deduplication and prioritization asynchronously.

Endpoint: GET /api/v1/frontier/next

Query Parameters:

ParameterTypeDescription
worker_idstringUnique identifier for the crawler worker
batch_sizeintegerNumber of URLs to fetch (default: 100, max: 1000)
timeout_msintegerLong-poll timeout if no URLs ready (default: 5000)

Response (200 OK):

3 collapsed lines
{
"urls": [
{
"url": "https://example.com/page1",
"priority": 0.85,
"host": "example.com",
"ip": "93.184.216.34",
"robots_rules": {
"allowed": true,
"crawl_delay": null
}
},
{
"url": "https://other.com/page2",
"priority": 0.72,
"host": "other.com",
"ip": "203.0.113.50",
"robots_rules": {
"allowed": true,
"crawl_delay": 2
}
}
],
"lease_id": "lease-abc123",
2 collapsed lines
"lease_expires_at": "2024-01-15T10:35:00Z"
}

Design Decision: URL Leasing

URLs are leased to workers with a timeout (5 minutes default). If the worker doesn’t report completion (success or failure) before the lease expires, the URL is returned to the frontier for reassignment. This handles worker crashes without losing URLs.

Endpoint: POST /api/v1/frontier/results

3 collapsed lines
// Headers
Authorization: Bearer {internal_token}
Content-Type: application/json
// Request body
{
"lease_id": "lease-abc123",
"results": [
{
"url": "https://example.com/page1",
"status": "success",
"http_status": 200,
"content_type": "text/html",
"content_length": 45230,
"content_hash": "a1b2c3d4e5f6...",
"simhash": "0x1234567890abcdef",
"fetch_time_ms": 234,
"links_found": 47,
"is_duplicate": false
},
{
"url": "https://other.com/page2",
"status": "error",
"http_status": 503,
"error": "Service Unavailable",
"retry_after": 300
}
]
}

Response (200 OK):

{
"processed": 2,
"retries_scheduled": 1
}

Endpoint: GET /api/v1/stats

5 collapsed lines
{
"crawl_id": "crawl-2024-01-15",
"started_at": "2024-01-15T00:00:00Z",
"runtime_hours": 10.5,
"urls": {
"discovered": 15234567,
"queued": 8234123,
"fetched": 7000444,
"successful": 6543210,
"failed": 457234,
"duplicate_content": 234567,
"blocked_robots": 123456
},
"throughput": {
"current_pages_per_second": 185.3,
"avg_pages_per_second": 173.2,
"peak_pages_per_second": 312.5
},
"hosts": {
"total_discovered": 1234567,
"active_in_queue": 456789,
"blocked_by_robots": 12345
},
"storage": {
"content_stored_gb": 2345.6,
"urls_stored_millions": 15.2
},
"workers": {
"active": 12,
"idle": 3,
"failed": 0
}
}

Primary Store: PostgreSQL for URL metadata, Kafka for frontier queue

5 collapsed lines
-- URL state tracking
CREATE TABLE urls (
id BIGSERIAL PRIMARY KEY,
url_hash BYTEA NOT NULL, -- SHA-256 of normalized URL (32 bytes)
url TEXT NOT NULL,
normalized_url TEXT NOT NULL,
-- Discovery metadata
host VARCHAR(255) NOT NULL,
host_hash INT NOT NULL, -- For partition routing
depth SMALLINT DEFAULT 0,
source_url_id BIGINT REFERENCES urls(id),
discovered_at TIMESTAMPTZ DEFAULT NOW(),
-- Priority and scheduling
priority REAL DEFAULT 0.5,
last_fetched_at TIMESTAMPTZ,
next_fetch_at TIMESTAMPTZ,
fetch_count INT DEFAULT 0,
-- Fetch results
last_status_code SMALLINT,
last_content_hash BYTEA, -- For change detection
last_simhash BIGINT, -- Content fingerprint
content_length INT,
content_type VARCHAR(100),
-- State
state VARCHAR(20) DEFAULT 'pending',
-- pending, queued, fetching, completed, failed, blocked
CONSTRAINT url_hash_unique UNIQUE (url_hash)
);
-- Indexes for common access patterns
CREATE INDEX idx_urls_host ON urls(host_hash, host);
CREATE INDEX idx_urls_state ON urls(state, next_fetch_at)
WHERE state IN ('pending', 'queued');
CREATE INDEX idx_urls_refetch ON urls(next_fetch_at)
1 collapsed line
WHERE state = 'completed' AND next_fetch_at IS NOT NULL;

Design Decision: URL Hash as Primary Key for Dedup

Using SHA-256 hash of the normalized URL as the unique constraint instead of the raw URL:

  • Fixed size (32 bytes) regardless of URL length
  • Fast equality comparisons
  • Bloom filter uses same hash
  • URLs up to 2000+ characters don’t bloat indexes
3 collapsed lines
-- robots.txt cache
CREATE TABLE robots_cache (
host VARCHAR(255) PRIMARY KEY,
fetched_at TIMESTAMPTZ NOT NULL,
expires_at TIMESTAMPTZ NOT NULL,
status_code SMALLINT NOT NULL,
content TEXT, -- Raw robots.txt content
parsed_rules JSONB, -- Pre-parsed rules for fast lookup
crawl_delay INT, -- Crawl-delay directive (if present)
sitemaps TEXT[], -- Sitemap URLs discovered
error_message TEXT
);
-- Example parsed_rules structure:
-- {
-- "user_agents": {
-- "*": {
-- "allow": ["/public/", "/api/"],
-- "disallow": ["/admin/", "/private/"]
-- },
-- "googlebot": {
-- "allow": ["/"],
-- "disallow": ["/no-google/"]
-- }
5 collapsed lines
-- }
-- }
CREATE INDEX idx_robots_expires ON robots_cache(expires_at)
WHERE expires_at < NOW() + INTERVAL '1 hour';
5 collapsed lines
-- Page content archive (separate from URL metadata)
CREATE TABLE page_content (
id BIGSERIAL PRIMARY KEY,
url_id BIGINT NOT NULL REFERENCES urls(id),
fetch_id BIGINT NOT NULL, -- Links to fetch attempt
-- Content
raw_content BYTEA, -- Compressed HTML/content
compression VARCHAR(10) DEFAULT 'gzip',
content_length INT NOT NULL,
content_type VARCHAR(100),
charset VARCHAR(50),
-- Extracted data
title TEXT,
meta_description TEXT,
canonical_url TEXT,
language VARCHAR(10),
-- Fingerprints
content_hash BYTEA NOT NULL, -- SHA-256 for exact match
simhash BIGINT NOT NULL, -- SimHash for near-duplicate
-- Timestamps
fetched_at TIMESTAMPTZ NOT NULL,
http_date TIMESTAMPTZ, -- From HTTP Date header
last_modified TIMESTAMPTZ -- From Last-Modified header
);
-- Content deduplication index
CREATE INDEX idx_content_simhash ON page_content(simhash);
CREATE INDEX idx_content_hash ON page_content(content_hash);
-- Partition by fetch date for efficient archival
2 collapsed lines
-- CREATE TABLE page_content_2024_01 PARTITION OF page_content
-- FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
Data TypeStoreRationale
URL frontier queueKafkaDurable, partitioned, consumer groups
URL metadataPostgreSQLComplex queries, ACID, indexes
Seen URL filterRedis BloomFast membership test, O(1)
robots.txt cacheRedisTTL support, fast lookup
Page contentObject Storage (S3)Large blobs, cheap, archival
Content fingerprintsPostgreSQL + RedisPersistent + fast lookup
Crawl metricsInfluxDB/TimescaleDBTime-series queries
Web graphNeo4j / PostgreSQLLink structure analysis

URL Database Sharding:

  • Shard key: host_hash (hash of hostname)
  • Rationale: Co-locates all URLs for a host; queries filter by host

Kafka Topic Partitioning:

  • Partition key: host_hash
  • Number of partitions: 256 (allows scaling to 256 crawler nodes)
  • Rationale: Each crawler consumes specific partitions → owns specific hosts

Content Storage:

  • Partition by date (monthly tables)
  • Rationale: Old content can be archived/deleted without affecting recent data

The frontier is the heart of the crawler. It must balance priority, politeness, and throughput.

15 collapsed lines
// Frontier configuration
interface FrontierConfig {
numFrontQueues: number // F = number of priority levels (10-20)
numBackQueues: number // B = 3 × crawler threads
defaultPolitenessDelay: number // milliseconds between requests to same host
maxQueueSize: number // per back queue
}
interface URLEntry {
url: string
normalizedUrl: string
host: string
priority: number // 0.0 to 1.0
depth: number
discoveredAt: Date
}
class URLFrontier {
private frontQueues: PriorityQueue<URLEntry>[] // F queues
private backQueues: Map<string, HostQueue> // Host → queue
private hostToBackQueue: Map<string, number> // Host → back queue index
private backQueueHeap: MinHeap<BackQueueEntry> // Sorted by next_allowed_time
constructor(private config: FrontierConfig) {
this.frontQueues = Array(config.numFrontQueues)
.fill(null)
.map(() => new PriorityQueue<URLEntry>((a, b) => b.priority - a.priority))
this.backQueues = new Map()
this.hostToBackQueue = new Map()
this.backQueueHeap = new MinHeap((a, b) => a.nextAllowedTime - b.nextAllowedTime)
}
// Add URL to frontier
async addURL(entry: URLEntry): Promise<void> {
// Assign to front queue based on priority
const frontQueueIndex = Math.floor(entry.priority * (this.config.numFrontQueues - 1))
this.frontQueues[frontQueueIndex].enqueue(entry)
// Ensure back queue exists for this host
this.ensureBackQueue(entry.host)
}
// Get next URL to fetch (respects politeness)
async getNext(): Promise<URLEntry | null> {
// Find a back queue that's ready (next_allowed_time <= now)
const now = Date.now()
while (this.backQueueHeap.size() > 0) {
const topBackQueue = this.backQueueHeap.peek()
if (topBackQueue.nextAllowedTime > now) {
// No queues ready yet, return null or wait
return null
}
const backQueue = this.backQueues.get(topBackQueue.host)
if (backQueue && backQueue.size() > 0) {
// Pop URL from this back queue
const url = backQueue.dequeue()
// Update next_allowed_time
this.backQueueHeap.extractMin()
topBackQueue.nextAllowedTime = now + this.getPolitenessDelay(topBackQueue.host)
this.backQueueHeap.insert(topBackQueue)
// Refill back queue from front queues if needed
this.refillBackQueue(topBackQueue.host)
21 collapsed lines
return url
}
// Empty back queue, remove from heap
this.backQueueHeap.extractMin()
}
return null
}
// Refill back queue from front queues (maintains priority)
private refillBackQueue(host: string): void {
const backQueue = this.backQueues.get(host)
if (!backQueue || backQueue.size() >= 10) return // Keep 10 URLs buffered
// Scan front queues for URLs matching this host
for (const frontQueue of this.frontQueues) {
const url = frontQueue.findAndRemove((e) => e.host === host)
if (url) {
backQueue.enqueue(url)
if (backQueue.size() >= 10) break
}
}
}
private getPolitenessDelay(host: string): number {
// Check robots.txt crawl-delay, fall back to default
const robotsDelay = this.robotsCache.getCrawlDelay(host)
return robotsDelay ?? this.config.defaultPolitenessDelay
}
}
5 collapsed lines
interface PriorityFactors {
pageRank: number // 0.0 to 1.0, from link analysis
domainAuthority: number // 0.0 to 1.0, domain-level quality
changeFrequency: number // 0.0 to 1.0, how often content changes
depth: number // Distance from seed URL
freshness: number // 0.0 to 1.0, based on last fetch age
}
function calculatePriority(factors: PriorityFactors): number {
// Weighted combination of factors
const weights = {
pageRank: 0.3,
domainAuthority: 0.25,
changeFrequency: 0.2,
depth: 0.15,
freshness: 0.1,
}
// Depth penalty: deeper pages get lower priority
const depthScore = Math.max(0, 1 - factors.depth * 0.1) // -10% per level
const priority =
weights.pageRank * factors.pageRank +
weights.domainAuthority * factors.domainAuthority +
weights.changeFrequency * factors.changeFrequency +
weights.depth * depthScore +
weights.freshness * factors.freshness
return Math.max(0, Math.min(1, priority)) // Clamp to [0, 1]
}
// Example: High-value news site homepage
const priority = calculatePriority({
pageRank: 0.9,
6 collapsed lines
domainAuthority: 0.85,
changeFrequency: 0.95, // Changes frequently
depth: 0, // Seed URL
freshness: 0.1, // Hasn't been fetched recently
})
// priority ≈ 0.77 → high priority front queue

URL normalization is critical—different URL strings can reference the same resource.

10 collapsed lines
import { URL } from "url"
interface NormalizationOptions {
removeFragment: boolean // Remove #hash
removeDefaultPorts: boolean // Remove :80, :443
removeTrailingSlash: boolean // Remove trailing /
removeIndexFiles: boolean // Remove /index.html
sortQueryParams: boolean // Alphabetize query params
lowercaseHost: boolean // Lowercase hostname
removeWWW: boolean // Remove www. prefix (controversial)
decodeUnreserved: boolean // Decode safe percent-encoded chars
}
const DEFAULT_OPTIONS: NormalizationOptions = {
removeFragment: true,
removeDefaultPorts: true,
removeTrailingSlash: true,
removeIndexFiles: true,
sortQueryParams: true,
lowercaseHost: true,
removeWWW: false, // www.example.com and example.com may be different
decodeUnreserved: true,
}
function normalizeURL(urlString: string, options = DEFAULT_OPTIONS): string {
const url = new URL(urlString)
// Lowercase scheme and host
url.protocol = url.protocol.toLowerCase()
if (options.lowercaseHost) {
url.hostname = url.hostname.toLowerCase()
}
// Remove default ports
if (options.removeDefaultPorts) {
if ((url.protocol === "http:" && url.port === "80") || (url.protocol === "https:" && url.port === "443")) {
url.port = ""
}
}
// Remove fragment
if (options.removeFragment) {
url.hash = ""
}
// Sort query parameters
if (options.sortQueryParams && url.search) {
const params = new URLSearchParams(url.search)
const sorted = new URLSearchParams([...params.entries()].sort())
url.search = sorted.toString()
}
// Remove trailing slash (except for root)
if (options.removeTrailingSlash && url.pathname !== "/") {
16 collapsed lines
url.pathname = url.pathname.replace(/\/+$/, "")
}
// Remove index files
if (options.removeIndexFiles) {
url.pathname = url.pathname.replace(/\/(index|default)\.(html?|php|asp)$/i, "/")
}
// Decode unreserved characters
if (options.decodeUnreserved) {
url.pathname = decodeURIComponent(url.pathname)
.split("/")
.map((segment) => encodeURIComponent(segment))
.join("/")
}
return url.toString()
}
// Examples:
// normalizeURL('HTTP://Example.COM:80/Page/')
// → 'http://example.com/page'
// normalizeURL('https://example.com/search?b=2&a=1')
// → 'https://example.com/search?a=1&b=2'
// normalizeURL('https://example.com/index.html')
// → 'https://example.com/'
10 collapsed lines
import { createHash } from "crypto"
class BloomFilter {
private bitArray: Uint8Array
private numHashFunctions: number
private size: number
constructor(expectedElements: number, falsePositiveRate: number = 0.01) {
// Calculate optimal size and hash functions
// m = -n * ln(p) / (ln(2)^2)
// k = (m/n) * ln(2)
this.size = Math.ceil((-expectedElements * Math.log(falsePositiveRate)) / Math.log(2) ** 2)
this.numHashFunctions = Math.ceil((this.size / expectedElements) * Math.log(2))
this.bitArray = new Uint8Array(Math.ceil(this.size / 8))
console.log(`Bloom filter: ${this.size} bits, ${this.numHashFunctions} hash functions`)
// For 1B URLs at 1% FP: ~1.2 GB, 7 hash functions
}
private getHashes(item: string): number[] {
// Use double hashing: h(i) = h1 + i * h2
const h1 = this.hash(item, 0)
const h2 = this.hash(item, 1)
return Array(this.numHashFunctions)
.fill(0)
.map((_, i) => Math.abs((h1 + i * h2) % this.size))
}
private hash(item: string, seed: number): number {
const hash = createHash("md5").update(`${seed}:${item}`).digest()
return hash.readUInt32LE(0)
}
add(item: string): void {
for (const hash of this.getHashes(item)) {
const byteIndex = Math.floor(hash / 8)
const bitIndex = hash % 8
this.bitArray[byteIndex] |= 1 << bitIndex
}
}
mightContain(item: string): boolean {
for (const hash of this.getHashes(item)) {
const byteIndex = Math.floor(hash / 8)
const bitIndex = hash % 8
if ((this.bitArray[byteIndex] & (1 << bitIndex)) === 0) {
return false // Definitely not in set
}
16 collapsed lines
}
return true // Probably in set (may be false positive)
}
}
// Usage
const seenURLs = new BloomFilter(1_000_000_000, 0.01)
function isURLSeen(url: string): boolean {
const normalized = normalizeURL(url)
if (!seenURLs.mightContain(normalized)) {
return false // Definitely not seen
}
// Bloom filter says "maybe"—check persistent store
return urlDatabase.exists(normalized)
}
function markURLSeen(url: string): void {
const normalized = normalizeURL(url)
seenURLs.add(normalized)
urlDatabase.insert(normalized)
}
10 collapsed lines
import { createHash } from "crypto"
// SimHash generates a fingerprint where similar documents have similar hashes
// Hamming distance between fingerprints indicates similarity
function simhash(text: string, hashBits: number = 64): bigint {
// Tokenize into shingles (word n-grams)
const tokens = tokenize(text)
const shingles = generateShingles(tokens, 3) // 3-word shingles
// Initialize bit vector
const v = new Array(hashBits).fill(0)
for (const shingle of shingles) {
// Hash each shingle to get bit positions
const hash = hashShingle(shingle, hashBits)
// Update vector: +1 if bit is 1, -1 if bit is 0
for (let i = 0; i < hashBits; i++) {
if ((hash >> BigInt(i)) & 1n) {
v[i] += 1
} else {
v[i] -= 1
}
}
}
// Convert vector to fingerprint: 1 if positive, 0 if negative
let fingerprint = 0n
for (let i = 0; i < hashBits; i++) {
if (v[i] > 0) {
fingerprint |= 1n << BigInt(i)
}
}
return fingerprint
}
function tokenize(text: string): string[] {
return text
.toLowerCase()
.replace(/[^\w\s]/g, " ")
.split(/\s+/)
.filter((word) => word.length > 2)
}
function generateShingles(tokens: string[], n: number): string[] {
const shingles: string[] = []
for (let i = 0; i <= tokens.length - n; i++) {
shingles.push(tokens.slice(i, i + n).join(" "))
}
return shingles
}
function hashShingle(shingle: string, bits: number): bigint {
const hash = createHash("sha256").update(shingle).digest("hex")
return BigInt("0x" + hash.slice(0, bits / 4))
}
21 collapsed lines
function hammingDistance(a: bigint, b: bigint): number {
let xor = a ^ b
let distance = 0
while (xor > 0n) {
distance += Number(xor & 1n)
xor >>= 1n
}
return distance
}
// Usage
const DUPLICATE_THRESHOLD = 3 // Max 3 bits different
function isNearDuplicate(newFingerprint: bigint, existingFingerprints: bigint[]): boolean {
for (const existing of existingFingerprints) {
if (hammingDistance(newFingerprint, existing) <= DUPLICATE_THRESHOLD) {
return true
}
}
return false
}
// Example
const doc1 = "The quick brown fox jumps over the lazy dog"
const doc2 = "The quick brown fox leaps over the lazy dog" // One word changed
const doc3 = "Completely different content about web crawlers"
const hash1 = simhash(doc1)
const hash2 = simhash(doc2)
const hash3 = simhash(doc3)
console.log(hammingDistance(hash1, hash2)) // Small (< 3), near-duplicate
console.log(hammingDistance(hash1, hash3)) // Large (> 20), different content
10 collapsed lines
// RFC 9309 compliant robots.txt parser
interface RobotsRule {
path: string
allow: boolean
}
interface RobotsRules {
rules: RobotsRule[]
crawlDelay?: number
sitemaps: string[]
}
function parseRobotsTxt(content: string, userAgent: string): RobotsRules {
const lines = content.split("\n").map((l) => l.trim())
const rules: RobotsRule[] = []
let crawlDelay: number | undefined
const sitemaps: string[] = []
let currentUserAgents: string[] = []
let inRelevantBlock = false
for (const line of lines) {
// Skip comments and empty lines
if (line.startsWith("#") || line === "") continue
const [directive, ...valueParts] = line.split(":")
const value = valueParts.join(":").trim()
switch (directive.toLowerCase()) {
case "user-agent":
// Check if this block applies to our user agent
if (currentUserAgents.length > 0 && inRelevantBlock) {
// We've finished a relevant block, stop processing
// (RFC 9309: use first matching group)
break
}
currentUserAgents.push(value.toLowerCase())
inRelevantBlock = matchesUserAgent(value, userAgent)
break
case "allow":
if (inRelevantBlock) {
rules.push({ path: value, allow: true })
}
break
case "disallow":
if (inRelevantBlock) {
rules.push({ path: value, allow: false })
}
break
case "crawl-delay":
// Note: crawl-delay is NOT in RFC 9309, but commonly used
if (inRelevantBlock) {
crawlDelay = parseInt(value, 10)
}
break
case "sitemap":
// Sitemaps apply globally
sitemaps.push(value)
break
}
}
// Sort rules by path length (longest match wins per RFC 9309)
rules.sort((a, b) => b.path.length - a.path.length)
21 collapsed lines
return { rules, crawlDelay, sitemaps }
}
function matchesUserAgent(pattern: string, userAgent: string): boolean {
const patternLower = pattern.toLowerCase()
const uaLower = userAgent.toLowerCase()
if (patternLower === "*") return true
return uaLower.includes(patternLower)
}
function isAllowed(rules: RobotsRules, path: string): boolean {
// RFC 9309: Longest matching path wins
for (const rule of rules.rules) {
if (pathMatches(rule.path, path)) {
return rule.allow
}
}
// Default: allowed if no rule matches
return true
}
function pathMatches(pattern: string, path: string): boolean {
// Handle wildcards (*) and end-of-path ($)
let regex = pattern
.replace(/[.+?^${}()|[\]\\]/g, "\\$&") // Escape special chars
.replace(/\*/g, ".*") // * → .*
.replace(/\$$/, "$") // Keep $ as end anchor
if (!pattern.endsWith("$")) {
regex = `^${regex}` // Match from start
}
return new RegExp(regex).test(path)
}
// Usage
const robotsTxt = `
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public/
User-agent: Googlebot
Disallow: /no-google/
Sitemap: https://example.com/sitemap.xml
`
const rules = parseRobotsTxt(robotsTxt, "MyCrawler/1.0")
console.log(isAllowed(rules, "/admin/dashboard")) // false
console.log(isAllowed(rules, "/admin/public/")) // true (longest match wins)
console.log(isAllowed(rules, "/public/page")) // true (no matching rule)
10 collapsed lines
// Detect and avoid spider traps (infinite URL spaces)
interface TrapDetectionConfig {
maxURLLength: number // URLs longer than this are suspicious
maxPathDepth: number // Path segments deeper than this
maxSimilarURLsPerHost: number // Too many similar patterns
patternThreshold: number // Repetition count before flagging
}
const DEFAULT_TRAP_CONFIG: TrapDetectionConfig = {
maxURLLength: 2000,
maxPathDepth: 15,
maxSimilarURLsPerHost: 10000,
patternThreshold: 3,
}
interface URLPattern {
host: string
pathSegments: string[]
queryParams: string[]
}
class SpiderTrapDetector {
private hostPatterns: Map<string, Map<string, number>> = new Map()
constructor(private config = DEFAULT_TRAP_CONFIG) {}
isTrap(url: string): { isTrap: boolean; reason?: string } {
try {
const parsed = new URL(url)
// Check 1: URL length
if (url.length > this.config.maxURLLength) {
return { isTrap: true, reason: `URL too long (${url.length} chars)` }
}
// Check 2: Path depth
const pathSegments = parsed.pathname.split("/").filter(Boolean)
if (pathSegments.length > this.config.maxPathDepth) {
return { isTrap: true, reason: `Path too deep (${pathSegments.length} segments)` }
}
// Check 3: Repeating patterns in path
const pattern = this.detectRepeatingPattern(pathSegments)
16 collapsed lines
if (pattern) {
return { isTrap: true, reason: `Repeating pattern: ${pattern}` }
}
// Check 4: Calendar trap (dates extending into far future)
if (this.isCalendarTrap(parsed.pathname)) {
return { isTrap: true, reason: "Calendar trap (far future dates)" }
}
// Check 5: Session ID in URL
if (this.hasSessionID(parsed.search)) {
return { isTrap: true, reason: "Session ID in URL" }
}
return { isTrap: false }
} catch {
return { isTrap: true, reason: "Invalid URL" }
}
}
private detectRepeatingPattern(segments: string[]): string | null {
// Look for patterns like /a/b/a/b/a/b
for (let patternLength = 1; patternLength <= segments.length / 3; patternLength++) {
const pattern = segments.slice(0, patternLength).join("/")
let repetitions = 0
for (let i = 0; i <= segments.length - patternLength; i += patternLength) {
if (segments.slice(i, i + patternLength).join("/") === pattern) {
repetitions++
}
}
if (repetitions >= this.config.patternThreshold) {
return pattern
}
}
return null
}
private isCalendarTrap(pathname: string): boolean {
// Look for date patterns like /2024/01/15 extending far into future
const datePattern = /\/(\d{4})\/(\d{1,2})\/(\d{1,2})/
const match = pathname.match(datePattern)
if (match) {
const year = parseInt(match[1], 10)
const currentYear = new Date().getFullYear()
// Flag if date is more than 2 years in the future
if (year > currentYear + 2) {
return true
}
}
return false
}
private hasSessionID(search: string): boolean {
const sessionPatterns = [
/[?&](session_?id|sid|phpsessid|jsessionid|aspsessionid)=/i,
/[?&][a-z]+=[a-f0-9]{32,}/i, // Long hex strings often session IDs
]
return sessionPatterns.some((pattern) => pattern.test(search))
}
}

Problem: Operators need real-time visibility into crawl progress, errors, and throughput across distributed nodes.

Solution: Real-Time Metrics Dashboard

15 collapsed lines
// Dashboard data structure
interface CrawlDashboardState {
summary: {
totalURLs: number
fetchedURLs: number
queuedURLs: number
errorCount: number
duplicateCount: number
}
throughput: {
current: number
average: number
history: Array<{ timestamp: number; value: number }>
}
workers: Array<{
id: string
status: "active" | "idle" | "error"
urlsProcessed: number
currentHost: string
lastActivity: Date
}>
topHosts: Array<{
host: string
urlsFetched: number
urlsQueued: number
errorRate: number
}>
errors: Array<{
timestamp: Date
url: string
error: string
count: number
}>
}
// Real-time updates via WebSocket
function useCrawlDashboard(): CrawlDashboardState {
const [state, setState] = useState<CrawlDashboardState>(initialState)
useEffect(() => {
const ws = new WebSocket("wss://crawler-api/dashboard/stream")
ws.onmessage = (event) => {
const update = JSON.parse(event.data)
16 collapsed lines
switch (update.type) {
case "summary":
setState((prev) => ({ ...prev, summary: update.data }))
break
case "throughput":
setState((prev) => ({
...prev,
throughput: {
...prev.throughput,
current: update.data.current,
history: [...prev.throughput.history.slice(-59), update.data],
},
}))
break
case "worker_status":
setState((prev) => ({
...prev,
workers: updateWorker(prev.workers, update.data),
}))
break
}
}
return () => ws.close()
}, [])
return state
}

Key metrics to display:

  • URLs/second (current, average, peak)
  • Queue depth by priority level
  • Error rates by category (network, HTTP 4xx, 5xx, timeout)
  • Worker status and distribution
  • Top hosts by activity
  • Bandwidth consumption
10 collapsed lines
// Interface for analyzing specific URLs or hosts
interface URLAnalysis {
url: string;
normalizedUrl: string;
status: 'pending' | 'fetched' | 'error' | 'blocked';
fetchHistory: Array<{
timestamp: Date;
statusCode: number;
responseTime: number;
contentHash: string;
}>;
robotsStatus: {
allowed: boolean;
rule: string;
crawlDelay?: number;
};
linkAnalysis: {
inboundLinks: number;
outboundLinks: number;
pageRank: number;
};
}
// Host-level analysis
interface HostAnalysis {
host: string;
urlsDiscovered: number;
urlsFetched: number;
urlsBlocked: number;
avgResponseTime: number;
errorRate: number;
robots: {
lastFetched: Date;
crawlDelay?: number;
16 collapsed lines
blockedPaths: string[];
};
throughputHistory: Array<{ timestamp: Date; urlsPerMinute: number }>;
}
// Component for searching and analyzing URLs
function URLAnalyzer() {
const [searchQuery, setSearchQuery] = useState('');
const [analysis, setAnalysis] = useState<URLAnalysis | null>(null);
const analyzeURL = async (url: string) => {
const result = await fetch(`/api/v1/analysis/url?url=${encodeURIComponent(url)}`);
setAnalysis(await result.json());
};
return (
<div className="url-analyzer">
<input
type="text"
value={searchQuery}
onChange={e => setSearchQuery(e.target.value)}
placeholder="Enter URL or host..."
/>
<button onClick={() => analyzeURL(searchQuery)}>Analyze</button>
{analysis && (
<div className="analysis-results">
<h3>URL: {analysis.url}</h3>
<p>Status: {analysis.status}</p>
<p>Robots: {analysis.robotsStatus.allowed ? 'Allowed' : 'Blocked'}</p>
<p>PageRank: {analysis.linkAnalysis.pageRank.toFixed(4)}</p>
{/* Fetch history table, etc. */}
</div>
)}
</div>
);
}
ComponentRequirementOptions
Message QueueDurable, partitioned, high throughputKafka, Apache Pulsar, Redpanda
URL DatabaseHigh write throughput, range queriesPostgreSQL, CockroachDB, Cassandra
Seen URL FilterFast membership testRedis Bloom, Custom in-memory
Content StorageCheap, durable, archivalS3-compatible (MinIO, SeaweedFS)
CoordinationLeader election, configZooKeeper, etcd, Consul
DNS CacheLow latency, high throughputPowerDNS, Unbound, Custom
MetricsTime-series, real-timeInfluxDB, Prometheus, TimescaleDB

Monitoring

Data Layer (Private Subnets)

Compute Layer (VPC)

Coordination Layer

Renderer ASG

Renderer EC2

c6i.4xlarge

Headless Chrome

Parser Auto Scaling Group

Parser EC2

c6i.xlarge

Parser EC2

Crawler Auto Scaling Group

Crawler EC2

c6i.2xlarge

Crawler EC2

Crawler EC2

Crawler EC2

Amazon MSK

Kafka

Amazon MSK

ZooKeeper

RDS PostgreSQL

URL Metadata

ElastiCache Redis

Bloom Filter + Cache

S3

Content Archive

CloudWatch

Grafana

Dashboard

ComponentAWS ServiceConfiguration
Message QueueAmazon MSKkafka.m5.2xlarge, 6 brokers, 256 partitions
Crawler NodesEC2 c6i.2xlarge8 vCPU, 16GB RAM, Auto Scaling 5-50
Parser NodesEC2 c6i.xlarge4 vCPU, 8GB RAM, Auto Scaling 2-20
JS RendererEC2 c6i.4xlarge16 vCPU, 32GB RAM, Spot instances
URL DatabaseRDS PostgreSQLdb.r6g.2xlarge, Multi-AZ, 2TB gp3
Bloom FilterElastiCache Rediscache.r6g.xlarge, 4-node cluster
Content StorageS3 Standard-IAIntelligent-Tiering after 30 days
CoordinationMSK ZooKeeperBuilt into MSK
Managed ServiceSelf-HostedWhen to Self-Host
Amazon MSKApache KafkaCost at scale, specific Kafka versions
RDS PostgreSQLPostgreSQL on EC2Cost, specific extensions
ElastiCacheRedis Cluster on EC2Redis modules, cost
S3MinIO / SeaweedFSOn-premise, data sovereignty
ResourceMonthly Cost
MSK (6 × m5.2xlarge)~$3,600
Crawler EC2 (15 × c6i.2xlarge)~$5,400
Parser EC2 (10 × c6i.xlarge)~$1,800
Renderer EC2 Spot (5 × c6i.4xlarge)~$1,500
RDS PostgreSQL~$1,200
ElastiCache Redis~$800
S3 Storage (50TB compressed)~$1,150
Data Transfer (egress ~10TB)~$900
Total~$16,350/month

Cost per million pages: ~$16.35

This design prioritizes the hybrid architecture with Kafka for URL distribution—combining the operational simplicity of a message queue with the scalability of distributed processing. Host-based partitioning enables local politeness enforcement without coordination overhead.

Key architectural decisions:

  1. Dual-queue frontier (Mercator style): Front queues for priority, back queues for politeness. This separates concerns and enables efficient URL selection.

  2. Bloom filter + persistent store: Bloom filter provides O(1) “definitely not seen” checks. False positives fall through to the database—but these are rare (1% at 10 bits/element).

  3. SimHash for content deduplication: 64-bit fingerprints with 3-bit threshold catches near-duplicates without storing full content hashes for comparison.

  4. Kafka partitions by host hash: Each crawler node owns specific hosts, enabling local politeness without cross-node coordination.

Limitations and future improvements:

  • JavaScript rendering bottleneck: Headless Chrome is resource-intensive. Could implement speculative rendering (pre-render pages likely to need JS) or use faster alternatives (Playwright, Puppeteer-cluster).
  • Recrawl intelligence: Current design uses simple time-based recrawling. Could add ML-based change prediction to prioritize likely-changed pages.
  • Deep web crawling: Forms, authentication, and session handling are out of scope. Would require browser automation and likely ethical considerations.
  • Internationalization: URL normalization and content extraction assume primarily ASCII/UTF-8. Would need language detection and encoding handling for global crawls.
  • Distributed systems fundamentals (consistent hashing, message queues)
  • Database design (indexing, sharding, partitioning)
  • HTTP protocol basics (status codes, headers, compression)
  • Basic understanding of probabilistic data structures (Bloom filters)
  • URL Frontier: The data structure managing which URLs to crawl next, balancing priority and politeness
  • Politeness: Rate-limiting crawls to avoid overwhelming target servers
  • robots.txt: File at website root specifying crawl rules (RFC 9309)
  • SimHash: Locality-sensitive hashing algorithm producing fingerprints where similar documents have similar hashes
  • Bloom Filter: Probabilistic data structure for membership testing with no false negatives but possible false positives
  • Spider Trap: URL patterns that generate infinite unique URLs (calendars, session IDs, repetitive paths)
  • Crawl-delay: Non-standard robots.txt directive specifying seconds between requests
  • Back Queue: Per-host queue in the frontier that enforces politeness timing
  • Front Queue: Priority-ordered queue in the frontier that determines which URLs are important
  • Web crawlers use a dual-queue frontier: front queues for priority ranking, back queues for per-host politeness enforcement
  • URL deduplication combines Bloom filters (O(1) negative checks, 1% false positive) with persistent storage for definitive answers
  • Content deduplication uses SimHash—64-bit fingerprints where similar documents differ by ≤3 bits
  • Distributed crawling partitions URLs by host hash using consistent hashing; each node owns specific hosts for local politeness
  • robots.txt compliance is mandatory: cache 24 hours max, respect Allow/Disallow with longest-match precedence (RFC 9309)
  • Scale: Common Crawl processes 2.3B pages/month (~400 TiB) across ~47M hosts; Google indexes ~400B documents

Read more

  • Previous

    Design Search Autocomplete: Prefix Matching at Scale

    System Design / System Design Problems 18 min read

    A system design for search autocomplete (typeahead) covering prefix data structures, ranking algorithms, distributed architecture, and sub-100ms latency requirements. This design addresses the challenge of returning relevant suggestions within the user’s typing cadence—typically under 100ms—while handling billions of queries daily.

  • Next

    Design Google Search

    System Design / System Design Problems 25 min read

    Building a web-scale search engine that processes 8.5 billion queries daily across 400+ billion indexed pages with sub-second latency. Search engines solve the fundamental information retrieval problem: given a query, return the most relevant documents from a massive corpus—instantly. This design covers crawling (web discovery), indexing (content organization), ranking (relevance scoring), and serving (query processing)—the four pillars that make search work at planetary scale.