Design a Web Crawler

A comprehensive system design for a web-scale crawler that discovers, downloads, and indexes billions of pages. This design addresses URL frontier management with politeness constraints, distributed crawling at scale, duplicate detection, and freshness maintenance across petabytes of web content.

High-level web crawler architecture: Seeds flow through a prioritized, politeness-aware frontier to distributed fetchers. Parsed content feeds storage while extracted links cycle back through duplicate detection.

Abstract

Web crawlers solve three interconnected problems: frontier management (deciding which URLs to fetch next), politeness (avoiding overwhelming target servers), and deduplication (avoiding redundant downloads of seen URLs and near-duplicate content).

The URL frontier is a two-tier queue system. Front queues implement prioritization—URLs scoring higher for PageRank, historical change frequency, or domain authority rise to the top. Back queues enforce politeness—one queue per host, with timestamps tracking when each host can next be contacted. The Mercator architecture recommends 3× as many back queues as crawler threads to avoid contention.

Duplicate detection operates at two levels. URL-level: Bloom filters provide O(1) membership testing with configurable false-positive rates (1% at 10 bits/element). URLs are normalized before hashing to collapse equivalent forms. Content-level: SimHash generates 64-bit fingerprints where near-duplicate documents have fingerprints differing by ≤3 bits—Google uses this threshold for 8 billion pages.

Distributed crawling partitions URLs by host hash using consistent hashing. This co-locates all URLs for a host on one crawler node, enabling local politeness enforcement without coordination. Adding/removing nodes requires re-assigning only 1/N of the URL space. Common Crawl processes 2.3 billion pages/month across ~47 million hosts using this architecture.

Requirements

Functional Requirements

Feature	Priority	Scope
URL discovery from seed list	Core	Full
HTTP/HTTPS content fetching	Core	Full
robots.txt compliance	Core	Full
HTML parsing and link extraction	Core	Full
URL deduplication	Core	Full
Content deduplication	Core	Full
Distributed crawling coordination	Core	Full
Recrawl scheduling for freshness	Core	Full
JavaScript rendering	High	Full
Sitemap processing	High	Full
Content storage and indexing	High	Overview
Image/media crawling	Medium	Brief
Deep web / form submission	Low	Out of scope

Non-Functional Requirements

Requirement	Target	Rationale
Crawl throughput	1,000+ pages/second per node	Match search engine scale
Politeness	≤1 request/second per host	Avoid overloading targets
URL dedup accuracy	0% false negatives	Never miss seen URLs
Content dedup threshold	3-bit SimHash distance	Catch near-duplicates
robots.txt freshness	Cache ≤24 hours	Per RFC 9309 recommendation
Fault tolerance	No data loss on node failure	Critical for multi-day crawls
Horizontal scalability	Linear throughput scaling	Add nodes to increase capacity

Scale Estimation

Web Scale:

Total websites: ~1.4 billion (206 million active)
Google’s index: ~400 billion documents
Common Crawl monthly: ~2.3 billion pages, ~400 TiB uncompressed

Target Crawl:

Goal: 1 billion pages in 30 days
Daily pages: 33.3 million = ~385 pages/second sustained
With peak headroom (3×): ~1,200 pages/second

Traffic per node:

Target: 100 pages/second per node
Required nodes: 12 (sustained), 15 (with buffer)
Network per node: 100 pages/sec × 500KB avg = 50 MB/s = 400 Mbps

Storage:

Raw HTML: 1B pages × 500KB avg = 500 TB
Compressed (gzip ~10×): 50 TB
URL database: 1B URLs × 200 bytes = 200 GB
SimHash fingerprints: 1B × 8 bytes = 8 GB
Bloom filter (1% FP): 1B URLs × 10 bits = 1.25 GB

DNS:

Unique hosts: ~50 million (Common Crawl average)
DNS queries: 50M × 1.2 (retries) = 60M
With caching (1-hour TTL): ~6M active lookups

Design Paths

Path A: Centralized Coordinator

Best when:

Small to medium scale (< 100M pages)
Single datacenter deployment
Strong consistency requirements

Architecture: A central coordinator maintains the entire URL frontier in memory/database. Crawler workers pull batches of URLs from the coordinator, fetch content, and report results back.

Key characteristics:

Single point of truth for URL state
Global priority ordering
Centralized politeness enforcement

Trade-offs:

✅ Simple architecture, easy to reason about
✅ Global priority optimization
✅ Consistent deduplication
❌ Coordinator becomes bottleneck
❌ Single point of failure
❌ Does not scale beyond ~10 nodes

Real-world example: Scrapy Cluster uses Redis as a centralized frontier for coordinated distributed crawling up to millions of pages.

Path B: Fully Distributed (UbiCrawler)

Best when:

Web-scale crawling (billions of pages)
Multi-datacenter deployment
No single point of failure tolerance

Architecture: No central coordinator. URLs are assigned to nodes via consistent hashing on host. Each node maintains its own frontier for assigned hosts. Nodes communicate only for handoff during rebalancing.

Key characteristics:

Hash-based URL partitioning
Local frontier per node
Peer-to-peer coordination

Trade-offs:

✅ Linear horizontal scaling
✅ No single point of failure
✅ Local politeness (no coordination needed)
❌ Global priority ordering is approximate
❌ Complex rebalancing on node join/leave
❌ Harder to debug and monitor

Real-world example: UbiCrawler pioneered this approach for AltaVista. Common Crawl uses similar partitioning at scale.

Path C: Hybrid with Message Queue (Chosen)

Best when:

Large scale with operational simplicity
Need both global coordination and distributed execution
Flexible deployment (cloud or on-premise)

Architecture: URL frontier is distributed across nodes, but coordination happens through a message queue (Kafka). Each node owns a partition of the URL space (by host hash). New URLs are published to Kafka topics; consumers (crawler nodes) pull URLs for their assigned partitions.

Key characteristics:

Message queue for durable URL distribution
Hash-partitioned topics (one partition per host range)
Local frontier within each node for politeness
Separate fetcher and parser processes

Trade-offs:

✅ Kafka provides durability and replayability
✅ Easy to add/remove nodes (Kafka rebalancing)
✅ Clear separation of concerns
✅ Built-in backpressure handling
❌ Additional operational complexity (Kafka cluster)
❌ Some latency from message queue
❌ Requires careful partition design

Path Comparison

Factor	Path A (Centralized)	Path B (Distributed)	Path C (Hybrid)
Scale limit	~10 nodes	Unlimited	~1000 nodes
Complexity	Low	High	Medium
Fault tolerance	Low	High	High
Global optimization	Perfect	Approximate	Good
Operational overhead	Low	High	Medium
Best for	Prototype/small	Web-scale	Production

This Article’s Focus

This article implements Path C (Hybrid) because it balances operational simplicity with scale. Kafka’s consumer groups handle node failures and rebalancing automatically, while hash-based partitioning enables local politeness enforcement without coordination overhead.

High-Level Design

Component Architecture

URL Frontier Service

Manages the queue of URLs to crawl with dual-queue architecture:

Front Queues (Priority):

F queues ranked by priority (typically F=10-20)
Priority factors: PageRank, domain authority, change frequency, depth from seed
URLs are assigned to front queues based on computed priority score

Back Queues (Politeness):

B queues, one per host (approximately)
Each queue tracks next_allowed_time for rate limiting
Mercator recommendation: B = 3 × number of crawler threads

Queue Selection:

Select a non-empty front queue (weighted by priority)
From that queue, select a URL whose back queue is ready (next_allowed_time ≤ now)
After fetch, update back queue’s next_allowed_time

DNS Resolver Service

High-performance DNS resolution with aggressive caching:

Local cache: LRU cache with TTL-aware expiration
Batch resolution: Prefetch DNS for queued URLs
Multiple resolvers: Parallel queries to multiple DNS servers
Negative caching: Cache NXDOMAIN responses (shorter TTL)

Performance target: DNS should not be on the critical path. Pre-resolve while URL is queued.

Fetcher Service

HTTP client optimized for crawling:

Connection pooling: Reuse connections per host
HTTP/2: Single connection with multiplexed streams (where supported)
Compression: Accept gzip/brotli, decompress on receipt
Timeout handling: Connect timeout (10s), read timeout (30s), total timeout (60s)
Redirect following: Up to 5 redirects (per RFC 9309 for robots.txt)
User-Agent: Identify as crawler with contact info

robots.txt Service

Caches and enforces robots.txt rules:

Cache duration: Up to 24 hours (RFC 9309 recommendation)
Parsing: Handle malformed files gracefully
Rules: Allow/Disallow with longest-match precedence
Size limit: Parse at least first 500 KiB (RFC 9309 minimum)
Error handling: 4xx = unrestricted, 5xx = assume complete disallow

Content Parser Service

Extracts content and links from fetched pages:

HTML parsing: DOM-based extraction (jsoup, lxml)
Link extraction: <a href>, <link>, <script src>, <img src>
URL normalization: Resolve relative URLs, canonicalize
Content extraction: Title, meta description, body text
Metadata: Content-Type, charset, Last-Modified

JavaScript Renderer Service

Renders JavaScript-heavy pages (separate from main fetcher):

Headless Chrome/Puppeteer: Full browser rendering
Render queue: Separate queue for JS-required pages
Resource limits: Memory/CPU limits per render
Timeout: 30-second render timeout
Detection: Heuristics to identify JS-dependent pages

Duplicate Detection Service

Prevents redundant crawling:

URL Deduplication:

Bloom filter for fast negative checks
Persistent URL store for seen URLs
URL normalization before hashing

Content Deduplication:

SimHash fingerprinting (64-bit)
Threshold: Hamming distance ≤ 3 bits = duplicate
Index fingerprints for fast lookup

Data Flow: Crawling a URL

Data Flow: Distributed URL Distribution

API Design

Internal APIs (Service-to-Service)

Submit URLs to Frontier

Endpoint: POST /api/v1/frontier/urls


3 collapsed lines
1
// Headers
2
Authorization: Bearer {internal_token}
3
Content-Type: application/json
4

5
// Request body
6
{
7
  "urls": [
8
    {
9
      "url": "https://example.com/page1",
10
      "priority": 0.85,
11
      "source_url": "https://example.com/",
12
      "depth": 1,
13
      "discovered_at": "2024-01-15T10:30:00Z"
14
    },
15
    {
16
      "url": "https://example.com/page2",
17
      "priority": 0.72,
18
      "source_url": "https://example.com/",
19
      "depth": 1,
5 collapsed lines
20
      "discovered_at": "2024-01-15T10:30:00Z"
21
    }
22
  ],
23
  "crawl_id": "crawl-2024-01-15"
24
}

Response (202 Accepted):

1
{
2
  "accepted": 2,
3
  "rejected": 0,
4
  "duplicates": 0
5
}

Design Decision: Batch Submission

URLs are submitted in batches (100-1000) rather than individually to amortize Kafka producer overhead and reduce network round-trips. The frontier service handles deduplication and prioritization asynchronously.

Fetch Next URLs (Pull Model)

Endpoint: GET /api/v1/frontier/next

Query Parameters:

Parameter	Type	Description
`worker_id`	string	Unique identifier for the crawler worker
`batch_size`	integer	Number of URLs to fetch (default: 100, max: 1000)
`timeout_ms`	integer	Long-poll timeout if no URLs ready (default: 5000)

Response (200 OK):


3 collapsed lines
1
{
2
  "urls": [
3
    {
4
      "url": "https://example.com/page1",
5
      "priority": 0.85,
6
      "host": "example.com",
7
      "ip": "93.184.216.34",
8
      "robots_rules": {
9
        "allowed": true,
10
        "crawl_delay": null
11
      }
12
    },
13
    {
14
      "url": "https://other.com/page2",
15
      "priority": 0.72,
16
      "host": "other.com",
17
      "ip": "203.0.113.50",
18
      "robots_rules": {
19
        "allowed": true,
20
        "crawl_delay": 2
21
      }
22
    }
23
  ],
24
  "lease_id": "lease-abc123",
2 collapsed lines
25
  "lease_expires_at": "2024-01-15T10:35:00Z"
26
}

Design Decision: URL Leasing

URLs are leased to workers with a timeout (5 minutes default). If the worker doesn’t report completion (success or failure) before the lease expires, the URL is returned to the frontier for reassignment. This handles worker crashes without losing URLs.

Report Fetch Results

Endpoint: POST /api/v1/frontier/results


3 collapsed lines
1
// Headers
2
Authorization: Bearer {internal_token}
3
Content-Type: application/json
4

5
// Request body
6
{
7
  "lease_id": "lease-abc123",
8
  "results": [
9
    {
10
      "url": "https://example.com/page1",
11
      "status": "success",
12
      "http_status": 200,
13
      "content_type": "text/html",
14
      "content_length": 45230,
15
      "content_hash": "a1b2c3d4e5f6...",
16
      "simhash": "0x1234567890abcdef",
17
      "fetch_time_ms": 234,
18
      "links_found": 47,
19
      "is_duplicate": false
20
    },
21
    {
22
      "url": "https://other.com/page2",
23
      "status": "error",
24
      "http_status": 503,
25
      "error": "Service Unavailable",
26
      "retry_after": 300
27
    }
28
  ]
29
}

Response (200 OK):

1
{
2
  "processed": 2,
3
  "retries_scheduled": 1
4
}

Monitoring API

Crawl Statistics

Endpoint: GET /api/v1/stats


5 collapsed lines
1
{
2
  "crawl_id": "crawl-2024-01-15",
3
  "started_at": "2024-01-15T00:00:00Z",
4
  "runtime_hours": 10.5,
5
  "urls": {
6
    "discovered": 15234567,
7
    "queued": 8234123,
8
    "fetched": 7000444,
9
    "successful": 6543210,
10
    "failed": 457234,
11
    "duplicate_content": 234567,
12
    "blocked_robots": 123456
13
  },
14
  "throughput": {
15
    "current_pages_per_second": 185.3,
16
    "avg_pages_per_second": 173.2,
17
    "peak_pages_per_second": 312.5
18
  },
19
  "hosts": {
20
    "total_discovered": 1234567,
21
    "active_in_queue": 456789,
22
    "blocked_by_robots": 12345
23
  },
24
  "storage": {
25
    "content_stored_gb": 2345.6,
26
    "urls_stored_millions": 15.2
27
  },
28
  "workers": {
29
    "active": 12,
30
    "idle": 3,
31
    "failed": 0
32
  }
33
}

Data Modeling

URL Schema

Primary Store: PostgreSQL for URL metadata, Kafka for frontier queue


5 collapsed lines
1
-- URL state tracking
2
CREATE TABLE urls (
3
    id BIGSERIAL PRIMARY KEY,
4
    url_hash BYTEA NOT NULL,  -- SHA-256 of normalized URL (32 bytes)
5
    url TEXT NOT NULL,
6
    normalized_url TEXT NOT NULL,
7

8
    -- Discovery metadata
9
    host VARCHAR(255) NOT NULL,
10
    host_hash INT NOT NULL,  -- For partition routing
11
    depth SMALLINT DEFAULT 0,
12
    source_url_id BIGINT REFERENCES urls(id),
13
    discovered_at TIMESTAMPTZ DEFAULT NOW(),
14

15
    -- Priority and scheduling
16
    priority REAL DEFAULT 0.5,
17
    last_fetched_at TIMESTAMPTZ,
18
    next_fetch_at TIMESTAMPTZ,
19
    fetch_count INT DEFAULT 0,
20

21
    -- Fetch results
22
    last_status_code SMALLINT,
23
    last_content_hash BYTEA,  -- For change detection
24
    last_simhash BIGINT,      -- Content fingerprint
25
    content_length INT,
26
    content_type VARCHAR(100),
27

28
    -- State
29
    state VARCHAR(20) DEFAULT 'pending',
30
      -- pending, queued, fetching, completed, failed, blocked
31

32
    CONSTRAINT url_hash_unique UNIQUE (url_hash)
33
);
34

35
-- Indexes for common access patterns
36
CREATE INDEX idx_urls_host ON urls(host_hash, host);
37
CREATE INDEX idx_urls_state ON urls(state, next_fetch_at)
38
    WHERE state IN ('pending', 'queued');
39
CREATE INDEX idx_urls_refetch ON urls(next_fetch_at)
1 collapsed line
40
    WHERE state = 'completed' AND next_fetch_at IS NOT NULL;

Design Decision: URL Hash as Primary Key for Dedup

Using SHA-256 hash of the normalized URL as the unique constraint instead of the raw URL:

Fixed size (32 bytes) regardless of URL length
Fast equality comparisons
Bloom filter uses same hash
URLs up to 2000+ characters don’t bloat indexes

Robots.txt Cache Schema


3 collapsed lines
1
-- robots.txt cache
2
CREATE TABLE robots_cache (
3
    host VARCHAR(255) PRIMARY KEY,
4
    fetched_at TIMESTAMPTZ NOT NULL,
5
    expires_at TIMESTAMPTZ NOT NULL,
6
    status_code SMALLINT NOT NULL,
7
    content TEXT,  -- Raw robots.txt content
8
    parsed_rules JSONB,  -- Pre-parsed rules for fast lookup
9
    crawl_delay INT,  -- Crawl-delay directive (if present)
10
    sitemaps TEXT[],  -- Sitemap URLs discovered
11
    error_message TEXT
12
);
13

14
-- Example parsed_rules structure:
15
-- {
16
--   "user_agents": {
17
--     "*": {
18
--       "allow": ["/public/", "/api/"],
19
--       "disallow": ["/admin/", "/private/"]
20
--     },
21
--     "googlebot": {
22
--       "allow": ["/"],
23
--       "disallow": ["/no-google/"]
24
--     }
5 collapsed lines
25
--   }
26
-- }
27

28
CREATE INDEX idx_robots_expires ON robots_cache(expires_at)
29
    WHERE expires_at < NOW() + INTERVAL '1 hour';

Content Storage Schema


5 collapsed lines
1
-- Page content archive (separate from URL metadata)
2
CREATE TABLE page_content (
3
    id BIGSERIAL PRIMARY KEY,
4
    url_id BIGINT NOT NULL REFERENCES urls(id),
5
    fetch_id BIGINT NOT NULL,  -- Links to fetch attempt
6

7
    -- Content
8
    raw_content BYTEA,  -- Compressed HTML/content
9
    compression VARCHAR(10) DEFAULT 'gzip',
10
    content_length INT NOT NULL,
11
    content_type VARCHAR(100),
12
    charset VARCHAR(50),
13

14
    -- Extracted data
15
    title TEXT,
16
    meta_description TEXT,
17
    canonical_url TEXT,
18
    language VARCHAR(10),
19

20
    -- Fingerprints
21
    content_hash BYTEA NOT NULL,  -- SHA-256 for exact match
22
    simhash BIGINT NOT NULL,      -- SimHash for near-duplicate
23

24
    -- Timestamps
25
    fetched_at TIMESTAMPTZ NOT NULL,
26
    http_date TIMESTAMPTZ,        -- From HTTP Date header
27
    last_modified TIMESTAMPTZ     -- From Last-Modified header
28
);
29

30
-- Content deduplication index
31
CREATE INDEX idx_content_simhash ON page_content(simhash);
32
CREATE INDEX idx_content_hash ON page_content(content_hash);
33

34
-- Partition by fetch date for efficient archival
2 collapsed lines
35
-- CREATE TABLE page_content_2024_01 PARTITION OF page_content
36
--     FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

Database Selection Matrix

Data Type	Store	Rationale
URL frontier queue	Kafka	Durable, partitioned, consumer groups
URL metadata	PostgreSQL	Complex queries, ACID, indexes
Seen URL filter	Redis Bloom	Fast membership test, O(1)
robots.txt cache	Redis	TTL support, fast lookup
Page content	Object Storage (S3)	Large blobs, cheap, archival
Content fingerprints	PostgreSQL + Redis	Persistent + fast lookup
Crawl metrics	InfluxDB/TimescaleDB	Time-series queries
Web graph	Neo4j / PostgreSQL	Link structure analysis

Partitioning Strategy

URL Database Sharding:

Shard key: host_hash (hash of hostname)
Rationale: Co-locates all URLs for a host; queries filter by host

Kafka Topic Partitioning:

Partition key: host_hash
Number of partitions: 256 (allows scaling to 256 crawler nodes)
Rationale: Each crawler consumes specific partitions → owns specific hosts

Content Storage:

Partition by date (monthly tables)
Rationale: Old content can be archived/deleted without affecting recent data

Low-Level Design

URL Frontier Implementation

The frontier is the heart of the crawler. It must balance priority, politeness, and throughput.

Dual-Queue Architecture (Mercator Style)


15 collapsed lines
1
// Frontier configuration
2
interface FrontierConfig {
3
  numFrontQueues: number // F = number of priority levels (10-20)
4
  numBackQueues: number // B = 3 × crawler threads
5
  defaultPolitenessDelay: number // milliseconds between requests to same host
6
  maxQueueSize: number // per back queue
7
}
8

9
interface URLEntry {
10
  url: string
11
  normalizedUrl: string
12
  host: string
13
  priority: number // 0.0 to 1.0
14
  depth: number
15
  discoveredAt: Date
16
}
17

18
class URLFrontier {
19
  private frontQueues: PriorityQueue<URLEntry>[] // F queues
20
  private backQueues: Map<string, HostQueue> // Host → queue
21
  private hostToBackQueue: Map<string, number> // Host → back queue index
22
  private backQueueHeap: MinHeap<BackQueueEntry> // Sorted by next_allowed_time
23

24
  constructor(private config: FrontierConfig) {
25
    this.frontQueues = Array(config.numFrontQueues)
26
      .fill(null)
27
      .map(() => new PriorityQueue<URLEntry>((a, b) => b.priority - a.priority))
28

29
    this.backQueues = new Map()
30
    this.hostToBackQueue = new Map()
31
    this.backQueueHeap = new MinHeap((a, b) => a.nextAllowedTime - b.nextAllowedTime)
32
  }
33

34
  // Add URL to frontier
35
  async addURL(entry: URLEntry): Promise<void> {
36
    // Assign to front queue based on priority
37
    const frontQueueIndex = Math.floor(entry.priority * (this.config.numFrontQueues - 1))
38
    this.frontQueues[frontQueueIndex].enqueue(entry)
39

40
    // Ensure back queue exists for this host
41
    this.ensureBackQueue(entry.host)
42
  }
43

44
  // Get next URL to fetch (respects politeness)
45
  async getNext(): Promise<URLEntry | null> {
46
    // Find a back queue that's ready (next_allowed_time <= now)
47
    const now = Date.now()
48

49
    while (this.backQueueHeap.size() > 0) {
50
      const topBackQueue = this.backQueueHeap.peek()
51

52
      if (topBackQueue.nextAllowedTime > now) {
53
        // No queues ready yet, return null or wait
54
        return null
55
      }
56

57
      const backQueue = this.backQueues.get(topBackQueue.host)
58
      if (backQueue && backQueue.size() > 0) {
59
        // Pop URL from this back queue
60
        const url = backQueue.dequeue()
61

62
        // Update next_allowed_time
63
        this.backQueueHeap.extractMin()
64
        topBackQueue.nextAllowedTime = now + this.getPolitenessDelay(topBackQueue.host)
65
        this.backQueueHeap.insert(topBackQueue)
66

67
        // Refill back queue from front queues if needed
68
        this.refillBackQueue(topBackQueue.host)
69

21 collapsed lines
70
        return url
71
      }
72

73
      // Empty back queue, remove from heap
74
      this.backQueueHeap.extractMin()
75
    }
76

77
    return null
78
  }
79

80
  // Refill back queue from front queues (maintains priority)
81
  private refillBackQueue(host: string): void {
82
    const backQueue = this.backQueues.get(host)
83
    if (!backQueue || backQueue.size() >= 10) return // Keep 10 URLs buffered
84

85
    // Scan front queues for URLs matching this host
86
    for (const frontQueue of this.frontQueues) {
87
      const url = frontQueue.findAndRemove((e) => e.host === host)
88
      if (url) {
89
        backQueue.enqueue(url)
90
        if (backQueue.size() >= 10) break
91
      }
92
    }
93
  }
94

95
  private getPolitenessDelay(host: string): number {
96
    // Check robots.txt crawl-delay, fall back to default
97
    const robotsDelay = this.robotsCache.getCrawlDelay(host)
98
    return robotsDelay ?? this.config.defaultPolitenessDelay
99
  }
100
}

Priority Calculation


5 collapsed lines
1
interface PriorityFactors {
2
  pageRank: number // 0.0 to 1.0, from link analysis
3
  domainAuthority: number // 0.0 to 1.0, domain-level quality
4
  changeFrequency: number // 0.0 to 1.0, how often content changes
5
  depth: number // Distance from seed URL
6
  freshness: number // 0.0 to 1.0, based on last fetch age
7
}
8

9
function calculatePriority(factors: PriorityFactors): number {
10
  // Weighted combination of factors
11
  const weights = {
12
    pageRank: 0.3,
13
    domainAuthority: 0.25,
14
    changeFrequency: 0.2,
15
    depth: 0.15,
16
    freshness: 0.1,
17
  }
18

19
  // Depth penalty: deeper pages get lower priority
20
  const depthScore = Math.max(0, 1 - factors.depth * 0.1) // -10% per level
21

22
  const priority =
23
    weights.pageRank * factors.pageRank +
24
    weights.domainAuthority * factors.domainAuthority +
25
    weights.changeFrequency * factors.changeFrequency +
26
    weights.depth * depthScore +
27
    weights.freshness * factors.freshness
28

29
  return Math.max(0, Math.min(1, priority)) // Clamp to [0, 1]
30
}
31

32
// Example: High-value news site homepage
33
const priority = calculatePriority({
34
  pageRank: 0.9,
6 collapsed lines
35
  domainAuthority: 0.85,
36
  changeFrequency: 0.95, // Changes frequently
37
  depth: 0, // Seed URL
38
  freshness: 0.1, // Hasn't been fetched recently
39
})
40
// priority ≈ 0.77 → high priority front queue

Duplicate Detection

URL Normalization

URL normalization is critical—different URL strings can reference the same resource.


10 collapsed lines
1
import { URL } from "url"
2

3
interface NormalizationOptions {
4
  removeFragment: boolean // Remove #hash
5
  removeDefaultPorts: boolean // Remove :80, :443
6
  removeTrailingSlash: boolean // Remove trailing /
7
  removeIndexFiles: boolean // Remove /index.html
8
  sortQueryParams: boolean // Alphabetize query params
9
  lowercaseHost: boolean // Lowercase hostname
10
  removeWWW: boolean // Remove www. prefix (controversial)
11
  decodeUnreserved: boolean // Decode safe percent-encoded chars
12
}
13

14
const DEFAULT_OPTIONS: NormalizationOptions = {
15
  removeFragment: true,
16
  removeDefaultPorts: true,
17
  removeTrailingSlash: true,
18
  removeIndexFiles: true,
19
  sortQueryParams: true,
20
  lowercaseHost: true,
21
  removeWWW: false, // www.example.com and example.com may be different
22
  decodeUnreserved: true,
23
}
24

25
function normalizeURL(urlString: string, options = DEFAULT_OPTIONS): string {
26
  const url = new URL(urlString)
27

28
  // Lowercase scheme and host
29
  url.protocol = url.protocol.toLowerCase()
30
  if (options.lowercaseHost) {
31
    url.hostname = url.hostname.toLowerCase()
32
  }
33

34
  // Remove default ports
35
  if (options.removeDefaultPorts) {
36
    if ((url.protocol === "http:" && url.port === "80") || (url.protocol === "https:" && url.port === "443")) {
37
      url.port = ""
38
    }
39
  }
40

41
  // Remove fragment
42
  if (options.removeFragment) {
43
    url.hash = ""
44
  }
45

46
  // Sort query parameters
47
  if (options.sortQueryParams && url.search) {
48
    const params = new URLSearchParams(url.search)
49
    const sorted = new URLSearchParams([...params.entries()].sort())
50
    url.search = sorted.toString()
51
  }
52

53
  // Remove trailing slash (except for root)
54
  if (options.removeTrailingSlash && url.pathname !== "/") {
16 collapsed lines
55
    url.pathname = url.pathname.replace(/\/+$/, "")
56
  }
57

58
  // Remove index files
59
  if (options.removeIndexFiles) {
60
    url.pathname = url.pathname.replace(/\/(index|default)\.(html?|php|asp)$/i, "/")
61
  }
62

63
  // Decode unreserved characters
64
  if (options.decodeUnreserved) {
65
    url.pathname = decodeURIComponent(url.pathname)
66
      .split("/")
67
      .map((segment) => encodeURIComponent(segment))
68
      .join("/")
69
  }
70

71
  return url.toString()
72
}
73

74
// Examples:
75
// normalizeURL('HTTP://Example.COM:80/Page/')
76
//   → 'http://example.com/page'
77
// normalizeURL('https://example.com/search?b=2&a=1')
78
//   → 'https://example.com/search?a=1&b=2'
79
// normalizeURL('https://example.com/index.html')
80
//   → 'https://example.com/'

Bloom Filter for URL Deduplication


10 collapsed lines
1
import { createHash } from "crypto"
2

3
class BloomFilter {
4
  private bitArray: Uint8Array
5
  private numHashFunctions: number
6
  private size: number
7

8
  constructor(expectedElements: number, falsePositiveRate: number = 0.01) {
9
    // Calculate optimal size and hash functions
10
    // m = -n * ln(p) / (ln(2)^2)
11
    // k = (m/n) * ln(2)
12
    this.size = Math.ceil((-expectedElements * Math.log(falsePositiveRate)) / Math.log(2) ** 2)
13
    this.numHashFunctions = Math.ceil((this.size / expectedElements) * Math.log(2))
14
    this.bitArray = new Uint8Array(Math.ceil(this.size / 8))
15

16
    console.log(`Bloom filter: ${this.size} bits, ${this.numHashFunctions} hash functions`)
17
    // For 1B URLs at 1% FP: ~1.2 GB, 7 hash functions
18
  }
19

20
  private getHashes(item: string): number[] {
21
    // Use double hashing: h(i) = h1 + i * h2
22
    const h1 = this.hash(item, 0)
23
    const h2 = this.hash(item, 1)
24

25
    return Array(this.numHashFunctions)
26
      .fill(0)
27
      .map((_, i) => Math.abs((h1 + i * h2) % this.size))
28
  }
29

30
  private hash(item: string, seed: number): number {
31
    const hash = createHash("md5").update(`${seed}:${item}`).digest()
32
    return hash.readUInt32LE(0)
33
  }
34

35
  add(item: string): void {
36
    for (const hash of this.getHashes(item)) {
37
      const byteIndex = Math.floor(hash / 8)
38
      const bitIndex = hash % 8
39
      this.bitArray[byteIndex] |= 1 << bitIndex
40
    }
41
  }
42

43
  mightContain(item: string): boolean {
44
    for (const hash of this.getHashes(item)) {
45
      const byteIndex = Math.floor(hash / 8)
46
      const bitIndex = hash % 8
47
      if ((this.bitArray[byteIndex] & (1 << bitIndex)) === 0) {
48
        return false // Definitely not in set
49
      }
16 collapsed lines
50
    }
51
    return true // Probably in set (may be false positive)
52
  }
53
}
54

55
// Usage
56
const seenURLs = new BloomFilter(1_000_000_000, 0.01)
57

58
function isURLSeen(url: string): boolean {
59
  const normalized = normalizeURL(url)
60

61
  if (!seenURLs.mightContain(normalized)) {
62
    return false // Definitely not seen
63
  }
64

65
  // Bloom filter says "maybe"—check persistent store
66
  return urlDatabase.exists(normalized)
67
}
68

69
function markURLSeen(url: string): void {
70
  const normalized = normalizeURL(url)
71
  seenURLs.add(normalized)
72
  urlDatabase.insert(normalized)
73
}

SimHash for Content Deduplication


10 collapsed lines
1
import { createHash } from "crypto"
2

3
// SimHash generates a fingerprint where similar documents have similar hashes
4
// Hamming distance between fingerprints indicates similarity
5

6
function simhash(text: string, hashBits: number = 64): bigint {
7
  // Tokenize into shingles (word n-grams)
8
  const tokens = tokenize(text)
9
  const shingles = generateShingles(tokens, 3) // 3-word shingles
10

11
  // Initialize bit vector
12
  const v = new Array(hashBits).fill(0)
13

14
  for (const shingle of shingles) {
15
    // Hash each shingle to get bit positions
16
    const hash = hashShingle(shingle, hashBits)
17

18
    // Update vector: +1 if bit is 1, -1 if bit is 0
19
    for (let i = 0; i < hashBits; i++) {
20
      if ((hash >> BigInt(i)) & 1n) {
21
        v[i] += 1
22
      } else {
23
        v[i] -= 1
24
      }
25
    }
26
  }
27

28
  // Convert vector to fingerprint: 1 if positive, 0 if negative
29
  let fingerprint = 0n
30
  for (let i = 0; i < hashBits; i++) {
31
    if (v[i] > 0) {
32
      fingerprint |= 1n << BigInt(i)
33
    }
34
  }
35

36
  return fingerprint
37
}
38

39
function tokenize(text: string): string[] {
40
  return text
41
    .toLowerCase()
42
    .replace(/[^\w\s]/g, " ")
43
    .split(/\s+/)
44
    .filter((word) => word.length > 2)
45
}
46

47
function generateShingles(tokens: string[], n: number): string[] {
48
  const shingles: string[] = []
49
  for (let i = 0; i <= tokens.length - n; i++) {
50
    shingles.push(tokens.slice(i, i + n).join(" "))
51
  }
52
  return shingles
53
}
54

55
function hashShingle(shingle: string, bits: number): bigint {
56
  const hash = createHash("sha256").update(shingle).digest("hex")
57
  return BigInt("0x" + hash.slice(0, bits / 4))
58
}
59

21 collapsed lines
60
function hammingDistance(a: bigint, b: bigint): number {
61
  let xor = a ^ b
62
  let distance = 0
63
  while (xor > 0n) {
64
    distance += Number(xor & 1n)
65
    xor >>= 1n
66
  }
67
  return distance
68
}
69

70
// Usage
71
const DUPLICATE_THRESHOLD = 3 // Max 3 bits different
72

73
function isNearDuplicate(newFingerprint: bigint, existingFingerprints: bigint[]): boolean {
74
  for (const existing of existingFingerprints) {
75
    if (hammingDistance(newFingerprint, existing) <= DUPLICATE_THRESHOLD) {
76
      return true
77
    }
78
  }
79
  return false
80
}
81

82
// Example
83
const doc1 = "The quick brown fox jumps over the lazy dog"
84
const doc2 = "The quick brown fox leaps over the lazy dog" // One word changed
85
const doc3 = "Completely different content about web crawlers"
86

87
const hash1 = simhash(doc1)
88
const hash2 = simhash(doc2)
89
const hash3 = simhash(doc3)
90

91
console.log(hammingDistance(hash1, hash2)) // Small (< 3), near-duplicate
92
console.log(hammingDistance(hash1, hash3)) // Large (> 20), different content

robots.txt Parsing


10 collapsed lines
1
// RFC 9309 compliant robots.txt parser
2

3
interface RobotsRule {
4
  path: string
5
  allow: boolean
6
}
7

8
interface RobotsRules {
9
  rules: RobotsRule[]
10
  crawlDelay?: number
11
  sitemaps: string[]
12
}
13

14
function parseRobotsTxt(content: string, userAgent: string): RobotsRules {
15
  const lines = content.split("\n").map((l) => l.trim())
16
  const rules: RobotsRule[] = []
17
  let crawlDelay: number | undefined
18
  const sitemaps: string[] = []
19

20
  let currentUserAgents: string[] = []
21
  let inRelevantBlock = false
22

23
  for (const line of lines) {
24
    // Skip comments and empty lines
25
    if (line.startsWith("#") || line === "") continue
26

27
    const [directive, ...valueParts] = line.split(":")
28
    const value = valueParts.join(":").trim()
29

30
    switch (directive.toLowerCase()) {
31
      case "user-agent":
32
        // Check if this block applies to our user agent
33
        if (currentUserAgents.length > 0 && inRelevantBlock) {
34
          // We've finished a relevant block, stop processing
35
          // (RFC 9309: use first matching group)
36
          break
37
        }
38
        currentUserAgents.push(value.toLowerCase())
39
        inRelevantBlock = matchesUserAgent(value, userAgent)
40
        break
41

42
      case "allow":
43
        if (inRelevantBlock) {
44
          rules.push({ path: value, allow: true })
45
        }
46
        break
47

48
      case "disallow":
49
        if (inRelevantBlock) {
50
          rules.push({ path: value, allow: false })
51
        }
52
        break
53

54
      case "crawl-delay":
55
        // Note: crawl-delay is NOT in RFC 9309, but commonly used
56
        if (inRelevantBlock) {
57
          crawlDelay = parseInt(value, 10)
58
        }
59
        break
60

61
      case "sitemap":
62
        // Sitemaps apply globally
63
        sitemaps.push(value)
64
        break
65
    }
66
  }
67

68
  // Sort rules by path length (longest match wins per RFC 9309)
69
  rules.sort((a, b) => b.path.length - a.path.length)
21 collapsed lines
70

71
  return { rules, crawlDelay, sitemaps }
72
}
73

74
function matchesUserAgent(pattern: string, userAgent: string): boolean {
75
  const patternLower = pattern.toLowerCase()
76
  const uaLower = userAgent.toLowerCase()
77

78
  if (patternLower === "*") return true
79
  return uaLower.includes(patternLower)
80
}
81

82
function isAllowed(rules: RobotsRules, path: string): boolean {
83
  // RFC 9309: Longest matching path wins
84
  for (const rule of rules.rules) {
85
    if (pathMatches(rule.path, path)) {
86
      return rule.allow
87
    }
88
  }
89
  // Default: allowed if no rule matches
90
  return true
91
}
92

93
function pathMatches(pattern: string, path: string): boolean {
94
  // Handle wildcards (*) and end-of-path ($)
95
  let regex = pattern
96
    .replace(/[.+?^${}()|[\]\\]/g, "\\$&") // Escape special chars
97
    .replace(/\*/g, ".*") // * → .*
98
    .replace(/\$$/, "$") // Keep $ as end anchor
99

100
  if (!pattern.endsWith("$")) {
101
    regex = `^${regex}` // Match from start
102
  }
103

104
  return new RegExp(regex).test(path)
105
}
106

107
// Usage
108
const robotsTxt = `
109
User-agent: *
110
Disallow: /admin/
111
Disallow: /private/
112
Allow: /admin/public/
113

114
User-agent: Googlebot
115
Disallow: /no-google/
116

117
Sitemap: https://example.com/sitemap.xml
118
`
119

120
const rules = parseRobotsTxt(robotsTxt, "MyCrawler/1.0")
121
console.log(isAllowed(rules, "/admin/dashboard")) // false
122
console.log(isAllowed(rules, "/admin/public/")) // true (longest match wins)
123
console.log(isAllowed(rules, "/public/page")) // true (no matching rule)

Spider Trap Detection


10 collapsed lines
1
// Detect and avoid spider traps (infinite URL spaces)
2

3
interface TrapDetectionConfig {
4
  maxURLLength: number // URLs longer than this are suspicious
5
  maxPathDepth: number // Path segments deeper than this
6
  maxSimilarURLsPerHost: number // Too many similar patterns
7
  patternThreshold: number // Repetition count before flagging
8
}
9

10
const DEFAULT_TRAP_CONFIG: TrapDetectionConfig = {
11
  maxURLLength: 2000,
12
  maxPathDepth: 15,
13
  maxSimilarURLsPerHost: 10000,
14
  patternThreshold: 3,
15
}
16

17
interface URLPattern {
18
  host: string
19
  pathSegments: string[]
20
  queryParams: string[]
21
}
22

23
class SpiderTrapDetector {
24
  private hostPatterns: Map<string, Map<string, number>> = new Map()
25

26
  constructor(private config = DEFAULT_TRAP_CONFIG) {}
27

28
  isTrap(url: string): { isTrap: boolean; reason?: string } {
29
    try {
30
      const parsed = new URL(url)
31

32
      // Check 1: URL length
33
      if (url.length > this.config.maxURLLength) {
34
        return { isTrap: true, reason: `URL too long (${url.length} chars)` }
35
      }
36

37
      // Check 2: Path depth
38
      const pathSegments = parsed.pathname.split("/").filter(Boolean)
39
      if (pathSegments.length > this.config.maxPathDepth) {
40
        return { isTrap: true, reason: `Path too deep (${pathSegments.length} segments)` }
41
      }
42

43
      // Check 3: Repeating patterns in path
44
      const pattern = this.detectRepeatingPattern(pathSegments)
16 collapsed lines
45
      if (pattern) {
46
        return { isTrap: true, reason: `Repeating pattern: ${pattern}` }
47
      }
48

49
      // Check 4: Calendar trap (dates extending into far future)
50
      if (this.isCalendarTrap(parsed.pathname)) {
51
        return { isTrap: true, reason: "Calendar trap (far future dates)" }
52
      }
53

54
      // Check 5: Session ID in URL
55
      if (this.hasSessionID(parsed.search)) {
56
        return { isTrap: true, reason: "Session ID in URL" }
57
      }
58

59
      return { isTrap: false }
60
    } catch {
61
      return { isTrap: true, reason: "Invalid URL" }
62
    }
63
  }
64

65
  private detectRepeatingPattern(segments: string[]): string | null {
66
    // Look for patterns like /a/b/a/b/a/b
67
    for (let patternLength = 1; patternLength <= segments.length / 3; patternLength++) {
68
      const pattern = segments.slice(0, patternLength).join("/")
69
      let repetitions = 0
70

71
      for (let i = 0; i <= segments.length - patternLength; i += patternLength) {
72
        if (segments.slice(i, i + patternLength).join("/") === pattern) {
73
          repetitions++
74
        }
75
      }
76

77
      if (repetitions >= this.config.patternThreshold) {
78
        return pattern
79
      }
80
    }
81
    return null
82
  }
83

84
  private isCalendarTrap(pathname: string): boolean {
85
    // Look for date patterns like /2024/01/15 extending far into future
86
    const datePattern = /\/(\d{4})\/(\d{1,2})\/(\d{1,2})/
87
    const match = pathname.match(datePattern)
88

89
    if (match) {
90
      const year = parseInt(match[1], 10)
91
      const currentYear = new Date().getFullYear()
92

93
      // Flag if date is more than 2 years in the future
94
      if (year > currentYear + 2) {
95
        return true
96
      }
97
    }
98
    return false
99
  }
100

101
  private hasSessionID(search: string): boolean {
102
    const sessionPatterns = [
103
      /[?&](session_?id|sid|phpsessid|jsessionid|aspsessionid)=/i,
104
      /[?&][a-z]+=[a-f0-9]{32,}/i, // Long hex strings often session IDs
105
    ]
106

107
    return sessionPatterns.some((pattern) => pattern.test(search))
108
  }
109
}

Frontend Considerations

Crawl Monitoring Dashboard

Problem: Operators need real-time visibility into crawl progress, errors, and throughput across distributed nodes.

Solution: Real-Time Metrics Dashboard


15 collapsed lines
1
// Dashboard data structure
2
interface CrawlDashboardState {
3
  summary: {
4
    totalURLs: number
5
    fetchedURLs: number
6
    queuedURLs: number
7
    errorCount: number
8
    duplicateCount: number
9
  }
10
  throughput: {
11
    current: number
12
    average: number
13
    history: Array<{ timestamp: number; value: number }>
14
  }
15
  workers: Array<{
16
    id: string
17
    status: "active" | "idle" | "error"
18
    urlsProcessed: number
19
    currentHost: string
20
    lastActivity: Date
21
  }>
22
  topHosts: Array<{
23
    host: string
24
    urlsFetched: number
25
    urlsQueued: number
26
    errorRate: number
27
  }>
28
  errors: Array<{
29
    timestamp: Date
30
    url: string
31
    error: string
32
    count: number
33
  }>
34
}
35

36
// Real-time updates via WebSocket
37
function useCrawlDashboard(): CrawlDashboardState {
38
  const [state, setState] = useState<CrawlDashboardState>(initialState)
39

40
  useEffect(() => {
41
    const ws = new WebSocket("wss://crawler-api/dashboard/stream")
42

43
    ws.onmessage = (event) => {
44
      const update = JSON.parse(event.data)
16 collapsed lines
45

46
      switch (update.type) {
47
        case "summary":
48
          setState((prev) => ({ ...prev, summary: update.data }))
49
          break
50
        case "throughput":
51
          setState((prev) => ({
52
            ...prev,
53
            throughput: {
54
              ...prev.throughput,
55
              current: update.data.current,
56
              history: [...prev.throughput.history.slice(-59), update.data],
57
            },
58
          }))
59
          break
60
        case "worker_status":
61
          setState((prev) => ({
62
            ...prev,
63
            workers: updateWorker(prev.workers, update.data),
64
          }))
65
          break
66
      }
67
    }
68

69
    return () => ws.close()
70
  }, [])
71

72
  return state
73
}

Key metrics to display:

URLs/second (current, average, peak)
Queue depth by priority level
Error rates by category (network, HTTP 4xx, 5xx, timeout)
Worker status and distribution
Top hosts by activity
Bandwidth consumption

URL Analysis Interface


10 collapsed lines
1
// Interface for analyzing specific URLs or hosts
2
interface URLAnalysis {
3
  url: string;
4
  normalizedUrl: string;
5
  status: 'pending' | 'fetched' | 'error' | 'blocked';
6
  fetchHistory: Array<{
7
    timestamp: Date;
8
    statusCode: number;
9
    responseTime: number;
10
    contentHash: string;
11
  }>;
12
  robotsStatus: {
13
    allowed: boolean;
14
    rule: string;
15
    crawlDelay?: number;
16
  };
17
  linkAnalysis: {
18
    inboundLinks: number;
19
    outboundLinks: number;
20
    pageRank: number;
21
  };
22
}
23

24
// Host-level analysis
25
interface HostAnalysis {
26
  host: string;
27
  urlsDiscovered: number;
28
  urlsFetched: number;
29
  urlsBlocked: number;
30
  avgResponseTime: number;
31
  errorRate: number;
32
  robots: {
33
    lastFetched: Date;
34
    crawlDelay?: number;
16 collapsed lines
35
    blockedPaths: string[];
36
  };
37
  throughputHistory: Array<{ timestamp: Date; urlsPerMinute: number }>;
38
}
39

40
// Component for searching and analyzing URLs
41
function URLAnalyzer() {
42
  const [searchQuery, setSearchQuery] = useState('');
43
  const [analysis, setAnalysis] = useState<URLAnalysis | null>(null);
44

45
  const analyzeURL = async (url: string) => {
46
    const result = await fetch(`/api/v1/analysis/url?url=${encodeURIComponent(url)}`);
47
    setAnalysis(await result.json());
48
  };
49

50
  return (
51
    <div className="url-analyzer">
52
      <input
53
        type="text"
54
        value={searchQuery}
55
        onChange={e => setSearchQuery(e.target.value)}
56
        placeholder="Enter URL or host..."
57
      />
58
      <button onClick={() => analyzeURL(searchQuery)}>Analyze</button>
59

60
      {analysis && (
61
        <div className="analysis-results">
62
          <h3>URL: {analysis.url}</h3>
63
          <p>Status: {analysis.status}</p>
64
          <p>Robots: {analysis.robotsStatus.allowed ? 'Allowed' : 'Blocked'}</p>
65
          <p>PageRank: {analysis.linkAnalysis.pageRank.toFixed(4)}</p>
66
          {/* Fetch history table, etc. */}
67
        </div>
68
      )}
69
    </div>
70
  );
71
}

Infrastructure Design

Cloud-Agnostic Concepts

Component	Requirement	Options
Message Queue	Durable, partitioned, high throughput	Kafka, Apache Pulsar, Redpanda
URL Database	High write throughput, range queries	PostgreSQL, CockroachDB, Cassandra
Seen URL Filter	Fast membership test	Redis Bloom, Custom in-memory
Content Storage	Cheap, durable, archival	S3-compatible (MinIO, SeaweedFS)
Coordination	Leader election, config	ZooKeeper, etcd, Consul
DNS Cache	Low latency, high throughput	PowerDNS, Unbound, Custom
Metrics	Time-series, real-time	InfluxDB, Prometheus, TimescaleDB

AWS Reference Architecture

Component	AWS Service	Configuration
Message Queue	Amazon MSK	kafka.m5.2xlarge, 6 brokers, 256 partitions
Crawler Nodes	EC2 c6i.2xlarge	8 vCPU, 16GB RAM, Auto Scaling 5-50
Parser Nodes	EC2 c6i.xlarge	4 vCPU, 8GB RAM, Auto Scaling 2-20
JS Renderer	EC2 c6i.4xlarge	16 vCPU, 32GB RAM, Spot instances
URL Database	RDS PostgreSQL	db.r6g.2xlarge, Multi-AZ, 2TB gp3
Bloom Filter	ElastiCache Redis	cache.r6g.xlarge, 4-node cluster
Content Storage	S3 Standard-IA	Intelligent-Tiering after 30 days
Coordination	MSK ZooKeeper	Built into MSK

Self-Hosted Alternatives

Managed Service	Self-Hosted	When to Self-Host
Amazon MSK	Apache Kafka	Cost at scale, specific Kafka versions
RDS PostgreSQL	PostgreSQL on EC2	Cost, specific extensions
ElastiCache	Redis Cluster on EC2	Redis modules, cost
S3	MinIO / SeaweedFS	On-premise, data sovereignty

Cost Estimation (AWS, 1B pages/month)

Resource	Monthly Cost
MSK (6 × m5.2xlarge)	~$3,600
Crawler EC2 (15 × c6i.2xlarge)	~$5,400
Parser EC2 (10 × c6i.xlarge)	~$1,800
Renderer EC2 Spot (5 × c6i.4xlarge)	~$1,500
RDS PostgreSQL	~$1,200
ElastiCache Redis	~$800
S3 Storage (50TB compressed)	~$1,150
Data Transfer (egress ~10TB)	~$900
Total	~$16,350/month

Cost per million pages: ~$16.35

Conclusion

This design prioritizes the hybrid architecture with Kafka for URL distribution—combining the operational simplicity of a message queue with the scalability of distributed processing. Host-based partitioning enables local politeness enforcement without coordination overhead.

Key architectural decisions:

Dual-queue frontier (Mercator style): Front queues for priority, back queues for politeness. This separates concerns and enables efficient URL selection.
Bloom filter + persistent store: Bloom filter provides O(1) “definitely not seen” checks. False positives fall through to the database—but these are rare (1% at 10 bits/element).
SimHash for content deduplication: 64-bit fingerprints with 3-bit threshold catches near-duplicates without storing full content hashes for comparison.
Kafka partitions by host hash: Each crawler node owns specific hosts, enabling local politeness without cross-node coordination.

Limitations and future improvements:

JavaScript rendering bottleneck: Headless Chrome is resource-intensive. Could implement speculative rendering (pre-render pages likely to need JS) or use faster alternatives (Playwright, Puppeteer-cluster).
Recrawl intelligence: Current design uses simple time-based recrawling. Could add ML-based change prediction to prioritize likely-changed pages.
Deep web crawling: Forms, authentication, and session handling are out of scope. Would require browser automation and likely ethical considerations.
Internationalization: URL normalization and content extraction assume primarily ASCII/UTF-8. Would need language detection and encoding handling for global crawls.

Appendix

Prerequisites

Distributed systems fundamentals (consistent hashing, message queues)
Database design (indexing, sharding, partitioning)
HTTP protocol basics (status codes, headers, compression)
Basic understanding of probabilistic data structures (Bloom filters)

Terminology

URL Frontier: The data structure managing which URLs to crawl next, balancing priority and politeness
Politeness: Rate-limiting crawls to avoid overwhelming target servers
robots.txt: File at website root specifying crawl rules (RFC 9309)
SimHash: Locality-sensitive hashing algorithm producing fingerprints where similar documents have similar hashes
Bloom Filter: Probabilistic data structure for membership testing with no false negatives but possible false positives
Spider Trap: URL patterns that generate infinite unique URLs (calendars, session IDs, repetitive paths)
Crawl-delay: Non-standard robots.txt directive specifying seconds between requests
Back Queue: Per-host queue in the frontier that enforces politeness timing
Front Queue: Priority-ordered queue in the frontier that determines which URLs are important

Summary

Web crawlers use a dual-queue frontier: front queues for priority ranking, back queues for per-host politeness enforcement
URL deduplication combines Bloom filters (O(1) negative checks, 1% false positive) with persistent storage for definitive answers
Content deduplication uses SimHash—64-bit fingerprints where similar documents differ by ≤3 bits
Distributed crawling partitions URLs by host hash using consistent hashing; each node owns specific hosts for local politeness
robots.txt compliance is mandatory: cache 24 hours max, respect Allow/Disallow with longest-match precedence (RFC 9309)
Scale: Common Crawl processes 2.3B pages/month (~400 TiB) across ~47M hosts; Google indexes ~400B documents

References

RFC 9309 - Robots Exclusion Protocol - Authoritative robots.txt specification
Sitemaps Protocol - XML sitemap specification
Mercator: A Scalable, Extensible Web Crawler - Foundational paper on URL frontier design
UbiCrawler: A Scalable Fully Distributed Web Crawler - Fully distributed architecture
Detecting Near-Duplicates for Web Crawling (SimHash) - Content fingerprinting algorithm
Stanford IR Book - Web Crawling Chapter - Comprehensive crawling fundamentals
Common Crawl Statistics - Real-world crawl scale and methodology
Google Search Central - JavaScript SEO - Googlebot rendering approach
ByteByteGo - URL Deduplication with Bloom Filters - Practical deduplication strategies

Read more