Design a Web Crawler

A web-scale crawler is a distributed system whose hot path is dominated by three concerns: deciding what to fetch next without overloading any single host, recognising URLs and content that have already been seen, and partitioning work so that politeness can be enforced locally. This article walks through a production-shaped design — Kafka-backed coordination, a Mercator-style dual-queue frontier, Bloom + SimHash dedup, and an RFC 9309-compliant robots layer — anchored on the original Mercator and UbiCrawler papers and on what Common Crawl actually publishes month over month.

Crawler architecture: seeds flow through a priority-and-politeness frontier into distributed fetchers; parsed content reaches storage while extracted links cycle back through dedup. — Component view of the crawler. Seeds enter the frontier; the scheduler hands URLs to fetchers; the parser extracts content and links; new links re-enter the frontier through dedup.

Mental model

A crawler is a closed-loop pipeline whose state lives in two places: the URL frontier (what to do next) and the dedup substrate (what has already been done). Everything else — fetcher, parser, renderer, storage — is replaceable plumbing.

The frontier is two queues stacked: a front layer that ranks URLs by importance, and a back layer that keeps each host on its own FIFO so politeness can be enforced without cross-node coordination. This shape was introduced in Mercator and remains the canonical reference design¹.
Dedup is split by cost: a Bloom filter cheaply rejects already-seen URLs with a tunable false-positive rate, and a SimHash index recognises near-duplicate content in 64 bits per page².
Distribution is a single decision: shard URLs by host (consistent hashing on hostname). Each node owns a host slice and therefore owns its own politeness clock, which is the property that makes the system scale linearly. This is the UbiCrawler insight³.

Three numbers ground the rest of the article. Common Crawl’s recent monthly archives publish roughly 2 billion pages from 44–46 million unique hosts at ~340–360 TiB uncompressed per crawl⁴⁵, so any design that cannot trivially carry those numbers is undersized; Pandu Nayak’s 2020 DOJ testimony put Google’s index at roughly 400 billion documents⁶. The point is not to build Google — it’s to know which order of magnitude you are sized for.

Requirements

Functional

Feature	Priority	Scope
Discover URLs from seed list	Core	Full
Fetch HTTP/HTTPS content	Core	Full
Comply with robots.txt (RFC 9309)	Core	Full
Parse HTML and extract links	Core	Full
URL deduplication	Core	Full
Near-duplicate content detection	Core	Full
Distributed crawling coordination	Core	Full
Recrawl scheduling for freshness	Core	Full
JavaScript rendering	High	Full
Sitemap processing	High	Full
Content storage and indexing	High	Overview
Image/media crawling	Medium	Brief
Deep web / form submission	Low	Out of scope

Non-functional

Requirement	Target	Notes
Throughput	~1,000 pages/sec per node	Order-of-magnitude; gated by network and parser CPU
Politeness	≥1 s between requests / host	A common floor; adapt to response latency, see below
URL dedup false-neg	0%	Bloom filter never says “not seen” for a known URL
Content dedup threshold	Hamming ≤ 3 on 64-bit SimHash	Manku/Jain/Das Sarma threshold at Google scale²
robots.txt cache TTL	≤ 24 h	RFC 9309 §2.4 explicit upper bound⁷
Fault tolerance	No URL loss on node failure	Lease + Kafka durable log handles this
Horizontal scalability	Linear with crawler nodes	Achieved by host-keyed partitioning

Scale estimation

The public web is fuzzy: counts of total websites range from ~1.34 B to ~1.98 B depending on how you count parked domains, with around 200 M actively maintained⁸⁹. For sizing, anchor on Common Crawl’s monthly footprint instead — a single observable, methodologically consistent denominator.

A reasonable working target — 1 billion pages in 30 days — implies:

Sustained throughput: 33.3 M pages/day ≈ 385 pages/sec; with 3× peak headroom → ~1,200 pages/sec.
12–15 crawler nodes at 100 pages/sec each (network-bound at ~50 MB/s ≈ 400 Mbps assuming 500 KB average response).
Raw HTML: 500 TB; gzip-compressed ~50 TB; URL metadata ~200 GB; SimHash fingerprints ~8 GB; URL Bloom filter (1% FP at ~10 bits/element) ~9.6 Gbits ≈ 1.2 GB¹⁰.
DNS: ~50 M unique hosts in scope; with 1 h TTL caching the active query rate stays in the low millions.

Note

The 500 KB average response size is the long-pole assumption: it can shift by an order of magnitude depending on whether you count rendered HTML, embedded JSON, or just the document. Re-derive throughput whenever it changes.

Design paths

Path A — centralised coordinator

Single node owns the entire frontier; workers pull batches.

✅ Simple; perfect global priority; consistent dedup against one truth.
❌ Coordinator is the bottleneck and SPOF; ceiling around ~10 worker nodes.

Use this for prototypes or sub-100M-page crawls. Scrapy’s redis-backed cluster is a real-world example.

Path B — fully distributed (UbiCrawler)

No coordinator. Hosts are assigned to nodes by identifier-seeded consistent hashing, each agent owns its own frontier, and reassignment on join/leave is a local re-derivation rather than a network protocol³.

✅ Linear horizontal scaling; no SPOF; politeness is purely local.
❌ Global priority is approximate; rebalancing on churn is non-trivial; harder to observe.

Use this when you need maximum throughput and can afford to give up perfect global ordering. UbiCrawler’s lineage continued into BUbiNG.

Path C — hybrid with Kafka (chosen)

URLs ride a Kafka topic partitioned by host hash; each crawler node belongs to one consumer group and owns the partitions assigned to it. The frontier inside a node is local; coordination across nodes is reduced to “Kafka rebalancing”.

✅ Durable replayable log; consumer-group rebalance handles node churn; each node still enforces politeness locally.
❌ Operational cost of running Kafka; some message-queue latency; partition count is a hard upper bound on parallelism.

Comparison

Factor	Path A (Centralised)	Path B (Distributed)	Path C (Hybrid)
Practical scale	~10 nodes	Unbounded	Bounded by partitions (e.g. 256)
Complexity	Low	High	Medium
Fault tolerance	Low	High	High
Global optimisation	Perfect	Approximate	Good
Operational overhead	Low	High	Medium
Best for	Prototype / small	Web-scale	Production

The rest of this article implements Path C because it preserves the locality argument that makes UbiCrawler scale (each crawler still owns specific hosts) while replacing the bespoke coordination layer with a managed Kafka cluster. For a clean-room web-scale system, Path B is also defensible — the choice is operational, not algorithmic.

High-level design

Component responsibilities

URL frontier

Mercator dual-queue: F front queues ranked by priority (PageRank, domain authority, change frequency, depth from seed) and B back queues keyed by host. The classical heuristic is B ≈ 3 × crawler threads so most threads find a non-blocked back queue at any moment¹. Selection has three steps:

Pick a non-empty front queue, biased by priority.
Within that queue, pop the first URL whose back queue’s next_allowed_time ≤ now.
After fetch, set the host’s next_allowed_time = now + politeness_delay. Refill the back queue from the front queues if it is empty.

DNS resolver

DNS must not be on the critical path. Use a process-local LRU cache with TTL-aware expiry, prefetch DNS while a URL is queued, parallelise across multiple recursive resolvers, and negative-cache NXDOMAIN with a shorter TTL. Mercator’s original architecture explicitly called out DNS as a bottleneck and shipped its own multithreaded resolver¹.

Fetcher

Plain HTTP client tuned for crawl: per-host connection pooling, HTTP/2 where the server supports it (one connection, multiplexed streams), Accept-Encoding: gzip, br, conservative timeouts (10 s connect / 30 s read / 60 s total), max 5 redirects, and a User-Agent that includes contact info. Honour Retry-After on 429/503.

robots.txt service

RFC 9309 is the authoritative source⁷. Implementation must:

Parse at least the first 500 KiB of the response (RFC 9309 §2.5).
Cache for at most 24 h, except when the file is unreachable, in which case the previously cached version may be reused (§2.4).
Apply longest-match precedence between Allow and Disallow rules — the rule with the most matching octets wins; on a tie, Allow wins (§2.2.2).
On 2xx, parse and apply. On 4xx (“unavailable”, §2.3.1.3), treat as no rules and access freely. On 5xx (“unreachable”, §2.3.1.4) the crawler MUST assume complete disallow immediately; only after a “reasonably long period (for example, 30 days)” of unreachability MAY the crawler relax to treating the file as unavailable (full access) or keep using a cached copy. Many production crawlers (Google included) lean on the cached-copy escape and never fully relax.
Crawl-delay is not in RFC 9309. Bing and Yandex honour it with different semantics; Google explicitly ignores it. Treat it as advisory¹¹.

Content parser

DOM-based HTML parsing (jsoup, lxml, parse5). Extract <a href>, <link href>, <script src>, <img src>, plus Content-Type, charset, Last-Modified, and <meta name="robots"> directives. Resolve relative URLs against the document base and canonicalise per RFC 3986 §6 before handing them back to the frontier¹².

JavaScript renderer

A separate, expensive tier — headless Chromium / Playwright — fed by a separate render queue, so the cheap HTML pipeline is not blocked by 30-second renders. Google’s own pipeline does this: crawl fetches HTML synchronously, the Web Rendering Service renders later, and the indexer reconciles the two¹³. Heuristics for routing into the render queue: <noscript>-only fallbacks, low text/markup ratio, presence of common SPA framework markers, or an explicit per-host opt-in.

JavaScript rendering pool: a triage step routes JS-heavy URLs into a separate Kafka topic feeding an auto-scaled headless Chromium pool with per-render timeouts and concurrency caps. — Render tier as its own fault domain. The cheap HTML pipeline triages first; only JS-heavy URLs cross into the bounded render queue, where a headless browser pool runs with hard timeouts, concurrency caps, and a per-host render budget.

Duplicate detection

Two layers, two costs:

URL-level: Bloom filter as the front line for “definitely not seen”, backed by a persistent URL store for the maybe-cases. URL must be normalised first (see URL normalisation).
Content-level: 64-bit SimHash fingerprint per page; documents whose fingerprints differ by ≤ 3 bits are treated as near-duplicates. This is the threshold Manku, Jain, and Das Sarma validated empirically on an 8 B-page Google corpus².

Crawling one URL — sequence

Sequence diagram of fetching one URL: frontier resolves DNS, checks robots.txt, fetches, parses, deduplicates, stores, and re-enqueues new links. — Per-URL flow. The frontier owns scheduling and politeness; dedup gates both content storage and link re-injection.

The same flow as a stages pipeline makes the two-layer dedup explicit — URL-level Bloom in front of the persistent URL store, content-level SimHash in front of object storage:

Pipeline view: fetch then optional render then DOM parse then two-layer dedup (URL Bloom plus persistent store, content SimHash) then archive and re-injection. — Fetch → render? → parse → dedup → store. URL Bloom rejects most discovered links cheaply; the persistent URL store confirms positives. Content SimHash gates body archival; near-duplicates are stored as references only.

Distributed URL distribution

Kafka partitions URLs by host hash; each crawler in the consumer group owns specific partitions and therefore specific hosts. — Producers hash by host into a single partitioned topic; the consumer group rebalance assigns whole partitions to one crawler each, so all URLs for a host land on one node.

The number of partitions is a hard ceiling on parallelism — pick it once, generously (256 is a common starting point), and never reduce it.

API design

These are internal service APIs, not public ones. Authentication is an internal service token; everything is JSON for diff-friendly ops.

Submit URLs to frontier

POST /api/v1/frontier/urls

1{2  "crawl_id": "crawl-2026-04-21",3  "urls": [4    { "url": "https://example.com/page1", "priority": 0.85, "source_url": "https://example.com/", "depth": 1, "discovered_at": "2026-04-21T10:30:00Z" },5    { "url": "https://example.com/page2", "priority": 0.72, "source_url": "https://example.com/", "depth": 1, "discovered_at": "2026-04-21T10:30:00Z" }6  ]7}

1{ "accepted": 2, "rejected": 0, "duplicates": 0 }

URLs are submitted in batches of 100–1,000 to amortise Kafka producer overhead. The frontier deduplicates and prioritises asynchronously.

Pull next URLs (long-poll)

GET /api/v1/frontier/next?worker_id=...&batch_size=100&timeout_ms=5000

1{2  "urls": [3    {4      "url": "https://example.com/page1",5      "priority": 0.85,6      "host": "example.com",7      "ip": "93.184.216.34",8      "robots_rules": { "allowed": true, "crawl_delay": null }9    }10  ],11  "lease_id": "lease-abc123",12  "lease_expires_at": "2026-04-21T10:35:00Z"13}

Each batch is leased; a worker that fails to report results before the lease expires loses the lease and the URLs are reassigned. This is the standard pattern for at-least-once delivery in worker-pool systems.

Report results

POST /api/v1/frontier/results

1{2  "lease_id": "lease-abc123",3  "results": [4    { "url": "https://example.com/page1", "status": "success", "http_status": 200, "content_type": "text/html", "content_length": 45230, "content_hash": "a1b2c3d4...", "simhash": "0x1234567890abcdef", "fetch_time_ms": 234, "links_found": 47, "is_duplicate": false },5    { "url": "https://other.com/page2", "status": "error", "http_status": 503, "error": "Service Unavailable", "retry_after": 300 }6  ]7}

Crawl statistics

GET /api/v1/stats returns an envelope with discovered/queued/fetched/failed counts, current/average/peak throughput, host counts, storage usage, and worker liveness. Nothing exotic; ship it to a time-series store (Prometheus, InfluxDB, TimescaleDB) and you have a dashboard.

Data modelling

URL metadata

PostgreSQL holds the slow-changing metadata; Kafka holds the in-flight queue.

1CREATE TABLE urls (2    id BIGSERIAL PRIMARY KEY,3    url_hash BYTEA NOT NULL,         -- SHA-256 of normalised URL (32 B)4    url TEXT NOT NULL,5    normalized_url TEXT NOT NULL,67    host VARCHAR(255) NOT NULL,8    host_hash INT NOT NULL,          -- partition key9    depth SMALLINT DEFAULT 0,10    source_url_id BIGINT REFERENCES urls(id),11    discovered_at TIMESTAMPTZ DEFAULT NOW(),1213    priority REAL DEFAULT 0.5,14    last_fetched_at TIMESTAMPTZ,15    next_fetch_at TIMESTAMPTZ,16    fetch_count INT DEFAULT 0,1718    last_status_code SMALLINT,19    last_content_hash BYTEA,20    last_simhash BIGINT,21    content_length INT,22    content_type VARCHAR(100),2324    state VARCHAR(20) DEFAULT 'pending',25        -- pending | queued | fetching | completed | failed | blocked2627    CONSTRAINT url_hash_unique UNIQUE (url_hash)28);2930CREATE INDEX idx_urls_host ON urls(host_hash, host);31CREATE INDEX idx_urls_state ON urls(state, next_fetch_at)32    WHERE state IN ('pending', 'queued');33CREATE INDEX idx_urls_refetch ON urls(next_fetch_at)34    WHERE state = 'completed' AND next_fetch_at IS NOT NULL;

The unique constraint is on url_hash, not url: a SHA-256 fits any URL into a fixed 32 bytes, indexes stay tight, and the same hash is reused by the Bloom filter.

robots.txt cache

1CREATE TABLE robots_cache (2    host VARCHAR(255) PRIMARY KEY,3    fetched_at TIMESTAMPTZ NOT NULL,4    expires_at TIMESTAMPTZ NOT NULL, -- ≤ fetched_at + 24h per RFC 93095    status_code SMALLINT NOT NULL,6    content TEXT,                    -- raw, capped at 500 KiB7    parsed_rules JSONB,              -- pre-parsed for fast lookup8    crawl_delay INT,                 -- non-standard, advisory9    sitemaps TEXT[],10    error_message TEXT11);1213CREATE INDEX idx_robots_expires ON robots_cache(expires_at);

Content storage

Page bodies belong in object storage (S3 / MinIO / SeaweedFS); Postgres holds extracted metadata and fingerprints. Partition the metadata table by fetch month so old data can be detached cheaply.

1CREATE TABLE page_content (2    id BIGSERIAL PRIMARY KEY,3    url_id BIGINT NOT NULL REFERENCES urls(id),4    fetch_id BIGINT NOT NULL,56    raw_content_key TEXT NOT NULL,   -- S3 object key, gzip-compressed7    content_length INT NOT NULL,8    content_type VARCHAR(100),9    charset VARCHAR(50),1011    title TEXT,12    meta_description TEXT,13    canonical_url TEXT,14    language VARCHAR(10),1516    content_hash BYTEA NOT NULL,17    simhash BIGINT NOT NULL,1819    fetched_at TIMESTAMPTZ NOT NULL,20    http_date TIMESTAMPTZ,21    last_modified TIMESTAMPTZ22);2324CREATE INDEX idx_content_simhash ON page_content(simhash);25CREATE INDEX idx_content_hash ON page_content(content_hash);

Store selection

Data	Store	Why
In-flight URL queue	Kafka	Durable, partitioned, replayable, rebalances on consumer churn
URL metadata	PostgreSQL	Indexed range queries, ACID, partition pruning
URL Bloom filter	Redis (`BF.*`) or in-process	O(1) lookup; small enough to fit in memory at billions
robots.txt cache	Redis	TTL-native, hot path lookup
Page bodies	Object storage (S3)	Cheap, durable, bandwidth-priced for archival
Content fingerprints	PostgreSQL + Redis	Persistent + warm-set lookup
Crawl metrics	TimescaleDB / InfluxDB / Prometheus	Time-series
Web graph	Postgres / Neo4j	Link analysis (PageRank, hub/authority)

Partitioning

URL DB shard key: host_hash. Co-locates a host’s URLs and matches the Kafka partitioning so each crawler’s hot keys live on one shard.
Kafka topic: 256 partitions keyed by host_hash — enough to scale to 256 crawlers without re-partitioning.
Content table: monthly partitions on fetched_at so cold data can be detached and archived without a vacuum storm.

Low-level design

URL frontier — dual-queue mechanics

Mercator-style dual-queue frontier. The front layer ranks by priority; the back layer enforces one host per worker via a min-heap keyed on the host's next-allowed timestamp.

The selection algorithm in code (TypeScript shape, JVM/Python equivalents are mechanical translations):

1interface FrontierConfig {2  numFrontQueues: number      // F, typically 10–203  numBackQueues: number       // B ≈ 3 × crawler threads4  defaultPolitenessDelay: number // ms between requests to same host5  maxBackQueueSize: number6}78interface URLEntry {9  url: string10  normalizedUrl: string11  host: string12  priority: number            // 0..113  depth: number14  discoveredAt: Date15}1617class URLFrontier {18  private frontQueues: PriorityQueue<URLEntry>[]19  private backQueues: Map<string, HostQueue>20  private hostToBackQueue: Map<string, number>21  private backQueueHeap: MinHeap<BackQueueEntry>2223  constructor(private config: FrontierConfig, private robotsCache: RobotsCache) {24    this.frontQueues = Array(config.numFrontQueues)25      .fill(null)26      .map(() => new PriorityQueue<URLEntry>((a, b) => b.priority - a.priority))27    this.backQueues = new Map()28    this.hostToBackQueue = new Map()29    this.backQueueHeap = new MinHeap((a, b) => a.nextAllowedTime - b.nextAllowedTime)30  }3132  async addURL(entry: URLEntry): Promise<void> {33    const idx = Math.floor(entry.priority * (this.config.numFrontQueues - 1))34    this.frontQueues[idx].enqueue(entry)35    this.ensureBackQueue(entry.host)36  }3738  async getNext(): Promise<URLEntry | null> {39    const now = Date.now()40    while (this.backQueueHeap.size() > 0) {41      const top = this.backQueueHeap.peek()42      if (top.nextAllowedTime > now) return null  // nothing ready4344      const back = this.backQueues.get(top.host)45      if (back && back.size() > 0) {46        const url = back.dequeue()47        this.backQueueHeap.extractMin()48        top.nextAllowedTime = now + this.getPolitenessDelay(top.host)49        this.backQueueHeap.insert(top)50        this.refillBackQueue(top.host)51        return url52      }53      this.backQueueHeap.extractMin()54    }55    return null56  }5758  private refillBackQueue(host: string): void {59    const back = this.backQueues.get(host)60    if (!back || back.size() >= 10) return61    for (const front of this.frontQueues) {62      const url = front.findAndRemove((e) => e.host === host)63      if (url) {64        back.enqueue(url)65        if (back.size() >= 10) break66      }67    }68  }6970  private getPolitenessDelay(host: string): number {71    return this.robotsCache.getCrawlDelay(host) ?? this.config.defaultPolitenessDelay72  }73}

Tip

The classical Mercator policy is adaptive: wait 10 × t_last_fetch before contacting the same host again, where t_last_fetch is the wall-clock duration of the last fetch¹¹. This bakes server health into the politeness delay automatically — slow hosts naturally back off — and is far better than a fixed 1-second floor.

Priority calculation

1interface PriorityFactors {2  pageRank: number          // 0..1, link-graph score3  domainAuthority: number   // 0..1, host-level quality4  changeFrequency: number   // 0..1, how often content changes5  depth: number             // distance from seed6  freshness: number         // 0..1, based on staleness of last fetch7}89function calculatePriority(f: PriorityFactors): number {10  const w = { pageRank: 0.3, domainAuthority: 0.25, changeFrequency: 0.2, depth: 0.15, freshness: 0.1 }11  const depthScore = Math.max(0, 1 - f.depth * 0.1)  // -10% per hop12  const p =13    w.pageRank * f.pageRank +14    w.domainAuthority * f.domainAuthority +15    w.changeFrequency * f.changeFrequency +16    w.depth * depthScore +17    w.freshness * f.freshness18  return Math.max(0, Math.min(1, p))19}

The exact weights are application-specific; what matters is that priority is monotone and bounded to [0, 1] so the front-queue index calculation is deterministic.

URL normalisation

Different URL strings can name the same resource. RFC 3986 §6 defines a syntax-based normalisation that is the conservative baseline¹²:

Lowercase the scheme and host.
Remove the default port (:80 for http, :443 for https).
Apply remove_dot_segments to the path (collapse ./ and ../).
Decode percent-encoded triplets that map to unreserved characters; normalise hex digits to uppercase.

Crawler-specific additions go beyond §6 and are lossy — apply them only when they are correct for your corpus:

Strip the #fragment (no impact on the resource the server returns).
Sort query parameters alphabetically.
Remove a trailing / on non-root paths.
Drop index.html / index.php / default.aspx from path tails.
Lowercasing or stripping www. is not safe — many sites serve different content on the bare apex.

1import { URL } from "url"23interface NormalizationOptions {4  removeFragment: boolean5  removeDefaultPorts: boolean6  removeTrailingSlash: boolean7  removeIndexFiles: boolean8  sortQueryParams: boolean9  lowercaseHost: boolean10  removeWWW: boolean         // off by default; lossy11  decodeUnreserved: boolean12}1314const DEFAULT_OPTIONS: NormalizationOptions = {15  removeFragment: true,16  removeDefaultPorts: true,17  removeTrailingSlash: true,18  removeIndexFiles: true,19  sortQueryParams: true,20  lowercaseHost: true,21  removeWWW: false,22  decodeUnreserved: true,23}2425function normalizeURL(urlString: string, options = DEFAULT_OPTIONS): string {26  const url = new URL(urlString)27  url.protocol = url.protocol.toLowerCase()28  if (options.lowercaseHost) url.hostname = url.hostname.toLowerCase()29  if (options.removeDefaultPorts) {30    if ((url.protocol === "http:" && url.port === "80") || (url.protocol === "https:" && url.port === "443")) {31      url.port = ""32    }33  }34  if (options.removeFragment) url.hash = ""35  if (options.sortQueryParams && url.search) {36    const params = new URLSearchParams(url.search)37    url.search = new URLSearchParams([...params.entries()].sort()).toString()38  }39  if (options.removeTrailingSlash && url.pathname !== "/") {40    url.pathname = url.pathname.replace(/\/+$/, "")41  }42  if (options.removeIndexFiles) {43    url.pathname = url.pathname.replace(/\/(index|default)\.(html?|php|asp)$/i, "/")44  }45  if (options.decodeUnreserved) {46    url.pathname = decodeURIComponent(url.pathname)47      .split("/")48      .map((segment) => encodeURIComponent(segment))49      .join("/")50  }51  return url.toString()52}

URL deduplication — Bloom filter

A Bloom filter answers “have I seen this URL?” in O(1) with no false negatives — the property a crawler needs to avoid double-fetching. The optimal sizing¹⁰ for n items at false-positive rate p:

For 1 B URLs at p = 0.01: m ≈ 9.585 × 10⁹ bits ≈ 1.2 GB, k ≈ 7. Use Kirsch–Mitzenmacher double-hashing — g_i(x) = h_1(x) + i·h_2(x) mod m — to derive k positions from two hash functions, not k independent ones.

1import { createHash } from "crypto"23class BloomFilter {4  private bitArray: Uint8Array5  private numHashFunctions: number6  private size: number78  constructor(expectedElements: number, falsePositiveRate = 0.01) {9    this.size = Math.ceil((-expectedElements * Math.log(falsePositiveRate)) / Math.log(2) ** 2)10    this.numHashFunctions = Math.ceil((this.size / expectedElements) * Math.log(2))11    this.bitArray = new Uint8Array(Math.ceil(this.size / 8))12  }1314  private getHashes(item: string): number[] {15    const h1 = this.hash(item, 0)16    const h2 = this.hash(item, 1)17    return Array(this.numHashFunctions)18      .fill(0)19      .map((_, i) => Math.abs((h1 + i * h2) % this.size))20  }2122  private hash(item: string, seed: number): number {23    return createHash("md5").update(`${seed}:${item}`).digest().readUInt32LE(0)24  }2526  add(item: string): void {27    for (const h of this.getHashes(item)) {28      this.bitArray[Math.floor(h / 8)] |= 1 << (h % 8)29    }30  }3132  mightContain(item: string): boolean {33    for (const h of this.getHashes(item)) {34      if ((this.bitArray[Math.floor(h / 8)] & (1 << (h % 8))) === 0) return false35    }36    return true   // probably; check persistent store on positives37  }38}

Important

A Bloom filter has no false negatives but does have false positives. Treat positives as “maybe seen — confirm against the URL store”. If you skip the confirmation step you will lose URLs and have no way of knowing.

Content deduplication — SimHash

SimHash, introduced by Charikar (2002) and applied to web crawling at Google by Manku, Jain, and Das Sarma (2007), reduces a document to a fixed-width fingerprint where small bit-Hamming distances correspond to high textual similarity². The 2007 paper validates that 64-bit fingerprints with a 3-bit Hamming-distance threshold are a good operating point on a corpus of 8 billion pages.

1import { createHash } from "crypto"23function simhash(text: string, hashBits: number = 64): bigint {4  const tokens = tokenize(text)5  const shingles = generateShingles(tokens, 3) // 3-word shingles67  const v = new Array(hashBits).fill(0)8  for (const shingle of shingles) {9    const hash = hashShingle(shingle, hashBits)10    for (let i = 0; i < hashBits; i++) {11      v[i] += ((hash >> BigInt(i)) & 1n) ? 1 : -112    }13  }1415  let fingerprint = 0n16  for (let i = 0; i < hashBits; i++) {17    if (v[i] > 0) fingerprint |= 1n << BigInt(i)18  }19  return fingerprint20}2122function tokenize(text: string): string[] {23  return text.toLowerCase().replace(/[^\w\s]/g, " ").split(/\s+/).filter((w) => w.length > 2)24}2526function generateShingles(tokens: string[], n: number): string[] {27  const out: string[] = []28  for (let i = 0; i <= tokens.length - n; i++) out.push(tokens.slice(i, i + n).join(" "))29  return out30}3132function hashShingle(shingle: string, bits: number): bigint {33  const hash = createHash("sha256").update(shingle).digest("hex")34  return BigInt("0x" + hash.slice(0, bits / 4))35}3637function hammingDistance(a: bigint, b: bigint): number {38  let xor = a ^ b39  let d = 040  while (xor > 0n) { d += Number(xor & 1n); xor >>= 1n }41  return d42}4344const DUPLICATE_THRESHOLD = 34546function isNearDuplicate(newFp: bigint, existing: bigint[]): boolean {47  for (const e of existing) if (hammingDistance(newFp, e) <= DUPLICATE_THRESHOLD) return true48  return false49}

A naive per-fetch scan over a billion fingerprints is hopeless. The 2007 paper’s contribution is the permuted-table indexing scheme — store multiple permutations of each fingerprint sorted, and a candidate set lookup becomes O(log n) bounded by the number of tables². Locality-sensitive hashing variants (e.g. MinHash with banding) achieve similar properties for set-similarity rather than bit-Hamming similarity.

robots.txt parsing

1interface RobotsRule { path: string; allow: boolean }2interface RobotsRules { rules: RobotsRule[]; crawlDelay?: number; sitemaps: string[] }34function parseRobotsTxt(content: string, userAgent: string): RobotsRules {5  const lines = content.split("\n").map((l) => l.trim())6  const rules: RobotsRule[] = []7  let crawlDelay: number | undefined8  const sitemaps: string[] = []910  let currentUserAgents: string[] = []11  let inRelevantBlock = false1213  for (const line of lines) {14    if (line.startsWith("#") || line === "") continue15    const [directive, ...valueParts] = line.split(":")16    const value = valueParts.join(":").trim()17    switch (directive.toLowerCase()) {18      case "user-agent":19        if (currentUserAgents.length > 0 && inRelevantBlock) break20        currentUserAgents.push(value.toLowerCase())21        inRelevantBlock = matchesUserAgent(value, userAgent)22        break23      case "allow":      if (inRelevantBlock) rules.push({ path: value, allow: true });  break24      case "disallow":   if (inRelevantBlock) rules.push({ path: value, allow: false }); break25      case "crawl-delay": if (inRelevantBlock) crawlDelay = parseInt(value, 10);          break26      case "sitemap":    sitemaps.push(value); break27    }28  }2930  // RFC 9309 §2.2.2: longest matching path wins31  rules.sort((a, b) => b.path.length - a.path.length)32  return { rules, crawlDelay, sitemaps }33}3435function matchesUserAgent(pattern: string, userAgent: string): boolean {36  const p = pattern.toLowerCase(), u = userAgent.toLowerCase()37  return p === "*" || u.includes(p)38}3940function isAllowed(rules: RobotsRules, path: string): boolean {41  for (const r of rules.rules) if (pathMatches(r.path, path)) return r.allow42  return true   // §2.2.2: implicit allow if no rule matches43}4445function pathMatches(pattern: string, path: string): boolean {46  let regex = pattern.replace(/[.+?^${}()|[\]\\]/g, "\\$&").replace(/\*/g, ".*").replace(/\$$/, "$")47  if (!pattern.endsWith("$")) regex = `^${regex}`48  return new RegExp(regex).test(path)49}

Warning

Crawl-delay is not part of RFC 9309. Bing treats it as “one request per N-second window”, Yandex as a fixed delay between requests, and Google ignores it entirely¹¹. Honour it for compatibility but do not assume any particular semantics.

Spider-trap defences

Crawler-hostile URL spaces — calendars extending into the year 9999, session IDs minted on every request, repeating path segments — can generate unbounded unique URLs. Defences are heuristic and conservative:

1interface TrapConfig {2  maxURLLength: number3  maxPathDepth: number4  patternThreshold: number   // path segment repetitions5  futureDateYears: number    // years beyond now to flag as calendar trap6}78const DEFAULT: TrapConfig = { maxURLLength: 2000, maxPathDepth: 15, patternThreshold: 3, futureDateYears: 2 }910class SpiderTrapDetector {11  constructor(private config = DEFAULT) {}1213  isTrap(url: string): { isTrap: boolean; reason?: string } {14    try {15      const u = new URL(url)16      if (url.length > this.config.maxURLLength) return { isTrap: true, reason: "url too long" }17      const segs = u.pathname.split("/").filter(Boolean)18      if (segs.length > this.config.maxPathDepth) return { isTrap: true, reason: "path too deep" }19      const repeating = this.detectRepeatingPattern(segs)20      if (repeating) return { isTrap: true, reason: `repeating segment: ${repeating}` }21      if (this.isCalendarTrap(u.pathname)) return { isTrap: true, reason: "far-future date" }22      if (this.hasSessionID(u.search)) return { isTrap: true, reason: "session id in url" }23      return { isTrap: false }24    } catch {25      return { isTrap: true, reason: "invalid url" }26    }27  }2829  private detectRepeatingPattern(segs: string[]): string | null {30    for (let len = 1; len <= segs.length / 3; len++) {31      const pat = segs.slice(0, len).join("/")32      let reps = 033      for (let i = 0; i <= segs.length - len; i += len) {34        if (segs.slice(i, i + len).join("/") === pat) reps++35      }36      if (reps >= this.config.patternThreshold) return pat37    }38    return null39  }4041  private isCalendarTrap(pathname: string): boolean {42    const m = pathname.match(/\/(\d{4})\/(\d{1,2})\/(\d{1,2})/)43    return !!m && parseInt(m[1], 10) > new Date().getFullYear() + this.config.futureDateYears44  }4546  private hasSessionID(search: string): boolean {47    return [/[?&](session_?id|sid|phpsessid|jsessionid|aspsessionid)=/i, /[?&][a-z]+=[a-f0-9]{32,}/i]48      .some((re) => re.test(search))49  }50}

Stronger production defences add a per-host URL-discovery rate limit (no more than N new URLs per host per window) and a diminishing-returns budget (cap pages per host at, say, the 99th percentile of comparable hosts).

Sitemap discovery

Per the Sitemaps protocol, each sitemap file is capped at 50,000 URLs and 50 MB uncompressed; larger sites use a sitemap index that itself can list up to 50,000 children¹⁴. The crawler should discover sitemaps via the Sitemap: directive in robots.txt (RFC 9309 explicitly allows it) and prioritise URLs found in sitemaps because they are usually the site’s own canonical inventory.

Operations

Monitoring surface

Operators care about a small set of signals; everything else is forensic. Ship these to a time-series backend with per-host slicing:

urls_per_second{state=fetched|failed|duplicate|blocked} — the headline.
frontier_depth{queue=front|back} — leading indicator of scheduler stall.
host_concurrency{host} — proves politeness is working.
bloom_false_positive_observed_rate — early warning that the filter is full.
render_queue_depth and render_p95_latency_ms — JS rendering is the most common bottleneck.
robots_cache_age_p95_seconds — must stay below 24 h.

A small operator UI with a per-host search and a per-URL inspect view (“show me what I last fetched, when, and what robots said”) pays for itself the first time you triage a publisher complaint.

Failure modes

Failure	Detection	Recovery
Crawler node crash	Kafka heartbeat / lease expiry	Consumer-group rebalance reassigns partitions; leases time out and URLs are re-pulled
Kafka partition leader loss	MSK metric / consumer error	Replica election; producer retries with idempotency
Bloom filter saturated	Observed FP rate > target	Roll a new filter (double-buffered), drain old, swap
Hot-host saturation	`host_concurrency` rail-pinned	Apply backoff; widen back-queue refill rate
robots.txt 5xx (unreachable)	`robots_status_code` per host	Treat as full Disallow immediately (§2.3.1.4); after ~30 days MAY relax to “unavailable” or keep cached copy
Render queue runaway	`render_queue_depth` slope	Drop to HTML-only for low-priority URLs; cap render rate

Recrawl scheduling

A simple time-based schedule is the default — next_fetch_at = last_fetched_at + interval(host_class) — but the better signal is observed change frequency. If content_hash changed between the last two fetches, halve the interval; if it did not, double it (capped). This is essentially exponential adaptation of the recrawl cadence to actual page churn, and avoids spending bandwidth on static archives.

Infrastructure

Cloud-agnostic targets

Component	Requirement	Options
Message queue	Durable, partitioned, high throughput	Kafka, Apache Pulsar, Redpanda
URL DB	High write throughput, range queries	PostgreSQL, CockroachDB, Cassandra
Bloom store	Low-latency membership test	Redis Bloom, in-process per node
Content store	Cheap, durable, object semantics	S3 / MinIO / SeaweedFS
Coordination	Leader election, config	ZooKeeper, etcd, Consul
DNS cache	Low-latency recursive resolver	Unbound, PowerDNS, Knot Resolver
Metrics	Time-series at high cardinality	Prometheus, InfluxDB, TimescaleDB

AWS reference

Component	AWS Service	Configuration
Message queue	Amazon MSK	kafka.m5.2xlarge × 6, 256 partitions
Crawler nodes	EC2 c6i.2xlarge	8 vCPU / 16 GB, ASG 5–50
Parser nodes	EC2 c6i.xlarge	4 vCPU / 8 GB, ASG 2–20
JS renderer	EC2 c6i.4xlarge	16 vCPU / 32 GB, spot
URL DB	RDS PostgreSQL	db.r6g.2xlarge, Multi-AZ, 2 TB gp3
Bloom + cache	ElastiCache Redis	cache.r6g.xlarge × 4 nodes
Content store	S3 Standard-IA	Intelligent-Tiering after 30 days

Cost order-of-magnitude (1 B pages/month, AWS list price)

Resource	Monthly Cost
MSK (6 × m5.2xlarge)	~$3,600
Crawler EC2 (15 × c6i.2xlarge)	~$5,400
Parser EC2 (10 × c6i.xlarge)	~$1,800
Renderer EC2 spot (5 × c6i.4xlarge)	~$1,500
RDS PostgreSQL	~$1,200
ElastiCache Redis	~$800
S3 storage (50 TB compressed)	~$1,150
Data egress (~10 TB)	~$900
Total	~$16,350

Roughly $16 per million pages, dominated by compute (crawler ASG) and Kafka. These are list-price, single-region, list-pricing numbers — a serious operator with reserved instances and a self-hosted Kafka tier will cut this in half.

Practical takeaways

Pick host-keyed partitioning early. Everything good (local politeness, deterministic dedup, predictable failure recovery) follows from it. Everything bad (cross-node coordination, global locks) follows from not picking it.
Measure politeness in t_last_fetch units, not seconds. The Mercator 10 × t policy adapts to server health for free; a fixed 1 s/host floor either crawls too fast for tiny servers or too slow for cloud-scale CDNs.
Deduplicate twice, at different costs. A Bloom filter on URLs (fits in memory, no false negatives) and SimHash on content (catches boilerplate variants) cover the two failure modes — re-fetching the same page, and storing the same page under different URLs.
Render queue is a separate fault domain. Headless Chromium will eat the whole budget if you let it. Run it as its own ASG, on spot, behind a separate queue with a hard concurrency cap.
Treat RFC 9309 as a hard contract. 500 KiB minimum parse, ≤ 24 h cache, longest-match precedence, no special semantics for Crawl-delay. Anything else is folklore.
Anchor sizing to Common Crawl, not “the web”. “How many websites exist” has answers spread across an order of magnitude; “how many pages does Common Crawl publish per month” is an unambiguous operational denominator.

Appendix

Prerequisites

Distributed-systems fundamentals (consistent hashing, message queues, leases), basic database design (indexing, sharding, partitioning), HTTP wire-level details (status codes, redirects, conditional requests), and probabilistic data structures (Bloom filters, MinHash / SimHash).

Terminology

URL frontier — the data structure the crawler uses to choose the next URL, balancing priority against politeness.
Politeness — rate-limiting per host so the crawler does not act like a denial-of-service attack.
robots.txt — the file at /robots.txt that declares crawl rules; standardised in RFC 9309.
SimHash — locality-sensitive hash whose output bit-distance correlates with input similarity; used for near-duplicate detection.
Bloom filter — probabilistic set with no false negatives but tunable false positives; used for “have I seen this URL?”.
Spider trap — a URL pattern that yields infinite unique URLs (calendars, session IDs, repeating paths).
Crawl-delay — non-standard robots.txt directive specifying a delay between fetches; Google ignores it.
Back queue / front queue — Mercator’s two-tier frontier: front = priority bucket, back = per-host FIFO.

Summary

The frontier is two queues — front for priority, back for politeness — with a min-heap on next_allowed_time driving worker selection. Mercator’s heuristic of B ≈ 3 × threads is still a sensible default¹.
URL dedup is a Bloom filter with a persistent backing store on positives. 1 B URLs at 1% FP fits in ~1.2 GB at 7 hash positions¹⁰.
Content dedup is SimHash with a 3-bit Hamming threshold on 64-bit fingerprints, the operating point validated by Manku, Jain, and Das Sarma on an 8 B-page corpus².
Sharding by host hash is the choice that makes everything else work: politeness is local, dedup is deterministic, recovery is a Kafka rebalance. UbiCrawler’s identifier-seeded consistent hashing remains the cleanest formulation³.
Common Crawl’s recent monthly archives are ~2 B pages from ~45 M hosts at ~350 TiB uncompressed⁴⁵; Google’s index sits around 400 B documents per Pandu Nayak’s 2020 testimony⁶.

References

Heydon & Najork, Mercator: A Scalable, Extensible Web Crawler, World Wide Web 2(4), 1999. The dual-queue frontier and the B ≈ 3 × threads heuristic are summarised in the Stanford IR Book §20.2.3. ↩ ↩² ↩³ ↩⁴
Manku, Jain, Das Sarma, Detecting Near-Duplicates for Web Crawling, WWW 2007. 64-bit fingerprints with k = 3 Hamming threshold validated on an 8 B-page Google corpus. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Boldi, Codenotti, Santini, Vigna, UbiCrawler: A Scalable Fully Distributed Web Crawler, Software: Practice and Experience 34(8), 2004. ↩ ↩² ↩³
Common Crawl Foundation, March 2026 Crawl Archive Now Available — 1.97 B pages, 44 M hosts, 344.64 TiB uncompressed. ↩ ↩²
Common Crawl Foundation, February 2026 Crawl Archive Now Available — 2.1 B pages, 45.5 M hosts, 363 TiB uncompressed. ↩ ↩²
Cyrus Shepard, Google’s Index Size Revealed: 400 Billion Docs — analysis of Pandu Nayak’s 2020 testimony in US v. Google. ↩ ↩²
RFC 9309 — Robots Exclusion Protocol, §§2.2.2, 2.3.1.4, 2.4, 2.5. ↩ ↩²
Hostinger Tutorials, How many websites are there in 2026 — ~1.34 B total, ~201 M actively maintained. ↩
DigitalApplied, Website Statistics 2026: 180+ Facts, Trends, and Data — ~1.98 B total. ↩
Bloom filter — Wikipedia. Optimal sizing formulas and the Kirsch–Mitzenmacher double-hashing optimisation. ↩ ↩² ↩³
robots.txt — Wikipedia: Crawl-delay directive; Google explicitly does not honour Crawl-delay. Mercator’s adaptive 10 × t politeness policy is summarised in Najork, “High-Performance Web Crawling” (2002). ↩ ↩² ↩³
RFC 3986 — Uniform Resource Identifier (URI): Generic Syntax, §6 Normalization and Comparison. ↩ ↩²
Google Search Central, Understand JavaScript SEO Basics — separate render queue using the Web Rendering Service (headless Chromium). ↩
Sitemaps Protocol 0.90 — ≤ 50,000 URLs and ≤ 50 MB uncompressed per file; sitemap index files cap at 50,000 child sitemaps. ↩