DNS Troubleshooting Playbook
DNS incidents look chaotic until you isolate the layer they live in. This playbook gives you a symptom-first triage tree, a small but precise tool kit (dig, delv, kdig, drill), and concrete recovery patterns for the failure modes that drive most production outages — DNSSEC validation breakage, lame delegation, stale caches, and CDN/GeoDNS surprises.
It is the operations companion to DNS Resolution Path, DNS Records, TTL, and Cache Behavior, and DNS Security: DNSSEC, DoH, and DoT. Read those for the protocol mechanics; read this when something is on fire.
Mental Model
DNS resolution traverses three failure domains in series. A bad answer at the top hides everything below it, so always isolate top-down:
- Authoritative infrastructure — the zone’s own nameservers and the data they publish. Symptoms surface as wrong answers, missing records, or no AA flag.
- The delegation chain — root → TLD → zone, plus DNSSEC trust from
.down. Symptoms surface as+tracestalls, lame delegation, orSERVFAILwith EDE 6/9/10 (RFC 8914 §4). - Recursive resolvers and client caches — public/ISP resolvers, OS stubs, browser caches. Symptoms surface as one resolver disagreeing with another, “propagation” lag after a TTL expires, or stale records served per RFC 8767.
Four reflexes drive the rest of the playbook:
dig +cdfirst when you seeSERVFAIL. If+cdsucceeds the failure is DNSSEC validation, not the zone. CD is the RFC 4035 Checking Disabled bit.- Distinguish NXDOMAIN from NODATA. NXDOMAIN means the name does not exist; NODATA means the name exists but has no records of the requested type —
NOERRORwith an empty answer section (RFC 2308 §2). - “Propagation” is cache expiry, not active distribution. Old answers persist for the record’s remaining TTL; negative answers persist for
min(SOA.MINIMUM, SOA TTL)(RFC 2308 §5). - Lame delegation is silent. A nameserver that does not consider itself authoritative for the zone returns answers without the
aaflag, orREFUSED(RFC 1034 §4.2.2). Always check the AA flag, never assume.
Diagnostic Tool Kit
Different tools expose different layers. Keep all four in muscle memory; they are not interchangeable.
| Tool | Source | Best for |
|---|---|---|
dig |
BIND 9 | General-purpose query inspection, header flags, RCODEs, EDE. |
delv |
BIND 9 | Local DNSSEC validation with chain traces (+rtrace, +vtrace). |
kdig |
Knot DNS | DoT, DoH, and DoQ end-to-end testing. |
drill |
NLnet Labs ldns | DNSSEC chain visualization (-T -D -S). |
dig: the Primary Tool
Understanding dig’s flags and header is the single highest-leverage skill in DNS triage.
Essential flags:
| Flag | Purpose | When to Use |
|---|---|---|
+trace |
Iterate from root downward | Identify which NS in the chain fails |
+norecurse |
Skip recursion, query directly | Test an authoritative server’s own answer |
+cd |
Checking Disabled (bypass DNSSEC) | Confirm a SERVFAIL is DNSSEC-related |
+dnssec |
Set the DO bit, request RRSIGs | Verify signatures and DNSKEY records exist |
+nsid |
Request the Name Server ID | Identify which anycast instance answered |
+subnet |
Send EDNS Client Subnet | Reproduce GeoDNS answers from a target subnet |
+short |
Concise output | Quick answer verification, scripting |
+tcp |
Force TCP transport | Test when UDP responses are truncated or dropped |
+bufsize=N |
Set advertised EDNS UDP bufsize | Force truncation (+bufsize=512) or test Flag Day default (1232) |
+cookie |
Send DNS Cookies (RFC 7873) | Off-path-spoofing test; on by default in modern dig |
+noedns |
Strip the EDNS OPT RR | Detect EDNS-stripping middleboxes |
-4 / -6 |
Force IPv4/IPv6 | Isolate address-family-specific issues |
These are documented in the BIND 9 ARM under dig and delv (BIND 9 manpages).
Interpreting dig output:
$ dig example.com; <<>> DiG 9.18.18 <<>> example.com;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54321;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1;; ANSWER SECTION:example.com. 86400 IN A 93.184.216.34;; Query time: 23 msec;; SERVER: 8.8.8.8#53(8.8.8.8)Header flags decoded (RFC 1035 §4.1.1 and RFC 4035 §3.2):
| Flag | Meaning |
|---|---|
qr |
Query Response — message is a response, not a query |
rd |
Recursion Desired — the client asked for full resolution |
ra |
Recursion Available — the responder offers recursion |
aa |
Authoritative Answer — answer comes from the zone’s nameserver |
ad |
Authenticated Data — DNSSEC validation succeeded at this resolver |
cd |
Checking Disabled — client requested validation be skipped |
RCODE values (full list in the IANA DNS Parameters registry):
| Status | Meaning | Common Causes |
|---|---|---|
NOERROR |
Success | Query succeeded; may be NODATA if the answer section is empty |
SERVFAIL |
Server failure | DNSSEC validation, upstream timeout, lame delegation, RRL |
NXDOMAIN |
Name doesn’t exist | Domain not registered, typo, deleted record |
REFUSED |
Query refused | ACL, rate limiting, server not authoritative |
FORMERR |
Format error | Malformed query (rare; usually a buggy stub or middlebox) |
delv: Local DNSSEC Validation
delv ships with BIND and reuses named’s validator, so it can prove validation locally — independent of the resolver you queried. Use it whenever dig +cd succeeds but plain dig returns SERVFAIL.
$ delv example.com; fully validatedexample.com. 86400 IN A 93.184.216.34Key delv flags (BIND 9 ARM — delv):
| Flag | Purpose |
|---|---|
+rtrace |
Resolver fetch logging — every query delv issues to build the chain |
+vtrace |
Validation trace — every signature it checks against the trust anchor |
+mtrace |
Message trace — full responses received |
-i |
Insecure mode (disable validation; for plain lookups, prefer dig +cd) |
+multiline |
Wrap RRSIG, DNSKEY, and SOA records into a readable form |
When validation fails, delv prints the broken hop:
$ delv dnssec-failed.org;; resolution failed: SERVFAIL;; DNSSEC validation failureNote
delv -i does not set CD on upstream queries. If your forwarder is itself validating, it will withhold bogus data and delv will time out. To examine bogus data, use dig +cd — never delv -i. (BIND 9 ARM — delv)
kdig: Encrypted DNS Testing
kdig from Knot DNS speaks DoT (RFC 7858), DoH (RFC 8484), and DoQ (RFC 9250) out of the box. It is the right tool when you suspect TLS handshake failures, ALPN mismatches, or DoH path/method differences:
kdig @1.1.1.1 example.com +tlskdig @1.1.1.1 example.com +httpskdig @1.1.1.1 example.com +quickdig @8.8.8.8 +https +tls-hostname=dns.google +fastopen example.comThe full option set is in the kdig manpage; particularly useful for triage are +tls-hostname, +tls-pin, +keepopen, and +padding.
drill: DNSSEC Chain Tracing
drill from NLnet Labs ships with ldns and is the cleanest way to walk the DNSSEC chain by hand:
drill -TDS example.comPer the drill(1) manpage, the flags are independent:
| Flag | Purpose |
|---|---|
-T |
Trace from root to the queried name |
-D |
Set the DNSSEC OK (DO) bit, requesting DNSSEC records in responses |
-S |
Chase signatures up to a known trust anchor |
-TDS combined produces a per-hop trace that includes RRSIG and DS records, surfacing chain breaks visually before you reach for DNSViz.
Symptom-Driven Triage
Pick the entry point matching what users (or your synthetic monitor) reported, then drill down. None of the steps below mutate state — they are safe to run in production at any time.
Complete Resolution Failure
Symptom: All queries to a domain fail across multiple resolvers — no responses, or every response is an error.
dig example.com NS +short# Returns: ns1.example.com, ns2.example.comdig @ns1.example.com example.com A +norecursedig @ns2.example.com example.com A +norecursedig ns1.example.com A +short# Returns: 192.0.2.1nc -zv 192.0.2.1 53Failure patterns:
| Pattern | Likely Cause |
|---|---|
| No response from any NS | Authoritative servers down or unreachable |
| Response but no AA flag | Lame delegation — NS does not serve this zone |
Response but REFUSED |
ACL blocking your source IP, or zone not loaded |
| Timeout to NS but ICMP works | Firewall blocking UDP/TCP 53, or DNS-over-something only |
Lame delegation check. A nameserver that is listed in the parent’s NS records but does not consider itself authoritative for the zone is a lame delegation:
dig @ns1.example.com example.com SOA +norecurse# Healthy: status: NOERROR, flags include 'aa'# Lame: REFUSED, SERVFAIL, or NOERROR without the 'aa' flagThis is one of the oldest and most common DNS failure modes. The classic reference is RFC 1912 §2.8; modern resolvers downrank lame nameservers automatically, which makes intermittent breakage hard to spot.
SERVFAIL Responses
Symptom: A resolver returns SERVFAIL for a domain that should resolve.
SERVFAIL is a catch-all the resolver returns whenever it cannot produce a trustworthy answer. The dominant causes today:
- DNSSEC validation failure (most common since DNSSEC adoption rose past ~30% of zones).
- All authoritative servers unreachable from the resolver.
- Lame delegation.
- Resolver-side timeout, RRL throttling (RFC 7873 cookies), or upstream loop detection.
The RFC 9520 cache rules require resolvers to negatively cache resolution failures themselves, which is why a single bad event can persist for the full negative-cache TTL even after you fix the source.
Decision tree. Always start by ruling DNSSEC in or out:
dig example.com +cd# If +cd succeeds and plain dig fails → DNSSEC problem# If both fail → authoritative or network problemExtended DNS Errors (EDE). RFC 8914 defines an EDNS0 OPT field carrying an INFO-CODE that explains the underlying reason for a SERVFAIL (or any other RCODE). Most major recursors emit it; Cloudflare’s “Unwrap the SERVFAIL” post documents their adoption.
dig @1.1.1.1 example.com;; OPT PSEUDOSECTION:; EDE: 6 (DNSSEC Bogus)The codes most often seen in the wild (IANA Extended DNS Errors registry):
| Code | Name | Meaning |
|---|---|---|
| 6 | DNSSEC Bogus | Validation failed — chain or signature inconsistency |
| 7 | Signature Expired | RRSIG inception/expiration window has passed; re-sign the zone |
| 8 | Signature Not Yet Valid | RRSIG inception is in the future; usually clock skew on the signer |
| 9 | DNSKEY Missing | RRSIG references a key not in the published DNSKEY RRset |
| 10 | RRSIGs Missing | The zone is signed but a queried RRset has no RRSIG |
| 11 | No Zone Key Bit Set | DNSKEY used to sign does not have the Zone Key flag |
| 12 | NSEC Missing | Negative answer cannot be authenticated — denial-of-existence chain broken |
Trace to find the failing hop:
dig +trace example.comThe last successful referral identifies the layer immediately above the failure.
Intermittent Failures
Symptom: Queries succeed sometimes and fail other times, often correlated with location or time of day.
| Cause | How to detect |
|---|---|
| Inconsistent authoritative servers | Different answers / SOA serials per NS |
| Anycast routing instability | Same NS IP, different +nsid instances, different latencies |
| Partial outage | Some NS instances respond, others time out |
| Network path issues | Packet loss to specific NS IPs; UDP fragmentation past the EDNS bufsize |
Compare authoritative responses:
for ns in $(dig example.com NS +short); do echo "=== $ns ===" dig @$ns example.com A +norecurse +shortdonefor ns in $(dig example.com NS +short); do echo "$ns: $(dig @$ns example.com SOA +short | awk '{print $3}')"doneDifferent SOA serials across nameservers indicate a zone-transfer problem (AXFR/IXFR ACLs, NOTIFY drops, signer not pushing) rather than an answer problem.
Anycast instance identification:
dig +nsid @1.1.1.1 example.com;; OPT PSEUDOSECTION:; NSID: 4c 41 58 ("LAX" = Los Angeles instance)NSID is defined by RFC 5001 and is widely supported by the major public resolvers; vendors document their site code conventions in their resolver docs.
Slow Resolution
Symptom: Queries take seconds when they should take milliseconds.
| Cause | Detection signal |
|---|---|
| Cache miss on a long delegation chain | Normal on the first query; subsequent queries should be fast |
| Per-NS timeout before failover | Latency clusters near the resolver’s query-timeout |
| Lame delegation requiring retries | +trace shows retries, several seconds spent at one hop |
| DNSSEC adds DNSKEY/DS fetches | Chain queries visible in delv +rtrace |
dig example.com | grep "Query time"dig +trace +stats example.comdig @8.8.8.8 example.com | grep -E "^example.com.*IN"# example.com. 142 IN A ...# TTL 142 means the record was fetched ~158 seconds ago (original 300)The remaining TTL is your free side-channel for “how cached is this answer?” — useful when comparing resolvers without flushing them.
Unexpected NXDOMAIN
Symptom: A domain you know exists returns NXDOMAIN.
| Cause | Verification |
|---|---|
| Record actually deleted | Authoritative NS also returns NXDOMAIN |
| Negative cache | Authoritative answers correctly; resolver returns NXDOMAIN until TTL expires |
| Split-horizon DNS | Public resolvers see NXDOMAIN; internal resolvers see the record |
| Registry / registrar removal | Parent zone has no NS for the domain |
dig @ns1.example.com api.example.com A +norecursedig example.com NS @$(dig com NS +short | head -1)whois example.comNegative cache duration is governed by RFC 2308 §5:
dig example.com SOA +short# ns1.example.com. hostmaster.example.com. 2024011501 7200 3600 1209600 3600# ^^^^# Last value (MINIMUM, here 3600) bounds negative cache TTL.# Effective negative TTL = min(SOA.MINIMUM, SOA RR TTL).Resolvers may also enforce their own ceilings on negative TTL (RFC 8767 for serve-stale, RFC 9520 for failure caching), so an aggressive MINIMUM is not a safety net.
Resolver vs Authoritative Isolation
Testing Authoritative Servers
Always verify authoritative servers before blaming a resolver. Resolvers are usually right; zones often are not.
dig example.com NS +shortdig @ns1.example.com example.com A +norecurse# Healthy:# - status: NOERROR# - flags include 'aa'# - Answer section contains the recordRed flags in the authoritative response:
| Issue | Meaning |
|---|---|
No aa flag |
Server does not consider itself authoritative — lame delegation |
REFUSED |
ACL blocking the query, or zone not loaded |
SERVFAIL |
Zone load failure (syntax error, missing file, signing pipeline crash) |
| Different answers from different NS | Replication broken — AXFR/IXFR failure or out-of-band edits to one server only |
Glue Records and In-Bailiwick NS
When a zone’s nameservers live inside the zone (ns1.example.com for example.com), the parent zone must publish glue A/AAAA records in the delegation, otherwise resolvers face a chicken-and-egg lookup (RFC 1034 §4.2.1). Missing or stale glue is silent — the resolver simply gives up and the zone “intermittently disappears” for cold caches.
dig com. NS @a.root-servers.net +norecursedig example.com. NS @a.gtld-servers.net +norecurse# Look in ADDITIONAL section for A/AAAA glue.# Empty ADDITIONAL + in-bailiwick NS = broken delegation.| Symptom | Likely cause |
|---|---|
| Cold-cache resolvers SERVFAIL | Glue missing or stale at the registry; primed caches still work |
| Glue IPs differ from in-zone A records | Operator updated the in-zone record but not the registrar’s glue |
| IPv6-only resolvers fail | AAAA glue missing while A glue is present (or vice versa) |
Update glue at the registrar in the same change set as any NS IP change, and verify with dig @<TLD-NS> <zone> NS +norecurse rather than trusting your own resolver.
Comparing Public Resolvers
Different resolvers cache differently and apply different policies. Querying several in parallel triangulates whether the issue is global or local:
echo "Google: $(dig @8.8.8.8 example.com +short)"echo "Cloudflare: $(dig @1.1.1.1 example.com +short)"echo "Quad9: $(dig @9.9.9.9 example.com +short)"echo "OpenDNS: $(dig @208.67.222.222 example.com +short)"| Result | Meaning |
|---|---|
| All match | Likely correct; check authoritative if the answer is unexpected |
| One differs | That resolver has stale cache or a different policy |
| All differ | Authoritative inconsistency — check zone replication |
| Some return SERVFAIL | DNSSEC issue, EDE often present, or resolver-specific problem |
Resolver-specific behaviors:
| Resolver | DNSSEC validation | EDE | EDNS Client Subnet | Notes |
|---|---|---|---|---|
| Google (8.8.8.8) | Yes | Yes | Yes, /24 default | Largest anycast footprint |
| Cloudflare (1.1.1.1) | Yes | Yes | Privacy-focused, off by default | Sends ECS only to the Akamai debug domain |
| Quad9 (9.9.9.9) | Yes | Yes | No | Threat-block list; may NXDOMAIN bad reputation hosts |
| OpenDNS | Yes | Partial | Partial | Content filtering available; aliases NXDOMAIN to a landing page in some tiers |
Tracing the Resolution Path
dig +trace performs iterative resolution from your machine, showing each referral. It bypasses your configured resolver entirely, so it catches local resolver bugs that other tools miss:
$ dig +trace api.example.com. 518400 IN NS a.root-servers.net.. 518400 IN NS b.root-servers.net.;; Received 239 bytes from 192.168.1.1#53(192.168.1.1) in 12 mscom. 172800 IN NS a.gtld-servers.net.com. 172800 IN NS b.gtld-servers.net.;; Received 772 bytes from 198.41.0.4#53(a.root-servers.net) in 24 msexample.com. 172800 IN NS ns1.example.com.example.com. 172800 IN NS ns2.example.com.;; Received 112 bytes from 192.5.6.30#53(a.gtld-servers.net) in 32 msapi.example.com. 300 IN A 93.184.216.50;; Received 56 bytes from 192.0.2.1#53(ns1.example.com) in 45 msRead the trace as a sequence of referrals; the final section should answer with the aa flag set. If the trace stalls, the layer above the stall is where to look first.
| Pattern | Cause |
|---|---|
| Stops at TLD | Delegation not registered or NS unreachable |
SERVFAIL at zone NS |
Zone not loaded or DNSSEC bogus |
| Timeout at one NS | That instance is down; failover should mask it |
| Loop in referrals | Misconfigured delegation (sibling NS records pointing at each other) |
Tip
dig +trace does not perform DNSSEC validation. To trace and validate, combine delv +rtrace (validates) with drill -TDS (visualizes the chain).
DNSSEC Troubleshooting
Validation Failure Workflow
When validation fails, resolvers return SERVFAIL to the client and (if EDE-aware) attach an INFO-CODE. The decision diagram above maps the common path. The command sequence:
dig example.com +cd # Should succeed (validation disabled)dig example.com # Fails with SERVFAIL → DNSSECdig @1.1.1.1 example.com # Look for "EDE: N (...)" in OPT pseudo-sectiondelv example.com +rtrace # Detailed validation trace from local validator# Visualize the chain:# https://dnsviz.net/d/example.com/analyze/DNSViz is the standard chain visualizer; it surfaces missing DS records, algorithm rollovers, and partial signing in one view. For a shell-only workflow, dnsviz probe emits the analysis as JSON. The Verisign DNSSEC Debugger is a faster second opinion when DNSViz reports a lattice of warnings — it focuses on the chain-of-trust pass/fail at each tier.
Common DNSSEC Failures
Expired signatures (EDE 7).
dig example.com RRSIG +dnssec +multilineexample.com. 300 IN RRSIG A 13 2 300 ( 20240215000000 20240115000000 12345 example.com. abc123...signature... )# ^^^^^^^^^^^^^^# Signature expires 2024-02-15Re-sign the zone. Check that automatic resigning (BIND inline-signing, PowerDNS LIVE-signed zones, Knot DNS automatic-policy) is running and that the signer’s clock is correct.
DS / DNSKEY mismatch (EDE 6 or 9). The DS record in the parent zone must match the active KSK in your DNSKEY RRset:
dig example.com DS @$(dig com NS +short | head -1)dig @ns1.example.com example.com DNSKEY +dnssecThe DS is a hash of one of your DNSKEYs (typically the KSK). After a key rollover the parent must publish a DS that matches the new key before the old one is removed.
Algorithm mismatch. Per RFC 8624 §3.1 and the IANA DNS Security Algorithm Numbers registry, the implementation requirements for the actively-deployed algorithms are:
| ID | Name | RFC 8624 status |
|---|---|---|
| 8 | RSASHA256 | MUST sign, MUST validate |
| 13 | ECDSAP256SHA256 | MUST sign, MUST validate |
| 14 | ECDSAP384SHA384 | MAY sign, RECOMMENDED validate |
| 15 | Ed25519 | RECOMMENDED sign, RECOMMENDED validate |
Per RFC 4035 §5.2, a validator that does not implement the algorithm at the apex of a chain treats the zone as Insecure rather than Bogus, so well-behaved resolvers degrade gracefully. Middleboxes and old validators that misimplement this rule still return SERVFAIL — surface that with dig +cd (which should succeed if the resolver itself is the cause).
Chain of trust broken. Use DNSViz for visual analysis. Most chain breaks come from either a DS record at the parent that does not match any current DNSKEY, or a DNSKEY RRset that is not signed by the KSK referenced in the DS.
Key Rollover Issues
Key rollovers cause more DNSSEC outages than any other category. The two safe patterns are documented in RFC 6781 §4: the pre-publication scheme for ZSKs and the double-DS scheme for KSKs (also called double-KSK in some references).
ZSK pre-publication rollover:
- Generate
ZSK_new. - Publish DNSKEY RRset with both
ZSK_oldandZSK_new(still signing withZSK_old). - Wait at least DNSKEY TTL so resolvers cache both keys.
- Re-sign the zone with
ZSK_new. - Wait at least the longest RRSIG TTL so old signatures expire from caches.
- Remove
ZSK_old.
KSK double-DS rollover:
| Failure symptom | Cause | Recovery |
|---|---|---|
SERVFAIL immediately after DS swap |
Old DS removed before resolvers cached the new DS | Restore old DS; wait at least the parent DS TTL |
SERVFAIL after publishing new keys |
Zone signed with a key not yet in the DNSKEY RRset | Ensure DNSKEY publication precedes any new RRSIG |
Intermittent SERVFAIL (EDE 9) |
Some resolver caches still hold the old DNSKEY set | Wait full DNSKEY TTL; do not roll the zone again |
SERVFAIL after algorithm change |
Validators that don’t implement the new algorithm fail closed | Roll algorithms via RFC 6781 §4.1.4 (algorithm rollover) |
Caution
Never run a KSK rollover and a DS replacement in the same operational window. Wait the parent’s DS TTL between every state change. The fastest safe recovery from a botched rollover is usually to restore the previous DS at the registrar and let caches drain.
Transport-Layer Issues: EDNS0, Cookies, and TCP Fallback
DNSSEC is not the only common SERVFAIL source. The transport layer — EDNS0 buffer-size negotiation, UDP fragmentation, TCP fallback, and DNS Cookies — produces the second-most confusing class of failures because errors usually surface as plain timeouts.
EDNS0 Buffer Size and DNS Flag Day 2020
EDNS0 (RFC 6891) lets a stub or resolver advertise the largest UDP response it is willing to accept. In 2020, the DNS community standardized on a default of 1232 bytes (DNS Flag Day 2020), derived from the IPv6 minimum MTU (1280) minus IPv6 + UDP headers. The change closes a long-standing class of UDP-fragmentation attacks and middlebox failures, and shifts large responses (DNSSEC RRSIGs, DNSKEY RRsets, ANY queries) onto TCP.
Two failure modes dominate:
- TCP/53 (or TCP/853 for DoT) firewalled. Truncated responses cannot be retried, and large RRsets fail intermittently — usually for DNSSEC-signed zones, the DNSKEY RRset, or
ANYqueries. Symptoms: timeouts only on certain query types; small queries succeed. - Middlebox strips EDNS or rewrites bufsize. Some legacy CPE devices and load balancers either drop EDNS OPT RRs or claim a bufsize the path cannot carry. Symptoms:
FORMERR, persistent UDP timeouts despite TCP working.
dig +bufsize=512 example.com ANY # Force a TC=1 response on signed zonesdig +tcp example.com DNSKEY # Verify TCP/53 reachability end-to-enddig +bufsize=1232 example.com DNSKEY # The Flag Day defaultdig +noedns example.com # EDNS-stripping middlebox checkISC’s DNS Flag Day 2020 announcement and the APNIC measurement study cover the operational rationale and the long tail of broken networks.
DNS Cookies and BADCOOKIE
DNS Cookies (RFC 7873, updated by RFC 9018) provide a lightweight off-path-spoofing defense and a way for servers to soft-rate-limit unknown clients. dig sends the COOKIE option by default; the relevant flags:
| Flag | Effect |
|---|---|
+cookie |
Send the COOKIE option (default) |
+nocookie |
Suppress the COOKIE option — useful when testing legacy resolvers |
+nobadcookie |
Disable the automatic retry when the server returns RCODE 23 BADCOOKIE |
A BADCOOKIE (RCODE 23) from a server you have not previously contacted is not an error — it is the server asking the client to retry with the issued server cookie. Resolvers handle this transparently; bare dig shows the retry. If you see BADCOOKIE not followed by a successful retry, suspect a stateful middlebox swallowing the second packet. ISC’s DNS Cookies in BIND 9 describes the production interaction with Response-Rate Limiting.
Wireshark and Server-Side Captures
When dig cannot reproduce the failure, drop to packet capture. Wireshark’s DNS dissector handles plain DNS, EDNS0 OPT records, DoT, and DoQ; the standard capture filters are port 53 (Do53), port 853 (DoT), and port 443 for DoH (with a TLS keylog file to decrypt). On the server side:
sudo tcpdump -i any -nn -s0 -w dns.pcap '(port 53 or port 853)'Then open dns.pcap in Wireshark and use display filters like dns.flags.rcode != 0, dns.qry.name contains "example", or edns0.opt.code == 15 (Extended DNS Errors). Wireshark’s DNS dissector documentation lists every available field reference.
dnstap and Resolver Query Logs
Text-format query-log in BIND or unbound-control log dumps are I/O-heavy and lose detail. dnstap (dnstap.info) — a Frame Streams + protobuf binary format — is supported by BIND, Unbound, Knot Resolver, CoreDNS, and dnsdist. It captures every authoritative or resolver-side message asynchronously, and dnstap-read -y decodes a .tap file to YAML for grep-friendly inspection.
options { dnstap { all; }; dnstap-output unix "/var/run/named/dnstap.sock";};dnstap-read -y /var/log/named/dnstap.log | lessdnstap is the right tool when the question is “what did this resolver actually do?” — it shows upstream queries, cache hits, validation outcomes, and EDE codes that never reach the client. Pair it with unbound-control dump_cache when you need to inspect a resolver’s view of a name without restarting it.
Cache and Propagation Debugging
”DNS Propagation” Is Cache Expiry
DNS does not actively push updates. Changes take effect as cached records expire. Four caches matter:
- Record TTL — how long any resolver may cache a successful answer (RFC 1035 §3.2.1).
- Negative cache TTL — how long NXDOMAIN/NODATA persists, bounded by
min(SOA.MINIMUM, SOA TTL)(RFC 2308 §5). - Resolver minimum / maximum TTL — many resolvers cap both ends. Cloudflare and Google publish their floors and ceilings; assume a 30s floor and a 24h–48h ceiling.
- Browser and OS caches — Chrome, Firefox, and the OS stub all cache independently, with their own (often shorter) TTLs.
Propagation verification:
dig @ns1.example.com example.com A +norecursedig @8.8.8.8 example.com +norecurse# Empty answer → not cached; the next query will fetch fresh.# Answer with TTL → cached; remaining TTL shows time-to-live.dig @8.8.8.8 example.comFlushing Caches
Public resolver cache flush (only flushes that resolver, not the internet):
| Resolver | Method |
|---|---|
| Google Public DNS cache flush | |
| Cloudflare | Cloudflare 1.1.1.1 purge cache |
| OpenDNS | OpenDNS cache check / flush |
Browser cache flush:
| Browser | Method |
|---|---|
| Chrome | chrome://net-internals/#dns → Clear host cache |
| Firefox | about:networking#dns → Clear DNS Cache |
| Edge | edge://net-internals/#dns → Clear host cache |
| Safari | Clear via macOS system cache |
Operating system cache flush:
sudo dscacheutil -flushcachesudo killall -HUP mDNSRespondersudo resolvectl flush-cachesipconfig /flushdnsThe systemd-resolved command is documented in resolvectl(1); macOS requires both commands because dscacheutil only flushes the legacy DirectoryService cache and mDNSResponder holds the active stub cache.
Pre-Migration TTL Strategy
Before changing any record that will be queried during a migration, lower its TTL to the migration window you can tolerate, then wait for the old TTL to expire before making the actual change:
dig example.com +short # Note the current TTL# 1. Lower TTL to 300 (or your migration TTL) at the provider# 2. Wait at least the OLD TTL — if it was 86400, that is 24 hours# 3. Verify the new TTL is in effect everywhere:dig @8.8.8.8 example.com # TTL should be ≤ 300# 4. Make the actual change# 5. After verification, restore the higher steady-state TTLWarning
The most common DNS migration mistake is lowering TTL and immediately making the change. Resolvers still hold the old record at the old TTL — the new TTL only applies to the next refresh. Skipping the wait can quadruple your effective propagation time.
CDN and GeoDNS Pitfalls
Resolver Location vs Client Location
GeoDNS uses the resolver’s IP — not the end-client’s IP — to pick a region. When users on the same continent route through a public resolver in another region (or through corporate DNS in a far-away office), routing degrades silently.
EDNS Client Subnet (ECS). RFC 7871 lets a resolver forward a truncated client subnet so authoritative GeoDNS can localize the answer:
dig +subnet=203.0.113.0/24 example.com @8.8.8.8dig +subnet=203.0.113.0/24 example.com @ns1.example-cdn.netECS privacy stance varies:
- Google Public DNS sends ECS by default (Google Public DNS — ECS docs).
- Cloudflare 1.1.1.1 does not send ECS to authoritative servers other than a single Akamai debug zone, by design (Cloudflare 1.1.1.1 FAQ).
- Quad9 and most privacy-focused resolvers strip ECS entirely.
If your CDN relies on ECS for steering and your users go through a privacy resolver, you will see a population concentrate at whichever region is closest to the resolver’s anycast PoP.
Health Check and Failover Delays
CDN and load balancer DNS can serve stale records when:
- The health check has not yet detected failure.
- The DNS TTL has not yet expired in downstream caches.
- A resolver is intentionally serving stale data per RFC 8767 because the authoritative is unreachable.
dig @authoritative-ns.example.com www.example.com +shortdig @8.8.8.8 www.example.com +shortMitigation:
- Use a low TTL (60–300s) for health-checked records.
- Tighten health-check intervals and failure thresholds.
- Prefer anycast at the IP layer — failover happens in BGP, not DNS — when sub-second cutover matters.
CNAME Flattening Complications
CNAME at a zone apex is forbidden by RFC 1034 §3.6.2: a CNAME at a node may not coexist with the SOA and NS records that the apex requires. Providers work around this with CNAME flattening (Cloudflare) or ALIAS records (Route 53, NS1) — the authoritative server resolves the target itself and serves the resulting A/AAAA at the apex.
Complications:
- Domain verification fails — a TXT record at the apex coexists fine, but tooling that walks CNAMEs may give up.
- Certificate renewal issues — ACME
dns-01challenges target the apex but resolve through the flattened indirection; latency or stale upstream caches break the challenge window. - GeoDNS accuracy — flattening happens at the authoritative server, so the upstream sees the authoritative’s source IP, not the end client’s.
dig example.com CNAME # No answer when flatteneddig example.com A # Returns the resolved IPdig _underlying.example.com CNAME # Provider-specific debug name, where exposedIncident Playbook
Initial Triage Script
A 30-second triage script you can paste into any production shell:
#!/bin/bashDOMAIN=${1:?Usage: $0 domain.com}echo "=== DNS Triage for $DOMAIN ==="echo ""echo "--- Authoritative Nameservers ---"dig $DOMAIN NS +shortecho ""echo "--- Direct Query to Each NS ---"for ns in $(dig $DOMAIN NS +short 2>/dev/null); do echo "$ns:" dig @$ns $DOMAIN A +norecurse +short 2>/dev/null || echo " FAILED"doneecho ""echo "--- SOA Serial Consistency ---"for ns in $(dig $DOMAIN NS +short 2>/dev/null); do serial=$(dig @$ns $DOMAIN SOA +short 2>/dev/null | awk '{print $3}') echo "$ns: $serial"doneecho ""echo "--- Public Resolver Comparison ---"echo "Google 8.8.8.8: $(dig @8.8.8.8 $DOMAIN A +short 2>/dev/null)"echo "Cloudflare 1.1.1.1: $(dig @1.1.1.1 $DOMAIN A +short 2>/dev/null)"echo "Quad9 9.9.9.9: $(dig @9.9.9.9 $DOMAIN A +short 2>/dev/null)"echo ""echo "--- DNSSEC Status ---"dig $DOMAIN +dnssec +short 2>/dev/nullecho ""echo "With +cd (validation disabled):"dig $DOMAIN +cd +short 2>/dev/nullecho ""echo "--- TTL Check ---"dig $DOMAIN | grep -E "^$DOMAIN.*IN" | head -1Escalation Decision Tree
| Finding | Escalation Path |
|---|---|
| All NS unreachable | Infrastructure team / managed DNS provider |
| Lame delegation | DNS administrator (zone not loaded on the NS) |
| DNSSEC validation failure | DNSSEC key management team / registrar (DS) |
| Resolver-specific issue | Affected ISP / public resolver operator (rare) |
| Inconsistent NS responses | Zone-transfer / replication owner |
| Registry delegation missing | Registrar account or domain status (e.g. clientHold) |
Rollback Strategies
Record change rollback:
If you lowered TTL before the change, just revert the record — propagation is bounded by the new low TTL. If you did not lower TTL first, revert and either wait for the old TTL or flush major resolver caches (incomplete, but covers most user impact).
Nameserver change rollback: NS changes propagate slowly because TLD-served NS RRs typically have 48-hour TTLs. Three options, in increasing impact:
- Revert at the registrar — easiest if no real traffic shifted yet.
- Keep new NS, fix the zone there — usually faster than waiting for NS rollback to propagate.
- Run both old and new NS with consistent data — gives the safest soft-cutover; leaves you free to revert at the registrar later.
DNSSEC rollback:
If DNSSEC is breaking resolution and you need traffic restored immediately:
- Emergency DS removal at the registrar — the parent’s DNSSEC chain becomes Insecure and resolvers stop validating. Resolution returns within the parent’s DS negative TTL (~1 hour for most TLDs).
- Wait the DS negative TTL so any cached DS records expire.
- Re-enable DNSSEC only after the signing pipeline is fixed and verified end-to-end.
Postmortem Checklist
- Timeline. When did the issue start? When was it detected? When was it resolved?
- Symptoms. What queries failed and from where? Which RCODE/EDE were users seeing?
- Root cause. Which specific misconfiguration, expired key, or replication failure?
- Resolution. Which change fixed it? Was a rollback necessary?
- TTL impact. How long were stale or NXDOMAIN-cached records served beyond the fix?
- Detection gap. Could synthetic monitoring or RUM have caught this earlier?
- Prevention. Which process change (rollover automation, DS-monitoring, signing observability) prevents recurrence?
Conclusion
DNS troubleshooting is layer isolation discipline. Start with symptoms, verify the authoritative serves the right data, compare resolvers to spot stale or filtered responses, and trace the resolution path to find the failing component. The three high-leverage commands — dig +norecurse, dig +trace, and dig +cd — isolate the authoritative, the chain, and DNSSEC respectively.
SERVFAIL in 2026 is overwhelmingly DNSSEC. Always test with +cd first, read the EDE INFO-CODE if present, and fall back to delv +rtrace plus DNSViz when the chain itself looks broken.
“Propagation” is cache expiry. Lower TTL before a change, wait the old TTL, change, verify, restore. Flushing public resolvers shortens your own test loop, not the internet’s.
For DNSSEC, almost every outage traces to a key rollover that skipped a wait gate. Use the double-DS pattern from RFC 6781 §4.1.2, never collapse two state changes into one window, and keep an emergency DS-removal runbook at the registrar.
Appendix
Prerequisites
- Familiarity with DNS resolution flow (DNS Resolution Path)
- Understanding of DNS record types and TTL (DNS Records, TTL, and Cache Behavior)
- DNSSEC, DoH, and DoT mechanics (DNS Security: DNSSEC, DoH, and DoT)
- Command-line access with
dig(BIND tools) installed;delv,kdig,drillrecommended
Terminology
| Term | Definition |
|---|---|
| RCODE | Response Code; 4-bit field in the DNS header indicating query result |
| EDE | Extended DNS Error (RFC 8914); detailed error information via EDNS option |
| SERVFAIL | Server failure response; catch-all for resolution errors |
| NXDOMAIN | Non-Existent Domain; name does not exist |
| NODATA | Name exists but no records of the requested type; NOERROR with empty answer |
| Lame delegation | NS records point to a server that does not serve the zone |
| AA flag | Authoritative Answer; set when response comes from the zone’s nameserver |
| AD flag | Authenticated Data; set when DNSSEC validation succeeded |
| CD flag | Checking Disabled; client requests validation be skipped |
+trace |
dig flag to perform iterative resolution from root |
| KSK | Key Signing Key; signs DNSKEY RRset, referenced by DS at parent |
| ZSK | Zone Signing Key; signs zone data |
| ECS | EDNS Client Subnet (RFC 7871); resolver-forwarded client subnet for GeoDNS |
Cheat Sheet
SERVFAIL++cdsucceeds → DNSSEC validation failure; read EDE, check signatures and DS.SERVFAIL++cdfails → authoritative or network issue; query NS directly with+norecurse.- Intermittent failures → compare NS responses, check SOA serial consistency, identify anycast instance with
+nsid. - Slow resolution → use
+trace +statsto find the slow hop; check for lame delegation and DNSSEC fetch overhead. - Unexpected NXDOMAIN → verify authoritative servers, then check negative cache (SOA
MINIMUM). - Propagation delay → verify authoritative has new data, wait for old TTL, flush downstream caches.
- Key rollover failure → ensure DS at parent matches active DNSKEY; never remove old DS until new is fully cached; recover by restoring old DS.
- Large-response timeouts (DNSKEY/
ANY) → suspect TCP/53 firewalled or EDNS-stripping middlebox; verify withdig +tcpanddig +bufsize=512 ANY. - Cold-cache zone disappears → check parent glue with
dig @<TLD-NS> <zone> NS +norecurseand inspect the ADDITIONAL section.
References
- RFC 1034 — Domain Names: Concepts and Facilities
- RFC 1035 — Domain Names: Implementation and Specification
- RFC 1912 — Common DNS Operational and Configuration Errors
- RFC 2308 — Negative Caching of DNS Queries (DNS NCACHE)
- RFC 4033/4034/4035 — DNSSEC Introduction, Records, and Protocol
- RFC 5001 — DNS Name Server Identifier (NSID) Option
- RFC 6781 — DNSSEC Operational Practices, Version 2
- RFC 6891 — Extension Mechanisms for DNS (EDNS(0))
- RFC 7858 — DNS over TLS
- RFC 7871 — Client Subnet in DNS Queries
- RFC 7873 — DNS Cookies
- RFC 8484 — DNS over HTTPS
- RFC 8767 — Serving Stale Data to Improve DNS Resiliency
- RFC 8914 — Extended DNS Errors
- RFC 9018 — Interoperable DNS Server Cookies
- RFC 9250 — DNS over Dedicated QUIC Connections
- RFC 9520 — Negative Caching of DNS Resolution Failures
- IANA DNS Parameters Registry
- IANA DNS Security Algorithm Numbers
- BIND 9 Administrator Reference Manual —
dig,delv, resolver configuration - Knot DNS — kdig manpage
- NLnet Labs — ldns / drill and
drill(1) - DNSViz — DNSSEC chain visualization
- Verisign DNSSEC Debugger — alternative chain analyzer
- DNS Flag Day 2020 — EDNS bufsize 1232 and the end of UDP fragmentation
- ISC — DNS Flag Day 2020 — operator guidance
- ISC — DNS Cookies in BIND 9 — RFC 7873 in production
- dnstap.info — high-rate structured DNS logging
- Wireshark DNS dissector reference — display-filter fields
- Cloudflare — Unwrap the SERVFAIL — Extended DNS Errors in production
- Cloudflare 1.1.1.1 FAQ — privacy stance and ECS handling
- Google Public DNS — EDNS Client Subnet
- Julia Evans — How to use dig — practical dig usage guide