Accessibility Testing and Tooling Workflow

Most accessibility regressions ship not because tools are missing, but because teams treat one tool as the whole stack. Static lint rules catch the wrong half of issues to gate a deploy on; runtime engines like axe-core do the heavy lifting on rendered DOM but ignore questions of meaning; manual passes find what nothing else can but are too expensive to run on every PR. This article lays out a layered workflow — IDE → component → end-to-end → URL audit → manual — and is opinionated about which engine to put at each layer, how to wire it into CI without flakes, and which residue must always be tested by hand.

The reader profile is a senior engineer or accessibility lead deciding what to instrument, not a beginner learning what an aria-label is. The success bar is that after one read you can: pick the right runner per layer, justify the choice with rule-coverage data, write a CI gate that fails only on real violations, and explain what manual testing is still buying you.

Layered accessibility testing workflow — static lint at dev time, axe / Pa11y / Lighthouse in the CI gate, and a manual keyboard + screen-reader pass — Layered accessibility testing workflow: each layer catches a different class of issue, and the layers downstream of CI are the ones that close the irreducible ~42% of WCAG 2.2 AA criteria automation cannot reach.

Mental model: three coverage axes that do not collapse

Before picking tools, separate three axes that conversations routinely conflate.

Criteria coverage — what fraction of WCAG 2.2’s 86 success criteria any tool can give a definite pass/fail on. This number is small. Accessible.org’s analysis of WCAG 2.2 AA puts it at roughly 13% reliably testable, 45% partially testable, and 42% not automatable at all.
Issue volume — what fraction of the bugs that actually exist on real pages a tool catches. This is what the Deque coverage report measures: across 13,000+ pages and ~300,000 issues, axe-core flagged 57.38% of them.¹
Engineering pipeline phase — IDE, component test, E2E test, full-page audit, manual pass. Each phase has different blast radius and different latency budgets.

Treating “57%” and “13%” as contradictory is the most common confusion: they answer different questions. A small number of high-frequency rules (contrast, missing labels, duplicate IDs) generate most of the volume in the wild — so a low fraction of criteria can still cover a high fraction of bugs.

Important

The 57% figure measures issue volume on a real corpus, not WCAG criteria coverage. Quoting it as “57% of WCAG” is wrong, and it is the misquote that gets product owners to over-trust automation.

Layer of coverage	Catches	Notes
Static analysis (eslint-plugin-jsx-a11y)	Missing alt attributes, invalid ARIA roles/props, JSX semantic slips	36 active rules in the v6.10 line (3 more deprecated), `recommended` config; zero runtime cost.
Runtime automation (axe-core, Pa11y)	Contrast ratios, duplicate IDs, missing labels, ARIA state validity	Engines test the rendered DOM. axe-core is built around the zero false positives manifesto, so it is safe to gate CI on. Pa11y can run axe-core, HTML CodeSniffer, or both side by side.
Manual testing (keyboard, screen readers)	Focus order logic, content meaning, navigation consistency	Roughly 42% of WCAG 2.2 AA criteria cannot be automated²; focus order and alt-text quality are the canonical examples.

What automation actually catches

The Deque corpus measures what tools flag, not what tools understand. When Accessible.org slices the same problem by criterion, the picture sharpens:²

Detectability	Share of WCAG 2.2 AA	Examples	What “detection” means
High	~13%	Color contrast ratios, duplicate IDs, missing form labels	Objective technical thresholds; deterministic pass/fail. Safe for CI gates.
Partial	~45%	Heading hierarchy, link purpose, error identification	Tools detect presence but not semantic correctness. Useful as a triage signal.
None	~42%	Alt-text accuracy, focus-order logic, caption timing	Require human judgement about purpose and context. No algorithm can substitute.

Two concrete examples make this tangible:

WCAG 1.4.3 Contrast (Minimum) requires 4.5:1 for normal text and 3:1 for large text (≥18pt, or ≥14pt bold). The math is closed-form, so axe-core gives a reliable answer.
WCAG 2.4.3 Focus Order requires the tab sequence to “preserve meaning and operability.” That demands knowing what the page means, which no engine can answer.

Why axe-core’s design makes it CI-safe

axe-core is opinionated about its error budget. From the project’s manifesto:

Returns zero false positives (bugs notwithstanding). ³

Combined with the engineering team’s stated mantra that “we will treat false positives as bugs”⁴, this is what makes the engine fit-for-purpose as a deploy gate: builds will not fail for phantom issues. The flip side is that anything axe-core is unsure about lands in an incomplete bucket — and incomplete is routinely ignored in CI pipelines, which is where most teams leave coverage on the table.

A defensible WCAG 2.2 AA configuration looks like this:

1const axeConfig = {2  runOnly: {3    type: "tag",4    values: ["wcag2a", "wcag2aa", "wcag21a", "wcag21aa", "wcag22aa"],5  },6  rules: {7    "color-contrast": { enabled: true },8    region: { enabled: true },9  },10}

The runOnly tag list filters by union, not by inheritance — wcag22aa only enables the rules tagged wcag22aa (the criteria added in 2.2 AA). Explicitly listing every prior level you care about (wcag2a, wcag2aa, wcag21a, wcag21aa) is required to get full WCAG 2.2 AA coverage.

Tool deep-dive: axe-core

axe-core (v4.11.x as of early 2026) ships around one hundred rules across three categories: WCAG-mapped rules, best-practice rules, and experimental rules. Rule set is queryable with axe.getRules() because it changes per release.

Integration surface

Playwright uses @axe-core/playwright and a chainable builder:

1import { test, expect } from "@playwright/test"2import AxeBuilder from "@axe-core/playwright"34test("checkout flow accessibility", async ({ page }) => {5  await page.goto("/checkout")67  const results = await new AxeBuilder({ page })8    .withTags(["wcag2a", "wcag2aa", "wcag22aa"])9    .exclude(".third-party-widget")10    .analyze()1112  expect(results.violations).toEqual([])1314  if (results.incomplete.length > 0) {15    console.log("Manual review needed:", results.incomplete)16  }17})

Cypress has two distinct paths today, and confusing them is a frequent source of “why are my a11y reports different in Cloud?” tickets:

cypress-axe — community plugin, in-test, exposes cy.injectAxe() / cy.checkA11y(). Free, runs on every test, adds latency.
Cypress Accessibility — first-party Cypress Cloud product, out-of-test, runs axe against captured DOM snapshots after the fact. Paid, no in-test cost, separate dashboards.

1describe("Contact form a11y", () => {2  beforeEach(() => {3    cy.visit("/contact")4    cy.injectAxe()5  })67  it("meets WCAG 2.2 AA", () => {8    cy.checkA11y(null, {9      runOnly: { type: "tag", values: ["wcag22aa"] },10    })11  })1213  it("error states stay accessible", () => {14    cy.get("#email").type("invalid")15    cy.get("form").submit()16    cy.checkA11y()17  })18})

React can run axe in development with @axe-core/react. Keep it gated on process.env.NODE_ENV !== "production" so it never ships:

1import React from "react"2import ReactDOM from "react-dom/client"3import App from "./App"45if (process.env.NODE_ENV !== "production") {6  import("@axe-core/react").then((axe) => {7    axe.default(React, ReactDOM, 1000)8  })9}1011ReactDOM.createRoot(document.getElementById("root")).render(<App />)

Result categories — and the trap

axe-core returns four buckets, and a CI integration that only consumes violations is leaving signal on the floor.

axe-core result categories — violations fail the build, incomplete is the highest-value manual triage queue — axe-core returns four buckets per run; CI pipelines that only consume `violations` discard the `incomplete` heuristics that are the highest-value queue for manual review.

Category	Meaning	Recommended CI action
violations	Definite failures	Fail build.
passes	Definite passes	None (useful as a coverage telemetry).
incomplete	Heuristics fired, needs human review	Annotate PR, queue for next manual pass.
inapplicable	Rule did not match anything on page	None (telemetry that the rule was wired).

Tip

Pipe incomplete results into a tracker (Linear / Jira label, or a Slack digest) instead of dropping them. These are exactly the items axe-core thinks might be wrong; they are the highest-value queue for manual triage.

Tool deep-dive: Pa11y and HTML CodeSniffer

Pa11y 9 and Pa11y-CI 4 (current minor lines as of early 2026) sit beside axe-core rather than replacing it. Pa11y’s value comes from being able to run two independent rule engines side by side: HTML CodeSniffer (its historical default) and axe-core. Pa11y-CI 4 requires Node.js ≥ 20 and embeds Pa11y 9, which itself ships axe-core 4.10+.⁵

Why run two engines

Aspect	Pa11y default (HTMLCS)	axe-core
Result model	Violations only	Violations + passes + incomplete + inapplicable
Philosophy	Conservative, definite issues	Zero-false-positive doctrine, surfaces uncertainty
Rule overlap	~70 checks, partial overlap with axe	~100 rules, partial overlap with HTMLCS
False positives	Moderate	Very low (tracked as bugs)

The two engines disagree on real pages — different rule implementations, different heuristics for ARIA state, different opinions on landmarks. Running runners: ['axe', 'htmlcs'] recovers the union, and is the configuration the Pa11y maintainers explicitly support.

Pa11y-CI as a deploy-time gate

1{2  "defaults": {3    "runners": ["axe", "htmlcs"],4    "standard": "WCAG2AA",5    "timeout": 30000,6    "wait": 10007  },8  "urls": [9    "http://localhost:3000/",10    "http://localhost:3000/contact",11    "http://localhost:3000/checkout"12  ]13}

A minimal GitHub Actions wiring:

1name: Accessibility2on: [push, pull_request]34jobs:5  a11y:6    runs-on: ubuntu-latest7    steps:8      - uses: actions/checkout@v49      - uses: actions/setup-node@v410        with:11          node-version: "22"12      - run: npm ci13      - run: npm run build14      - run: npm run preview &15      - run: npx wait-on http://localhost:43211617      - name: Pa11y CI18        run: npx pa11y-ci1920      - name: Upload results21        if: failure()22        uses: actions/upload-artifact@v423        with:24          name: pa11y-report25          path: pa11y-ci-results.json

Pa11y scans rendered DOM, so the page must be ready before the first assertion. Use wait for static delays or actions to drive the page to a meaningful state:

1{2  "urls": [3    {4      "url": "http://localhost:3000/login",5      "actions": [6        "set field #email to test@example.com",7        "set field #password to password123",8        "click element #submit",9        "wait for element #dashboard to be visible"10      ]11    }12  ]13}

eslint-plugin-jsx-a11y (36 active rules in the current 6.10 line, plus three deprecated ones — accessible-emoji, label-has-for, no-onchange) walks the JSX AST at lint time and catches the obvious mistakes before they reach the test runner. Zero runtime cost, sub-second feedback in the editor. It is the cheapest part of the pipeline and easy to under-configure.

The recommended config buckets the rules:

Bucket	Representative rules
Alternative text	`alt-text`, `img-redundant-alt`
ARIA validity	`aria-props`, `aria-proptypes`, `aria-role`, `aria-unsupported-elements`, `role-has-required-aria-props`, `role-supports-aria-props`
Semantic HTML	`anchor-has-content`, `anchor-is-valid`, `heading-has-content`
Interaction	`click-events-have-key-events`, `no-static-element-interactions`, `interactive-supports-focus`
Labels	`label-has-associated-control`, `control-has-associated-label`

A working config that handles the two most common framework footguns (router Link components, controlled div role="button" widgets):

1{2  "extends": ["eslint:recommended", "plugin:react/recommended", "plugin:jsx-a11y/recommended"],3  "plugins": ["jsx-a11y"],4  "rules": {5    "jsx-a11y/anchor-is-valid": [6      "error",7      { "components": ["Link"], "specialLink": ["to"] }8    ],9    "jsx-a11y/no-static-element-interactions": [10      "error",11      { "allowExpressionValues": true, "handlers": ["onClick"] }12    ]13  }14}

What lint cannot see

Static analysis works on what is written, not on what renders. The systematic blind spots are:

Runtime focus management — focus moved by an effect, returned on close, trapped in modals.
Live-region behaviour — aria-live polite-vs-assertive at runtime.
Dynamic content quality — alt={user.bio} is correct lint-wise but semantically empty if user.bio === "".
Cross-component label associations — <Field label={...}/> where the for/id wiring lives in another file.
Third-party components — anything that renders outside JSX you control.

Treat lint as the floor, not the ceiling. Its job is to keep developers from spending CI cycles on issues caught at keystroke.

Browser extensions: WAVE and axe DevTools

Browser extensions are interactive tools for humans, not pipeline components.

WAVE

Runs entirely client-side — safe for intranets, behind-auth pages, and offline assets.
Visual overlay annotates issues directly on the page, useful for non-engineering stakeholders.
Higher false-positive rate than axe-core; not appropriate for CI.
Best use: developer education and walking PMs / designers through real issues on their own pages.

axe DevTools (browser extension)

Same engine as @axe-core/playwright, so results match CI exactly.
“Intelligent Guided Tests” (IGT) walk you through semi-automated checks for incomplete items — colour-meaning use, link-purpose, focus-order. This is how Deque demonstrates ~80% issue coverage when you combine the engine with guided manual review.
Best use: triaging an incomplete from CI, debugging a specific violation in context.

Lighthouse: useful, not authoritative

Lighthouse uses axe-core under the hood but only ships a subset of its rules — somewhere around 50–60 audits as of early 2026⁶ — and weights them with axe’s user-impact ratings. That makes Lighthouse a fast health check, not a compliance signal.

Lighthouse reports	Full axe-core also catches
Common contrast issues	All contrast permutations
Missing form labels	Label association edge cases
Alt text presence	Alt text in SVGs, custom components
Basic ARIA validity	Complex ARIA widget patterns
Document language	Per-region language overrides

Use Lighthouse for the dev-loop “is this page in the ballpark” question and as a Lighthouse-CI assertion bundle (see the Quality Gates section). Use axe-core directly when you need to trust the result.

Manual testing: the irreducible ~42%

Roughly 42% of WCAG 2.2 AA criteria cannot be automated²; the partially detectable bucket (~45%) still needs human judgement to interpret. Two passes carry most of the weight: keyboard navigation and screen readers.

Engines cannot tell you whether a tab order makes sense. The protocol is mechanical and short.

Keys to exercise:

Tab / Shift+Tab — forward and backward through interactive elements.
Enter / Space — activate buttons and links.
Arrow keys — within widgets (menus, tabs, listbox, autocomplete).
Escape — close modals, dropdowns, dismiss popovers.

What to verify on every page:

Coverage — every action you can complete with a mouse you can also complete with a keyboard.
Order — tab sequence follows visual layout (top-to-bottom, left-to-right in LTR).
Visible indicator — every focusable element has a visible focus style. The presence requirement is WCAG 2.4.7 Focus Visible (Level AA); shape and contrast thresholds (≥2 CSS-px equivalent perimeter, ≥3:1 contrast) live in WCAG 2.4.13 Focus Appearance (Level AAA).
Not entirely obscured — sticky headers, cookie banners, and bottom sheets must not completely cover the focused element. This is WCAG 2.4.11 Focus Not Obscured (Minimum), new in WCAG 2.2 at Level AA.
No traps — Tab always escapes; the WCAG 2.1.2 No Keyboard Trap bar.
Modal focus — focus moves into the modal on open, is trapped inside until close, returns to the trigger on close.

Caution

The most common single bug in SPAs: route changes leave focus on the link that was clicked. Users hear nothing announced and must Tab through the entire previous page to reach new content. Move focus to the new view’s main heading or main element after every navigation.

1function handleRouteChange() {2  const main = document.querySelector("main")3  if (!main) return4  main.setAttribute("tabindex", "-1")5  main.focus()6  main.addEventListener("blur", () => main.removeAttribute("tabindex"), { once: true })7}

Screen readers reveal everything that the accessibility tree exposes — mislabelled fields, broken associations, missing live-region announcements. The matrix is platform-pinned.

Surface	Screen reader	Primary share (WebAIM #10, 2024)	When to test
Windows	JAWS	40.5%	Required — largest desktop share, especially in regulated enterprises.
Windows	NVDA	37.7%	Required — strict DOM follower, exposes structural defects.
macOS	VoiceOver	9.7%	Required for Apple-skewed audiences; default on macOS.
iOS	VoiceOver	70.6% (mobile)	Required for any consumer mobile flow.
Android	TalkBack	34.7% (mobile)	Required for any consumer mobile flow.

Source: WebAIM Screen Reader User Survey #10 (January 2024). Note the desktop trio (JAWS / NVDA / VoiceOver) and mobile pair (VoiceOver iOS / TalkBack) are reported on different bases — respondents could pick a primary on each surface.

Note

NVDA strictly follows the DOM and accessibility tree. JAWS uses heuristics to infer missing information — which often masks defects in production but improves real-world usability. Test on both: NVDA reveals the structural problem, JAWS tells you whether your users are noticing it yet.

Per-page protocol:

Read the entire page in browse / reading mode (not just interactive elements).
Walk the primary user flow with the screen reader on (forms, checkout, search).
Verify dynamic content announces — live regions, error states, “loading” / “loaded” transitions.
Force errors and verify recovery: can the user understand the error and fix it without sighted help?

Content quality assessment

These are the criteria where automation gives you nothing usable.

Alternative text (1.1.1) — does the alt convey the image’s purpose in context? “Graph” passes lint. “Sales rose 25% from Q1 to Q3” actually communicates.
Meaningful sequence (1.3.2) — when CSS positioning is ignored, is the DOM order still coherent? Screen readers follow DOM, not visual layout.
Link purpose (2.4.4) — can a user understand each link out of context? “Click here” fails; “Download annual report (PDF, 2.4 MB)” succeeds.
Error suggestions (3.3.3) — does the error tell the user how to recover? “Invalid input” fails; “Email must contain @” succeeds.

Triage and prioritisation

Not all violations are equal. The two axes that matter operationally are user impact and WCAG conformance level.

Severity framework

Severity	Definition	Examples	Response
Critical	Complete barrier — task cannot be completed	No keyboard access to submit, missing form labels, keyboard trap	Fix immediately
Serious	Major difficulty — task very hard	Poor contrast, confusing focus order, missing error identification	Fix this sprint
Moderate	Inconvenience — task is harder	Redundant alt, minor contrast issues, verbose labels	Fix this quarter
Minor	Best practice — not a barrier	Missing landmark roles, suboptimal heading levels	Backlog

axe-core attaches its own impact field (minor → critical); it is a defensible default seed for triage. Map it to the table above; do not invent a parallel severity scheme.

WCAG level mapping

WCAG level	User impact	Legal exposure (typical)
Level A	Complete barriers — assistive tech fails	High; baseline conformance.
Level AA	Significant barriers — tasks very hard	High; the regulatory target most jurisdictions adopt (EAA, ADA case law, Section 508 by reference).
Level AAA	Maximum accessibility — specialised	Low; not universally required.

Issue documentation

Track issues with enough context that a developer can fix without re-running the audit:

1## Issue: Missing label on email input23**Severity**: Critical (axe impact: critical)4**WCAG**: 1.3.1 Info and Relationships (Level A)5**Page**: /checkout6**Tool**: axe-core (violations[0])78### Description910Email input field has no programmatic label. Screen-reader users cannot identify the field's purpose.1112### Current markup1314`<input type="email" name="email" placeholder="Email">`1516### Recommended fix1718`<label for="checkout-email">Email address</label>`19`<input type="email" id="checkout-email" name="email">`2021### Verification2223- [ ] axe-core passes24- [ ] NVDA announces "Email address, edit"25- [ ] Label visible and associated

CI/CD pipeline architecture

The shape that holds up in practice is a three-stage funnel. Each stage runs only what makes sense for its latency budget.

Three-stage accessibility CI/CD funnel — pre-commit lint, PR-time component and E2E axe runs, pre-deploy Pa11y and Lighthouse audits — Three-stage accessibility funnel: each stage's tool is matched to its latency budget; outputs split into a build-failing `violations` lane and a manual-triage lane.

Stage 1 — pre-commit (development time)

1{2  "lint-staged": {3    "*.{js,jsx,ts,tsx}": ["eslint --fix"]4  },5  "husky": {6    "hooks": {7      "pre-commit": "lint-staged"8    }9  }10}

eslint-plugin-jsx-a11y catches static violations before code enters the repo. Pre-commit must finish in seconds, so this is the only a11y check that fits.

Stage 2 — pull request (automated tests)

1name: PR Checks2on: pull_request34jobs:5  test:6    runs-on: ubuntu-latest7    steps:8      - uses: actions/checkout@v49      - uses: actions/setup-node@v410        with:11          node-version: "22"12      - run: npm ci1314      - name: Lint (includes a11y rules)15        run: npm run lint1617      - name: Component + unit tests (jest-axe)18        run: npm test1920      - name: E2E tests (Playwright + axe-core)21        run: npx playwright test

PR-time runs axe-core on the rendered DOM that your tests already exercise. This is where most regressions get caught. Keep the suite small enough that every PR runs it; expand only when you can hold it under the team’s “PR feels fast” threshold.

Stage 3 — pre-deploy (full audit)

1name: Deploy2on:3  push:4    branches: [main]56jobs:7  audit:8    runs-on: ubuntu-latest9    steps:10      - uses: actions/checkout@v411      - uses: actions/setup-node@v412        with:13          node-version: "22"14      - run: npm ci15      - run: npm run build16      - run: npm run preview &17      - run: npx wait-on http://localhost:43211819      - name: Pa11y-CI (dual runner)20        run: npx pa11y-ci2122      - name: Lighthouse CI23        run: npx lhci autorun2425  deploy:26    needs: audit27    runs-on: ubuntu-latest28    steps:29      - name: Deploy to production30        run: ./deploy.sh

Pa11y-CI’s dual-runner pass and Lighthouse-CI’s accessibility category catch what unit-level axe-core invocations missed because their fixtures didn’t exercise that path. This is also the natural place to archive the JSON reports for audit trail.

Quality gates

Set thresholds that prevent regressions without forcing a perfection-or-block stance.

1module.exports = {2  ci: {3    assert: {4      assertions: {5        "categories:accessibility": ["error", { minScore: 0.9 }],6        "color-contrast": "error",7        "document-title": "error",8        "html-has-lang": "error",9        "meta-viewport": "error",10      },11    },12  },13}

Pa11y-CI’s threshold works the same way — the issue count it tolerates per URL.

1{2  "defaults": {3    "threshold": 04  }5}

Zero is the right target for new codebases. For legacy ones, snapshot today’s count and ratchet down:

1{2  "defaults": {3    "threshold": 154  }5}

Drop the threshold by one every time you fix a violation; never raise it.

Practical takeaways

Treat the four numbers — Deque’s 57% issue volume, Accessible.org’s 13/45/42% criterion split, axe-core’s ~100 rules, Lighthouse’s ~57 audits — as a vocabulary, not interchangeable claims.
Use eslint-plugin-jsx-a11y as the floor. It is free in latency and catches the obvious mistakes that should never burn CI cycles.
Use axe-core as the spine. Gate PR builds on violations, but pipe incomplete somewhere a human will read it.
Use Pa11y-CI’s dual-runner mode pre-deploy to recover the union of axe + HTMLCS coverage.
Use Lighthouse for fast feedback and as a Lighthouse-CI score floor; do not treat it as compliance evidence.
Manual keyboard and screen-reader passes are non-negotiable for every release. Define them as a checklist, time-box them, and assign them.
Test JAWS, NVDA, and VoiceOver on desktop; VoiceOver and TalkBack on mobile. The market splits roughly 4:4:1 on desktop and 7:3 on mobile. ⁷

Appendix

Prerequisites

Familiarity with the WCAG 2.2 success-criterion structure (Level A / AA / AAA).
Comfort with at least one JS test runner (Jest, Vitest, Playwright, Cypress).
Working understanding of assistive-technology categories (screen reader, switch device, voice control).
Basic CI/CD pipeline literacy (GitHub Actions or equivalent).

Terminology

axe-core — open-source accessibility engine maintained by Deque; powers most modern integrations.
HTML CodeSniffer (HTMLCS) — alternative rule engine, default runner inside Pa11y.
AST (Abstract Syntax Tree) — code representation that eslint-plugin-jsx-a11y walks.
incomplete (axe-core) — verdict for cases where heuristics fired but cannot be sure; not a definite violation.
shift-left — moving a quality check earlier in the pipeline (lint catches it before runtime does).
quality gate — automated check that blocks progression (merge, deploy) when a threshold is breached.

Summary

57% is volume, not coverage. The Deque corpus measures which bugs tools flag, not which criteria they cover.
~13% of WCAG 2.2 AA is reliably automatable, ~45% partial, ~42% manual. That is the ceiling, not a number you can engineer your way past.
axe-core’s “zero false positives” doctrine is what makes it CI-safe. Its incomplete bucket is the highest-value queue for manual triage.
Pa11y-CI 4 + dual runners recovers axe / HTMLCS overlap.
eslint-plugin-jsx-a11y 6.10 ships 36 active rules (+ 3 deprecated). Cheapest layer; do not skip it.
Manual testing is irreducible. Keyboard pass + screen-reader pass on every release.
Screen-reader matrix — JAWS and NVDA are roughly tied on desktop (40.5% / 37.7%); VoiceOver dominates mobile (70.6%) with TalkBack a sizeable second (34.7%).
Triage by user impact and WCAG level, not by tool category.

References

Specifications

WCAG 2.2 W3C Recommendation — normative success criteria.
Understanding WCAG 2.2 — techniques and intent for each criterion.
W3C ACT Rules — Accessibility Conformance Testing rule format.
What’s New in WCAG 2.2 — the 9 new criteria, the 4.1.1 removal.

Official documentation

axe-core GitHub repository — source, manifesto, release notes.
axe-core rule descriptions — current rule list.
axe-core API documentation — axe.run, getRules, configuration shape.
Pa11y documentation — runners, configuration, actions.
Pa11y-CI repository — CI integration and migration notes.
eslint-plugin-jsx-a11y repository — rule list and configs.
Playwright accessibility testing — @axe-core/playwright.
Cypress Accessibility documentation — Cloud product.
Lighthouse accessibility scoring — weighting model.

Research

Deque: Automated Testing Identifies 57% of Issues — methodology and headline.
The Automated Accessibility Coverage Report (Deque) — full report.
Accessible.org: What Percentage of WCAG Can Be Automated? — criteria-level analysis.
WebAIM Screen Reader User Survey #10 — 2024 market share data.
WebAIM Million — annual analysis of homepage accessibility.

Tools

axe DevTools browser extension — interactive testing, IGT.
WAVE Evaluation Tool — visual overlay extension.
NVDA — free Windows screen reader.
JAWS — commercial Windows screen reader.

The full report — The Automated Accessibility Coverage Report — analyzes anonymized audit data from 2,000+ audits across 13,000+ pages and ~300,000 issues, with axe-core as the engine. The headline figure is 57.38% of issues by volume. Deque (2021). ↩
Accessible.org’s analysis of WCAG 2.2 AA’s 55 success criteria splits 7 (~13%) reliably automatable, 25 (~45%) partially detectable, and 23 (~42%) not detectable by automation. ↩ ↩² ↩³
axe-core README — About Axe (manifesto). The manifesto states the engine “returns zero false positives (bugs notwithstanding)”. ↩
Deque: axe-core 4.1 release notes. The team’s stated mantra is that “we will treat false positives as bugs.” ↩
Pa11y-CI v4 migration notes. v4 requires Node ≥ 20 and embeds Pa11y 9, which itself ships axe-core 4.10+. ↩
DebugBear — Understanding Lighthouse accessibility audit reports and Chrome for Developers — Lighthouse accessibility scoring show Lighthouse uses axe-core with a curated subset (around 50–60 audits, version-dependent) and weights them by axe’s user-impact ratings. ↩
WebAIM Screen Reader User Survey #10 (January 2024). Desktop primary screen reader: JAWS 40.5%, NVDA 37.7%, VoiceOver 9.7%. Mobile screen reader: VoiceOver 70.6%, TalkBack 34.7%. ↩