Design a Notification System

A notification platform sits between every product surface that needs to interrupt a user — security alerts, transactional confirmations, social signals, marketing — and three classes of opinionated downstream: device push providers (APNs, FCM, Web Push), email transports, and SMS carriers. Each downstream has its own rate limits, error semantics, and reputation rules. The platform’s job is to absorb bursty producer traffic, respect per-user preferences and quiet hours, deduplicate, retry, and converge on a single coherent delivery — without becoming the reason a user uninstalls the app.

This article is a deep design pass for a senior engineer who needs to either build that platform from scratch or reason about an existing one. It assumes you are comfortable with Kafka partitioning, Cassandra time-series modeling, Redis primitives, and at-least-once semantics; it spends its weight where the non-obvious failure modes live: fan-out at the producer boundary, priority inversion, deduplication windows, channel fallback, aggregation, and the operational reality of FCM/APNs at scale.

High-level architecture (ingress): producers publish to the notification API; validation and enrichment write into Kafka and per-priority queues.

High-level architecture (routing and delivery): the router applies preferences and throttling; channel processors hand off to APNs, FCM, SMTP, and Twilio.

Mental model

Notification systems solve three interlocking problems:

Reliable delivery — a notification accepted at the API edge must eventually reach the device, or end up explicitly dropped with a recorded reason. Exactly-once is unattainable across heterogeneous downstreams; the practical contract is at-least-once with idempotent consumers (Twilio Segment, “Delivering billions of messages exactly once”).
User respect — preferences, quiet hours, frequency caps, and aggregation. Brands that manage frequency see materially longer customer lifetimes (Braze on frequency capping); the platform owns the cross-channel cap.
Channel optimization — pick the right channel for the message at the right time. APNs, FCM, SMTP, and SMS each have distinct latency, cost, deliverability, and rate-limit profiles (FCM throttling and quotas, APNs provider API).

Core architectural decisions:

Decision	Choice	Rationale
Delivery guarantee	At-least-once + idempotent consumers	Exactly-once is impractical across APNs/FCM/SMTP/SMS; deduplicate at the consumer.
Queue partitioning	By `user_id`	Co-locates a user’s notifications for rate limiting, aggregation, and ordering inside the partition.
Priority handling	Separate topic per priority	Critical notifications bypass backlog from bulk sends and survive head-of-line blocking.
Channel selection	User preference, then fallback chain	Respect explicit choice; ensure delivery for `critical` regardless of channel preference.
Rate limiting	Token bucket per user per channel	Allows controlled bursts without exceeding long-term cap; matches provider 429 semantics.
Template rendering	At ingestion	Freezes content at send so deduplication, retries, and audit logs reference the same payload.

Trade-offs you accept by adopting this shape:

Higher per-event latency from preference and dedup lookups in exchange for user control.
Storage overhead for a deduplication window (hours to days) plus delivery-status fan-out.
Multiple channel processors instead of one delivery loop — more code, more isolation.
At-least-once means clients must tolerate occasional duplicates.

Requirements

Functional requirements

Requirement	Priority	Notes
Multi-channel delivery	Core	Push (iOS/Android/Web), email, SMS, in-app.
User preferences	Core	Opt-in/out per category and per channel.
Template management	Core	Variable substitution, locale, version history.
Scheduling	Core	Immediate, scheduled, timezone-aware delivery.
Delivery tracking	Core	`accepted` → `sent` → `delivered` → `opened` / `clicked`.
Rate limiting	Core	User-level and channel-level throttling.
Retry and fallback	Core	Bounded retries with exponential backoff; channel fallback.
Notification history	Extended	Queryable per-user log for product surface and support.
Batching/aggregation	Extended	Collapse similar notifications (“5 new likes”).
Quiet hours	Extended	Per-user do-not-disturb windows in user-local timezone.

Non-functional requirements

Requirement	Target	Rationale
Availability	99.99% (4 nines)	Notifications are critical for engagement and security workflows.
Delivery latency (critical)	p99 < 500 ms (server-side)	Time-sensitive alerts (security, transactions) must feel synchronous.
Delivery latency (normal)	p99 < 5 s (server-side)	Acceptable for social and promotional traffic.
Throughput	1M notifications/sec peak	Consumer-scale enterprise (Uber, LinkedIn, Slack tier).
Deduplication window	24–48 hours (per producer SLA)	Balances storage vs. duplicate prevention; longer is fine if storage allows.
Delivery rate	> 99.9% (after retries)	After bounded retries and channel fallback.

Note

Server-side delivery latency only measures up to the provider acknowledgement. Actual on-device delivery depends on the carrier (SMS), the OS power state (push), and the user’s mail client (email) and is outside our control.

Scale estimation

Users:

Monthly active users: 100M.
Daily active users: 40M (40% of MAU).
Devices per user: 2 (mobile + web).
Push tokens to manage: ~200M.

Traffic:

Notifications per active user per day: 25 (mix of transactional and engagement).
Daily volume: 40M × 25 = 1B notifications/day.
Average rate: 1B / 86 400 ≈ 12K notifications/sec.
Peak (3× average): ~36K notifications/sec.
Burst events (flash sales, breaking news): 100K+ notifications/sec.

Storage:

Notification record: ~500 B (metadata, status, timestamps).
Daily storage: 1B × 500 B = 500 GB/day.
90-day retention: ~45 TB.
Deduplication cache: 48-hour window × 1B × 32 B key ≈ 64 GB hot working set.

External provider capacity:

FCM HTTP v1: default quota 600 000 messages per minute per Firebase project (roughly 10K/sec sustained), enforced by a one-minute token bucket; overflow returns HTTP 429 RESOURCE_EXHAUSTED (Firebase: throttling and quotas).
APNs: no published numeric rate limit; Apple throttles or GOAWAYs connections that exhibit abusive patterns and recommends keeping persistent HTTP/2 connections to a minimum (APNs provider API).
Amazon SES: account-level send rate is adjustable through Service Quotas and ramps with reputation; sandbox accounts are capped at 1 message/sec and 200/24h. Dedicated IPs auto-warm over a 45-day schedule (SES sending quotas, Dedicated IP warming).
Twilio SMS: short codes default to 100 MPS; A2P 10DLC long-code throughput varies by Brand Trust Score, from ~12 SMS MPS (low trust) to ~225 SMS MPS (high trust) across major US carriers (Twilio: A2P 10DLC throughput).

Important

Provider quotas are not symmetric across carriers, regions, or product tiers. Treat them as policy variables loaded at runtime, not constants in code.

Design paths

There are three defensible base architectures. Real systems converge on a hybrid of all three; understanding the pure forms makes the trade-offs explicit.

Path A: Push-based (real-time first)

Best when sub-second in-app latency is the primary constraint and your platform already maintains persistent connections to clients.

Path A — push-based flow: producer to API to priority queue to router to gateway to user device. — Path A — push-based flow: critical notifications skip the bulk path and go straight from the router to the persistent gateway connection.

Key characteristics:

Persistent connections (WebSocket / SSE / gRPC bidi) terminate at a stateful gateway.
The gateway maintains a connection → user_id mapping so the router can address a user without going through APNs/FCM.
Direct delivery for in-app traffic; APNs/FCM still required for background/closed-app delivery.

Trade-offs:

Lowest in-app latency (often < 100 ms).
No external provider cost for in-app traffic.
Bidirectional channel for read/clear acknowledgements.
Connection management is non-trivial — load-balancing sticky-session traffic, recovering after gateway restarts, and coalescing reconnect storms (Uber RAMEN gRPC migration).
Higher infrastructure cost from carrying persistent connections.

Reference implementation: Uber’s RAMEN (Real-time Asynchronous MEssaging Network) reached 1.5M+ concurrent connections and 250K+ messages per second on its SSE-based architecture before migrating to gRPC bidirectional streaming for real-time acknowledgments and head-of-line-blocking-free heartbeats (Uber: real-time push platform, Uber: next-gen push platform on gRPC).

Path B: Queue-based (reliability first)

Best when delivery guarantee dominates latency and you need a strong audit trail.

Path B — queue-based flow: every notification flows through a durable Kafka topic with retry workers and a dead-letter queue. — Path B — queue-based flow: durable Kafka, worker pool, retry service, and dead-letter queue.

Key characteristics:

All notifications flow through a durable log (Kafka, Pulsar, or NATS JetStream).
Workers consume at their own pace, with built-in retry and a dead-letter queue.
Kafka retention is the audit trail.

Trade-offs:

Strong delivery guarantee — no message lost on a worker crash.
Excellent burst absorption — the queue is the buffer.
Higher per-event latency (queue hop overhead, batching).
Ordering is only guaranteed inside a partition (Apache Kafka docs); cross-partition ordering requires application-level sequencing.
Risk of notification storms after recovery — long backlogs can replay all at once and overwhelm downstream providers.

Reference implementation: Slack runs notification delivery on Kafka-backed pipelines with 100% trace coverage per notification, treating each notification_id as a trace_id and using span links to connect the originating message to all downstream notifications (Slack engineering: tracing notifications).

Path C: Hybrid (tiered by priority)

Best when notification mix is heterogeneous — some traffic is time-critical, most is bulk.

Path C — hybrid priority routing: notifications are classified at ingress and routed to one of four priority paths. — Path C — hybrid priority routing: critical traffic uses synchronous push; high/normal/low traffic flows through tiered queues with matching SLAs.

Key characteristics:

Priority classified at ingestion based on category.
Each priority has its own topic, partition count, worker pool, and SLA.
Bulk traffic batches and is allowed to defer to off-peak windows.

Trade-offs:

Optimal latency for critical traffic at acceptable cost for bulk.
Predictable per-tier SLAs — operators can scale per-priority workers independently.
More code paths and configuration to maintain.
Risk of priority inversion under contention if the priority classifier is wrong or shared resources (Redis, dedup store) are saturated.

Reference implementation: Netflix’s RENO (Rapid Event Notification System) uses priority-segmented Amazon SQS queues with dedicated compute clusters per priority and a hybrid push (Zuul Push) plus pull (Cassandra-backed history) delivery model — the segmentation contains failures so a slow path does not block a fast one.

Path comparison

Factor	Push-based	Queue-based	Hybrid
Latency (critical)	< 100 ms	500 ms–2 s	< 100 ms
Latency (bulk)	Same as critical	Same as critical	Flexible (off-peak)
Reliability	Good	Excellent	Excellent
Burst absorption	Limited	Excellent	Excellent
Infrastructure cost	High	Medium	Medium-high
Operational complexity	High	Medium	Highest
Production reference	Uber RAMEN	Slack	Netflix RENO

What this article designs

The rest of the article designs Path C (Hybrid) end-to-end, because:

It reflects what production systems at scale converge to (Netflix, LinkedIn, Pinterest).
It forces you to make the priority and trade-off thinking explicit.
It handles the real notification mix — security alerts to weekly digests — without two separate systems.
The pure-push and pure-queue paths fall out as degenerate cases.

High-level design

Component overview

Ingress and queueing:

Routing, delivery, and storage:

Component overview — routing and delivery: router, channel processors, external providers, and storage backends. — Component overview — routing and delivery: priority topics feed the router, which applies dedup/throttle/aggregation and dispatches to per-channel processors that wrap APNs, FCM, SES, and Twilio.

Notification API

The producer-facing surface. Validates, enriches, and routes to the correct priority topic.

Responsibilities:

Authenticate the producer (mTLS or signed JWT).
Validate the request against the template’s variable schema.
Resolve the template to its current version and render at ingestion.
Look up the user’s preferences and current device tokens.
Classify priority and route to the matching topic, keyed by user_id.

Design decisions:

Decision	Choice	Rationale
API style	REST with `202 Accepted`	Producers fire-and-forget; status is queryable / webhook-pushed.
Idempotency	Producer-supplied `notificationId`	Enables safe producer retries; drives downstream dedup.
Batching	Up to 1 000 recipients per request	Reduces API overhead for bulk sends without losing per-recipient addressability.
Template rendering	At ingestion (not at send)	Freezes content; downstream can replay, dedup, and audit identical payloads.

Template service

Manages multi-channel templates with variable substitution, locale, and version history.

1interface NotificationTemplate {2  templateId: string3  name: string4  category: "transactional" | "marketing" | "system"5  channels: {6    push?: {7      title: string // "Your order {{orderId}} has shipped"8      body: string // "Track your package: {{trackingUrl}}"9      data?: Record<string, string>10    }11    email?: {12      subject: string13      htmlBody: string14      textBody: string15    }16    sms?: {17      body: string // Max 160 chars for single GSM-7 segment18    }19  }20  variables: VariableDefinition[]21  defaultLocale: string22  translations: Record<string, ChannelContent>23}

Design decisions:

Templates stored in PostgreSQL; render-path Redis cache with a 5-minute TTL.
Variable schema validated at template creation so runtime substitution cannot fail silently.
Versioned table for rollback; the rendered payload records the version it used.
Variants registered for A/B tests; the variant assignment lives in the rendered payload.

Preference service

Per-user notification preferences with channel-level and category-level granularity.

1interface UserPreferences {2  userId: string3  globalEnabled: boolean4  quietHours?: {5    enabled: boolean6    start: string // "22:00"7    end: string // "07:00"8    timezone: string // IANA TZ identifier, e.g. "America/New_York"9  }10  channels: {11    push: ChannelPreference12    email: ChannelPreference13    sms: ChannelPreference14    inApp: ChannelPreference15  }16  categories: {17    [category: string]: {18      enabled: boolean19      channels: string[] // overrides global channel prefs for this category20      frequency?: "immediate" | "daily_digest" | "weekly_digest"21    }22  }23}2425interface ChannelPreference {26  enabled: boolean27  frequency?: FrequencyLimit // e.g. { maxPerHour: 5, maxPerDay: 20 }28}

Storage strategy:

Hot path: Redis hash keyed on prefs:{user_id}, 1-hour TTL, refreshed write-through.
Canonical: PostgreSQL with append-only audit history (compliance and debugging).
Cache invalidation is write-through; explicit purges happen on PATCH.

Resolution cascade (router-side):

Preference resolution cascade: a notification is checked against global → category → channel-override → channel → frequency cap → quiet hours, with critical traffic exempt from quiet hours. — Preference resolution cascade: global → category → channel-override → channel → frequency cap → quiet hours, with critical traffic exempt from the quiet-hours gate.

The cascade is short-circuit: the first disabled decision wins, and per-channel decisions are independent — opting out of email for the marketing category does not affect the same category’s push. Drops record a structured reason (global_off, category_off, channel_off, frequency_capped) so the analytics pipeline can attribute “not delivered” to user choice rather than infrastructure failure.

Device registry

Maintains push tokens per user, per device.

1interface DeviceToken {2  userId: string3  deviceId: string4  platform: "ios" | "android" | "web"5  token: string6  tokenType: "apns" | "fcm" | "web_push"7  appVersion: string8  lastSeen: Date9  createdAt: Date10  updatedAt: Date11  status: "active" | "stale" | "invalid"12}

Token lifecycle:

Event	Action
App install	Register new token.
App launch	Refresh token if older than 7 days.
Token refresh callback	Update token; mark previous invalid.
Delivery returns `UNREGISTERED`/404	Mark token invalid immediately.
30 days inactive	Mark token stale (deprioritize).
270 days inactive (Android FCM)	Token is automatically expired by FCM and subsequent sends return `UNREGISTERED` (FCM: manage tokens).

Tip

Track FCM’s droppedDeviceInactive and droppedTooManyPendingMessages metrics — exposed via the Firebase BigQuery export — to detect token-base rot and aggressive collapse-key replacement before they erode delivery rate.

Router service

The orchestration layer that turns a “ready-to-send” notification into one or more provider calls.

Router decision flow: a single notification passes through dedup, preference, quiet hours, rate limit, and aggregation gates before channel selection and dispatch.

The gates are ordered cheapest-rejection-first so we waste the least work on traffic that will be dropped:

Deduplication — SETNX dedup:{user_id}:{notification_id} against Redis.
Preference — is the user opted in for this category and any channel?
Quiet hours — is the user in their DND window? Critical bypasses.
Rate limit — does the user have tokens left for this channel?
Aggregation — does this match an open digest window?
Channel selection — apply preference + fallback rules.
Dispatch — push to the per-channel processor topics.

Channel processors

One independent worker pool per channel. Isolation matters: an SES outage must not block APNs throughput.

Push processor:

Maintains long-lived HTTP/2 connections to APNs and FCM.
Token-based auth (JWT) for APNs; service-account auth for FCM.
Respects FCM’s 600 K-tokens-per-minute quota with a local token bucket and exponential backoff on 429 (FCM throttling and quotas).
Maps provider error codes to retry / drop / mark-invalid actions.

Email processor:

Manages sender reputation and IP warm-up (45-day curve for SES dedicated IPs).
Handles bounces (hard / soft) and complaints; auto-suppresses repeat offenders.
Implements RFC 8058 one-click unsubscribe with List-Unsubscribe and List-Unsubscribe-Post: List-Unsubscribe=One-Click headers, required for senders >5 000 messages/day to Gmail and Yahoo since Feb 2024 (Gmail: email sender guidelines, RFC 8058).
Tracks open and click events via a tracking pixel and signed redirect URLs.

SMS processor:

Routes to the appropriate sender type (short code, long code, or toll-free).
Splits messages > 160 GSM-7 chars into concatenated segments with the appropriate UDH headers.
Honors STOP keyword opt-outs (carrier-mandated in the US).
Throttles to the carrier’s MPS — 100 MPS per short code, variable for A2P 10DLC.

In-app processor:

Delivers via WebSocket for connected clients.
Falls back to a /notifications polling endpoint plus lastSeen cursor for disconnected clients.
Aggregates badge counts and read/unread state.

Delivery receipts

Provider acknowledgement is a two-phase contract: a synchronous ack on the send call (“we accepted the message”) and an asynchronous receipt (“we delivered / the user opened it / it bounced”). Both phases must reconcile back into the same delivery_status row keyed by notification_id.

Channel	Sync ack source	Async receipt source
FCM	HTTP v1 send response (`messageId` or error code)	Firebase BigQuery delivery export (`delivery_attempted`, `delivered`, `dropped_*`) (FCM message delivery).
APNs	HTTP/2 response status + `apns-id`	No per-message delivery receipt — Apple does not expose device-side ack to providers; rely on app-side analytics for “opened”.
SES	SendEmail API response (`MessageId`)	SNS topic events: `Delivery`, `Bounce`, `Complaint`, `Open`, `Click` (SES event publishing).
Twilio	REST API response with `sid` and initial status	Status callback URL: `queued` → `sent` → `delivered` / `failed` / `undelivered` (Twilio status callbacks).

The callback ingestor is its own service so a callback storm (e.g., a bounce surge) cannot back-pressure the channel processors. Callbacks are idempotent on (notification_id, channel, device_id, status) because providers retry their webhooks freely.

API design

Send notification

POST /api/v1/notifications

1{2  "notificationId": "uuid-client-generated",3  "templateId": "order_shipped",4  "recipients": [5    {6      "userId": "user_123",7      "variables": {8        "orderId": "ORD-456",9        "trackingUrl": "https://track.example.com/ORD-456"10      }11    }12  ],13  "priority": "high",14  "channels": ["push", "email"],15  "options": {16    "ttl": 86400,17    "collapseKey": "order_update_ORD-456",18    "scheduledAt": null19  }20}

1{2  "requestId": "req_abc123",3  "notificationId": "uuid-client-generated",4  "status": "accepted",5  "recipientCount": 1,6  "estimatedDelivery": "2026-04-21T10:00:05Z"7}

Code	Error	When
400	`INVALID_TEMPLATE`	Template missing or variables fail schema validation.
400	`INVALID_RECIPIENT`	User ID not found in identity service.
409	`DUPLICATE_NOTIFICATION`	`notificationId` already processed inside dedup window.
429	`RATE_LIMITED`	Producer-level rate limit exceeded.

Bulk send

POST /api/v1/notifications/bulk

1{2  "notificationId": "bulk_uuid",3  "templateId": "weekly_digest",4  "recipientQuery": {5    "segment": "active_users_7d",6    "excludeOptedOut": true7  },8  "priority": "low",9  "channels": ["email"],10  "options": {11    "spreadOverMinutes": 60,12    "respectQuietHours": true13  }14}

spreadOverMinutes is the platform’s protection against burst-send anti-patterns: large segments are scheduled to deliver evenly across the window so SES, FCM, and downstream queues don’t see a vertical wall of traffic.

Get notification status

GET /api/v1/notifications/{notificationId}/status

1{2  "notificationId": "uuid",3  "status": "delivered",4  "recipients": [5    {6      "userId": "user_123",7      "channels": {8        "push": {9          "status": "delivered",10          "deliveredAt": "2026-04-21T10:00:02Z",11          "openedAt": "2026-04-21T10:05:00Z"12        },13        "email": {14          "status": "sent",15          "sentAt": "2026-04-21T10:00:03Z",16          "openedAt": null17        }18      }19    }20  ]21}

User preferences

GET /api/v1/users/{userId}/preferences

1{2  "userId": "user_123",3  "globalEnabled": true,4  "quietHours": {5    "enabled": true,6    "start": "22:00",7    "end": "07:00",8    "timezone": "America/New_York"9  },10  "channels": {11    "push": { "enabled": true },12    "email": { "enabled": true, "frequency": { "maxPerDay": 10 } },13    "sms": { "enabled": false }14  },15  "categories": {16    "marketing": { "enabled": false },17    "order_updates": { "enabled": true, "channels": ["push", "email"] },18    "security": { "enabled": true, "channels": ["push", "sms", "email"] }19  }20}

PATCH /api/v1/users/{userId}/preferences — partial update; the patch is applied with optimistic locking and an audit row is appended on every change.

Device registration

POST /api/v1/devices

1{2  "userId": "user_123",3  "deviceId": "device_abc",4  "platform": "ios",5  "token": "apns_token_xyz",6  "appVersion": "3.2.1"7}

Notification history

GET /api/v1/users/{userId}/notifications?limit=50&cursor=xxx — cursor-paginated listing for the user’s notification surface.

Data modeling

Notification record (Cassandra)

Cassandra is a strong fit for the notification log: high write volume, time-series access pattern, and per-user TTL. The partition key includes a time bucket so partitions stay below the recommended ~100 MB ceiling (DataStax: data modeling best practices).

1CREATE TABLE notifications (2    user_id UUID,3    bucket DATE,           -- daily bucket; bound partition size for power users4    created_at TIMESTAMP,5    notification_id UUID,6    template_id TEXT,7    priority TEXT,8    content FROZEN<notification_content>,9    channels SET<TEXT>,10    status TEXT,11    delivery_attempts INT,12    PRIMARY KEY ((user_id, bucket), created_at, notification_id)13) WITH CLUSTERING ORDER BY (created_at DESC, notification_id ASC)14  AND default_time_to_live = 7776000  -- 90 days15  AND compaction = { 'class': 'TimeWindowCompactionStrategy',16                     'compaction_window_unit': 'DAYS',17                     'compaction_window_size': 1 };1819CREATE TYPE notification_content (20    title TEXT,21    body TEXT,22    data MAP<TEXT, TEXT>,23    image_url TEXT24);2526-- Direct lookup by notification_id (for status endpoint)27CREATE TABLE notifications_by_id (28    notification_id UUID PRIMARY KEY,29    user_id UUID,30    bucket DATE,31    created_at TIMESTAMP,32    template_id TEXT,33    priority TEXT,34    content FROZEN<notification_content>,35    channels SET<TEXT>,36    status TEXT37);

Important

Without the bucket partition component, a power-user partition grows unbounded and TWCS compaction stops being effective. Pick the bucket size (hour vs day) so the largest expected partition stays under ~100 MB.

Delivery status (Cassandra)

1CREATE TABLE delivery_status (2    notification_id UUID,3    channel TEXT,4    user_id UUID,5    device_id TEXT,6    status TEXT,        -- queued, sent, delivered, failed, opened, clicked7    provider_id TEXT,   -- APNs message ID, SES message ID, etc.8    error_code TEXT,9    error_message TEXT,10    timestamp TIMESTAMP,11    PRIMARY KEY ((notification_id), channel, device_id)12);1314-- Time-bucketed retry index15CREATE TABLE failed_deliveries (16    retry_bucket INT,   -- hour bucket; bound the per-bucket scan size17    notification_id UUID,18    channel TEXT,19    user_id UUID,20    attempt_count INT,21    last_error TEXT,22    next_retry_at TIMESTAMP,23    PRIMARY KEY ((retry_bucket), next_retry_at, notification_id)24) WITH CLUSTERING ORDER BY (next_retry_at ASC);

User preferences (PostgreSQL)

1CREATE TABLE user_preferences (2    user_id UUID PRIMARY KEY,3    global_enabled BOOLEAN DEFAULT true,4    quiet_hours JSONB,   -- {"enabled":true,"start":"22:00","end":"07:00","tz":"America/New_York"}5    channel_prefs JSONB,6    category_prefs JSONB,7    created_at TIMESTAMPTZ DEFAULT NOW(),8    updated_at TIMESTAMPTZ DEFAULT NOW()9);1011CREATE TABLE preference_history (12    id BIGSERIAL PRIMARY KEY,13    user_id UUID NOT NULL,14    changed_at TIMESTAMPTZ DEFAULT NOW(),15    change_type TEXT,    -- 'opt_in', 'opt_out', 'update'16    old_value JSONB,17    new_value JSONB,18    source TEXT          -- 'user', 'system', 'compliance'19);2021CREATE INDEX idx_pref_history_user ON preference_history(user_id, changed_at DESC);

Device tokens (PostgreSQL + Redis)

1CREATE TABLE device_tokens (2    device_id TEXT PRIMARY KEY,3    user_id UUID NOT NULL,4    platform TEXT NOT NULL,5    token TEXT NOT NULL,6    token_type TEXT NOT NULL,7    app_version TEXT,8    last_seen TIMESTAMPTZ,9    status TEXT DEFAULT 'active',10    created_at TIMESTAMPTZ DEFAULT NOW(),11    updated_at TIMESTAMPTZ DEFAULT NOW()12);1314CREATE INDEX idx_tokens_user ON device_tokens(user_id);15CREATE INDEX idx_tokens_status ON device_tokens(status) WHERE status = 'active';

1# User's active tokens (set)2SADD user:tokens:{user_id} {device_id_1} {device_id_2}34# Token details (hash)5HSET token:{device_id}6    user_id "user_123"7    platform "ios"8    token "apns_xyz"9    token_type "apns"10    status "active"1112# Liveness marker (TTL drives stale-detection job)13SETEX token:active:{device_id} 2592000 "1"  # 30 days

Templates (PostgreSQL)

1CREATE TABLE notification_templates (2    template_id TEXT PRIMARY KEY,3    name TEXT NOT NULL,4    category TEXT NOT NULL,5    channels JSONB NOT NULL,6    variables JSONB,7    default_locale TEXT DEFAULT 'en',8    is_active BOOLEAN DEFAULT true,9    created_at TIMESTAMPTZ DEFAULT NOW(),10    updated_at TIMESTAMPTZ DEFAULT NOW(),11    version INT DEFAULT 112);1314CREATE TABLE template_translations (15    template_id TEXT REFERENCES notification_templates(template_id),16    locale TEXT,17    channels JSONB NOT NULL,18    PRIMARY KEY (template_id, locale)19);2021CREATE TABLE template_versions (22    template_id TEXT,23    version INT,24    channels JSONB NOT NULL,25    created_at TIMESTAMPTZ DEFAULT NOW(),26    created_by TEXT,27    PRIMARY KEY (template_id, version)28);

Database selection matrix

Data type	Store	Rationale
Notifications	Cassandra	Time-series, high write volume, native TTL.
Delivery status	Cassandra	High write volume, time-bucketed scan for retry.
User preferences	PostgreSQL + Redis	ACID for changes, cached for hot reads.
Device tokens	PostgreSQL + Redis	Relational queries; cached for delivery hot path.
Templates	PostgreSQL	Low volume; needs version history and constraints.
Deduplication	Redis	TTL semantics, atomic SETNX, sub-millisecond lookups.
Rate limits	Redis	Atomic INCR, sliding windows via Lua scripts.
Analytics	ClickHouse	Columnar aggregations across billions of records.

Low-level design

Deduplication service

The dedup window is the most expensive Redis working set after rate limits. Use Bloom filters as a fast “definitely not duplicate” check before falling back to authoritative SETNX.

1class DeduplicationService {2  private readonly redis: RedisCluster3  private readonly DEDUP_TTL = 172800 // 48 hours in seconds45  async isDuplicate(userId: string, notificationId: string): Promise<boolean> {6    const key = `dedup:${userId}:${notificationId}`78    // SETNX returns 1 if key was set (not duplicate), 0 if exists (duplicate)9    const result = await this.redis.set(key, "1", {10      NX: true,11      EX: this.DEDUP_TTL,12    })1314    return result === null // null means key existed (duplicate)15  }1617  async checkBloomFilter(userId: string, notificationId: string): Promise<boolean> {18    const key = `bloom:dedup:${userId}`19    return await this.redis.bf.exists(key, notificationId)20  }21}

Note

Twilio Segment runs an analogous design at much larger scale — 60 billion keys in 1.5 TB of RocksDB with a 4-week dedup window after processing 200 billion messages (Twilio Segment: exactly-once delivery). If your dedup working set exceeds Redis economics, that’s the migration path.

Rate limiter

Token bucket per (user_id, channel) is the right default for notifications because legitimate user activity is bursty (a checkout flow can fire 3–5 notifications back-to-back) but the long-term cap must hold (Stripe: scaling your API with rate limiters).

1interface RateLimitConfig {2  channel: string3  maxPerHour: number4  maxPerDay: number5}67class RateLimiter {8  private readonly redis: RedisCluster910  async checkAndConsume(11    userId: string,12    channel: string,13    config: RateLimitConfig,14  ): Promise<{ allowed: boolean; retryAfter?: number }> {15    const hourKey = `ratelimit:${userId}:${channel}:hour:${this.getCurrentHour()}`16    const dayKey = `ratelimit:${userId}:${channel}:day:${this.getCurrentDay()}`1718    // Lua for atomic check-and-increment with rollback on overflow19    const result = await this.redis.eval(20      `21      local hourCount = redis.call('INCR', KEYS[1])22      if hourCount == 1 then23        redis.call('EXPIRE', KEYS[1], 3600)24      end2526      local dayCount = redis.call('INCR', KEYS[2])27      if dayCount == 1 then28        redis.call('EXPIRE', KEYS[2], 86400)29      end3031      if hourCount > tonumber(ARGV[1]) then32        redis.call('DECR', KEYS[1])33        return {0, 3600 - redis.call('TTL', KEYS[1])}34      end3536      if dayCount > tonumber(ARGV[2]) then37        redis.call('DECR', KEYS[1])38        redis.call('DECR', KEYS[2])39        return {0, 86400 - redis.call('TTL', KEYS[2])}40      end4142      return {1, 0}43    `,44      [hourKey, dayKey],45      [config.maxPerHour, config.maxPerDay],46    )4748    return {49      allowed: result[0] === 1,50      retryAfter: result[1] > 0 ? result[1] : undefined,51    }52  }53}

Channel-specific provider caps to track separately from per-user caps:

Channel	Provider cap	Enforcement
FCM	600K messages/min per project (throttling docs)	Local token bucket; exponential backoff on 429.
APNs	No published numeric cap; throttles abuse	Watch for 429, `GOAWAY`, and `SHUTDOWN` reasons; back off per stream.
SES	Account-specific, adjustable; sandbox 1/sec	Read current quota from Service Quotas API on startup.
SMS (Twilio)	100 MPS short code; 12–225 SMS MPS for 10DLC	Per-sender queue with rate-limited consumer.

Notification aggregator

Collapses similar notifications into a digest within a configurable window. Knock describes two implementation patterns — batch-on-write (open a buffer keyed by recipient + collapse key when the first event arrives, flush at the end of the window) and batch-on-read (periodic cron scans for unsent notifications and groups them) (Knock: building a batched notification engine). The implementation below is batch-on-write, which scales better with sustained load.

1interface AggregationRule {2  category: string3  collapseKey: string // template, e.g., "likes_{postId}"4  windowSeconds: number5  minCount: number6  maxCount: number7  digestTemplate: string // "{{count}} people liked your post"8}910class NotificationAggregator {11  private readonly redis: RedisCluster1213  async shouldAggregate(14    userId: string,15    notification: Notification,16    rule: AggregationRule,17  ): Promise<{ aggregate: boolean; pending: Notification[] }> {18    const collapseKey = this.renderCollapseKey(rule.collapseKey, notification)19    const bufferKey = `agg:${userId}:${collapseKey}`2021    await this.redis.rpush(bufferKey, JSON.stringify(notification))22    await this.redis.expire(bufferKey, rule.windowSeconds)2324    const count = await this.redis.llen(bufferKey)2526    if (count >= rule.maxCount) {27      const pending = await this.flushBuffer(bufferKey)28      return { aggregate: true, pending }29    }3031    if (count >= rule.minCount) {32      await this.scheduleFlush(userId, collapseKey, rule.windowSeconds)33    }3435    return { aggregate: false, pending: [] }36  }3738  async createDigest(notifications: Notification[], rule: AggregationRule): Promise<Notification> {39    const count = notifications.length40    const actors = [...new Set(notifications.map((n) => n.actorId))].slice(0, 3)4142    return {43      ...notifications[0],44      content: {45        title: this.renderTemplate(rule.digestTemplate, { count, actors }),46        body: `${actors[0]} and ${count - 1} others`,47      },48      metadata: {49        aggregatedCount: count,50        originalIds: notifications.map((n) => n.notificationId),51      },52    }53  }54}

Common aggregation patterns:

Notification type	Collapse key	Window	Digest format
Post likes	`likes_{postId}`	5 min	”John and 5 others liked your post”
New followers	`followers_{userId}`	1 hour	”6 new followers today”
Comment replies	`replies_{commentId}`	10 min	”3 new replies to your comment”

Tip

Provider-side collapse is a separate mechanism from server-side aggregation. FCM’s collapse_key (max 4 distinct keys per device, older messages replaced when the device is offline (FCM collapsible messages)) and APNs’ apns-collapse-id (max 64 bytes (APNs docs)) replace banners on the device, not in your queue. Use both: server-side aggregation reduces volume, provider collapse cleans up the device.

Priority router

1enum NotificationPriority {2  CRITICAL = "critical", // Security alerts, transaction confirmations3  HIGH = "high",         // Direct messages, mentions4  NORMAL = "normal",     // Social notifications, updates5  LOW = "low",           // Marketing, digests6}78class PriorityRouter {9  private readonly queues: Map<NotificationPriority, KafkaProducer>1011  async route(notification: EnrichedNotification): Promise<void> {12    const priority = this.determinePriority(notification)13    const queue = this.queues.get(priority)1415    // Partition by user_id so rate limiting and aggregation co-locate16    await queue.send({17      topic: `notifications.${priority}`,18      messages: [19        {20          key: notification.userId,21          value: JSON.stringify(notification),22          headers: {23            "notification-id": notification.notificationId,24            "created-at": Date.now().toString(),25          },26        },27      ],28    })29  }3031  private determinePriority(notification: EnrichedNotification): NotificationPriority {32    if (notification.category === "security") return NotificationPriority.CRITICAL33    if (notification.category === "transaction") return NotificationPriority.CRITICAL3435    if (notification.category === "message") return NotificationPriority.HIGH36    if (notification.category === "mention") return NotificationPriority.HIGH3738    if (notification.category === "marketing") return NotificationPriority.LOW39    if (notification.category === "digest") return NotificationPriority.LOW4041    return NotificationPriority.NORMAL42  }43}

Note

Kafka guarantees ordering only within a partition (Apache Kafka docs), so keying by user_id gives you per-user ordering across all of a user’s notifications inside a single priority topic. Cross-priority ordering is not guaranteed and should not be relied on by clients.

Per-priority queue configuration:

Priority	Partitions	Consumer parallelism	Target max latency
Critical	50	50 workers	500 ms
High	100	100 workers	2 s
Normal	200	200 workers	10 s
Low	50	50 workers (off-peak)	Best effort

Push delivery with retry

1interface PushDeliveryResult {2  success: boolean3  messageId?: string4  errorCode?: string5  shouldRetry: boolean6  invalidToken: boolean7}89class PushProcessor {10  private readonly fcm: FirebaseMessaging11  private readonly apns: ApnsClient12  private readonly deviceRegistry: DeviceRegistry1314  async deliver(notification: Notification, device: DeviceToken): Promise<PushDeliveryResult> {15    try {16      if (device.tokenType === "fcm") {17        return await this.deliverFcm(notification, device)18      } else if (device.tokenType === "apns") {19        return await this.deliverApns(notification, device)20      }21    } catch (error) {22      return this.handleError(error, device)23    }24  }2526  private async deliverFcm(notification: Notification, device: DeviceToken): Promise<PushDeliveryResult> {27    const message = {28      token: device.token,29      notification: {30        title: notification.content.title,31        body: notification.content.body,32      },33      data: notification.content.data,34      android: {35        priority: notification.priority === "critical" ? "high" : "normal",36        ttl: notification.ttl * 1000,37        collapseKey: notification.collapseKey,38      },39    }4041    const response = await this.fcm.send(message)42    return { success: true, messageId: response, shouldRetry: false, invalidToken: false }43  }4445  private handleError(error: any, device: DeviceToken): PushDeliveryResult {46    const errorCode = error.code4748    // Invalid token — remove immediately49    if (["messaging/invalid-registration-token", "messaging/registration-token-not-registered"].includes(errorCode)) {50      this.deviceRegistry.markInvalid(device.deviceId)51      return { success: false, errorCode, shouldRetry: false, invalidToken: true }52    }5354    // Rate limited — retry with backoff55    if (errorCode === "messaging/too-many-requests") {56      return { success: false, errorCode, shouldRetry: true, invalidToken: false }57    }5859    // Server error — retry with backoff60    if (errorCode === "messaging/internal-error") {61      return { success: false, errorCode, shouldRetry: true, invalidToken: false }62    }6364    return { success: false, errorCode, shouldRetry: false, invalidToken: false }65  }66}

Retry service with exponential backoff

Retry with exponential backoff and dead-letter handoff: a worker re-attempts on transient failure with capped exponential backoff plus jitter, then commits to the DLQ once max_attempts is exceeded. — Retry with exponential backoff and dead-letter handoff: capped exponential backoff plus jitter, with DLQ after max attempts.

1interface RetryConfig {2  maxAttempts: number3  baseDelayMs: number4  maxDelayMs: number5  jitterFactor: number6}78class RetryService {9  private readonly defaultConfig: RetryConfig = {10    maxAttempts: 5,11    baseDelayMs: 1000,12    maxDelayMs: 300000, // 5 minutes13    jitterFactor: 0.2,14  }1516  async scheduleRetry(17    notification: Notification,18    channel: string,19    attemptCount: number,20    config: RetryConfig = this.defaultConfig,21  ): Promise<void> {22    if (attemptCount >= config.maxAttempts) {23      await this.moveToDlq(notification, channel)24      return25    }2627    const delay = this.calculateDelay(attemptCount, config)28    const retryBucket = Math.floor((Date.now() + delay) / 3600000) // hour bucket2930    await this.cassandra.execute(31      `32      INSERT INTO failed_deliveries (33        retry_bucket, notification_id, channel, user_id,34        attempt_count, last_error, next_retry_at35      ) VALUES (?, ?, ?, ?, ?, ?, ?)36    `,37      [38        retryBucket,39        notification.notificationId,40        channel,41        notification.userId,42        attemptCount + 1,43        notification.lastError,44        new Date(Date.now() + delay),45      ],46    )47  }4849  private calculateDelay(attempt: number, config: RetryConfig): number {50    // Exponential backoff with jitter to avoid retry-storm thundering herds51    const exponentialDelay = config.baseDelayMs * Math.pow(2, attempt)52    const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs)53    const jitter = cappedDelay * config.jitterFactor * Math.random()5455    return Math.floor(cappedDelay + jitter)56  }5758  private async moveToDlq(notification: Notification, channel: string): Promise<void> {59    await this.kafka.send({60      topic: "notifications.dlq",61      messages: [62        {63          key: notification.userId,64          value: JSON.stringify({65            notification,66            channel,67            reason: "max_retries_exceeded",68            timestamp: Date.now(),69          }),70        },71      ],72    })7374    this.metrics.increment("notifications.dlq.count", {75      channel,76      category: notification.category,77    })78  }79}

Quiet hours handler

1class QuietHoursHandler {2  async shouldDefer(3    userId: string,4    notification: Notification,5    preferences: UserPreferences,6  ): Promise<{ defer: boolean; deliverAt?: Date }> {7    // Critical notifications bypass quiet hours8    if (notification.priority === "critical") {9      return { defer: false }10    }1112    if (!preferences.quietHours?.enabled) {13      return { defer: false }14    }1516    const userNow = this.getUserLocalTime(preferences.quietHours.timezone)17    const isInQuietHours = this.isTimeInRange(userNow, preferences.quietHours.start, preferences.quietHours.end)1819    if (!isInQuietHours) {20      return { defer: false }21    }2223    const deliverAt = this.getQuietHoursEnd(preferences.quietHours.end, preferences.quietHours.timezone)2425    return { defer: true, deliverAt }26  }2728  private isTimeInRange(current: Date, start: string, end: string): boolean {29    const currentMinutes = current.getHours() * 60 + current.getMinutes()30    const [startHour, startMin] = start.split(":").map(Number)31    const [endHour, endMin] = end.split(":").map(Number)3233    const startMinutes = startHour * 60 + startMin34    const endMinutes = endHour * 60 + endMin3536    // Handle overnight ranges (e.g., 22:00 – 07:00)37    if (startMinutes > endMinutes) {38      return currentMinutes >= startMinutes || currentMinutes < endMinutes39    }4041    return currentMinutes >= startMinutes && currentMinutes < endMinutes42  }43}

Frontend considerations

Real-time in-app notifications

1class NotificationClient {2  private ws: WebSocket | null = null3  private reconnectAttempt = 04  private readonly MAX_RECONNECT_DELAY = 3000056  connect(authToken: string): void {7    this.ws = new WebSocket(`wss://notifications.example.com/ws?token=${authToken}`)89    this.ws.onopen = () => {10      this.reconnectAttempt = 011      this.syncMissedNotifications()12    }1314    this.ws.onmessage = (event) => {15      const notification = JSON.parse(event.data)16      this.handleNotification(notification)17    }1819    this.ws.onclose = () => {20      this.scheduleReconnect()21    }22  }2324  private handleNotification(notification: Notification): void {25    this.incrementBadge()26    this.store.dispatch(addNotification(notification))2728    if (notification.priority === "high" && !document.hasFocus()) {29      this.showToast(notification)30    }3132    if (notification.showBrowserNotification) {33      this.showBrowserNotification(notification)34    }35  }3637  private async syncMissedNotifications(): Promise<void> {38    const lastSeen = localStorage.getItem("lastNotificationTimestamp")3940    const response = await fetch(`/api/v1/notifications?since=${lastSeen}&limit=50`)41    const { notifications } = await response.json()4243    notifications.forEach((n) => this.handleNotification(n))44  }45}

Notification list with virtualization

1interface NotificationListProps {2  userId: string3  pageSize: number4}56const NotificationList: React.FC<NotificationListProps> = ({ userId, pageSize }) => {7  const {8    data,9    fetchNextPage,10    hasNextPage,11    isFetchingNextPage12  } = useInfiniteQuery({13    queryKey: ['notifications', userId],14    queryFn: ({ pageParam }) =>15      fetchNotifications(userId, { cursor: pageParam, limit: pageSize }),16    getNextPageParam: (lastPage) => lastPage.nextCursor17  })1819  const notifications = data?.pages.flatMap(p => p.notifications) ?? []2021  return (22    <VirtualList23      items={notifications}24      estimatedItemSize={80}25      onEndReached={() => hasNextPage && fetchNextPage()}26      renderItem={(notification) => (27        <NotificationItem28          key={notification.id}29          notification={notification}30          onRead={markAsRead}31        />32      )}33    />34  )35}

Push permission flow

1class PushPermissionManager {2  async requestPermission(): Promise<"granted" | "denied" | "default"> {3    if (Notification.permission === "granted") {4      await this.registerServiceWorker()5      return "granted"6    }78    if (Notification.permission === "denied") {9      return "denied"10    }1112    const permission = await Notification.requestPermission()1314    if (permission === "granted") {15      await this.registerServiceWorker()16      const token = await this.getFcmToken()17      await this.registerDevice(token)18    }1920    return permission21  }2223  private async registerServiceWorker(): Promise<void> {24    const registration = await navigator.serviceWorker.register("/sw.js")2526    registration.addEventListener("pushsubscriptionchange", async () => {27      const newToken = await this.getFcmToken()28      await this.updateDevice(newToken)29    })30  }31}

Note

Web Push is the W3C/IETF standard underpinning browser notifications: the protocol is defined by RFC 8030, payload encryption by RFC 8291, and application-server identification (VAPID) by RFC 8292. FCM Web Push is a transport over the same protocol.

Infrastructure

Cloud-agnostic component map

Component	Purpose	Options
Message queue	Event ingestion, priority routing	Kafka, Pulsar, NATS JetStream
KV store	Preferences, tokens, dedup, rate limits	Redis, KeyDB, Dragonfly
Primary DB	Templates, preferences, audit	PostgreSQL, CockroachDB
Time-series DB	Notification history, delivery status	Cassandra, ScyllaDB, DynamoDB
Push gateway	APNs/FCM delivery	Self-hosted, Firebase Admin
Email gateway	SMTP delivery	Postfix, SendGrid, SES API
SMS gateway	Carrier delivery	Twilio, Vonage, MessageBird

AWS reference architecture

Service configurations:

Service	Configuration	Rationale
Notification API (Fargate)	2 vCPU, 4 GB, 20 tasks	Stateless, scales with traffic.
Router workers (Fargate)	2 vCPU, 4 GB, 50 tasks	CPU-bound preference lookups.
Push workers (Fargate)	2 vCPU, 4 GB, 30 tasks	I/O-bound provider calls.
WebSocket gateways (Fargate)	4 vCPU, 8 GB, 20 tasks	Memory budget for connections.
ElastiCache Redis	r6g.xlarge cluster	Sub-ms reads for hot path.
RDS PostgreSQL	db.r6g.large Multi-AZ	Templates, preferences.
Amazon Keyspaces	On-demand	Serverless Cassandra.
MSK	kafka.m5.large × 3	Priority topic separation.

Self-hosted alternatives

Managed service	Self-hosted option	When to self-host
Amazon MSK	Apache Kafka on EC2	Cost at scale, specific configs.
ElastiCache	Redis Cluster on EC2	Specific modules (RediSearch, RedisBloom).
Amazon Keyspaces	Apache Cassandra / ScyllaDB	Cost, tuning flexibility.
SNS Mobile Push	Direct APNs/FCM integration	Full control, cost savings.
Amazon SES	Postfix + DKIM/SPF	Volume discounts, deliverability control.

Monitoring and observability

Key SLIs and alert thresholds:

Metric	Alert threshold	Action
Delivery rate	< 99%	Investigate provider; check error mix.
p99 latency (critical)	> 500 ms	Scale workers; inspect topic lag.
DLQ depth	> 1 000	Manual triage; replay or drop.
Rate limit hits	> 10% of traffic	Review per-user/category caps.
Invalid tokens	> 5% per day	Token cleanup job is failing or behind.
Bounce rate (email)	> 5% (hard)	Review list hygiene; audit producer.
Spam complaint rate	> 0.3%	Pause sender; audit content. (Gmail bulk-sender guidelines)

Distributed tracing pattern (Slack-style):

Each notification gets its own trace, with notification_id as the trace_id.
Span links connect the originating message trace to the resulting notification trace, so the originating event remains discoverable without inflating its trace.
Spans cover the full path: accept → enqueue → route → dispatch → provider-ack → device-ack.
Sampling is 100% for notifications (vs. ~1% for general traffic), because notification debugging is high-value and per-event payloads are small (Slack engineering: tracing notifications).

Operational reality

Failure modes

Failure	Detection	Mitigation
FCM 5xx / `RESOURCE_EXHAUSTED`	Provider error rate spike, 429s	Per-project token bucket; exponential backoff with jitter; bulk traffic to low priority.
APNs `GOAWAY` / connection reset	Connection error metric	Reconnect, halve concurrency, monitor reason field.
Cassandra wide partition	Read latency p99 spike	Shrink bucket size; backfill into a re-partitioned table.
Redis dedup eviction	Increase in duplicate downstream events	Right-size memory; consider RocksDB tier (Segment-style) for long windows.
Producer floods bulk topic	Topic lag on bulk priority	Apply per-producer rate limits at API gateway; reject or shed bulk on contention.
Time skew on quiet hours	Off-hours delivery complaints	Source TZ from user profile, not request; validate IANA TZ at write time.
WebSocket reconnect storm after gateway crash	Connection-rate spike	Exponential backoff on the client; coalesce reconnects per shard.

Scaling levers

Add partitions to a priority topic to grow consumer parallelism — but only on creation; resizing later remaps hash(user_id) % partition_count and breaks ordering for in-flight users (Kafka partition design).
Shard Redis dedup by user prefix when the working set exceeds a single cluster’s memory budget.
Move dedup to disk-tier KV (RocksDB, ScyllaDB) when 4+ week dedup windows are needed; Segment chose RocksDB at 1.5 TB (Twilio Segment).
Spread bulk sends in time (spreadOverMinutes) so SES, FCM, and downstream queues never see a vertical wall of traffic.

Conclusion

This design gives you:

At-least-once delivery through Kafka durability, retry with exponential backoff and jitter, and a dead-letter queue with manual replay.
Sub-500 ms server-side delivery for critical traffic via priority-segmented topics and dedicated worker pools.
User-centric throttling with preference-, channel-, and category-level caps, plus quiet-hours deferral that bypasses for critical only.
Multi-channel coverage with isolated processors so a SES, FCM, or APNs incident degrades only its own channel.
Horizontal scale to 1M+ notifications/sec via partitioned topics and Cassandra time-series storage.

Architectural decisions worth defending in a design review:

Priority-based queue separation prevents bulk traffic from monopolizing the path that carries security alerts.
User-partitioned Kafka enables co-located rate limiting and aggregation without a distributed lock.
Independent channel processors mean an SES outage cannot starve push throughput, and an FCM 429 cannot back-pressure email.
Template rendering at ingestion freezes the payload so dedup, retries, and audit logs all reference identical bytes.

Known limitations:

At-least-once means clients must handle duplicates; provide an idempotency hint in the payload (notificationId).
Cross-channel ordering is not guaranteed (push may arrive before email).
Aggregation windows add latency for batch-eligible notifications by design.
External provider rate limits and reputation systems are the ultimate bound on burst capacity.

Future enhancements:

ML-based send-time optimization — Pinterest and Airship report meaningful CTR uplift from per-user predicted send times (Pinterest NEP, Airship STO model).
Rich media notifications (images, action buttons, reply-from-notification).
Cross-device read-state sync (mark read on phone → clear on web).
Webhook delivery for B2B integrations as an additional channel.

Appendix

Prerequisites

Distributed systems fundamentals (durable logs, partitioning, idempotency).
Push notification protocols (APNs HTTP/2, FCM HTTP v1, Web Push).
Rate-limiting algorithms (token bucket, sliding window, leaky bucket).
Database selection trade-offs (relational, time-series, KV).

Terminology

Term	Definition
APNs	Apple Push Notification service — Apple’s push delivery infrastructure.
FCM	Firebase Cloud Messaging — Google’s cross-platform push service.
DLQ	Dead-letter queue — store for messages that exhausted retries.
TTL	Time-to-live — duration after which a notification or token expires.
Collapse key	Identifier for grouping related notifications (newer replaces older).
Token bucket	Rate-limiting algorithm allowing bursts up to bucket capacity.
Idempotent	Operation that produces the same observable result on repeat execution.
VAPID	Voluntary Application Server Identification — RFC 8292; Web Push auth.

Summary

Multi-channel delivery (push, email, SMS, in-app) with at-least-once guarantees using durable Kafka topics, bounded retries, and a DLQ for terminal failures.
Priority-based routing separates critical notifications (< 500 ms) from bulk traffic (best-effort, off-peak friendly).
Preference service with Redis-cached hot path enables per-user, per-category, per-channel control plus quiet hours.
Token-bucket rate limits at user and channel scope prevent fatigue and respect provider caps.
Aggregation collapses similar events (“5 new likes”) to reduce interruption count.
Cassandra time-series with (user_id, bucket) partition keys keeps history queryable at billions/day with native TTL.
Provider-aware error handling for FCM (UNREGISTERED, RESOURCE_EXHAUSTED), APNs (GOAWAY, BadCollapseId), and SES (bounce/complaint feedback) decides retry vs. permanent removal correctly.

References

Real-world implementations:

Uber’s Real-Time Push Platform — original SSE design — 600K connections, ~250K msg/s.
Uber’s Next-Gen Push Platform on gRPC — current 1.5M+ connection scale, gRPC bidi streaming.
LinkedIn Concourse — Apache Samza-based personalized content notifications in near real-time.
Netflix RENO — Rapid Event Notification System; SQS priority queues plus hybrid push-pull.
Slack — Tracing Notifications — 100% sampling, span links, notification_id = trace_id.
Pinterest NEP — Notification System and Relevance — ML-driven candidate ranker plus PID-controlled volume policy.

Standards and provider documentation:

Patterns and best practices:

Twilio Segment — Delivering billions of messages exactly once — 60B keys, 1.5 TB RocksDB dedup.
Stripe — Scaling your API with rate limiters — token bucket + sliding window in production.
Knock — Building a batched notification engine — batch-on-write vs batch-on-read patterns.
Braze — Frequency Capping — empirical impact of caps on retention.
Airship — ML model for predictive send-time optimization.

Related articles:

Design Real-Time Chat and Messaging — WebSocket connections, presence systems.
Design an API Rate Limiter — token bucket and sliding window algorithms in depth.
Design an Email System — SMTP, deliverability, and bounce handling.
Slack’s Distributed Architecture — for context on how Slack runs notifications at scale.