Web Video Playback Architecture: HLS, DASH, and Low Latency
The complete video delivery pipeline from codecs and compression to adaptive streaming protocols, DRM systems, and ultra-low latency technologies. Covers protocol internals, design trade-offs, and production failure modes for building resilient video applications.
Abstract
Video streaming solves a fundamental tension: high-quality video requires high bitrates, but network conditions are unpredictable and heterogeneous. The solution is Adaptive Bitrate (ABR) streaming—pre-encode multiple quality variants, segment them into small chunks, and let the client dynamically select based on real-time conditions.
The core mental model:
-
Codecs compress by removing spatial/temporal redundancy. The trade-off is compression efficiency vs. decode complexity and hardware support. H.264 is universal but inefficient; AV1 achieves 30% better compression than HEVC (High Efficiency Video Coding) but requires modern hardware.
-
Containers package compressed streams into addressable segments. CMAF (Common Media Application Format) unifies HLS and DASH delivery by standardizing fragmented MP4 (fMP4) structure, eliminating the need for duplicate storage.
-
Manifests map available segments and quality levels. HLS uses text-based
.m3u8playlists; DASH uses XML-based MPD (Media Presentation Description) files. Both describe the same underlying content differently. -
The client drives quality selection. The player monitors buffer levels, network throughput, and device capabilities, then requests appropriate segments. The server is stateless—it serves whatever is requested.
-
Latency is a spectrum. Traditional HLS/DASH: 6-15 seconds (segment duration × buffer depth). Low-Latency HLS/DASH: 2-4 seconds (partial segments + preload hints). WebRTC: <500ms (UDP + no buffering). Each step down in latency trades scalability and error resilience.
-
DRM (Digital Rights Management) is ecosystem-fragmented. Widevine (Chrome/Android), FairPlay (Apple), PlayReady (Windows). CENC (Common Encryption) allows a single encrypted file to work with multiple DRM systems, but license acquisition remains platform-specific.
Latency trade-off summary:
| Approach | Latency | Why |
|---|---|---|
| Traditional HLS/DASH | 6-15s | Segment duration (4-6s) × buffer depth (2-3 segments) |
| LL-HLS/LL-DASH | 2-4s | Partial segments (~200ms) + HTTP long-polling |
| WebRTC | <500ms | UDP transport + minimal buffering + no HTTP overhead |
Introduction
Initial attempts at web video playback were straightforward but deeply flawed. The most basic method involved serving a complete video file, such as an MP4, directly from a server. While modern browsers can begin playback before the entire file is downloaded (progressive download), this approach is brittle. It offers no robust mechanism for seeking to un-downloaded portions of the video, fails completely upon network interruption, and locks the user into a single, fixed quality.
A slightly more advanced method, employing HTTP Range Requests, addressed the issues of seekability and resumability by allowing the client to request specific byte ranges of the file. This enabled a player to jump to a specific timestamp or resume a download after an interruption.
However, both of these early models shared a fatal flaw: they were built around a single, monolithic file with a fixed bitrate. This “one-size-fits-all” paradigm was economically and experientially unsustainable. Serving a high-quality, high-bitrate file to a user on a low-speed mobile network resulted in constant buffering and a poor experience, while simultaneously incurring high bandwidth costs for the provider.
This pressure gave rise to ABR streaming, the foundational technology of all modern video platforms. ABR inverted the delivery model. Instead of the server pushing a single file, the video is pre-processed into multiple versions at different quality levels. Each version is then broken into small, discrete segments. The client player is given a manifest file—a map to all available segments—and is empowered to dynamically request the most appropriate segment based on its real-time assessment of network conditions, screen size, and CPU capabilities.
Design rationale for segmentation: Small segments (typically 2-10 seconds) enable several critical capabilities:
- Fast quality switching: Client can change bitrate at segment boundaries without seeking
- Parallel CDN caching: Each segment is independently cacheable with unique URLs
- Error recovery: A corrupted segment affects only a few seconds of playback
- HTTP compatibility: Standard web servers and CDNs handle segments as regular files
The trade-off is manifest overhead and increased request count. For a 2-hour movie with 6-second segments, the player must fetch ~1,200 segments plus periodic manifest updates.
The Foundation - Codecs and Compression
At the most fundamental layer of the video stack lies the codec (coder-decoder), the compression algorithm that makes transmission of high-resolution video over bandwidth-constrained networks possible. Codecs work by removing spatial and temporal redundancy from video data, dramatically reducing file size.
How codecs achieve compression:
- Intra-frame (I-frames): Compress individual frames independently using spatial redundancy (similar adjacent pixels)
- Inter-frame (P/B-frames): Encode only the differences between frames using temporal redundancy (motion vectors)
- Transform coding: Convert pixel blocks to frequency domain (DCT/Discrete Cosine Transform), quantize, and entropy-code
A typical 1080p raw video at 30fps requires ~1.5 Gbps. H.264 compresses this to 5-10 Mbps (150-300x reduction). HEVC achieves similar quality at 3-5 Mbps; AV1 at 2-3 Mbps.
Video Codecs: A Comparative Analysis
H.264 (AVC - Advanced Video Coding)
Standardized in 2003 by ITU-T/ISO (ITU-T H.264 | ISO/IEC 14496-10), H.264 remains the most widely deployed video codec. Its dominance is not due to superior compression but to unparalleled compatibility. For two decades, hardware manufacturers have built dedicated H.264 decoding chips into virtually every device.
Key Characteristics:
| Attribute | Value |
|---|---|
| Compression Efficiency | Baseline (reference point) |
| Ideal Use Case | Universal compatibility, live streaming, ads |
| Licensing | MPEG LA patent pool (reasonable rates) |
| Hardware Support | Ubiquitous (100% of devices) |
| Typical Bitrate (1080p30) | 5-10 Mbps |
Design trade-off: H.264 prioritized decode simplicity over compression efficiency. The spec defines multiple profiles (Baseline, Main, High) with increasing complexity. Baseline profile can be decoded on the most constrained hardware; High profile enables better compression but requires more capable decoders.
Gotcha: The avc1 codec string in manifests encodes the profile and level (e.g., avc1.640028 = High profile, level 4.0). Mismatched codec strings cause playback failures even when the container is valid.
H.265 (HEVC - High Efficiency Video Coding)
Standardized in 2013 (ITU-T H.265 | ISO/IEC 23008-2), HEVC was designed for 4K and HDR (High Dynamic Range) content. Browser support is functional in approximately 92% of installed browsers per caniuse, but in every case it depends on the underlying OS and a hardware HEVC decoder; there is no widely-shipped software HEVC fallback in the major browsers, mainly because of patent-licensing exposure.
Key Characteristics:
| Attribute | Value |
|---|---|
| Compression Efficiency | ~50% better than H.264 |
| Ideal Use Case | 4K/UHD & HDR streaming |
| Licensing | Multiple patent pools (MPEG LA, HEVC Advance, Velos Media) |
| Hardware Support | Widespread (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm) |
| Typical Bitrate (1080p30) | 3-5 Mbps |
Design trade-off: HEVC achieves better compression through larger coding tree units (CTU, up to 64×64 vs. H.264’s 16×16 macroblocks) and more sophisticated prediction modes. This increases encoder complexity by 5-10x, making real-time encoding expensive.
Licensing complexity: Three separate patent pools with inconsistent terms drove the industry toward AV1. A single 4K stream may owe royalties to all three pools, making cost calculation non-trivial.
AV1 (AOMedia Video 1)
Released in 2018 by the Alliance for Open Media (AOM)—Google, Netflix, Amazon, Microsoft, Meta, and others—AV1 was a direct response to HEVC’s licensing fragmentation.
Key Characteristics:
| Attribute | Value |
|---|---|
| Compression Efficiency | ~30% better than HEVC |
| Ideal Use Case | High-volume VOD, bandwidth savings |
| Licensing | Royalty-free (AOM patent commitment) |
| Hardware Support | Mobile: still partial; large screens: ~88% of Netflix-certified devices since 2021 |
| Typical Bitrate (1080p30) | 2-3 Mbps |
Adoption status (late 2025):
- YouTube serves 75%+ of its catalog (weighted by watch time) in AV1; software AV1 decode was rolled out to mobile devices without hardware support starting in 2024.1
- AV1 powers ~30% of Netflix streaming; 88% of large-screen devices submitted for Netflix certification between 2021 and 2025 ship AV1 hardware decode.2
- Apple shipped AV1 hardware decode in the A17 Pro (iPhone 15 Pro) and M3 in 2023; current Snapdragon flagships, Intel Arc / 12th-gen+ iGPUs, and NVIDIA RTX 40-series also decode AV1 in hardware.2
Design trade-off: AV1’s superior compression comes from computationally expensive encoding. Software encoding is 10-20x slower than H.264. Hardware encoders (NVIDIA NVENC, Intel QuickSync) are now available but still slower than HEVC hardware encoding.
When to use AV1: High-volume VOD where encoding cost amortizes across many views. Not yet practical for low-latency live streaming without hardware acceleration.
Audio Codecs
AAC (Advanced Audio Coding)
AAC is the de facto standard for audio in video streaming. Standardized in MPEG-2 Part 7 and MPEG-4 Part 3, it’s the default audio codec for MP4 containers and supported by nearly every device.
| Attribute | Value |
|---|---|
| Primary Use Case | VOD, music streaming |
| Low Bitrate (<96kbps) | Fair; quality degrades noticeably |
| High Bitrate (>128kbps) | Excellent; industry standard |
| Latency | ~100-200ms (not ideal for real-time) |
| Compatibility | Near-universal |
| Licensing | MPEG LA patent pool |
Profile variants: AAC-LC (Low Complexity) is most common. HE-AAC (High Efficiency) adds spectral band replication for better low-bitrate performance. HE-AACv2 adds parametric stereo.
Opus
Opus (IETF RFC 6716) is a royalty-free codec developed for real-time communication. Its standout feature is exceptional performance at low bitrates while maintaining sub-20ms algorithmic latency.
| Attribute | Value |
|---|---|
| Primary Use Case | WebRTC, VoIP, low-latency streaming |
| Low Bitrate (<96kbps) | Excellent; maintains intelligibility |
| High Bitrate (>128kbps) | Competitive with AAC |
| Latency | 2.5-60ms (configurable) |
| Compatibility | All modern browsers, limited hardware support |
| Licensing | Royalty-free, BSD-licensed |
Design rationale: Opus combines two compression technologies—SILK (speech-optimized, from Skype) and CELT (music-optimized). The encoder dynamically switches based on content characteristics, achieving good quality across voice, music, and mixed content.
Packaging and Segmentation
Once the audio and video have been compressed by their respective codecs, they must be packaged into a container format and segmented into small, deliverable chunks. This intermediate stage is critical for enabling adaptive bitrate streaming.
Container Formats
MPEG Transport Stream (.ts)
MPEG-TS is the traditional container for HLS. Its origins lie in digital broadcast (DVB), where its structure of small, fixed-size 188-byte packets was designed for resilience against transmission errors over unreliable networks.
Design characteristics:
- Fixed packet size enables recovery from partial data loss
- Self-synchronizing (sync byte 0x47 every 188 bytes)
- Designed for continuous streams, not random access
Limitation: MPEG-TS has significant overhead (~8-15%) compared to fMP4, and each segment requires redundant metadata. This is why modern deployments prefer fMP4/CMAF.
Fragmented MP4 (fMP4)
Fragmented MP4 is the modern, preferred container for both HLS (since 2016) and DASH. It’s a variant of the ISO Base Media File Format (ISOBMFF, ISO/IEC 14496-12).
Box structure for streaming:
fMP4 File Structure:├── ftyp (file type)├── moov (movie metadata)│ ├── mvhd (movie header)│ └── trak (track info, codec config)├── moof (movie fragment header) ─┐│ └── traf (track fragment) │ Repeats for└── mdat (media data) ─┘ each segmentThe moov box must appear before any mdat for playback to begin without full download (“fast start”). For live streaming, the moov contains initialization data, and each segment is a moof + mdat pair.
Initialization segment: Contains ftyp + moov with codec configuration but no media samples. Must be fetched before any media segments. Typically 1-5 KB.
CMAF (Common Media Application Format)
CMAF (ISO/IEC 23000-19:2024) is not a new container but a standardization of fMP4 for streaming. Its introduction was a watershed moment for the industry.
The problem CMAF solves: Before CMAF, supporting both Apple devices (HLS with .ts) and other devices (DASH with .mp4) required encoding, packaging, and storing two complete sets of video files. This doubled storage costs and halved CDN cache efficiency, because the same content was hashed under two different URL spaces.
A provider creates one set of CMAF segments and serves them with two different manifest files. Storage cost: 1× instead of 2×; cache fill cost (the bandwidth between origin and CDN edges) drops in lock-step.
CMAF chunks for low-latency: CMAF defines “chunks”—the smallest addressable unit containing a moof + mdat pair. For low-latency streaming, each chunk can be independently transferred via HTTP chunked encoding as soon as it’s encoded, without waiting for the full segment.
Example: A 4-second segment at 30fps contains 120 frames. With one frame per CMAF chunk, the first chunk is available ~33ms after encoding starts, versus waiting 3.97 seconds for the full segment.
The Segmentation Process
ffmpeg is the workhorse of video processing. Here is a minimal three-rendition HLS encoding pipeline that produces a master playlist plus per-variant media playlists and segments:
ffmpeg -i source.mp4 \ -filter_complex "[0:v]split=3[v1][v2][v3]; \ [v1]scale=w=1920:h=1080[v1080]; \ [v2]scale=w=1280:h=720[v720]; \ [v3]scale=w=854:h=480[v480]" \ -map "[v1080]" -c:v:0 libx264 -b:v:0 5000k -maxrate:v:0 5350k -bufsize:v:0 7500k \ -map "[v720]" -c:v:1 libx264 -b:v:1 2800k -maxrate:v:1 2996k -bufsize:v:1 4200k \ -map "[v480]" -c:v:2 libx264 -b:v:2 1400k -maxrate:v:2 1498k -bufsize:v:2 2100k \ -map a:0 -c:a aac -b:a:0 128k -ac 2 \ -map a:0 -b:a:1 128k -ac 2 \ -map a:0 -b:a:2 96k -ac 2 \ -g 60 -keyint_min 60 -sc_threshold 0 \ -hls_time 6 -hls_playlist_type vod \ -hls_segment_type fmp4 -hls_flags independent_segments \ -master_pl_name master.m3u8 \ -var_stream_map "v:0,a:0 v:1,a:1 v:2,a:2" \ out/stream_%v/playlist.m3u8Key parameters explained:
| Parameter | Purpose |
|---|---|
split=3 |
Splits the decoded video into N parallel filter chains, one per rendition |
scale=w=W:h=H |
Resizes to the target rendition resolution |
-c:v:N libx264 -b:v:N Xk |
Sets the codec and target bitrate for output stream N |
-maxrate -bufsize |
Caps the leaky-bucket peak bitrate so ABR rungs do not overshoot the rung above |
-g 60 -keyint_min 60 -sc_threshold 0 |
Forces a fixed 2-second GOP at 30 fps so every segment can begin on a keyframe and aligns across renditions |
-hls_time 6 |
Target segment duration; actual length snaps to the next keyframe inside ±0.5s of the target |
-hls_segment_type fmp4 |
Emits CMAF-style fMP4 init + media segments instead of MPEG-TS |
-hls_flags independent_segments |
Each segment is independently decodable (no cross-segment frame dependencies) |
-var_stream_map |
Groups video and audio outputs into ABR variants in the master playlist |
-master_pl_name |
Filename for the generated master playlist that references all variants |
Gotcha: Keyframe alignment. Segments can only split at keyframes (I-frames). If your source has keyframes every 10 seconds but you request 6-second segments, actual segment duration will be ~10 seconds. Always set the keyframe interval at encode time (-g, -keyint_min, -sc_threshold 0). For low-latency, use a 1-2 second GOP so partial segments (LL-HLS parts) line up with EXT-X-PART-INF:PART-TARGET.3
Gotcha: Segment duration consistency. The HLS spec requires every segment’s EXTINF duration to be no greater than EXT-X-TARGETDURATION, with the target itself rounded from the actual maximum to the nearest integer.4 Variable segment durations (common with scene-change keyframes when -sc_threshold 0 is omitted) confuse player buffer models and can cause underruns near the live edge.
The Protocols of Power - HLS and MPEG-DASH
The protocols for adaptive bitrate streaming define the rules of communication between the client and server. They specify the manifest format and segment addressing scheme.
HLS (HTTP Live Streaming)
Created by Apple and documented in RFC 8216, HLS is the most widely deployed streaming protocol. The spec is being revised in draft-pantos-hls-rfc8216bis (draft-21 as of March 2026), which still describes HLS protocol version 13 but folds in everything that has been added since 2017 — fMP4, LL-HLS, content steering, interstitials. Its dominance stems from mandatory support on Apple’s ecosystem — Safari will only play HLS natively.
Design philosophy: HLS was designed to work with standard HTTP infrastructure. Segments are regular files; playlists are text files. Any HTTP server or CDN can serve HLS content without modification.
Playlist Hierarchy
HLS uses a two-level playlist structure:
Master Playlist (entry point, lists all variants):
#EXTM3U#EXT-X-VERSION:7#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="aac",NAME="English",DEFAULT=YES,LANGUAGE="en",URI="audio/en/playlist.m3u8"#EXT-X-STREAM-INF:BANDWIDTH=5350000,AVERAGE-BANDWIDTH=5000000,RESOLUTION=1920x1080,CODECS="avc1.640028,mp4a.40.2",AUDIO="aac"video/1080p/playlist.m3u8#EXT-X-STREAM-INF:BANDWIDTH=2996000,AVERAGE-BANDWIDTH=2800000,RESOLUTION=1280x720,CODECS="avc1.64001f,mp4a.40.2",AUDIO="aac"video/720p/playlist.m3u8#EXT-X-STREAM-INF:BANDWIDTH=1498000,AVERAGE-BANDWIDTH=1400000,RESOLUTION=854x480,CODECS="avc1.64001e,mp4a.40.2",AUDIO="aac"video/480p/playlist.m3u8Key tags explained:
| Tag | Purpose |
|---|---|
EXT-X-VERSION |
HLS protocol version (3 is widely compatible; 7+ for fMP4) |
EXT-X-STREAM-INF |
Describes a variant stream |
BANDWIDTH |
Peak bitrate in bits/second (used for ABR selection) |
AVERAGE-BANDWIDTH |
Average bitrate (more accurate for buffer estimation) |
CODECS |
RFC 6381 codec string (critical for playback capability check) |
Media Playlist (lists segments for one variant):
#EXTM3U#EXT-X-VERSION:7#EXT-X-TARGETDURATION:6#EXT-X-MEDIA-SEQUENCE:0#EXT-X-PLAYLIST-TYPE:VOD#EXT-X-MAP:URI="init.mp4"#EXTINF:6.000,segment_00000.m4s#EXTINF:6.000,segment_00001.m4s#EXTINF:5.840,segment_00002.m4s#EXTINF:6.000,segment_00003.m4s#EXT-X-ENDLISTCritical tags for playback:
| Tag | Purpose |
|---|---|
EXT-X-TARGETDURATION |
Maximum segment duration (player uses for buffer calculations) |
EXT-X-MEDIA-SEQUENCE |
Sequence number of first segment (critical for live edge tracking) |
EXTINF |
Actual segment duration |
EXT-X-ENDLIST |
Indicates VOD content (no more segments will be added) |
Live streaming behavior: For live streams, EXT-X-ENDLIST is absent. The player periodically re-fetches the playlist to discover new segments. The refresh interval is typically target duration / 2 per the spec.
Gotcha: Sequence number discontinuities. If the origin server restarts and resets EXT-X-MEDIA-SEQUENCE to 0, players may incorrectly seek to the wrong position or refuse to play. Production systems must persist sequence numbers across restarts.
MPEG-DASH
Dynamic Adaptive Streaming over HTTP (DASH) is standardized as ISO/IEC 23009-1 (5th edition: 2022). Unlike HLS, which was created by Apple, DASH was developed through an open, international standardization process.
Key differentiator: DASH is codec-agnostic. The spec defines manifest structure and segment addressing but makes no assumptions about codecs. This enables delivery of any format: H.264, HEVC, AV1, VP9, and future codecs.
Design philosophy: DASH prioritizes flexibility and expressiveness. The XML-based MPD can describe complex presentations (multiple periods, dynamic ad insertion, multiple audio languages, accessibility tracks) more precisely than HLS’s text-based format.
MPD (Media Presentation Description) Structure
<MPD type="static" mediaPresentationDuration="PT600S" profiles="urn:mpeg:dash:profile:isoff-on-demand:2011"> <Period duration="PT600S"> <AdaptationSet contentType="video" mimeType="video/mp4" codecs="avc1.640028"> <Representation id="video-1080p" bandwidth="5000000" width="1920" height="1080"> <BaseURL>video/1080p/</BaseURL> <SegmentTemplate media="segment-$Number$.m4s" initialization="init.mp4" startNumber="1" /> </Representation> <Representation id="video-720p" bandwidth="2800000" width="1280" height="720"> <BaseURL>video/720p/</BaseURL> <SegmentTemplate media="segment-$Number$.m4s" initialization="init.mp4" startNumber="1" /> </Representation> </AdaptationSet> <AdaptationSet contentType="audio" mimeType="audio/mp4" codecs="mp4a.40.2" lang="en"> <Representation id="audio-en" bandwidth="128000"> <BaseURL>audio/en/</BaseURL> <SegmentTemplate media="segment-$Number$.m4s" initialization="init.mp4" startNumber="1" /> </Representation> </AdaptationSet> </Period></MPD>Hierarchy: MPD → Period → AdaptationSet → Representation
- Period: Content divisions (pre-roll ad, main content, mid-roll ad). Enables seamless ad insertion.
- AdaptationSet: Groups switchable representations (all 1080p variants, or all audio languages).
- Representation: One specific variant (1080p at 5Mbps H.264).
Segment addressing methods:
- SegmentTemplate: Constructs URLs from templates with variable substitution (
$Number$,$Time$). Most common. - SegmentList: Explicit enumeration of segment URLs. Useful for variable-duration segments.
- SegmentBase: Single segment with byte-range addressing via
sidxbox. Used for VOD with in-file index.
Gotcha: Period transitions. Playback can stutter at period boundaries if the transition isn’t handled correctly. The periodContinuity signaling in DASH-IF guidelines enables seamless appends, but short periods (<5 seconds) with network latency can still cause stalls.
HLS vs. DASH: Technical Comparison
| Feature | HLS | MPEG-DASH |
|---|---|---|
| Spec Owner | Apple (RFC 8216) | ISO/IEC 23009-1 |
| Manifest Format | Text-based (.m3u8) | XML-based (.mpd) |
| Manifest Size | Smaller (10-50 KB live) | Larger (20-100 KB complex) |
| Codec Support | H.264, HEVC, limited others | Any codec (agnostic) |
| Container Support | MPEG-TS, fMP4/CMAF | fMP4/CMAF, WebM |
| Native DRM | FairPlay | Widevine, PlayReady |
| Safari/iOS | Native | Not supported |
| Manifest Expressiveness | Limited (extensions for advanced features) | Rich (periods, segment timelines, etc.) |
| Low-Latency Extension | LL-HLS | LL-DASH |
| Industry Adoption | Dominant (Apple ecosystem requirement) | Strong (Android, smart TVs, web) |
Practical implication: Most production systems support both. CMAF enables shared media segments; only the manifest differs. The choice often comes down to DRM requirements and target platform mix.
Securing the Stream: Digital Rights Management
For premium content, preventing unauthorized copying and distribution is a business necessity. DRM provides content protection through encryption and controlled license issuance.
The Multi-DRM Landscape
Three major DRM systems dominate, each tied to a specific ecosystem:
| DRM System | Ecosystem | Browser | Mobile | TV/STB |
|---|---|---|---|---|
| Widevine (Google) | Chrome, Android | Chrome, Firefox, Edge | Android | Android TV, Chromecast |
| FairPlay (Apple) | Apple | Safari | iOS | Apple TV |
| PlayReady (Microsoft) | Windows | Edge | — | Xbox, Smart TVs |
Why three systems? Each platform vendor controls the secure execution environment (TEE/Trusted Execution Environment) on their hardware. DRM requires hardware-backed security for premium content (4K, HDR). No vendor will trust another vendor’s TEE implementation.
Security Levels
DRM systems define security levels based on where decryption occurs:
Widevine Levels:
- L1 (Hardware): Decryption in TEE; keys never exposed to main CPU. Required for HD/4K on most services.
- L2 (Hybrid): Partial hardware protection. Uncommon in practice.
- L3 (Software): Decryption in browser process. No hardware protection. Limited to SD resolution on premium services.
FairPlay: Leverages Apple’s Secure Enclave for hardware-backed security on all modern devices.
PlayReady Levels:
- SL3000: TEE-based, hardware protection (introduced with PlayReady 3.0 in 2015)
- SL2000: Software-based protection
Content policy implication: Netflix, Disney+, and other premium services enforce L1/SL3000 for 4K content. A Chrome user on Linux typically gets L3 Widevine and is limited to 720p.
Common Encryption (CENC)
CENC (ISO/IEC 23001-7:2023) enables a single encrypted file to work with multiple DRM systems. This is the key technology making multi-DRM practical.
How CENC works:
- Content is encrypted once with AES-128 (same encrypted bytes for all DRM systems)
- Each DRM system’s metadata (PSSH box) is embedded in the init segment
- At playback, the player detects available DRM systems, requests a license from the appropriate server, and decrypts using the returned key
Encryption modes:
- cenc (CTR mode): Default. Parallelizable, suitable for streaming. No padding required.
- cbcs (CBC mode with subsample patterns): Required by FairPlay. Encrypts only a subset of bytes, leaving NAL headers in plaintext for codec inspection.
Subsample encryption: For NAL-structured video (H.264, HEVC), only the coded slice data is encrypted. NAL headers remain in plaintext, allowing the player to parse codec configuration without decryption.
EME: The Browser API
Encrypted Media Extensions (EME) is the W3C Recommendation that connects JavaScript to the platform’s Content Decryption Module (CDM). EME standardizes the handshake; the CDM and the underlying DRM (Widevine, FairPlay, PlayReady) standardize the trust.
Flow:
- Player detects encrypted content (via
encryptedevent onHTMLMediaElement). - Calls
navigator.requestMediaKeySystemAccess()to check DRM availability and required capabilities (codec, robustness level). - Creates
MediaKeysand aMediaKeySession. - Sends the license challenge produced by the CDM to the operator’s license server (typically over an authenticated HTTPS endpoint).
- Passes the signed license response back into the session via
session.update(). - CDM decrypts content;
HTMLMediaElementplays.
Important
EME is not DRM. EME is the JavaScript API; Widevine, FairPlay, and PlayReady are the DRM systems behind it. The same EME call sequence can resolve to a software CDM with no robustness guarantees (e.g. Widevine L3) or to a hardware-backed CDM (Widevine L1, FairPlay on Secure Enclave, PlayReady SL3000). License servers must consult the negotiated robustness level before issuing keys for premium tiers.
The New Frontier: Ultra-Low Latency
Traditional HLS/DASH has 6-15+ seconds of latency (segment duration × buffer depth). For live sports, auctions, and interactive content, this is unacceptable. Two approaches address this: Low-Latency HLS/DASH (HTTP-based, 2-4 seconds) and WebRTC (UDP-based, <500ms).
Understanding Latency Sources
| Component | Traditional | Low-Latency | WebRTC |
|---|---|---|---|
| Encoding | 1-2s (segment duration) | 200-500ms (partial segment) | 20-100ms (per-frame) |
| Packaging | ~1s | <100ms | N/A |
| CDN Edge | 1-2s | 500ms-1s | N/A (SFU direct) |
| Manifest Update | 2-3s | 200-500ms (blocking reload) | N/A |
| Player Buffer | 6-12s (2-3 segments) | 1-3s (parts + safety) | 50-200ms (jitter buffer) |
| Total | 10-20s | 2-5s | 100-500ms |
Low-Latency HLS (LL-HLS)
Apple introduced LL-HLS at WWDC 2019 to reduce latency while preserving HTTP scalability. It achieves 2-4 second latency through three mechanisms working together:
1. Partial Segments (Parts)
Instead of waiting for a full 4-6 second segment, LL-HLS publishes smaller “parts” (Apple recommends PART-TARGET=1.0 for stability; deployments typically use 200 ms-1 s) as soon as they are encoded.5
#EXTM3U#EXT-X-TARGETDURATION:4#EXT-X-PART-INF:PART-TARGET=0.5#EXT-X-MEDIA-SEQUENCE:100#EXT-X-PART:DURATION=0.5,URI="segment100_part0.m4s"#EXT-X-PART:DURATION=0.5,URI="segment100_part1.m4s"#EXT-X-PART:DURATION=0.5,URI="segment100_part2.m4s"#EXTINF:2.0,segment100.m4sDesign rationale: Parts are published incrementally but roll up into full segments. Legacy players ignore EXT-X-PART tags and fetch full segments. LL-HLS-aware players fetch parts for lower latency.
2. Blocking Playlist Reload
Traditional HLS requires the player to poll for playlist updates (every target duration / 2). This polling introduces latency—the player might check just after an update and wait until the next poll.
LL-HLS uses HTTP long-polling: the player requests the playlist with a query parameter indicating the last seen media sequence. The server holds the connection until new content is available, then responds immediately.
GET /playlist.m3u8?_HLS_msn=100&_HLS_part=3The server blocks until segment 100, part 3 (or later) exists, then responds with the updated playlist.
3. Preload Hints
The server tells the player the URI of the next part before it exists:
#EXT-X-PRELOAD-HINT:TYPE=PART,URI="segment100_part4.m4s"The player can issue a request immediately. The server holds the request open until the part is ready. This eliminates the round-trip between playlist update and segment request.
Latency budget:
- Part generation: ~500ms
- CDN propagation: ~200ms
- Blocking playlist: ~0ms (pre-positioned)
- Player buffer: ~1-2 seconds (safety margin)
- Total: 2-3 seconds
Low-Latency DASH (LL-DASH)
LL-DASH achieves similar latency through different mechanisms:
-
CMAF Chunked Transfer: Each CMAF chunk is transferred via HTTP chunked encoding as it’s generated. The client receives data before the segment is complete.
-
Service Description: The MPD includes
ServiceDescriptionwith target latency and playback rate adjustments:
<ServiceDescription> <Latency target="3500" min="2000" max="6000" /> <PlaybackRate min="0.96" max="1.04" /></ServiceDescription>- CMSD (Common Media Server Data): Server-to-client signaling of real-time latency targets, enabling dynamic adjustment.
ISO/IEC 23009-1 is currently being revised — the 6th edition has reached FDIS as of 2025 and is expected to tighten low-latency interoperability with the LL-HLS partial-segment model and align more closely with DASH-IF Live Media Ingest practice.6
WebRTC (Web Real-Time Communication)
WebRTC is fundamentally different from HTTP streaming. It’s designed for true real-time, bidirectional communication with sub-second latency.
Key architectural differences:
| Aspect | HLS/DASH | WebRTC |
|---|---|---|
| Transport | TCP (HTTP) | UDP (SRTP) |
| Connection Model | Stateless request/response | Stateful peer connections |
| Error Handling | Retransmit on loss | Skip or interpolate lost data |
| Buffering | Seconds of buffer | Milliseconds (jitter buffer) |
| Scaling | CDN (millions of viewers) | SFU/MCU (hundreds per server) |
Why UDP for low latency? TCP’s reliability (retransmission, ordering) introduces head-of-line blocking. A lost packet blocks all subsequent packets until retransmitted. For live video, it’s better to skip a frame than delay the entire stream.
SFU (Selective Forwarding Unit) architecture: For group calls and broadcasting, WebRTC uses SFUs. Each participant sends once to the SFU; the SFU forwards streams to all receivers without transcoding. This scales better than mesh (N² connections) or MCU (transcoding bottleneck), but it is fundamentally bounded by per-server fan-out — typically hundreds to low thousands of viewers per SFU instance versus the millions a single CDN PoP can serve.
Latency Technology Selection
| Use Case | Technology | Latency | Trade-off |
|---|---|---|---|
| VOD | Traditional HLS/DASH | 6-15s | Maximum compatibility, CDN caching |
| Live Sports | LL-HLS/LL-DASH | 2-5s | Scalable, some latency |
| Live Auctions | LL-HLS/LL-DASH | 2-3s | HTTP infrastructure |
| Video Conferencing | WebRTC | <500ms | Requires SFU infrastructure |
| Gaming/Interactive | WebRTC | <200ms | Limited scale per server |
Hybrid approaches are emerging: Some systems use WebRTC for the first hop (ingest) and LL-HLS for distribution. This combines low-latency capture with CDN scalability.
Client-Side Playback: Media Source Extensions
The browser’s native <video> element can only handle progressive download or a single HLS/DASH stream (limited browser support). For adaptive streaming with quality switching, players use Media Source Extensions (MSE).
MSE Architecture
Media Source Extensions (W3C Recommendation, MSE 2 in active maintenance) provides a JavaScript API to feed media data to <video>:
const mediaSource = new MediaSource()video.src = URL.createObjectURL(mediaSource)mediaSource.addEventListener("sourceopen", () => { const sourceBuffer = mediaSource.addSourceBuffer('video/mp4; codecs="avc1.640028"') // Append segments as they're fetched fetch("segment.m4s") .then((r) => r.arrayBuffer()) .then((data) => { sourceBuffer.appendBuffer(data) })})Key components:
- MediaSource: Container that connects to HTMLMediaElement
- SourceBuffer: Buffer for a single track (video, audio, or text). Handles segment parsing and decode.
- SourceBufferList: Collection of active buffers
Buffer Management
The player must balance competing concerns:
- Sufficient buffer for smooth playback (2-30 seconds depending on use case)
- Responsive quality switching (smaller buffer = faster adaptation)
- Memory constraints (5-20 MB typical; mobile is more constrained)
ManagedMediaSource (Safari 17, 2023): Apple shipped ManagedMediaSource in Safari 17 (macOS) and Safari 17.1 (iOS); it is currently the only browser implementation.7 The browser owns buffer eviction; the player listens for bufferedchange to react. ManagedMediaSource only enters the open state if the page either provides an HLS AirPlay alternative or sets video.disableRemotePlayback = true.
Gotcha: Buffer eviction timing. If the player doesn’t track eviction, it may request already-evicted segments, causing playback stalls. Production players must handle the bufferedchange event or implement their own eviction tracking.
Codec Switching
MSE’s changeType() method enables mid-stream codec changes without replacing the SourceBuffer:
sourceBuffer.changeType('video/mp4; codecs="hev1.1.6.L93.B0"')Use case: Start with H.264 for immediate playback (universal decode), then switch to HEVC for bandwidth savings once the player confirms hardware support.
Common Failure Modes
| Failure | Symptom | Cause |
|---|---|---|
| QuotaExceededError | appendBuffer fails | Buffer full; implement eviction |
| Decode error | Video freezes | Corrupted segment or codec mismatch |
| Gap in timeline | Audio/video desync | Discontinuity not handled |
| Infinite buffering | Never starts | Init segment missing or codec string wrong |
Player Libraries: hls.js, dash.js, shaka-player
In practice, almost no production team writes the MSE / EME glue from scratch. Three open-source players cover the field:
| Library | Maintainer | Protocols | Engine model |
|---|---|---|---|
| hls.js | video-dev community | HLS only (RFC 8216 + LL-HLS) | Dedicated, optimized for the HLS playlist tree; piped through MSE on every browser except Safari, which uses native HLS. |
| dash.js | DASH Industry Forum | MPEG-DASH only (incl. LL-DASH) | Reference player for the DASH-IF Implementation Guidelines; tracks the spec edits closely. |
| shaka-player | DASH, HLS (incl. LL variants), MSS | Single engine that normalizes manifests into an internal model and re-emits MSE segment requests per protocol. |
All three share the same architectural shape behind their public surface:
- Manifest parser — turns the playlist or MPD into an in-memory representation with periods, representations, and segment timelines.
- ABR controller — owns rung selection, throughput estimation, and switch-down recovery (see the ABR section below).
- Stream controller / scheduler — decides what to fetch next based on the current buffer goal and ABR rung.
- Loader / fetch layer — issues the actual network requests, with retry, backoff, and CMSD parsing.
- Buffer controller — owns one
SourceBufferper track, schedulesappendBufferand eviction, and reacts toQuotaExceededError. - DRM controller — wraps the EME handshake described above and routes license challenges to the configured server.
- Text / caption renderer — for WebVTT, IMSC, or CEA-608/708 tracks the platform does not render natively.
Selection heuristic. If the catalog is HLS-only, hls.js is the smallest and fastest dependency. If it is DASH-only and the team needs to track new spec features as they land, dash.js. If both protocols are in scope, or if Cast / smart-TV reach matters, shaka-player; its single-engine model avoids two parallel ABR implementations diverging in production.
Architecting a Resilient Video Pipeline
Building a production-grade video streaming service requires robust system design. A modern video pipeline should be viewed as a high-throughput, real-time data pipeline with strict latency and availability requirements.
CDN: The Critical Infrastructure
A CDN is non-negotiable for any streaming service at scale:
Origin offload: Without a CDN, a live stream to 1 million concurrent viewers at 5 Mbps requires 5 Pbps from origin. With a CDN, each edge PoP (Point of Presence) fetches once, caches, and serves thousands locally. Origin load: ~100 Mbps (PoPs × bitrate variants).
Latency reduction: Physical distance determines minimum latency (~1ms per 200km for fiber). A global CDN with 100+ PoPs ensures most users are within 50ms of an edge server.
Manifest caching considerations: Live manifests change every few seconds. CDN TTL must be shorter than update frequency, or viewers see stale playlists. Common pattern: TTL = 1 second for live manifests, origin-controlled cache-control headers.
ABR Ladder Design
The bitrate ladder determines which quality levels are available, and the player’s ABR algorithm picks one ladder rung per segment based on a throughput estimate and the buffer level.
Poor ladder design causes:
- Wasted bandwidth: Steps too small for perceptible quality difference
- Unnecessary buffering: Steps too large, causing oscillation
- Poor mobile experience: No low-bitrate option for constrained networks
Example production ladder (1080p max):
| Resolution | Bitrate | Use Case |
|---|---|---|
| 1920×1080 | 8 Mbps | High-quality fixed connections |
| 1920×1080 | 5 Mbps | Good broadband |
| 1280×720 | 3 Mbps | Average broadband |
| 1280×720 | 1.8 Mbps | Mobile on good LTE |
| 854×480 | 1.1 Mbps | Mobile on average LTE |
| 640×360 | 600 kbps | Constrained mobile |
| 426×240 | 300 kbps | Edge case fallback |
Per-title encoding: Advanced platforms analyze each video’s complexity and generate custom ladders. A static talking-head video needs lower bitrates than an action scene. Netflix’s per-title encoding reduced average bandwidth by ~20% at the same perceived quality compared with a fixed ladder, and shot-based / Dynamic Optimizer follow-ups improved that further (~28% for x264, ~34% for HEVC).8
Monitoring and Observability
Player-side metrics (client telemetry):
- Startup time (time to first frame)
- Rebuffering ratio (time buffering / time playing)
- Average bitrate
- Quality switches per minute
- Error rate by type
Server-side metrics:
- Origin request rate and latency
- CDN cache hit ratio
- Manifest generation latency
- Segment availability latency (time from encode to CDN edge)
Alerting thresholds (example):
- Rebuffering ratio > 1%: Investigate
- Startup time p95 > 3 seconds: Investigate
- CDN cache hit ratio < 90%: Check TTL configuration
- Origin 5xx rate > 0.1%: Incident
Common Production Failure Modes
Subtitle Synchronization Drift
Problem: WebVTT captions drift out of sync during playback.
Cause: X-TIMESTAMP-MAP header maps WebVTT cue times to media timestamps. Different calculation bases for offset (PTS vs. wall clock) cause drift. Works correctly with MPEG-TS but problematic with fMP4.
Mitigation: Use EXT-X-PROGRAM-DATE-TIME for wall-clock synchronization instead of direct PTS mapping.
Manifest Update Race Conditions
Problem: Player requests a segment that doesn’t exist yet.
Cause: Aggressive manifest caching (CDN TTL too long) or clock skew between origin and player.
Mitigation:
- Set manifest TTL shorter than segment duration
- Include
EXT-X-PROGRAM-DATE-TIMEfor clock synchronization - Implement retry with exponential backoff for 404s on expected segments
DRM License Renewal Failures
Problem: Playback stops mid-stream when license expires.
Cause: License expiration not handled, or renewal request blocked by ad blocker.
Mitigation:
- Request license renewal before expiration (typically at 80% of license duration)
- Implement graceful degradation (continue playing cached content while renewing)
- Monitor license acquisition failures as a key metric
Multi-Device Synchronization Drift
Problem: Multiple devices watching the same “live” stream diverge over time.
Cause: Clock drift, variable network latency, different buffer depths.
Mitigation:
- Use
EXT-X-PROGRAM-DATE-TIMEas sync anchor - Implement periodic re-sync (every 30 seconds) with allowed drift tolerance
- Consider server-side time signaling (CMSD in DASH)
Platform Integration: Picture-in-Picture and Remote Playback
The pipeline above stops at the <video> element. Two W3C APIs extend playback off that surface — to a floating window and to an external display — and both are part of the playback contract a production player owns.
Picture-in-Picture
The W3C Picture-in-Picture API (Working Draft, Media Working Group) extends HTMLVideoElement with requestPictureInPicture(), enterpictureinpicture / leavepictureinpicture events, and the boolean disablePictureInPicture attribute. The site asks for PiP; the user agent decides whether and how to grant a floating, always-on-top window.
video.addEventListener("enterpictureinpicture", (event) => { // event.pictureInPictureWindow exposes width/height for resizing})document.pictureInPictureEnabled && await video.requestPictureInPicture()Rules that bite in production:
- The call must originate from a user gesture — a click handler, not a background timer.
- Only one PiP window per document. Calling
requestPictureInPicture()again returns the same window. - Setting
disablePictureInPictureis the right way to suppress the system PiP affordance for live ad creatives or DRM content where PiP would break the policy.
Remote Playback API
The W3C Remote Playback API (Candidate Recommendation Draft, Second Screen Working Group) extends HTMLMediaElement with a remote attribute (a RemotePlayback object) and a disableRemotePlayback attribute. It abstracts AirPlay, Chromecast (when surfaced through this API), and other vendor-specific cast affordances under a single observable surface:
video.remote.watchAvailability((available) => { castButton.hidden = !available})castButton.addEventListener("click", () => { video.remote.prompt() // user agent picks the device})Two interactions matter for the rest of the pipeline:
ManagedMediaSource(the eviction-aware MSE variant Safari shipped in 2023) only enters theopenstate if the page either provides an HLS AirPlay alternative or setsvideo.disableRemotePlayback = true. Players that rely onManagedMediaSourcemust make this choice explicitly.7- DRM policy travels with the stream, not the surface. A license issued for the local CDM is not automatically valid on the remote receiver; production players either re-acquire on remote start or disable remote playback for protected content via
disableRemotePlayback.
Conclusion
Video streaming architecture is a study in layered abstractions, each solving a specific problem:
- Codecs solve compression (H.264 for compatibility, AV1 for efficiency)
- Containers solve addressability (CMAF unifies HLS/DASH delivery)
- Manifests solve discovery (client-driven quality selection)
- DRM solves content protection (CENC enables multi-platform encryption)
- Low-latency extensions solve real-time delivery (LL-HLS/LL-DASH for HTTP, WebRTC for UDP)
The key architectural insight: the client drives everything. The server provides options (quality variants, segments, licenses); the client makes decisions based on local conditions (bandwidth, buffer, capability). This stateless server model enables CDN scaling—every request is independent.
Production systems must support the full matrix: multiple codecs × multiple protocols × multiple DRM systems × multiple latency profiles. CMAF and CENC make this tractable by sharing the encoded content and encryption. Only manifests and license servers differ per platform.
The future is hybrid: AV1 for bandwidth efficiency where hardware supports it, H.264 as universal fallback; LL-HLS for scalable low-latency, WebRTC for interactive applications; hardware DRM for premium content, software for broader reach. The winning architecture is the one that dynamically selects the right tool for each user’s context.
Appendix
Prerequisites
- Understanding of HTTP request/response model
- Familiarity with video concepts (resolution, bitrate, frame rate)
- Basic knowledge of encryption (symmetric encryption, key exchange)
- Understanding of CDN caching principles
Terminology
| Term | Definition |
|---|---|
| ABR | Adaptive Bitrate Streaming—dynamically selecting quality based on conditions |
| CDM | Content Decryption Module—browser component that handles DRM decryption |
| CMAF | Common Media Application Format—standardized fMP4 for HLS/DASH |
| CENC | Common Encryption—standard for multi-DRM file encryption |
| CTU | Coding Tree Unit—basic processing unit in HEVC (up to 64×64 pixels) |
| EME | Encrypted Media Extensions—W3C API connecting JavaScript to CDM |
| fMP4 | Fragmented MP4—streaming-optimized MP4 with separate init and media segments |
| GOP | Group of Pictures—sequence of frames from one I-frame to the next |
| I-frame | Intra-frame—independently decodable frame (keyframe) |
| ISOBMFF | ISO Base Media File Format—foundation for MP4/fMP4 containers |
| LL-HLS | Low-Latency HLS—Apple’s extension for 2-4 second latency |
| MPD | Media Presentation Description—DASH’s XML manifest format |
| MSE | Media Source Extensions—W3C API for feeding media to HTMLMediaElement |
| NAL | Network Abstraction Layer—framing structure in H.264/HEVC bitstreams |
| PoP | Point of Presence—CDN edge location |
| PSSH | Protection System Specific Header—DRM metadata in MP4 files |
| PTS | Presentation Timestamp—when a frame should be displayed |
| SFU | Selective Forwarding Unit—WebRTC server that routes without transcoding |
| TEE | Trusted Execution Environment—hardware-secured processing area |
Summary
- Codecs trade compression efficiency vs. hardware support. H.264 is universal; AV1 offers ~30% better compression than HEVC but historically depended on hardware decode (Apple A17 Pro / M3 in 2023; ~88% of large-screen Netflix-certified devices since 2021), with software AV1 decode now rolled out to mobile devices that lack it.
- CMAF unifies HLS and DASH delivery. Single encoded segments, different manifests. Reduces storage and improves CDN cache efficiency.
- HLS dominates due to Apple ecosystem. Safari/iOS require HLS; most services support both HLS and DASH.
- DRM is platform-fragmented. Widevine (Google), FairPlay (Apple), PlayReady (Microsoft). CENC enables single encryption for all three.
- Low-latency is achievable with HTTP. LL-HLS/LL-DASH achieve 2-4 seconds via partial segments and blocking playlist reload.
- WebRTC is for true real-time. Sub-500ms latency via UDP transport, but requires SFU infrastructure and doesn’t scale like CDNs.
References
Specifications:
- RFC 8216 - HTTP Live Streaming - HLS protocol specification
- draft-pantos-hls-rfc8216bis - HLS protocol version 13 (in progress)
- ISO/IEC 23009-1:2022 - MPEG-DASH - DASH specification (5th edition)
- ISO/IEC 23000-19:2024 - CMAF - Common Media Application Format
- ISO/IEC 23001-7:2023 - CENC - Common Encryption standard
- ITU-T H.264 - Advanced Video Coding specification
- ITU-T H.265 - High Efficiency Video Coding specification
- RFC 6716 - Opus Audio Codec - Opus codec specification
- W3C Media Source Extensions - MSE API specification
- W3C Encrypted Media Extensions - EME API specification
- W3C WebRTC - WebRTC API specification
- W3C Picture-in-Picture - Picture-in-Picture API for
HTMLVideoElement - W3C Remote Playback API - Remote Playback API for
HTMLMediaElement - WHATWG HTML — The
videoelement - HTML media element specification
Official Documentation:
- Apple HLS Authoring Specification - Apple’s HLS requirements
- Apple Low-Latency HLS - LL-HLS implementation guide
- DASH-IF Implementation Guidelines - DASH interoperability guidelines
- DASH-IF Timing Model - DASH timing and synchronization
- Widevine DRM - Google’s DRM documentation
- FairPlay Streaming - Apple’s DRM documentation
- MDN Media Source Extensions API - MSE developer reference
Tools:
- FFmpeg Documentation - Video processing toolkit
- Shaka Player - Open-source DASH/HLS player
- hls.js - Open-source HLS player for MSE
- dash.js - Reference DASH player
Footnotes
-
Video Streaming with the AV1 Video Codec in Mobile Devices (Meta + YouTube white paper, Sept 2025). ↩
-
AV1 — Now Powering 30% of Netflix Streaming (Netflix Tech Blog, Dec 2025). ↩ ↩2
-
HLS Authoring Specification for Apple Devices — Apple Developer. ↩
-
Enabling Low-Latency HTTP Live Streaming (HLS) — Apple Developer. ↩
-
ISO/IEC FDIS 23009-1 — Information technology — Dynamic adaptive streaming over HTTP (DASH) — Part 1. ↩
-
Per-Title Encode Optimization — Netflix Tech Blog (Dec 2015). Later iterations are summarized by Netflix in Optimized Shot-Based Encodes (2018). ↩