Image Processing Service Design: CDN, Transforms, and APIs

This document presents the architectural design for a cloud-agnostic, multi-tenant image processing platform that provides on-the-fly transformations with enterprise-grade security, performance, and cost optimization. The platform supports hierarchical multi-tenancy (Organization → Tenant → Space), public and private image delivery, and deployment across AWS, GCP, Azure, or on-premise infrastructure. Key capabilities include deterministic transformation caching to ensure sub-second delivery, HMAC-SHA256 signed URLs for secure private access, CDN (Content Delivery Network) integration for global edge caching, and a “transform-once-serve-forever” approach that minimizes processing costs while guaranteeing HTTP 200 responses even for first-time transformation requests.

Architecture overview: clients hit the CDN; misses fall through to the Image Gateway, which orchestrates the Transform Engine, Redis cache, registry, and object storage.

Abstract

Image processing at scale requires balancing three competing concerns: latency (users expect sub-second delivery), cost (processing and storage grow with traffic), and correctness (transformations must be deterministic and secure). This architecture resolves these tensions through a layered caching strategy with content-addressed storage.

Cache funnel: a typical multi-layer hierarchy converts ~95% CDN hits → ~80% Redis hits on misses → ~90% DB-index hits on the remainder, leaving < 5% of requests to actually transform.

Core mental model:

Content-addressed storage: Hash(original + operations) → unique derived asset. Same inputs always produce the same output, enabling infinite caching.
Synchronous-first with async fallback: Transform inline for < 5MB images (< 800ms). Queue larger images but return 202 with polling URL.
Efficiency locks, not safety locks: Redlock prevents duplicate processing but doesn’t guarantee mutual exclusion. If two transforms race, both succeed—we just store one.
Hierarchical policies: Organization → Tenant → Space inheritance. Override at any level. Enforce at every layer (API, database, CDN).

Technology selection rationale:

Component	Choice	Why
Image processor	Sharp 0.34.5 (libvips 8.17)	~26× faster than jimp, ~4-5× faster than ImageMagick, ~50 MB memory per worker
Distributed lock	Redlock	Sufficient for efficiency (not correctness); simpler than etcd/ZooKeeper
Formats	AVIF → WebP → JPEG (with auto-negotiation)	AVIF ~94.9% global support, ~50% smaller than JPEG; WebP ~97% support, 25-34% smaller than JPEG
Database	PostgreSQL + JSONB	Row-level security, flexible policy storage, proven at scale

System Overview

Core Capabilities

Multi-Tenancy Hierarchy
- Organization: Top-level tenant boundary
- Tenant: Logical partition within organization (brands, environments)
- Space: Project workspace containing assets
Image Access Models
- Public Images: Direct URL access with CDN caching
- Private Images: Cryptographically signed URLs with expiration
On-the-Fly Processing
- Real-time transformations (resize, crop, format, quality, effects)
- Named presets for common transformation patterns
- Automatic format optimization (WebP, AVIF)
- Guaranteed 200 response even on first transform request
Cloud-Agnostic Design
- Deployment to AWS, GCP, Azure, or on-premise
- Storage abstraction layer for portability
- Kubernetes-based orchestration
Performance & Cost Optimization
- Multi-layer caching (CDN → Redis → Database → Storage)
- Transform deduplication with content-addressed storage
- Lazy preset generation
- Storage lifecycle management

Component Naming

Core Services

Component	Name	Purpose
Entry point	Image Gateway	API gateway, routing, authentication
Transform service	Transform Engine	On-demand image processing
Upload handler	Asset Ingestion Service	Image upload and validation
Admin API	Control Plane API	Tenant management, configuration
Background jobs	Transform Workers	Async preset generation
Metadata store	Registry Service	Asset and transformation metadata
Storage layer	Object Store Adapter	Cloud-agnostic storage interface
CDN layer	Edge Cache	Global content delivery
URL signing	Signature Service	Private URL cryptographic signing

Data Entities

Entity	Name	Description
Uploaded file	Asset	Original uploaded image
Processed variant	Derived Asset	Transformed image
Named transform	Preset	Reusable transformation template
Transform result	Variant	Cached transformation output

Architecture Principles

1. Cloud Portability First

Storage Abstraction: Unified interface for S3, GCS, Azure Blob, MinIO
Queue Abstraction: Support for SQS, Pub/Sub, Service Bus, RabbitMQ
Kubernetes Native: Deploy consistently across clouds
No Vendor Lock-in: Use open standards where possible

2. Performance SLA

Edge Hit: < 50ms (CDN cache)
Origin Hit: < 200ms (application cache)
First Transform: < 800ms (sync processing for images < 5MB)
Always Return 200: Never return 202 or redirect

3. Transform Once, Serve Forever

Content-addressed transformation storage
Idempotent processing with distributed locking
Permanent caching with invalidation API
Deduplication across requests

4. Security by Default

Signed URLs for private content
Row-level tenancy isolation
Encryption at rest and in transit
Comprehensive audit logging

5. Cost Optimization

Multi-layer caching to reduce processing
Storage lifecycle automation
Format optimization (WebP/AVIF)
Rate limiting and resource quotas

Technology Stack

Core Technologies

Image Processing Library

Technology	Pros	Cons	Recommendation
Sharp (libvips)	26x faster than jimp, low memory (~50MB), modern formats	Linux-focused build	✅ Recommended
ImageMagick	Feature-rich, mature	4-5x slower than Sharp	Use for complex operations
Jimp	Pure JavaScript, portable	Very slow, limited formats	Development only

Choice: Sharp 0.34.5 (released November 2025; bundles libvips 8.17.3) for primary processing.

Why Sharp over alternatives (numbers from Sharp’s published benchmark, 2725×2225 → 720×588 JPEG, libvips caching disabled):

Throughput: 64.42 ops/sec on AMD64, 49.20 ops/sec on ARM64 — about 26× jimp and several times faster than ImageMagick on the same hardware. Production throughput is typically higher because libvips caching is enabled by default.
Memory efficiency: Streaming + memory-mapped I/O. Sharp’s install docs recommend setting MALLOC_ARENA_MAX="2" on Linux/glibc to keep RSS bounded under heavy concurrency.
Modern format support: JPEG (mozjpeg), PNG, WebP, AVIF, GIF, TIFF, and HEIC out of the box.

Note

libvips 8.18 (December 2025) added UltraHDR output (via Google’s libultrahdr), Camera RAW input via libraw, an Oklab colorspace, and BigTIFF (>4 GB) output. Those features ship through Sharp starting with the 0.35.x line (Sharp 0.35.0-rc.0 was published 2 January 2026). If you need them today, pin to a 0.35 prerelease and validate before promoting.

1npm install sharp

Caching Layer

Technology	Use Case	Pros	Cons	Recommendation
Redis	Application cache, locks	Fast, pub/sub, clustering	Memory cost	✅ Primary cache
Memcached	Simple KV cache	Faster for simple gets	No persistence, limited data types	Skip
Hazelcast	Distributed cache	Java ecosystem, compute	Complexity	Skip for Node.js

Choice: Redis (6+ with Redis Cluster for HA)

1npm install ioredis

Storage Clients

Provider	Library	Notes
AWS S3	`@aws-sdk/client-s3`	Official v3 SDK
Google Cloud Storage	`@google-cloud/storage`	Official SDK
Azure Blob	`@azure/storage-blob`	Official SDK
MinIO (on-prem)	`minio` or S3 SDK	S3-compatible

1npm install @aws-sdk/client-s3 @google-cloud/storage @azure/storage-blob minio

Message Queue

Provider	Library	Use Case
AWS SQS	`@aws-sdk/client-sqs`	AWS deployments
GCP Pub/Sub	`@google-cloud/pubsub`	GCP deployments
Azure Service Bus	`@azure/service-bus`	Azure deployments
RabbitMQ	`amqplib`	On-premise, multi-cloud

Choice: Provider-specific for cloud, RabbitMQ for on-premise

1npm install amqplib

Web Framework

Framework	Pros	Cons	Recommendation
Fastify	Fast, low overhead, TypeScript support	Less mature ecosystem	✅ Recommended
Express	Mature, large ecosystem	Slower, callback-based	Acceptable
Koa	Modern, async/await	Smaller ecosystem	Acceptable

Choice: Fastify for performance

1npm install fastify @fastify/multipart @fastify/cors

Database

Technology	Pros	Cons	Recommendation
PostgreSQL	JSONB, full-text search, reliability	Complex clustering	✅ Recommended
MySQL	Mature, simple	Limited JSON support	Acceptable
MongoDB	Flexible schema	Tenancy complexity	Not recommended

Choice: PostgreSQL 15+ with JSONB for policies

1npm install pg

URL Signing

Library	Algorithm	Recommendation
Node crypto (built-in)	HMAC-SHA256	✅ Recommended
`jsonwebtoken`	JWT (HMAC/RSA)	Use for JWT tokens
`tweetnacl`	Ed25519	Use for EdDSA

Choice: Built-in crypto module for HMAC-SHA256 signatures

1import crypto from "crypto"

Distributed Locking

Technology	Pros	Cons	Recommendation
Redlock (Redis)	Simple, Redis-based	No fencing tokens, clock skew risk	✅ For efficiency only
etcd	Linearizable, fencing tokens	Separate service, higher latency	Safety-critical use
ZooKeeper	Strong consistency, mature	Complex operations, JVM dependency	Safety-critical use
Database locks	Simple, transactional	Contention, less scalable	Development only

Choice: Redlock with Redis for transform deduplication (efficiency), not for safety-critical mutual exclusion.

Why Redlock is sufficient here:

The image service uses locks to prevent duplicate work, not to prevent data corruption. If two workers race past the lock:

Both fetch the original image
Both apply the same transformation (deterministic)
Both attempt to store the result
One wins (upsert semantics), the other’s write is a no-op

This is inefficient (wasted compute) but not incorrect. The content-addressed storage ensures idempotency.

Why Redlock is insufficient for safety-critical scenarios (per Martin Kleppmann’s analysis):

No fencing tokens: Cannot generate monotonically increasing tokens to detect stale lock holders after process pauses/GC stops
Timing assumptions: Depends on bounded network delays and clock accuracy that frequently break in practice
Clock vulnerabilities: Uses gettimeofday() (not monotonic); NTP adjustments can cause time jumps

Redis’s current recommendation (from official docs): Use N=5 Redis masters with majority voting, implement fencing tokens separately if correctness matters, monitor clock drift.

1npm install redlock

High-Level Architecture

System Diagram

Service-level system architecture: Image Gateway, Transform Engine, Asset Ingestion, Control Plane, Workers, registry, cache, queue, and storage adapter, fronted by a CDN.

Request Flow: Public Image

Public image request flow showing CDN hit, Redis hit, registry lookup, and first-time transform branches with their latency budgets.

Request Flow: Private Image

Private image flow: client requests a signed URL, then fetches via the CDN. Signature verification happens at the edge when the CDN supports edge compute, otherwise the gateway verifies on origin.

Edge Authentication Patterns

Modern CDNs support signature validation at the edge, eliminating origin round-trips for private content. This section covers three deployment patterns with different security/complexity tradeoffs.

Pattern 1: Origin-based validation (simplest)

All requests hit the origin, which validates signatures. The CDN caches responses keyed by the full URL including signature parameters.

Pros: Simple deployment, no edge configuration
Cons: Every unique signed URL generates a cache miss, origin must handle all validation
When to use: Low traffic, simple deployments, or when CDN doesn’t support edge compute

Pattern 2: Edge signature validation with normalized cache keys

The edge validates the signature, then strips signature parameters before checking the cache. This allows multiple signed URLs for the same content to share a single cache entry.

1// Cloudflare Worker for edge signature validation.2// Workers run on V8 isolates rather than per-request containers — Cloudflare3// reports ~5 ms cold-start and per-isolate memory in the low single-digit MB4// (vs. ~35 MB for a typical Node.js Lambda). See:5// https://blog.cloudflare.com/cloud-computing-without-containers/67export default {8  async fetch(request, env) {9    const url = new URL(request.url)1011    // Extract and validate signature12    const sig = url.searchParams.get("sig")13    const exp = url.searchParams.get("exp")14    const kid = url.searchParams.get("kid")1516    if (!sig || !exp || !kid) {17      return new Response("Missing signature", { status: 401 })18    }1920    // Check expiration21    if (Date.now() / 1000 > parseInt(exp)) {22      return new Response("Signature expired", { status: 401 })23    }2425    // Validate HMAC (key fetched from Workers KV or secrets)26    const key = await env.SIGNING_KEYS.get(kid)27    if (!key) {28      return new Response("Invalid key", { status: 401 })29    }3031    // Reconstruct canonical string and verify32    const canonical = createCanonicalString(url.pathname, exp, url.hostname)33    const expected = await computeHmac(key, canonical)3435    if (!timingSafeEqual(sig, expected)) {36      return new Response("Invalid signature", { status: 401 })37    }3839    // Strip signature params for cache key normalization40    url.searchParams.delete("sig")41    url.searchParams.delete("exp")42    url.searchParams.delete("kid")4344    // Fetch from origin/cache with normalized URL45    return fetch(url.toString(), {46      cf: { cacheKey: url.toString() }, // Normalized cache key47    })48  },49}

Pros: High cache efficiency, reduced origin load, sub-5ms auth latency
Cons: Requires edge compute (CloudFlare Workers, CloudFront Functions, Fastly Compute)
When to use: High-traffic private content, latency-sensitive applications

Pattern 3: JWT tokens with edge validation

Use JWT (JSON Web Token) instead of HMAC signatures. The edge can decode and validate JWTs without origin contact, and the token can carry claims (user ID, tenant ID, allowed operations).

Pros: Self-contained tokens with embedded claims, standard format
Cons: Larger URLs, no revocation without short expiry or edge-stored blocklist
When to use: When tokens need to carry user context, or when integrating with existing JWT infrastructure

CDN-specific implementation notes:

CDN	Edge Auth Capability	Cache Key Normalization
CloudFlare	Workers (full JS), Rules (limited)	`cf.cacheKey` in Workers
CloudFront	Functions (limited JS), Lambda@Edge (full Node.js)	`cache-policy` with query keys
Fastly	Compute@Edge (Rust/JS/Go), VCL	`req.hash` manipulation in VCL
Akamai	EdgeWorkers (JS), Property Manager	Cache ID modification

Data Models

Database Schema

1-- Organizations (Top-level tenants)2CREATE TABLE organizations (3    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),4    slug VARCHAR(100) UNIQUE NOT NULL,5    name VARCHAR(255) NOT NULL,6    status VARCHAR(20) DEFAULT 'active',78    -- Metadata9    created_at TIMESTAMPTZ DEFAULT NOW(),10    updated_at TIMESTAMPTZ DEFAULT NOW(),11    deleted_at TIMESTAMPTZ NULL12);1314-- Tenants (Optional subdivision within org)15CREATE TABLE tenants (16    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),17    organization_id UUID NOT NULL REFERENCES organizations(id) ON DELETE CASCADE,18    slug VARCHAR(100) NOT NULL,19    name VARCHAR(255) NOT NULL,20    status VARCHAR(20) DEFAULT 'active',2122    -- Metadata23    created_at TIMESTAMPTZ DEFAULT NOW(),24    updated_at TIMESTAMPTZ DEFAULT NOW(),25    deleted_at TIMESTAMPTZ NULL,2627    UNIQUE(organization_id, slug)28);2930-- Spaces (Projects within tenant)31CREATE TABLE spaces (32    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),33    organization_id UUID NOT NULL REFERENCES organizations(id) ON DELETE CASCADE,34    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,35    slug VARCHAR(100) NOT NULL,36    name VARCHAR(255) NOT NULL,3738    -- Default policies (inherit from tenant/org if NULL)39    default_access VARCHAR(20) DEFAULT 'private', -- 'public' or 'private'4041    -- Metadata42    created_at TIMESTAMPTZ DEFAULT NOW(),43    updated_at TIMESTAMPTZ DEFAULT NOW(),44    deleted_at TIMESTAMPTZ NULL,4546    UNIQUE(tenant_id, slug),47    CONSTRAINT valid_access CHECK (default_access IN ('public', 'private'))48);4950-- Policies (Hierarchical configuration)51CREATE TABLE policies (52    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),5354    -- Scope (org, tenant, or space)55    scope_type VARCHAR(20) NOT NULL, -- 'organization', 'tenant', 'space'56    scope_id UUID NOT NULL,5758    -- Policy data59    key VARCHAR(100) NOT NULL,60    value JSONB NOT NULL,6162    -- Metadata63    updated_at TIMESTAMPTZ DEFAULT NOW(),6465    UNIQUE(scope_type, scope_id, key),66    CONSTRAINT valid_scope_type CHECK (scope_type IN ('organization', 'tenant', 'space'))67);6869-- API Keys for authentication70CREATE TABLE api_keys (71    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),72    organization_id UUID NOT NULL REFERENCES organizations(id) ON DELETE CASCADE,73    tenant_id UUID REFERENCES tenants(id) ON DELETE CASCADE,7475    -- Key identity76    key_id VARCHAR(50) UNIQUE NOT NULL, -- kid for rotation77    name VARCHAR(255) NOT NULL,78    secret_hash VARCHAR(255) NOT NULL, -- bcrypt/argon27980    -- Permissions81    scopes TEXT[] DEFAULT ARRAY['image:read']::TEXT[],8283    -- Status84    status VARCHAR(20) DEFAULT 'active',85    expires_at TIMESTAMPTZ NULL,86    last_used_at TIMESTAMPTZ NULL,8788    -- Metadata89    created_at TIMESTAMPTZ DEFAULT NOW(),90    rotated_at TIMESTAMPTZ NULL91);9293-- Assets (Original uploaded images)94CREATE TABLE assets (95    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),96    organization_id UUID NOT NULL REFERENCES organizations(id) ON DELETE CASCADE,97    tenant_id UUID NOT NULL REFERENCES tenants(id) ON DELETE CASCADE,98    space_id UUID NOT NULL REFERENCES spaces(id) ON DELETE CASCADE,99100    -- Versioning101    version INTEGER NOT NULL DEFAULT 1,102103    -- File info104    filename VARCHAR(500) NOT NULL,105    original_filename VARCHAR(500) NOT NULL,106    mime_type VARCHAR(100) NOT NULL,107108    -- Storage109    storage_provider VARCHAR(50) NOT NULL, -- 'aws', 'gcp', 'azure', 'minio'110    storage_key VARCHAR(1000) NOT NULL UNIQUE,111112    -- Content113    size_bytes BIGINT NOT NULL,114    content_hash VARCHAR(64) NOT NULL, -- SHA-256 for deduplication115116    -- Image metadata117    width INTEGER,118    height INTEGER,119    format VARCHAR(10),120    color_space VARCHAR(20),121    has_alpha BOOLEAN,122123    -- Organization124    tags TEXT[] DEFAULT ARRAY[]::TEXT[],125    folder VARCHAR(1000) DEFAULT '/',126127    -- Access control128    access_policy VARCHAR(20) NOT NULL DEFAULT 'private',129130    -- EXIF and metadata131    exif JSONB,132133    -- Upload info134    uploaded_by UUID, -- Reference to user135    uploaded_at TIMESTAMPTZ DEFAULT NOW(),136137    -- Metadata138    created_at TIMESTAMPTZ DEFAULT NOW(),139    updated_at TIMESTAMPTZ DEFAULT NOW(),140    deleted_at TIMESTAMPTZ NULL,141142    CONSTRAINT valid_access_policy CHECK (access_policy IN ('public', 'private'))143);144145-- Transformation Presets (Named transformation templates)146CREATE TABLE presets (147    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),148    organization_id UUID NOT NULL REFERENCES organizations(id) ON DELETE CASCADE,149    tenant_id UUID REFERENCES tenants(id) ON DELETE CASCADE,150    space_id UUID REFERENCES spaces(id) ON DELETE CASCADE,151152    -- Preset identity153    name VARCHAR(100) NOT NULL,154    slug VARCHAR(100) NOT NULL,155    description TEXT,156157    -- Transformation definition158    operations JSONB NOT NULL,159    /*160    Example:161    {162        "resize": {"width": 800, "height": 600, "fit": "cover"},163        "format": "webp",164        "quality": 85,165        "sharpen": 1166    }167    */168169    -- Auto-generation rules170    auto_generate BOOLEAN DEFAULT false,171    match_tags TEXT[] DEFAULT NULL,172    match_folders TEXT[] DEFAULT NULL,173174    -- Metadata175    created_at TIMESTAMPTZ DEFAULT NOW(),176    updated_at TIMESTAMPTZ DEFAULT NOW(),177178    UNIQUE(organization_id, tenant_id, space_id, slug)179);180181-- Derived Assets (Transformed images)182CREATE TABLE derived_assets (183    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),184    asset_id UUID NOT NULL REFERENCES assets(id) ON DELETE CASCADE,185186    -- Transformation identity187    operations_canonical VARCHAR(500) NOT NULL, -- Canonical string representation188    operations_hash VARCHAR(64) NOT NULL, -- SHA-256 of (canonical_ops + asset.content_hash)189190    -- Output191    output_format VARCHAR(10) NOT NULL,192193    -- Storage194    storage_provider VARCHAR(50) NOT NULL,195    storage_key VARCHAR(1000) NOT NULL UNIQUE,196197    -- Content198    size_bytes BIGINT NOT NULL,199    content_hash VARCHAR(64) NOT NULL,200201    -- Image metadata202    width INTEGER,203    height INTEGER,204205    -- Performance tracking206    processing_time_ms INTEGER,207    access_count BIGINT DEFAULT 0,208    last_accessed_at TIMESTAMPTZ,209210    -- Cache tier for lifecycle211    cache_tier VARCHAR(20) DEFAULT 'hot', -- 'hot', 'warm', 'cold'212213    -- Metadata214    created_at TIMESTAMPTZ DEFAULT NOW(),215    updated_at TIMESTAMPTZ DEFAULT NOW(),216217    UNIQUE(asset_id, operations_hash)218);219220-- Transform Cache (Fast lookup for existing transforms)221CREATE TABLE transform_cache (222    asset_id UUID NOT NULL REFERENCES assets(id) ON DELETE CASCADE,223    operations_hash VARCHAR(64) NOT NULL,224    derived_asset_id UUID NOT NULL REFERENCES derived_assets(id) ON DELETE CASCADE,225226    -- Metadata227    created_at TIMESTAMPTZ DEFAULT NOW(),228229    PRIMARY KEY(asset_id, operations_hash)230);231232-- Usage tracking (for cost and analytics)233CREATE TABLE usage_metrics (234    id BIGSERIAL PRIMARY KEY,235    date DATE NOT NULL,236    organization_id UUID NOT NULL,237    tenant_id UUID NOT NULL,238    space_id UUID NOT NULL,239240    -- Metrics241    request_count BIGINT DEFAULT 0,242    bandwidth_bytes BIGINT DEFAULT 0,243    storage_bytes BIGINT DEFAULT 0,244    transform_count BIGINT DEFAULT 0,245    transform_cpu_ms BIGINT DEFAULT 0,246247    UNIQUE(date, organization_id, tenant_id, space_id)248);249250-- Audit logs251CREATE TABLE audit_logs (252    id BIGSERIAL PRIMARY KEY,253    organization_id UUID NOT NULL,254    tenant_id UUID,255256    -- Actor257    actor_type VARCHAR(20) NOT NULL, -- 'user', 'api_key', 'system'258    actor_id UUID NOT NULL,259260    -- Action261    action VARCHAR(100) NOT NULL, -- 'asset.upload', 'asset.delete', etc.262    resource_type VARCHAR(50) NOT NULL,263    resource_id UUID,264265    -- Context266    metadata JSONB,267    ip_address INET,268    user_agent TEXT,269270    -- Timestamp271    created_at TIMESTAMPTZ DEFAULT NOW()272);273274-- Indexes for performance275CREATE INDEX idx_tenants_org ON tenants(organization_id);276CREATE INDEX idx_spaces_tenant ON spaces(tenant_id);277CREATE INDEX idx_spaces_org ON spaces(organization_id);278CREATE INDEX idx_policies_scope ON policies(scope_type, scope_id);279280CREATE INDEX idx_assets_space ON assets(space_id) WHERE deleted_at IS NULL;281CREATE INDEX idx_assets_org ON assets(organization_id) WHERE deleted_at IS NULL;282CREATE INDEX idx_assets_hash ON assets(content_hash);283CREATE INDEX idx_assets_tags ON assets USING GIN(tags);284CREATE INDEX idx_assets_folder ON assets(folder);285286CREATE INDEX idx_derived_asset ON derived_assets(asset_id);287CREATE INDEX idx_derived_hash ON derived_assets(operations_hash);288CREATE INDEX idx_derived_tier ON derived_assets(cache_tier);289CREATE INDEX idx_derived_access ON derived_assets(last_accessed_at);290291CREATE INDEX idx_usage_date_org ON usage_metrics(date, organization_id);292CREATE INDEX idx_audit_org_time ON audit_logs(organization_id, created_at);

URL Design

URL Structure Philosophy

URLs should be:

Self-describing: Clearly indicate access mode (public vs private)
Cacheable: CDN-friendly with stable cache keys
Deterministic: Same transformation = same URL
Human-readable: Easy to understand and debug

URL Patterns

Public Images

1Format:2https://{cdn-domain}/v1/pub/{org}/{tenant}/{space}/img/{asset-id}/v{version}/{operations}.{ext}34Examples:5- Original:6  https://img.example.com/v1/pub/acme/website/marketing/img/01JBXYZ.../v1/original.jpg78- Resized:9  https://img.example.com/v1/pub/acme/website/marketing/img/01JBXYZ.../v1/w_800-h_600-f_cover.webp1011- With preset:12  https://img.example.com/v1/pub/acme/website/marketing/img/01JBXYZ.../v1/preset_thumbnail.webp1314- Format auto-negotiation:15  https://img.example.com/v1/pub/acme/website/marketing/img/01JBXYZ.../v1/w_1200-f_auto-q_auto.jpg

Private Images (Base URL)

1Format:2https://{cdn-domain}/v1/priv/{org}/{tenant}/{space}/img/{asset-id}/v{version}/{operations}.{ext}34Example:5https://img.example.com/v1/priv/acme/internal/confidential/img/01JBXYZ.../v1/w_800-h_600.jpg

Private Images (Signed URL)

1Format:2{base-url}?sig={signature}&exp={unix-timestamp}&kid={key-id}34Example:5https://img.example.com/v1/priv/acme/internal/confidential/img/01JBXYZ.../v1/w_800-h_600.jpg?sig=dGVzdHNpZ25hdHVyZQ&exp=1731427200&kid=key_12367Components:8- sig: Base64URL-encoded HMAC-SHA256 signature9- exp: Unix timestamp (seconds) when URL expires10- kid: Key ID for signature rotation support

Transformation Parameters

Operations are encoded as hyphen-separated key-value pairs:

1Parameter Format: {key}_{value}23Supported Parameters:4- w_{pixels}         : Width (e.g., w_800)5- h_{pixels}         : Height (e.g., h_600)6- f_{mode}           : Fit mode - cover, contain, fill, inside, outside, pad7- q_{quality}        : Quality 1-100 or 'auto' (e.g., q_85)8- fmt_{format}       : Format - jpg, png, webp, avif, gif, 'auto'9- r_{degrees}        : Rotation - 90, 180, 27010- g_{gravity}        : Crop gravity - center, north, south, east, west, etc.11- b_{color}          : Background color for pad (e.g., b_ffffff)12- blur_{radius}      : Blur radius 0.3-1000 (e.g., blur_5)13- sharpen_{amount}   : Sharpen amount 0-10 (e.g., sharpen_2)14- bw                 : Convert to black & white (grayscale)15- flip               : Flip horizontal16- flop               : Flip vertical17- preset_{name}      : Apply named preset1819Examples:20- w_800-h_600-f_cover-q_8521- w_400-h_400-f_contain-fmt_webp22- preset_thumbnail23- w_1200-sharpen_2-fmt_webp-q_9024- w_800-h_600-f_pad-b_ffffff

Operation Canonicalization

To ensure cache hit consistency, operations must be canonicalized:

1/**2 * Canonicalizes transformation operations to ensure consistent cache keys3 */4function canonicalizeOperations(opsString) {5  const ops = parseOperations(opsString)67  // Apply defaults8  if (!ops.quality && ops.format !== "png") ops.quality = 859  if (!ops.fit && (ops.width || ops.height)) ops.fit = "cover"1011  // Normalize values12  if (ops.quality) ops.quality = Math.max(1, Math.min(100, ops.quality))13  if (ops.width) ops.width = Math.floor(ops.width)14  if (ops.height) ops.height = Math.floor(ops.height)1516  // Canonical order: fmt, w, h, f, g, b, q, r, sharpen, blur, bw, flip, flop17  const order = ["fmt", "w", "h", "f", "g", "b", "q", "r", "sharpen", "blur", "bw", "flip", "flop"]1819  return order20    .filter((key) => ops[key] !== undefined)21    .map((key) => `${key}_${ops[key]}`)22    .join("-")23}

Core Request Flows

Upload Flow with Auto-Presets

Upload sequence: validate → SHA-256 → dedupe check → store original → enqueue auto-preset jobs. Workers transform asynchronously and populate cache.

Synchronous Transform Flow (Guaranteed 200)

Synchronous transform path: parse + validate + acquire lock, transform inline if needed, then return 200 from origin via the CDN within an 800 ms budget.

Image Processing Pipeline

Processing Implementation

1import sharp from "sharp"2import crypto from "crypto"34/**5 * Transform Engine - Core image processing service6 */7class TransformEngine {8  constructor(storage, registry, cache, lockManager) {9    this.storage = storage10    this.registry = registry11    this.cache = cache12    this.lockManager = lockManager13  }1415  /**16   * Process image transformation with deduplication17   */18  async transform(assetId, operations, acceptHeader) {19    // 1. Canonicalize operations20    const canonicalOps = this.canonicalizeOps(operations)21    const outputFormat = this.determineFormat(operations.format, acceptHeader)2223    // 2. Generate transformation hash (content-addressed)24    const asset = await this.registry.getAsset(assetId)25    const opsHash = this.generateOpsHash(canonicalOps, asset.contentHash, outputFormat)2627    // 3. Check multi-layer cache28    const cacheKey = `transform:${assetId}:${opsHash}`2930    // Layer 1: Redis cache31    const cached = await this.cache.get(cacheKey)32    if (cached) {33      return {34        buffer: Buffer.from(cached.buffer, "base64"),35        contentType: cached.contentType,36        fromCache: "redis",37      }38    }3940    // Layer 2: Database + Storage41    const derived = await this.registry.getDerivedAsset(assetId, opsHash)42    if (derived) {43      const buffer = await this.storage.get(derived.storageKey)4445      // Populate Redis cache46      await this.cache.set(47        cacheKey,48        {49          buffer: buffer.toString("base64"),50          contentType: `image/${derived.outputFormat}`,51        },52        3600,53      ) // 1 hour TTL5455      // Update access metrics56      await this.registry.incrementAccessCount(derived.id)5758      return {59        buffer,60        contentType: `image/${derived.outputFormat}`,61        fromCache: "storage",62      }63    }6465    // Layer 3: Process new transformation (with distributed locking)66    const lockKey = `lock:transform:${assetId}:${opsHash}`67    const lock = await this.lockManager.acquire(lockKey, 60000) // 60s TTL6869    try {70      // Double-check after acquiring lock71      const doubleCheck = await this.registry.getDerivedAsset(assetId, opsHash)72      if (doubleCheck) {73        const buffer = await this.storage.get(doubleCheck.storageKey)74        return {75          buffer,76          contentType: `image/${doubleCheck.outputFormat}`,77          fromCache: "concurrent",78        }79      }8081      // Process transformation82      const startTime = Date.now()8384      // Fetch original85      const originalBuffer = await this.storage.get(asset.storageKey)8687      // Apply transformations88      const processedBuffer = await this.applyTransformations(originalBuffer, canonicalOps, outputFormat)8990      const processingTime = Date.now() - startTime9192      // Get metadata of processed image93      const metadata = await sharp(processedBuffer).metadata()9495      // Generate storage key96      const storageKey = `derived/${asset.organizationId}/${asset.tenantId}/${asset.spaceId}/${assetId}/v${asset.version}/${opsHash}.${outputFormat}`9798      // Store processed image99      await this.storage.put(storageKey, processedBuffer, `image/${outputFormat}`)100101      // Compute content hash102      const contentHash = crypto.createHash("sha256").update(processedBuffer).digest("hex")103104      // Save to database105      const derivedAsset = await this.registry.createDerivedAsset({106        assetId,107        operationsCanonical: canonicalOps,108        operationsHash: opsHash,109        outputFormat,110        storageProvider: this.storage.provider,111        storageKey,112        sizeBytes: processedBuffer.length,113        contentHash,114        width: metadata.width,115        height: metadata.height,116        processingTimeMs: processingTime,117      })118119      // Update transform cache index120      await this.registry.cacheTransform(assetId, opsHash, derivedAsset.id)121122      // Populate Redis cache123      await this.cache.set(124        cacheKey,125        {126          buffer: processedBuffer.toString("base64"),127          contentType: `image/${outputFormat}`,128        },129        3600,130      )131132      return {133        buffer: processedBuffer,134        contentType: `image/${outputFormat}`,135        fromCache: "none",136        processingTime,137      }138    } finally {139      await lock.release()140    }141  }142143  /**144   * Apply transformations using Sharp145   */146  async applyTransformations(inputBuffer, operations, outputFormat) {147    let pipeline = sharp(inputBuffer)148149    // Rotation150    if (operations.rotation) {151      pipeline = pipeline.rotate(operations.rotation)152    }153154    // Flip/Flop155    if (operations.flip) {156      pipeline = pipeline.flip()157    }158    if (operations.flop) {159      pipeline = pipeline.flop()160    }161162    // Resize163    if (operations.width || operations.height) {164      const resizeOptions = {165        width: operations.width,166        height: operations.height,167        fit: operations.fit || "cover",168        position: operations.gravity || "centre",169        withoutEnlargement: true,170      }171172      // Background for 'pad' fit173      if (operations.fit === "pad" && operations.background) {174        resizeOptions.background = this.parseColor(operations.background)175      }176177      pipeline = pipeline.resize(resizeOptions)178    }179180    // Effects181    if (operations.blur) {182      pipeline = pipeline.blur(operations.blur)183    }184185    if (operations.sharpen) {186      pipeline = pipeline.sharpen(operations.sharpen)187    }188189    if (operations.grayscale) {190      pipeline = pipeline.grayscale()191    }192193    // Format conversion and quality194    const quality = operations.quality === "auto" ? this.getAutoQuality(outputFormat) : operations.quality || 85195196    switch (outputFormat) {197      case "jpg":198      case "jpeg":199        pipeline = pipeline.jpeg({200          quality,201          mozjpeg: true, // Better compression202        })203        break204205      case "png":206        pipeline = pipeline.png({207          quality,208          compressionLevel: 9,209          adaptiveFiltering: true,210        })211        break212213      case "webp":214        pipeline = pipeline.webp({215          quality,216          effort: 6, // Compression effort (0-6)217        })218        break219220      case "avif":221        pipeline = pipeline.avif({222          quality,223          effort: 6,224        })225        break226227      case "gif":228        pipeline = pipeline.gif()229        break230    }231232    return await pipeline.toBuffer()233  }234235  /**236   * Determine output format based on operations and Accept header237   *238   * Format selection priority (as of 2026-Q2):239   * - AVIF: ~94.9% caniuse global support, ~50% smaller than JPEG, ~20-25% smaller than WebP240   * - WebP: ~97% caniuse global support, 25-34% smaller than JPEG (Google's WebP study)241   * - JPEG: Universal fallback242   *243   * Note: JPEG XL decoder shipped in Chrome 145 (Feb 2026) but is gated behind244   * the `enable-jxl-image-format` flag, so it is not yet a viable production target.245   * Re-evaluate once it is enabled by default and caniuse coverage exceeds ~80%.246   */247  determineFormat(requestedFormat, acceptHeader) {248    if (requestedFormat && requestedFormat !== "auto") {249      return requestedFormat250    }251252    // Format negotiation based on Accept header253    const accept = (acceptHeader || "").toLowerCase()254255    if (accept.includes("image/avif")) {256      return "avif" // Best compression: ~50% smaller than JPEG257    }258259    if (accept.includes("image/webp")) {260      return "webp" // Good compression: 25-34% smaller than JPEG, slightly wider support261    }262263    return "jpg" // Fallback264  }265266  /**267   * Get automatic quality based on format268   *269   * Quality values are calibrated to produce visually similar output across formats.270   * AVIF and WebP compress more efficiently, so they need lower quality values271   * to achieve similar file sizes with equivalent visual quality.272   *273   * Real-world example (2000×2000 product photo):274   * - JPEG q=80: ~540 KB275   * - WebP q=85: ~350 KB (35% smaller)276   * - AVIF q=75 (CQ 28): ~210 KB (61% smaller)277   */278  getAutoQuality(format) {279    const qualityMap = {280      avif: 75, // AVIF compresses very well; q=75 ≈ JPEG q=85 visually281      webp: 80, // WebP compresses well; q=80 ≈ JPEG q=85 visually282      jpg: 85, // JPEG baseline quality283      jpeg: 85,284      png: 90, // PNG quality affects compression, not visual fidelity (lossless)285    }286287    return qualityMap[format] || 85288  }289290  /**291   * Generate deterministic hash for transformation292   */293  generateOpsHash(canonicalOps, assetContentHash, outputFormat) {294    const payload = `${canonicalOps};${assetContentHash};fmt=${outputFormat}`295    return crypto.createHash("sha256").update(payload).digest("hex")296  }297298  /**299   * Parse color hex string to RGB object300   */301  parseColor(hex) {302    hex = hex.replace("#", "")303    return {304      r: parseInt(hex.substr(0, 2), 16),305      g: parseInt(hex.substr(2, 2), 16),306      b: parseInt(hex.substr(4, 2), 16),307    }308  }309310  /**311   * Canonicalize operations312   */313  canonicalizeOps(ops) {314    // Implementation details...315    // Return canonical string like "w_800-h_600-f_cover-q_85-fmt_webp"316  }317}318319export default TransformEngine

Format Selection: AVIF Encoding Cost vs. WebP

AVIF’s compression win comes at a real CPU cost. Independent benchmarks place AVIF encoding at ~5-10× JPEG and several times slower than WebP at default settings, with some measurements at 1920×1080 reporting JPEG ≈ 45 ms, WebP ≈ 127 ms, AVIF ≈ 892 ms¹. The web.dev team reports that libaom optimizations have cut AVIF encoder CPU by ~6.5× since first deployment, but it remains the most expensive of the three².

Operational implication for an on-the-fly transform service:

Synchronous path → WebP first. WebP gives 25-34% of the JPEG → AVIF win for a fraction of the CPU. Honor the Accept header but bias toward WebP inside the < 800 ms budget.
AVIF behind a queue. Serve AVIF either as a pre-baked preset (generated asynchronously after upload) or as a background “upgrade” — render WebP synchronously, then asynchronously generate AVIF and let the next request hit it. With Sharp use effort: 4 (not the default 6) for live AVIF; the size delta is small but the CPU delta is meaningful.
Cache hit dominates the math. Once a derived asset is cached, encoder cost is amortized to zero. The decision is really about the cost of the first miss.

Content-Aware Cropping

Hard center-crop loses the subject whenever the subject is not in the center — product photos with off-center hero items, group shots, billboards, screenshots. Two approaches dominate, with very different complexity:

Approach	What it picks	Where it lives	Cost model
Entropy / attention (libvips `attention`/`entropy`, smartcrop.js)	Region with highest variance / edge density	In the transform path, on the downscaled image	Cheap (a few ms on a 256 px thumbnail)
Saliency model (DeepGaze, custom CNN)	Region a human eye is most likely to fixate	Offline at upload time; persisted as a focal point	Expensive once (tens to hundreds of ms on CPU); free at request time
Manual focal point	Editor-supplied point	Stored on the asset	Free

The classic engineering write-up is X (formerly Twitter) Speedy Neural Networks for Smart Auto-Cropping of Images, which describes a DeepGaze-II-derived saliency network compressed via knowledge distillation and Fisher pruning to run in real time. That post is also a cautionary tale: in 2021, X retired automated saliency cropping after an internal study found demographic bias in the cropping decisions. The lesson generalizes: ship saliency cropping with an explicit override (manual focal point or full-aspect fallback) and audit the crop distribution per protected attribute before relying on it.

Content-aware crop pipeline: a saliency model runs once at upload time and persists a focal point + score map; at request time the gateway solves for the highest-scoring window of the requested aspect ratio and hands extract+resize to libvips.

In practice for a Sharp-based service:

Start with sharp.resize({ position: sharp.strategy.attention }) — libvips’ built-in attention strategy is free, deterministic, and good enough for most product imagery.
Promote to a persisted focal point when you need stable crops across multiple aspect ratios (so the same subject anchors square, 16:9, and 4:5 crops). Compute it once on upload, store as (x_norm, y_norm) on the asset row, treat it as part of the canonical operations string for cache-key purposes.
Reach for a trained saliency model only when entropy/attention measurably hurts a specific content type. The X experience is the bar: if you ship it, ship the audit alongside.

Metadata, Orientation, and Animated Images

Three categories of input quirks routinely break naive pipelines:

EXIF orientation. Phone cameras store rotation in the EXIF Orientation tag instead of rotating pixels. Sharp does not auto-orient by default — you must call .rotate() (no args) or .autoOrient() to apply the orientation tag and then strip it. Skipping this is the #1 source of “my image is sideways” tickets. Sharp also strips all EXIF metadata by default; this is a privacy feature (geotags, camera serial numbers), and you only re-attach via .keepExif() if downstream consumers truly need it. Persist a sanitized subset (orientation, capture date, camera, GPS-stripped) in the assets.exif JSONB column so you do not have to re-decode the original to answer metadata queries.

Animated images. Animated GIF, WebP, and AVIF have multiple frames; resize/encode operations that ignore frames silently flatten to the first frame. Sharp exposes this via sharp(buf, { animated: true, pages: -1 }) plus .gif({ loop: 0 }) / .webp({ loop: 0 }). For an on-the-fly service, decide per format whether to:

Preserve animation — slower and bigger, but expected for memes/UI animations.
Take a poster frame — fast, defaults to frame 0; useful for thumbnails and previews.
Refuse animation — explicit 415 with a clear error; reasonable for e-commerce hero imagery.

Encode the choice as a transformation parameter (e.g., anim_keep / anim_first / anim_strip) so it participates in the cache key. Bound the maximum frame count and total decoded pixel area, otherwise a 200 KB animated AVIF can decompose into hundreds of megabytes of decoded frames.

Color space. Strip embedded ICC profiles only after you have applied them — sharp(buf).withMetadata({ icc: 'srgb' }) converts in the working color space, then tags sRGB on output. Skipping this turns Adobe-RGB DSLR shots dull on the web.

Distributed Locking

1import Redlock from "redlock"2import Redis from "ioredis"34/**5 * Distributed lock manager using Redlock algorithm6 *7 * IMPORTANT: This lock manager is designed for EFFICIENCY optimization, not8 * CORRECTNESS guarantees. Redlock cannot provide fencing tokens, so:9 *10 * - SAFE: Preventing duplicate transforms (if lock fails, we waste compute but don't corrupt data)11 * - UNSAFE: Protecting financial transactions, inventory updates, or any operation where12 *   concurrent execution could cause data inconsistency13 *14 * For safety-critical mutual exclusion, use etcd (Raft consensus) or ZooKeeper (ZAB protocol).15 * See: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html16 */17class LockManager {18  constructor(redisClients) {19    // Initialize Redlock with multiple Redis instances (N=5 recommended for production)20    this.redlock = new Redlock(redisClients, {21      driftFactor: 0.01,22      retryCount: 10,23      retryDelay: 200,24      retryJitter: 200,25      automaticExtensionThreshold: 500,26    })27  }2829  /**30   * Acquire distributed lock31   */32  async acquire(key, ttl = 30000) {33    try {34      const lock = await this.redlock.acquire([`lock:${key}`], ttl)35      return lock36    } catch (error) {37      throw new Error(`Failed to acquire lock for ${key}: ${error.message}`)38    }39  }4041  /**42   * Try to acquire lock without waiting43   */44  async tryAcquire(key, ttl = 30000) {45    try {46      return await this.redlock.acquire([`lock:${key}`], ttl)47    } catch (error) {48      return null // Lock not acquired49    }50  }51}5253// Usage54const redis1 = new Redis({ host: "redis-1" })55const redis2 = new Redis({ host: "redis-2" })56const redis3 = new Redis({ host: "redis-3" })5758const lockManager = new LockManager([redis1, redis2, redis3])5960export default LockManager

Security & Access Control

Signing Flow

URL signing is the keystone of the private-content path: it lets the CDN cache content publicly while keeping authorization on the request, not on the byte stream. The canonical-string + HMAC pattern below is what Cloudflare Images, AWS CloudFront, and Imgix all converge on (Cloudflare Images, for example, signs the path + query string with SHA-256 HMAC and an exp query parameter).

Three properties matter:

Canonical string includes everything the recipient must trust. Method, path, expiry, host, and tenant ID — anything the verifier must enforce belongs in the bytes that get HMAC’d. Anything not in the canonical string can be tampered with.
Signature parameters must be stripped before the cache key is computed, otherwise every signed URL becomes its own cache entry. This is the single most common reason private-content cache hit rates collapse.
kid (key ID) enables rotation without breaking outstanding URLs. Issue under the new key, leave the old key valid for the longest signed TTL you allow, then retire it.

Signed URL Implementation

1import crypto from "crypto"23/**4 * Signature Service - Generate and verify signed URLs5 */6class SignatureService {7  constructor(registry) {8    this.registry = registry9  }1011  /**12   * Generate signed URL for private images13   */14  async generateSignedUrl(baseUrl, orgId, tenantId, ttl = null) {15    // Get signing key for tenant/org16    const apiKey = await this.registry.getSigningKey(orgId, tenantId)1718    // Get effective policy for TTL19    const policy = await this.registry.getEffectivePolicy(orgId, tenantId)20    const defaultTtl = policy.signed_url_ttl_default_seconds || 360021    const maxTtl = policy.signed_url_ttl_max_seconds || 864002223    // Calculate expiry24    const requestedTtl = ttl || defaultTtl25    const effectiveTtl = Math.min(requestedTtl, maxTtl)26    const expiresAt = Math.floor(Date.now() / 1000) + effectiveTtl2728    // Create canonical string for signing29    const url = new URL(baseUrl)30    const canonicalString = this.createCanonicalString(url.pathname, expiresAt, url.hostname, tenantId)3132    // Generate HMAC-SHA256 signature33    const signature = crypto.createHmac("sha256", apiKey.secret).update(canonicalString).digest("base64url") // URL-safe base643435    // Append signature, expiry, and key ID to URL36    url.searchParams.set("sig", signature)37    url.searchParams.set("exp", expiresAt.toString())38    url.searchParams.set("kid", apiKey.keyId)3940    return {41      url: url.toString(),42      expiresAt: new Date(expiresAt * 1000),43      expiresIn: effectiveTtl,44    }45  }4647  /**48   * Verify signed URL49   */50  async verifySignedUrl(signedUrl, orgId, tenantId) {51    const url = new URL(signedUrl)5253    // Extract signature components54    const signature = url.searchParams.get("sig")55    const expiresAt = parseInt(url.searchParams.get("exp"))56    const keyId = url.searchParams.get("kid")5758    if (!signature || !expiresAt || !keyId) {59      return {60        valid: false,61        error: "Missing signature components",62      }63    }6465    // Check expiration66    const now = Math.floor(Date.now() / 1000)67    if (now > expiresAt) {68      return {69        valid: false,70        expired: true,71        error: "Signature expired",72      }73    }7475    // Get signing key76    const apiKey = await this.registry.getApiKeyById(keyId)77    if (!apiKey || apiKey.status !== "active") {78      return {79        valid: false,80        error: "Invalid key ID",81      }82    }8384    // Verify tenant/org ownership85    if (apiKey.organizationId !== orgId || apiKey.tenantId !== tenantId) {86      return {87        valid: false,88        error: "Key does not match tenant",89      }90    }9192    // Reconstruct canonical string93    url.searchParams.delete("sig")94    url.searchParams.delete("exp")95    url.searchParams.delete("kid")9697    const canonicalString = this.createCanonicalString(url.pathname, expiresAt, url.hostname, tenantId)9899    // Compute expected signature100    const expectedSignature = crypto.createHmac("sha256", apiKey.secret).update(canonicalString).digest("base64url")101102    // Constant-time comparison to prevent timing attacks103    const valid = crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expectedSignature))104105    return {106      valid,107      error: valid ? null : "Invalid signature",108    }109  }110111  /**112   * Create canonical string for signing113   */114  createCanonicalString(pathname, expiresAt, hostname, tenantId) {115    return ["GET", pathname, expiresAt, hostname, tenantId].join("\n")116  }117118  /**119   * Rotate signing keys120   */121  async rotateSigningKey(orgId, tenantId) {122    // Generate new secret123    const newSecret = crypto.randomBytes(32).toString("hex")124    const newKeyId = `key_${Date.now()}_${crypto.randomBytes(8).toString("hex")}`125126    // Create new key127    const newKey = await this.registry.createApiKey({128      organizationId: orgId,129      tenantId,130      keyId: newKeyId,131      name: `Signing Key (rotated ${new Date().toISOString()})`,132      secret: newSecret,133      scopes: ["signing"],134    })135136    // Mark old keys for deprecation (keep valid for grace period)137    await this.registry.deprecateOldSigningKeys(orgId, tenantId, newKey.id)138139    return newKey140  }141}142143export default SignatureService

Authentication Middleware

1import crypto from "crypto"23/**4 * Authentication middleware for Fastify5 */6class AuthMiddleware {7  constructor(registry) {8    this.registry = registry9  }1011  /**12   * API Key authentication13   */14  async authenticateApiKey(request, reply) {15    const apiKey = request.headers["x-api-key"]1617    if (!apiKey) {18      return reply.code(401).send({19        error: "Unauthorized",20        message: "API key required",21      })22    }2324    // Hash the API key25    const keyHash = crypto.createHash("sha256").update(apiKey).digest("hex")2627    // Look up in database28    const keyRecord = await this.registry.getApiKeyByHash(keyHash)2930    if (!keyRecord) {31      return reply.code(401).send({32        error: "Unauthorized",33        message: "Invalid API key",34      })35    }3637    // Check status and expiration38    if (keyRecord.status !== "active") {39      return reply.code(401).send({40        error: "Unauthorized",41        message: "API key is inactive",42      })43    }4445    if (keyRecord.expiresAt && new Date(keyRecord.expiresAt) < new Date()) {46      return reply.code(401).send({47        error: "Unauthorized",48        message: "API key has expired",49      })50    }5152    // Update last used timestamp (async, don't wait)53    this.registry.updateApiKeyLastUsed(keyRecord.id).catch(console.error)5455    // Attach to request context56    request.auth = {57      organizationId: keyRecord.organizationId,58      tenantId: keyRecord.tenantId,59      scopes: keyRecord.scopes,60      keyId: keyRecord.id,61    }62  }6364  /**65   * Scope-based authorization66   */67  requireScope(scope) {68    return async (request, reply) => {69      if (!request.auth) {70        return reply.code(401).send({71          error: "Unauthorized",72          message: "Authentication required",73        })74      }7576      if (!request.auth.scopes.includes(scope)) {77        return reply.code(403).send({78          error: "Forbidden",79          message: `Required scope: ${scope}`,80        })81      }82    }83  }8485  /**86   * Tenant boundary check87   */88  async checkTenantAccess(request, reply, orgId, tenantId, spaceId) {89    if (!request.auth) {90      return reply.code(401).send({91        error: "Unauthorized",92      })93    }9495    // Check organization match96    if (request.auth.organizationId !== orgId) {97      return reply.code(403).send({98        error: "Forbidden",99        message: "Access denied to this organization",100      })101    }102103    // Check tenant match (if key is tenant-scoped)104    if (request.auth.tenantId && request.auth.tenantId !== tenantId) {105      return reply.code(403).send({106        error: "Forbidden",107        message: "Access denied to this tenant",108      })109    }110111    return true112  }113}114115export default AuthMiddleware

Rate Limiting

1import Redis from "ioredis"23/**4 * Rate limiter using sliding window algorithm5 */6class RateLimiter {7  constructor(redis) {8    this.redis = redis9  }1011  /**12   * Check and enforce rate limit13   */14  async checkLimit(identifier, limit, windowSeconds) {15    const key = `ratelimit:${identifier}`16    const now = Date.now()17    const windowStart = now - windowSeconds * 10001819    // Use Redis pipeline for atomicity20    const pipeline = this.redis.pipeline()2122    // Remove old entries outside the window23    pipeline.zremrangebyscore(key, "-inf", windowStart)2425    // Count requests in current window26    pipeline.zcard(key)2728    // Add current request29    const requestId = `${now}:${Math.random()}`30    pipeline.zadd(key, now, requestId)3132    // Set expiry on key33    pipeline.expire(key, windowSeconds)3435    const results = await pipeline.exec()36    const count = results[1][1] // Result of ZCARD3738    const allowed = count < limit39    const remaining = Math.max(0, limit - count - 1)4041    // Calculate reset time42    const oldestEntry = await this.redis.zrange(key, 0, 0, "WITHSCORES")43    const resetAt =44      oldestEntry.length > 045        ? new Date(parseInt(oldestEntry[1]) + windowSeconds * 1000)46        : new Date(now + windowSeconds * 1000)4748    return {49      allowed,50      limit,51      remaining,52      resetAt,53    }54  }5556  /**57   * Rate limiting middleware for Fastify58   */59  middleware(getLimitConfig) {60    return async (request, reply) => {61      // Get limit configuration based on request context62      const { identifier, limit, window } = getLimitConfig(request)6364      const result = await this.checkLimit(identifier, limit, window)6566      // Set rate limit headers67      reply.header("X-RateLimit-Limit", result.limit)68      reply.header("X-RateLimit-Remaining", result.remaining)69      reply.header("X-RateLimit-Reset", result.resetAt.toISOString())7071      if (!result.allowed) {72        return reply.code(429).send({73          error: "Too Many Requests",74          message: `Rate limit exceeded. Try again after ${result.resetAt.toISOString()}`,75          retryAfter: Math.ceil((result.resetAt.getTime() - Date.now()) / 1000),76        })77      }78    }79  }80}8182// Usage example83const redis = new Redis()84const rateLimiter = new RateLimiter(redis)8586// Apply to route87app.get(88  "/v1/pub/*",89  {90    preHandler: rateLimiter.middleware((request) => ({91      identifier: `org:${request.params.org}`,92      limit: 1000, // requests93      window: 60, // seconds94    })),95  },96  handler,97)9899export default RateLimiter

Deployment Architecture

Kubernetes Deployment

Kubernetes topology: Nginx ingress fronts the Gateway, Asset Ingestion, and Control Plane services. Stateful dependencies (Postgres, Redis, RabbitMQ, S3-compatible storage) sit alongside the data plane.

Storage Abstraction Layer

1/**2 * Abstract storage interface3 */4class StorageAdapter {5  async put(key, buffer, contentType, metadata = {}) {6    throw new Error("Not implemented")7  }89  async get(key) {10    throw new Error("Not implemented")11  }1213  async delete(key) {14    throw new Error("Not implemented")15  }1617  async exists(key) {18    throw new Error("Not implemented")19  }2021  async getSignedUrl(key, ttl) {22    throw new Error("Not implemented")23  }2425  get provider() {26    throw new Error("Not implemented")27  }28}2930// AWS S3 Implementation (imports collapsed)31// import { S3Client, PutObjectCommand, GetObjectCommand, ... } from "@aws-sdk/client-s3"32// import { getSignedUrl } from "@aws-sdk/s3-request-presigner"3334class S3StorageAdapter extends StorageAdapter {35  constructor(config) {36    super()37    this.client = new S3Client({38      region: config.region,39      credentials: config.credentials,40    })41    this.bucket = config.bucket42  }4344  async put(key, buffer, contentType, metadata = {}) {45    const command = new PutObjectCommand({46      Bucket: this.bucket,47      Key: key,48      Body: buffer,49      ContentType: contentType,50      Metadata: metadata,51      ServerSideEncryption: "AES256",52    })5354    await this.client.send(command)55  }5657  async get(key) {58    const command = new GetObjectCommand({59      Bucket: this.bucket,60      Key: key,61    })6263    const response = await this.client.send(command)64    const chunks = []6566    for await (const chunk of response.Body) {67      chunks.push(chunk)68    }6970    return Buffer.concat(chunks)71  }7273  async delete(key) {74    const command = new DeleteObjectCommand({75      Bucket: this.bucket,76      Key: key,77    })7879    await this.client.send(command)80  }8182  async exists(key) {83    try {84      const command = new HeadObjectCommand({85        Bucket: this.bucket,86        Key: key,87      })8889      await this.client.send(command)90      return true91    } catch (error) {92      if (error.name === "NotFound") {93        return false94      }95      throw error96    }97  }9899  async getSignedUrl(key, ttl = 3600) {100    const command = new GetObjectCommand({101      Bucket: this.bucket,102      Key: key,103    })104105    return await getSignedUrl(this.client, command, { expiresIn: ttl })106  }107108  get provider() {109    return "aws"110  }111}112113// Google Cloud Storage Implementation (imports collapsed)114// import { Storage } from "@google-cloud/storage"115116class GCSStorageAdapter extends StorageAdapter {117  constructor(config) {118    super()119    this.storage = new Storage({120      projectId: config.projectId,121      credentials: config.credentials,122    })123    this.bucket = this.storage.bucket(config.bucket)124  }125126  async put(key, buffer, contentType, metadata = {}) {127    const file = this.bucket.file(key)128    await file.save(buffer, {129      contentType,130      metadata,131      resumable: false,132    })133  }134135  async get(key) {136    const file = this.bucket.file(key)137    const [contents] = await file.download()138    return contents139  }140141  async delete(key) {142    const file = this.bucket.file(key)143    await file.delete()144  }145146  async exists(key) {147    const file = this.bucket.file(key)148    const [exists] = await file.exists()149    return exists150  }151152  async getSignedUrl(key, ttl = 3600) {153    const file = this.bucket.file(key)154    const [url] = await file.getSignedUrl({155      action: "read",156      expires: Date.now() + ttl * 1000,157    })158    return url159  }160161  get provider() {162    return "gcp"163  }164}165166// Azure Blob Storage Implementation (imports collapsed)167// import { BlobServiceClient } from "@azure/storage-blob"168169class AzureBlobStorageAdapter extends StorageAdapter {170  constructor(config) {171    super()172    this.blobServiceClient = BlobServiceClient.fromConnectionString(config.connectionString)173    this.containerClient = this.blobServiceClient.getContainerClient(config.containerName)174  }175176  async put(key, buffer, contentType, metadata = {}) {177    const blockBlobClient = this.containerClient.getBlockBlobClient(key)178    await blockBlobClient.upload(buffer, buffer.length, {179      blobHTTPHeaders: { blobContentType: contentType },180      metadata,181    })182  }183184  async get(key) {185    const blobClient = this.containerClient.getBlobClient(key)186    const downloadResponse = await blobClient.download()187188    return await this.streamToBuffer(downloadResponse.readableStreamBody)189  }190191  async delete(key) {192    const blobClient = this.containerClient.getBlobClient(key)193    await blobClient.delete()194  }195196  async exists(key) {197    const blobClient = this.containerClient.getBlobClient(key)198    return await blobClient.exists()199  }200201  async getSignedUrl(key, ttl = 3600) {202    const blobClient = this.containerClient.getBlobClient(key)203    const expiresOn = new Date(Date.now() + ttl * 1000)204205    return await blobClient.generateSasUrl({206      permissions: "r",207      expiresOn,208    })209  }210211  async streamToBuffer(readableStream) {212    return new Promise((resolve, reject) => {213      const chunks = []214      readableStream.on("data", (chunk) => chunks.push(chunk))215      readableStream.on("end", () => resolve(Buffer.concat(chunks)))216      readableStream.on("error", reject)217    })218  }219220  get provider() {221    return "azure"222  }223}224225// MinIO Implementation (S3-compatible for on-premise, imports collapsed)226// import * as Minio from "minio"227228class MinIOStorageAdapter extends StorageAdapter {229  constructor(config) {230    super()231    this.client = new Minio.Client({232      endPoint: config.endPoint,233      port: config.port || 9000,234      useSSL: config.useSSL !== false,235      accessKey: config.accessKey,236      secretKey: config.secretKey,237    })238    this.bucket = config.bucket239  }240241  async put(key, buffer, contentType, metadata = {}) {242    await this.client.putObject(this.bucket, key, buffer, buffer.length, {243      "Content-Type": contentType,244      ...metadata,245    })246  }247248  async get(key) {249    const stream = await this.client.getObject(this.bucket, key)250251    return new Promise((resolve, reject) => {252      const chunks = []253      stream.on("data", (chunk) => chunks.push(chunk))254      stream.on("end", () => resolve(Buffer.concat(chunks)))255      stream.on("error", reject)256    })257  }258259  async delete(key) {260    await this.client.removeObject(this.bucket, key)261  }262263  async exists(key) {264    try {265      await this.client.statObject(this.bucket, key)266      return true267    } catch (error) {268      if (error.code === "NotFound") {269        return false270      }271      throw error272    }273  }274275  async getSignedUrl(key, ttl = 3600) {276    return await this.client.presignedGetObject(this.bucket, key, ttl)277  }278279  get provider() {280    return "minio"281  }282}283284/**285 * Storage Factory286 */287class StorageFactory {288  static create(config) {289    switch (config.provider) {290      case "aws":291      case "s3":292        return new S3StorageAdapter(config)293294      case "gcp":295      case "gcs":296        return new GCSStorageAdapter(config)297298      case "azure":299        return new AzureBlobStorageAdapter(config)300301      case "minio":302      case "onprem":303        return new MinIOStorageAdapter(config)304305      default:306        throw new Error(`Unsupported storage provider: ${config.provider}`)307    }308  }309}310311export { StorageAdapter, StorageFactory }

Deployment Configuration

1# docker-compose.yml for local development2version: "3.8"34services:5  # API Gateway6  gateway:7    build: ./services/gateway8    ports:9      - "3000:3000"10    environment:11      NODE_ENV: development12      DATABASE_URL: postgresql://postgres:password@postgres:5432/imageservice13      REDIS_URL: redis://redis:637914      STORAGE_PROVIDER: minio15      MINIO_ENDPOINT: minio16      MINIO_ACCESS_KEY: minioadmin17      MINIO_SECRET_KEY: minioadmin18    depends_on:19      - postgres20      - redis21      - minio2223  # Transform Engine24  transform:25    build: ./services/transform26    deploy:27      replicas: 328    environment:29      DATABASE_URL: postgresql://postgres:password@postgres:5432/imageservice30      REDIS_URL: redis://redis:637931      STORAGE_PROVIDER: minio32      MINIO_ENDPOINT: minio33      MINIO_ACCESS_KEY: minioadmin34      MINIO_SECRET_KEY: minioadmin35    depends_on:36      - postgres37      - redis38      - minio3940  # Transform Workers41  worker:42    build: ./services/worker43    deploy:44      replicas: 345    environment:46      DATABASE_URL: postgresql://postgres:password@postgres:5432/imageservice47      RABBITMQ_URL: amqp://rabbitmq:567248      STORAGE_PROVIDER: minio49      MINIO_ENDPOINT: minio50      MINIO_ACCESS_KEY: minioadmin51      MINIO_SECRET_KEY: minioadmin52    depends_on:53      - postgres54      - rabbitmq55      - minio5657  # PostgreSQL58  postgres:59    image: postgres:15-alpine60    environment:61      POSTGRES_DB: imageservice62      POSTGRES_USER: postgres63      POSTGRES_PASSWORD: password64    volumes:65      - postgres-data:/var/lib/postgresql/data66    ports:67      - "5432:5432"6869  # Redis70  redis:71    image: redis:7-alpine72    command: redis-server --appendonly yes73    volumes:74      - redis-data:/data75    ports:76      - "6379:6379"7778  # RabbitMQ79  rabbitmq:80    image: rabbitmq:3-management-alpine81    environment:82      RABBITMQ_DEFAULT_USER: admin83      RABBITMQ_DEFAULT_PASS: password84    ports:85      - "5672:5672"86      - "15672:15672"87    volumes:88      - rabbitmq-data:/var/lib/rabbitmq8990  # MinIO (S3-compatible storage)91  minio:92    image: minio/minio:latest93    command: server /data --console-address ":9001"94    environment:95      MINIO_ROOT_USER: minioadmin96      MINIO_ROOT_PASSWORD: minioadmin97    ports:98      - "9000:9000"99      - "9001:9001"100    volumes:101      - minio-data:/data102103volumes:104  postgres-data:105  redis-data:106  rabbitmq-data:107  minio-data:

Cost Optimization

Multi-Layer Caching Strategy

The cost story is the same as the latency story — see the cache funnel up top. Each layer eliminates a fraction of the work the next layer would have to do, and only the residual (typically < 5%) ever pays for compute. The cost projection later in this section turns those residuals into dollars.

Storage Lifecycle Management

1/**2 * Storage lifecycle manager3 */4class LifecycleManager {5  constructor(registry, storage) {6    this.registry = registry7    this.storage = storage8  }910  /**11   * Move derived assets to cold tier based on access patterns12   */13  async moveToColdTier() {14    const coldThresholdDays = 3015    const warmThresholdDays = 71617    // Find candidates for tiering18    const candidates = await this.registry.query(`19      SELECT id, storage_key, cache_tier, last_accessed_at, size_bytes20      FROM derived_assets21      WHERE cache_tier = 'hot'22        AND last_accessed_at < NOW() - INTERVAL '${coldThresholdDays} days'23        AND deleted_at IS NULL24      ORDER BY last_accessed_at ASC25      LIMIT 100026    `)2728    for (const asset of candidates.rows) {29      try {30        // Move to cold storage tier (Glacier Instant Retrieval, Coldline, etc.)31        await this.storage.moveToTier(asset.storageKey, "cold")3233        // Update database34        await this.registry.updateCacheTier(asset.id, "cold")3536        console.log(`Moved asset ${asset.id} to cold tier`)37      } catch (error) {38        console.error(`Failed to move asset ${asset.id}:`, error)39      }40    }4142    // Similar logic for warm tier43    const warmCandidates = await this.registry.query(`44      SELECT id, storage_key, cache_tier45      FROM derived_assets46      WHERE cache_tier = 'hot'47        AND last_accessed_at < NOW() - INTERVAL '${warmThresholdDays} days'48        AND last_accessed_at >= NOW() - INTERVAL '${coldThresholdDays} days'49      LIMIT 100050    `)5152    for (const asset of warmCandidates.rows) {53      await this.storage.moveToTier(asset.storageKey, "warm")54      await this.registry.updateCacheTier(asset.id, "warm")55    }56  }5758  /**59   * Delete unused derived assets60   */61  async pruneUnused() {62    const pruneThresholdDays = 906364    const unused = await this.registry.query(`65      SELECT id, storage_key66      FROM derived_assets67      WHERE access_count = 068        AND created_at < NOW() - INTERVAL '${pruneThresholdDays} days'69      LIMIT 100070    `)7172    for (const asset of unused.rows) {73      try {74        await this.storage.delete(asset.storageKey)75        await this.registry.deleteDerivedAsset(asset.id)7677        console.log(`Pruned unused asset ${asset.id}`)78      } catch (error) {79        console.error(`Failed to prune asset ${asset.id}:`, error)80      }81    }82  }83}

Cost Projection

For a service serving 10 million requests/month:

Component	Without Optimization	With Optimization	Savings
Processing	1M transforms × $0.001	50K transforms × $0.001	95%
Storage	100TB × $0.023	100TB × $0.013 (tiered)	43%
Bandwidth	100TB × $0.09 (origin)	100TB × $0.02 (CDN)	78%
CDN	—	100TB × $0.02	—
Total	$12,300/month	$5,400/month	56%

Key optimizations:

95% CDN hit rate reduces origin bandwidth
Transform deduplication prevents reprocessing
Storage tiering moves cold data to cheaper tiers
Smart caching minimizes processing costs

Monitoring & Operations

Metrics Collection

1import prometheus from "prom-client"23/**4 * Metrics registry5 */6class MetricsRegistry {7  constructor() {8    this.register = new prometheus.Registry()910    // Default metrics (CPU, memory, etc.)11    prometheus.collectDefaultMetrics({ register: this.register })1213    // HTTP metrics14    this.httpRequestDuration = new prometheus.Histogram({15      name: "http_request_duration_seconds",16      help: "HTTP request duration in seconds",17      labelNames: ["method", "route", "status"],18      buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10],19    })2021    this.httpRequestTotal = new prometheus.Counter({22      name: "http_requests_total",23      help: "Total HTTP requests",24      labelNames: ["method", "route", "status"],25    })2627    // Transform metrics28    this.transformDuration = new prometheus.Histogram({29      name: "transform_duration_seconds",30      help: "Image transformation duration in seconds",31      labelNames: ["org", "format", "cached"],32      buckets: [0.1, 0.2, 0.5, 1, 2, 5, 10],33    })3435    this.transformTotal = new prometheus.Counter({36      name: "transforms_total",37      help: "Total image transformations",38      labelNames: ["org", "format", "cached"],39    })4041    this.transformErrors = new prometheus.Counter({42      name: "transform_errors_total",43      help: "Total transformation errors",44      labelNames: ["org", "error_type"],45    })4647    // Cache metrics48    this.cacheHits = new prometheus.Counter({49      name: "cache_hits_total",50      help: "Total cache hits",51      labelNames: ["layer"], // cdn, redis, database52    })5354    this.cacheMisses = new prometheus.Counter({55      name: "cache_misses_total",56      help: "Total cache misses",57      labelNames: ["layer"],58    })5960    // Storage metrics61    this.storageOperations = new prometheus.Counter({62      name: "storage_operations_total",63      help: "Total storage operations",64      labelNames: ["provider", "operation"], // put, get, delete65    })6667    this.storageBytesTransferred = new prometheus.Counter({68      name: "storage_bytes_transferred_total",69      help: "Total bytes transferred to/from storage",70      labelNames: ["provider", "direction"], // upload, download71    })7273    // Business metrics74    this.assetsUploaded = new prometheus.Counter({75      name: "assets_uploaded_total",76      help: "Total assets uploaded",77      labelNames: ["org", "format"],78    })7980    this.bandwidthServed = new prometheus.Counter({81      name: "bandwidth_served_bytes_total",82      help: "Total bandwidth served",83      labelNames: ["org", "space"],84    })8586    // Register all metrics87    this.register.registerMetric(this.httpRequestDuration)88    this.register.registerMetric(this.httpRequestTotal)89    this.register.registerMetric(this.transformDuration)90    this.register.registerMetric(this.transformTotal)91    this.register.registerMetric(this.transformErrors)92    this.register.registerMetric(this.cacheHits)93    this.register.registerMetric(this.cacheMisses)94    this.register.registerMetric(this.storageOperations)95    this.register.registerMetric(this.storageBytesTransferred)96    this.register.registerMetric(this.assetsUploaded)97    this.register.registerMetric(this.bandwidthServed)98  }99100  /**101   * Get metrics in Prometheus format102   */103  async getMetrics() {104    return await this.register.metrics()105  }106}107108// Singleton instance109const metricsRegistry = new MetricsRegistry()110111export default metricsRegistry

Alerting Configuration

1groups:2  - name: image_service_alerts3    interval: 30s4    rules:5      # High error rate6      - alert: HighErrorRate7        expr: |8          (9            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)10            /11            sum(rate(http_requests_total[5m])) by (service)12          ) > 0.0513        for: 5m14        labels:15          severity: critical16        annotations:17          summary: "High error rate on {{ $labels.service }}"18          description: "Error rate is {{ $value | humanizePercentage }}"1920      # Low cache hit rate21      - alert: LowCacheHitRate22        expr: |23          (24            sum(rate(cache_hits_total{layer="redis"}[10m]))25            /26            (sum(rate(cache_hits_total{layer="redis"}[10m])) + sum(rate(cache_misses_total{layer="redis"}[10m])))27          ) < 0.7028        for: 15m29        labels:30          severity: warning31        annotations:32          summary: "Low cache hit rate"33          description: "Cache hit rate is {{ $value | humanizePercentage }}, expected > 70%"3435      # Slow transformations36      - alert: SlowTransformations37        expr: |38          histogram_quantile(0.95,39            sum(rate(transform_duration_seconds_bucket[5m])) by (le)40          ) > 241        for: 10m42        labels:43          severity: warning44        annotations:45          summary: "Slow image transformations"46          description: "P95 transform time is {{ $value }}s, expected < 2s"4748      # Queue backup49      - alert: QueueBacklog50        expr: rabbitmq_queue_messages{queue="transforms"} > 100051        for: 10m52        labels:53          severity: warning54        annotations:55          summary: "Transform queue has backlog"56          description: "Queue depth is {{ $value }}, workers may be overwhelmed"5758      # Storage quota warning59      - alert: StorageQuotaWarning60        expr: |61          (62            sum(storage_bytes_used) by (organization_id)63            /64            sum(storage_bytes_quota) by (organization_id)65          ) > 0.8066        for: 1h67        labels:68          severity: warning69        annotations:70          summary: "Organization {{ $labels.organization_id }} approaching storage quota"71          description: "Usage is {{ $value | humanizePercentage }} of quota"

Health Checks

1/**2 * Health check service3 */4class HealthCheckService {5  constructor(dependencies) {6    this.db = dependencies.db7    this.redis = dependencies.redis8    this.storage = dependencies.storage9    this.queue = dependencies.queue10  }1112  /**13   * Liveness probe - is the service running?14   */15  async liveness() {16    return {17      status: "ok",18      timestamp: new Date().toISOString(),19      uptime: process.uptime(),20    }21  }2223  /**24   * Readiness probe - is the service ready to accept traffic?25   */26  async readiness() {27    const checks = {28      database: false,29      redis: false,30      storage: false,31      queue: false,32    }3334    // Check database35    try {36      await this.db.query("SELECT 1")37      checks.database = true38    } catch (error) {39      console.error("Database health check failed:", error)40    }4142    // Check Redis43    try {44      await this.redis.ping()45      checks.redis = true46    } catch (error) {47      console.error("Redis health check failed:", error)48    }4950    // Check storage51    try {52      const testKey = ".health-check"53      const testData = Buffer.from("health")54      await this.storage.put(testKey, testData, "text/plain")55      await this.storage.get(testKey)56      await this.storage.delete(testKey)57      checks.storage = true58    } catch (error) {59      console.error("Storage health check failed:", error)60    }6162    // Check queue63    try {64      // Implement queue-specific health check65      checks.queue = true66    } catch (error) {67      console.error("Queue health check failed:", error)68    }6970    const allHealthy = Object.values(checks).every((v) => v === true)7172    return {73      status: allHealthy ? "ready" : "not ready",74      checks,75      timestamp: new Date().toISOString(),76    }77  }78}7980export default HealthCheckService

Failure Modes and Edge Cases

This section documents failure scenarios, their detection, and recovery strategies. Understanding these modes is critical for production operations.

Transform Timeout (> 800ms SLA breach)

Cause: Large images (> 5MB), complex operations (multiple resize + effects), cold storage retrieval, or resource contention.

Detection: transform_duration_seconds histogram exceeds p95 threshold.

Mitigation strategies:

Size-based routing: Queue images > 5MB to async workers, return 202 with polling URL
Operation limits: Cap maximum output dimensions (e.g., 4096×4096), reject excessive blur/sharpen values
Timeout with fallback: Return lower-quality transform or original if timeout approaches
Pre-warm cold storage: Move frequently accessed cold-tier assets back to hot tier proactively

1async function transformWithTimeout(assetId, operations, timeoutMs = 750) {2  const controller = new AbortController()3  const timeout = setTimeout(() => controller.abort(), timeoutMs)45  try {6    return await transform(assetId, operations, { signal: controller.signal })7  } catch (error) {8    if (error.name === "AbortError") {9      // Return degraded response or queue for async processing10      metrics.transformTimeouts.inc({ org: assetId.split("/")[0] })1112      // Option 1: Return original (fastest fallback)13      return { fallback: "original", reason: "timeout" }1415      // Option 2: Queue and return 20216      // await queue.publish('transforms', { assetId, operations })17      // return { status: 202, pollUrl: `/v1/transforms/${jobId}` }18    }19    throw error20  } finally {21    clearTimeout(timeout)22  }23}

Storage Tier Restoration Latency

Cold storage retrieval can range from milliseconds (S3 Glacier Instant Retrieval / GCS Nearline) to minutes (Flexible Retrieval expedited tier) to hours (Bulk / Deep Archive). Anything beyond Instant Retrieval breaks the synchronous transform guarantee — the request must either degrade to the original or queue and return 202.

Mitigation:

Tier tracking in database: derived_assets.cache_tier column indicates current storage tier
Proactive restoration: Cron job restores cold assets with recent last_accessed_at updates
Graceful degradation: For cold original assets, return 202 and trigger async restoration

1-- Find cold assets accessed recently that should be restored2SELECT id, storage_key, cache_tier3FROM derived_assets4WHERE cache_tier = 'cold'5  AND last_accessed_at > NOW() - INTERVAL '24 hours'6ORDER BY access_count DESC7LIMIT 100;

CDN Cache Invalidation Failures

Scenario: Asset updated, but stale version persists in CDN edge caches.

Root causes:

Invalidation API rate limits exceeded
Propagation delays (CDNs quote 0-60 seconds, but outliers exist)
Wildcard invalidation missed specific paths

Mitigation:

Version in URL: Include asset version (/v{version}/) so updates get new cache keys automatically
Soft purge with fallback: Use CDN’s stale-while-revalidate to serve stale during revalidation
Invalidation monitoring: Track invalidation success rates and propagation times
Dual-write period: For critical updates, serve from origin for 60 seconds before relying on CDN

Lock Contention Under Load

Scenario: Multiple workers compete for the same transform lock, causing lock acquisition timeouts.

Detection: redlock_acquisition_failures metric spikes, lock_wait_time increases.

Mitigation:

Lock-free fast path: Check if transform exists before acquiring lock (optimistic check)
Retry with jitter: Exponential backoff with randomized jitter to prevent thundering herd
Lock timeout tuning: Set lock TTL to 2x expected transform time, not a fixed value
Shard by hash prefix: Distribute lock contention across multiple Redis masters

Malformed Input, Decompression Bombs, and Image-Library CVEs

Scenarios: Truncated upload, deliberate decompression bomb (a 50 KB PNG that decodes to 50 GB of pixels), exploit aimed at the decoder (the ImageTragick family of CVEs is the canonical example for ImageMagick), or simply a file that looks like a JPEG but is something else.

Detection: Sharp throws VipsError on invalid input; content hash doesn’t match expected; decoded pixel area exceeds a sane bound.

Mitigation — defense in depth, in this order:

Magic-byte verification before passing to the decoder. Check the first 8-16 bytes against the expected format signature (e.g., FF D8 FF for JPEG, 89 50 4E 47 for PNG, 52 49 46 46 ?? ?? ?? ?? 57 45 42 50 for WebP). Reject mismatches at the gateway, not inside libvips.
Pixel-area cap, not just dimension cap. Limit width × height × frames to something like 100 megapixels per request. This catches both extreme aspect ratios and animated bombs that pass per-axis checks.
Disable untrusted loaders. libvips honors VIPS_BLOCK_UNTRUSTED=1, which disables loaders flagged as risky for untrusted input (CSV, matrix, OpenSlide, …). The libvips developer checklist is the authoritative reference. Sharp inherits this; set the env var on the worker pods.
Hash verification on upload. Compute SHA-256 during upload, verify before marking complete; refuse partial uploads instead of leaving “ghost” assets.
Sandbox the decoder. Run transform workers in a separate pod from the gateway, drop Linux capabilities (securityContext: { capabilities: { drop: ["ALL"] } }), set a tight seccomp profile, and cap memory + CPU. A compromised decoder should not be able to reach the registry or the storage credentials directly.
Prefer libvips over ImageMagick on untrusted input. libvips’ security discussion lays out the historical CVE differential clearly; ImageMagick’s policy.xml can be hardened, but the attack surface is structurally larger.
Graceful error messages. Map VipsError and decoder errors to user-friendly responses. Do not echo the raw libvips error string back to the client — it can leak file paths or library versions useful to an attacker.

1import sharp from "sharp"23async function validateImage(buffer) {4  try {5    const metadata = await sharp(buffer).metadata()67    // Check for reasonable dimensions8    if (metadata.width > 50000 || metadata.height > 50000) {9      return { valid: false, error: "Image dimensions exceed maximum (50000×50000)" }10    }1112    // Check for minimum size (likely corrupt if too small)13    if (buffer.length < 100) {14      return { valid: false, error: "Image file too small, possibly corrupt" }15    }1617    return { valid: true, metadata }18  } catch (error) {19    return { valid: false, error: `Invalid image: ${error.message}` }20  }21}

Rate Limit Exhaustion

Scenario: Burst traffic exhausts rate limits, legitimate requests rejected.

Mitigation:

Tiered limits: Higher limits for authenticated requests vs. anonymous
Burst allowance: Sliding window with small burst buffer (e.g., 110% of limit for 10 seconds)
Priority queuing: VIP tenants get separate, higher limits
Graceful 429 responses: Include Retry-After header with exact reset time

Conclusion

This architecture provides a production-ready foundation for building a cloud-agnostic image processing platform. The key insight is that image transformation is an ideal candidate for aggressive caching: transformations are pure functions (same inputs → same outputs), making content-addressed storage highly effective.

Critical tradeoffs made in this design:

Synchronous-first over queue-first: We accept higher p99 latency for small images in exchange for simpler client integration (no polling). For large images, we fall back to async.
Efficiency locks over safety locks: Redlock prevents duplicate work but doesn’t guarantee mutual exclusion. This is acceptable because content-addressed storage ensures idempotency—duplicate transforms are wasteful, not dangerous.
Edge authentication over origin-only: Moving signature validation to the edge adds complexity but dramatically improves private content latency and reduces origin load.
Storage tiering over uniform hot storage: Cold storage introduces retrieval latency but reduces costs by 40-60% for infrequently accessed content.

What this architecture does not cover:

Video transcoding (different latency characteristics, requires different chunking strategies)
Real-time image editing (collaborative features, operational transforms)
GPU-accelerated transformations beyond saliency cropping (background removal, upscaling, generative inpainting — require dedicated GPU pools and a different cost model)
Geographic data residency requirements (beyond standard CDN region configuration)

Appendix

Prerequisites

Familiarity with distributed systems concepts (caching, consistency, partitioning)
Understanding of HTTP caching semantics (Cache-Control, ETags, CDN behavior)
Basic knowledge of image formats and compression (JPEG, WebP, AVIF characteristics)
Experience with at least one cloud provider’s storage and CDN offerings

Terminology

Term	Definition
Asset	An original uploaded image, stored with its content hash
Derived Asset	A transformed version of an asset, identified by the hash of (original + operations)
Content-Addressed	Storage keyed by content hash rather than arbitrary ID; same content → same key
Fencing Token	Monotonically increasing token used to detect stale lock holders
Operations Hash	SHA-256 of (canonical operation string + original content hash + output format)
Signed URL	URL with cryptographic signature proving authorization; includes expiration timestamp
Storage Tier	Access latency class: hot (ms), warm (seconds), cold (minutes to hours)
Transform Canonicalization	Normalizing operation parameters to ensure equivalent transforms produce identical cache keys

Summary

Multi-layer caching (CDN → Redis → Database → Storage) keeps the transform path on < 5% of requests in the reference scenario
Content-addressed storage with deterministic hashing ensures transform idempotency
Sharp 0.34 / libvips 8.17 provides ~26× throughput over jimp at ~50 MB per worker (libvips 8.18 features such as UltraHDR ride along in Sharp 0.35+)
AVIF (~94.9% global support, ~50% smaller than JPEG) is the preferred lossy target but ~5-10× the JPEG encode cost — bake asynchronously or behind a queue; WebP (~97% support, 25-34% smaller than JPEG) is the synchronous default
Redlock is appropriate for efficiency optimization but not safety-critical mutual exclusion
Edge authentication with normalized cache keys maximizes CDN hit rates for private content
Hierarchical policies (Organization → Tenant → Space) enable flexible multi-tenant isolation

References

Specifications and Official Documentation

RFC 2104 - HMAC - HMAC-SHA256 specification for signed URLs
AV1 Image File Format (AVIF) - AOM AVIF specification
WebP Container Specification - Google WebP format spec
HTTP Caching (RFC 9111) - HTTP caching semantics

Core Library Documentation

Sharp Documentation - High-performance Node.js image processing
Sharp Performance Benchmarks - 64.42 ops/sec AMD64 JPEG, ~26× faster than jimp
Sharp Changelog v0.34.5 - 6 November 2025 release notes (libvips 8.17.3)
Sharp releases - 0.35.0-rc.0 (2 January 2026) tracks libvips 8.18
libvips 8.18 release notes - UltraHDR, Camera RAW, Oklab, BigTIFF
Redis Distributed Locks - Official Redlock documentation

Design Rationale and Analysis

How to do Distributed Locking - Martin Kleppmann’s Redlock analysis (fencing tokens, timing assumptions)
Is Redlock Safe? - Salvatore Sanfilippo’s (antirez) response

Browser Support and Format Comparison

Can I Use: AVIF - 94.9% global support (March 2026 StatCounter snapshot)
Can I Use: WebP - 96.96% global support (February 2026 StatCounter snapshot)
JPEG XL decoding (image/jxl) in Blink - decoder merged 13 January 2026, shipped in Chrome 145 (10 February 2026) behind enable-jxl-image-format

Cloud Provider SDKs

AWS SDK for JavaScript v3 - S3 client
Google Cloud Storage Node.js - GCS client
Azure Blob Storage SDK - Azure Storage client

Edge Computing

CloudFlare Workers Documentation - Edge compute platform
CloudFront Functions - AWS edge compute
Fastly Compute@Edge - Fastly edge platform

Frameworks and Tools

Fastify - Low-overhead Node.js web framework
PostgreSQL JSONB - JSON support documentation
Prometheus - Monitoring and alerting toolkit

Benchmark data from ImageRobo’s 2025 image-format comparison and openaviffile’s encoder comparison — both show AVIF encoding latency dominating WebP/JPEG by roughly an order of magnitude at default effort settings. ↩
Deploying AVIF for more responsive websites — web.dev — Google reports a 6.5× reduction in libaom encoder CPU since AVIF launch. ↩