Networking Foundations — Compute Reference
A Tier-1 reference covering the protocol stack that underpins every distributed system, web service, and edge platform: from the physical link up through TCP/IP, QUIC, HTTP/3, DNS, TLS, load balancing, gRPC, BGP, CDNs, service meshes, and modern overlay networks. The framing follows the layered model engineers actually use day-to-day, then drills into the protocols, performance properties, and pitfalls that determine whether a system is reliable at scale.
1. At a Glance — Layered Models
The textbook OSI seven-layer model (Physical, Data Link, Network, Transport, Session, Presentation, Application) is taxonomically tidy but rarely maps cleanly onto real software. Practitioners use the TCP/IP four-layer model (Link, Internet, Transport, Application) or its informal five-layer expansion. What matters is the responsibilities each layer carries and how the layers compose.
Real-world stack as it actually runs:
- L1 — Physical. Ethernet over twisted-pair copper (Cat 5e/6/6a/8), single-mode and multi-mode fiber, coaxial, Wi-Fi 6/6E/7 (IEEE 802.11ax/be), 5G NR, satellite (Starlink, OneWeb). Encodes bits as voltages, light pulses, or radio modulation. Bandwidth, signal-to-noise, and bit-error rate live here.
- L2 — Data Link. Ethernet frames (IEEE 802.3), MAC addresses (48-bit), VLAN tagging (802.1Q), MPLS labels, ARP for IP↔MAC mapping. Switches operate here. Spanning Tree Protocol (STP) prevents L2 loops; modern fabrics use TRILL, SPB, or VXLAN overlays.
- L3 — Network. IP routing — IPv4 and IPv6. Routers operate here, making forwarding decisions per packet via longest-prefix match against the routing table. ICMP rides on top for diagnostics (ping, traceroute, MTU discovery).
- L4 — Transport. TCP for reliable byte-streams, UDP for datagrams, QUIC as the modern UDP-based replacement that absorbs TLS and multiplexing. SCTP exists for telecom but is niche on the open internet.
- L5–L6 — Session / Presentation. Usually folded into the application layer in practice. TLS sits awkwardly here in OSI terms but is functionally a transport-security wrapper.
- L7 — Application. HTTP/1.1, HTTP/2, HTTP/3, DNS, MQTT, AMQP, SMTP, IMAP, SSH, gRPC, WebSocket, SIP/RTP, NTP, DHCP, SNMP.
The mental model: each layer treats the layer below as a pipe and adds its own header. A browser sending an HTTP/3 request fires Protobuf-like QUIC frames inside UDP datagrams inside IP packets inside Ethernet frames inside light pulses down a fiber.
2. IP — The Internet Layer
IPv4
- 32-bit addresses, dotted-quad notation:
192.0.2.42. - Total address space: 2³² ≈ 4.29 billion. IANA exhausted its free pool in 2011; regional registries (ARIN, RIPE NCC, APNIC, LACNIC, AFRINIC) ran out by 2019. Address scarcity is permanent.
- NAT (Network Address Translation, RFC 3022, 1998) stretches the space by sharing one public address across many private hosts. Private ranges (RFC 1918):
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16. Carrier-grade NAT (CGNAT) layers NAT inside ISPs; the100.64.0.0/10shared space (RFC 6598) is reserved for it. - CIDR (Classless Inter-Domain Routing, RFC 4632, 2006) notation:
203.0.113.0/24denotes a 24-bit prefix → 256 addresses./16→ 65,536;/30→ 4 (typical point-to-point link). Replaces the older Class A/B/C scheme which was wasteful. - Routing tables match destination IPs against prefixes by longest prefix match. A
/32host route beats a/24subnet route beats a/0default route.
IPv6
- 128-bit addresses written as eight colon-separated hex groups:
2001:db8::1. The::compresses one run of zero groups. Total space: 2¹²⁸ ≈ 3.4 × 10³⁸ — vastly larger than IPv4. - Standardized in RFC 8200 (2017; supersedes RFC 2460 from 1998). Adoption has been slow but steady: as of 2025, ~45% of Google traffic is IPv6 globally, higher in mobile networks.
- Stateless Address Autoconfiguration (SLAAC, RFC 4862) lets hosts derive addresses from router advertisements without DHCP. DHCPv6 still exists for stateful management.
- No NAT in IPv6 — every host gets a globally routable address. Privacy extensions (RFC 8981, 2021) randomize the interface identifier to prevent tracking.
- Dual-stack deployments run IPv4 and IPv6 in parallel; Happy Eyeballs (RFC 8305, 2017) picks whichever connects first.
ICMP
ICMP (RFC 792 for v4, RFC 4443 for v6) carries control messages: echo request/reply (ping), destination unreachable, time exceeded (traceroute uses this by sending packets with increasing TTL), redirect, parameter problem, packet too big (essential for PMTUD — Path MTU Discovery). Overzealous firewalls that block all ICMP break PMTUD and cause silent connection failures with large packets.
3. TCP — Transmission Control Protocol
TCP (RFC 793, 1981; modernized in RFC 9293, 2022) is the connection-oriented, reliable, in-order, byte-stream transport that carries most of the legacy internet. Its design assumptions — that loss implies congestion, that connections are pinned to IP/port tuples, that ordering matters — defined the shape of HTTP/1 and HTTP/2 and motivated QUIC’s existence.
Connection Setup
The three-way handshake:
- Client → Server: SYN with initial sequence number (ISN_c).
- Server → Client: SYN-ACK with ISN_s and ACK = ISN_c + 1.
- Client → Server: ACK = ISN_s + 1.
One round-trip before any application data flows. TCP Fast Open (TFO, RFC 7413, 2014) allows data in the SYN packet using a cookie negotiated on a prior connection; deployment has been limited due to middlebox interference.
Reliability and Ordering
Every byte is sequence-numbered. The receiver ACKs the highest contiguous byte received. Lost or out-of-order segments are buffered until the gap fills. Retransmissions are triggered by timeout (RTO) or by duplicate ACKs (fast retransmit after 3 dup-ACKs). Selective Acknowledgment (SACK, RFC 2018, 1996) lets the receiver report non-contiguous received ranges, dramatically improving recovery from multi-segment loss.
Flow Control
Each side advertises a receive window in every ACK — how many bytes it can buffer. The sender never has more than that many unacknowledged bytes in flight. Window scaling (RFC 7323) extends the 16-bit window field with a shift count, supporting windows up to ~1 GB needed for long fat networks.
Congestion Control
TCP guesses network capacity. Several algorithms have dominated over the decades:
- Reno (RFC 5681): AIMD — additive increase, multiplicative decrease. On ACK, increase cwnd by 1 MSS per RTT (linear). On loss, halve cwnd. Slow-start ramps cwnd exponentially at connection start until the first loss.
- CUBIC (RFC 8312, 2018): default in Linux since kernel 2.6.19 (2006). Window growth follows a cubic function of time-since-last-loss, making it more aggressive on long fat networks than Reno while remaining fair.
- BBR (Bottleneck Bandwidth and Round-trip propagation time, Google 2016): model-based instead of loss-based. Probes the path’s bandwidth and minimum RTT and paces packets accordingly. Vastly better on lossy or buffered paths (mobile, transoceanic). BBRv2/v3 added fairness improvements with CUBIC flows. Published in Cardwell et al., ACM Queue 2016.
- Compound TCP (Microsoft), Vegas (delay-based), DCTCP (data-center, uses ECN aggressively).
Slow-start, congestion-avoidance, fast-retransmit, and fast-recovery form the classic Reno state machine. Modern stacks treat loss + ECN (Explicit Congestion Notification, RFC 3168) as separate signals.
RTT, RTO, and Retransmission
RTT is measured continuously via Karn’s algorithm. RTO is computed from smoothed RTT and RTT variance (RFC 6298): RTO = SRTT + max(G, 4·RTTVAR). Minimum RTO is typically 200 ms on Linux, which dominates loss recovery on short fast paths.
Head-of-Line Blocking
TCP’s strict ordering means a single lost segment stalls the entire byte-stream until retransmission arrives — even if later bytes belong to an independent logical stream (e.g., a different HTTP/2 stream multiplexed over the same connection). This is TCP-level HoL blocking, the single most consequential limitation of HTTP/2 over TCP and the primary motivation for QUIC.
Nagle + Delayed ACK Pitfall
Nagle’s algorithm (RFC 896, 1984) coalesces small sends to reduce overhead. Delayed ACK (RFC 1122) holds ACKs for up to 200 ms hoping for a payload to piggyback on. The interaction can produce 200 ms stalls when an application sends a small write expecting an immediate response. Setting TCP_NODELAY (disables Nagle) is standard for interactive protocols; setting TCP_QUICKACK (disables delayed ACK) is sometimes needed.
TIME_WAIT
After a connection closes, the side that sent the final ACK enters TIME_WAIT for 2·MSL (Maximum Segment Lifetime, typically 60–120 seconds total). This prevents stale duplicates from poisoning a new connection on the same 4-tuple. High-throughput servers exhaust ephemeral ports under TIME_WAIT pressure; mitigations include SO_REUSEADDR, SO_REUSEPORT, tcp_tw_reuse (Linux), or simply using connection pooling.
4. UDP — User Datagram Protocol
UDP (RFC 768, 1980) is the connectionless, unreliable, datagram-oriented sibling. No handshake, no retransmission, no congestion control, no ordering — just a thin envelope (8-byte header: source port, dest port, length, checksum) around an IP payload.
UDP’s job is to get out of the way. Applications layer their own reliability, ordering, and congestion control on top when needed — or skip those entirely when latency matters more than completeness.
Common UDP citizens:
- DNS — tiny request/response, sub-millisecond budget, fits in one packet.
- NTP (Network Time Protocol, RFC 5905) — tight latency requirements; statistical filtering handles loss.
- DHCP — bootstrap before the host has an IP at all.
- RTP/RTCP (Real-time Transport, RFC 3550) — voice and video; late packets are useless, retransmission would only add jitter. FEC and PLC compensate for loss.
- QUIC — modern reliable transport built on UDP.
- WireGuard — modern VPN, all-UDP.
- Game protocols — most action games use UDP with custom sequence-number schemes.
- QUIC discovery, mDNS, SSDP — local-network service discovery.
5. QUIC — The Modern Transport
QUIC (Quick UDP Internet Connections) was prototyped by Google starting in 2013, deployed at scale across Google services, then standardized at the IETF as RFC 9000 (transport), RFC 9001 (TLS integration), and RFC 9002 (loss detection and congestion control) in 2021.
QUIC runs over UDP and replaces the TCP + TLS + HTTP/2-framing stack with a unified design that solves the structural limits of TCP:
- Built-in TLS 1.3. Encryption is mandatory, not bolted on. The transport and crypto handshakes are interleaved — first flight carries TLS ClientHello, server responds with TLS ServerHello + transport parameters. 1-RTT handshakes are standard; 0-RTT (resumption) lets clients send application data in the very first packet using cached keys.
- Streams as first-class objects. A single QUIC connection multiplexes many independent streams. Loss on one stream does not block others — QUIC sequence numbers (packet numbers) are separate from stream offsets, so a missing packet only stalls the streams whose data it carried. This eliminates TCP-level HoL blocking.
- Connection migration. The connection is identified by a Connection ID, not the 4-tuple. A client moving from Wi-Fi to cellular keeps its connection alive; packets just start arriving from a new IP. Hugely valuable on mobile.
- Per-packet encryption + authentication. The QUIC header (other than the long-header version field) is encrypted, making middlebox ossification much harder. The lessons of TCP — where extensions died because middleboxes dropped unknown options — informed QUIC’s design from day one.
- Pluggable congestion control. NewReno is the default in the spec; BBR is widely deployed in production.
- Forward error correction was explored and removed before standardization; modern QUIC relies on retransmission.
QUIC is the foundation of HTTP/3 and is increasingly used as a generic transport. MASQUE (RFC 9298) tunnels UDP over HTTP/3 for proxying. WebTransport over HTTP/3 (W3C) gives browsers an unreliable, multiplexed transport that complements WebSocket.
6. DNS — The Domain Name System
DNS (RFC 1034/1035, 1987; updated by dozens of subsequent RFCs) is the distributed, hierarchical name-to-data lookup system that makes the internet usable.
Architecture
- Root servers — 13 logical IP addresses (a.root-servers.net through m.root-servers.net), operated by 12 organizations, anycast-replicated across hundreds of physical sites. They delegate to TLD servers.
- TLD servers —
.com,.org, country-code TLDs (ccTLDs like.uk,.de), new gTLDs. Operated by registries (Verisign for.com/.net, PIR for.org). - Authoritative servers — hold the actual records for a zone (e.g.,
example.com). Operated by the domain owner or DNS hosting provider (Cloudflare, Route 53, NS1, dnsimple). - Recursive resolvers — the client-facing servers that walk the chain on behalf of stub resolvers in the OS. Run by ISPs, enterprises, or public providers (1.1.1.1, 8.8.8.8, 9.9.9.9).
- Stub resolvers — the thin library in the OS (
getaddrinfo,systemd-resolved,nss) that talks to recursive resolvers.
A typical resolution: stub → recursive → root → .com TLD → example.com authoritative → answer. Each step caches the result; subsequent queries skip the upper levels.
Record Types
- A — IPv4 address.
- AAAA — IPv6 address.
- CNAME — alias to another name. Must not coexist with other records at the same name.
- MX — mail exchange host + priority.
- TXT — arbitrary text; used for SPF, DKIM, DMARC, domain ownership verification.
- NS — name server delegation.
- SOA — start of authority; zone metadata including serial, refresh, retry, expire, minimum TTL.
- CAA — Certificate Authority Authorization (RFC 8659); restricts which CAs may issue certs for the domain.
- SRV — service location with priority + weight; used by SIP, XMPP, LDAP, Kubernetes headless services.
- PTR — reverse lookup (IP → name).
- DS / DNSKEY / RRSIG / NSEC / NSEC3 — DNSSEC signing chain.
- HTTPS / SVCB (RFC 9460, 2023) — modern records that advertise HTTP/3 support, ALPN, ECH, and IP hints, replacing piles of CNAME tricks.
TTL and Caching
Every record has a TTL (seconds). Resolvers and stubs cache up to TTL. Low TTLs (60s) enable fast failover and migrations at the cost of more queries; high TTLs (24h) reduce load but slow change propagation. Lower TTLs days in advance of a planned migration, restore them after.
Encrypted DNS
Plain DNS is unauthenticated and unencrypted — any on-path observer sees every domain you visit, and any attacker can forge responses. Encryption arrived late:
- DNS-over-TLS (DoT, RFC 7858, 2016) — TCP/853, TLS-wrapped.
- DNS-over-HTTPS (DoH, RFC 8484, 2018) — HTTPS POST/GET to
/dns-query. Adopted by Firefox, Chrome, iOS, macOS, Windows 11. - DNS-over-QUIC (DoQ, RFC 9250, 2022) — UDP/853, QUIC-wrapped.
- Oblivious DoH (ODoH, RFC 9230, 2022) — adds a proxy that hides the client IP from the resolver.
Public encrypted resolvers: Cloudflare 1.1.1.1, Quad9 9.9.9.9, Google 8.8.8.8, NextDNS, AdGuard DNS.
Extensions
- EDNS(0) (RFC 6891) — extension mechanism; carries larger UDP responses, client-subnet hints, cookies. Without EDNS, UDP responses are capped at 512 bytes.
- DNSSEC (RFC 4033-4035) — cryptographic signatures over RRsets, anchored at the root signing key. Deployment is partial (~30% of TLDs signed, much less for second-level domains). DNSSEC validates origin authenticity but not confidentiality — DoT/DoH/DoQ cover the latter.
7. HTTP/1.1
HTTP/1.1 (RFC 2068 in 1997, RFC 2616 in 1999, completely restructured into RFC 7230–7235 in 2014, then merged + refined into RFC 9110–9114 in 2022) is the text-based request-response protocol that defined the web for two decades.
Key properties:
- Text on the wire:
GET /path HTTP/1.1\r\nHost: example.com\r\n\r\n. Human-readable but verbose and slow to parse. - Persistent connections (the v1.1 headline feature): a single TCP connection carries many request-response pairs.
Connection: keep-aliveis the default;Connection: closeends the connection. - Pipelining: client may send multiple requests without waiting for responses. The server must reply in order. In practice, pipelining is disabled in browsers because buggy proxies and HoL blocking made it worse than parallel connections.
- Chunked transfer encoding: streams responses of unknown length using
Transfer-Encoding: chunked. - Six-connections-per-origin convention in browsers, used to parallelize despite HoL.
HTTP/1.1 is still ubiquitous — every reverse proxy, every API gateway, every curl one-liner. It is the lowest common denominator.
8. HTTP/2
HTTP/2 (RFC 7540, 2015; restructured into RFC 9113 in 2022) replaced HTTP/1.1’s text framing with a binary protocol designed for the modern web.
- Binary framing: frames have a fixed 9-byte header (length, type, flags, stream ID) and a payload. Types include DATA, HEADERS, PRIORITY, RST_STREAM, SETTINGS, PUSH_PROMISE, PING, GOAWAY, WINDOW_UPDATE, CONTINUATION.
- Multiplexed streams: many concurrent request/response pairs share one TCP connection, each identified by a stream ID. Eliminates HTTP-level HoL — but TCP-level HoL still applies (a lost segment stalls all streams).
- HPACK header compression (RFC 7541): a Huffman-coded static table + dynamic table reduces header overhead dramatically. Particularly impactful for cookies and repeated headers.
- Server push (
PUSH_PROMISE): server can preemptively send resources the client will request next. Deprecated in Chrome in 2022 due to negligible benefit and frequent over-push waste. - Flow control per stream, in addition to connection-level flow control.
- ALPN-negotiated upgrade: HTTP/2 is negotiated during the TLS handshake via ALPN (
h2). Plaintext HTTP/2 (h2c) exists but is rarely deployed; in practice HTTP/2 always rides on TLS.
HTTP/2’s main limitation is the TCP coupling. A single dropped packet stalls every stream — exactly what multiplexing was supposed to fix. This drove the design of HTTP/3.
9. HTTP/3
HTTP/3 (RFC 9114, 2022) is HTTP semantics — methods, status codes, headers, URLs — re-mapped onto QUIC instead of TCP+TLS.
- No TCP-level HoL: independent QUIC streams; a lost UDP packet only blocks the streams whose data was in that packet.
- Faster connection establishment: 1-RTT for fresh, 0-RTT for resumed. TLS 1.3 is mandatory.
- QPACK header compression (RFC 9204): HPACK’s design depended on a synchronized dynamic table that didn’t tolerate out-of-order delivery. QPACK redesigns the dynamic table sync with a separate encoder stream, allowing headers to decode in any order.
- Connection migration: cross-network handoff (Wi-Fi → 5G) without breaking the session.
- No server push: dropped from the standard given HTTP/2’s experience.
Adoption is widespread: Cloudflare, Google, Facebook/Meta, Akamai, Fastly, Microsoft serve HTTP/3 on their edges. Chrome, Firefox, Safari, and Edge support it. Roughly 30% of Cloudflare traffic was HTTP/3 by late 2024; the share is highest on mobile (where lossy networks benefit most).
Performance gains are most pronounced on lossy or high-latency paths — mobile, satellite (Starlink), congested last-mile links. On clean wired connections, HTTP/2 and HTTP/3 are nearly indistinguishable.
10. TLS Termination and Offload
TLS (Transport Layer Security; current version TLS 1.3, RFC 8446, 2018) provides confidentiality, integrity, and authentication for application traffic. Where TLS terminates in your architecture has major consequences.
Termination Patterns
- At the edge / CDN: Cloudflare, Fastly, Akamai, CloudFront terminate TLS for your domain using certs they hold (Cloudflare’s Universal SSL, ACM with CloudFront). The CDN re-encrypts to your origin, or talks plaintext over a private network, or uses Cloudflare Tunnel.
- At the load balancer: AWS ALB/NLB-TLS, GCP HTTPS LB, nginx, HAProxy, Envoy terminate inbound TLS. Backend traffic may be plaintext (inside a trusted VPC) or re-encrypted (defense in depth, regulated workloads).
- At the ingress controller: Kubernetes ingress (nginx-ingress, Traefik, Contour, Istio Gateway) terminates TLS at the cluster edge. Cert management via cert-manager + Let’s Encrypt.
- Mesh / mTLS end-to-end: Istio, Linkerd, Cilium, Consul Connect issue per-workload SPIFFE identities and run mTLS between every pod. Application code stays plaintext-aware; the sidecar handles crypto.
SNI
Server Name Indication (RFC 6066) lets a client tell the server which hostname it wants during the ClientHello, before the cert is selected. Without SNI, you need one IP per cert. With SNI, one IP can serve thousands of cert+vhost combinations. ESNI (Encrypted SNI) was deprecated in favor of ECH (Encrypted ClientHello, draft as of 2025), which encrypts the entire ClientHello including SNI, hiding the hostname from on-path observers.
ALPN
Application-Layer Protocol Negotiation (RFC 7301) is the TLS extension that lets client and server agree on the application protocol during the handshake. The client sends a list (h2, http/1.1, h3) and the server picks. Without ALPN, the protocol would require a round-trip after the handshake.
Cert Lifecycle
- Let’s Encrypt (ISRG, free, ACME protocol RFC 8555) — 90-day certs, automated renewal.
- Commercial CAs: DigiCert, GlobalSign, Sectigo, Google Trust Services.
- CT (Certificate Transparency, RFC 9162) — append-only logs of every issued cert; browsers require CT proofs.
- OCSP / OCSP stapling (RFC 6066) — revocation checking; OCSP must-staple closes the soft-fail loophole.
- HSTS (RFC 6797) —
Strict-Transport-Securityheader pins HTTPS for the domain. HSTS preload lists are baked into browsers.
11. Load Balancing
Load balancing distributes incoming work across a pool of backends. The choice of layer, algorithm, and stickiness has direct impact on tail latency, failure isolation, and cost.
L4 (Transport-Layer) Balancers
Inspect only IP and port — not payload. Fast, simple, protocol-agnostic.
- Linux LVS (IP Virtual Server, kernel 2.4+) — kernel-level NAT/DR/Tunnel modes.
- AWS NLB — managed L4 with static IPs and ultra-high throughput.
- GCP Network Load Balancer.
- Cloudflare Spectrum — L4 anycast for arbitrary TCP/UDP.
- HAProxy in TCP mode, Envoy in TCP proxy mode.
L4 balancers cannot do path-based routing, header manipulation, mTLS termination, or content-aware retries. They can do connection-level health checks and weighted distribution.
L7 (Application-Layer) Balancers
Parse HTTP (or other L7 protocols). Slower per-request but vastly more capable.
- nginx — battle-tested reverse proxy + LB; declarative config.
- HAProxy — also widely deployed; powerful ACL system.
- Envoy — modern proxy from Lyft (2016), now CNCF; programmable via xDS APIs; foundation of Istio.
- Traefik — Kubernetes-native ingress with auto-discovery.
- AWS ALB, GCP HTTPS LB, Azure Application Gateway.
L7 features: path/header/cookie routing, redirects, rewrites, request mirroring, WAF, mTLS termination, retries with backoff, circuit breaking, rate limiting, observability.
Algorithms
- Round-robin — simple rotation; fair when backends are uniform.
- Least-connections — picks the backend with the fewest active connections; good when request durations vary.
- Weighted — round-robin or least-conn with per-backend weights for heterogeneous capacity.
- IP-hash — hash of client IP picks the backend; coarse stickiness without cookies.
- Consistent hashing (Ketama) — places backends and requests on a ring; backend churn moves only K/N keys. Used by Memcached clients, Cassandra, Riak, Envoy ring-hash. Maglev (Google, 2016) is a similar consistent-hash variant with better disruption properties.
- Power of two choices — pick two random backends, route to the less-loaded one. Surprisingly close to optimal.
- EWMA / latency-based — Envoy’s
LEAST_REQUESTwith active-request count, or true latency-aware routing.
Health Checks
- Active: balancer periodically probes each backend (HTTP GET /health, TCP connect, gRPC health check).
- Passive: balancer observes real traffic and marks backends unhealthy after N consecutive failures (outlier detection).
Combine both. Tune intervals carefully: too aggressive and you flap during partial outages; too lax and you keep sending traffic to dead backends.
Stickiness vs Statelessness
- Sticky sessions: tie a client to a backend via cookie or IP-hash. Easy to retrofit onto stateful applications, but creates uneven load and complicates failover.
- Stateless: any backend can serve any request; session state lives in Redis/Memcached/DB. The right answer for cloud-native services.
Global Load Distribution
- Anycast DNS: many physical sites announce the same IP via BGP; routing delivers each client to the nearest site. Used by Cloudflare, Fastly, Google Public DNS, AWS Global Accelerator, CloudFront.
- GeoDNS: resolvers return different IPs based on client location (Route 53 geolocation routing, NS1).
- Latency-based routing: returns the IP of the lowest-latency region (Route 53 latency routing).
- GSLB (Global Server Load Balancing): cross-region active-active with health-aware DNS.
12. gRPC
gRPC (Google, 2015) is an RPC framework on HTTP/2 + Protocol Buffers, with a v2 spec for HTTP/3 already in the wild. It is the dominant RPC system in modern microservice fleets.
- Transport: HTTP/2 framing carries gRPC frames. gRPC over HTTP/3 is supported in newer implementations.
- Serialization: Protocol Buffers (proto3) — typed schemas in
.protofiles, code-generated stubs across languages. Compact binary wire format; fast parse + serialize. - Streaming: unary (request → response), server-streaming, client-streaming, bidirectional-streaming. The streaming modes are what differentiate gRPC most clearly from REST.
- Deadlines + cancellation: every call carries a deadline propagated through the call graph; downstream services know how much time remains. Cancellation cascades back if the client disconnects.
- Status codes: 16 well-defined codes (
OK,CANCELLED,INVALID_ARGUMENT,DEADLINE_EXCEEDED,NOT_FOUND,ALREADY_EXISTS,PERMISSION_DENIED,RESOURCE_EXHAUSTED,FAILED_PRECONDITION,ABORTED,OUT_OF_RANGE,UNIMPLEMENTED,INTERNAL,UNAVAILABLE,DATA_LOSS,UNAUTHENTICATED). Cleaner than HTTP’s overloaded status codes for RPC semantics. - Metadata: like HTTP headers, but typed and bi-directional via initial + trailing metadata.
- Auth: TLS + per-call credentials; mTLS for service-to-service.
gRPC-Web
Browsers can’t speak raw gRPC (no access to HTTP/2 frames at the JS API level). gRPC-Web is a wire-format-compatible variant proxied by Envoy or grpc-web’s own proxy. It loses bidirectional streaming but gains browser reach.
gRPC vs REST
- gRPC wins: tight contract, low latency, streaming, polyglot codegen, deadlines, internal service-to-service.
- REST wins: zero-tool curl debugging, universal HTTP cache + CDN behavior, browser-native, no codegen, JSON ubiquity, OpenAPI tooling.
- Pragmatic split: gRPC for internal east-west, REST/GraphQL/HTTP+JSON for public/north-south, with a gateway (Envoy, grpc-gateway) bridging.
Mesh-Native
Service meshes (Istio, Linkerd, Cilium) understand gRPC natively, surfacing per-method metrics, retries on idempotent calls, automatic mTLS, and traffic-splitting at the method level.
13. WebSocket + Server-Sent Events
WebSocket
WebSocket (RFC 6455, 2011) provides full-duplex, message-oriented communication over a single TCP connection, after an HTTP upgrade handshake:
GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ...
Sec-WebSocket-Version: 13
Response 101 Switching Protocols flips the connection into WebSocket framing — 0x00-0xF opcodes for text, binary, close, ping, pong, continuation. Messages are split into frames; clients must mask payloads against intermediaries.
Use cases: chat, multiplayer games, collaborative editors (Figma, Linear, Notion live cursors), live dashboards, trading platforms, IoT telemetry.
Pitfalls:
- Idle-timeout death behind load balancers. AWS ELB defaults to 60 s; nginx to 60 s. Always send WebSocket ping/pong at a shorter interval.
- No native multiplexing. One WebSocket = one stream. Some applications layer their own multiplexing protocol (e.g., subprotocols, JSON-RPC).
- No automatic reconnection. Application must implement backoff and resubscription.
- HTTP/2 + WebSocket: RFC 8441 (2018) defines bootstrapping WebSocket over HTTP/2 via the extended CONNECT method, but support is limited.
- HTTP/3 + WebSocket: WebTransport is the modern alternative on HTTP/3.
Server-Sent Events (SSE)
SSE (HTML5, served as text/event-stream) is a one-way server-to-client streaming primitive over HTTP. Each event is data: ...\n\n. Built-in auto-reconnection with Last-Event-ID. Simpler than WebSocket when you don’t need client-to-server messages — and works through any HTTP-aware proxy.
14. BGP — Border Gateway Protocol
BGP (RFC 4271, 2006; many extensions since) is the internet’s exterior routing protocol — how Autonomous Systems (ASes) tell each other which prefixes they originate and how to reach them.
- AS numbers: 16-bit (now extended to 32-bit, RFC 6793). Each ISP, big enterprise, or cloud provider is an AS. AWS is AS16509; Google is AS15169; Cloudflare is AS13335.
- Peering: two ASes establish a TCP session and exchange UPDATE messages announcing or withdrawing prefixes. Internet exchanges (IXPs) like DE-CIX, LINX, AMS-IX, Equinix host thousands of cross-connects.
- Path selection: BGP picks routes by AS-path length, local preference, MED, origin, and many tiebreakers. Local policy dominates — operators prefer customer routes over peer routes over provider routes (the “valley-free” model).
- eBGP (between ASes) vs iBGP (within an AS, full mesh or route reflectors).
Route Leaks and Hijacks
BGP has no built-in cryptographic authentication. Misconfigurations and malicious announcements have repeatedly broken portions of the internet:
- Pakistan / YouTube (2008): Pakistan Telecom announced a more-specific route for YouTube’s prefix as part of a national block, and the announcement leaked to PCCW, then globally. YouTube went offline worldwide for ~2 hours.
- AS7007 (1997): a small ISP accidentally re-announced the entire internet routing table as its own; large swaths of the internet became unreachable.
- Mainline (2017): a Russian ISP’s brief announcement of Visa, Mastercard, and bank prefixes.
- Facebook outage (2021): not strictly a hijack — Facebook withdrew its own BGP routes during a config push and locked itself out for ~6 hours.
RPKI
Resource Public Key Infrastructure (RFC 6480, 2012) cryptographically attests which AS is authorized to originate each prefix. Routers reject ROA-invalid announcements. RPKI adoption has accelerated post-2020: as of 2024, ~50% of prefixes have valid ROAs, and ~60% of relationships filter invalids. BGPsec (RFC 8205, 2017) signs the full path, but has near-zero deployment.
MANRS (Mutually Agreed Norms for Routing Security) is an industry initiative promoting filtering, anti-spoofing, coordination, and RPKI validation.
15. CDN and Edge
Content Delivery Networks cache content close to users and offload origins. Modern CDNs are increasingly compute platforms — code runs at the edge, not just static assets.
- Cloudflare — anycast network in 300+ cities; Workers (V8 isolates), R2 (S3-compatible storage), KV, D1 (SQLite), Durable Objects, Pages.
- Fastly — Varnish-based; compute@edge (Wasm via Lucet/Wasmtime).
- Akamai — the oldest and largest by physical footprint; EdgeWorkers.
- AWS CloudFront — global edges; Lambda@Edge (full Node.js/Python at viewer/origin response/request); CloudFront Functions (lightweight JS).
- Google Cloud CDN — backed by Google’s edge.
- Bunny CDN, KeyCDN, jsDelivr — smaller players, often used for static assets and open-source libs.
Core mechanics:
- Cache at the edge: HTTP cache semantics (
Cache-Control,Vary,ETag). Cache keys typically include URL + a curated set of headers (e.g., Accept-Encoding). - Origin shielding: a designated regional cache fronts the origin so cache misses don’t hammer it.
- Image optimization: on-the-fly resize, format negotiation (WebP, AVIF), DPR-aware variants.
- Edge compute: small functions run at the edge for auth, A/B testing, request shaping, personalization without round-trips to origin.
- DDoS protection: anycast soaks volumetric attacks; rate limiting and WAF rules at the edge.
16. Service Mesh
A service mesh is a dedicated infrastructure layer for inter-service communication. The two architectures:
- Sidecar proxy (Istio with Envoy, Linkerd with linkerd2-proxy, Consul Connect): every pod gets a per-pod proxy. Application traffic is transparently routed through it via iptables redirection.
- Sidecarless / eBPF (Cilium service mesh, Istio Ambient): a node-level proxy (or eBPF programs in the kernel) handles mesh duties without per-pod overhead.
Capabilities:
- mTLS automation: SPIFFE identities + auto-rotated certs.
- Traffic management: weighted routing, request mirroring, fault injection, retries, timeouts, circuit breakers.
- Observability: per-call metrics (Prometheus), distributed tracing (OpenTelemetry), access logs.
- Authorization policies: cross-service AuthZ at L7 (Istio AuthorizationPolicy).
- Multi-cluster: federated meshes spanning clusters and clouds.
Trade-offs: sidecars add latency (~1–2 ms per hop) and resource overhead (CPU + memory per pod). eBPF approaches reduce this but require kernel support and constrain the programming model.
17. Modern Transport and Mesh Networking
WireGuard
WireGuard (Jason Donenfeld, 2017; in mainline Linux kernel since 5.6, 2020) reset the VPN landscape with an aggressively simple design:
- ~4,000 lines of kernel C code (OpenVPN is hundreds of thousands).
- One cipher suite: Curve25519 for key exchange, ChaCha20-Poly1305 for AEAD, BLAKE2s for hashing, HKDF, SipHash24 for cookies. No negotiation, no downgrade.
- UDP transport with a quiet “Crypto-Key Routing” model: each peer is identified by its static public key, and traffic is encrypted to that key.
- Stateless from the kernel’s POV; minimal attack surface.
Performance is excellent — often 2–3× IPsec at similar CPU.
Tailscale
Tailscale (founded 2019) is WireGuard + a control plane:
- Coordination server (proprietary) distributes public keys, ACLs, MagicDNS records, and short-lived auth tokens. No traffic passes through it.
- DERP relays (Designated Encrypted Relay for Packets) — open-source Go-based relays that bounce traffic when direct UDP between peers is blocked. Tailscale operates a global DERP fleet.
- MagicDNS: short hostnames within the tailnet resolve to peer IPs.
- ACLs: declarative policy in JSON/HuJSON controlling which devices can talk to which.
- Subnet routers + exit nodes: bridge into traditional networks or egress through a chosen node.
Headscale is an open-source re-implementation of the coordination server, used by self-hosters.
Nebula
Nebula (Slack, open-sourced 2019) — Noise-protocol-based mesh VPN with a built-in CA, host certs, and lighthouse nodes that act as rendezvous points. Used internally by Slack for secure host-to-host comms.
ZeroTier
ZeroTier (2014) — layer-2 virtual ethernet over the internet. Peers discover each other via root servers and establish direct UDP paths. Predates Tailscale’s similar approach.
18. Performance Engineering
Latency Budget
The speed of light in fiber is ~200,000 km/s (about 2/3 c due to refractive index). Practical implications:
- ~5 ms per 1,000 km in fiber, one-way.
- New York ↔ Los Angeles (~4,500 km): ~22 ms one-way, ~45 ms RTT minimum. Real-world: 60–75 ms RTT after switching/routing/queuing.
- New York ↔ London (~5,500 km transatlantic cable): ~28 ms one-way, ~56 ms RTT minimum. Real-world: 65–80 ms.
- Anywhere ↔ geosynchronous satellite: ~250 ms one-way per hop; ~500 ms RTT — why geostationary satellite internet feels laggy.
- LEO satellites (Starlink, ~550 km): ~3 ms one-way per hop; ~20–40 ms RTT typical including ground links.
You cannot beat the speed of light. Architectures must respect it: latency-critical work goes to the edge, not the origin.
Bandwidth-Delay Product
The amount of data “in flight” on a pipe is bandwidth × RTT. A 1 Gbps link at 100 ms RTT carries 12.5 MB of unacknowledged data in flight at saturation. The TCP window must be at least this large to fully utilize the path. Default kernel windows (~4 MB on modern Linux) under-saturate trans-Atlantic gigabit paths without window scaling and auto-tuning.
Bufferbloat
Excessive buffering in routers, modems, and home gear inflates queueing delay under load. A bulk download can push ping times from 20 ms to several seconds on a residential link. Mitigations:
- Active Queue Management (AQM): CoDel (Controlled Delay) drops packets to keep queues short.
- fq_codel (Linux default since ~2014): per-flow fair queueing on top of CoDel.
- CAKE: more sophisticated; common on OpenWrt routers.
- ECN: marks instead of drops, so endpoints react earlier.
eBPF
eBPF (extended Berkeley Packet Filter, kernel 3.18+ with explosive growth post-4.x) lets sandboxed programs run in the kernel at attach points (XDP, tc, kprobes, tracepoints). Networking projects:
- Cilium: CNI + service mesh + load balancing entirely in eBPF, bypassing kube-proxy.
- Katran (Facebook): L4 load balancer using XDP; serves Facebook’s edge.
- Calico eBPF dataplane: CNI alternative to iptables.
- bpfilter, nftables-with-bpf: future of packet filtering.
eBPF programs run at line rate without context switches and are verifier-checked for safety.
High-Performance NICs
- DPDK (Data Plane Development Kit): user-space packet processing; bypasses the kernel network stack. Used in NFV, telecom, high-frequency trading.
- io_uring (kernel 5.1+, 2019): asynchronous I/O syscall family; used for high-throughput networking servers and storage. Pairs with
SQPOLLand zero-copysendmsg. - AF_XDP: zero-copy socket type that hands packets directly to user-space programs.
- SR-IOV / VFIO: hardware-level NIC virtualization for VMs.
- RDMA / RoCE / iWARP: kernel-bypass remote-memory access, common in HPC and storage fabrics.
19. Pitfalls
A grab bag of failure modes that recur across teams.
DNS Caching Staleness
A 24-hour TTL means a migration can take a day to propagate. Lower TTLs to 60 s in the days before a planned cutover; raise them again afterward. Some resolvers ignore very short TTLs and impose minimums of their own (60 s is a common floor).
MSS Clamping and Broken PMTUD
Path MTU Discovery uses ICMP “Fragmentation Needed” messages. Firewalls that drop ICMP break PMTUD, causing connections that work for small requests but hang on large ones. MSS clamping at the router (typically to 1452 or 1400) is the standard workaround on PPPoE, GRE, IPSec, or any tunneled link.
HTTPS Without HSTS
Without Strict-Transport-Security, the first request to a domain is plaintext HTTP, redirectable to HTTPS — but vulnerable to an active MITM downgrading the user. HSTS preloading (hstspreload.org) hardcodes the HTTPS requirement into browsers.
CORS Surprises
Browsers enforce same-origin by default. Cross-origin XHR/fetch needs Access-Control-Allow-Origin and, for non-simple requests, a preflight OPTIONS. Forgetting CORS headers on the API breaks the frontend the moment it’s served from a different origin. Wildcard * is incompatible with credentials.
Idle WebSockets Behind LBs
ALB, CloudFront, nginx, and most ingress controllers idle-timeout connections (typically 60 s). WebSockets need application-level ping/pong (RFC 6455 frames or app heartbeats) at a shorter interval to keep the connection alive. Otherwise the LB silently kills it.
Retry Storms
When a backend slows down, naive clients retry, doubling load, pushing the backend further into overload — a feedback loop that crashes the service. Mitigations:
- Capped exponential backoff with jitter.
- Retry budgets: cap retries at a percentage of base traffic (e.g., 10%).
- Circuit breakers: fail fast once error rate crosses a threshold.
- Hedged requests: send the second attempt early (after p95) and use whichever wins; Google’s “Tail at Scale” paper (Dean + Barroso, CACM 2013) popularized this.
- Idempotency keys + deduplication at the receiver.
Connection-Pool Exhaustion
A pool sized for steady-state can’t absorb a latency spike — every connection becomes blocked, callers queue up, requests time out. Fix by sizing for peak, adding overflow + reject, and shedding load earlier.
Slowloris and Slow-Body Attacks
An attacker keeps a connection open by trickling bytes. Each connection is cheap for the attacker, expensive for the server. Mitigation: request/header/body timeouts at the proxy layer (nginx client_body_timeout, Envoy request_timeout).
TLS Cert Expiry
A staggering number of major outages have come from expired certs — Microsoft Teams, LinkedIn, Spotify, Equifax, Pokémon Go. Automate renewal (cert-manager, certbot, ACM). Monitor expiry. Stagger cert lifetimes so they don’t all expire on the same day.
Kubernetes Default-Allow Network
Pods talk to anything by default. NetworkPolicies must be explicitly added. Cilium and Calico provide L4/L7 policies.
20. Cross-References
- _index — Compute domain root.
- distributed-systems-fundamentals — consensus, replication, partitioning; relies on the network model described here.
- cryptography-fundamentals — TLS 1.3, AEAD, key exchange, certificate ecosystem.
- kubernetes-deep — Services, Ingress, Gateway API, CNI, NetworkPolicies; the orchestrator’s view of networking.
- consensus-protocols — how distributed agreement layers on the transport substrate.
21. Citations
Textbooks
- Andrew S. Tanenbaum + David J. Wetherall, Computer Networks, 6th edition, Pearson, 2021.
- James F. Kurose + Keith W. Ross, Computer Networking: A Top-Down Approach, 8th edition, Pearson, 2020.
- W. Richard Stevens, TCP/IP Illustrated, Volumes 1–3, Addison-Wesley, 1994–1996 (still the canonical TCP reference).
- Daniel P. Bovet + Marco Cesati, Understanding the Linux Kernel, O’Reilly, 3rd ed 2005 (legacy but foundational for stack internals).
- Christian Benvenuti, Understanding Linux Network Internals, O’Reilly, 2005.
Core RFCs
- HTTP: RFC 9110 (Semantics, 2022), RFC 9111 (Caching, 2022), RFC 9112 (HTTP/1.1, 2022), RFC 9113 (HTTP/2, 2022), RFC 9114 (HTTP/3, 2022).
- QUIC: RFC 9000 (Transport, 2021), RFC 9001 (Using TLS to Secure QUIC, 2021), RFC 9002 (Loss Detection + Congestion Control, 2021), RFC 9221 (Unreliable Datagram Extension, 2022).
- TLS: RFC 8446 (TLS 1.3, 2018), RFC 7301 (ALPN, 2014), RFC 6066 (SNI + extensions, 2011), RFC 6797 (HSTS, 2012), RFC 9162 (Certificate Transparency v2, 2021), RFC 8555 (ACME, 2019).
- TCP: RFC 9293 (TCP roll-up, 2022), RFC 5681 (Congestion Control, 2009), RFC 8312 (CUBIC, 2018), RFC 7323 (Window Scaling + Timestamps, 2014), RFC 6298 (RTO, 2011), RFC 7413 (TCP Fast Open, 2014), RFC 2018 (SACK, 1996), RFC 3168 (ECN, 2001).
- IP: RFC 791 (IPv4, 1981), RFC 8200 (IPv6, 2017), RFC 4632 (CIDR, 2006), RFC 1918 (Private Address Space, 1996), RFC 6598 (CGNAT Shared Address Space, 2012), RFC 4862 (SLAAC, 2007), RFC 8981 (Privacy Extensions, 2021), RFC 8305 (Happy Eyeballs v2, 2017).
- ICMP: RFC 792 (ICMPv4, 1981), RFC 4443 (ICMPv6, 2006).
- DNS: RFC 1034/1035 (Concepts + Implementation, 1987), RFC 6891 (EDNS(0), 2013), RFC 4033–4035 (DNSSEC, 2005), RFC 7858 (DoT, 2016), RFC 8484 (DoH, 2018), RFC 9250 (DoQ, 2022), RFC 9230 (Oblivious DoH, 2022), RFC 9460 (SVCB/HTTPS records, 2023), RFC 8659 (CAA, 2019).
- WebSocket: RFC 6455 (2011), RFC 8441 (Bootstrapping WebSockets with HTTP/2, 2018).
- BGP: RFC 4271 (BGP-4, 2006), RFC 6793 (4-octet AS, 2012), RFC 6480 (RPKI, 2012), RFC 8205 (BGPsec, 2017).
- Other: RFC 768 (UDP, 1980), RFC 5905 (NTPv4, 2010), RFC 3550 (RTP, 2003), RFC 896 (Nagle, 1984), RFC 1122 (Host Requirements, 1989).
Papers
- Cardwell, Cheng, Gunn, Yeganeh, Jacobson, “BBR: Congestion-Based Congestion Control”, ACM Queue 2016 / CACM 2017.
- Cardwell et al., “BBR v2: A Model-Based Congestion Control”, IETF drafts 2019+.
- Dean + Barroso, “The Tail at Scale”, Communications of the ACM, February 2013.
- Eisenbud, Yi, Contavalli, Smith, Kononov, et al., “Maglev: A Fast and Reliable Software Network Load Balancer”, NSDI 2016 (Google).
- Mittal et al., “TIMELY: RTT-based Congestion Control for the Datacenter”, SIGCOMM 2015.
- Iyengar + Swett (eds.), QUIC working-group drafts and the resulting RFC 9000 family.
Industry References
- Cloudflare blog — HTTP/3, QUIC adoption posts, BGP hijack postmortems.
- Google CDN + QUIC papers + Chromium QUIC implementation notes.
- Facebook engineering blog — Katran, mvfst (QUIC).
- Netflix Open Connect — TLS + transport engineering.
- IETF QUIC + HTTP working group archives.
- NLnet Labs + ISC + PowerDNS documentation for authoritative DNS internals.
End of reference. This document is a living snapshot; verify protocol versions, RFC supersessions, and CVE landscapes against current sources before depending on any specific behavior.