Containers & Service Mesh — Compute Reference
A consolidated reference for container architecture (Docker, containerd, runc, OCI image and runtime specs, image layers and supply-chain signing), service-mesh data and control planes (Istio, Linkerd, Cilium, Consul), and zero-trust networking for cloud-native systems.
1. At a glance
A container is not a virtual machine. It is a process (or process tree) on the host kernel that has been placed inside a set of Linux kernel isolation primitives — namespaces for naming, cgroups for resource accounting, seccomp and capabilities for syscall restriction, and a layered union filesystem for its root directory. The host kernel is shared with every container on the box; there is one kernel, many isolated userlands.
A container image is a content-addressable bundle of tar.gz filesystem layers plus a JSON manifest and a JSON config blob. Images are pulled from a registry, unpacked into a layered rootfs, and handed to a low-level runtime that translates an OCI config.json plus the rootfs into a running process. The OCI (Open Container Initiative, 2015) image-spec, runtime-spec, and distribution-spec are the three standards that make any combination of registry, image-builder, and runtime interoperable.
A service mesh is a data plane of L7 proxies (typically Envoy, occasionally a Rust micro-proxy or an eBPF datapath) plus a control plane that programs the proxies with policy. The proxies sit in front of every workload — either as a sidecar in the same pod, as a per-node shared proxy, or as in-kernel eBPF — and intercept all inbound and outbound traffic. This lets the platform impose mTLS, retries, timeouts, circuit-breaking, rate limits, and observability uniformly without changing application code. Zero-trust networking extends the same principle to all access: never trust the network, always authenticate the workload identity, authorize per-request.
2. Container layers
A container is the cumulative effect of a handful of independent Linux kernel features stacked together. Each feature can be used alone; together they form what we call “a container.”
Linux namespaces (CLONE_NEW* flags to clone(2) or unshare(2)) isolate a global resource so that processes inside see their own view. Eight namespaces exist:
mnt— mount points; a private filesystem tree.pid— process IDs; init inside the container is PID 1.net— network interfaces, routing table, iptables rules, sockets.ipc— System V IPC and POSIX message queues.uts— hostname and NIS domain name.user— UID/GID mapping; lets a non-root host user be UID 0 inside.cgroup— view of the cgroup hierarchy.time— boot and monotonic clocks (Linux 5.6+, used for CRIU checkpoint/restore).
Control groups (cgroups) v2 account and limit resources per process tree. cgroups v1 had a separate hierarchy per controller; v2 (Linux 4.5+, default in most modern distros) uses a single unified hierarchy. Controllers include cpu (weights and quotas), memory (limit, swap, OOM), io (block-device weights and BPS limits), pids (max processes), and rdma. Kubernetes pods map directly to cgroup v2 slices.
seccomp-bpf filters which syscalls a process may make. The default Docker profile blocks roughly 60 of ~350 syscalls including reboot, kexec_load, bpf, init_module, and most keyring operations. Custom profiles in JSON list permitted syscalls and per-argument predicates.
Linux capabilities split historical root power into ~40 separate bits (CAP_NET_ADMIN, CAP_SYS_ADMIN, CAP_NET_BIND_SERVICE, etc.). Containers usually drop all and add back only what is needed (--cap-drop ALL --cap-add NET_BIND_SERVICE to bind low ports).
AppArmor (Ubuntu, SUSE) and SELinux (Red Hat, Fedora) provide mandatory access control on top of the discretionary model. AppArmor uses path-based policy; SELinux labels every file and process with a context and enforces type-enforcement rules. Both Kubernetes and Docker can attach a profile per container.
OverlayFS is the union filesystem that makes layered images cheap. Each image layer is a directory; OverlayFS merges them into a single virtual rootfs with a writable upper layer. Copy-on-write means an unchanged file in a lower layer is not duplicated. Other drivers exist (btrfs, zfs, devicemapper) but OverlayFS has been the default for years.
Pivoting the rootfs. After mounting the layered filesystem, the runtime calls pivot_root(2) (or unshare(CLONE_NEWNS) + chroot() for some legacy paths) to make the container’s rootfs the only filesystem visible inside the mount namespace. The host’s rootfs is then unmounted from the container’s view. This is what makes ls / inside a container show only the image’s files even though the host’s / is unchanged.
User namespaces and rootless containers. A user namespace remaps UIDs and GIDs between host and container. A non-root host user can be UID 0 inside the container, with capabilities scoped to that namespace. Rootless Podman and rootless Docker rely on this to run the whole stack — daemon, builder, runtime — without ever needing host root. The cost is a more constrained set of mount options and some performance hit from slirp4netns user-space networking (or pasta as a faster alternative).
Container init systems. PID 1 inside a container has special responsibilities — it must reap zombies and forward signals. A plain node server.js as PID 1 will leak zombies. Common solutions: tini (small C init, default in docker run --init), dumb-init (Yelp), or letting the language runtime handle it (Node 16+, Go’s runtime, JVM all reap reasonably). Without an init, signal handling on docker stop can hang for the full 10-second grace period.
3. OCI specs
The Open Container Initiative was formed in 2015 to standardize what was previously Docker’s de-facto format. Three specs cover the lifecycle.
Image spec. An image is identified by a content-addressable digest (sha256:abcd…). It consists of:
- A manifest JSON listing the config blob and an ordered list of layer blobs, each with a media type, digest, and size.
- A config JSON describing the image’s environment (entrypoint, cmd, env vars, exposed ports, working dir, history of layer creation).
- One or more layer blobs, typically tar.gz, applied in order to build the rootfs.
A manifest list (also called an image index) is a manifest of manifests — used for multi-arch images that bundle linux/amd64 + linux/arm64 + linux/arm/v7 + linux/s390x variants under one tag.
Runtime spec. Defines config.json and a rootfs/ directory. config.json enumerates the OCI runtime configuration: process to run (path, args, env, uid/gid), mounts, namespaces to use, cgroups to apply, capabilities, seccomp profile, hooks, and platform-specific knobs. A runtime takes these two inputs and produces a running container.
Distribution spec. The registry HTTP API. Authenticated HEAD/GET/PUT/POST on /v2/<name>/manifests/<tag> and /v2/<name>/blobs/<digest> with content-type negotiation, ranged uploads, and a token-based auth flow. Compatible across Docker Hub, GHCR, ECR, Artifact Registry, ACR, Quay, and Harbor.
4. Container runtimes
Low-level runtimes consume an OCI runtime bundle and produce a running container. They are interchangeable.
runc is the OCI reference implementation, written in Go, donated by Docker. It calls clone() with the namespace flags, sets up cgroups, applies seccomp and AppArmor, then execves the target binary. Default for Docker, containerd, and most managed Kubernetes services.
crun is Red Hat’s C reimplementation. Smaller, faster startup (3–5× quicker on cold start for short-lived containers), no Go runtime to pay for. Default in Podman and OpenShift on recent versions.
youki is a Rust reimplementation by a community effort. Memory-safety and performance benefits comparable to crun; production-deployable but less widely shipped.
gVisor (Google) runs containers inside a userspace kernel called “Sentry” that intercepts syscalls and emulates a Linux ABI. The host kernel sees only a small set of safe syscalls. Strong sandboxing for untrusted workloads at the cost of ~10–30% syscall overhead. Used in Google App Engine and Cloud Run.
Kata Containers runs each container (or pod) inside its own lightweight virtual machine, bringing hypervisor-level isolation while preserving the OCI interface. Backed by QEMU, Cloud Hypervisor, or AWS Firecracker.
Firecracker (AWS, 2018) is a KVM-based micro-VM monitor written in Rust. Strips out devices and BIOS to boot in under 200 ms with about 5 MB of overhead per VM. Powers AWS Lambda and AWS Fargate. Used by Kata, Fly.io, and others for fast, isolated workloads.
WasmEdge / runwasi / containerd-shim-wasm. WebAssembly runtimes packaged as OCI containers. containerd’s runwasi shim accepts a .wasm artifact, hands it to a Wasm runtime (WasmEdge, Wasmtime, Wasmer), and exposes it through the standard CRI. Workloads are sandboxed by the Wasm VM rather than by Linux namespaces; cold start is sub-millisecond.
Runtime classes. Kubernetes’ RuntimeClass resource lets pods opt into alternate runtimes per workload. A cluster might run runc for trusted internal workloads, kata or gvisor for tenant workloads, and wasm for short-lived functions — all on the same nodes. The Kubelet maps the RuntimeClass to a containerd handler (io.containerd.kata.v2, io.containerd.runsc.v1, etc.).
Shims and the OCI runtime ABI. Above each low-level runtime sits a shim process that the high-level runtime (containerd, CRI-O) keeps alive to own the container’s stdio and exit status independent of the daemon. containerd-shim-runc-v2 is the modern shim; one shim handles many containers in the same pod via task groups. Crash of containerd itself does not kill the workloads — the shim keeps them running.
5. Higher-level runtimes and tooling
Above the low-level runtime sits the container daemon or higher-level runtime that pulls images, manages lifecycle, and exposes an API.
containerd is a CNCF graduated project, born from Docker, designed as a daemon embeddable into Kubernetes (via the CRI plugin) or used standalone via the ctr CLI. Since Kubernetes 1.24 (2022) dropped dockershim, containerd has been the default runtime in most managed K8s offerings (EKS, GKE, AKS).
CRI-O is a Kubernetes-focused runtime built by Red Hat with the explicit goal of being only as much as Kubernetes needs. No daemon-level CLI for end users; everything goes through the Kubelet via CRI. Default on OpenShift.
Podman is Red Hat’s daemon-less drop-in for Docker. It uses fork/exec to launch containers via crun (or runc) directly — no long-running daemon to crash or run as root. Supports rootless mode out of the box (user namespaces map host user to container root). Compatible Dockerfile and Docker CLI semantics.
nerdctl is the user-facing CLI for containerd that mirrors Docker’s UX. Same flags, same workflows, but talks directly to containerd. Useful when running containerd without Docker on the side.
Docker itself remains the most common developer-side UX: Docker Desktop on macOS and Windows, Docker Engine on Linux. Docker Engine sits on top of containerd; the CLI/daemon split is mostly historical. In Kubernetes, “Docker as runtime” was deprecated in 1.20 and removed in 1.24 (2022) — Kubernetes now talks directly to containerd or CRI-O.
6. Image building
Building an image means turning source code plus a build recipe into a layered OCI image. Many tools exist; the right choice depends on whether you need a daemon, rootless mode, in-cluster builds, language-native ergonomics, or reproducibility.
Dockerfile + BuildKit. The default for the last several years. BuildKit (default in Docker 20+, 2018+) replaces the legacy builder with a DAG-based engine, parallel stage execution, content-addressed cache, ssh and secret mounts, and multi-platform output. Most images in the wild are still built with docker build.
Buildah is Red Hat’s daemon-less image builder. Pairs with Podman; rootless; can build either via Dockerfile or via a scriptable API (buildah from, buildah run, buildah copy, buildah commit).
Kaniko runs entirely inside a Kubernetes pod with no privileged daemon. Reads a Dockerfile and writes layers to the registry directly. Used in CI systems that run on Kubernetes and cannot mount the host Docker socket.
Cloud Native Buildpacks (CNB), originally from Heroku and Pivotal (now CNCF). Detect the application’s language, choose a builder image, and produce a runnable image without a Dockerfile. pack build myapp — opinionated, layered for rebase, reproducible. Used by Heroku, Google Cloud Run buildpacks, and Paketo.
Bazel with rules_oci (and the older rules_docker) produces fully reproducible images. Every layer is content-hashed from its inputs; no shell-out to apt-get. Suited to large monorepos where reproducibility and remote caching matter.
ko is Google’s tool for building Go binaries directly into images, no Dockerfile needed. ko build ./cmd/server cross-compiles, packages, and pushes — used heavily in the Knative and Tekton communities. Sub-second builds for small services.
Nix-based images via Nixery (build images on demand from a tag like nixery.dev/python/numpy) or nix2container and dockerTools.streamLayeredImage for fine-grained layering. Maximum reproducibility, bit-identical builds, with a steep Nix learning curve.
Distroless images, published by Google at gcr.io/distroless/*, contain only the application and its direct runtime (libc, ca-certificates, tzdata, sometimes a language runtime). No shell, no package manager, no ps, no cat. Drastically reduces CVE surface; debugging requires kubectl debug ephemeral containers or shipping a parallel debug image.
Multi-stage builds split a Dockerfile into stages where intermediate stages contain build tools and only the final stage is the runtime image. FROM golang:1.22 AS build followed by FROM gcr.io/distroless/static AS run with COPY --from=build keeps the final image small.
Multi-arch builds with docker buildx produce a manifest list of platform-specific images. docker buildx build --platform linux/amd64,linux/arm64 -t myorg/myapp:1.0 --push . produces both arches and a manifest pointing at them. QEMU under the hood for cross-emulation, or native nodes for real cross-builds.
Layer caching strategy. Order Dockerfile instructions from least to most frequently changing. COPY package.json + RUN npm ci before COPY . . lets the dependency-install layer be cached across source changes. BuildKit’s --mount=type=cache,target=/root/.npm further persists cache directories across builds without baking them into the image. For monorepos, use --mount=type=bind to pull only the subtree being built.
Reproducibility knobs. Reproducible builds require: pinned base image digests (not floating tags), deterministic file ordering in tarballs (BuildKit’s SOURCE_DATE_EPOCH), no embedded timestamps in compiled binaries (-trimpath for Go, -Wl,--build-id=none for C), and pinned package-manager lockfiles. The reward is identical image digests across rebuilds, enabling cosign signatures to verify provenance over time.
Image-size hygiene. Aim for application images well under 200 MB. Common bloat sources: package manager caches (/var/cache/apt, /var/lib/apt/lists — clean in the same RUN as install or use --no-install-recommends), language-specific caches (pip wheels, npm modules used only for build), test fixtures left in, debug symbols (strip binaries), and unused locale data (apt-get install -y --no-install-recommends, set LANG=C.UTF-8). dive (Wagoodman) is the standard tool for layer-by-layer size analysis.
Build-time vs. runtime secrets. Never ENV PASSWORD=... or COPY id_rsa / — every layer’s contents are preserved in the registry forever and visible to anyone with pull rights. Use BuildKit’s --mount=type=secret,id=... to mount secrets only during a single RUN step; the secret is not part of any layer. For runtime secrets, mount them via Kubernetes Secrets (or, better, projected SPIFFE/Vault-Agent-injected secrets) at pod startup.
7. Image security and supply chain
A container image is software you are pulling from the internet and executing. The supply-chain attack surface is significant.
Vulnerability scanning maps an image’s installed packages and language libraries to known CVEs by reading package databases (/var/lib/dpkg/status, /var/lib/rpm/Packages, /lib/apk/db/installed) and language-specific manifests (package-lock.json, Pipfile.lock, go.sum, Gemfile.lock, Cargo.lock). Cross-references against vulnerability sources (NVD, GitHub Advisory Database, distro security trackers).
Common scanners:
- Trivy (Aqua Security, open source) — scans images, filesystems, git repos, IaC; the most-deployed open scanner.
- Grype + Syft (Anchore, open source) — Grype scans for CVEs, Syft generates SBOMs.
- Snyk — commercial; scans + remediation suggestions; ships container, code, IaC, and OSS scanning.
- Docker Scout — built into Docker Desktop; CVE scanning + base image recommendations.
SBOM (Software Bill of Materials). A machine-readable manifest of every dependency in an image. Two competing formats:
- SPDX (Linux Foundation, ISO standard) — tag/value or JSON.
- CycloneDX (OWASP) — XML or JSON, more component-relationship-focused.
syft packages alpine:latest -o spdx-json produces an SPDX SBOM; cosign attest --predicate sbom.json --type spdx <image> attaches it to the image as a signed attestation.
Image signing.
- cosign (Sigstore project, Linux Foundation, 2021) — keyless signing via OIDC + Fulcio (short-lived cert authority) + Rekor (immutable transparency log).
cosign sign <image>andcosign verify <image> --certificate-identity ... --certificate-oidc-issuer .... The dominant choice for OSS projects. - Notation v2 (CNCF Notary project) — X.509-based signing, supports plug-in signing providers including KMS.
Supply chain frameworks.
- SLSA (Supply-chain Levels for Software Artifacts, 2021) — four levels (L1: documented provenance, L2: hosted build service, L3: hardened builder, L4: hermetic and reproducible). Goal is to attest the build process, not just the binary.
- in-toto attestations — generalized signed-claim format that SLSA provenance is expressed in.
cosign attest --predicate slsa.json --type slsaprovenance.
Image hardening in the Dockerfile itself:
USER 1000:1000to drop root.- Read-only root filesystem (
readOnlyRootFilesystem: truein the K8s SecurityContext, mount tmpfs for/tmp). - Drop all Linux capabilities, add back only what is needed.
- Apply the default seccomp profile (or a stricter custom one).
- No SUID binaries in the image (
find / -perm /4000should be empty). - Minimal base (distroless or
scratch) to shrink CVE surface. - Pin base images by digest, not floating tag —
FROM gcr.io/distroless/static@sha256:abcd...rather than:latest. - Avoid
ADD <url>(unverified fetch); useRUN curl ... && sha256sum -cinstead. - Strip debug symbols and remove build tooling before the final stage.
Admission-time enforcement. Image-policy admission controllers reject pods whose images do not meet policy. Kyverno and OPA Gatekeeper let platforms require: signed images from cosign verify, scanned-and-clean from a Trivy attestation, SBOM present, originating from an allowed registry, running as non-root, with read-only root FS. Sigstore Policy Controller specifically validates cosign signatures at admission. The pattern is “configuration that cannot exist” — invalid pods are rejected before they run.
8. Registries
The registry is the durable store and distribution layer. All major clouds host one; several open implementations exist.
- Docker Hub — the original, still default for
docker pull alpineand similar; aggressive pull-rate limits for anonymous and free-tier users. - GitHub Container Registry (GHCR) — at
ghcr.io/<org>/<image>; tight GitHub Actions integration, free for public images. - Amazon ECR — private; also ECR Public at
public.ecr.aws/.... IAM-based auth; lifecycle policies; replication. - Google Artifact Registry (
pkg.dev) — successor to Container Registry (gcr.io); supports container, Maven, npm, Python, and OS-package formats in one product. - Azure Container Registry (ACR) — Microsoft’s offering; geo-replication; Helm and OCI artifact support.
- Quay (Red Hat) — hosted at quay.io; on-prem via Project Quay; image security scanning built in.
- Harbor (CNCF graduated) — open-source, on-prem; CVE scanning via Trivy, signing via cosign/Notary, multi-tenant projects, replication.
Image promotion is the practice of moving the same digest from a dev registry to staging to prod via signed promotion — never rebuilding for prod. The digest pinning guarantees binary identity; signatures attest who promoted it and when. Tools: cosign copy, crane copy, ArgoCD Image Updater.
Pull-through caches and air-gapped registries. Production K8s deployments rarely pull directly from Docker Hub. Common patterns: a Harbor or ECR Pull-Through Cache mirrors upstream, applying scanning and signing on first pull; clusters in regulated or disconnected environments use Zarf, Hauler, or skopeo sync to bundle images for offline transport. Image-mutating admission webhooks (Kyverno, OPA Gatekeeper) rewrite registry hostnames at pod admission so workloads can keep their original tags while pulling from the local mirror.
OCI artifacts — the OCI distribution spec is generalizing beyond container images. Helm charts, signatures (cosign), SBOMs, Wasm modules, policy bundles (OPA), and even general blobs can now be pushed to OCI registries as first-class artifacts with their own media types. This is consolidating storage of all build outputs into one durable, signed, replicated system.
Registry garbage collection. Untagged manifests and orphaned blobs accumulate. Registries either GC continuously (Harbor’s gc job, ECR lifecycle policies) or by scheduled sweep. Bad GC strategies have caused production outages — Docker Hub’s 2020 retention-policy reversal, several self-hosted Quay incidents — so consider retention rules carefully: keep last N tags per repo, keep all tags matching ^v\d+\.\d+\.\d+$, expire untagged manifests after 30 days.
Geo-replication and latency. Pulling a multi-GB image from a US registry to an EU cluster on rollout is slow and expensive. Cross-region replication (Harbor’s replication policies, ECR cross-region replication, Artifact Registry’s standard repos in each region) keeps pull latency low. Image-pull-secret rotation across regions is the operational pain — sealed-secrets or External Secrets Operator + cloud KMS is the modern pattern.
9. Service mesh — why
Distributed systems built on microservices share a set of recurring cross-cutting concerns:
- mTLS between services for authentication and encryption-in-transit.
- Retries and timeouts for transient failures, with budgets to prevent retry storms.
- Circuit-breaking to fail fast when a downstream is unhealthy.
- Rate-limiting to protect callees.
- Load balancing with locality-aware and weighted strategies.
- Observability — metrics (RED: rate, errors, duration), distributed traces, access logs.
- Authorization policy at the request level (which workload may call which method on which service).
- Traffic shifting for canary and blue-green deployments.
You can implement these in each service via libraries (Spring Cloud, Polly, grpc-go retries) — but that requires every team in every language to keep the implementation aligned. A service mesh lifts these out of application code into a sidecar or platform-managed proxy. The platform team owns the cross-cutting concerns; application teams write business logic.
10. Service mesh architectures
Three architectural patterns dominate.
Sidecar per pod. Each application container shares a pod (or VM) with a proxy container. All inbound traffic enters the proxy first; all outbound traffic exits via the proxy. Examples: classic Istio + Envoy, Linkerd + linkerd-proxy, Consul Connect + Envoy, AWS App Mesh (sidecar mode, now deprecating). Pros: per-workload policy isolation, mature, fits the K8s pod model. Cons: 1 proxy per pod → memory + CPU multiplied across thousands of workloads; init-container ordering races; harder to debug; not free in latency (1–3 ms added per hop).
Per-node shared proxy. One proxy per node handles traffic for all workloads on that node. Examples: Cilium service mesh (eBPF-based, no userspace proxy in the fast path), Istio Ambient (introduced 2022, GA 2024) with a per-node ztunnel for L4 + mTLS and a per-namespace waypoint proxy for L7 policy. Pros: fewer proxy instances, lower aggregate overhead, simpler pod lifecycle. Cons: blast radius — a node-proxy crash affects every workload on the node; per-workload isolation has to be reasserted via identity.
Library / SDK mesh. Each service links a library that speaks the mesh control-plane protocols directly. Examples: gRPC’s native xDS support (it can consume Istio’s config without a sidecar), OpenTelemetry’s per-language SDK. Pros: zero data-path overhead; deepest application visibility. Cons: language-by-language fragmentation; harder to keep versions aligned; impossible to add to unmodifiable workloads.
11. Istio
Istio (Google, IBM, Lyft, 2017; now CNCF) is the most feature-complete mesh. Control plane = istiod (consolidated from Pilot + Mixer + Citadel + Galley around 2019). Data plane = Envoy sidecars (classic mode) or ztunnel + waypoint proxies (Ambient mode).
Core CRDs:
VirtualService— routing rules (match by host, path, header, weight).DestinationRule— per-destination policy (subsets, load balancing, TLS, connection pool, outlier detection).Gateway— north-south traffic configuration for the ingress proxy.ServiceEntry— adds external services into the mesh registry.AuthorizationPolicy— request-level authn/authz.PeerAuthentication— mTLS mode per namespace (DISABLE, PERMISSIVE, STRICT).RequestAuthentication— JWT validation.Sidecar— scope which services a sidecar tracks (essential at scale).Telemetry— fine-grained logging, metrics, tracing knobs.
mTLS is automatic when both sides of a connection are Istio-managed; certs are rotated approximately every 24 hours (configurable via pilot-agent flags). Workload identity is SPIFFE-formatted (spiffe://<trust-domain>/ns/<ns>/sa/<sa>).
Multi-cluster topologies: primary-remote (one control plane manages many clusters), primary-primary (control plane per cluster, federated trust), or external-control-plane (control plane in a different cluster from the data plane). Spans VMs via WorkloadEntry + WorkloadGroup.
Istio revisions and canary control-plane upgrades. The istio.io/rev label on a namespace pins it to a specific istiod revision. Operators install a new revision alongside the old, migrate namespaces one at a time, observe, and finally remove the old. This pattern survives major version upgrades that would otherwise require full-cluster downtime.
Ambient mode (alpha 2022, beta 2023, GA 2024) splits the data plane: ztunnel (Rust) per-node handles L4 + mTLS; waypoint (Envoy) per-namespace handles L7. Eliminates per-pod sidecars and the associated init-container ordering issues, at the cost of a more complex architecture and L7-policy waypoint hops.
Trade-off: power vs. operational complexity. Istio is the right answer for large platforms with dedicated platform engineering; it is overkill for a 10-service shop.
Traffic policy primitives:
- Retry budgets.
retries.attemptsandretries.perTryTimeoutcap per-request retries; an overall retry budget as percentage of active requests prevents retry storms from amplifying outages. - Outlier detection.
consecutive5xxErrorsandintervaleject unhealthy upstream hosts from the load balancer forbaseEjectionTime, with exponential ejection growth on repeat failures. - Connection pool tuning.
maxConnections,http1MaxPendingRequests,http2MaxRequests,maxRequestsPerConnection. Misconfigured pools are the most common cause of “the mesh made things slower.” - Locality-weighted load balancing. Prefer endpoints in the same zone, fall back to the same region, then cross-region. Reduces inter-AZ traffic costs on cloud providers.
Canary and progressive delivery. Istio’s VirtualService weight lets you shift 1% / 5% / 25% / 100% over a release window. Argo Rollouts and Flagger automate this with metric-driven gates: deploy a new revision, shift 5% of traffic, watch p99 latency and error rate via Prometheus, abort and rollback if SLOs breach. Mesh-driven progressive delivery is one of the most operationally valuable mesh features.
Fault injection for chaos engineering. Istio’s HTTPFaultInjection injects HTTP-status errors or latency on a configurable fraction of requests. Useful for verifying retry/timeout behavior, validating client-side resilience, and continuous chaos in staging. Linkerd, Consul, and Envoy Gateway all support equivalents.
12. Linkerd
Linkerd (Buoyant, 2016; CNCF graduated 2021) takes the opposite posture: the smallest mesh that does the job well. Data plane is linkerd-proxy, a purpose-built Rust micro-proxy (originally based on the Hyper and Tower crates) — drastically lower memory (~10 MB per sidecar) and CPU footprint than Envoy.
Features: mTLS by default (no opt-in), automatic per-route metrics (success rate, latency p50/p95/p99, RPS), retries and timeouts via ServiceProfile, traffic split for canaries via SMI’s TrafficSplit, multi-cluster gateways, optional Linkerd Viz for built-in dashboards and Linkerd Jaeger for tracing.
Control plane runs HA by default (three replicas of destination, identity, proxy-injector). Configuration is via CRDs but the surface area is intentionally smaller than Istio’s.
The Rust proxy is written specifically to be small and bounded. Memory per proxy is typically 10–20 MB resident (vs. Envoy’s 50–150 MB in steady state with realistic config), and CPU at idle is near-zero because the runtime is async (Tokio) and the proxy holds no work when no requests are in flight. This matters at scale — a 5000-pod cluster is the difference between 50 GB and 500 MB of sidecar memory.
Trade-off: simpler and lighter than Istio, but fewer L7 routing knobs. The pitch is “the mesh that does not become a project of its own.”
13. Cilium
Cilium (Isovalent; Cisco acquired Isovalent in 2024; CNCF graduated 2023) is built on eBPF rather than userspace proxies. It is simultaneously:
- A CNI plugin (pod networking).
- A NetworkPolicy engine, including the rich
CiliumNetworkPolicy(L7-aware: HTTP method/path, DNS, Kafka, gRPC). - A service mesh (L7 proxy steps via Envoy + cilium-envoy, but datapath in eBPF, no sidecar).
- An observability platform via Hubble (flow logs, service maps, metrics).
- A runtime security platform via Tetragon (kernel-event-based, eBPF-driven policy enforcement and forensics).
eBPF runs inside the kernel; programs are verified safe and attached to hooks (XDP, tc, kprobes, tracepoints, LSM, socket). Cilium uses this to bypass kube-proxy iptables, do native socket-level load balancing, enforce policy at packet time, and observe traffic without sidecars.
Default CNI on EKS Anywhere and GKE Dataplane V2 (Google’s variant). Recommended when network performance, eBPF observability, and a sidecarless mesh matter together.
Cluster Mesh — Cilium’s multi-cluster mode joins multiple clusters into a single service-discovery namespace with cross-cluster service routing, shared identity (via SPIFFE), and global NetworkPolicy. Backed by KVStoreMesh (etcd-replicated) or now Direct Server Return paths via the kernel.
Identity-based policy. A CiliumNetworkPolicy can reference workloads by endpointSelector (Kubernetes labels) or, for cross-cluster, by serviceAccount and cluster. eBPF programs in the kernel check identity at packet time, dropping disallowed traffic before it leaves the source node. This is materially faster than iptables-based policy at scale.
Performance characteristics. Because Cilium can replace kube-proxy entirely (kube-proxy free mode), service-VIP resolution happens in the socket layer via sock_ops eBPF programs — direct backend IP rewriting at connect time, no DNAT, no conntrack entries for in-cluster traffic. For high-throughput services this removes a per-packet hop and cuts CPU on busy nodes noticeably.
WireGuard transparent encryption. Cilium can encrypt all pod-to-pod traffic transparently with WireGuard (kernel-native, modern, fast) — no app changes, no certificates per workload, no proxy hop. This is a different security posture than mesh mTLS: WireGuard authenticates nodes, mTLS authenticates workloads. For machine-to-machine confidentiality without workload identity, WireGuard is enough; for workload identity and per-source authorization, mesh mTLS remains needed.
L7 visibility without sidecar. Hubble’s L7 visibility comes from Envoy proxies that Cilium injects at the node level only for traffic that L7 policy applies to — pay only for what you use. For pure L4 needs, no Envoy is ever in the path.
14. Consul
Consul (HashiCorp) is the broadest scope: service discovery (DNS + HTTP), KV configuration store, service mesh (Consul Connect), and now identity. Runs natively on Kubernetes, on VMs, on bare metal, on Nomad, mixing all of them in a single mesh.
Connect uses Envoy sidecars (or Consul’s own built-in proxy for simple cases). Configuration is via Consul’s HTTP API or HCL config-entries — service-defaults, service-router, service-splitter, service-resolver, service-intentions (the authorization policy).
Trade-off: best when the deployment is heterogeneous (VMs + K8s + bare metal); not the strongest pure-Kubernetes option.
Consul as service registry. Even shops not using Connect lean on Consul as the source of truth for service location across data centers. The consul-template daemon renders config files (HAProxy backends, application configs) from Consul’s KV and service catalog, restarting or signaling the app on change. Vault often sits alongside for secret material — Consul + Vault is a common pairing.
15. Envoy
Envoy (Lyft, 2016; CNCF graduated 2018) is the high-performance L7 proxy that underpins much of the modern mesh ecosystem. C++; built around an event loop per worker thread; zero-allocation hot path; HTTP/1.1, HTTP/2, HTTP/3, gRPC; observability baked in.
The defining feature is xDS, a set of gRPC streaming APIs that let an external control plane push configuration dynamically: LDS (listeners), RDS (routes), CDS (clusters), EDS (endpoints), SDS (secrets), ADS (aggregated). Istio’s istiod, AWS App Mesh’s controller, Solo Gloo, Apigee, Tetrate Service Bridge — all are xDS control planes pushing config into Envoy.
Envoy filter chains are pluggable in C++ or, via Proxy-Wasm, in any language that compiles to WebAssembly. Common filters: ext_authz (external authz), JWT, rate-limit, fault-injection, CSRF, lua.
Listener / filter-chain / cluster model. A listener binds a socket. Inbound bytes flow through a filter chain of network filters (TCP-level) and then, for HTTP listeners, an HTTP filter chain (router, jwt_authn, ext_authz, lua, rbac, cors, rate_limit, compressor, etc.). The terminal router filter maps the request to a cluster — a logical upstream group with endpoints, load-balancing policy, TLS config, and circuit-breaker settings. Each layer is independently configurable via the corresponding xDS resource.
Envoy admin interface on port 15000 exposes /clusters, /listeners, /config_dump, /stats, /server_info. For debugging “why is my mesh doing X” this is the ground truth — istioctl proxy-config cluster <pod> and linkerd diagnostics proxy-metrics are wrappers around it.
16. mTLS automation
mTLS in a mesh is uninteresting if certificate issuance and rotation are automatic. The pieces that make that real:
- SPIFFE (CNCF) defines a workload identity URI:
spiffe://<trust-domain>/<path>. Identity is platform-issued, not bearer-secret-based. - SPIRE is the reference SPIFFE implementation. SPIRE Server is the authority; SPIRE Agents run on each node and attest workloads (via the Kubernetes API, AWS instance identity docs, GCP metadata, etc.). Workloads request an SVID (SPIFFE Verifiable Identity Document) — X.509 cert or JWT — via the Workload API socket.
- cert-manager (CNCF graduated 2024) automates X.509 issuance from many backends (ACME/Let’s Encrypt, HashiCorp Vault, Venafi, internal CAs, Smallstep). trust-manager distributes CA bundles into namespaces.
- Service-mesh-native rotation — Istio’s
istiod, Linkerd’sidentitycontroller, Consul’s CA all issue and rotate workload certs without operator involvement.
Rotation cadence and blast radius. Short certificate lifetimes (24 hours by default in Istio, 24 hours in Linkerd) limit compromise window. The trade-off is that a control-plane outage longer than the lifetime can fail-open or fail-closed depending on data-plane behavior. Envoy by default fails closed when the SDS-delivered cert expires; some operators stretch the lifetime to 7 days to widen the recovery window for outages.
Trust-domain federation. Multi-cluster meshes need cross-cluster trust. SPIFFE supports federation via the trust bundle — each trust domain exports its root CAs, peer domains import them. SPIRE Server federation, Istio’s meshConfig.trustDomainAliases, and Linkerd’s linkerd multicluster link all implement variations of this pattern.
Workload-identity-to-cloud-IAM bridging. AWS IAM Roles for Service Accounts (IRSA), GKE Workload Identity, and Azure Workload Identity project a JWT signed by the cluster’s OIDC issuer into the pod; the cloud IAM trusts that issuer and exchanges the JWT for short-lived cloud credentials. The same SPIFFE-style identity model, applied to cloud IAM.
17. Zero-trust networking
The premise of BeyondCorp (Google, articulated in a 2014 USENIX ;login: article and subsequent papers): the network perimeter is meaningless. Treat every request as coming from the open internet; authenticate the user and the device and the workload; authorize per-request based on identity and posture; encrypt everything end-to-end.
The model is now standard in the industry:
- Cloudflare Access — identity-aware proxy in front of internal apps; integrates with any IdP.
- Tailscale — WireGuard mesh with SSO-based identity; works over the open internet without opening firewalls; ACLs in JSON.
- Twingate — split-tunnel zero-trust gateway, agent on each device, resource-level access.
- Banyan / SonicWall CSE — enterprise ZTNA.
- AWS Verified Access, GCP Identity-Aware Proxy (IAP), Azure Entra Private Access — cloud-managed equivalents.
Inside the cluster, the mesh’s mTLS plus an AuthorizationPolicy is the zero-trust pattern — every workload-to-workload call carries a verifiable identity and is authorized per-request, regardless of network reachability.
Device posture and context. True zero-trust extends past identity to context: device compliance (managed, encrypted, patched), location (impossible-travel detection), time-of-day, user-behavior anomaly score. Tools like Crowdstrike Zero Trust Assessment, Microsoft Entra Conditional Access, and Okta FastPass feed posture signals into the access decision. The policy evaluator (CASB, IdP, or IAP) combines identity + posture + resource sensitivity into a per-request allow/deny.
Continuous verification, not just at login. Traditional VPN authorizes at connect; zero-trust re-evaluates at every request. Token lifetimes shrink to minutes; sessions are bound to device-attested public keys (DPoP, mTLS); revocation is immediate. This is why JWT signing keys are rotated frequently and why every gateway re-validates the token rather than trusting an earlier check.
Workload-to-workload zero-trust. Beyond user-facing access, the mesh’s mTLS + AuthorizationPolicy applies the same model to service-to-service calls. SPIFFE-issued workload identity, short-lived certs, deny-by-default policy with explicit allow lists per source/destination/method. The Google “BeyondProd” paper (2019) describes Google’s internal application of these principles, of which Istio and SPIFFE are the open-source distillation.
OIDC at the edge. External-facing services accept OIDC ID tokens from an IdP (Okta, Auth0, Entra, Keycloak, Dex). The gateway validates signature, expiry, audience, and issuer against the JWKS, extracts claims, and passes them to the upstream as headers. The mesh then applies AuthorizationPolicy based on those claims. Workloads never see raw credentials.
18. Gateway API
Gateway API (Kubernetes SIG-Network; v1 GA October 2023) is the successor to the legacy Ingress resource. It is role-oriented and multi-tenant:
- GatewayClass (cluster-scoped, written by infra provider) — describes a class of load balancer.
- Gateway (cluster operator) — provisions a specific instance, with listeners, ports, TLS settings.
- HTTPRoute, GRPCRoute, TCPRoute, TLSRoute, UDPRoute (application teams) — attach routes to a Gateway with
parentRefs. - ReferenceGrant — explicit cross-namespace permission.
Implementations: Istio, Cilium, Contour, Envoy Gateway, NGINX Gateway Fabric, HAProxy, Kong, Traefik, Google’s GKE Gateway, AWS Gateway API Controller. Most projects are converging on Gateway API; legacy Ingress will likely fade for new deployments.
Why Gateway API improves on Ingress. The original Ingress resource crammed routing, TLS, and provider-specific behavior into one object owned by application teams. Annotations diverged per controller (nginx.ingress.kubernetes.io/..., traefik.ingress.kubernetes.io/...), portability suffered, and cross-namespace fan-in needed permission hacks. Gateway API separates roles cleanly — infra owns GatewayClass and Gateway, app teams own Routes — and bakes the common knobs (header matching, traffic splitting, request mirroring, redirect rules) into the spec rather than per-controller annotations.
East-west Gateway API. Originally a north-south spec, Gateway API now also covers service-to-service routing via the GAMMA (Gateway API for Mesh) initiative — HTTPRoute with a Service as parentRef becomes mesh routing. Istio, Linkerd, and Kuma all support this; it is the path toward a single routing API across ingress and mesh.
19. Edge proxies and API gateways
Outside the mesh, at the edge, classic proxies and full API gateways still rule.
L7 proxies:
- Nginx — battle-tested, configuration via nginx.conf or templated; OSS and Plus.
- HAProxy — TCP and HTTP, very fast, ALPN/SNI-aware.
- Envoy — same proxy used inside meshes, also fine at the edge.
- Caddy — Go-native; automatic HTTPS via ACME by default.
- Traefik — auto-configuration from Docker / K8s labels; built-in Let’s Encrypt.
API gateways add API-product concerns — keys, plans, quotas, transformations, dev-portal:
- Kong Gateway — Lua + Nginx core (Kong Konnect commercial control plane).
- Tyk — Go-based, open source + commercial.
- Ambassador / Emissary-ingress — Envoy-based; Datawire.
- Apigee (Google) — full lifecycle, dev portals, monetization.
- MuleSoft Anypoint (Salesforce) — heavy enterprise, integrates with SaaS connectors.
- Cloud-native managed: AWS API Gateway (REST and HTTP APIs), Google Cloud API Gateway + Apigee, Azure API Management.
Gateway vs. mesh. A north-south gateway terminates client connections, handles auth, applies API-product policies, and routes to the cluster. The mesh handles east-west traffic between workloads inside the cluster. Modern deployments increasingly converge — Istio’s ingress Gateway is an Envoy that the same control plane manages; Cilium Gateway API uses the same datapath as in-cluster traffic. The split is operational rather than architectural.
WAF integration. Production gateways usually sit behind a Web Application Firewall — ModSecurity + Coraza (open WAF rule engine), AWS WAF, Cloudflare WAF, Akamai App & API Protector, Imperva. The WAF handles OWASP-top-10-style filtering; the gateway handles auth and routing; the mesh handles workload identity. Three rings, each with a narrower scope.
20. Observability for meshes
The mesh is in the unique position of seeing every request and being able to instrument it without app cooperation.
Distributed tracing. Envoy and other mesh proxies generate trace spans automatically if the application propagates the trace headers. Standard header sets:
- W3C Trace Context —
traceparentandtracestate. The standard since 2020. - B3 (Zipkin) —
X-B3-TraceId,X-B3-SpanId,X-B3-Sampled. Pre-W3C; still widely supported. - Jaeger —
uber-trace-id(Uber’s original).
Backends: Jaeger, Zipkin, Tempo (Grafana), Honeycomb, Lightstep / ServiceNow Cloud Observability, AWS X-Ray, GCP Cloud Trace, Azure Monitor.
Metrics. Envoy ships RED metrics on /stats/prometheus. Linkerd-proxy ships its own Prometheus endpoint. Cilium and Hubble expose flow-level metrics. Aggregated via Prometheus + Grafana + Alertmanager, or shipped to Datadog, New Relic, Dynatrace, Splunk Observability, or GCP Cloud Monitoring.
Access logs. Envoy supports JSON-formatted access logs and ALS (Access Log Service) gRPC streaming. Linkerd’s access log support arrived later.
Mesh-specific dashboards.
- Kiali — Istio’s topology, traffic, and config dashboard.
- Hubble UI — Cilium’s service map and flow log viewer.
- Linkerd Viz — built-in dashboard with per-route metrics.
- Grafana Tempo / Loki / Mimir stack for tracing + logs + metrics.
Sampling and cost. Tracing every request at full fidelity is impossibly expensive at scale. Three sampling strategies:
- Head-based — at the entry hop, flip a biased coin (e.g. 1% sample rate). Cheap but loses errors that are rare.
- Tail-based — buffer spans for the request’s full duration, then decide to keep based on outcome (drop fast successes, keep slow or errored requests). Requires a collector with memory headroom — OpenTelemetry Collector, Refinery (Honeycomb), Tempo’s tail-sampling.
- Probabilistic with overrides — sample at 0.1% baseline, 100% for error responses and known-slow routes.
eBPF-based observability. Cilium Hubble, Pixie (Google → Splunk → New Relic), Coroot, and Groundcover hook into the kernel via eBPF to observe HTTP and gRPC traffic without any sidecar or app instrumentation. Cost: kernel-version requirements (5.4+ generally, 5.10+ for full features), reduced visibility into encrypted traffic that uses libssl rather than the kernel’s TLS offload.
21. WebAssembly and filter chains
Envoy supports Proxy-Wasm, an ABI that lets you write filters in any language that targets WebAssembly (Rust, Go via TinyGo, AssemblyScript, C++). At config time you LoadWasmCode from a registry (a .wasm artifact pushed as an OCI artifact, increasingly common), and Envoy runs it in a per-worker Wasm VM (V8 or Wasmtime).
Istio surfaces this as WasmPlugin CRDs. Solo.io’s WasmEdge integration and wasm-image-spec allow Wasm modules to be distributed and signed like container images. Use cases: custom authn (e.g. SPIFFE+OIDC), header rewriting, transformation, audit-log enrichment, custom rate-limit logic.
The tradeoff: Wasm filters are isolated and language-flexible, but slower than native C++ filters (10–30%), and the ecosystem of pre-built filters is still maturing.
Distribution of Wasm modules. Modules are pushed to OCI registries using the application/vnd.module.wasm.content.layer.v1+wasm media type. wasm-to-oci or oras push builds the artifact; the mesh’s WasmPlugin resource references it by image URL. Signing follows the same cosign path as container images.
Where this is going. Per-tenant policy hooks (multi-tenant SaaS), specialized request inspection (PII scrubbing before logs leave the cluster), in-flight transformation (legacy XML to JSON), and per-tenant rate limiting are the use cases where Wasm filters shine — they let platform code be safely customized per workload without rebuilding the proxy.
22. Common pitfalls
- Latency tax. Each sidecar adds roughly 1–3 ms per hop; a service path with three hops crosses six sidecars (out-in-out-in-out-in). Tune
keepaliveand connection pools. - Resource overhead. 1000 pods × 50–150 MB per Envoy = 50–150 GB of memory just for sidecars; non-trivial on small clusters. Linkerd-proxy or Ambient/Cilium materially reduce this.
- Configuration explosion. Istio’s CRD surface is large;
VirtualService+DestinationRule+Sidecar+AuthorizationPolicy+PeerAuthenticationinteractions are subtle. Many teams maintain a thin internal abstraction over Istio CRDs. - mTLS not actually on.
PeerAuthenticationdefaults toPERMISSIVE, which accepts plaintext;STRICTis opt-in. Teams discover after a year that half their traffic is still cleartext because some workload was outside the mesh. Verify withkubectl exec -- curlfrom outside and confirm refusal. - Header propagation. Distributed tracing breaks unless the application copies trace headers from inbound requests to outbound calls. The mesh adds them at the entry hop but cannot read the app’s outbound semantics. Most language SDKs (OpenTelemetry instrumentation) handle this automatically; bare HTTP clients need manual propagation.
- Hairpin traffic. When service A calls service A, requests can bounce through the load balancer back to the calling pod, doubling cost and breaking some sticky-session models. Address via
internalTrafficPolicy: Localor topology-aware routing. - Sidecar startup race. The app starts before the sidecar is ready; outbound calls fail until the sidecar listens. Solutions:
holdApplicationUntilProxyStarts: truein Istio, or use native sidecar containers (Kubernetes 1.29+ feature). - Egress is harder than ingress. External (non-mesh) destinations need
ServiceEntryplus DNS plus, often, TLS-origination at the egress proxy. Many teams give up and bypass the mesh for egress. - Migration cost. Adopting a mesh on an existing application is an iterative project — start with one namespace in PERMISSIVE, prove observability and SLOs, move to STRICT, expand.
- Multi-cluster correctness. Multi-cluster mesh promises a single global service registry. Reality: cluster names leak into hostnames, locality-aware routing requires per-cluster topology labels, and trust-domain federation is operationally heavy.
- Upgrade complexity. Istio’s
istiodand the data-plane Envoys must stay roughly in sync — usually one minor version of skew is supported. Canary upgrade viaistio-revisionlabels is the safe pattern; in-place upgrades on a busy cluster have caused outages. - Debugging is harder. A request now traverses two extra hops; failures can be in the app, the local sidecar, the destination sidecar, or the destination app.
istioctl proxy-config,linkerd diagnostics, andcilium monitorexist precisely because reading raw Envoy admin output is unkind. - No silver bullet for slow services. A mesh adds resilience features (retries, timeouts, circuit breakers) but cannot fix an application that is genuinely slow or buggy. Teams sometimes over-rotate on retry config when the actual fix is a database index or a goroutine leak.
- Pod-to-pod direct paths bypass the mesh. A workload that connects via pod IP rather than service DNS skips sidecar interception (or hits it on a different listener). NetworkPolicy or
outboundTrafficPolicy: REGISTRY_ONLYis needed to enforce the mesh as the only path. - Service mesh is not a security boundary alone. mTLS encrypts and authenticates traffic but does not prevent a compromised workload from calling everything its identity is authorized for. Pair the mesh with NetworkPolicy, with least-privilege
AuthorizationPolicy, and with runtime security (Falco, Tetragon).
Appendix A — Runtime security and detection
Beyond namespace and seccomp isolation, runtime security tooling watches containers for anomalous behavior.
- Falco (Sysdig, CNCF graduated) — kernel-event-based rule engine. Default ruleset detects shell-in-container, write to
/etc, sensitive file reads, mount of host paths, privilege escalation. eBPF or kernel-module data sources. - Tetragon (Isovalent/Cisco, part of Cilium project) — eBPF-based; can both observe and enforce, killing offending processes from the kernel before syscalls complete. Policy via
TracingPolicyCRDs. - Tracee (Aqua) — eBPF-based forensics; correlates events to detect known attack patterns.
- Sysdig Secure, Aqua CSP, Prisma Cloud (Palo Alto), Wiz Runtime — commercial runtime + posture platforms.
Detection rules typically cover: unexpected execve in a container (e.g. sh in an image that shouldn’t have one), outbound connections to unknown destinations, ptrace attaches, kernel-module loads, mounting of /proc/*/root, modification of binaries on disk inside a running container.
Appendix B — Image-distribution efficiency
Pulling large images is a major source of cold-start latency. Three techniques materially help:
- Lazy pulling / streaming. eStargz (containerd, Google), SOCI (AWS), and nydus (Alibaba) reorder layer contents so the runtime can stream individual files on first access rather than waiting for the whole layer. Cold start of large ML images drops from minutes to seconds.
- Layer deduplication at the registry. Content-addressable storage means two images sharing a base layer store it once. Cross-image layer reuse depends on consistent base-image hashes; pinning base images by digest matters.
- Peer-to-peer image distribution. Dragonfly (CNCF) and Kraken (Uber) build a P2P overlay across the cluster so the registry serves each layer once and nodes share among themselves. Saves egress and reduces pull times on large rollouts.
Appendix C — Pod-level networking inside Kubernetes
The Container Network Interface (CNI) spec defines how pods get network. Common CNI plugins shape what the mesh sees:
- Calico — BGP-based or VXLAN; rich NetworkPolicy; eBPF dataplane option.
- Cilium — eBPF-native; described above; also a CNI.
- Flannel — simplest VXLAN-based plugin; common in small clusters.
- AWS VPC CNI — assigns real ENI-backed IPs to pods; integrates with VPC security groups.
- Azure CNI, GKE Dataplane V2 (Cilium-based), Antrea (VMware, OVS-based).
When a mesh runs over a CNI like Cilium that also offers mesh features, operators choose whether to let Cilium handle L7 or to layer Istio/Linkerd on top. Layering works but means two policy planes; running mesh and CNI as one (Cilium service mesh) reduces moving parts.
Appendix D — Real-world adoption patterns
How organizations actually roll out containers and service mesh, distilled from the case studies and trade-press of the last few years.
Container adoption curve. Most teams start with docker build + a single VM. Phase two is a managed Kubernetes (EKS, GKE, AKS, or DigitalOcean) running stateless services, with stateful data still in managed cloud services (RDS, Cloud SQL, MemoryStore). Phase three is GitOps (Argo CD or Flux) plus image scanning + signing in CI. Phase four is multi-cluster (DR + region affinity) and platform-engineering investment — internal developer platforms like Backstage exposing self-service.
Mesh adoption curve. Few teams need a mesh on day one. The reasonable triggers are: (a) the org has more than ~30 services and observability is a real problem, (b) compliance requires encryption-in-transit across services and the team does not want to do mTLS by hand in every language, (c) cross-cluster service-to-service traffic is starting and needs a coherent identity model. Premature mesh adoption is a common pattern — a 5-service shop adopting Istio spends more time on Istio than on its product.
The “platform team” pattern. A small (3–10-person) platform team owns the cluster, the mesh, the registry, the CI/CD pipeline, and the developer experience. Application teams consume a stable, opinionated surface — Deployment, Service, HTTPRoute, maybe a custom AppDefinition CRD — and never touch raw mesh CRDs. The platform team handles upgrades, version pinning, and incident response on the substrate.
Migration patterns. Strangler-fig is most common — new services land in K8s with the mesh enabled, old monolith stays on VMs. The mesh extends to VMs via Istio WorkloadEntry or Consul agents so cross-environment mTLS is unbroken. Over months to years the monolith’s traffic share drops and it is retired or factored.
Failure modes to plan for. Cluster control-plane outage (managed K8s offerings have one nine fewer than the data plane), registry unavailability during rollout (pull-through caches help), control-plane-version skew during mesh upgrade, certificate-issuer outage (CA HA matters), and “everything is fine but slow” — invariably a misconfigured connection pool or retry budget.
Appendix E — Glossary
- CRI — Container Runtime Interface. gRPC API between Kubelet and the runtime (containerd, CRI-O).
- OCI — Open Container Initiative. The standards body and the specs.
- Sidecar — a container that runs alongside an application container in the same pod, sharing network and (optionally) storage.
- xDS — Envoy’s dynamic configuration APIs (LDS/RDS/CDS/EDS/SDS/ADS).
- SPIFFE — Secure Production Identity Framework For Everyone.
- SVID — SPIFFE Verifiable Identity Document.
- SBOM — Software Bill of Materials.
- SLSA — Supply-chain Levels for Software Artifacts.
- mTLS — mutual TLS; both client and server present certificates.
- Trust domain — SPIFFE namespace, typically one per organization or per cluster.
- Ambient mesh — Istio’s sidecarless architecture (ztunnel + waypoint).
- Waypoint proxy — Istio Ambient’s per-namespace L7 proxy.
- ztunnel — Istio Ambient’s per-node L4 + mTLS proxy.
- Hubble — Cilium’s observability layer.
- Tetragon — Cilium’s runtime-security layer.
- CNI — Container Network Interface.
- CSI — Container Storage Interface.
- Distroless — minimal base images with no shell or package manager.
- Rootless — running the container stack without host root.
- OCI artifact — non-image content (Helm chart, SBOM, Wasm module) pushed to an OCI registry with a custom media type.
- GAMMA — Gateway API for Mesh Management; service-mesh extensions to Gateway API.
- DPoP — Demonstration of Proof of Possession; sender-constrained tokens (RFC 9449).
- JWKS — JSON Web Key Set; the public keys an IdP publishes for JWT validation.
- eStargz / SOCI / nydus — lazy-pulling image formats.
- Manifest list / image index — multi-arch image bundling.
- Filter chain — Envoy’s ordered list of filters applied to a connection.
- xDS — Envoy’s discovery service APIs (LDS/RDS/CDS/EDS/SDS/ADS).
- Trust bundle — set of root CAs constituting a SPIFFE trust domain.
- Locality — region/zone topology metadata on K8s nodes.
- Outlier detection — Envoy’s mechanism for ejecting unhealthy upstream hosts.
Appendix F — When to skip the mesh
A useful counterpoint: not every cluster needs a mesh. Reasonable skip criteria:
- Fewer than ~20 services, all in one team’s hands; cross-cutting concerns can be solved with a shared library.
- Hard latency requirements (sub-millisecond p99 between services) that cannot tolerate a sidecar hop.
- Mesh certificate lifecycle would duplicate an existing PKI you cannot retire.
- Operations bandwidth is the binding constraint; the platform team is two people doing K8s + CI + on-call.
- Stack is mostly serverless functions (Lambda, Cloud Run) that the cloud already mTLSes.
Alternative paths to the same goals: cert-manager + a library for mTLS, kube-proxy + NetworkPolicy + Calico for identity-based policy, OpenTelemetry SDK in every service for traces and metrics, and a thin reverse proxy (Envoy, Caddy, Traefik) at the ingress. Less coherent than a mesh, sometimes the right call given org constraints.
23. Cross-references
- _index
- kubernetes-deep
- networking-foundations
- cryptography-fundamentals — mTLS, certificate rotation, SPIFFE SVIDs, Sigstore Rekor transparency log.
- observability-stack — Prometheus, OpenTelemetry, tracing backends.
24. Citations
- Open Container Initiative — Image Spec, Runtime Spec, Distribution Spec (github.com/opencontainers/image-spec, runtime-spec, distribution-spec).
- containerd documentation — containerd.io.
- CRI-O documentation — cri-o.io.
- Podman + Buildah — podman.io, buildah.io.
- Kubernetes 1.24 release notes (May 2022): dockershim removal.
- BuildKit documentation — docs.docker.com/build/buildkit.
- Cloud Native Buildpacks — buildpacks.io.
- ko — ko.build.
- Distroless — github.com/GoogleContainerTools/distroless.
- Trivy — aquasecurity.github.io/trivy.
- Syft + Grype — github.com/anchore/syft, anchore/grype.
- SBOM standards — spdx.dev, cyclonedx.org.
- Sigstore — sigstore.dev; cosign documentation.
- SLSA framework — slsa.dev.
- in-toto attestation framework — in-toto.io.
- SPIFFE + SPIRE — spiffe.io.
- cert-manager — cert-manager.io.
- Istio — istio.io; Ambient mesh announcement (2022) and GA (2024).
- Linkerd — linkerd.io; CNCF graduation announcement (July 2021).
- Cilium — cilium.io; CNCF graduation (October 2023); Hubble + Tetragon docs.
- Consul — developer.hashicorp.com/consul.
- Envoy — envoyproxy.io; xDS API documentation.
- Kubernetes Gateway API — gateway-api.sigs.k8s.io.
- BeyondCorp: A New Approach to Enterprise Security — Ward, Beyer; USENIX
;login:Dec 2014. - W3C Trace Context recommendation — w3.org/TR/trace-context.
- B3 Propagation — github.com/openzipkin/b3-propagation.
- OpenTelemetry — opentelemetry.io.
- Proxy-Wasm ABI — github.com/proxy-wasm/spec.
- Firecracker — firecracker-microvm.github.io.
- gVisor — gvisor.dev.
- Kata Containers — katacontainers.io.