Kubernetes Deep — Compute Reference
1. At a glance
Kubernetes (K8s) is an open-source container-orchestration platform that automates deployment, scaling, and lifecycle management of containerized workloads across a cluster of machines. It originated as a re-implementation in Go of lessons Google learned operating its internal Borg and Omega cluster-management systems for over a decade. The project was open-sourced in 2014 by Joe Beda, Brendan Burns, and Craig McLuckie at Google; donated as the seed project to the Cloud Native Computing Foundation (CNCF) at its formation in 2015; reached 1.0 in July 2015; and was the first CNCF project to graduate in 2018.
It won the orchestration wars (against Docker Swarm, Apache Mesos / Marathon, Nomad, and earlier PaaS systems) for a small set of architectural reasons:
- Declarative API. Users submit YAML/JSON manifests describing desired state; controllers continuously reconcile observed state toward it. This is the opposite of imperative “run this command on this host” tooling.
- Controller / control-loop pattern. Every behavior in the system — Deployments rolling out, Services routing traffic, Pods being scheduled, volumes being mounted — is implemented by a controller that watches the API server and converges state. The pattern is uniform, extensible, and survivable: kill a controller, restart it, it picks up where it left off.
- Self-healing. Failed pods are rescheduled, unhealthy nodes are cordoned and drained, replica counts are maintained automatically. The system trends toward desired state rather than away from it.
- Pluggable everything. Container runtime (CRI), networking (CNI), storage (CSI), and authentication / authorization / admission all use well-defined interfaces. Vendors compete on implementations without forking the core.
- Extension surface. Custom Resource Definitions (CRDs) plus the operator pattern let you turn Kubernetes itself into the platform for managing arbitrary stateful systems — databases, message queues, ML pipelines.
Core abstractions every Kubernetes user must internalize:
- Pod — the smallest deployable unit; one or more containers sharing a network and storage namespace.
- Deployment — declarative manager of stateless replicas with rolling updates.
- Service — stable virtual IP / DNS name that load-balances traffic to a set of pods.
- ConfigMap — non-secret key/value configuration injected as env vars, files, or CLI args.
- Secret — base64-encoded sensitive data (not encrypted at rest by default — see Section 8).
- PersistentVolume / PersistentVolumeClaim — durable storage decoupled from pod lifetime.
- Namespace — soft multi-tenancy boundary for naming and quotas.
- Node — a worker machine (VM or bare-metal) that runs pods.
As of mid-2024, Kubernetes 1.30 is the stable release line; clusters in the field typically run 1.27 – 1.30, with one release cadence per ~4 months and a roughly 14-month support window. Every major cloud provider ships managed offerings (AWS EKS, Google GKE Autopilot/Standard, Azure AKS, DigitalOcean DOKS, Linode LKE, Oracle OKE, IBM IKS). Red Hat OpenShift is the dominant enterprise distribution. Lightweight distributions (k3s, k0s, MicroK8s, kind, minikube) span IoT/edge through dev laptops. Kubernetes is now the de-facto default substrate for cloud-native applications; the question for new platform builds is usually which Kubernetes, not whether.
2. Architecture
A Kubernetes cluster consists of a control plane and a set of worker nodes. Production clusters run multiple control-plane replicas behind a load balancer for HA; managed services hide this entirely.
2.1 Control plane
- kube-apiserver — the REST gateway and only component that talks to etcd. All other components (controllers, scheduler, kubelet, kubectl, operators, CI/CD) read and write cluster state through it. Horizontally scalable: stateless behind a load balancer, fronted by mTLS. Implements admission control (see Section 11), validation, defaulting, and conversion between API versions. The watch protocol streams change events to clients, enabling the controller pattern.
- etcd — a strongly-consistent distributed key-value store built on the Raft consensus algorithm. Stores all cluster state: objects, secrets, leases, events. Typically run as a 3- or 5-node odd-sized cluster (Raft tolerates
floor(n/2)failures). etcd is the single most performance-sensitive component; slow disks or network hops directly throttle the API. Production etcd needs SSD storage, separate dedicated nodes from API servers if at scale, regular defragmentation, and encryption at rest. - kube-scheduler — watches for unscheduled pods and binds each to a node by evaluating node fitness (filtering) and ranking (scoring). Extensible via the Scheduling Framework — plugins implement Filter, Score, PreBind, Bind extension points. Multiple schedulers can run side-by-side; pods opt in via
spec.schedulerName. - kube-controller-manager — a single binary hosting dozens of built-in controllers: Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Node, Endpoint, ServiceAccount, Namespace, PV/PVC, ResourceQuota, HPA, Lease, GC, TTL. Each watches its resource type via the API server and reconciles. Leader election via Lease objects ensures only one active replica.
- cloud-controller-manager — separates cloud-vendor-specific code (LoadBalancer provisioning, Node lifecycle tied to VM lifecycle, Route management) from the core kube-controller-manager. Each cloud (AWS, GCP, Azure, vSphere, OpenStack) ships its own implementation. The split was finalized in 1.21 and enables cloud-agnostic core releases.
2.2 Node components
- kubelet — the per-node agent. Watches the API server for pods assigned to its node, instructs the container runtime to start/stop containers, mounts volumes, runs probes (liveness/readiness/startup), reports node and pod status back to the API server. Communicates with the runtime via the Container Runtime Interface (CRI) gRPC API.
- kube-proxy — programs the node’s networking dataplane (iptables, IPVS, or nftables rules) to implement Service abstraction: virtual ClusterIPs become DNAT rules that round-robin to pod endpoints. Increasingly displaced by eBPF-based dataplanes (Cilium) that bypass iptables entirely.
- Container runtime — implements CRI. The historical default was Docker, but Kubernetes deprecated the Docker shim (dockershim) in 1.20 (Dec 2020) and removed it in 1.24 (May 2022). Modern clusters run containerd (the runtime extracted from Docker, donated to CNCF) or CRI-O (Red Hat / OpenShift). Both invoke runc (or alternative low-level runtimes — gVisor, Kata Containers, Firecracker via Kata) to actually start containers per the OCI runtime spec.
2.3 Add-ons
These are technically optional but every real cluster runs them:
- CoreDNS — cluster DNS server; resolves Service names to ClusterIPs. Replaced kube-dns since 1.13. Configurable via the Corefile ConfigMap; supports plugins (rewrite, forward, cache, prometheus).
- CNI plugin — implements pod networking per the Container Networking Interface spec. Popular choices: Calico (BGP-based, mature, NetworkPolicy support), Cilium (eBPF dataplane, advanced NetworkPolicy + service mesh + observability), Flannel (simple VXLAN overlay), Weave Net (legacy), AWS VPC CNI / Azure CNI / GKE Dataplane V2 (cloud-native, pod IPs from VPC).
- metrics-server — lightweight aggregator for resource metrics (CPU/memory) from kubelets, feeds HPA and
kubectl top. - CSI drivers — pluggable storage (Section 6).
- Ingress controller — HTTP/HTTPS routing (Section 5).
- Cluster autoscaler / Karpenter — node-level scaling (Section 12).
3. The Pod abstraction
A Pod is the smallest unit Kubernetes schedules. It is one or more containers that share:
- A network namespace — same IP, same port space, communicate over
localhost. - A set of volumes — mounted into one or more containers.
- A PID namespace (optional via
shareProcessNamespace: true). - A lifecycle — all containers in a pod live and die together on the same node.
The “1+ containers” wording matters: most pods are single-container, but multi-container pods are a critical pattern for sidecars and adapters.
3.1 Multi-container patterns
- Init containers — run sequentially before app containers start, to completion. Used for schema migrations, secret bootstrapping, DNS warm-up, fetching config from a remote source, waiting for dependencies. Each init container must exit 0 before the next runs; failure restarts the whole sequence (per restart policy).
- Sidecar containers — long-running helpers alongside the main app: log shippers (Fluent Bit), metrics exporters, service-mesh proxies (Envoy/istio-proxy/linkerd-proxy), TLS terminators. Kubernetes 1.28 introduced native sidecar containers (
restartPolicy: Alwayson an init container) so they start before and stop after the main containers — fixing long-standing shutdown ordering bugs. - Ambassador pattern — a proxy container brokers connections from the app to an external system (e.g. an oauth2-proxy front of an unauthenticated app, or a connection-pooled DB proxy like PgBouncer).
- Adapter pattern — a container reshapes the output of the main container so it fits a standard interface (e.g. translating app-specific metrics into Prometheus format).
3.2 Lifecycle and phases
A pod’s status.phase progresses through:
- Pending — accepted by the API but not yet scheduled, or scheduled but containers still pulling images / running init.
- Running — at least one container is up.
- Succeeded — all containers exited 0 (Job semantics).
- Failed — at least one container exited non-zero with no further restarts.
- Unknown — node lost contact.
Pod-level conditions (PodScheduled, Initialized, ContainersReady, Ready) give finer-grained state. Containers within a pod have their own state (Waiting / Running / Terminated) with reason fields (CrashLoopBackOff, ImagePullBackOff, OOMKilled, Completed).
3.3 Restart policies
Set on spec.restartPolicy:
- Always (default for Deployment/ReplicaSet/DaemonSet) — restart containers on any exit.
- OnFailure (Job default) — restart only on non-zero exit.
- Never — exit codes are terminal.
Restart is a kubelet-local concept: the kubelet restarts containers in-place. Crash-loop backoff is exponential, capped at 5 minutes.
3.4 QoS classes
Kubernetes assigns each pod a Quality-of-Service class based on resource specs:
- Guaranteed — every container has equal
requestsandlimitsfor both CPU and memory. Last to be evicted; eligible for static CPU pinning under the CPU Manager. - Burstable — at least one container has a request but not limits=requests on everything. Can use more than its request when nodes are uncontended.
- BestEffort — no requests or limits anywhere. First to be evicted under memory pressure; treated as cattle.
QoS feeds the eviction manager and the OOM scoring on Linux: BestEffort pods get the highest oom_score_adj, Guaranteed the lowest.
4. Higher-level workloads
You almost never create raw Pods. You create a controller that creates Pods on your behalf.
- Deployment — the workhorse for stateless services. Owns a ReplicaSet, which owns Pods. Supports rolling updates (
maxSurge,maxUnavailable), rollback (kubectl rollout undo), pause/resume, history limits. The right primitive for ~90% of web apps and APIs. - StatefulSet — provides stable network identity (ordinal-suffixed pod names:
mysql-0,mysql-1), stable persistent storage (one PVC per ordinal, retained on rescheduling), ordered deployment, scaling, and termination. Required for databases, distributed consensus systems (Cassandra, Kafka, Elasticsearch, etcd), and anything where individual pod identity matters. The volumes outlive the pods. - DaemonSet — runs exactly one pod per matching node. Used for per-node daemons: log shippers (Fluent Bit), metric exporters (node-exporter), CNI agents (Calico, Cilium), CSI node plugins, GPU drivers, security agents (Falco), service-mesh sidecars in some topologies.
- Job — runs a pod (or parallel pods) to completion. Configurable parallelism, completion count, backoff limit, active deadline. Use for batch workloads, schema migrations, data processing.
- CronJob — schedules Jobs on a cron expression. Concurrency policy (Allow/Forbid/Replace), starting-deadline-seconds, history limits. Note: missed jobs and clock skew are footguns — set
startingDeadlineSeconds. - ReplicaSet — the primitive Deployment owns. Maintains
npod replicas matching a label selector. You should not create ReplicaSets directly; always use Deployment so you get rollout history and update strategy.
5. Service, ingress, and networking
Pod IPs are ephemeral and unpredictable. The Service abstraction provides stable virtual endpoints.
5.1 Service types
- ClusterIP (default) — virtual IP routable only inside the cluster. The basic L4 service: traffic to
<svc>.<ns>.svc.cluster.localis load-balanced across selected pods. - NodePort — exposes the service on every node at a static port in the 30000–32767 range. Used for bare-metal clusters without external load balancers, or as a stepping stone for an Ingress.
- LoadBalancer — provisions an external cloud load balancer (AWS NLB/ALB, GCP TCP/HTTP LB, Azure LB) that points at the NodePort. Requires the cloud-controller-manager.
- ExternalName — DNS CNAME alias to an external host; pure DNS, no proxying.
- Headless (
clusterIP: None) — no virtual IP; DNS returns A records for each pod directly. Used by StatefulSets so each pod gets its own DNS name.
5.2 kube-proxy modes
- iptables (default) — kube-proxy programs iptables rules; random round-robin by probability. Scales to thousands of services but rule-evaluation cost grows linearly.
- IPVS — Linux Virtual Server kernel module; hash-table-based, O(1) lookup, multiple LB algorithms (rr, lc, dh, sh). Better at very high service counts.
- eBPF — Cilium replaces kube-proxy entirely; service lookup is an eBPF program attached to socket / TC hooks. Avoids iptables conntrack altogether.
- nftables — newer Linux netfilter front-end; replacing iptables in 1.29+ as a beta mode.
5.3 DNS
CoreDNS is mandatory in modern clusters. Resolution rules:
<svc>.<ns>.svc.cluster.local— Service A record (ClusterIP) or pod A records (headless).<pod-ip-dashed>.<ns>.pod.cluster.local— per-pod records (rarely used).- Short names resolve via search-path expansion configured by kubelet (
<svc>→<svc>.<current-ns>.svc.cluster.local). - External names forwarded to upstream resolvers via CoreDNS
forwardplugin.
NodeLocal DNSCache is a DaemonSet that runs a per-node CoreDNS to cut DNS latency and reduce kube-dns load.
5.4 Ingress
The Ingress resource provides L7 HTTP/HTTPS routing into the cluster — hostnames, paths, TLS termination — implemented by a separately-deployed Ingress controller. Popular controllers:
- ingress-nginx (Kubernetes project) — NGINX-based, most widely deployed.
- Traefik — Go-based, auto-discovers services.
- HAProxy Ingress — HAProxy-based.
- Envoy-based: Contour (VMware), Emissary-ingress (formerly Ambassador).
- Cloud-native: AWS Load Balancer Controller (ALB), GCE Ingress, Azure Application Gateway Ingress Controller.
5.5 Gateway API
The Ingress resource was always under-specified — many vendor annotations, weak multi-protocol support, no clear separation between infrastructure and application owners. The Gateway API (GA in 2023) supersedes it with a richer model: GatewayClass (infra type), Gateway (the listener), HTTPRoute / TCPRoute / GRPCRoute / TLSRoute / UDPRoute (the rules). It cleanly separates cluster-admin concerns from app-team concerns, supports cross-namespace routing, and is the direction most projects are moving. Existing Ingress is still supported for the foreseeable future but new designs should start with Gateway API.
5.6 NetworkPolicy
L3/L4 firewall rules between pods. Default-allow (no policy) is the K8s default; you opt into restrictiveness by defining policies. Selectors specify source and destination pods or namespaces; ports specify allowed protocols and ports. NetworkPolicy is a spec; enforcement requires a CNI that implements it — Calico, Cilium, Antrea, Weave. Flannel does not implement NetworkPolicy. Cilium adds an extended CiliumNetworkPolicy that supports L7 rules (HTTP methods/paths, Kafka topics, DNS names).
5.7 Service mesh
A service mesh adds a sidecar (or eBPF-based) proxy to every pod, providing:
- mTLS everywhere with automatic cert rotation.
- Identity-based authorization (SPIFFE / SPIRE).
- Traffic management: traffic splitting, retries, timeouts, circuit breakers, fault injection.
- Observability: per-request metrics, distributed tracing headers, access logs.
Major implementations:
- Istio — Envoy sidecars; control plane (istiod) handles config, certs, telemetry. Sidecarless mode (Ambient Mesh) introduced 2022.
- Linkerd — Rust-based micro-proxy (linkerd2-proxy); designed for simplicity and low overhead.
- Consul Connect — HashiCorp; Envoy-based.
- Cilium Service Mesh — eBPF-based, no sidecars; integrates with Cilium CNI.
- AWS App Mesh — Envoy-based managed.
6. Storage
6.1 PV / PVC
The PersistentVolume / PersistentVolumeClaim pair decouples storage provisioning from consumption:
- A PersistentVolume (PV) is a cluster-level resource representing a piece of durable storage — backed by an EBS volume, GCE PD, Azure Disk, NFS share, Ceph RBD, etc. Has capacity, access modes (RWO, ROX, RWX, RWOP), reclaim policy (Retain, Delete, Recycle-deprecated).
- A PersistentVolumeClaim (PVC) is a namespace-scoped request for storage with a size and access mode. The PV controller binds claims to volumes.
- A pod mounts PVCs as volumes — never references a PV directly.
6.2 StorageClass
A StorageClass parameterizes dynamic provisioning. Setting storageClassName on a PVC and pointing it at a class triggers the corresponding CSI driver to provision a PV on demand. Parameters per class: filesystem type, replication, IOPS tier, encryption, zone constraints, reclaim policy. Each cluster typically has a default class marked with the storageclass.kubernetes.io/is-default-class annotation.
6.3 CSI — Container Storage Interface
CSI is the pluggable storage standard (replaced the old in-tree volume plugins, fully removed by 1.27+ in waves). A CSI driver runs as a controller Deployment + per-node DaemonSet implementing the CSI gRPC API (CreateVolume, DeleteVolume, ControllerPublish, NodeStage, NodePublish, etc.).
Notable drivers:
- AWS EBS CSI (block) and AWS EFS CSI (NFS) — official AWS drivers.
- GCP PD CSI and GCP Filestore CSI — Google.
- Azure Disk CSI and Azure File CSI — Microsoft.
- OpenEBS — cloud-native block / local-PV provisioner with Mayastor, Jiva, cStor backends.
- Rook + Ceph — Rook is an operator that runs Ceph on Kubernetes; provides RWO/RWX block, filesystem, and object storage.
- Longhorn (Rancher) — distributed block storage with replication, snapshots, backup to S3.
- Portworx — commercial, enterprise-focused.
- TopoLVM — local LVM provisioner with capacity-aware scheduling.
6.4 Volume types
- emptyDir — ephemeral scratch space tied to pod lifetime; lives on node disk or
tmpfs. - hostPath — mounts a path from the node filesystem; security risk, use sparingly (privileged operators, log shippers).
- ConfigMap / Secret — projected as files into the pod.
- projected — combines multiple sources (ConfigMap, Secret, downwardAPI, serviceAccountToken) into one volume mount.
- Generic Ephemeral Volumes — CSI-backed ephemeral volumes; like emptyDir but on a storage class.
- CSI Inline Ephemeral — driver-managed ephemeral storage (secrets-store-csi-driver mounts secrets from Vault/SM here).
6.5 StatefulSet integration
A StatefulSet’s volumeClaimTemplates automatically creates a PVC per pod ordinal, retained across pod restarts and rescheduling. Pods are bound to their volumes by name (data-mysql-0 always attaches to mysql-0). PVCs survive pod deletion; deleting a StatefulSet does not delete PVCs (use kubectl delete pvc or set retention policy). This is the right primitive for storage-stateful systems.
7. Scheduling
7.1 The default scheduler
The kube-scheduler runs two phases:
- Filtering — eliminate nodes that cannot run the pod (resource fit, taint tolerations, node selector, affinity rules, volume topology).
- Scoring — rank remaining nodes by plugins (least-allocated, balanced-allocation, image-locality, inter-pod affinity, topology-spread).
The result is a Pod-to-Node binding written via the API. The scheduler is pluggable via the Scheduling Framework (1.19+) with extension points: PreFilter, Filter, PostFilter, PreScore, Score, NormalizeScore, Reserve, Permit, PreBind, Bind, PostBind. Multiple schedulers can coexist; pods select via spec.schedulerName.
7.2 Node selectors and affinities
nodeSelector— simplest form; require labels on the node (nodeSelector: {disktype: ssd}).nodeAffinity— richer expressions with operators (In, NotIn, Exists, Gt, Lt) andrequiredDuringSchedulingIgnoredDuringExecution/preferredDuringSchedulingIgnoredDuringExecutionmodes (hard vs soft).podAffinity/podAntiAffinity— place pods near or away from other pods based on label selectors. Topology key (e.g.topology.kubernetes.io/zone,kubernetes.io/hostname) defines the unit of “near”. Anti-affinity is critical for spreading replicas across nodes and zones.
7.3 Taints and tolerations
A taint on a node repels pods that don’t tolerate it: key=value:effect where effect is NoSchedule, PreferNoSchedule, or NoExecute (also evicts running pods). Used for dedicated nodes (GPU, large-memory), node-problem isolation, and graceful drain (node.kubernetes.io/unschedulable). Pods set tolerations to bypass.
7.4 Topology spread constraints
topologySpreadConstraints lets you spread pods evenly across zones, nodes, or any topology label, with a maxSkew parameter. Cleaner than pod anti-affinity for the common case of “spread my replicas across zones”.
7.5 Resource requests and limits
Every container should set:
requests.cpu/requests.memory— what the scheduler reserves on a node.limits.cpu/limits.memory— kernel-enforced ceilings (cgroup v1 / v2).
CPU limits are typically enforced via CFS bandwidth; this can introduce latency due to throttling, leading to the contentious debate around dropping CPU limits in favor of requests-only. Memory limits trigger OOMKill at the cgroup; OOM-killed containers restart per restart policy. Burstable vs Guaranteed QoS depends on requests/limits.
7.6 Priority and preemption
PriorityClass objects define numeric priority values. Higher-priority pods that can’t be scheduled trigger preemption — the scheduler evicts lower-priority pods to make room. Use sparingly; standard advice is system-critical workloads at high priority, default workloads at zero, and best-effort batch at negative.
7.7 Pod Disruption Budgets
A PDB constrains how many pods of a selected set can be voluntarily disrupted (minAvailable / maxUnavailable) — drains, evictions, node upgrades. Voluntary disruptions respect PDB; involuntary (node crash, kernel panic) do not. Always set PDBs for any workload where availability matters, especially when running cluster autoscalers or node-upgrade controllers.
8. Configuration and secrets
8.1 ConfigMap and Secret
- ConfigMap — non-sensitive key/value config; injected as env vars (
envFrom), single env values (valueFrom.configMapKeyRef), or mounted as files. File mounts update live as the ConfigMap changes (with kubelet sync delay). - Secret — same shape, but base64-encoded and tagged as “secret” for tooling. By default Secrets are stored in etcd in plaintext (base64 is encoding, not encryption). Encryption at rest must be enabled explicitly via
EncryptionConfigurationon the API server (AES-CBC, AES-GCM, KMS provider). Managed clusters usually enable this by default — verify on yours.
8.2 External secret management
For real production:
- External Secrets Operator (ESO) — syncs secrets from HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, AWS Parameter Store, etc., into Kubernetes Secret objects, with refresh intervals.
- Sealed Secrets (Bitnami) — encrypts a SealedSecret CRD with a cluster-private key; safe to commit to git. Operator decrypts and creates a Secret.
- SOPS (Mozilla) + kustomize-sops / helm-secrets — encrypt files at rest with age / PGP / KMS; decrypt at deploy time.
- Secrets Store CSI Driver — mounts secrets directly from external stores into pods as files via inline CSI ephemeral volumes; avoids Kubernetes Secret object entirely.
- Vault Agent Injector — HashiCorp Vault injects a sidecar that templates secrets into shared emptyDir at startup.
8.3 Downward API
Pods can read their own metadata (name, namespace, labels, annotations, IP, requests/limits, service account) via:
- Env vars:
valueFrom.fieldRef/valueFrom.resourceFieldRef. - Volumes:
downwardAPIvolume source.
Used for self-aware logging, metrics labeling, and cluster-aware bootstrapping.
8.4 Projected volumes
A single mount that combines ConfigMap, Secret, downwardAPI, and serviceAccountToken sources. ServiceAccountToken projection is critical: it produces short-lived, audience-scoped JWTs (rather than the legacy long-lived secrets in default-token-*), which underpins workload identity (Section 11).
9. Observability and reliability
9.1 Logging
Container stdout/stderr is captured by the runtime and written to node files (typically /var/log/containers/). A DaemonSet log shipper reads these and forwards to a backend:
- Fluent Bit — lightweight C-based; the de-facto cloud-native shipper.
- Fluentd — Ruby/C; richer plugins, heavier.
- Vector (Datadog) — Rust-based; high throughput.
- Promtail — Loki-specific shipper.
Backends:
- Loki (Grafana) — log aggregation by labels, cheap object storage backend.
- Elasticsearch / OpenSearch — full-text search at scale.
- Cloud logging — CloudWatch Logs, Google Cloud Logging, Azure Monitor.
9.2 Metrics
The dominant stack:
- Prometheus — pull-based time series DB, scrapes
/metricsendpoints. - kube-state-metrics — exposes object state (Deployment status, PVC phase, Pod conditions) as metrics.
- node-exporter — node-level CPU/memory/disk/network.
- cAdvisor (built into kubelet) — per-container resource metrics.
- Grafana — dashboards.
- Alertmanager — alert routing, deduplication, silencing.
For scale: Thanos or Cortex / Mimir provide HA, long-term storage, and multi-cluster query federation. Victoria Metrics is a popular Prometheus-compatible alternative with better storage efficiency.
9.3 Tracing
- OpenTelemetry — vendor-neutral instrumentation + SDK + Collector. Auto-instrumentation operators inject SDK into JVM, Python, Node, .NET pods.
- Jaeger — CNCF-graduated tracing backend.
- Grafana Tempo — object-storage-backed traces.
- Zipkin — older, Twitter-origin.
9.4 Probes
Three kubelet-driven health checks per container:
- Liveness — failure restarts the container. Don’t confuse with “is the app correct”; this is “is it deadlocked / wedged”.
- Readiness — failure removes the pod from Service endpoints (and from Ingress backend pools) without restarting. Used for startup warm-up, dependency loss, deliberate drain.
- Startup — disables liveness/readiness until it succeeds; gives slow-starting apps room before liveness kicks in.
Probe handlers: httpGet, tcpSocket, exec, and grpc (1.27+).
9.5 Resource governance per namespace
- ResourceQuota — caps total requests/limits, object counts (PVCs, Services, Pods), and storage class allotments within a namespace.
- LimitRange — sets default and min/max for individual container/pod requests and limits.
Both are essential for multi-tenant clusters to prevent noisy-neighbor accidents.
10. Operators and custom resources
10.1 CRDs
A CustomResourceDefinition declares a new API type — schema, validation (OpenAPI v3), versions, subresources (status, scale), conversion strategies. Once installed, kubectl get <kind> works as if it were native. CRDs are the foundation of the operator pattern.
10.2 The operator pattern
Coined by CoreOS in 2016, an Operator is a controller that reconciles a CRD-defined custom resource. The CR represents a high-level desired state (“I want a 3-node Postgres cluster with backup to S3”), and the controller orchestrates the underlying Pods, Services, PVCs, Secrets, Jobs to make it real — and to handle ongoing operations (failover, backup, upgrade, scale). The pattern encodes operational expertise as software.
10.3 Operator frameworks
- Operator SDK (Red Hat) — scaffolding for Go, Ansible, and Helm-based operators.
- Kubebuilder — Go-only, lower-level, the framework Operator SDK Go is built on.
- Metacontroller (Google) — write controllers as webhooks in any language; the framework handles caching, queues, leader election.
- KUDO — declarative operator framework.
- kopf (Python) — Pythonic operator framework, popular in ML/data infra.
10.4 Notable operators
- Prometheus Operator — manages Prometheus, Alertmanager, ServiceMonitor, PodMonitor as CRDs.
- cert-manager — automated X.509 certificate issuance via ACME (Let’s Encrypt), HashiCorp Vault, Venafi, AWS PCA.
- ArgoCD — GitOps continuous delivery (Section 12).
- Flux — GitOps toolkit (Source, Kustomize, Helm, Notification controllers).
- KEDA — event-driven autoscaling on dozens of sources (Kafka lag, SQS depth, Redis list size, Prometheus query).
- Istio Operator — service mesh lifecycle.
- Strimzi — Kafka on Kubernetes.
- PostgreSQL operators — Crunchy PGO, Zalando Postgres Operator, CloudNativePG, StackGres.
- MongoDB operator, Elastic Cloud on Kubernetes (ECK), CockroachDB operator, Vitess operator, TiDB operator.
11. Security
11.1 RBAC
Kubernetes RBAC controls API authorization:
- Role / ClusterRole — verb-on-resource permissions (
get,list,watch,create,update,patch,delete,deletecollection, plus arbitrary verbs on subresources likepods/exec). - RoleBinding / ClusterRoleBinding — bind a Role/ClusterRole to a subject (User, Group, ServiceAccount).
Built-in ClusterRoles (cluster-admin, admin, edit, view) are aggregated and meant as starting points — production should write tight custom roles.
11.2 Pod Security Standards
PodSecurityPolicy (PSP) was deprecated in 1.21 and removed in 1.25. The replacement is Pod Security Admission with three standards:
- Privileged — unrestricted; for system workloads.
- Baseline — minimal restrictions; prevents known privilege escalations.
- Restricted — heavily restricted; current Pod-hardening best practices (no root, no privileged, read-only root FS, drop all capabilities, seccomp RuntimeDefault, etc.).
Each namespace can enforce / audit / warn at one of these levels via labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
For more sophisticated policy, install OPA Gatekeeper (Rego policies) or Kyverno (YAML/JSON policies) — both validate and mutate via admission webhooks.
11.3 NetworkPolicy
Already covered in Section 5.6 — pod-level firewalling. Default-deny ingress and egress per namespace is a strong baseline.
11.4 Image scanning and supply chain
- Trivy (Aqua) — broad scanner (CVEs, IaC, secrets, SBOM).
- Grype (Anchore) — vulnerability scanner; pairs with Syft for SBOM.
- Snyk Container, Aqua, Sysdig Secure, Twistlock/Prisma — commercial.
- Clair — registry-side scanner (Quay).
Supply chain hardening:
- SLSA (Supply-chain Levels for Software Artifacts) — provenance, build-integrity standard.
- Sigstore / cosign — sign container images with short-lived keys; verify in admission with Connaisseur, Kyverno verify-images, or cosigned. Backed by Rekor transparency log.
- Notary v2 / ORAS — OCI artifact signing.
- Bill of materials: Syft, CycloneDX, SPDX formats.
11.5 Admission controllers
The API server runs every request through a chain:
- Built-in: NamespaceLifecycle, LimitRanger, ServiceAccount, ResourceQuota, MutatingAdmissionWebhook, ValidatingAdmissionWebhook, PodSecurity.
- Dynamic: MutatingAdmissionWebhook (rewrite objects on the fly — sidecar injection, defaults), ValidatingAdmissionWebhook (accept/reject — policy engines).
- ValidatingAdmissionPolicy (1.30 GA) — CEL-based in-process validation; faster than webhooks for simple rules.
OPA Gatekeeper and Kyverno are the dominant policy engines built on these primitives.
11.6 Runtime security
- Falco — eBPF / kernel-module runtime threat detection; rules engine on syscall events.
- Tetragon (Cilium) — eBPF-based observability and enforcement, can block syscalls in-line.
- Tracee (Aqua) — eBPF event tracing.
- Sysdig Secure — commercial Falco-driven platform.
11.7 Supply chain identity
- SPIFFE / SPIRE — universal workload identity standard; X.509 SVIDs or JWT-SVIDs.
- IRSA (IAM Roles for Service Accounts) — AWS; ServiceAccount token projection lets pods assume IAM roles via OIDC federation.
- GKE Workload Identity — same idea on GCP via Google service account binding.
- Azure AD Workload Identity — OIDC federation to Azure AD; replaces deprecated pod-managed identity.
These remove the need for long-lived cloud credentials in pods.
12. Production patterns
12.1 GitOps
GitOps treats a git repository as the single source of truth for cluster state. A controller in-cluster watches the repo and reconciles. Benefits: auditable change log, easy rollback, peer review via PRs, drift detection.
- Argo CD (Intuit / CNCF) — pull-model GitOps for app delivery; UI, multi-cluster, app-of-apps, ApplicationSet, Sync waves, hooks.
- Flux v2 (Weaveworks / CNCF) — toolkit-style: Source-controller, Kustomize-controller, Helm-controller, Notification-controller, Image-automation-controller.
Both pair well with Kustomize and Helm for templating.
12.2 Multi-cluster
- Karmada — multi-cluster scheduler and propagator; CRDs describe placement policies.
- Cluster API (CAPI) — declarative cluster lifecycle; CRDs for Machine, MachineSet, Cluster; provider-specific implementations (CAPA for AWS, CAPG for GCP, CAPZ for Azure, CAPV for vSphere, CAPBM for Bare-Metal).
- Crossplane — provisions cloud resources (databases, queues, buckets) declaratively from Kubernetes; the “control plane for everything”.
- Liqo — multi-cluster federation with virtual-node abstraction.
- Submariner — cross-cluster networking.
12.3 Autoscaling
- HPA (Horizontal Pod Autoscaler) — scales replicas based on CPU, memory, custom Prometheus metrics, or external metrics. Configurable behavior windows for stable up/down scaling.
- VPA (Vertical Pod Autoscaler) — recommends or sets pod requests based on usage. Conflicts with HPA-on-CPU; common pattern is VPA for memory + HPA for CPU/QPS.
- KEDA — event-driven autoscaling on dozens of sources (Kafka consumer lag, SQS queue depth, Redis Streams, Prometheus query, RabbitMQ, Azure Service Bus, GCP Pub/Sub). Wraps HPA under the hood and supports scale-to-zero.
12.4 Node autoscaling
- Cluster Autoscaler — standard project; watches for unschedulable pods, scales up node groups; consolidates underutilized nodes. Reads from cloud-specific node-group APIs.
- Karpenter (AWS, open-source) — just-in-time provisioner that bypasses node groups; reads pod requirements and provisions exactly-sized instances directly. Faster, better bin-packing, native spot support. Originally AWS-only; the Karpenter project is now multi-cloud via provider abstractions.
12.5 Cost optimization
- Spot / Preemptible nodes — 70–90% cheaper; use with PDBs, graceful-shutdown hooks, and stateless workloads. Karpenter handles spot interruption well.
- Right-sizing — Goldilocks, KRR, or VPA recommendations.
- Cluster-utilization dashboards via kube-state-metrics + Grafana.
- OpenCost (FinOps Foundation) — cost allocation per namespace/workload.
12.6 Progressive delivery
- Argo Rollouts — Deployment replacement with blue/green, canary, experiment strategies; integrates with service meshes and Ingress controllers for traffic shifting; metrics-driven analysis with Prometheus, Datadog, NewRelic.
- Flagger (Flux family) — similar canary/blue-green workflow with metric analysis.
12.7 Backup and disaster recovery
- Velero — cluster-state and PV backup to object storage; uses CSI snapshots or restic for filesystem-level backup. Schedules, retention, selective restores.
- Kasten K10 (Veeam) — commercial enterprise backup.
- Stash (AppsCode), TrilioVault, Portworx PX-Backup — others in the space.
GitOps complements backup: cluster manifests can be redeployed from git; PVs need Velero/Kasten.
13. Common pitfalls
- etcd performance. Slow disks (rotational, oversubscribed cloud disks) cause API request latency, leader elections, and cascading controller failures. Monitor
etcd_disk_wal_fsync_duration_secondsandetcd_server_leader_changes_seen_total. Defragment regularly. In very large clusters, separate etcd nodes from API servers. - Memory limits without tuning. Setting
limits.memorywithout understanding JVM / Node.js / Python heap behavior causes OOMKilled loops. Either set limits well above observed peak, or run with VPA recommendations, or remove limits and rely on requests + node-pressure eviction. - Stale Service endpoints. Recreating a Service with the same name but different selector, or skewed kube-proxy, can leave stale iptables rules briefly. Endpoints / EndpointSlices are the source of truth — inspect them.
- Image registry rate limits. Docker Hub anonymous pulls are throttled (100 per IP per 6 hours; 200 authenticated). Heavy clusters hit this. Mirror to ECR, GCR/Artifact Registry, ACR, Harbor, or a pull-through cache.
- Secrets in ConfigMaps. Easy mistake — a developer drops a DB password into a ConfigMap because it “works”. RBAC should grant
get secretsdifferently fromget configmaps; admission policies (Kyverno) can block string patterns in ConfigMaps. - API deprecation breaks during upgrade. PodSecurityPolicy was removed in 1.25;
policy/v1beta1HPA / Ingress / NetworkPolicy / PDB went away. Runkubent(Kube no trouble) orplutobefore every minor upgrade. Always read the release notes. - Missing PDBs. Cluster autoscaler or admin running
kubectl drainwill evict all replicas of a Deployment simultaneously if no PDB exists. SetminAvailable: 1(or higher) on every replica set that matters. - DNS-resolution startup races. Pod starts, queries Service that has zero ready endpoints, fails. Solutions: init container that blocks on DNS+TCP, readiness gates, app-level retry, or
dnsConfig.options.ndots: "1"to reduce search-path expansion. - CrashLoopBackOff with empty logs. Container exited too fast or its own logging hasn’t initialized. Use
kubectl logs --previous,kubectl describe pod, look at events, increase startup probe initialDelaySeconds. - Cluster runs out of node IPs. AWS VPC CNI in particular assigns ENIs and pod IPs from the VPC subnet; subnet exhaustion is invisible until pods fail to schedule. Use prefix delegation, custom networking, or Cilium overlay.
hostPathand privileged escapes. Workloads withhostPathmounts orprivileged: truecan break out of the container. Restrict via PSS, Kyverno, Gatekeeper.
14. Kubernetes distributions
14.1 Managed cloud
- Amazon EKS — AWS managed control plane; nodes via EC2, Fargate (serverless pods), or self-managed. Tight IAM integration via IRSA / Pod Identity.
- Google GKE — Standard or Autopilot mode (Google manages nodes entirely, you pay per-pod). Strong default networking (Dataplane V2 / Cilium), workload identity, multi-cluster fleet.
- Azure AKS — managed control plane; AKS-managed addons (CNI, Ingress, monitoring). Azure CNI Overlay, integration with Azure AD and Workload Identity.
- DigitalOcean DOKS, Linode LKE, Oracle OKE, Scaleway Kapsule, IBM IKS, Alibaba ACK — regional / niche options.
14.2 Self-managed
- kubeadm — official cluster bootstrapper; you handle nodes, OS, upgrades, HA.
- kOps — opinionated cluster lifecycle on AWS/GCP/Azure/DO; pre-Cluster-API approach.
- Cluster API (CAPI) — declarative cluster lifecycle as Kubernetes resources, multi-provider.
- Kubespray — Ansible-based, supports many platforms.
14.3 Opinionated enterprise
- Red Hat OpenShift / OKD — adds opinionated authentication, image streams, BuildConfigs, integrated registry, web console, Operators hub, Service Mesh, Pipelines (Tekton), Serverless (Knative). Mandatory
restrictedSCC. The market leader in regulated industries. - Rancher / RKE2 — multi-cluster management UI on top of upstream Kubernetes (RKE2 is the supported distro). Now part of SUSE.
- Mirantis Kubernetes Engine (formerly Docker Enterprise).
- VMware Tanzu Kubernetes Grid (TKG) — VMware ecosystem integration.
14.4 Edge and lightweight
- k3s (Rancher / CNCF Sandbox) — full Kubernetes in a single ~50 MB binary; SQLite default datastore (etcd optional); intended for edge, IoT, CI, dev. Strips out cloud-controller and most alpha features.
- k0s (Mirantis) — single-binary distro.
- MicroK8s (Canonical) — snap-based, batteries-included for Ubuntu.
- kind (Kubernetes IN Docker) — control plane and nodes as Docker containers; the standard for CI testing and dev clusters.
- minikube — single-node local cluster; supports multiple drivers (Docker, VirtualBox, hyperkit, KVM).
14.5 Bare-metal-native
- Talos Linux (Sidero Labs) — minimal immutable OS designed only to run Kubernetes; API-driven, no SSH; uses Cluster API for provisioning.
- Sidero Metal — bare-metal provisioning operator that pairs with Talos.
- Equinix Metal, Hetzner Cloud — common bare-metal substrates with Cluster API providers.
- MetalLB — LoadBalancer implementation for bare-metal clusters (ARP / BGP).
15. Cross-references
- _index
- distributed-systems-fundamentals — etcd’s Raft consensus; controller-loop pattern as eventual-consistency reconciliation.
- service-mesh — Istio, Linkerd, Cilium service mesh deep dive (TBD).
- observability-stack — Prometheus, Grafana, OpenTelemetry, Loki, Jaeger (TBD).
- oci-cloud-native — Helm, Kustomize, manifest DSLs, CDK8s, Pulumi.
16. Citations
- Burns, Brendan; Beda, Joe; Hightower, Kelsey. Kubernetes: Up & Running, 3rd ed. O’Reilly, 2022. The canonical introduction by two of the project co-creators.
- Verma, Abhishek; Pedrosa, Luis; Korupolu, Madhukar; Oppenheimer, David; Tune, Eric; Wilkes, John. “Large-scale cluster management at Google with Borg.” EuroSys ‘15. The Borg paper that motivated Kubernetes’ design.
- Kubernetes official documentation — https://kubernetes.io/docs/
- Kubernetes Enhancement Proposals (KEPs) — https://github.com/kubernetes/enhancements
- CNCF Landscape — https://landscape.cncf.io/
- Operator Framework documentation — https://operatorframework.io/
- Bondi, Andre B. and others. “Containers and Cluster Management.” ACM Queue, 2016.
- Burns, Brendan; Grant, Brian; Oppenheimer, David; Brewer, Eric; Wilkes, John. “Borg, Omega, and Kubernetes.” ACM Queue, 2016. The lineage paper.
- Hightower, Kelsey. Kubernetes the Hard Way — https://github.com/kelseyhightower/kubernetes-the-hard-way. The from-scratch bootstrap walkthrough every operator should do once.
- The Kubernetes Book (Nigel Poulton), 2024 edition.
- OpenShift documentation — https://docs.openshift.com/
- Argo Project documentation — https://argo-cd.readthedocs.io/, https://argo-rollouts.readthedocs.io/
- Flux documentation — https://fluxcd.io/
- Cilium documentation — https://docs.cilium.io/
- Calico documentation — https://docs.tigera.io/calico/
- Karpenter documentation — https://karpenter.sh/
- Cluster API documentation — https://cluster-api.sigs.k8s.io/
- Sigstore project — https://www.sigstore.dev/
- SLSA framework — https://slsa.dev/