Microservices Patterns — Compute Reference

1. At a glance

Microservices are small, independently-deployable services aligned to business capabilities that communicate over a network. The term was popularized by James Lewis and Martin Fowler in their March 2014 essay “Microservices” (martinfowler.com), which codified patterns already practiced at Netflix, Amazon, SoundCloud, and Gilt.

Defining characteristics:

  • Independent deployability — one service ships without coordinating release with siblings.
  • Business-capability alignment — services map to domains (orders, payments, inventory), not technical layers (UI, DB).
  • Decentralized data ownership — each service owns its persistent store.
  • Smart endpoints, dumb pipes — logic in services, transport stays simple (HTTP, message bus).
  • Polyglot tolerance — teams pick the language/runtime that fits.
  • Failure isolation — one service’s outage degrades, not destroys.

The core trade-off: deployment + scaling + team independence vs operational complexity + distributed-systems tax. You exchange a single deploy and in-process function calls for a fleet of services, a network between every interaction, and the eventual-consistency, partial-failure, observability, and CI/CD investment that comes with it. Conway’s law (1968) is the deep driver: software architecture mirrors team structure, so when team count and product surface area both grow past a threshold, micro-decomposition is often a consequence rather than a goal.

The 2014–2020 wave saw heavy adoption, sometimes premature; the 2022–2026 correction has been the modular-monolith comeback (Shopify, GitHub, Stack Overflow continuing the pattern) and the rise of cell-based architectures that group services into bounded units.

2. When NOT to use microservices

Microservices have a real fixed cost. Start with a modular monolith and extract services only when a bottleneck justifies the overhead. Anti-indicators:

  • Team size <10 engineers — you don’t have enough humans to staff per-service ownership; everyone ends up touching every service anyway, which gives you a distributed monolith with extra steps.
  • Unclear domain boundaries — if you can’t draw bounded contexts on a whiteboard, you’ll cut services in the wrong places and pay the refactor tax to re-cut them.
  • No CI/CD investment — manual deploys across N services is N× the toil.
  • No observability platform — logs, metrics, and distributed tracing are non-negotiable; without them you cannot debug a multi-service request path.
  • No distributed-systems experience on the team — partial failure, retries, idempotency, eventual consistency, and timeout-tuning are skills, not switches you flip.
  • Greenfield product with unstable requirements — moving a model object between two services is expensive; moving a function between two files in a monolith is free.
  • Low traffic — if you serve 100 RPS, a single well-tuned Postgres + monolith handles you forever.

Sam Newman (“Monolith to Microservices” 2019) explicitly recommends extracting from a monolith rather than starting micro. Shopify’s 2024 retrospective on their Rails majestic monolith and GitHub’s documentation of their >15-year Rails app reinforce that monolith does not mean unscalable.

3. Decomposition strategies

Domain-Driven Design (Evans 2003)

Eric Evans’s Domain-Driven Design: Tackling Complexity in the Heart of Software (2003) supplies the vocabulary almost every microservices org now uses. Key concepts:

  • Ubiquitous language — engineers and domain experts share one vocabulary; concepts and class names match the business.
  • Bounded context — an explicit scope within which a model is consistent and unambiguous; outside it, terms may mean different things.
  • Aggregate — a cluster of domain objects treated as a single transactional unit (e.g., Order + OrderLine + ShippingAddress); has a root entity, and external code only mutates via the root.
  • Entity vs value object — entities have identity over time; value objects are immutable by value.
  • Repository — persistence boundary for an aggregate.
  • Domain event — a fact that happened in the domain (OrderPlaced, PaymentCaptured) that other contexts can react to.

One bounded context per service is the cleanest mapping. Cross-context references are by ID only; you do not share types or tables.

Strategic patterns (context map)

Once you have multiple contexts, the relationships between them are themselves patterns:

  • Shared kernel — two teams agree to share a small piece of model; high coordination cost.
  • Customer-supplier — downstream team has influence over upstream priorities.
  • Conformist — downstream takes the upstream model as given (cheap, lossy).
  • Anticorruption layer (ACL) — downstream translates upstream model into its own; insulates against upstream churn. Almost mandatory at integration boundaries with legacy systems.
  • Separate ways — two contexts deliberately do not integrate.
  • Partnership — two teams co-evolve a context together.
  • Published language — a stable shared schema (e.g., a public event format).

Subdomain types

Vaughn Vernon (“Implementing Domain-Driven Design” 2013) and Evans classify subdomains:

  • Core domain — the competitive advantage; build with your best people, build deliberately, never outsource.
  • Supporting subdomain — necessary but not differentiating (e.g., notifications, reporting). Build pragmatically; consider OSS.
  • Generic subdomain — undifferentiated commodity (auth, billing, search infra). Buy or use OSS (Auth0, Stripe, Elastic).

Service investment should follow this hierarchy: heavy in core, lean in supporting, vendor in generic.

Conway’s law and the inverse Conway maneuver

Melvin Conway’s 1968 Datamation paper observed that “organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations.” Forty years of evidence holds it up. The pragmatic move (Skelton + Pais, Team Topologies 2019):

  1. Design the team boundaries you want. Stream-aligned teams own services end-to-end. Enabling, complicated-subsystem, and platform teams support.
  2. Architecture will follow. You don’t have to enforce service boundaries — the org chart will.

Trying to micro-decompose without addressing team boundaries produces shared ownership, which produces shared schemas, which produces a distributed monolith.

4. Service granularity

Too small: every business operation fans out to a dozen RPC calls, latency suffers, and the schema-coupling between adjacent services is so tight you redeploy them together. Uber’s documented 2000–2020 explosion to 4000+ services and the consolidation that followed is the canonical cautionary tale.

Too big: you’ve reproduced the monolith with extra HTTP between modules.

Useful heuristics:

  • 2-pizza team owns one service (Amazon’s rule; a team you can feed with two pizzas, roughly 6–10 people). If multiple teams co-own, the boundary is wrong.
  • One bounded context per service (DDD heuristic).
  • One database per service — if two services share a DB, they’re one service.
  • Cell-based architecture (AWS, DoorDash) groups services into independent cells that fail and scale together; cross-cell traffic is the exception. A cell is the unit of regional failover; service boundaries inside a cell can be looser.
  • The cohesion-coupling test — high cohesion inside the service (the methods relate), low coupling to other services (few cross-calls per business operation). If a feature change touches three services, the seam is in the wrong place.
  • Change frequency test — services that change together should be one service; services that change at very different rates (a once-a-month invoice generator vs a continuously-tuned recommendation engine) belong apart.
  • Failure-blast-radius test — if service A failing always also fails service B, they live and die together; consider whether they need to be separate.
  • Stable-dependency principle (Robert Martin) — depend in the direction of stability. Frequently-changing services should depend on stable ones, never the reverse. Where you find an unstable service being depended on, introduce an anticorruption layer.

5. Communication patterns

Synchronous

REST / JSON over HTTP — universal, text-debuggable, weak contract. Resource-oriented, uses HTTP verbs (GET, POST, PUT, PATCH, DELETE), status codes, content negotiation. Hypermedia (HATEOAS, Fielding’s 2000 dissertation) is rarely implemented; in practice “REST” means JSON-over-HTTP with reasonable URLs.

gRPC (Google 2015) — HTTP/2 + Protocol Buffers + code generation. Roughly 6× smaller payload than JSON for the same data, header compression via HPACK, multiplexed streams, native bi-directional streaming, deadline propagation, server reflection. Strong contract; client and server stubs are generated from the .proto file. Costs: harder to inspect on the wire, requires tooling for browsers (gRPC-Web), Protobuf schema evolution rules must be followed (never re-use a tag number).

GraphQL (Facebook 2015, OSS 2018) — single endpoint, client specifies the query shape. Good for BFFs that aggregate many backends for diverse clients. Costs: N+1 resolver problems, query-cost limiting needed, caching is per-query rather than per-URL.

Latency: synchronous gives lower per-call latency but tighter coupling — every dependent service becomes a hard dependency of your availability budget.

Asynchronous messaging

Kafka (LinkedIn 2011, Apache) — partitioned, replicated, append-only log; durable; consumers track offset. Strong ordering within a partition; horizontal scale by adding partitions. The backbone of event-driven architectures.

RabbitMQ — AMQP broker; rich routing (direct, topic, fanout, headers exchanges); good for traditional work queues.

NATS — lightweight, low-latency pub/sub; JetStream adds persistence.

AWS SQS / SNS — managed queue (SQS) and pub/sub (SNS); often paired as the SNS-to-SQS fanout pattern.

GCP Pub/Sub — managed, global; at-least-once with optional ordering keys.

Azure Service Bus / Event Hubs — Service Bus for queues + topics; Event Hubs for log streams (Kafka-compatible API).

Async decouples in time, in space, and in failure: producer can publish even if no consumer is alive; consumers process at their own pace; one slow consumer doesn’t block others. Costs: eventual consistency, out-of-order delivery, deduplication burden, harder debugging.

Request-response vs event-driven

Orchestration — a central coordinator drives the workflow, calling services in order, handling failures. Tools: Temporal (Uber’s Cadence rewritten, OSS 2019), Camunda (BPMN engine), AWS Step Functions, Netflix Conductor. Pros: visible flow, easier to reason about, easier rollback. Cons: coordinator is a coupling point.

Choreography — each service reacts to events published by peers; no central brain. The order-flow is implicit in the topology of subscriptions. Pros: maximally decoupled. Cons: emergent behavior; hard to answer “what happens when an order is placed?” without tracing tools.

Practical guidance: orchestrate when transactions span 3+ services with compensations; choreograph for fanout-style downstream reactions (analytics, notifications, cache invalidation).

gRPC vs REST

gRPCREST
Wire formatProtobuf binaryJSON text
TransportHTTP/2HTTP/1.1 or HTTP/2
Payload size~6× smallerlarger
Contractstrict .protoOpenAPI optional
Browser supportneeds gRPC-Webnative
Streamingbi-directional nativeSSE or WebSockets bolted on
Toolingcode-gen for ~10 langscurl, Postman, every lang
Debuggingneeds decodertext

Service-to-service inside the data center: gRPC. Public APIs and browser-facing: REST/JSON or GraphQL. Mixing is fine.

6. Data ownership

Database per service

The non-negotiable: each service owns its persistent state, exposes it only through its API or events, and other services do not read its tables. Shared databases violate independence (schema changes propagate as breaking changes), violate encapsulation (sibling services depend on internal representation), and re-create the monolith.

Consequences:

  • Cross-service joins are impossible at the DB layer. Aggregate at the API layer (orchestration / BFF), or pre-materialize a read model.
  • Reports that span services are hard. Use a separate analytics store fed by event stream (Debezium CDC → Kafka → warehouse).
  • Referential integrity across services is soft. A foreign key from Order to Customer is just a UUID string; the Customer might be deleted; the service must tolerate dangling references.

Event sourcing

Instead of storing current state, store the immutable sequence of events that produced it (OrderPlaced, ItemAdded, Discounted, Paid, Shipped). State is a fold over events.

Benefits: complete audit log; time-travel; multiple read models from same source; natural fit with event-driven integration.

Costs: query needs read-model projection; schema evolution requires upcasters; not every problem benefits — keep it focused on aggregates where history matters.

Tooling: EventStoreDB (the reference impl from Greg Young’s team), Axon (JVM), Marten (.NET on Postgres), homegrown on Postgres or Kafka. Snapshots periodically materialize current state to bound replay cost.

CQRS (Command Query Responsibility Segregation)

Split write model (handles commands, produces events) from read model (denormalized, optimized for queries). Often paired with event sourcing but doesn’t require it. Greg Young’s 2010 essay popularized it.

Outbox pattern

The atomic-publish problem: you want to update the DB and publish an event; without 2PC you can fail between them. Outbox pattern:

  1. In the same DB transaction as the business write, insert a row into an outbox table containing the event payload.
  2. A separate process polls the outbox (or reads its WAL via CDC like Debezium) and publishes to the message bus, marking the row as sent.

This guarantees at-least-once event publication tied to business state.

Saga pattern

Garcia-Molina and Salem coined “Sagas” in their 1987 SIGMOD paper. Revived in the 2010s by Caitie McCaffrey, Chris Richardson, and others as the way to do long-running distributed transactions.

A saga is a sequence of local transactions, each in a different service, where each step has a compensating action that undoes its effect. If step 4 fails, you run compensations for steps 1–3 in reverse order.

Two flavors:

  • Orchestrated saga — a coordinator (Temporal, Camunda, custom) drives the sequence and the compensations. Cleaner.
  • Choreographed saga — each service listens for the previous step’s success event and emits its own; compensations are also event-driven. More decoupled, harder to follow.

Compensations are semantic undo, not technical rollback — once an email is sent, you cannot un-send it; the compensation is “send a correction email.”

Worked example — placing an order:

  1. OrderService.createOrder (status: pending) → emits OrderCreated.
  2. InventoryService.reserveStock → on success emits StockReserved; on failure emits StockReservationFailed.
  3. PaymentService.charge → on success emits PaymentCaptured; on failure emits PaymentFailed.
  4. ShippingService.dispatch → emits Dispatched.
  5. OrderService.complete on Dispatched.

Compensations on PaymentFailed at step 3: InventoryService.releaseStock; OrderService.cancel. The semantics are visible end-to-end, every step is restartable, and the system tolerates partial failure at any point.

Temporal in particular has become the default modern orchestration runtime (Maxim Fateev’s team, open-sourced by Uber as Cadence, rewritten as Temporal in 2019); its programming model presents the saga as ordinary code with timers, retries, signals, and compensations expressed as durable activities.

Two-phase commit

XA / 2PC across services: do not. It requires a coordinator, holds locks across the network, and any participant outage stalls the whole transaction. Doesn’t scale; doesn’t tolerate partial failure. Use sagas + outbox + idempotency instead. The classic Pat Helland essay “Life beyond Distributed Transactions” (2007) is the canonical justification.

Strong consistency between services is rare

Build for eventual consistency. Make UIs show “processing” states. Make APIs idempotent. Make stale reads acceptable for most queries. When you truly need strong consistency, the right unit of strong consistency is a single service’s database, not the system as a whole.

7. Service mesh

A service mesh moves cross-cutting concerns (mTLS, retries, timeouts, circuit-breaking, traffic-splitting, observability) out of application code into a sidecar (or sidecar-less) data plane. See [[Compute/containers-service-mesh]] for the deep dive.

Sidecar-based: Istio (Envoy proxy + Istiod control plane), Linkerd (Rust micro-proxy, simpler).

Sidecar-less: Cilium with eBPF + Envoy, Istio Ambient (split into ztunnel L4 + waypoint L7).

Cost: sidecar adds memory + CPU per pod; sidecar-less ambient model reduces that significantly. For under ~50 services, the mesh tax may not pay off; rely on a library or gateway instead.

8. Resilience patterns

Distributed systems fail in ways that monoliths don’t. Michael Nygard’s Release It! (1st ed 2007, 2nd ed 2018) catalogues the canonical stability patterns that became Hystrix, Resilience4j, Polly, and now mesh-level defaults.

Circuit breaker

(Nygard 2007; popularized by Netflix Hystrix 2012, now in maintenance mode in favor of Resilience4j.)

A circuit breaker wraps a remote call. States:

  • Closed — calls pass through; failures are counted.
  • Open — after N failures in window, the breaker opens; subsequent calls fail fast without hitting the dependency. Prevents cascading failure.
  • Half-open — after a cooldown, a few probes are allowed; on success, close; on failure, re-open.

Tunable parameters: failure threshold, window size, cooldown, probe count. Apply per-dependency.

Bulkhead

Isolate resources per dependency (separate thread pools, connection pools, semaphores). If dependency X gets slow, only its bulkhead’s threads block; the rest of the service keeps serving. Named after ship bulkheads — a leak in one compartment doesn’t sink the vessel.

Implementations: Hystrix’s per-command thread pools (now legacy); Resilience4j Bulkhead module (semaphore or thread-pool flavor); Polly (.NET) Bulkhead Isolation; mesh-level concurrency limits in Envoy/Istio. Sizing: enough threads to absorb p99 latency × target QPS for that dependency, capped to prevent the bulkhead itself becoming an OOM source.

Retry with exponential backoff and jitter

  • Exponential backoff — wait 100ms, 200ms, 400ms, 800ms, … doubling.
  • Jitter — randomize each wait to avoid the thundering herd when many clients retry in sync. AWS’s “Exponential Backoff And Jitter” post (2015) showed full jitter (uniform random between 0 and capped exponential) outperforms equal jitter for most workloads.
  • Cap retries. Retrying forever amplifies overload.
  • Retry only on retryable errors — 5xx, network timeout, connection refused. Never retry on 4xx without changing the request.

Timeout at every dependency call

No call to a remote dependency should be allowed to hang indefinitely. Set explicit timeouts at every layer: connect, read, total request. Default-no-timeout is the most common production landmine.

Deadline propagation

A client request has an overall deadline. Each downstream call should pass the remaining budget so a slow upstream doesn’t waste time on a request the client has already given up on. gRPC supports this natively (the grpc-timeout header is a deadline, not a per-hop timeout); HTTP needs explicit propagation (e.g., a Deadline or X-Request-Deadline header).

Graceful degradation

When a non-critical dependency is unavailable, return a degraded but useful response: cached data, default recommendations, a “we couldn’t load reviews” placeholder. Netflix’s classic example: when the personalization service fails, show the most-popular list instead of erroring.

Idempotency keys

Any write operation must be safely retryable. Client supplies a unique key per logical operation (Idempotency-Key header, Stripe’s convention since 2015); server records first response and returns it on retry. Required for at-least-once event consumers and for client-side retries over unreliable networks.

Hedged requests

Send the request to N replicas and use whichever returns first (Dean & Barroso, “The Tail at Scale,” CACM 2013). Trades extra load for latency-tail reduction. Effective on read-heavy, latency-sensitive paths; useless for writes (idempotency required if writes hedged).

Load shedding

When at capacity, reject excess requests rather than letting everything time out. Google’s Doorman (2016) and Facebook’s Concord are classic implementations. The simple rule: it’s better to serve 90% of requests well than 100% poorly.

Rate limiting

Per-tenant, per-endpoint, per-IP. Algorithms:

  • Token bucket — refill at rate R, capacity C; allows bursts up to C.
  • Leaky bucket — fixed-rate drain; smooths bursts.
  • Fixed window — N requests per minute; cliff edge at boundary.
  • Sliding window log — exact but memory-hungry.
  • Sliding window counter — approximate, low-memory.

Enforce at the gateway and within services. Stripe, GitHub, and Cloudflare publish their rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After).

Adaptive concurrency

Static limits are fragile — too low wastes capacity, too high invites collapse. Netflix’s concurrency-limits library and Envoy’s adaptive concurrency filter implement TCP-Vegas-style algorithms: probe the limit by measuring latency; back off when latency rises faster than throughput. The system finds its own ceiling under shifting load, hardware, and downstream conditions.

Backpressure

Async pipelines need explicit pressure signaling. Reactive Streams (the JVM spec adopted into java.util.concurrent.Flow), Project Reactor, RxJava, Akka Streams all implement demand-based pull semantics so a slow consumer slows the producer instead of OOMing it. Kafka consumers achieve the same effect by simply not committing offsets faster than they can process; the lag metric is the backpressure signal.

9. API design

  • REST + OpenAPI 3.x (Swagger) — the de facto contract format; tooling generates clients, docs, mock servers. Spec the API, then enforce it.
  • gRPC + Protobuf — schema-first; backwards compatibility is automatic if you follow the rules (never reuse a tag, never change a type).
  • GraphQL — flexible queries; pair with persisted queries to control cost; consider Apollo Federation for multi-service composition.
  • Backend-for-Frontend (BFF) — Sam Newman’s pattern; separate API tailored per consumer (web BFF, iOS BFF, Android BFF). The BFF aggregates from underlying services so client logic stays simple. SoundCloud is the classic case.
  • Versioning — URI versioning (/v1/orders) is most common; header versioning (Accept: application/vnd.acme.v2+json) is purer; feature toggles allow per-tenant rollout without version bumps. Pick one and be consistent.
  • Pagination — cursor (opaque, stable) beats offset (drift under concurrent inserts) for any non-trivial volume.
  • Error envelopes — RFC 7807 Problem Details for HTTP APIs is the standard structure; many orgs ignore it and define their own. Pick a shape and use it everywhere.
  • Consumer-driven contracts — Pact (Pactflow), Spring Cloud Contract. Consumers declare what they need; the broker stores those expectations; the producer’s CI verifies it hasn’t broken any consumer. Catches integration breaks before deploy.
  • Tolerant readers (Andrew Wilson, ThoughtWorks) — clients should parse only the fields they need and ignore the rest. Combined with additive-only schema evolution on the producer side, this makes most field-level changes non-breaking.
  • Schema registry — Confluent Schema Registry, Apicurio. For event payloads (Avro, Protobuf, JSON Schema); registry enforces compatibility rules (backward, forward, full) on every publish.

10. Distributed tracing and correlation

A user request fans out to N services; without tracing, you cannot reconstruct what happened.

  • W3C Trace Context — the traceparent and tracestate headers; the modern standard for propagation.
  • OpenTelemetry — CNCF, the merger of OpenTracing and OpenCensus (2019). Provides SDKs for most languages, exporters for most backends (Jaeger, Tempo, Zipkin, Honeycomb, Datadog).
  • Spans — each unit of work; spans nest into traces. Add attributes: user ID, tenant, status code, retry count, decision branch.
  • Sampling — head-based (decide at request entry) or tail-based (collect all, then decide based on errors/slowness). Tail-based catches the interesting traces.
  • Correlation IDs — even where tracing isn’t deployed, propagate a request-ID header (X-Request-Id, X-Correlation-Id) through every log line. Without this, multi-service log diving is grep-and-pray.
  • Baggage — OTel baggage propagates application-level context (tenant ID, feature flag bucket, deployment cell) across hops without baking it into URLs or bodies.
  • Exemplars — link metric data points back to individual traces (Prometheus exemplars + Grafana drill-down). Click a latency spike on a histogram; jump to a trace that caused it.

See [[Compute/observability-stack]] for the full pillar treatment (logs, metrics, traces, profiles).

Service catalog and ownership

A microservices org without a catalog cannot answer “who owns service X, what does it depend on, who do I page at 2am?” Backstage’s Software Catalog (Spotify OSS) and competitors (Cortex, OpsLevel, Port.io) tag each service with owner, on-call, runbook, SLOs, dependencies, tier (T0/T1/T2), data classification, and version. Service-level objectives become first-class artifacts measured against burn rates.

11. Caching layers

  • In-process — Caffeine (Java, Window-TinyLFU), Guava Cache, Ristretto (Go). Sub-microsecond latency; not shared across instances.
  • Distributed — Redis (single-threaded in-memory store; AOF + RDB persistence; clustering; Streams; pub/sub; OSS license shift to RSAL/SSPL in 2024, with Valkey forked under Linux Foundation), Memcached (simpler, multi-threaded, no persistence), Dragonfly (multi-threaded Redis-compatible, 2022), KeyDB.
  • CDN — CloudFront, Cloudflare, Fastly, Akamai, BunnyCDN; cache static assets and cacheable API responses at the edge. Soft-purge, stale-while-revalidate, surrogate keys (Fastly) for fine-grained invalidation.

Patterns:

  • Cache-aside — app checks cache, falls back to DB, populates cache. Most common.
  • Read-through / write-through — cache library reads from DB on miss / writes to DB on put. Hides DB from app code.
  • Write-back / write-behind — writes go to cache first, async to DB. Risks data loss; rarely worth it.
  • TTL + stale-while-revalidate — serve stale while a background refresh runs.
  • Cache stampede prevention — request coalescing (Caffeine’s AsyncLoadingCache, Go singleflight), probabilistic early expiration (Vattani et al, “Optimal Probabilistic Cache Stampede Prevention,” 2015), lock-based recompute.
  • Negative caching — cache “not found” results too, with shorter TTL; prevents repeated DB hits for the same missing key.
  • Hot-key sharding — split a hot key across N suffixed shards on the client; aggregate on read. Mitigates the single-key bottleneck.

12. Eventual consistency strategies

  • Event-driven replication — capture changes via CDC (Debezium reading Postgres WAL or MySQL binlog → Kafka), project into downstream service-local read models. Strong durability guarantee at the source; downstream lag is observable.
  • Saga with compensations — already covered; appropriate when multiple services must agree on the outcome of a business transaction.
  • CRDTs (Conflict-Free Replicated Data Types, Shapiro et al 2011) — data structures designed so independent replicas converge after merge: G-Counter, PN-Counter, LWW-Register, OR-Set, RGA, Yjs, Automerge. Used by Figma, Linear, Riak, Redis CRDTs, collaborative-editor stacks.
  • Vector clocks / version vectors — track causal history so the system can detect concurrent writes and surface conflicts.

13. Deployment

Containerized: Docker images, OCI registries, Kubernetes for orchestration. See [[Compute/kubernetes-deep]] for K8s, [[Compute/containers-service-mesh]] for the runtime layer.

Per-service CI/CD pipelines — each service has its own pipeline, build, test, deploy. Trunk-based development; small batches; automated rollout. GitHub Actions, GitLab CI, Buildkite, Tekton, Argo Workflows.

Progressive delivery:

  • Canary — route 1% → 5% → 25% → 100% to new version while monitoring error rate, latency, business metrics. Flagger, Argo Rollouts automate this.
  • Blue-green — full duplicate environment; cut traffic over; keep old around for rollback.
  • Feature flags — decouple deploy from release. LaunchDarkly, Unleash (OSS), Statsig, Split, ConfigCat. Run experiments, gate features per cohort, kill switches.
  • Shadow traffic — mirror prod traffic to a new version, discard response. Validates without user impact.

GitOps — declarative cluster state in Git; reconciler (ArgoCD, Flux) makes the cluster match. PR reviews become deploy approvals. Drift is detected and corrected.

Multi-cluster / multi-region:

  • Active-active — traffic to all regions; lower latency for users; complex data consistency.
  • Active-passive — one region serves; others standby. Simpler; longer failover.
  • Cell-based — multiple isolated cells per region; failover is per-cell, not per-region.

Image supply chain: distroless or minimal base images (Google Distroless, Chainguard, Wolfi); SBOMs (Syft, CycloneDX, SPDX); image signing (cosign + Sigstore, in-toto attestations); admission control (Kyverno, OPA Gatekeeper) to enforce signature + provenance at the cluster door. Post-Log4Shell and SolarWinds, this is no longer optional.

Internal developer platform (IDP): Backstage (Spotify OSS 2020, CNCF) for service catalogs, scorecards, golden paths, software templates; Humanitec, Port.io as commercial alternatives. A platform team curates paved roads so product teams don’t reinvent CI/CD, observability, and ops per service.

14. Anti-patterns

  • Distributed monolith — services deployed independently in theory but co-deployed in practice; shared schema; one service’s release requires another’s. You paid for microservices and got monolith disadvantages plus network costs.
  • Shared database — multiple services read/write the same tables; defeats independence; every schema change is cross-team coordination.
  • Chatty interfaces — a single user action triggers N+1 inter-service calls; latency multiplies; failures multiply. Solve with coarser APIs, batching, BFFs, or by rethinking the service boundary.
  • Microservice envy — copying FAANG architecture without FAANG scale or team count. Premature decomposition at a 20-person startup.
  • Distributed transactions across services — XA, 2PC; don’t. Use sagas.
  • Sync-only call chains — every service depends synchronously on every other; one slow leaf blocks everyone. Mix in async where the user doesn’t need an immediate confirmation.
  • No observability — no distributed tracing, no correlation IDs, no structured logs. You cannot debug a distributed system without it.
  • Versioning chaos — no published contract, breaking changes shipped without warning, clients break in production. Use OpenAPI / Protobuf + CI contract tests (Pact for consumer-driven contracts).
  • Per-service bespoke standards — every team picks a different language, framework, logging format, deploy tool. Sounds free but produces a maintenance swamp.
  • God service — one service everyone depends on (auth, user-profile, “core”); its availability becomes the system’s availability ceiling and its team becomes the bottleneck. Decompose or replicate.
  • Chatty events — every state change emitted as a separate event; downstream consumers drown in volume. Coarsen the events; emit business-meaningful facts, not table changes.
  • Synchronous fan-in — service A calls B which calls C which calls B which calls D. Latency multiplies; failure correlates. Flatten with orchestrators or event-driven flows.
  • Shared client libraries that hide RPC — clients pretend a remote call is a local function; teams forget about timeouts, retries, idempotency. Make remote-ness explicit at call sites.
  • Database per service except for the shared “lookup tables” — the slippery slope to a shared DB. The right answer is a Reference Data service or replicated read models, not a shared schema.

15. Resilience testing — chaos engineering

You cannot wait for production failures to teach you. Inject them.

  • Chaos Monkey — Netflix 2011; randomly terminates EC2 instances during business hours. First broadly-known chaos tool; lives on as Chaos Monkey for Spinnaker and the wider Simian Army (Chaos Gorilla — AZ outage; Latency Monkey — network slowness; Janitor Monkey — unused resources).
  • Gremlin — commercial chaos platform; CPU, memory, IO, network attacks; safe headers; targeting; blast-radius limits.
  • Chaos Mesh — CNCF, Kubernetes-native chaos; rich fault types (pod kill, network delay/loss, IO, time skew, JVM faults).
  • LitmusChaos — CNCF, Kubernetes-native; experiment hub; integration with GitOps.
  • AWS Fault Injection Service — managed fault injection on AWS resources.
  • Toxiproxy (Shopify) — TCP proxy for simulating network conditions in tests.

Game days: scheduled, scoped exercises where the team intentionally fails parts of the system and verifies that monitoring, alerts, runbooks, and people respond correctly. Document the unknowns surfaced.

Resilience SLOs: measure not just uptime but recovery — MTTD (detect), MTTR (recover), and burn-rate alerts on error budgets (Google SRE workbook).

Principles of chaos engineering (principlesofchaos.org, Casey Rosenthal et al, Netflix 2017): (1) build a hypothesis around steady-state behavior; (2) vary real-world events; (3) run experiments in production; (4) automate experiments to run continuously; (5) minimize blast radius. The discipline matured into the O’Reilly book Chaos Engineering (Rosenthal + Jones, 2020).

15a. Security in microservices

Each network hop is an attack surface. Defaults that should be table stakes:

  • mTLS everywhere — SPIFFE/SPIRE for workload identity (CNCF); SVIDs (SPIFFE Verifiable Identity Documents) as X.509 or JWT; mesh-managed cert rotation. Eliminates “is this caller really service X?” questions.
  • Zero-trust — never trust because of network position; authenticate every request. BeyondCorp (Google), BeyondProd (Google’s services analog, 2019 whitepaper).
  • Authorization — OAuth 2.0 + OIDC at the edge; service-to-service via JWTs with short TTLs; policy-as-code via Open Policy Agent (OPA, CNCF) or Cedar (AWS, 2023).
  • Secrets management — HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, External Secrets Operator; never commit secrets to image or repo; rotate aggressively.
  • Supply-chain integrity — see image supply chain above; SLSA (Supply-chain Levels for Software Artifacts) provenance levels.
  • Per-service threat model — STRIDE or PASTA per service; data classification informs encryption-at-rest and network egress controls.
  • Modular monolith comeback — Shopify (Rails monolith with strict module boundaries enforced via the packwerk tool); GitHub continuing its “Majestic Monolith”; Stack Overflow’s well-known monolithic Q&A platform. One deploy unit, internal bounded contexts, full developer experience advantages of a monolith, extraction path open when scale demands.
  • Self-contained systems (SCS) — an INNOQ-promoted style; each SCS is a vertical slice (UI + logic + data) of a domain; SCSes integrate via UI composition and async events; no synchronous inter-SCS calls.
  • Functions / serverless — AWS Lambda, GCP Cloud Functions, Cloud Run; Azure Functions; Cloudflare Workers (V8 isolates, <1ms cold start, edge-native); Vercel Functions; Deno Deploy. Micro-microservices for spiky workloads, glue, and edge logic.
  • Cell-based architecture — AWS internal blueprint, DoorDash 2022 post, Slack 2023 post. Group services into independent cells; each cell handles a slice of traffic; cell failure is bounded.
  • Edge computing — push logic to CDN edges. Cloudflare Workers, Fastly Compute@Edge, Deno Deploy, Vercel Edge Functions. Latency under 50ms globally; storage primitives (KV, D1, Durable Objects) follow.
  • Service Weaver (Google OSS 2023) — write a monolith in Go, deploy as microservices by annotating component boundaries; runtime decides what’s in-process vs RPC. Direction-of-travel: dev experience of monolith, ops shape of microservices.
  • WebAssembly components — Wasm + WASI 0.2 + the component model promise polyglot microservices that share an embedded runtime; Fermyon Spin, wasmCloud lean into this.
  • AI-augmented service development — Copilot-style assistants generate OpenAPI specs from intent, scaffold services, generate test suites; AgentDB and LLM-driven runbooks reshape on-call.
  • eBPF-everywhere data plane — Cilium’s mesh, Tetragon for runtime security, Pixie for observability; kernel-level visibility and policy without sidecars.
  • OpenTelemetry as the universal wire format — single instrumentation library across logs, metrics, traces, profiles; vendors compete on backends instead of agents.
  • Postgres-as-a-platform pushback — orgs consolidating onto a beefy Postgres with logical replication, partitioning, and pgvector rather than fragmenting state across N stores per service. “Just use Postgres” memes reflect real architectural simplification.
  • Polyrepo to monorepo migration — Bazel, Pants, Nx, Turborepo. Microservices in a monorepo retain shared tooling, atomic cross-service refactors, and one source of truth while preserving independent deployment.

16a. Testing strategies

Distributed systems break the assumptions of in-process testing. The pyramid stretches:

  • Unit tests — pure-logic; per-service; fast; the bulk of suite. No network, no DB.
  • Component tests — one service started up with its dependencies stubbed; tests the service through its public API. Testcontainers, WireMock, MockServer for fake collaborators.
  • Contract tests — Pact / Spring Cloud Contract; producer verifies it satisfies all consumer expectations; broker mediates. The replacement for fragile end-to-end tests across services.
  • Integration tests — a small set of services running together; verify the seams. Docker Compose or Testcontainers.
  • End-to-end tests — full deployment; few in number; slow; flaky if abused. Reserve for golden-path scenarios.
  • Production tests — synthetic monitors, canary analysis, chaos experiments. Where the real signals live.

The microservices-friendly shift is away from heavy E2E suites and toward contract tests + observability + canaries. Spotify’s “honeycomb model” and Honeycomb’s own “observability-driven development” push the same direction: test in prod safely, instrument heavily, ship small.

17. Cost reality

Microservices typically run 2–3× the operational cost of an equivalent monolith. The bill items:

  • Kubernetes + control-plane SRE — at minimum one platform engineer per ~50 services.
  • Observability stack — Prometheus + Grafana + Loki + Tempo or commercial (Datadog, Honeycomb, New Relic); cardinality bills can dwarf compute bills.
  • CI/CD infrastructure — runners, artifact registries, signing, environments.
  • Service mesh / gateway — Istio, Linkerd, Envoy, Kong; ops and tuning.
  • Network egress — inter-service traffic in cloud is metered.
  • Serialization + RPC overhead — gRPC is cheap, but every hop adds microseconds and bytes.
  • Coordination overhead — humans spending time on cross-team contracts, schema reviews, dependency-graph awareness.

Justify the spend with: independent team velocity (5+ teams shipping in parallel without stepping on each other) and scale bottlenecks (>100k QPS where a hot path needs to scale separately from cold paths). Below both thresholds, a well-engineered monolith wins on cost and velocity.

18. Famous case studies

  • Netflix — pioneer of the cloud microservices movement. Open-sourced Hystrix (2012, circuit breakers), Eureka (service registry), Ribbon (client-side LB), Zuul (gateway), Chaos Monkey (2011), Spinnaker (deploy). Approximately 1000 services circa 2020. Migrated off Hystrix to Resilience4j-style libraries plus Envoy/mesh.
  • Amazon — Bezos’s 2002 “Service-Oriented Architecture” memo mandated that all teams expose their data and functionality through service interfaces, with no other communication allowed; failure meant termination. The decree that arguably created modern cloud computing — AWS itself is the externalization of those internal services.
  • Uber — grew to 4000+ microservices by ~2020; documented the operational pain in posts on tracing, deployment, and on-call; has since been consolidating and emphasizing platform tooling and domain-oriented service consolidation.
  • Shopify — Rails monolith, very deliberately. The packwerk tool enforces module boundaries inside the monolith; cells (called “pods” internally) partition the platform by shop. A reference for “modular monolith + cells” done well.
  • Stripe — primarily a Ruby monolith with targeted services where the workload demands (e.g., webhooks, ML, fraud). Famous for engineering rigor on the monolith path.
  • Monzo — UK challenger bank; ~2000 Go microservices; an extreme case at relatively modest engineer-count, made workable by heavy investment in shared frameworks and Kubernetes tooling. Documented openly in conference talks.
  • GitHub — Rails monolith (“Majestic Monolith”) for >15 years; extracted a small number of services (e.g., search) only where genuinely necessary. Continues to scale.
  • Spotify — squad model + microservices; documented operational complexity and tooling (Backstage, OSS 2020, now CNCF — internal developer portal).
  • SoundCloud — early BFF case (Phil Calçado’s posts coined “Backend-for-Frontend” as a public term).
  • Gilt Groupe — early adopter (2011 onward); Adrian Cockcroft / Yoni Goldberg materials.
  • DoorDash — public cell-based architecture series (2022–2024); migration off a Python/Django monolith into Kotlin services; OLTP partitioning by region.
  • Slack — cell-based “Service Frontier” architecture (2023 engineering posts); Vitess-sharded MySQL underlying it.
  • Twitter / X — Finagle (RPC + service discovery, Scala, OSS 2011), one of the early microservice frameworks; later platform consolidation under the X rebrand has been less publicly documented.
  • Airbnb — initial Rails monolith migration into a Service-Oriented Architecture from 2018 onward; documented the cost in deploy-time and developer experience; partial walk-back to a “macroservice” granularity.
  • Etsy — held the line as a PHP monolith for years; later Scala/Java services for specific load profiles. Etsy’s deployment culture (Continuous Deployment, 50+ deploys/day pre-microservices) is itself a case study.
  • The Guardian — Scala monolith for a decade, gradual extraction; documented in QCon talks and the Guardian engineering blog.
  • Capital One — large-bank microservices migration (2014–2020); 2000+ services running on AWS; one of the most public enterprise reference architectures.

18a. Decision matrix — monolith vs modular monolith vs microservices

FactorMonolithModular monolithMicroservices
Team count1–33–1010+
Deploy independencenonelow (single artifact)high
Cognitive loadone codebaseone codebase, enforced modulesmany codebases, contracts
Observability barlogs + APMlogs + APMlogs + metrics + traces + SLOs
Failure isolationnonemodule-level (logical)service-level (physical)
Polyglotone stackone stackper-service choice
Refactor costlow (in-IDE)lowhigh (cross-repo)
Operational cost1.1×2–3×
Bottleneck scalingwhole appwhole appper-service
Right whenstartup, MVP, <10 engmedium, clear domain, single teammany teams, varied load, mature ops

The progression is one-way only by convention; teams routinely migrate back from microservices to modular monoliths once the cost is felt without the benefit.

19. Cross-references

  • [[Compute/distributed-systems-fundamentals]] — CAP, consensus, partial failure, time and ordering.
  • [[Compute/_index]] — Compute domain index.
  • [[Compute/kubernetes-deep]] — control plane, pods, services, networking, operators.
  • [[Compute/containers-service-mesh]] — sidecar / sidecar-less mesh, mTLS, traffic management.
  • [[Compute/observability-stack]] — logs, metrics, traces, profiles; OpenTelemetry.
  • [[Compute/networking-foundations]] — L4/L7, TLS, HTTP/2, HTTP/3.
  • [[Compute/consensus-protocols]] — Raft, Paxos, ZAB; the basis of strongly-consistent coordination.
  • [[Compute/database-internals]] — replication, isolation levels, CDC.

20. Citations

  • Newman, Sam. Building Microservices. 2nd ed. O’Reilly, 2021.
  • Newman, Sam. Monolith to Microservices. O’Reilly, 2019.
  • Richardson, Chris. Microservices Patterns. Manning, 2018.
  • Evans, Eric. Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley, 2003.
  • Vernon, Vaughn. Implementing Domain-Driven Design. Addison-Wesley, 2013.
  • Nygard, Michael. Release It! Design and Deploy Production-Ready Software. 2nd ed. Pragmatic Bookshelf, 2018.
  • Hohpe, Gregor and Bobby Woolf. Enterprise Integration Patterns. Addison-Wesley, 2003.
  • Skelton, Matthew and Manuel Pais. Team Topologies. IT Revolution, 2019.
  • Lewis, James and Martin Fowler. “Microservices.” martinfowler.com, March 2014.
  • Conway, Melvin. “How Do Committees Invent?” Datamation, April 1968.
  • Brewer, Eric. “Towards Robust Distributed Systems” (CAP theorem). PODC keynote, 2000.
  • Bezos, Jeff. “Service-Oriented Architecture” internal memo, Amazon, 2002.
  • Garcia-Molina, Hector and Kenneth Salem. “Sagas.” SIGMOD 1987.
  • Helland, Pat. “Life beyond Distributed Transactions: An Apostate’s Opinion.” CIDR 2007.
  • Young, Greg. “CQRS Documents.” 2010.
  • Shapiro, Marc et al. “Conflict-Free Replicated Data Types.” 2011.
  • Dean, Jeffrey and Luiz André Barroso. “The Tail at Scale.” Communications of the ACM, February 2013.
  • Fielding, Roy. Architectural Styles and the Design of Network-based Software Architectures (REST). PhD dissertation, UC Irvine, 2000.
  • Marc Brooker. “Exponential Backoff And Jitter.” AWS Architecture Blog, 2015.
  • Beyer, Betsy et al. Site Reliability Engineering (Google SRE book). O’Reilly, 2016; The Site Reliability Workbook, 2018.
  • Calçado, Phil. “The Back-end for Front-end Pattern (BFF).” 2015.
  • Netflix Tech Blog — Hystrix, Eureka, Chaos Monkey, Spinnaker series, 2011–2020.
  • DoorDash Engineering. “Building Faster Indexing with Apache Kafka and Elasticsearch” and the cell-based architecture series, 2022–2024.
  • Shopify Engineering. “Deconstructing the Monolith” and packwerk posts, 2020–2024.
  • Monzo. “The Modular Monolith and the Microservices: Lessons from Banking” and Go microservices talks, 2017–2023.
  • CNCF Landscape — Linkerd, Istio, Envoy, OpenTelemetry, Argo, Flagger, Chaos Mesh, LitmusChaos, Backstage.
  • Google. Service Weaver project documentation, 2023.