Core Concepts

The fundamental building blocks of Kubernetes — containers, clusters, pods, and how they connect.

Container

What is itA container is a standardized, lightweight, executable software package that bundles an application with all its dependencies (code, runtime, libraries, system tools, config) into a single unit that runs isolated from the host OS. Unlike a VM, containers share the host kernel and use Linux primitives — namespaces (for isolation of PID, network, mount, UTS, IPC, user), cgroups (for resource limits: CPU, memory, I/O), and union filesystems (overlay/overlayfs for layered images). Containers start in milliseconds, weigh tens of megabytes, and behave identically across laptops, CI servers, and production clusters.
Key features
  • Process-level isolation: Each container is just a Linux process (or tree) with restricted visibility — no separate kernel, no hypervisor.
  • Immutable images: A container image is a read-only snapshot; changes happen in a thin writable layer that is discarded when the container dies.
  • Portability: Build once, run anywhere that speaks the OCI runtime spec (Linux x86_64/arm64, Windows containers on Server 2019+).
  • Declarative builds: A Dockerfile or Containerfile describes the image deterministically, enabling reproducible artifacts.
  • Density: Hundreds of containers per host vs. dozens of VMs — no duplicated kernel and no 2 GB guest OS footprint.
How it differs
  • vs Virtual Machines: VMs virtualize hardware via a hypervisor (KVM, Hyper-V, ESXi) and run a full guest OS. Containers virtualize the OS. VMs take 30+ seconds to boot and 1-4 GB RAM each; containers boot in <1 second with <50 MB overhead.
  • vs chroot/jails: Containers add cgroups (resource limits) and network namespaces that classic chroot or BSD jails lack.
  • vs PaaS (Heroku, GAE): You control the entire image — OS packages, runtime version, every binary. PaaS hides this and imposes opinionated stacks.
  • vs Unikernels: Unikernels compile app + kernel into a single binary; containers keep the host kernel but isolate userspace.
Why use itContainers solved the classic "works on my machine" problem by bundling the environment with the code. They enable microservices (each service is its own image), fast CI/CD (build and ship in seconds), elastic scaling (spin up 1000 replicas in 10 seconds), and high density (pack more workloads per host to cut cloud spend). They are the unit of deployment for modern cloud-native systems and the foundation Kubernetes schedules.
Common gotchasContainers are not a security boundary by default — a kernel exploit escapes them (hence tools like gVisor, Kata Containers, Firecracker for stronger isolation). PID 1 semantics are tricky: signals to PID 1 behave differently, which can break graceful shutdown (use tini or --init). Writing to the container filesystem at runtime defeats immutability — use volumes instead. Image bloat (1+ GB images full of build tools) is common; use multi-stage builds and distroless bases.
Real-world examplesGoogle has run everything in containers since ~2003 (internally called "Borg containers") — reportedly 2+ billion containers launched per week. Netflix runs its entire streaming stack (Titus platform) on containers. Spotify, Uber, Airbnb, Shopify, GitHub ship containerized microservices to production thousands of times per day. Containers are the universal packaging format for cloud workloads in 2020s.
Simple Definition: A container is a lightweight, standalone package that includes your code and everything it needs to run — runtime, libraries, settings. It runs the same way everywhere.

Deep Dive:

Containers use two Linux kernel features to work:

  • Namespaces — Give each container its own isolated view of the system (process tree, network, filesystem, users). A container can't see other containers or the host processes.
  • Cgroups (Control Groups) — Limit how much CPU, memory, disk I/O, and network a container can use. Prevents one container from starving others.

Unlike Virtual Machines, containers share the host OS kernel. This makes them:

  • Start in milliseconds (VMs take minutes)
  • Use MBs of memory (VMs use GBs)
  • Near-native performance (no hypervisor overhead)

A container image is built in layers. Each instruction in a Dockerfile (FROM, COPY, RUN) creates a layer. Layers are cached and shared — if 10 services use the same base image, that layer is stored only once.

Interview Answer: "A container is an isolated process that packages an application with all its dependencies using Linux namespaces for isolation and cgroups for resource limits. Unlike VMs, containers share the host kernel, making them lightweight, portable, and fast to start. They're the fundamental deployment unit in cloud-native architectures."
Real-World Usage: Netflix runs thousands of containers per service. Each microservice is packaged as a container image, stored in a registry, and deployed to Kubernetes. When a new version is ready, a new image is built, and containers are replaced — never patched in place. This immutability eliminates configuration drift.

Docker

What is itDocker is the container platform that popularized containers in 2013. It is a set of tools that build, ship, and run container images: the docker CLI, the dockerd daemon, the BuildKit builder, Docker Desktop, Docker Compose, and Docker Hub (the largest public image registry). Docker didn't invent containers — Linux had LXC, chroot, and Solaris Zones for years — but Docker gave them a developer-friendly UX, a standardized image format (now the OCI image spec), and a network effect via Docker Hub.
Key features
  • Dockerfile: Declarative build file with FROM, RUN, COPY, CMD layers cached independently for fast rebuilds.
  • BuildKit: Modern builder with parallelism, secret mounts, cache mounts, and multi-platform (amd64+arm64) support.
  • Docker Compose: Declarative multi-container local dev via docker-compose.yml — databases, caches, app in one file.
  • Registry protocol: Push/pull to Docker Hub, GHCR, ECR, GAR, Quay, Harbor — all speak the same OCI distribution spec.
  • Volumes and networks: First-class data volumes and user-defined bridge/overlay networks for local orchestration.
How it differs
  • vs Podman: Podman is daemonless, rootless by default, and drop-in compatible (alias docker=podman). Red Hat's answer to the Docker daemon's historical root privilege concerns.
  • vs containerd: containerd is the low-level runtime that Docker itself now uses under the hood. Kubernetes since v1.24 talks directly to containerd, skipping Docker entirely ("dockershim removed").
  • vs CRI-O: CRI-O is a minimal Kubernetes-only runtime from Red Hat, built specifically for the Kubernetes Container Runtime Interface.
  • vs Buildah/Kaniko/Jib: Alternative image builders — Kaniko builds inside a Kubernetes pod without Docker daemon; Jib builds JVM images without a Dockerfile.
Why use itDocker is the default developer experience for containers — every tutorial, every CI system, every cloud provider supports it out of the box. It's ideal for local development (Compose), for CI/CD pipelines (build + push), and for learning containers before moving to Kubernetes. In production, teams increasingly use containerd or CRI-O directly, but Docker remains dominant for building images.
Common gotchasKubernetes no longer uses Docker as a runtime (as of v1.24, 2022) — this confuses many, but images built with Docker still run fine on K8s because they are OCI-compliant. Docker Desktop requires a paid license for companies with 250+ employees or $10M+ revenue (since 2022), pushing some to Podman or Rancher Desktop. The Docker daemon historically ran as root, a security red flag — rootless mode exists but has limitations.
Real-world examplesNearly every engineering team uses Docker for local dev. GitHub Actions, GitLab CI, CircleCI, Jenkins all rely on Docker images as the build environment. Docker Hub serves tens of billions of image pulls monthly. Docker the company was valued at $2B+ at peak, though its business pivoted several times (initially Swarm/orchestration, now dev tooling).
Simple Definition: Docker is the most popular tool for building and running containers. You write a Dockerfile, build an image from it, and run containers from that image.

Deep Dive:

Docker consists of several components:

  • Docker CLI — Command-line interface (docker build, docker run, docker push)
  • Docker Daemon (dockerd) — Background service that manages images and containers
  • containerd — The actual container runtime that Docker uses internally
  • Dockerfile — A text file with step-by-step instructions to build an image
# Example Dockerfile
FROM node:20-alpine          # Base image
WORKDIR /app                 # Set working directory
COPY package*.json ./        # Copy dependency files
RUN npm ci --production      # Install dependencies
COPY . .                     # Copy application code
EXPOSE 3000                  # Document the port
CMD ["node", "server.js"]    # Start command

Important distinction: Kubernetes dropped Docker as its container runtime in v1.24. But this does NOT mean Docker is dead. Docker is still the standard tool for building images. K8s just uses containerd directly to run them (cutting out the Docker daemon middleman). Your Docker-built images work perfectly on K8s.

Interview Answer: "Docker popularized containers by making them easy to build and use. The core workflow is: write a Dockerfile, build an image, push to a registry, run as a container. While Kubernetes no longer uses Docker as its runtime (it uses containerd directly), Docker remains the primary tool for building container images. The images are OCI-compliant and work everywhere."
Real-World Usage: Every CI pipeline at companies like Shopify, GitHub, and Uber uses Docker to build container images. Developers write Dockerfiles, CI builds and scans the image, pushes it to ECR/GCR, and Kubernetes runs it using containerd.

Kubernetes (K8s)

What is itKubernetes (abbreviated "K8s" — eight letters between K and s) is an open-source container orchestration platform that automates the deployment, scaling, networking, and lifecycle management of containerized applications across a cluster of machines. Originally built and open-sourced by Google in 2014 (based on 10+ years of internal experience running Borg and Omega), it was donated to the CNCF (Cloud Native Computing Foundation) and became the second-largest open-source project after Linux. At its core, Kubernetes is a declarative control system: you describe the desired state in YAML, and controllers continuously reconcile reality toward that state through a closed feedback loop.
Key features
  • Declarative API: You submit desired state (YAML manifests) to the API server, stored in etcd; controllers make it happen.
  • Self-healing: Crashed pods are restarted, failed nodes are drained and their pods rescheduled, unresponsive containers are killed.
  • Horizontal scaling: Scale workloads with a single command or automatically via HPA, VPA, KEDA, Cluster Autoscaler.
  • Service discovery + load balancing: Built-in DNS (CoreDNS) plus virtual IPs (Services) that survive pod churn.
  • Rolling updates and rollbacks: Zero-downtime deployments with configurable surge/unavailability budgets.
  • Extensibility: CRDs + Operators let you teach Kubernetes about new object types (databases, queues, certificates, cloud resources).
How it differs
  • vs Docker Swarm: Simpler to learn but far less capable. Swarm has basically lost the orchestration war — Docker Inc. now promotes K8s.
  • vs HashiCorp Nomad: Nomad is simpler, supports non-container workloads (VMs, raw binaries, Java), but has a much smaller ecosystem.
  • vs AWS ECS: ECS is simpler and deeply AWS-integrated, but locks you to AWS and lacks K8s's declarative extensibility.
  • vs Mesos/Marathon: Mesos was the orchestration leader circa 2015 (Twitter, Airbnb, Apple Siri all used it) but lost to K8s; DC/OS is effectively dead.
  • vs OpenShift: OpenShift is Red Hat's opinionated K8s distribution — adds developer UX, builds, routes, stricter security, paid support.
Why use itK8s is the de facto standard for running containers at scale. Benefits: portability across clouds (AWS, GCP, Azure, on-prem), a massive ecosystem (Helm charts, operators, service meshes), proven at planet-scale, and a huge talent pool. It enables microservices architectures, zero-downtime deploys, efficient bin-packing of workloads, and self-service developer platforms.
Common gotchasKubernetes is famously complex — hundreds of concepts, YAML sprawl, subtle networking and RBAC pitfalls. Debugging requires knowing about kubelet, kube-proxy, CNI, DNS, and more. Running etcd and the control plane yourself is hard (backups, upgrades, certificate rotation) — most teams use managed offerings (EKS, GKE, AKS). Cost can balloon if you oversize resources. Stateful workloads (databases) are trickier than stateless.
Real-world examplesSpotify migrated 150+ microservices from their in-house Helios orchestrator to K8s. Pinterest, Airbnb, Shopify, The New York Times, Capital One, CERN, Bloomberg, Reddit, Tinder all run K8s at massive scale. Google runs its own managed K8s (GKE) plus internal Borg. It powers all three major cloud providers' managed offerings: EKS, GKE, AKS.
Simple Definition: Kubernetes is an open-source platform that automates deploying, scaling, and managing containerized applications across a cluster of machines.

Deep Dive:

Kubernetes was designed by Google based on 15 years of running production workloads on their internal system called Borg. Open-sourced in 2014, now maintained by the CNCF.

What Kubernetes actually does:

  • Scheduling — Decides which machine runs which container based on resource needs, constraints, and policies
  • Self-healing — Restarts crashed containers, replaces unresponsive Pods, kills containers failing health checks
  • Scaling — Horizontally (more replicas) or vertically (more CPU/memory), automatically or manually
  • Service discovery & load balancing — Gives Pods DNS names, distributes traffic
  • Rolling updates & rollbacks — Deploy new versions with zero downtime, roll back if something breaks
  • Secret & config management — Inject configuration and credentials without baking them into images
  • Storage orchestration — Automatically attach cloud disks, NFS, or other storage to Pods

The Declarative Model — the most important concept:

You tell K8s what you want (desired state in YAML), not how to do it. K8s continuously reconciles actual state with desired state. If you say "I want 5 replicas" and one crashes, K8s creates a new one automatically. This reconciliation loop runs forever.

How K8s Reconciliation Loop Works
You write YAML
kubectl apply
API Server
Stored in etcd
Controller watches etcd
Compares desired vs actual
Takes action (create/delete Pods)
Loop runs forever — self-healing
Interview Answer: "Kubernetes is a container orchestration platform that manages containerized workloads across a cluster of machines. Its core principle is declarative: you define desired state in YAML, and controllers continuously reconcile actual state to match. It handles scheduling, self-healing, scaling, service discovery, rolling updates, and storage orchestration. It was created by Google based on their internal Borg system."
Real-World Usage: Spotify runs 100+ K8s clusters serving 400M+ users. Airbnb migrated from EC2 instances to K8s to achieve consistent deployments across 1000+ services. Pinterest uses K8s to handle 1B+ daily API requests with auto-scaling. Every major tech company runs on Kubernetes.

Cluster

What is itA Kubernetes cluster is the top-level unit of deployment: a set of machines (physical or virtual) organized into a control plane (the brain) and worker nodes (the muscle) that work together as one logical system. The control plane makes global decisions (scheduling, reacting to events, rolling out updates), while worker nodes run the actual application containers via the kubelet agent. A single cluster can span thousands of nodes (the official upper limit is 5,000 nodes, 150,000 pods) and dozens of availability zones, behaving as one unified compute fabric.
Key features
  • Control plane components: kube-apiserver (the front door), etcd (source of truth), kube-scheduler (places pods), kube-controller-manager (runs core controllers), cloud-controller-manager (talks to cloud APIs).
  • Worker node components: kubelet (runs pods), kube-proxy (networking), a container runtime (containerd/CRI-O).
  • Flat networking: Every pod gets its own routable IP, every pod can talk to every other pod without NAT (the "Kubernetes networking model").
  • HA control plane: Production clusters run 3 or 5 control-plane replicas for quorum on etcd.
How it differs
  • vs a single Docker host: A cluster provides scheduling, failover, and multi-node networking — a single host has none of these.
  • vs Nomad cluster: Nomad clusters are simpler to stand up (single binary) but lack K8s's rich object model.
  • vs ECS cluster: An ECS cluster is really just a logical grouping of EC2/Fargate capacity; a K8s cluster is a full API-driven system.
  • vs a Borg cell: Google's internal Borg uses "cells" of 10,000+ machines — K8s was intentionally scoped smaller per cluster, favoring multi-cluster federation.
Why use itA cluster gives you a single pane of glass for running workloads across many machines. Schedule once, let K8s decide where. Resources are bin-packed efficiently. Failures are recovered automatically. Teams share compute via Namespaces. The cluster becomes the unit of capacity planning, security, and upgrade.
Common gotchasUpgrading a cluster in place is risky — most teams use blue-green clusters (spin up new version, migrate workloads, tear down old). etcd backups are critical and often neglected. Noisy neighbors can starve other tenants without resource limits. Networking issues across zones or clusters are hard to debug. Single-cluster blast radius is real — many orgs run 10-50 clusters instead of one giant cluster for isolation.
Real-world examplesAlibaba runs clusters with 10,000+ nodes for Double 11 shopping events. OpenAI scaled K8s to 7,500 nodes to train large models. JD.com reported 30,000-node K8s deployments. Typical enterprise: 20-200 clusters across dev/staging/prod and regions, managed via tools like Cluster API, Rancher, or Anthos.
Simple Definition: A cluster is the entire Kubernetes deployment — all the machines (nodes) working together, managed by a control plane.

Deep Dive:

A cluster has two types of machines:

Control Plane Nodes (Masters):

  • Run the "brain" of Kubernetes — API Server, etcd, Scheduler, Controller Manager
  • Typically 3 or 5 nodes for high availability (odd number needed for etcd quorum)
  • Should NOT run application workloads in production
  • In managed K8s (EKS, GKE, AKS), the cloud provider manages these entirely

Worker Nodes:

  • Run your actual application Pods
  • Each runs: kubelet (agent), kube-proxy (networking), container runtime (containerd)
  • Can be physical servers, VMs, or cloud instances
  • Can be added/removed dynamically (auto-scaling)
  • Can have different sizes (mix of large and small instances)
Interview Answer: "A Kubernetes cluster consists of control plane nodes (running API Server, etcd, Scheduler, Controller Manager) and worker nodes (running kubelet, kube-proxy, and containerd). The control plane makes decisions about the cluster, while workers run the actual workloads. For HA, you run 3 or 5 control plane nodes. Managed services like EKS/GKE handle the control plane for you."
Real-World Usage: A typical startup runs 1-3 clusters (dev, staging, prod) on EKS/GKE with 5-50 worker nodes each. Large enterprises like PayPal run hundreds of clusters across multiple regions. The trend is multi-cluster architectures where each cluster has a specific purpose or serves a specific region.

Node

What is itA Node is a single worker machine (physical server, VM, or even a bare-metal instance) that runs pods. Each node runs three essential services: the kubelet (which talks to the API server, pulls pod specs, and supervises containers via the container runtime), kube-proxy (which implements service-level networking using iptables, IPVS, or eBPF), and a container runtime like containerd or CRI-O. Nodes register themselves with the control plane and report health, capacity, and running pods via periodic heartbeats.
Key features
  • Capacity and allocatable resources: Each node advertises CPU, memory, ephemeral storage, and GPUs; the scheduler uses this for bin-packing.
  • Conditions: Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable — reported by kubelet.
  • Labels and taints: Labels (zone=us-east-1a, gpu=nvidia-a100) steer workloads; taints repel them unless pods have matching tolerations.
  • Drains and cordons: kubectl drain safely evicts pods; cordon marks a node unschedulable for maintenance.
How it differs
  • vs a VM in isolation: A node is a fungible unit of capacity — K8s expects nodes to come and go (spot instances, autoscaling). Traditional VMs are pets, nodes are cattle.
  • vs a Nomad client: Functionally similar — both register with a control plane and run scheduled work.
  • vs an ECS container instance: Same concept, AWS-specific terminology.
Why use itNodes abstract hardware from workloads. You no longer care which server runs your app — you declare "I need 3 replicas" and K8s picks nodes that satisfy resource, label, taint, and affinity constraints. Adding/removing capacity is painless: nodes autoscale (Cluster Autoscaler, Karpenter) in response to pod demand.
Common gotchasA node going NotReady doesn't immediately reschedule pods — there's a grace period (default 5 min) controlled by tolerationSeconds. Kubelet bugs can orphan containers. Running too many pods per node (default limit: 110 per node) causes IP exhaustion and scheduling issues. Node upgrades require draining, which can break apps without proper PodDisruptionBudgets. Disk pressure from log volume is a classic production incident.
Real-world examplesAt Spotify, nodes are grouped into "node pools" by workload class (batch, online, GPU). Lyft uses mixed instance types per node pool to exploit spot market pricing. Uber uses Peloton (their internal scheduler, now CNCF) but the node concept is similar. Managed K8s services (EKS/GKE/AKS) abstract node provisioning: GKE Autopilot hides nodes entirely, billing you per pod instead.
Simple Definition: A node is a single machine (physical or virtual) in a Kubernetes cluster that runs containerized workloads.

Deep Dive:

Every worker node runs three essential components:

  • kubelet — The agent that communicates with the control plane. It receives Pod specifications and ensures the described containers are running and healthy.
  • kube-proxy — Maintains network rules for Service routing. Implements iptables or IPVS rules.
  • Container runtime — The software that actually runs containers (containerd or CRI-O).

Node lifecycle:

  • Registration — Node joins the cluster and registers with the API server
  • Heartbeat — kubelet sends heartbeats via Lease objects in kube-node-lease namespace
  • NotReady — If heartbeats stop, the node is marked NotReady after 40s (default)
  • Eviction — After 5 minutes of NotReady, Pods are rescheduled to other nodes
  • Drainkubectl drain gracefully evicts all Pods before maintenance
  • Cordonkubectl cordon marks a node as unschedulable without evicting existing Pods
Interview Answer: "A node is a machine in a K8s cluster running kubelet, kube-proxy, and a container runtime. Kubelet ensures Pods are running as specified, kube-proxy handles networking, and the runtime (containerd) runs containers. Nodes send heartbeats to the control plane. If a node goes down, its Pods are automatically rescheduled. You can drain a node for maintenance or cordon it to prevent new scheduling."

Pod

What is itA Pod is the smallest deployable unit in Kubernetes — not a container, but a wrapper around one or more tightly-coupled containers that share the same network namespace (same IP, same localhost, same ports), IPC namespace, and optionally volumes. Containers in a pod are co-scheduled on the same node and treated as a single logical "application instance." A pod gets its own cluster-internal IP from the CNI plugin and is ephemeral by design — when it dies, it is never restarted; a new pod with a new IP is created in its place by the controlling workload.
Key features
  • Shared network: All containers in a pod share localhost — a sidecar proxy on port 15001 can intercept the main app's traffic on port 8080 transparently.
  • Init containers: Run to completion sequentially before app containers start — used for setup (DB schema migrations, secret fetching, permission fixes).
  • Sidecar pattern: Helper containers running alongside the main app (log shippers, proxies, secret rotators). Native sidecars became stable in K8s 1.29.
  • Lifecycle hooks: postStart and preStop run code when containers start/stop for graceful handling.
  • Restart policies: Always (default for Deployments), OnFailure (Jobs), Never (run-once debugging).
How it differs
  • vs a container: A pod is a logical host; containers within share resources like processes inside the same VM.
  • vs a Docker Compose service: Compose runs containers on one host but doesn't group them into a shared-network unit the way pods do.
  • vs Nomad task group: Nomad's "task group" is the closest analogue — a co-located set of tasks.
  • vs an ECS task: Very similar concept — ECS tasks also group containers with shared network.
Why use itThe pod abstraction supports multi-container patterns that don't make sense as a single container: a web server + log sidecar, an app + a service mesh envoy proxy, a main process + a secret injector. It's also the scheduling unit: the scheduler finds one node with enough resources for all containers in the pod.
Common gotchasA pod's IP is not stable — always talk to pods via Services, never raw IPs. Deleting a pod directly is rarely what you want — delete the controlling Deployment instead. Pods can be stuck in Terminating if finalizers hang. Cross-container communication within a pod uses localhost, not the container name. Pod-level resource requests are the sum of container requests, used by the scheduler.
Real-world examplesIstio's sidecar injection is the canonical multi-container pod use case — every pod gets an istio-proxy envoy container. Vault Agent injector adds a sidecar that pulls secrets. Fluent Bit DaemonSet ships logs. Netflix and Airbnb run tens of thousands of pods per cluster at peak traffic. The average microservice pod has 2-3 containers (app + sidecar + init).
Simple Definition: A Pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that share the same network (IP address) and storage.

Deep Dive:

Pod Lifecycle Flow
Pending
Init Containers
Running
↓ termination signal
Remove from endpoints
preStop hook
SIGTERM
Grace period (30s)
SIGKILL

Key properties:

  • Shared network — All containers in a Pod share one IP address and port space. They talk to each other via localhost
  • Shared storage — Containers can mount the same volumes to share files
  • Co-scheduled — All containers in a Pod always run on the same node
  • Ephemeral — Pods are disposable. Created, destroyed, and replaced, never "repaired"

Multi-container patterns:

  • Sidecar — Helper container (Envoy proxy, log shipper, monitoring agent)
  • Init Container — Runs before the main container to set up prerequisites
  • Ambassador — Proxy for outbound connections
  • Adapter — Transforms output (log format conversion)
Interview Answer: "A Pod is the smallest deployable unit in K8s, wrapping one or more tightly coupled containers that share network and storage. Pods are ephemeral — designed to be replaced, not repaired. The most common pattern is one container per Pod. Multi-container Pods are used for sidecars (proxy, logging) and init containers (setup tasks). You manage Pods through higher-level resources like Deployments."
Real-World Usage: At Lyft, each microservice Pod has a main application container plus an Envoy sidecar for service mesh traffic. Init containers wait for database migrations to complete before the app starts. When Istio is enabled, a sidecar proxy is automatically injected into every Pod to handle mTLS and traffic management.

Namespace

What is itA Namespace is a virtual sub-cluster — a way to partition a single physical cluster into multiple logical environments. It provides a scope for names (two deployments can both be named api if they live in different namespaces) and is the natural unit for RBAC, resource quotas, network policies, and billing chargebacks. Every K8s cluster ships with four default namespaces: default, kube-system (control plane components), kube-public (cluster info readable by all), and kube-node-lease (node heartbeats).
Key features
  • Name scoping: Namespaced objects (Pods, Services, Deployments, ConfigMaps, Secrets) live in exactly one namespace.
  • Cluster-scoped exceptions: Nodes, PersistentVolumes, StorageClasses, CRDs, ClusterRoles are NOT namespaced — they span the whole cluster.
  • DNS convention: Services are reachable at svc-name.namespace.svc.cluster.local.
  • ResourceQuotas: Cap CPU/memory/storage/object counts per namespace.
  • LimitRange: Default or cap container-level requests/limits to prevent runaway pods.
How it differs
  • vs a separate cluster: Namespaces share the same control plane, nodes, and network — cheaper but weaker isolation. For hard multi-tenancy, use separate clusters or vClusters.
  • vs Linux namespaces: Unrelated — Linux namespaces isolate processes; K8s namespaces are an API-object partition.
  • vs OpenShift projects: OpenShift "projects" are namespaces with extra metadata and a default quota/RBAC bundle.
Why use itNamespaces enable multi-tenancy within a cluster: one namespace per team (team-payments, team-search), per environment (dev, staging, prod), or per customer. They provide the boundary for RBAC (who can do what), resource quotas (how much a team can consume), and network policies (who can talk to whom). Without namespaces, a cluster becomes a free-for-all.
Common gotchasNamespaces are not a security boundary by default — a pod in namespace A can typically reach pods in namespace B unless you apply NetworkPolicies. Deleting a namespace cascades to all resources inside it, which can be catastrophic. kubectl operates on default unless you -n or switch context. Cross-namespace references (e.g., a Service in another namespace) require the FQDN.
Real-world examplesAt Pinterest, each team gets a namespace with quotas enforced by a custom admission webhook. Shopify uses namespaces to separate apps like storefront, checkout, and shipping. Capital One uses namespaces per application for PCI compliance boundaries. Tooling like Hierarchical Namespaces (HNC) and vCluster extends namespaces toward deeper tenancy.
Simple Definition: A namespace is a virtual partition within a cluster that isolates resources. It's like a folder that separates different teams, projects, or environments.

Deep Dive:

Default namespaces: default, kube-system, kube-public, kube-node-lease.

What namespaces scope: Pods, Services, Deployments, ConfigMaps, Secrets, Roles, ServiceAccounts, PVCs.

What namespaces DON'T scope (cluster-wide): Nodes, PersistentVolumes, ClusterRoles, StorageClasses, Namespaces themselves.

Common strategies: per-team (team-payments), per-app (app-checkout), per-environment (staging), or hybrid.

Governance tools: ResourceQuota (cap resources), LimitRange (set defaults), NetworkPolicy (firewall), RBAC (access control).

Interview Answer: "Namespaces partition a cluster for resource isolation. They scope most resources (Pods, Services) but not cluster-wide ones (Nodes, PVs). You apply RBAC, ResourceQuotas, NetworkPolicies per namespace. Common strategies: per-team or per-application."
Real-World Usage: At Stripe, each team gets their own namespace with pre-configured RBAC, ResourceQuotas, and NetworkPolicies. A namespace provisioning controller automates this when a new team onboards.

Labels, Selectors & Annotations

What is itLabels are key/value pairs attached to Kubernetes objects (app=payments, env=prod, version=v2) that are queryable — selectors use them to group or find objects. Selectors are the query syntax: equality (app=payments), set-based (env in (prod, staging)), or existence (!canary). Annotations are also key/value but not queryable — they carry descriptive metadata (build hash, Git commit, owner email, last-modified timestamp) consumed by tools, controllers, and humans rather than selectors.
Key features
  • Labels glue workloads together: A Service uses a label selector to find its backing pods; a Deployment uses one to own its ReplicaSet pods.
  • Multi-dimensional: You can slice across any axis — tier, release, owner, canary, shard — without rigid hierarchies.
  • Standard labels: K8s recommends a set of common labels: app.kubernetes.io/name, app.kubernetes.io/instance, app.kubernetes.io/version, app.kubernetes.io/part-of, app.kubernetes.io/managed-by.
  • Annotations for tools: kubectl.kubernetes.io/last-applied-configuration, prometheus.io/scrape: "true", cert-manager.io/cluster-issuer.
How it differs
  • Labels vs Annotations: Labels are queryable and used by the control plane for grouping; annotations are free-form metadata for tools and humans only. Never put a cert or ConfigMap key in a label.
  • Labels vs Tags (AWS/GCP): Conceptually similar — but K8s labels drive actual scheduling and selection, whereas cloud tags are mostly for billing/organization.
  • Selectors vs Nomad constraints: Nomad uses HCL constraints for placement; K8s uses label selectors throughout the API.
Why use itLabels and selectors are the universal glue of Kubernetes. They power Services finding pods, Deployments managing pods, NetworkPolicies targeting pods, HPAs scaling deployments, PodDisruptionBudgets protecting groups. Good label hygiene is the difference between a cluster you can query/debug and one that's a black box. Annotations let operators and tools attach metadata without polluting the selector namespace.
Common gotchasLabel keys have a 63-character limit per segment; values likewise. kubectl label --overwrite is needed to change existing labels. Mismatched selector vs pod labels silently produce empty Services (no endpoints). Renaming labels on a live Deployment can cause orphaned ReplicaSets. Labels should be stable — putting volatile values (timestamps) in labels breaks rolling updates.
Real-world examplesSpotify labels every pod with squad=... (team) and cost-center=... to attribute infra spend. GitHub uses labels like shard=1..64 to implement horizontal sharding. cert-manager, Prometheus Operator, and Istio all lean heavily on labels for discovery and annotations for configuration.
Simple Definition: Labels are key-value tags for identifying and grouping resources. Selectors query by labels. Annotations store non-identifying metadata.

Deep Dive:

Labels are the glue of K8s. A Service finds Pods, a Deployment manages ReplicaSets, NetworkPolicies target Pods — all through label selectors.

metadata:
  labels:
    app: payment-service
    team: payments
    env: production

Selector types: Equality-based (app = my-api) and Set-based (env in (production, staging)).

Annotations store larger metadata: build timestamps, Git SHAs, monitoring config (prometheus.io/scrape: "true").

Interview Answer: "Labels are key-value pairs for identification and selection. Services find Pods, Deployments manage ReplicaSets through selectors. Annotations store non-selecting metadata. A consistent labeling convention is essential for cost tracking, monitoring, and governance."
Real-World Usage: Kubecost uses team: payments labels to attribute costs. Prometheus discovers scrape targets via annotations. Kyverno enforces that every Deployment must have team and app labels.

kubectl

What is itkubectl (pronounced "kube-control," "kube-C-T-L," or "kube-cuddle") is the official command-line interface to the Kubernetes API. It reads a kubeconfig file (default: ~/.kube/config) containing cluster endpoints and credentials, translates user commands into HTTPS requests against the kube-apiserver, and formats responses into tables, YAML, or JSON. Every operation you can do through a dashboard, Helm chart, or CI pipeline ultimately hits the same REST API that kubectl hits — it is the universal debugging and operating tool for K8s.
Key features
  • Imperative commands: kubectl run, kubectl expose, kubectl scale, kubectl delete for quick actions.
  • Declarative apply: kubectl apply -f submits YAML manifests and uses a three-way merge to reconcile state.
  • Debugging: kubectl logs, kubectl exec, kubectl describe, kubectl port-forward, kubectl debug.
  • Context switching: kubectl config use-context jumps between clusters; tools like kubectx/kubens make this faster.
  • Plugins: kubectl-krew is the plugin manager — hundreds exist: neat, tree, who-can, stern, rakkess, view-secret.
How it differs
  • vs helm: Helm is a package manager that generates YAML and then calls the same API; kubectl is the raw interface.
  • vs k9s: k9s is a TUI wrapper around kubectl that shows live resources — much faster for interactive debugging.
  • vs client libraries: Go/Python/Java/Rust clients talk the same API directly, useful for building controllers and operators.
  • vs Docker CLI: Docker CLI talks to dockerd on one host; kubectl talks to a whole cluster API.
Why use itkubectl is the lingua franca of K8s operations. Every tutorial assumes it, every incident response starts with kubectl get pods -A, every CI pipeline eventually shells out to it. Mastering kubectl (and its JSONPath output, selectors, custom columns) is the single biggest productivity boost for anyone working with Kubernetes.
Common gotchaskubectl apply vs kubectl create behave differently on re-runs. Forgetting -n namespace is the #1 source of "why doesn't my command find anything." Running the wrong context on a prod cluster is a classic disaster — use tools like kube-ps1 or kubie to show the active context in your prompt. Imperative edits (kubectl edit) diverge from Git-tracked state — always prefer GitOps. Version skew: kubectl should be within ±1 minor version of the cluster.
Real-world examplesEvery SRE team has a shared collection of kubectl one-liners: kubectl top pods for CPU/RAM, kubectl get events --sort-by=.lastTimestamp for recent cluster activity, kubectl rollout history for audit, kubectl cp for file transfer. Most team wikis contain "kubectl cheatsheets" that grow into hundreds of entries over time.
Simple Definition: kubectl is the command-line tool for interacting with Kubernetes clusters. It sends requests to the API server.
# Viewing
kubectl get pods -A                       # All namespaces
kubectl describe pod my-pod               # Detailed info
# Debugging
kubectl logs my-pod -f                    # Stream logs
kubectl exec -it my-pod -- /bin/sh        # Shell in
kubectl debug -it my-pod --image=busybox  # Ephemeral debug
# Managing
kubectl apply -f manifest.yaml            # Create/update
kubectl rollout undo deploy/my-app        # Rollback
kubectl scale deploy my-app --replicas=5  # Scale
# Networking
kubectl port-forward svc/my-app 8080:80   # Local tunnel
# Context
kubectl config use-context prod           # Switch cluster
Interview Answer: "kubectl is the K8s CLI. Key commands: get/describe for viewing, logs/exec for debugging, apply/delete for managing, rollout for deployments, port-forward for networking. In production, direct kubectl should be limited — GitOps tools handle deployments."

Workloads

The resources that run your applications — Deployments, StatefulSets, DaemonSets, Jobs, and CronJobs.

Deployment & ReplicaSet

What is itA Deployment is the workhorse workload controller for stateless applications. You declare "I want N replicas of this pod template at this image version," and the Deployment controller creates and manages an underlying ReplicaSet (the low-level object that actually ensures N pods exist). During updates, the Deployment creates a new ReplicaSet with the new pod template, gradually scales it up while scaling the old one down — this is the rolling update. The old ReplicaSet is retained (up to revisionHistoryLimit, default 10) so you can kubectl rollout undo to any prior version.
Key features
  • Rolling update strategy: maxSurge (how many extra pods during update) and maxUnavailable (how many pods can be down) control cadence.
  • Recreate strategy: Kill all old pods first, then start new — for apps that can't run two versions simultaneously.
  • Rollback: kubectl rollout undo deployment/foo reverts to the prior ReplicaSet instantly.
  • Pause and resume: Stage multiple changes, then apply atomically via kubectl rollout resume.
  • Progress deadline: If the rollout stalls, the Deployment is marked Progressing=False for alerting.
How it differs
  • vs ReplicaSet alone: ReplicaSet only maintains pod count — it has no update strategy. You almost never create ReplicaSets directly.
  • vs StatefulSet: StatefulSets give stable identities and ordered rollout; Deployments treat all pods as interchangeable.
  • vs DaemonSet: DaemonSets run one pod per node; Deployments run N pods anywhere.
  • vs the old ReplicationController: Deployments replaced ReplicationController in 2016 — RCs are deprecated.
Why use itDeployments are the default choice for any stateless microservice — web servers, APIs, workers, gRPC services, background processors. They give you declarative desired state, rolling updates, instant rollback, and horizontal scaling with a single YAML file. Combined with Services, HPAs, and Ingresses, they compose 80% of typical cluster workloads.
Common gotchasChanging the selector of an existing Deployment is immutable and will error — you must delete and recreate. A failed rollout (CrashLoopBackOff) leaves the cluster in a mixed state — always pair with proper health probes and a progress deadline. kubectl apply on a Deployment doesn't restart pods unless the pod template hash changes — use kubectl rollout restart to force a rollover. Two Deployments with overlapping selectors will fight each other.
Real-world examplesSpotify, Shopify, Airbnb run thousands of Deployment objects per cluster. GitHub uses Deployments as the target of their custom deployment-service canary tooling. Tooling like Argo Rollouts extends Deployments with progressive delivery (canary, blue-green, analysis-driven rollouts). A typical mid-size company has 200-2000 Deployment objects in production.
Simple Definition: A Deployment manages identical Pod replicas with rolling updates, rollbacks, and scaling. ReplicaSet is the mechanism underneath.
Deployment → ReplicaSet → Pods Chain
Deployment
ReplicaSet (v2)
Pod
Pod
Pod
↓ on update
Deployment
ReplicaSet (v3) ↑
New Pods
Old ReplicaSet scales down, new one scales up = rolling update
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: payment-api
  template:
    metadata:
      labels:
        app: payment-api
    spec:
      containers:
      - name: api
        image: payment-api:2.1.0
        resources:
          requests: { cpu: "250m", memory: "256Mi" }
          limits: { cpu: "1", memory: "1Gi" }

Rolling update: New ReplicaSet created → new Pods start → pass readiness probes → old Pods terminated. maxSurge: 1 = at most 4 total. maxUnavailable: 0 = always 3 ready.

Rollback: kubectl rollout undo deployment/payment-api (K8s keeps 10 revisions by default).

Interview Answer: "A Deployment manages ReplicaSets which manage Pods. Rolling updates are controlled by maxSurge/maxUnavailable. New Pods must pass readiness probes before old ones terminate. Keeps revision history for rollbacks. Apps must be stateless — any replica handles any request."
Real-World Usage: Every stateless microservice runs as a Deployment. Typical config: 3-10 replicas, maxSurge=25% maxUnavailable=25%, VPA-recommended requests, readiness probes on /health.

StatefulSet

What is itA StatefulSet is the workload controller for stateful applications that need stable, unique identities and persistent storage per replica. Unlike Deployments (where pods are interchangeable cattle), StatefulSet pods get predictable names (mysql-0, mysql-1, mysql-2), stable DNS (mysql-0.mysql.default.svc.cluster.local), and ordinal-indexed PersistentVolumeClaims that persist across pod restarts and rescheduling. Pods are created, updated, and deleted in order (mysql-0 before mysql-1) to support bootstrap protocols like leader election, replica initialization, and quorum-based systems.
Key features
  • Stable network identity: Each pod has a fixed DNS name backed by a Headless Service.
  • Stable storage: volumeClaimTemplates auto-create a dedicated PVC per pod, retained even if the pod is deleted.
  • Ordered deployment: Pods come up 0, 1, 2... and shut down in reverse.
  • Rolling updates with partitions: Canary updates by setting partition: N — only pods with ordinal >= N get the new version.
How it differs
  • vs Deployment: Deployments are for stateless services; StatefulSets are for databases, queues, and consensus systems.
  • vs running a DB on a VM: StatefulSets give you the automation of K8s (scaling, upgrades, self-healing) but add complexity — many teams still prefer managed DBs (RDS, Cloud SQL, Aurora) instead.
  • vs Operators: A StatefulSet alone doesn't know how to run a database safely — Operators (PostgreSQL Operator, MongoDB Operator, Vitess) wrap StatefulSets with domain logic.
Why use itStatefulSets power databases, message brokers, and any system where replica identity matters: MySQL, PostgreSQL, MongoDB, Cassandra, Kafka, Elasticsearch, Zookeeper, etcd, Redis (with replicas), RabbitMQ. They're essential for sharded systems where each shard must stick to its data volume.
Common gotchasDeleting a StatefulSet does not delete the PVCs — that's a feature (data preservation) but can surprise you. Scaling down is orderly but rescheduling failures can leave a pod stuck in Pending. StatefulSets alone don't handle backup, failover, or split-brain — you need an Operator or manual ops. Upgrades are risky because pods restart one at a time; always test in staging.
Real-world examplesWikimedia runs MediaWiki's databases via StatefulSets. Lyft runs Envoy control plane components in StatefulSets. Zalando built the Postgres Operator which uses StatefulSets under the hood for 1000+ Postgres clusters. Strimzi (Kafka Operator) manages Kafka brokers via StatefulSets.
Simple Definition: Like a Deployment but for apps needing stable identity and persistent storage — databases, Kafka, ZooKeeper.

What makes it special:

  • Stable Pod namesmysql-0, mysql-1, mysql-2
  • Stable DNSmysql-0.mysql-headless.default.svc.cluster.local
  • Persistent storage — Each Pod gets its own PVC that survives restarts
  • Ordered operations — Created 0→1→2, deleted 2→1→0

Requires a headless Service (clusterIP: None) and volumeClaimTemplates for per-Pod storage.

Interview Answer: "StatefulSets provide stable names (ordinal indices), stable DNS (via headless Service), persistent storage (PVC per Pod), and ordered deployment. Used for databases, Kafka, ZooKeeper. Each Pod gets its own PVC through volumeClaimTemplates."
Real-World Usage: LinkedIn runs Kafka on StatefulSets using Strimzi Operator. Each broker (kafka-0, kafka-1, kafka-2) has stable identity and persistent volume. CloudNativePG manages PostgreSQL StatefulSets with automated failover.

DaemonSet

What is itA DaemonSet ensures that a copy of a specific pod runs on every node (or a filtered subset via nodeSelector/affinity). When a new node joins the cluster, the DaemonSet controller automatically schedules its pod there; when a node leaves, the pod is garbage collected. DaemonSets are used for infrastructure agents that must be present on every host: log shippers, metric collectors, network plugins, storage drivers, CNI agents, and security scanners.
Key features
  • One-per-node guarantee: Pods are scheduled by the DaemonSet controller itself, not the default scheduler (by default), ensuring strict placement.
  • Toleration of taints: DaemonSets often tolerate NoSchedule taints so they run even on control-plane nodes.
  • HostPort/HostNetwork: Many DaemonSets use host networking to intercept host-level traffic (kube-proxy, CNI).
  • Rolling updates: Updated pod by pod, with maxUnavailable to control blast radius.
How it differs
  • vs Deployment: Deployment places N pods anywhere; DaemonSet places one per node.
  • vs static pods: Static pods are managed by kubelet directly from a file on disk — used for the control plane itself, not general workloads.
  • vs systemd services: On traditional hosts, you'd install a log shipper via systemd. DaemonSets bring that model into K8s declaratively.
Why use itAny workload that needs host-level access or per-node presence is a DaemonSet: Fluent Bit for logs, Node Exporter for Prometheus metrics, kube-proxy for networking, Calico/Cilium/Flannel for CNI, CSI node drivers for storage, Falco for runtime security, NVIDIA device plugin for GPU exposure.
Common gotchasA buggy DaemonSet can take down every node simultaneously — always test and roll out carefully. Resource usage is multiplied by node count (50 nodes × 200 MB agent = 10 GB wasted if unoptimized). DaemonSets often need hostPath volumes for /var/log or /proc, which can break on read-only root filesystems. Updates to DaemonSets running as critical infra (CNI) can briefly disrupt pod networking.
Real-world examplesEvery K8s cluster has multiple DaemonSets: kube-proxy, CNI plugin (Cilium, Calico), log agent (Fluent Bit, Vector), metrics (Node Exporter, DCGM Exporter for GPU), security (Falco, Aqua Enforcer). Datadog distributes its agent as a DaemonSet to every K8s node for host-level observability.
Simple Definition: Ensures a Pod runs on every node (or a selected subset). When new nodes join, Pods are auto-added.

What runs as DaemonSets: Log collection (Fluent Bit), monitoring (Node Exporter, Datadog), network plugins (Calico, Cilium), storage drivers (CSI), security agents (Falco).

Use nodeSelector or tolerations to restrict to specific node types (e.g., GPU nodes only).

Interview Answer: "DaemonSets ensure one Pod per node for infrastructure plumbing: log collectors, monitoring agents, CNI plugins, storage drivers. Use nodeSelector and tolerations for targeting."

Job & CronJob

What is itA Job runs one or more pods to completion — unlike a Deployment, the Job is "done" once its pods exit successfully. Jobs are used for batch processing, one-shot tasks, data migrations, backup runs, and ML training. A CronJob wraps a Job template with a cron-style schedule ("0 2 * * *") and creates a new Job at each tick — the K8s equivalent of crontab, but distributed and declarative. Both support retries (backoffLimit), timeouts (activeDeadlineSeconds), and parallelism.
Key features
  • Parallelism and completions: parallelism: 5 with completions: 100 runs up to 5 pods at a time until 100 have succeeded.
  • Indexed Jobs: Each pod gets a unique index, enabling embarrassingly parallel batch processing (e.g., "process shard 37").
  • BackoffLimit: Number of failures before marking the Job as failed.
  • TTL after finished: ttlSecondsAfterFinished auto-cleans up completed jobs to avoid clutter.
  • CronJob history limits: successfulJobsHistoryLimit / failedJobsHistoryLimit control how many old Jobs to keep.
How it differs
  • vs Deployment: Deployments keep pods alive; Jobs finish and stop.
  • vs Linux cron: CronJobs are HA — the scheduler in the control plane manages them, not a single host's crontab.
  • vs Airflow/Argo Workflows: Airflow and Argo Workflows are full DAG engines for complex pipelines; K8s Jobs are primitives you build workflows on top of.
  • vs AWS Batch: AWS Batch is similar but cloud-specific; K8s Jobs are portable.
Why use itJobs and CronJobs bring batch workloads into the same declarative K8s model as services. Typical uses: nightly database backups, report generation, image/video processing, ETL pipelines, certificate renewal, cache warmup, test suite runs in CI. They integrate with the same monitoring, logging, and RBAC as the rest of the cluster.
Common gotchasCronJobs can suffer from clock skew, missed runs if the control plane was down, and overlap if a prior run didn't finish — use concurrencyPolicy: Forbid or Replace. Old completed Jobs pile up without TTL, burdening etcd. A Job's pods are not immediately cleaned up on completion — set TTL or use a CI tool that cleans them. Jobs with PodDisruptionBudgets can block node drains.
Real-world examplesSpotify runs nightly data-pipeline jobs on K8s as CronJobs feeding Hadoop. Zalando uses Jobs to run database migration scripts during deploys. Argo Workflows (CNCF) builds complex DAGs on top of Jobs. Many CI systems (Tekton, Jenkins X) compile pipelines into K8s Jobs.
Simple Definition: Jobs run Pods to completion (one-time tasks). CronJobs run Jobs on a schedule.

Key settings: backoffLimit (retries), activeDeadlineSeconds (timeout), concurrencyPolicy (Allow/Forbid/Replace for CronJobs), restartPolicy (Never or OnFailure).

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
spec:
  schedule: "0 2 * * *"           # 2 AM daily
  concurrencyPolicy: Forbid       # Skip if previous still running
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: report
            image: report-gen:1.0
Interview Answer: "Jobs run to completion (migrations, batch). Key settings: backoffLimit, activeDeadlineSeconds, restartPolicy=Never|OnFailure. CronJobs are scheduled Jobs with concurrencyPolicy controlling overlap. For complex DAGs, use Argo Workflows."

ReplicaSet

What is itA ReplicaSet is the low-level controller whose sole job is to ensure that a specified number of pod replicas matching a label selector are running at any given time. If a pod dies, the ReplicaSet creates a new one; if there are too many, it deletes the excess. ReplicaSets use set-based selectors (unlike their predecessor ReplicationController). You almost never create ReplicaSets directly — they are created and managed by Deployments, which layer rolling update semantics on top of them.
Key features
  • Replica count enforcement: A reconciliation loop continuously compares actual vs desired pod count.
  • Pod template: Defines the blueprint for pods it creates (image, env vars, volumes).
  • Selector: Set-based or equality-based matching to adopt existing pods with matching labels.
  • Owner references: Pods link back to their owning ReplicaSet via ownerReferences, enabling cascading deletes.
How it differs
  • vs Deployment: Deployments are a higher-level abstraction that manages ReplicaSets for you, adding rolling updates, rollback, and pause/resume.
  • vs ReplicationController (deprecated): ReplicaSets replaced RCs in 2016 by adding set-based label selectors.
  • vs StatefulSet: ReplicaSets treat pods as fungible; StatefulSets give each pod an identity.
Why use itReplicaSets are the foundation for horizontal scaling and self-healing in K8s. They embody the reconciliation loop pattern: declare desired state, let the controller continuously drive reality toward it. Every Deployment you create implicitly creates a ReplicaSet.
Common gotchasNever create ReplicaSets directly — always use a Deployment. If you delete a Deployment with --cascade=orphan, the ReplicaSets become standalone and hard to manage. Multiple ReplicaSets with overlapping selectors will fight over pods. Old ReplicaSets from previous rollouts accumulate unless you lower revisionHistoryLimit.
Real-world examplesYou see ReplicaSets listed via kubectl get rs — their names are the Deployment name plus a pod-template hash (e.g., nginx-7c5d5d6b9f). When you check rollout history, each entry is a ReplicaSet. Understanding ReplicaSets is key to debugging stuck Deployments and interpreting rollout state.
Simple Definition: Maintains a specified number of identical Pod replicas. Managed by Deployments — you almost never create one directly.

Deployments create new ReplicaSets on updates and scale the old one down. Old ReplicaSets are kept (default 10) for rollback capability.

Interview Answer: "ReplicaSet ensures the right number of Pods. It's the backend for Deployments — never created directly. On update, a new RS scales up while the old scales down. Old RS retained for rollbacks."

Networking

How Pods communicate — Services, Ingress, Gateway API, DNS, CNI, Network Policies, and Service Mesh.

Service

What is itA Service is a stable network abstraction that provides a single virtual IP (ClusterIP) and DNS name fronting a dynamic set of pods. Because pods are ephemeral and their IPs change, clients need a fixed target — the Service solves this. Services are implemented by kube-proxy on each node using iptables, IPVS, or eBPF rules that load-balance traffic to the current list of healthy backend pods (the Endpoints/EndpointSlices object). Services use label selectors to find pods, updated in near-real-time by the endpoints controller.
Key features
  • Types: ClusterIP (internal only), NodePort (opens a port on every node), LoadBalancer (provisions a cloud LB), ExternalName (DNS CNAME alias).
  • Headless (clusterIP: None): No virtual IP — DNS returns pod IPs directly, used for StatefulSets and client-side load balancing.
  • Session affinity: sessionAffinity: ClientIP pins a client to a pod (sticky sessions).
  • Multi-port: A Service can expose multiple ports (HTTP + metrics, for example).
  • EndpointSlices: Scalable replacement for monolithic Endpoints, supporting clusters with tens of thousands of pods per Service.
How it differs
  • vs direct pod IPs: Pod IPs change with restarts; Service IPs are stable for the lifetime of the Service.
  • vs Ingress: Services are Layer 4 (TCP/UDP); Ingress is Layer 7 (HTTP host/path routing). They work together.
  • vs a cloud load balancer: LoadBalancer-type Services provision a cloud LB automatically (ELB, GLB, Azure LB), but bill per load balancer.
  • vs a service mesh: Istio/Linkerd replace kube-proxy's dumb round-robin with smart routing, retries, mTLS, and observability.
Why use itServices are the primary mechanism for service discovery and internal load balancing within a cluster. Combined with CoreDNS, a pod can reach any microservice via http://payments.default.svc.cluster.local. Services decouple clients from pod lifecycles, enabling seamless rolling updates.
Common gotchasA Service with no matching pod labels has an empty Endpoints list — traffic goes nowhere. iptables mode doesn't scale beyond ~5,000 services (use IPVS or Cilium eBPF for large clusters). NodePort exposes to all nodes, which you rarely want in prod — use LoadBalancer or Ingress instead. The ClusterIP range is fixed at cluster creation; running out requires recreating the cluster.
Real-world examplesEvery microservice in a K8s cluster has a Service in front of it. Spotify operates tens of thousands of Services. Lyft built Envoy initially as a replacement for kube-proxy's simple load balancing. Tools like MetalLB provide LoadBalancer functionality on bare metal (no cloud integration).
Simple Definition: A Service provides a stable network endpoint (DNS name + IP) for ephemeral Pods. Since Pods get new IPs when recreated, a Service gives a permanent address.
Service Routing Flow
Client request
Service IP (ClusterIP)
kube-proxy (iptables/IPVS)
↓ routes to healthy Pods
Pod A (10.0.1.5)
Pod B (10.0.2.3)
Pod C (not ready) — skipped

Types: ClusterIP (internal, default), NodePort (expose on node ports 30000-32767), LoadBalancer (cloud LB, costs money), ExternalName (CNAME alias), Headless (clusterIP: None, returns Pod IPs directly for StatefulSets).

Interview Answer: "Services give stable DNS/IP to ephemeral Pods. ClusterIP=internal, NodePort=node exposure, LoadBalancer=cloud LB, Headless=direct Pod IPs. Routing via kube-proxy iptables/IPVS. Use Ingress to consolidate external traffic through one LB."
Real-World Usage: Every microservice has a ClusterIP Service (http://payment-api/charge). One or two LoadBalancer Services for the Ingress Controller. ExternalName Services wrap external databases so app code uses K8s DNS everywhere.

Ingress & Ingress Controller

What is itIngress is a Kubernetes API object describing how external HTTP(S) traffic should be routed to internal Services based on host names and URL paths. On its own, an Ingress object does nothing — you need an Ingress Controller (NGINX, Traefik, HAProxy, Envoy-based, or cloud-native like AWS ALB Controller) running as pods in the cluster. The controller watches Ingress objects via the API server and programs its own data plane (NGINX config, Envoy xDS, cloud LB rules) to realize the declared routing.
Key features
  • Host-based routing: api.example.com → service A, web.example.com → service B.
  • Path-based routing: /api → backend; /static → CDN.
  • TLS termination: Decrypt HTTPS at the ingress, forward plain HTTP internally. Integrates with cert-manager for automatic Let's Encrypt certs.
  • Annotations: Controller-specific features (rate limits, auth, rewrites) attached via annotations — a historical warts-and-all design.
  • IngressClass: Lets multiple ingress controllers coexist in one cluster.
How it differs
  • vs Service (LoadBalancer): LoadBalancer type creates one cloud LB per service — expensive at scale. Ingress shares one LB across many services via host/path routing.
  • vs Gateway API: Gateway API is the modern replacement with richer, more expressive routing and better role separation. Ingress is feature-frozen.
  • vs Service Mesh: Ingress handles north-south (external-to-cluster) traffic; mesh handles east-west (pod-to-pod).
Why use itIngress is the cheapest and simplest way to expose many HTTP services through a single entrypoint — you pay for one load balancer, get TLS, host-based routing, and caching/rate limiting depending on the controller. Essential for any web-facing K8s cluster.
Common gotchasAnnotation semantics vary between controllers — a config that works for NGINX Ingress may not work for Traefik. TLS certificate renewal via cert-manager sometimes silently fails. Ingress has no native support for gRPC streaming in older versions. Misconfigured pathType (Exact vs Prefix vs ImplementationSpecific) is a common source of 404s. The original Ingress API was criticized for being underspecified, which is why Gateway API was designed.
Real-world examplesNGINX Ingress Controller is the most popular by far. Traefik is popular in dev and smaller setups. AWS Load Balancer Controller translates Ingress objects into ALB rules. Contour (from VMware, uses Envoy) is common in larger orgs. Cloudflare, Shopify, GitLab all rely heavily on Ingress for exposing SaaS services.
Simple Definition: Ingress routes external HTTP/HTTPS traffic to internal Services based on hostname and URL path. An Ingress Controller (NGINX, Traefik) implements these rules.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
spec:
  ingressClassName: nginx
  tls:
  - hosts: [api.myapp.com]
    secretName: tls-secret
  rules:
  - host: api.myapp.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service: { name: api-v1, port: { number: 80 } }
      - path: /
        pathType: Prefix
        backend:
          service: { name: frontend, port: { number: 80 } }

Controllers: NGINX (most popular), Traefik, HAProxy, AWS ALB, Istio Gateway.

Interview Answer: "Ingress manages external HTTP/HTTPS with host/path routing, TLS termination. Requires an Ingress Controller (NGINX, Traefik). One Ingress Controller behind a single LoadBalancer replaces many individual LB Services, saving cost. Gateway API is the newer replacement."

Gateway API

What is itThe Gateway API is the modern, role-oriented, expressive replacement for Kubernetes Ingress. It splits responsibilities into three layers: GatewayClass (what implementation to use, managed by infra teams), Gateway (a listener with IP/port/TLS, managed by platform teams), and HTTPRoute/TCPRoute/GRPCRoute (routing rules managed by application teams). Reached GA in Kubernetes 1.29 (Oct 2023) after years as an experimental project. Designed to fix Ingress's limitations: poor expressiveness, annotation chaos, no role separation, weak L4 support.
Key features
  • Role-based layering: Infra, platform, and app teams each own a distinct resource — clear ownership boundaries.
  • Rich routing: Header matching, query param matching, method matching, weighted traffic splitting, header/URL rewrites, mirroring — all native, no annotations.
  • Cross-namespace routing: Routes in one namespace can attach to Gateways in another, controlled via ReferenceGrant.
  • Protocol support: HTTP, HTTPS, gRPC, TLS passthrough, TCP, UDP.
  • Extension points: Policies (RateLimitPolicy, BackendTLSPolicy) attach without annotation overload.
How it differs
  • vs Ingress: Ingress has vague semantics, requires annotations for everything, and lacks L4 support. Gateway API is typed, expressive, portable.
  • vs Istio VirtualService: Istio's API is more powerful but mesh-specific. Gateway API is the common denominator across implementations.
  • vs SMI: Service Mesh Interface was an earlier attempt at standardization that is now effectively replaced by Gateway API + GAMMA initiative.
Why use itGateway API is where the Kubernetes networking ecosystem is moving. New clusters should plan for it. Benefits: portability (same YAML works across implementations), better security (explicit cross-namespace permissions), cleaner team separation, and uniform policy attachment.
Common gotchasStill has limited adoption in tooling compared to Ingress. Controllers vary in what features they support — check conformance. Migration from Ingress is a big rewrite, not a drop-in. Some advanced features are still alpha or experimental.
Real-world examplesImplemented by Istio, Envoy Gateway, Contour, Cilium, Traefik, Kong, HAProxy, NGINX Gateway Fabric, Google Cloud GKE Gateway, AWS Gateway API Controller. Google Cloud and Red Hat are major investors. Used at companies adopting service mesh + ingress unification, like Salesforce and Intuit.
Simple Definition: Next-gen Ingress replacement. Role-based separation: GatewayClass (infra), Gateway (ops), HTTPRoute (devs). Supports TCP, gRPC, traffic splitting natively.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: payment-route
spec:
  parentRefs:
  - name: production-gateway
  hostnames: ["api.myapp.com"]
  rules:
  - matches:
    - path: { type: PathPrefix, value: /payments }
    backendRefs:
    - name: payment-v2
      port: 80
      weight: 90          # 90% to v2
    - name: payment-v3
      port: 80
      weight: 10          # 10% canary
Interview Answer: "Gateway API succeeds Ingress with role-based resources: GatewayClass (infra), Gateway (ops), HTTPRoute (devs). Supports TCP/UDP/gRPC natively, traffic splitting, header-based routing. Portable across implementations (Istio, Cilium, NGINX). New projects should use Gateway API over Ingress."

DNS (CoreDNS) & kube-proxy

What is itCoreDNS is the cluster DNS server — a flexible, plugin-based DNS written in Go that replaced kube-dns in Kubernetes 1.13. It resolves Service names like payments.default.svc.cluster.local to ClusterIPs, and headless Service names to pod IPs. kube-proxy is the node-level agent that implements Services — it watches the API server for Service and Endpoints changes and programs node-level iptables, IPVS, or nftables rules (or, increasingly, eBPF via Cilium) that transparently load-balance traffic destined for ClusterIPs to the actual pod IPs.
Key features
  • CoreDNS plugins: kubernetes, forward, cache, loadbalance, hosts, rewrite, autopath, prometheus.
  • DNS search paths: Pods get .svc.cluster.local, .cluster.local etc. in /etc/resolv.conf for short names.
  • kube-proxy modes: iptables (default, O(n) rules), IPVS (hash-based, scales better), nftables (newer), eBPF via Cilium (replaces kube-proxy entirely).
  • Topology-aware routing: Prefer endpoints in the same zone for latency and egress-cost savings.
How it differs
  • CoreDNS vs kube-dns: kube-dns was a multi-container hack (dnsmasq + sidecar); CoreDNS is a single binary with a clean plugin model.
  • iptables vs IPVS vs eBPF: iptables is fine below ~1000 services. IPVS uses kernel hash tables — better at 5000+. eBPF (Cilium) is fastest and most feature-rich.
  • kube-proxy vs service mesh: A mesh replaces kube-proxy's simple L4 round-robin with sophisticated L7 routing via sidecar proxies.
Why use itWithout CoreDNS and kube-proxy, there is no service discovery and no internal load balancing in the cluster — two of the most fundamental K8s features. They're invisible when working, critical when broken.
Common gotchasCoreDNS under load can be a cluster bottleneck — tune cache plugin, add NodeLocal DNSCache. Poor pod /etc/resolv.conf ndots:5 causes excessive DNS queries. kube-proxy iptables reconciliation can be slow in large clusters. Conntrack table overflow causes mysterious packet drops. DNS lookups for external names bypass cluster resolution unless you configure rewrite.
Real-world examplesLyft famously migrated from kube-proxy iptables to Envoy sidecars for better observability. Datadog has detailed blog posts about CoreDNS tuning for thousand-node clusters. Airbnb replaced kube-proxy with Cilium eBPF for performance and policy enforcement.
Simple Definition: CoreDNS is the cluster's DNS server (Services reachable by name). kube-proxy handles the actual network routing from Service IPs to Pod IPs.

DNS format: my-service.my-namespace.svc.cluster.local (or just my-service within same namespace).

kube-proxy modes: iptables (O(n), default), IPVS (O(1), better at scale), eBPF/Cilium (highest performance, replaces kube-proxy).

East-West = service-to-service within cluster. North-South = external traffic.

Interview Answer: "CoreDNS provides service discovery by name. kube-proxy implements routing via iptables (O(n)), IPVS (O(1)), or eBPF (Cilium, fastest). East-west is internal, north-south is external."

CNI (Container Network Interface)

What is itThe Container Network Interface (CNI) is a CNCF-standard specification for configuring network interfaces in Linux containers. When a pod is created, kubelet invokes the configured CNI plugin, which attaches the pod's network namespace to the cluster network and assigns it an IP. CNI plugins (Calico, Cilium, Flannel, Weave, AWS VPC CNI, Azure CNI, GKE native) implement the K8s networking model — every pod gets a unique routable IP, and every pod can reach every other pod without NAT.
Key features
  • IPAM: IP address management — allocating pod IPs from a configured range.
  • Overlay networks: VXLAN, Geneve, or IPinIP encapsulation for cross-node pod-to-pod traffic (Flannel, Calico IPIP).
  • Native routing: BGP (Calico) or cloud routing tables for non-encapsulated traffic.
  • NetworkPolicy enforcement: Most CNI plugins implement NetworkPolicy via iptables or eBPF.
  • eBPF: Cilium uses eBPF hooks in the kernel for fast, programmable datapath and policy enforcement.
How it differs
  • Calico vs Cilium vs Flannel: Flannel is simple but limited. Calico adds policy and BGP. Cilium adds eBPF, L7 policy, Hubble observability, and a kube-proxy replacement.
  • Cloud CNI vs overlay: AWS VPC CNI assigns real VPC IPs to pods (no overlay overhead, but limited by ENI IP capacity); Flannel/Calico overlay is universal but has encap overhead.
  • CNI vs Docker networking: Docker uses libnetwork and the CNM spec; K8s rejected CNM in favor of the simpler CNI.
Why use itCNI is the pluggable networking layer of K8s — you cannot run a cluster without a CNI. Choice of CNI affects performance, scalability, observability, policy capability, and cloud integration. It's one of the most impactful platform decisions.
Common gotchasAWS VPC CNI has a pod-density limit per node based on ENI capacity. Overlay MTU must be lower than underlay MTU (commonly 1450) to avoid fragmentation. CNI plugin crashes on a node cripple networking for its pods. Migrating between CNIs requires cluster rebuild. eBPF kernels need recent Linux (5.4+).
Real-world examplesGoogle GKE uses native cloud routing. AWS EKS uses VPC CNI by default, Cilium increasingly popular. Datadog, Adobe, Sky, Bell Canada run Cilium at scale. Capital One famously runs Calico. Cilium is part of Google Anthos and AWS EKS Anywhere.
Simple Definition: CNI is the plugin standard for Pod networking. K8s doesn't do networking itself — it delegates to a CNI plugin.

Major plugins:

  • Cilium — eBPF-based, highest performance, built-in observability (Hubble), service mesh. The rising star. CNCF graduated
  • Calico — Most widely deployed. BGP routing, excellent Network Policy support
  • Flannel — Simple overlay, NO Network Policy support. Dev only
  • AWS VPC CNI — Real VPC IPs on EKS. Limited by ENI capacity per node
Interview Answer: "CNI is the plugin standard. Key choices: Cilium (eBPF, highest performance, CNCF graduated), Calico (BGP, mature), AWS VPC CNI (real VPC IPs). Choice affects performance, security features, and cloud integration."

Network Policy

What is itA NetworkPolicy is a Kubernetes firewall object that restricts which pods can communicate with which. It uses label selectors to define source and destination pods, then specifies allowed ingress and egress rules. By default, K8s has a flat, fully open network — any pod can reach any other. NetworkPolicy lets you implement zero-trust networking: deny everything, then explicitly allow the connections each app needs. NetworkPolicy is enforced by the CNI plugin (Calico, Cilium, Weave) — plugins like Flannel don't support it.
Key features
  • Ingress rules: "Allow traffic to pods with label app=db only from pods with label app=api."
  • Egress rules: "Pods in namespace frontend may only call out to api.default.svc."
  • Namespace selectors: Policies can match by namespace labels, enabling cross-namespace rules.
  • IP blocks: Allow/deny by CIDR (e.g., block access to cloud metadata endpoint 169.254.169.254).
  • Deny-all baseline: Apply a policy selecting all pods with no rules to default-deny.
How it differs
  • vs cloud security groups: Security groups work at instance level; NetworkPolicy works at pod level and moves with the pod.
  • vs Calico GlobalNetworkPolicy: Calico's CRD adds richer rules (L7, priority, deny) beyond the base K8s NetworkPolicy API.
  • vs Cilium NetworkPolicy: Cilium adds L7 (HTTP, gRPC, Kafka, DNS) and identity-based policy.
  • vs service mesh authz: Istio's AuthorizationPolicy operates at the mesh layer with mTLS identity, complementary to NetworkPolicy.
Why use itNetworkPolicies are essential for compliance (PCI, HIPAA, SOC2), blast radius reduction in case of pod compromise, and multi-tenancy. Without them, a compromised pod in namespace A can scan the whole cluster.
Common gotchasNetworkPolicies are additive — once any policy applies to a pod, all non-matching traffic is denied. Silent failures happen if the CNI doesn't support the policy features. DNS egress must be explicitly allowed or everything breaks. Some apps need hostNetwork and bypass NetworkPolicy. Debugging denied traffic requires CNI-specific tools like Cilium Hubble.
Real-world examplesCapital One enforces NetworkPolicies for PCI compliance. Shopify uses default-deny NetworkPolicies in every namespace. Zalando publishes tooling to auto-generate policies from observed traffic. Modern platforms use Cilium Hubble or Inspektor Gadget for policy dry-runs and visualization.
Simple Definition: Firewall rules for Pods. Controls which Pods can talk to which. Default is allow-all; once a policy applies, it becomes default-deny for that Pod.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow-frontend
spec:
  podSelector:
    matchLabels: { app: api }
  policyTypes: [Ingress]
  ingress:
  - from:
    - podSelector:
        matchLabels: { app: frontend }
    ports:
    - port: 8080
Important: Requires a CNI that supports it. Calico, Cilium = yes. Flannel = no.
Interview Answer: "Network Policies are Pod-level firewall rules using label selectors. Default is allow-all; applying a policy makes that direction default-deny. Best practice: default-deny per namespace, then explicit allows. Requires Calico or Cilium. Essential for micro-segmentation and compliance (PCI-DSS)."

Service Mesh (Istio, Linkerd)

What is itA service mesh is an infrastructure layer that handles service-to-service communication for microservices — routing, load balancing, retries, timeouts, mTLS, observability, and traffic splitting — without requiring application code changes. A mesh injects a sidecar proxy (Envoy for Istio, Linkerd2-proxy for Linkerd) into every pod, intercepting all inbound and outbound traffic transparently. A central control plane configures the sidecars via xDS APIs. Modern "ambient mesh" designs (Istio Ambient, Cilium Service Mesh) eliminate sidecars in favor of node-level or eBPF-level proxies.
Key features
  • mTLS: Automatic mutual TLS between every pod — zero-trust networking by default.
  • Traffic management: Canary, weighted splits, fault injection, circuit breakers, retries with jitter, timeouts.
  • Observability: Automatic distributed tracing, RED metrics (rate/errors/duration), and access logs for every service.
  • Authorization policy: L7 authz using pod identity (SPIFFE) instead of IPs.
  • Multi-cluster: Connect services across clusters and clouds via the same mesh.
How it differs
  • Istio vs Linkerd: Istio is feature-rich and complex; Linkerd is minimal, faster, lighter, written in Rust. Linkerd2 is the CNCF graduated option known for simplicity.
  • Istio vs Cilium Service Mesh: Cilium uses eBPF to avoid sidecars entirely, promising lower overhead.
  • vs Consul Connect: HashiCorp's mesh, often used in Nomad/VM environments.
  • vs app-level libraries: Prior approach was Netflix OSS (Ribbon/Hystrix/Eureka) baked into the app — meshes externalize this.
Why use itMeshes standardize cross-cutting concerns for polyglot microservices: a Go service, a Python ML model, and a Java legacy app all get the same retries, mTLS, and telemetry automatically. Essential for regulated industries (zero-trust) and large microservice estates (observability).
Common gotchasSidecars add ~10-30 ms latency and significant memory overhead (each sidecar is ~50 MB). Istio historically has a steep learning curve and operational burden. Debugging why traffic is being retried/dropped involves reading Envoy access logs. mTLS bootstrap ordering issues can cause startup deadlocks. Upgrades are risky — meshes touch every pod.
Real-world examplesAirbnb, HP, Splunk, Atlassian run Istio at large scale. Microsoft adopted Linkerd for Azure internal platforms. Salesforce runs a custom Envoy-based mesh. Lyft created Envoy specifically for their service mesh needs before open-sourcing it.
Simple Definition: Infrastructure layer handling service-to-service communication transparently: mTLS encryption, traffic management, observability, and authorization — without code changes.

Capabilities: mTLS (zero-trust), retries/timeouts/circuit breaking, traffic splitting (canary), request-level metrics and traces, fine-grained AuthorizationPolicies.

Options: Istio (most features, CNCF graduated, Ambient mode = sidecar-less), Linkerd (simpler, Rust proxy), Cilium (eBPF-based, no sidecars).

When to adopt: 10+ microservices AND you need mandatory mTLS, advanced traffic management, or request-level observability. Don't adopt for 2-3 services.

Interview Answer: "Service mesh handles service-to-service communication via sidecar proxies (Envoy). Provides mTLS, traffic management, observability, authorization. Istio=most features, Linkerd=simpler, Cilium=eBPF no sidecars. Adopt when microservices complexity justifies the overhead."

Configuration & Storage

ConfigMaps, Secrets, Volumes, PV/PVC, StorageClasses, and CSI.

ConfigMap

What is itA ConfigMap is a Kubernetes object for storing non-sensitive configuration data as key/value pairs, separate from the container image. This enforces the 12-factor app principle of externalizing config. ConfigMaps can hold plain strings, entire files, or binary data (up to 1 MB total), and are surfaced to pods as environment variables, command-line arguments, or mounted files in a volume. The same image can run in dev/staging/prod with different ConfigMaps, eliminating per-environment images.
Key features
  • Multiple consumption patterns: envFrom for bulk env import, valueFrom.configMapKeyRef for individual keys, volume mounts for files.
  • Live updates: Mounted ConfigMaps update in the pod filesystem automatically (via symlink swap) — apps can reload on change.
  • Immutable ConfigMaps: Marking immutable: true prevents accidental edits and reduces API server load.
  • Binary data: For certificates or non-UTF8 files, use binaryData.
How it differs
  • vs Secret: ConfigMaps are plaintext in etcd; Secrets are base64-encoded (not encrypted by default) and handled more carefully. Use Secrets for credentials, ConfigMaps for everything else.
  • vs env vars on VM: ConfigMaps are version-controlled, templated (via Helm/Kustomize), and managed via the same RBAC as everything else.
  • vs external config services (Consul, etcd): ConfigMaps are native to K8s; external stores require extra infra but can offer hot reloads and audit.
Why use itConfigMaps enable configuration as code stored in Git, applied via GitOps, and versioned with the rest of the manifest. They keep secrets out of images, support environment-specific overrides, and integrate with every config-consumption pattern.
Common gotchasEnv-var ConfigMaps are not updated live — pods must be restarted to see changes. Volume-mounted ConfigMaps update but apps must watch for file changes. ConfigMap changes don't trigger rolling restarts automatically — use checksum annotations (Helm idiom) or tools like Reloader. Base64 encoding a large file can push you toward the 1 MB limit. A missing ConfigMap referenced by a pod prevents pod startup.
Real-world examplesMost apps use ConfigMaps for things like LOG_LEVEL, FEATURE_FLAGS, DB_HOST, application.yaml (Spring Boot), settings.py overrides (Django), Nginx configuration files, Prometheus scrape rules, fluentd parser configs.
Simple Definition: Stores non-sensitive configuration as key-value pairs, separate from images. Consumed as env vars or mounted as files.

Volume-mounted ConfigMaps support hot-reload (~60s). Env vars do NOT — need Pod restart. immutable: true for better performance at scale. 1 MB size limit.

Interview Answer: "ConfigMaps decouple config from images. Consumed as env vars or files. Volume mounts support hot-reload; env vars need restart. Set immutable:true for performance. Use Secrets for sensitive data."

Secret

What is itA Secret is a Kubernetes object for storing sensitive data — passwords, API tokens, TLS certificates, SSH keys, OAuth tokens, image pull credentials. Secrets are mostly structured like ConfigMaps but are base64 encoded (which is encoding, NOT encryption) and treated with special RBAC. By default, Secrets are stored unencrypted in etcd, which is a well-known footgun — you MUST enable encryption at rest (via the EncryptionConfiguration API, ideally with a KMS provider like AWS KMS, GCP KMS, Vault Transit) to meet compliance.
Key features
  • Secret types: Opaque (generic), kubernetes.io/tls, kubernetes.io/dockerconfigjson, kubernetes.io/service-account-token, kubernetes.io/ssh-auth.
  • Consumption: Env vars, volume mounts, imagePullSecrets on a pod, or referenced by ServiceAccounts.
  • Encryption at rest: Requires explicit config on the API server with aescbc, kms, or secretbox providers.
  • Immutable Secrets: Same idea as immutable ConfigMaps for safety and performance.
How it differs
  • vs ConfigMap: Same API shape, different intent and RBAC treatment. Secrets should be restricted to the minimum set of users/service accounts.
  • vs Vault / AWS Secrets Manager: External secret stores provide rotation, audit, dynamic secrets, and true encryption. Tools like External Secrets Operator sync external stores into K8s Secrets.
  • vs SealedSecrets / SOPS: SealedSecrets (Bitnami) encrypts Secrets for safe storage in Git; SOPS (Mozilla) does the same with more providers.
Why use itSecrets keep credentials out of container images and source control, managed via RBAC, and injectable into pods without hardcoding. Combined with an external secret manager, they power secure credential distribution across thousands of pods.
Common gotchasBase64 is NOT encryption — anyone with read access to etcd or the Secret API can see the plaintext. Secrets stored in Git (even with Kustomize) without sealing are a security incident waiting to happen. Env-var Secrets can leak into crash dumps and logs. Rotating a Secret requires pod restart unless the app watches the mount. Default Secret size limit is 1 MB.
Real-world examplesExternal Secrets Operator is the most popular pattern — sync from AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, Azure Key Vault. Vault Agent Injector mutates pods to pull secrets at startup. cert-manager creates TLS Secrets automatically. Most production clusters use KMS-backed encryption at rest.
Simple Definition: Stores sensitive data (passwords, API keys, TLS certs). Base64-encoded (NOT encrypted by default).

Types: Opaque, TLS, dockerconfigjson, service-account-token, basic-auth.

Security: Base64 ≠ encryption! For production: enable etcd encryption at rest, use External Secrets Operator (Vault, AWS Secrets Manager), restrict RBAC, use Sealed Secrets for GitOps.
Interview Answer: "Secrets store sensitive data, base64-encoded (not encrypted). Types: Opaque, TLS, docker-registry. For production: encrypt etcd at rest, use External Secrets Operator to sync from Vault/AWS, restrict RBAC, use Sealed Secrets for GitOps."

PV, PVC & StorageClass

What is itKubernetes storage uses three coordinated abstractions. A PersistentVolume (PV) is a piece of storage (EBS volume, GCE disk, Azure Disk, NFS share, Ceph RBD) provisioned in the cluster. A PersistentVolumeClaim (PVC) is a pod's request for storage with specific size and access mode. A StorageClass is a profile describing how to dynamically provision new PVs — the "dynamic" flow: create a PVC referencing a StorageClass, and the controller creates a matching PV on demand via the CSI driver. This decouples application authors (who ask for storage) from infra (who define storage types).
Key features
  • Access modes: ReadWriteOnce (single node, block/file), ReadOnlyMany, ReadWriteMany (multi-node, rare), ReadWriteOncePod (1.22+).
  • Reclaim policies: Delete (PV and underlying disk deleted when PVC is deleted), Retain (keep for manual cleanup).
  • Volume expansion: Grow a PVC online by updating spec.resources.requests.storage.
  • CSI (Container Storage Interface): Vendor-agnostic plugin model — AWS EBS CSI, GCE PD CSI, Azure Disk CSI, Rook/Ceph, Portworx, Longhorn, OpenEBS.
  • Snapshots and clones: Native CSI snapshot API for point-in-time backups.
How it differs
  • vs emptyDir: emptyDir lives with the pod and is destroyed on pod deletion; PVs survive pod lifecycle.
  • vs hostPath: hostPath ties a pod to a specific node's disk; PVs are location-independent (ideally).
  • vs Docker volumes: Docker volumes are host-local; PVs can be networked, zonal, and dynamically provisioned.
  • vs CSI vs old in-tree drivers: CSI replaced the old in-tree cloud volume plugins in Kubernetes 1.23 — simpler API, cleaner out-of-tree development.
Why use itPVs/PVCs enable running stateful workloads on K8s: databases, message brokers, search engines, user uploads, caches. StorageClasses let platform teams offer tiered storage (gp3-ssd, slow-hdd, local-nvme) as self-service.
Common gotchasAWS EBS and GCE PD are zonal — a pod can only attach to a PV in its own zone, which constrains scheduling. ReadWriteMany is not supported by most block storage; use EFS, NFS, CephFS. Reclaim policy defaults to Delete, which has caused accidental data loss — change it to Retain for critical data. Expansion requires the filesystem to support online growth. Pod-PVC binding is sticky — deleting and recreating a PVC loses the data binding.
Real-world examplesRook/Ceph provides distributed block, file, and object storage inside K8s. Longhorn (Rancher) offers simple distributed block storage. Portworx is a commercial option popular in enterprise. AWS EKS, GKE, AKS all offer their own CSI drivers as the default. Airbnb and Pinterest run large Cassandra clusters on local-SSD PVs.
Simple Definition: PV = provisioned storage (disk). PVC = request for storage. StorageClass = category/provisioner enabling dynamic provisioning via CSI drivers.
Storage Provisioning Flow
Pod needs storage
PVC created
StorageClass
CSI Driver
Cloud disk provisioned (EBS)
PV bound to PVC
Mounted in Pod

Access Modes: RWO (single node), ROX (read-only many), RWX (read-write many, needs EFS/NFS), RWOP (single Pod).

Reclaim: Delete (default) or Retain (keep data). VolumeSnapshots for backups.

Interview Answer: "PV=storage, PVC=request, StorageClass=provisioner. Dynamic provisioning: PVC triggers CSI driver to create disk. Access modes: RWO (most common), RWX (shared, needs EFS). WaitForFirstConsumer binding ensures correct AZ. Volume Snapshots for backups."

emptyDir, hostPath & Ephemeral Volumes

What is itThese are non-persistent volume types in K8s. emptyDir is a temporary directory created when a pod is scheduled and deleted when the pod is removed — used for scratch space and inter-container sharing within a pod. hostPath mounts a directory from the node's filesystem into the pod — powerful but dangerous (ties the pod to a specific node). Ephemeral volumes (generic ephemeral, CSI ephemeral) are pod-scoped but provisioned via CSI drivers like regular PVs — useful for short-lived scratch volumes backed by a real storage system.
Key features
  • emptyDir memory medium: medium: Memory uses tmpfs (RAM) — fast but counts against pod memory limit.
  • emptyDir sizeLimit: Cap how much disk/memory the pod can consume.
  • hostPath types: Directory, File, Socket, DirectoryOrCreate — with type validation.
  • Generic ephemeral volumes: Inline PVC-like specs that are created per-pod and deleted with it.
  • projected volumes: Combine secrets, configmaps, downward API, and service account tokens into a single mount point.
How it differs
  • emptyDir vs PVC: emptyDir lives and dies with the pod; PVC survives.
  • hostPath vs local PV: Local PVs are the sanctioned way to use local disks — they're scheduler-aware. hostPath bypasses scheduling and is generally discouraged outside DaemonSets and system pods.
  • Generic ephemeral vs emptyDir: Generic ephemeral gives you a real storage class (SSD-backed, replicated) in a pod-scoped lifecycle.
Why use itUse emptyDir for scratch space, inter-container sharing (app writes, sidecar reads), or tmpfs-backed caches. Use hostPath only in DaemonSets or when mounting host logs/sockets (e.g., /var/run/docker.sock). Use ephemeral CSI volumes when you need real storage performance but pod-scoped lifecycle.
Common gotchashostPath is a massive security risk — a compromised pod can read/write host files, potentially escalating to root via /etc/kubernetes. emptyDir can fill the node's ephemeral storage, triggering eviction. Memory emptyDir counts against pod memory limit (OOM kills). Pods with hostPath can't be rescheduled to other nodes after first placement. Admission policies often ban hostPath in production.
Real-world examplesLog shipper DaemonSets (Fluent Bit) mount /var/log via hostPath. Init containers commonly use emptyDir to pass generated configs to app containers. CI/CD jobs use emptyDir for build output. Large Spark/ML workloads use ephemeral local SSDs via generic ephemeral volumes.
Simple Definition: emptyDir = temp storage tied to Pod lifetime (shared between containers). hostPath = mounts host filesystem (security risk, DaemonSets only).
Interview Answer: "emptyDir is Pod-lifetime temp storage for scratch space and inter-container sharing. hostPath mounts host filesystem — security risk, only for DaemonSets. Neither persists beyond Pod life."

Scheduling & Scaling

How K8s places Pods, and how it auto-scales workloads and infrastructure.

Resource Requests & Limits

What is itEvery container can declare resource requests (minimum guaranteed CPU/memory) and limits (maximum allowed). Requests are used by the scheduler for bin-packing — a node must have enough free requested CPU and memory to host the pod. Limits are enforced at runtime by the kernel via cgroups — CPU above limit is throttled, memory above limit triggers an OOM kill. Requests and limits define three Quality of Service (QoS) classes: Guaranteed (requests == limits), Burstable (limits > requests), BestEffort (no requests or limits). QoS determines eviction priority under node pressure.
Key features
  • CPU units: 1 CPU = 1 vCPU/core. Fractional: 500m = 0.5 CPU.
  • Memory units: Bytes, or suffixes Ki/Mi/Gi (binary) or K/M/G (decimal).
  • CPU limits cause throttling: Never killed for CPU, just throttled in cgroup CPU quota.
  • Memory limits cause OOM kill: Kernel OOM killer terminates the container if it exceeds its memory limit.
  • LimitRange: Per-namespace object that sets default requests/limits and min/max bounds.
How it differs
  • vs VM sizing: VMs have fixed sizes; K8s requests/limits are per-container and flexible across shared nodes.
  • vs Docker --memory: K8s requests have no direct Docker equivalent — they're scheduler hints.
  • vs Nomad resources: Nomad has a similar requests/limits model.
Why use itWithout requests, the scheduler can't bin-pack intelligently, and noisy neighbors starve each other. Without limits, a runaway pod can take down the whole node. Good resource tuning is the #1 lever for cost optimization, reliability, and multi-tenancy fairness.
Common gotchasCPU limits cause throttling that hurts latency — many production playbooks recommend omitting CPU limits entirely. Setting memory requests too low causes OOM kills under load. Over-requesting wastes node capacity (low density → high cost). Java/JVM apps need careful tuning because the JVM heap must be smaller than the pod memory limit. Default requests of 0 (BestEffort) is dangerous in production.
Real-world examplesNetflix, Adobe, Zalando publish detailed writeups about their CPU limit removal experiments. Tools like Vertical Pod Autoscaler, Goldilocks, Kubecost, and Robusta automatically recommend right-sized requests. Google Borg (K8s's ancestor) heavily influenced the requests/limits model.
Simple Definition: Requests = guaranteed minimum (scheduler uses for placement). Limits = max allowed (CPU throttled, memory OOMKilled).

QoS Classes: Guaranteed (requests==limits, highest priority), Burstable (requests<limits), BestEffort (none set, first evicted).

Units: CPU: 1=1 core, 100m=0.1 core. Memory: 128Mi, 1Gi.

Tip: Many teams remove CPU limits to avoid CFS throttling. Always set memory limits to prevent node-level OOM.
Interview Answer: "Requests=scheduler input (guaranteed min). Limits=max: CPU throttles, memory OOMKills. QoS: Guaranteed (requests==limits), Burstable, BestEffort. Many teams omit CPU limits to avoid throttling but always set memory limits. VPA helps right-size."

Node Affinity, Pod Affinity & Anti-Affinity

What is itAffinity rules steer pod placement beyond basic scheduler choices. Node Affinity attracts pods to nodes with certain labels (e.g., gpu=nvidia-a100, zone=us-east-1a). Pod Affinity attracts pods to nodes where other pods (matching a label selector) already run — useful for co-location to reduce latency. Pod Anti-Affinity does the opposite — spread replicas across nodes, racks, or zones to survive failures. All three come in two flavors: requiredDuringSchedulingIgnoredDuringExecution (hard rule) and preferredDuringSchedulingIgnoredDuringExecution (soft preference with weight).
Key features
  • Topology key: Defines the "domain" of spreading — kubernetes.io/hostname (per node), topology.kubernetes.io/zone (per AZ), topology.kubernetes.io/region.
  • Label expressions: Set-based selectors (In, NotIn, Exists, DoesNotExist).
  • TopologySpreadConstraints: Newer, higher-level API that achieves spread with simpler semantics than anti-affinity.
How it differs
  • vs nodeSelector: nodeSelector is simple equality; affinity supports richer expressions and soft preferences.
  • vs taints/tolerations: Taints repel pods (nodes say "stay away"); affinity attracts them (pods say "I want this node"). Use them together.
  • vs TopologySpreadConstraints: TSC is more concise for the common "spread my replicas across zones" case.
Why use itEssential for high availability (spread replicas across zones), hardware targeting (GPU, NVMe, specific CPU arch), data locality (co-locate cache with app), and isolation (dedicated nodes for noisy workloads).
Common gotchasHard affinity can render pods unschedulable if no matching node exists — always prefer soft rules. Pod affinity is expensive to compute at scale and discouraged for large clusters. Anti-affinity with too-tight constraints leaves pods pending. Affinity rules don't rebalance — new pods follow them but existing ones don't migrate when rules change.
Real-world examplesCassandra/Elasticsearch deployments use anti-affinity to spread nodes across racks. GPU ML workloads use node affinity for GPU nodes. Spark drivers and executors use affinity to co-locate for low-latency shuffles. Istio control plane uses anti-affinity so a single node failure doesn't take down the mesh control plane.
Simple Definition: Node Affinity = which nodes a Pod can run on. Pod Anti-Affinity = spread replicas apart. Pod Affinity = co-locate for performance. Topology Spread = distribute evenly across zones.

Hard vs Soft: requiredDuring... (must) vs preferredDuring... (try).

Interview Answer: "Node Affinity uses node labels for placement (required=hard, preferred=soft). Pod Anti-Affinity spreads replicas across nodes/zones for HA. Pod Affinity co-locates for low latency. Topology Spread Constraints distribute evenly across zones with maxSkew."

Taints & Tolerations

What is itTaints are key-value markers on nodes that repel pods that don't tolerate them. A taint has a key, optional value, and an effect: NoSchedule (new pods can't land), PreferNoSchedule (scheduler tries to avoid), NoExecute (evict existing pods that don't tolerate). Tolerations are the counterpart on pods: a pod with a matching toleration can be scheduled onto a tainted node. This is the opposite model of affinity — affinity attracts, taints repel. Often combined: taint a node to dedicate it, tolerate on the specific workload that should use it.
Key features
  • Node taints set via kubectl taint: kubectl taint nodes gpu-1 dedicated=gpu:NoSchedule.
  • Automatic taints: Control plane taints like node-role.kubernetes.io/control-plane:NoSchedule, or dynamic node.kubernetes.io/not-ready, node.kubernetes.io/unreachable.
  • TolerationSeconds: How long a pod tolerates a NoExecute taint before eviction.
  • Taint-based eviction: When a node goes unreachable, the node lifecycle controller taints it, and pods are evicted after tolerationSeconds.
How it differs
  • vs Node Affinity: Affinity is pull (pods choose nodes); taints are push (nodes filter pods). Best combined.
  • vs nodeSelector: nodeSelector is a simple "must match" — doesn't exclude other pods from the node. Taints actively exclude.
Why use itUse taints to dedicate nodes to specific workloads (GPU, memory-optimized, spot instances, PCI-compliant), to drain nodes gracefully for maintenance, and to evict pods from failed nodes. DaemonSets typically tolerate common taints so they run everywhere.
Common gotchasA taint alone doesn't attract matching pods — you still need nodeSelector or affinity to ensure the right pods actually land. Over-tainting leaves nodes idle. Forgetting tolerations on DaemonSets means they skip tainted nodes. Default toleration grace period (300s) on unreachable nodes causes long recovery — tune for faster failover.
Real-world examplesSpot instance node pools are tainted spot=true:NoSchedule so only interruption-tolerant workloads land there. GPU nodes use nvidia.com/gpu:NoSchedule with the NVIDIA device plugin. Control plane nodes in managed clusters are tainted to prevent user workloads. Dedicated team nodes use taints for hard chargeback boundaries.
Simple Definition: Taints repel Pods from nodes. Tolerations let specific Pods ignore taints. Used to dedicate nodes (GPU, Spot).

Effects: NoSchedule, PreferNoSchedule, NoExecute (evict existing).

Interview Answer: "Taints repel Pods; Tolerations let specific Pods ignore taints. Effects: NoSchedule, PreferNoSchedule, NoExecute. Use case: dedicate GPU nodes for ML, Spot nodes for batch jobs. K8s auto-taints unhealthy nodes."

HPA (Horizontal Pod Autoscaler)

What is itThe Horizontal Pod Autoscaler automatically scales the number of replicas in a Deployment, StatefulSet, or ReplicaSet based on observed metrics — CPU utilization, memory, or custom/external metrics. Every 15 seconds (default --horizontal-pod-autoscaler-sync-period), the HPA controller queries the metrics server (or a custom metrics API), compares current values against the target, and adjusts the replica count. The current version is autoscaling/v2, which supports multi-metric scaling and scaling behavior policies.
Key features
  • Metric types: Resource (CPU/mem), Pods (per-pod custom metric), Object (metric from another object like Ingress RPS), External (SQS depth, Kafka lag).
  • Target types: Utilization (% of request), AverageValue, Value.
  • Scale behavior (v2): Separate scale-up and scale-down policies with stabilization windows to prevent flapping.
  • Scale to zero: Not supported by HPA directly — use KEDA for that.
How it differs
  • vs VPA: HPA adds replicas; VPA resizes the individual pod's CPU/memory. Orthogonal, though overlapping use is tricky.
  • vs KEDA: KEDA extends HPA with event-driven sources (Kafka, RabbitMQ, Azure Service Bus, Prometheus, cron) and scale-to-zero.
  • vs Cluster Autoscaler: HPA scales pods; CA scales nodes. They compose: HPA adds pods, which triggers CA to add nodes.
Why use itHPA handles traffic spikes automatically without overprovisioning. You define a target (e.g., "70% CPU"), and HPA keeps the fleet right-sized. Essential for cost efficiency on bursty workloads — web, APIs, anything seasonal.
Common gotchasHPA requires the metrics-server installed (not all clusters have it). Pods must have resource requests or utilization-based scaling won't work. Scaling decisions are made on averaged metrics — noisy pods can cause flapping. Java apps with slow JVM warmup take minutes to absorb load after scale-up. The minimum scale-down stabilization is 300s by default — plan for that latency.
Real-world examplesShopify runs HPA on every Rails service. Expedia combines HPA with KEDA for event-driven scaling. Zalando uses custom metrics from Prometheus (requests per second) as scaling signals. Snapchat scales to millions of pods across regions via HPA + Cluster Autoscaler.
Simple Definition: Automatically scales Pod replicas based on CPU, memory, or custom metrics.

v2 supports multiple metrics. behavior section controls scaling speed (scale up fast, down slow to prevent flapping). Requires Metrics Server.

Interview Answer: "HPA scales replicas on metrics: CPU, memory, custom (Prometheus Adapter). v2 supports multiple metrics. Behavior section controls speed — fast up, slow down. Needs Metrics Server. Can't share CPU metric with VPA."

VPA, KEDA & Cluster Autoscaler / Karpenter

What is itThe autoscaling ecosystem extends HPA with specialized tools. VPA (Vertical Pod Autoscaler) right-sizes individual pods by adjusting their CPU/memory requests based on historical usage. KEDA (Kubernetes Event-Driven Autoscaling) is a CNCF graduated project that drives HPA using event sources (Kafka lag, SQS depth, Redis queue length, cron, Prometheus queries) and supports scale-to-zero. Cluster Autoscaler adds/removes nodes from cloud node pools when pods are Pending or nodes are underused. Karpenter (from AWS, now CNCF) is a smarter node provisioner that skips node pools entirely and picks optimal instance types on the fly.
Key features
  • VPA modes: Off (recommend only), Initial (set at creation), Auto (evict + recreate pod with new sizes).
  • KEDA ScaledObjects: Declare event source, metric, thresholds; KEDA creates/manages an HPA for you.
  • KEDA scale-to-zero: Idle deployments drop to 0 pods and wake on first event.
  • Cluster Autoscaler: Node-pool-based, conservative, requires pre-defined instance types.
  • Karpenter: Provisions raw EC2 instances matched to pending pod specs; consolidates workloads for efficiency.
How it differs
  • VPA vs HPA: VPA resizes, HPA multiplies. VPA shouldn't be combined with HPA on CPU/mem.
  • KEDA vs HPA: KEDA is a superset — handles sources HPA can't, like Kafka/SQS backlog, custom Prometheus queries.
  • Karpenter vs Cluster Autoscaler: Karpenter launches nodes in seconds (vs minutes), picks optimal instance type per workload, consolidates idle nodes aggressively. Much better cost efficiency.
Why use itComplete autoscaling covers three axes: how many pods (HPA/KEDA), how big each pod (VPA), and how many nodes (CA/Karpenter). Teams that master all three often cut cloud spend 40-60%.
Common gotchasVPA Auto mode recreates pods, which disrupts service — pair with PDBs. KEDA's scale-to-zero has a cold-start penalty. Cluster Autoscaler can't downsize if even one pod without a PDB is stuck. Karpenter requires careful spot instance handling to avoid cascading interruptions.
Real-world examplesAWS built Karpenter to solve scaling issues at scale; used by Snap, Intuit, Adobe. Microsoft maintains KEDA, used heavily in Azure Functions on Kubernetes. Pinterest uses VPA for memory right-sizing in their Python services. Zalando publishes detailed VPA war stories.
Simple Definition: VPA = right-size resource requests. KEDA = scale on external events + scale to zero. Karpenter = smart node provisioning replacing Cluster Autoscaler.
# KEDA: Scale Kafka consumer to zero
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
  scaleTargetRef: { name: kafka-consumer }
  minReplicaCount: 0
  maxReplicaCount: 50
  triggers:
  - type: kafka
    metadata:
      topic: orders
      lagThreshold: "100"
Interview Answer: "VPA right-sizes requests (recommender mode=safe). KEDA scales on 60+ event sources (Kafka lag, SQS, Prometheus) with scale-to-zero. Karpenter replaces Cluster Autoscaler: faster provisioning, auto instance-type selection, Spot handling, node consolidation."

PDB (Pod Disruption Budget)

What is itA PodDisruptionBudget tells Kubernetes how many pods of a given workload can be voluntarily evicted at the same time. "Voluntary disruption" means admin actions: kubectl drain for node upgrades, cluster autoscaler scale-down, Karpenter consolidation. Involuntary disruptions (node crash, hardware failure) are NOT governed by PDBs. A PDB specifies minAvailable or maxUnavailable as a count or percentage, and the eviction API will block drains that would violate it.
Key features
  • minAvailable: "At least N pods (or X%) must remain running."
  • maxUnavailable: "At most N pods (or X%) may be unavailable."
  • Selector: Targets pods via label selector — must match the deployment's pods.
  • Eviction API: PDBs only enforce against the eviction API, not direct pod deletes.
How it differs
  • vs replicas: Replicas guarantee target count on average; PDBs guarantee minimum during disruption windows.
  • vs HPA: Independent — HPA adjusts replicas, PDB protects them during drains.
  • vs anti-affinity: Anti-affinity spreads pods for availability; PDB prevents too many being drained simultaneously.
Why use itPDBs are the contract between app owners and cluster operators — "you can upgrade nodes, but please keep at least this many of my pods alive." Essential for stateful workloads (quorum databases), high-availability services, and any workload that can't afford full outage during node churn.
Common gotchasA PDB with minAvailable: 100% blocks drains entirely and stalls cluster upgrades. PDBs don't prevent node crashes — for that you need multi-zone replication. Stuck drains due to PDBs are a common incident. Single-replica deployments + PDB = unschedulable drain forever.
Real-world examplesEvery etcd, Zookeeper, Kafka, Cassandra deployment needs a PDB to preserve quorum. Cluster operators use kubectl get pdb -A before major upgrades to identify risky workloads. Lyft's incident reports mention PDB-related drain failures as routine ops hazards.
Simple Definition: Limits how many Pods can be down during voluntary disruptions (node drains, upgrades). Specifies minAvailable or maxUnavailable.

Applies to voluntary disruptions (drain, autoscaler), NOT involuntary (node crash, OOM).

Interview Answer: "PDB sets minimum available or maximum unavailable Pods during voluntary disruptions. Protects against all-replicas-down during node drains and upgrades. Every production Deployment should have one."

Security

RBAC, Pod Security, admission control, policy engines, secrets management, and runtime protection.

RBAC (Role-Based Access Control)

What is itRBAC is Kubernetes's built-in authorization system that controls who can do what on which resources. It is based on four object types: Role (a namespaced set of permissions: verbs × resources), ClusterRole (cluster-scoped version), RoleBinding (attach a Role to a user/group/ServiceAccount in a namespace), ClusterRoleBinding (attach a ClusterRole across the cluster). When a request hits the API server, it authenticates the identity, then checks RBAC to see if the subject is allowed to perform the verb (get, list, create, update, patch, delete, watch) on the resource.
Key features
  • Fine-grained verbs: Control individual operations, not just read/write.
  • Resource names: Restrict actions to specific named resources (e.g., only the prod-db secret).
  • Aggregation: ClusterRoles can aggregate from labeled sub-roles for modular permissions.
  • Subresources: Control sub-APIs like pods/exec, pods/portforward, deployments/scale.
  • ServiceAccount binding: Pod service accounts get RBAC via bindings — enabling controllers to call the API with minimum privileges.
How it differs
  • vs ABAC (deprecated): ABAC used policy files; RBAC uses API objects that can be managed declaratively.
  • vs cloud IAM: Cloud IAM controls cloud resources; RBAC controls K8s API. Managed K8s (EKS/GKE/AKS) bridges the two.
  • vs OPA/Gatekeeper: RBAC decides "can you do this?"; Gatekeeper decides "is this object valid?"
Why use itRBAC enforces least privilege — developers can only touch their team's namespaces, CI can only deploy to staging, controllers only watch what they need. Essential for security audits, compliance, and multi-tenant safety.
Common gotchasOverly broad ClusterRoleBindings (using the built-in cluster-admin role) are the single most common security mistake. The system:masters group bypasses RBAC entirely. Permissions needed for CRDs are often forgotten — a new CRD requires new ClusterRoles. Debugging "forbidden" errors means running kubectl auth can-i and kubectl-who-can.
Real-world examplesGoogle GKE integrates RBAC with Google IAM via Workload Identity. AWS EKS uses aws-auth ConfigMap or the EKS API to map IAM principals to RBAC. Tools like rakkess, rbac-lookup, and kubectl-who-can audit permissions. Every K8s security review starts with an RBAC audit.
Simple Definition: Controls who can do what. Role (namespace permissions) + ClusterRole (cluster-wide) bound to users/groups/ServiceAccounts via RoleBinding/ClusterRoleBinding.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: dev
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
---
kind: RoleBinding
metadata:
  name: dev-team-reader
  namespace: dev
subjects:
- kind: Group
  name: dev-team
roleRef:
  kind: Role
  name: pod-reader

Test: kubectl auth can-i create pods --namespace dev

Interview Answer: "RBAC: Role (namespace) + ClusterRole (global), bound via RoleBinding/ClusterRoleBinding to Users, Groups, or ServiceAccounts. Best practice: least privilege, namespace-scoped Roles, dedicated ServiceAccounts per workload, audit with 'kubectl auth can-i'."

SecurityContext & Pod Security Standards

What is itSecurityContext defines privilege and access control settings for a pod or individual container: which UID/GID to run as, whether to allow privilege escalation, which Linux capabilities to drop/add, whether the filesystem is read-only, SELinux/AppArmor/seccomp profiles, and more. Pod Security Standards (PSS) are three predefined policy levels — Privileged (no restrictions), Baseline (sensible minimums), Restricted (hardened, enforces non-root, no host networking, etc.). Enforced by the Pod Security Admission controller (replaced the deprecated PodSecurityPolicy in 1.25).
Key features
  • runAsNonRoot + runAsUser: Force non-root execution.
  • readOnlyRootFilesystem: App can't write to its own rootfs — defeats many exploits.
  • capabilities: Drop all Linux caps by default, add only the ones you need (e.g., NET_BIND_SERVICE).
  • seccomp profiles: Restrict which syscalls the container can make via seccompProfile: RuntimeDefault or custom.
  • AppArmor/SELinux: Mandatory access control for stronger isolation.
How it differs
  • vs PodSecurityPolicy (removed): PSS is simpler, enforced via admission labels on namespaces rather than PSP objects.
  • vs OPA/Kyverno: PSS covers the common cases; policy engines handle custom rules.
  • vs Docker --privileged: SecurityContext is more granular, with individual switches per capability.
Why use itDefault container execution is dangerously permissive — runs as root, with many capabilities, writable rootfs. SecurityContext hardens each workload. Pod Security Standards enforce hardening cluster-wide with a simple namespace label like pod-security.kubernetes.io/enforce: restricted.
Common gotchasLegacy images often won't start as non-root (need a rewrite or init script). Read-only rootfs breaks apps that write to /tmp — use an emptyDir for writable scratch. seccomp profiles can block syscalls your app needs, causing mysterious failures. PSS only inspects — it doesn't fix. Migration from PSPs to PSS requires planning.
Real-world examplesGoogle requires restricted PSS on internal GKE clusters. Red Hat OpenShift uses SCCs (Security Context Constraints), a stricter ancestor to PSS. Kubernetes hardening guides from the NSA and CISA strongly recommend SecurityContext hygiene. Compliance frameworks (SOC2, PCI-DSS) expect non-root, read-only, dropped capabilities.
Simple Definition: SecurityContext sets container security (runAsNonRoot, readOnlyRootFilesystem, drop capabilities). PSS defines three levels: Privileged, Baseline, Restricted.

Pod Security Admission enforces PSS per namespace via labels: enforce, audit, warn.

Interview Answer: "SecurityContext: runAsNonRoot, readOnlyRootFilesystem, drop ALL capabilities, allowPrivilegeEscalation=false. PSS levels: Privileged (system), Baseline (minimum), Restricted (target for apps). Enforced via namespace labels."

Admission Controllers & Webhooks

What is itAdmission controllers are gatekeeper hooks in the API server request pipeline. After authentication and RBAC authorization, every create/update request passes through admission — a chain of mutating admission controllers (can modify the object) followed by validating admission controllers (can accept or reject). Some are compiled into the API server (NamespaceLifecycle, ResourceQuota, LimitRanger, DefaultStorageClass). Others are external via webhooks: MutatingWebhookConfiguration and ValidatingWebhookConfiguration objects point to HTTPS endpoints that implement custom logic. This is how tools like cert-manager, Istio sidecar injector, OPA, Kyverno, and Linkerd auto-inject proxies plug in.
Key features
  • Mutating vs validating: Mutating can change the object (inject sidecars, set defaults); validating can only reject.
  • Order: All mutators run first, then all validators — so validation sees the final merged object.
  • Dynamic registration: Webhook configs are K8s objects — install a chart, register a webhook at runtime.
  • failurePolicy: Fail rejects all requests if the webhook is down; Ignore lets them through.
  • objectSelector / namespaceSelector: Limit which objects trigger the webhook.
How it differs
  • vs RBAC: RBAC is yes/no on verbs; admission inspects the actual object content and can modify or reject based on arbitrary logic.
  • vs CRD validation schemas: Schemas (OpenAPI) catch simple type errors; admission handles complex multi-field or cross-object rules.
  • vs audit hooks: Audit records what happened; admission prevents what shouldn't happen.
Why use itAdmission is the policy enforcement point for Kubernetes — the only place to inject sidecars, set cluster-wide defaults, enforce naming conventions, require labels, block privileged pods, or mandate a specific image registry. Every serious cluster runs several webhooks.
Common gotchasA broken webhook with failurePolicy: Fail can take down the API server — imagine Istio's injector webhook crashing during cluster upgrade. Mutating webhooks that conflict cause non-deterministic results. Webhook latency adds to every API request. Webhooks must ignore kube-system or you'll break the control plane itself.
Real-world examplesIstio uses a mutating webhook to inject Envoy sidecars. cert-manager validates Certificate and Issuer objects. Kyverno and OPA Gatekeeper run as webhooks to enforce org policies. Vault Agent Injector mutates pods with secret-fetching sidecars. Most K8s platforms have 5-15 webhooks.
Simple Definition: Plugins intercepting API requests before storage. Mutating webhooks modify requests (inject sidecars). Validating webhooks accept/reject (enforce policies).
API Server Request Flow
API Request
Authentication
Authorization (RBAC)
Mutating Admission
Schema Validation
Validating Admission
Persist to etcd
Interview Answer: "Admission controllers intercept after auth, before persist. Mutating runs first (inject sidecars, add labels), then Validating (enforce policies, reject violations). Policy engines (Kyverno, OPA) work via these webhooks."

OPA/Gatekeeper & Kyverno

What is itPolicy-as-code tools enforce cluster-wide rules. Open Policy Agent (OPA) is a CNCF graduated, general-purpose policy engine using the declarative Rego language. Gatekeeper is its Kubernetes adaptation that runs as a validating/mutating admission webhook and enforces ConstraintTemplates (policy definitions) and Constraints (instances). Kyverno is a newer CNCF graduated Kubernetes-native policy engine that uses YAML instead of Rego — easier for K8s users who don't want to learn Rego. Kyverno can validate, mutate, generate, and clean up resources.
Key features
  • Validation: Block deployments from the latest tag, enforce required labels, forbid privileged pods, require resource limits.
  • Mutation: Auto-add sidecars, inject default labels/annotations, set security context fields.
  • Audit mode: Report violations without blocking — useful for rolling out new policies.
  • Policy libraries: Community-maintained policies (pod-security-standards, supply chain, ingress hardening).
  • Kyverno-only: resource generation — create default ConfigMaps/NetworkPolicies when a namespace is created.
How it differs
  • OPA/Gatekeeper vs Kyverno: OPA is more powerful (arbitrary Rego, non-K8s use cases) but steeper; Kyverno is simpler, YAML-based, K8s-only.
  • vs Pod Security Admission: PSA handles the common PSS rules; policy engines handle everything beyond that.
  • vs validating webhooks you write: Policy engines save you from building and maintaining custom webhooks.
Why use itPolicy engines codify organizational guardrails — "every pod must have a cost-center label," "no images from Docker Hub," "all ingresses must use HTTPS" — and enforce them on every write. Essential for compliance, security, and platform engineering at scale.
Common gotchasRego is hard to learn and debug. Overly strict policies block legitimate work and cause friction. Policies that require cluster state (list all existing resources) are slow and must use caching (OPA's sync). Policy drift between clusters is hard to audit without GitOps.
Real-world examplesCapital One, Visa, PayPal use OPA for compliance. Nirmata (creators of Kyverno) and Nvidia use Kyverno at scale. Styra offers a commercial OPA management plane. The CNCF security TAG published guidance recommending one of these tools for any production K8s cluster.
Simple Definition: Policy engines that auto-enforce rules. Kyverno = YAML policies (simple). OPA/Gatekeeper = Rego language (powerful but complex).
# Kyverno: Block :latest tag
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-latest
spec:
  validationFailureAction: Enforce
  rules:
  - name: check-tag
    match:
      any:
      - resources: { kinds: [Pod] }
    validate:
      message: "Using ':latest' is not allowed."
      pattern:
        spec:
          containers:
          - image: "!*:latest"
Interview Answer: "Kyverno: K8s-native YAML policies, can validate, mutate, generate. OPA/Gatekeeper: Rego language, more powerful. Common policies: require labels, block latest tag, enforce resource limits, restrict registries, mandate security contexts."

Image Security & Runtime Protection

What is itContainer security spans two phases: image security (what's inside the image — CVEs, supply-chain provenance, signatures) and runtime security (what the container does at execution — syscalls, file writes, network connections, privilege escalation attempts). Tools like Trivy, Grype, Clair, Snyk Container scan images for known vulnerabilities. Cosign and Sigstore sign and verify images. Falco, Tetragon, Aqua Enforcer, and Sysdig Secure monitor runtime behavior and detect suspicious activity.
Key features
  • SBOM (Software Bill of Materials): Machine-readable list of every package in an image, produced by tools like Syft.
  • SLSA provenance: Cryptographic attestation of how an image was built (build pipeline, source commit, build platform).
  • Image signing: Cosign uses keyless signing via Fulcio/Rekor (OIDC identity) — no long-lived keys to manage.
  • Admission enforcement: Kyverno/Gatekeeper can require signed images from approved registries.
  • Runtime anomaly detection: Falco fires alerts on syscall-level rules (shell in container, unexpected file writes, crypto miners).
How it differs
  • vs traditional VM antivirus: Containers are immutable — the approach is "verify at build time, detect at runtime" rather than continuous file scanning.
  • Falco vs Tetragon: Falco uses a kernel module or eBPF; Tetragon is pure eBPF and integrates with Cilium for network + syscall observability.
  • vs network-based IDS: Runtime container security sees inside the pod; network IDS only sees traffic on the wire.
Why use itSoftware supply chain attacks (SolarWinds, Codecov, xz-utils) make image provenance non-optional. Container escapes and kernel CVEs make runtime detection essential. Regulated industries legally require both scanning and runtime monitoring.
Common gotchasScanners produce noisy CVE lists that must be triaged — not every CVE is exploitable in context. Runtime rules generate false positives that drown on-call engineers. Base images (Alpine, Debian slim) still ship with many package CVEs. Distroless images dramatically reduce attack surface but break debugging.
Real-world examplesGoogle Distroless images remove shells and package managers from runtime images. Chainguard Images provide continuously rebuilt, near-zero-CVE images. GitHub uses Cosign for signing Actions runner images. Shopify, Adobe, Intuit run Falco in production. Sigstore is a cross-industry initiative backed by Google, Red Hat, and the Linux Foundation.
Simple Definition: Scanning (Trivy for CVEs), signing (Cosign for provenance), SBOM generation, and runtime monitoring (Falco for anomalous behavior).
Interview Answer: "Defense-in-depth: scan in CI (Trivy blocks critical CVEs), sign images (Cosign/Sigstore), verify at admission (Kyverno checks signatures), monitor runtime (Falco detects unexpected processes/network). SBOM for regulatory compliance."

Observability

Metrics, logging, tracing, probes, alerting, and the three pillars of understanding your systems.

Prometheus & Grafana

What is itPrometheus is the de facto monitoring system for Kubernetes — a CNCF graduated project with a pull-based model, time-series database, and its own query language (PromQL). It scrapes metrics from HTTP /metrics endpoints exposed by every component (kubelet, node-exporter, apps) at a regular interval, stores them in a local TSDB, and serves alerts via Alertmanager. Grafana is the visualization layer — dashboards, alerts, and multi-data-source federation. Together they are the universal pair for cloud-native observability. The Prometheus Operator and kube-prometheus-stack Helm chart deploy them + exporters + default dashboards with one command.
Key features
  • Service discovery: Auto-discover scrape targets via K8s API (pod, service, endpoint annotations).
  • PromQL: Powerful query language with rate, histogram_quantile, aggregation, label manipulation.
  • Alerting: PrometheusRule CRD defines alerts; Alertmanager routes and deduplicates them to Slack, PagerDuty, email.
  • Exemplars and histograms: Link metrics to traces for correlation.
  • Remote write: Long-term storage via Thanos, Cortex, Mimir, VictoriaMetrics.
How it differs
  • vs Datadog/New Relic: SaaS alternatives are simpler but expensive at scale; Prometheus is free but you operate it.
  • vs InfluxDB: Influx used to compete but lost ground to Prometheus in the K8s space.
  • vs OpenTelemetry Metrics: OTel is the emerging standard for metric collection; Prometheus remains the default backend.
  • Push vs pull: Prometheus pulls (better for reliability), unlike StatsD/Graphite push model.
Why use itPrometheus is the observability backbone of K8s — it's how you know whether your cluster and apps are healthy. The standard metrics format (/metrics with text exposition) is implemented by every modern CNCF project.
Common gotchasPrometheus storage is local to a single instance — no HA out of the box, requires Thanos/Cortex/Mimir for federation. Scrape cardinality explosions (labels with unbounded values like user IDs) crash the server. Recording rules are essential for expensive queries. Retention is limited by disk; use remote_write for long-term.
Real-world examplesSoundCloud created Prometheus in 2012 (inspired by Google's Borgmon). Digital Ocean, Grafana Labs, GitLab, Shopify, Uber run large Prometheus deployments. Red Hat OpenShift bundles Prometheus as the default monitoring stack. Grafana dashboards for K8s (like the famous "K8s cluster overview") have millions of downloads.
Simple Definition: Prometheus collects metrics (pull-based, PromQL queries). Grafana visualizes dashboards. Together = standard K8s monitoring stack.

Key sources: kube-state-metrics (K8s object state), Node Exporter (node hardware), Metrics Server (HPA/kubectl top), app /metrics endpoints.

# PromQL examples
rate(http_requests_total{status="500"}[5m])     # Error rate
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))  # P99
sum by (service)(rate(http_requests_total[5m]))  # RPS per service

Long-term: Thanos or Grafana Mimir for multi-cluster, long-retention storage.

Interview Answer: "Prometheus: pull-based metrics with PromQL. Alertmanager for routing alerts. kube-state-metrics + Node Exporter for infrastructure. Grafana for dashboards. Thanos/Mimir for long-term multi-cluster storage. CNCF graduated, industry standard."

Logging (Loki, EFK, Fluent Bit)

What is itKubernetes logging follows a pattern: apps write to stdout/stderr, the container runtime captures output to files on the node, a log shipper DaemonSet tails those files and forwards them to a backend. The most common shippers are Fluent Bit (lightweight, written in C, the successor to Fluentd) and Vector (Rust, modern, fast). Popular backends: Loki (Grafana Labs, indexes labels not contents, cheap), Elasticsearch (full-text search, "EFK" stack: Elasticsearch + Fluentd + Kibana), OpenSearch, Splunk, Datadog, CloudWatch Logs.
Key features
  • Node-level shipping: One agent per node reading /var/log/pods/*.
  • Metadata enrichment: Add pod, namespace, labels, container name to every log line automatically.
  • Parsing: Regex, JSON, logfmt, multi-line (stack traces).
  • Backpressure handling: Buffering and retries when the backend is slow.
  • Loki label-based indexing: Cheap storage (S3/GCS) by indexing only labels, not log contents.
How it differs
  • Loki vs Elasticsearch: Loki is 10-100× cheaper but slower for full-text queries; ES is faster but requires large clusters.
  • Fluent Bit vs Fluentd: Fluent Bit uses ~10× less memory (~1 MB vs ~40 MB per instance) — preferred for K8s.
  • Vector vs Fluent Bit: Vector is more powerful with transforms but less battle-tested at massive scale.
Why use itLogs are the universal debugging tool — every incident starts with "show me the logs." Centralized logging makes investigations possible across hundreds of pods and nodes, supports compliance audit trails, and powers alerting on log patterns.
Common gotchasHigh log volume can dominate cluster network/storage costs — log sampling or level tuning is essential. Missing multi-line parsing creates per-line spam from stack traces. Node disk fills up if the shipper falls behind. PII in logs is a compliance risk — scrub at the shipper. The log format should be structured (JSON) for queryability.
Real-world examplesGrafana Labs created Loki specifically for cost-effective K8s logs. Netflix built Mantis for high-volume streaming logs. Shopify uses Fluent Bit + BigQuery. Slack runs Elasticsearch at petabyte scale. The CNCF project Fluent umbrella contains both Fluentd and Fluent Bit.
Simple Definition: Containers write to stdout/stderr. A DaemonSet collector (Fluent Bit) ships logs to a backend (Loki or Elasticsearch) for search and analysis.

Loki (Grafana) = label-based, cost-effective, Grafana integration. Elasticsearch = full-text search, powerful but heavy. Loki increasingly replacing EFK.

Interview Answer: "Containers → stdout/stderr → node files → Fluent Bit DaemonSet → backend. Loki: label-based, cheap. Elasticsearch: full-text, heavy. Structured JSON logging enables better filtering. Correlate logs with metrics on same Grafana dashboard."

Distributed Tracing & OpenTelemetry

What is itDistributed tracing follows a request as it propagates through many microservices, producing a tree of spans with timing, metadata, and causal relationships. OpenTelemetry (OTel) is the CNCF-graduated standard that unifies tracing, metrics, and logs into a single observability framework with SDKs for every language and a wire protocol (OTLP). It replaced the legacy OpenTracing and OpenCensus projects. Backends like Jaeger, Tempo (Grafana), Zipkin, Honeycomb, Datadog APM, and Lightstep ingest and visualize traces.
Key features
  • Context propagation: W3C Trace Context headers flow through HTTP/gRPC calls.
  • Auto-instrumentation: OTel agents can instrument Java, Python, Node.js without code changes.
  • Sampling: Head-based (decide at request start) or tail-based (decide after seeing the full trace — requires OTel Collector).
  • OTel Collector: A vendor-agnostic middleman that receives, transforms, batches, and exports telemetry to any backend.
  • Exemplars: Link metrics (Prometheus) to traces (Jaeger) for instant drill-down.
How it differs
  • vs logs: Logs tell you what happened at one point; traces tell you the full causal chain across services.
  • vs metrics: Metrics aggregate; traces capture individual requests with full detail.
  • Jaeger vs Zipkin: Jaeger (created at Uber, CNCF) is newer and more feature-rich; Zipkin (Twitter) was the pioneer.
  • Tempo vs Jaeger: Tempo uses object storage for cheap, high-volume trace retention.
Why use itIn a microservice architecture, a single user request may hit 50+ services — you cannot debug latency or errors without tracing. Tracing is how SREs answer "why was this request slow?" within minutes instead of days.
Common gotchasSampling too aggressively loses rare errors; sampling too little blows up storage. Instrumentation gaps (unpropagated context) break trace trees. Clock skew between services causes confusing span ordering. Trace data can leak sensitive payloads — scrub carefully. OTel SDKs are still maturing in some languages.
Real-world examplesUber built Jaeger to trace 10K+ microservices. Google's Dapper paper (2010) kicked off the whole field. Honeycomb popularized high-cardinality, column-oriented observability. Datadog, New Relic, Dynatrace all now support OTLP ingest. Every modern microservice architecture instruments with OTel.
Simple Definition: Tracing follows a request across microservices showing where time is spent. OpenTelemetry (OTel) is the CNCF standard unifying metrics, logs, and traces.

How: Request gets trace ID → each service creates a span → propagated via headers → assembled into full trace.

Backends: Jaeger (CNCF), Grafana Tempo, Zipkin. OTel Collector = vendor-neutral pipeline.

Interview Answer: "Tracing uses trace IDs and spans across services. Backends: Jaeger, Tempo. OpenTelemetry = CNCF standard unifying metrics/logs/traces with vendor-neutral SDKs and Collector. Auto-instrumentation adds tracing without code changes."

Probes: Liveness, Readiness & Startup

What is itKubernetes supports three types of health probes configurable per container: liveness ("is the process alive or stuck?" — fail restarts the container), readiness ("is it ready to receive traffic?" — fail removes the pod from Service endpoints), and startup ("has it finished initializing?" — disables liveness/readiness until it passes once). Each probe can be implemented as HTTP GET, TCP socket check, gRPC health check, or exec command. Probes run at configurable intervals (periodSeconds) with thresholds for success/failure transitions.
Key features
  • Liveness: Restart deadlocked or memory-leaking apps.
  • Readiness: Avoid sending traffic to warming-up pods; temporarily remove unhealthy pods during dependency outages.
  • Startup probes: For slow-starting apps (Java with 60+ second warmup) — prevents liveness from killing them during init.
  • Probe parameters: initialDelaySeconds, periodSeconds, timeoutSeconds, successThreshold, failureThreshold.
  • gRPC probes: Native (no grpc_health_probe sidecar needed) since K8s 1.24.
How it differs
  • Liveness vs readiness: Critical distinction — liveness restarts; readiness just removes from Service. A full DB outage should fail readiness, NOT liveness (restarts don't help).
  • Startup vs initialDelaySeconds: Startup probes are better for slow apps — they adapt to variable warmup times.
  • vs ELB health checks: K8s probes run inside the cluster from kubelet; cloud LB health checks run from outside.
Why use itProbes power K8s's self-healing — they're how K8s knows which pods to route traffic to and which to restart. Done right, they make the cluster resilient; done wrong, they cause cascading failures.
Common gotchasTying liveness to a downstream dependency (DB, cache) causes cascade restarts during DB outages — if the DB is down, ALL app pods restart simultaneously, making recovery worse. Liveness should check internal process health only. Probe endpoints that are expensive (full SQL queries) burden the app. Too-aggressive thresholds cause flapping. Missing startup probes on slow-start apps causes infinite restart loops.
Real-world examplesZalando publishes a famous "don't put your DB in liveness" blog post. Monzo had a major incident from overly-aggressive liveness probes cascading. Spring Boot Actuator's /actuator/health/liveness and /actuator/health/readiness are standard for Java K8s apps. Go's /healthz convention is widespread.
Simple Definition: Liveness = alive? (kill if not). Readiness = ready for traffic? (remove from endpoints if not). Startup = still booting? (gate other probes).

Methods: HTTP GET, TCP Socket, gRPC, Exec command.

Common mistake: Aggressive liveness probes restart healthy containers under load, causing cascading failures. Keep liveness lenient, readiness responsive.
Interview Answer: "Liveness: restart dead containers. Readiness: remove unready from traffic. Startup: gate probes for slow starters. Methods: HTTP, TCP, gRPC, exec. Key: lenient liveness, responsive readiness. Every production service needs both."

SLI, SLO & SLA

What is itThese are the SRE triad for reliability. SLI (Service Level Indicator) is a concrete measurement — e.g., "the fraction of HTTP requests completing in under 200ms with status < 500." SLO (Service Level Objective) is a target on that SLI — e.g., "99.9% of requests complete in under 200ms over a 30-day window." SLA (Service Level Agreement) is the contractual promise to users with consequences if violated (refunds, SLA credits). SLA < SLO < 100% (leave error budget for internal resilience). Introduced by Google's SRE book, now standard practice in cloud-native ops.
Key features
  • Error budget: The difference between 100% and the SLO — "we're allowed 43 minutes of downtime per month at 99.9%."
  • Burn rate alerts: Fire when error budget is being consumed too fast (e.g., 10× normal).
  • Multi-window multi-burn-rate: Modern alerting technique combining short and long windows to reduce noise.
  • SLO tooling: Sloth, Pyrra, OpenSLO, Nobl9 turn SLO specs into Prometheus rules and dashboards.
How it differs
  • SLI vs metric: All SLIs are metrics; not all metrics are SLIs. SLIs must reflect user experience.
  • SLO vs KPI: KPIs drive business; SLOs drive reliability engineering priorities.
  • SLO vs alert threshold: A "5xx rate > 1%" alert is usually worse than a burn-rate alert because it doesn't account for error budget.
Why use itSLOs let teams decide when to slow feature work and invest in reliability, and when to ship faster because there's budget to spare. They create a shared language between product and SRE, and focus alerts on what users actually feel.
Common gotchasSLOs too tight cause alert fatigue and burn out the team. SLOs too loose hide real problems. Measuring the wrong SLI (availability of the LB, not the app) misleads. SLAs that exceed SLOs set you up for financial pain. Consumer-facing SLOs should use end-user-perceived latency, not internal service latency.
Real-world examplesGoogle's SRE book formalized SLOs in 2016. AWS publishes SLAs (e.g., EC2 99.99%) with credit refunds. Monzo, GitHub, Stripe run status pages showing current SLO burn. Nobl9 is a commercial SLO platform. Open-source OpenSLO standardizes the spec.
Simple Definition: SLI = metric (availability, latency). SLO = target (99.9% availability). SLA = customer contract with penalties. Error budget = 100% - SLO.

Four Golden Signals: Latency, Traffic, Errors, Saturation.

SLO 99.9% = 43 min downtime/month. When error budget exhausted → freeze features, fix reliability.

Interview Answer: "SLI=measurable indicator, SLO=internal target, SLA=external contract. Error budget (100%-SLO) drives prioritization. Four golden signals: latency, traffic, errors, saturation. SLOs bridge platform and product teams."

Deployment Strategies & GitOps

Rolling, blue-green, canary, Argo CD, Flux, Helm, and Kustomize.

Rolling Update

What is itThe rolling update is Kubernetes's default deployment strategy: gradually replace old pods with new ones, ensuring at least some pods are always serving traffic. The Deployment controller creates a new ReplicaSet for the new image, incrementally scales it up while scaling the old one down, respecting maxSurge (how many extra pods allowed temporarily) and maxUnavailable (how many pods can be down at once). The result is a zero-downtime deployment for any stateless app with working readiness probes.
Key features
  • maxSurge: Default 25% — adds up to 25% more pods during the update.
  • maxUnavailable: Default 25% — can have up to 25% fewer pods during the update.
  • Progress deadline: If no progress for progressDeadlineSeconds, rollout is marked failed.
  • Pause and resume: Stage multiple edits, then apply together.
  • Instant rollback: kubectl rollout undo reverts to the previous ReplicaSet.
How it differs
  • vs Blue-Green: Rolling is gradual; blue-green is instant switch. Rolling uses less capacity but takes longer.
  • vs Canary: Canary sends a small fraction of traffic to the new version for verification; rolling replaces based on pod count.
  • vs Recreate strategy: Recreate tears down all old pods first — causes downtime, only used for incompatible versions.
  • vs VM-based rolling: Same concept at a different layer — auto-scaling groups cycle instances; K8s cycles pods.
Why use itRolling updates are the default because they give zero-downtime with no extra tooling. Combined with readiness probes and PDBs, most services deploy safely without human intervention.
Common gotchasWithout proper readiness probes, rolling updates can send traffic to not-yet-ready pods. DB schema migrations require backward-compatible changes or a two-phase rollout. maxUnavailable: 0 with low replica counts can get stuck if the cluster is tight on capacity. Sticky sessions break mid-rollout unless using affinity.
Real-world examplesShopify does hundreds of rolling updates per day across production. GitHub rolls out their monolith (still Rails) via rolling updates to their K8s cluster. Every kubectl apply with an image change kicks off a rolling update by default.
Simple Definition: K8s default. Replaces Pods incrementally via maxSurge/maxUnavailable. Both versions run simultaneously during transition.
Interview Answer: "Rolling update replaces Pods gradually. maxSurge=extra Pods allowed, maxUnavailable=Pods that can be down. New must pass readiness before old are killed. Rollout stalls if new Pods fail — production stays safe."

Blue-Green Deployment

What is itBlue-green deployment maintains two identical production environments — "blue" (current live) and "green" (new version, fully deployed but idle). Once green is verified, a single switch (Service label, DNS, load balancer) cuts all traffic from blue to green. If anything goes wrong, flip back instantly. This is not built into K8s as a first-class feature, but is trivially implemented by running two Deployments with different labels and updating the Service selector — or by using tools like Argo Rollouts or Flagger that automate the flip.
Key features
  • Instant cutover: Switch happens in seconds (label swap).
  • Instant rollback: Flip back with the same mechanism.
  • Full testing on green: Run smoke tests, load tests before the switch.
  • Dual capacity: Requires running both versions simultaneously — 2× the resource footprint during deploy.
How it differs
  • vs Rolling: Instant rather than gradual; more capacity needed but faster rollback.
  • vs Canary: Blue-green is all-or-nothing; canary ramps up gradually with traffic splitting.
  • vs A/B testing: Blue-green is a deployment strategy; A/B testing splits by user segment for product experiments.
Why use itBlue-green is ideal when you need fast, reliable rollback (single DB cutover, financial services), or when the new and old versions cannot coexist (incompatible data formats). It's also great for critical upgrades where you want to run smoke tests against a live-configured environment before exposing it.
Common gotchasDouble capacity is expensive — 2× compute and 2× database connections during deploy. Stateful data sync between blue and green is tricky — DB migrations must be backward-compatible. In-flight long-running requests are cut off at the switch. Not suitable for deployments with session state in memory.
Real-world examplesNetflix famously pioneered blue-green at scale in the Spinnaker era. Amazon, Etsy use blue-green for critical services. Argo Rollouts provides blue-green as a built-in strategy. Often combined with load testing tools to validate green before the switch.
Simple Definition: Two environments (Blue=old, Green=new). Switch all traffic at once by changing Service selector. Instant rollback = switch back.

Pros: Atomic switchover, instant rollback. Cons: Double resources, DB schema compatibility needed.

Interview Answer: "Blue-green: two identical environments, atomic traffic switch via Service selector. Instant rollback. Cons: double infrastructure, DB compatibility. Argo Rollouts automates with health analysis."

Canary Deployment

What is itCanary deployment releases a new version to a small subset of traffic (e.g., 1%), observes its behavior (error rates, latency, business metrics), and gradually promotes it (5%, 10%, 25%, 50%, 100%) if metrics stay healthy — otherwise rolls back. Named after the "canary in a coal mine." Traffic splitting can be done via replica ratios (crude), ingress weights (better), service mesh routing (precise), or dedicated tools like Argo Rollouts and Flagger that automate the promotion based on Prometheus metrics.
Key features
  • Automated analysis: Argo Rollouts + Flagger query Prometheus for error rate / latency / custom SLIs at each step and promote or rollback automatically.
  • Progressive traffic shifting: Weighted routing via Istio, Linkerd, NGINX Ingress, or service mesh.
  • Manual gates: Pause between steps for human approval.
  • Shadow/mirror traffic: Duplicate traffic to canary without affecting users.
  • Header-based canary: Route internal testers or beta users by HTTP header.
How it differs
  • vs Rolling: Rolling changes pod count; canary changes traffic percentage. Canary is safer for high-impact changes.
  • vs Blue-Green: Blue-green flips instantly; canary ramps gradually and exposes fewer users to bad deploys.
  • vs feature flags: Feature flags control behavior inside the same binary; canary controls which binary runs.
Why use itCanary is the gold standard for high-risk deployments — payment systems, authentication, core APIs. It catches regressions before they reach most users and provides data-driven rollback decisions.
Common gotchasAnalysis metrics must be representative — canary metrics at 1% traffic may not show long-tail issues. Sticky sessions defeat traffic splitting. Database schemas must be backward-compatible throughout the promotion. Canary tooling adds operational complexity.
Real-world examplesNetflix, Google, Facebook all use canary deployments for critical rollouts. Argo Rollouts and Flagger are the two dominant open-source canary controllers. Weaveworks built Flagger. Intuit built Argo Rollouts. Both integrate with Istio, Linkerd, NGINX, AWS ALB, and Contour.
Simple Definition: Route small % of traffic to new version, monitor, gradually increase. Gold standard for risk-sensitive deployments.
# Argo Rollouts canary
spec:
  strategy:
    canary:
      steps:
      - setWeight: 5
      - pause: {duration: 5m}
      - analysis:
          templates: [{ templateName: success-rate }]
      - setWeight: 25
      - pause: {duration: 10m}
      - setWeight: 100
Interview Answer: "Canary routes small traffic % to new version with automated analysis (Prometheus error rate/latency). Auto-rollback on failure. Argo Rollouts or Gateway API traffic splitting. Safest strategy — catches issues before affecting all users."

GitOps (Argo CD & Flux)

What is itGitOps is an operational pattern where Git is the single source of truth for cluster state. Manifests (or Helm values, or Kustomize overlays) live in a Git repo; a controller running in the cluster continuously pulls and applies them. Any drift (someone ran kubectl edit) is detected and reconciled back to Git state. Deploys happen by merging PRs, giving free audit, review, rollback, and traceability. The term was coined by Weaveworks in 2017. The two dominant tools are Argo CD (Intuit, now CNCF graduated) and Flux (Weaveworks, CNCF graduated).
Key features
  • Declarative desired state: Everything is a file in Git.
  • Reconciliation loop: Controller polls Git, detects diffs, applies changes.
  • Drift detection: Manual changes are reverted or flagged.
  • App-of-apps pattern: One Argo Application manages many child Applications — scales to hundreds of microservices.
  • Multi-cluster: Single Argo/Flux installation manages many clusters from a central control plane.
  • Secret integration: SealedSecrets, SOPS, External Secrets to safely store secrets in Git.
How it differs
  • vs kubectl apply from CI: Push-based CI gives CI credentials to the cluster; GitOps inverts this (cluster pulls) — cleaner security boundary.
  • Argo CD vs Flux: Argo CD has a rich UI, "Application" CRD, and SSO; Flux is lighter, more composable, with better Helm lifecycle support. Both are mature.
  • vs Spinnaker: Spinnaker is heavier, multi-cloud-focused, predates GitOps; adoption has shrunk.
Why use itGitOps brings software engineering practices to operations: PR review, branch protection, audit trail, easy rollback (git revert), disaster recovery (rebuild cluster from repo). It's the modern default for K8s delivery.
Common gotchasSecrets in Git need careful handling. The "who owns main branch" question becomes political. Emergency changes still need a break-glass path. Drift from external systems (Helm releases installed manually) confuses controllers. Argo Application sync loops can thrash if manifests contain unstable fields.
Real-world examplesIntuit created Argo CD and runs 15+ clusters via it. Weaveworks created Flux and popularized GitOps. Red Hat OpenShift GitOps ships Argo CD as the default. Codefresh built a commercial platform around Argo. Nearly every new K8s shop in 2024+ uses GitOps.
Simple Definition: Git as single source of truth. Agents in-cluster pull state from Git and reconcile continuously. All changes via PRs.
GitOps CI/CD Pipeline Flow
Code Commit
CI: Build & Test
Build Image
Scan (Trivy)
Push to Registry
Update GitOps Repo
Argo CD detects
Deploy to Cluster
Rollback = revert the Git commit

Argo CD: UI, SSO, ApplicationSets, auto-sync, self-healing. Flux: Toolkit approach, deeper Helm/Kustomize integration, no built-in UI.

Interview Answer: "GitOps: Git=source of truth, agents pull+reconcile (not push). Benefits: audit trail, easy rollback (revert commit), no cluster creds externally, drift correction. Argo CD has UI+ApplicationSets. Flux has deeper Helm/Kustomize integration. Industry standard."

Helm & Kustomize

What is itKubernetes manifests are verbose YAML — two tools emerged to tame the sprawl. Helm is the "package manager for Kubernetes," distributing applications as charts — a templated bundle of YAMLs + a values.yaml file with configurable defaults. Charts are versioned, shareable via repositories, and lifecycle-managed via helm install/upgrade/rollback. Kustomize is a templating-free alternative built into kubectl since v1.14 — it uses a layered overlay model (base + dev/staging/prod overlays) with strategic merge patches. Each tool has passionate advocates; many teams use both.
Key features
  • Helm charts: Templated YAML with Go templates, named releases, rollbacks, dependency charts.
  • Helm Hub / Artifact Hub: Thousands of pre-built charts for common apps (Postgres, Redis, Prometheus, Cert-Manager).
  • Helm values: Hierarchical YAML config injected into templates.
  • Kustomize overlays: Layer patches on a base manifest for environment-specific customization.
  • Kustomize components: Reusable patch snippets composed into overlays.
How it differs
  • Helm vs Kustomize: Helm uses templates (Go template syntax, can be messy); Kustomize is patch-based (no templating — pure YAML manipulation).
  • Helm: Better for installing third-party apps (packaged by authors) and full lifecycle management.
  • Kustomize: Better for managing your own manifests with small per-env differences.
  • vs Pulumi/Crossplane: These use real programming languages; Helm/Kustomize are YAML-centric.
Why use itWithout Helm or Kustomize, manifests become unmanageable at ~10+ services. Helm is the lingua franca for distributing third-party K8s apps. Kustomize is simpler for in-house codebases where you control the manifests.
Common gotchasHelm's Go templates are notoriously hard to debug with nested conditionals. Helm state lives in Secrets in the cluster — complicates cluster rebuild. Kustomize patches can silently fail to match target fields. CRDs in Helm charts are tricky due to ordering. Helm 2 used Tiller (removed in Helm 3).
Real-world exampleskube-prometheus-stack, Ingress NGINX, cert-manager, Istio, Argo CD are all distributed as Helm charts. Google, IBM, Microsoft sponsor Helm. Kustomize was created by Google and integrated into kubectl. Argo CD supports both natively.
Simple Definition: Helm = package manager (templated Charts + Values). Kustomize = YAML overlays without templates (built into kubectl).

Use Helm for: third-party packages, complex parameterization. Use Kustomize for: your own apps, simple environment overrides. Many teams use both.

Interview Answer: "Helm: Charts with Go templates and Values for parameterization. Kustomize: overlays and patches, no template language, built into kubectl. Helm for external tools (Prometheus, NGINX), Kustomize for internal apps. Most teams use both."

Cluster Architecture

Control plane, etcd, multi-cluster, multi-tenancy, managed vs. self-managed.

Control Plane Components

What is itThe control plane is the brain of Kubernetes — the set of components that make global decisions about the cluster, respond to events, and drive state reconciliation. It consists of: kube-apiserver (the only component that talks to etcd, exposes REST API), etcd (distributed key-value store, the source of truth), kube-scheduler (decides which node runs each pod), kube-controller-manager (runs built-in controllers: Deployment, ReplicaSet, Node, Namespace, etc.), cloud-controller-manager (integrates with cloud APIs for LBs, volumes, and node lifecycle). In production, these components run as HA with 3 or 5 replicas.
Key features
  • API server: Stateless, horizontally scalable, the only write path to etcd.
  • etcd: Raft consensus, runs as a 3- or 5-node cluster, stores all cluster state.
  • Scheduler: Runs predicate+priority algorithms to place pods; extensible via scheduler framework plugins.
  • Controller manager: Runs built-in control loops; each controller watches and reconciles its resource type.
  • Cloud controller manager: Extracted from CCM so cloud providers can evolve independently.
How it differs
  • vs worker nodes: Control plane manages; workers execute.
  • vs Nomad servers: Similar concept — Nomad "servers" are analogous but a simpler implementation.
  • vs Borg master: Borg had a single master per cell; K8s learned from that and uses a stateless API server atop etcd.
  • Managed vs self-hosted: EKS/GKE/AKS hide the control plane entirely; self-hosted (kubeadm, kops) requires operating it.
Why use itUnderstanding control plane internals is essential for debugging (why isn't my pod scheduled? why is my PVC stuck?) and for operating self-managed clusters. Every K8s concept — declarative state, reconciliation, eventual consistency — flows from the control plane's architecture.
Common gotchasetcd backups and disaster recovery are critical and often neglected. API server load can spike with chatty controllers or list-watch storms. Scheduler extender bugs cause pod placement anomalies. Certificate rotation failures lock you out of the control plane. Upgrading the control plane must precede worker node upgrades (version skew policy).
Real-world exampleskubeadm bootstraps a standard control plane. kops, Cluster API, Kubespray, Talos, Rancher offer managed-ish approaches. Google GKE runs the control plane on their own infrastructure; users never see it. AWS EKS charges $0.10/hour per cluster for the managed control plane. Most incidents in self-managed clusters trace back to etcd issues.
Simple Definition: The brain: API Server (front door), etcd (memory/state), Scheduler (Pod placement), Controller Manager (reconciliation loops).

API Server: Auth → AuthZ → Admission → Validation → Persist. Stateless, run multiple replicas.

etcd: Raft consensus, 3 or 5 nodes. Fast SSDs required. Losing etcd without backup = losing everything.

Scheduler: Filter (which nodes CAN) → Score (which is BEST). Considers resources, affinity, taints, topology.

Controller Manager: Runs Deployment, ReplicaSet, Node, Job, Endpoint controllers. Watch → Compare → Act.

Interview Answer: "API Server: RESTful front door (auth/authz/admission). etcd: distributed KV store (needs backup). Scheduler: filter→score for Pod placement. Controller Manager: reconciliation loops. For HA: multiple API servers, 3-5 etcd nodes, leader election."

Multi-Cluster & Multi-Tenancy

What is itAt scale, most organizations run many Kubernetes clusters rather than one giant cluster. Reasons: blast-radius containment, regulatory isolation (EU vs US data), multi-cloud portability, per-environment separation, upgrade safety, and scale limits. Multi-tenancy is the orthogonal question: how many teams/customers share one cluster? Soft multi-tenancy (namespace + RBAC + NetworkPolicy) works for trusted tenants. Hard multi-tenancy (per-tenant cluster, vClusters, or gVisor/Kata Containers for kernel isolation) is needed for untrusted ones. Tools like Cluster API, Rancher Fleet, Argo CD ApplicationSet, Karmada, Red Hat ACM, and Google Anthos manage multi-cluster workloads.
Key features
  • Cluster API: Declarative K8s API for creating and managing clusters as custom resources.
  • vCluster: Virtual clusters running inside a host cluster — cheap isolation.
  • Karmada: CNCF project for scheduling workloads across clusters.
  • Multi-cluster Services API: Cross-cluster service discovery using shared DNS.
  • Fleet management: GitOps-based deployment of identical workloads to many clusters.
How it differs
  • vs single giant cluster: Single cluster is cheaper/simpler at small scale; multi-cluster is necessary above a few thousand pods or for blast-radius reasons.
  • vs multi-cloud via different clusters: Multi-cluster naturally supports multi-cloud if you don't rely on cloud-specific APIs.
  • vs vCluster: vClusters give strong API-level isolation at lower cost than full clusters.
Why use itMulti-cluster is essential for high availability (survive a regional outage), compliance (PCI cluster, HIPAA cluster), scale (no single etcd bottleneck), and team autonomy. Multi-tenancy determines how you pay for those clusters and how isolation is enforced.
Common gotchasManaging 50+ clusters without tooling is a full-time job. Service discovery across clusters is non-trivial. Authentication unification (single SSO, per-cluster kubeconfigs) is painful. Cost of duplicated control planes adds up. Cross-cluster networking usually requires a mesh like Istio multi-cluster or Submariner.
Real-world examplesAirbnb operates hundreds of clusters. Spotify uses Cluster API. Netflix runs per-region clusters with Titus + K8s. Google Anthos manages fleets across GCP, on-prem, and other clouds. Shopify runs clusters across cells for blast-radius reduction. Loft Labs built vCluster for cheap tenant isolation.
Simple Definition: Multi-cluster = separate clusters for blast radius, compliance, scale. Multi-tenancy = sharing: namespace-per-tenant (cheap), vCluster (strong isolation), cluster-per-tenant (strongest).

Management: Cluster API (declarative lifecycle), Rancher (UI), Argo CD ApplicationSets (multi-cluster deploys).

Interview Answer: "Multi-cluster for blast radius, compliance, scale limits. Tenancy: namespace-per-tenant (RBAC+quotas), vCluster (virtual clusters), cluster-per-tenant (full isolation). Managed via Cluster API and Argo CD ApplicationSets."

Managed vs. Self-Managed Kubernetes

What is itA managed Kubernetes service hides the control plane entirely — the cloud provider runs etcd, the API server, scheduler, and controllers, handling upgrades, backups, and HA. You only see and manage worker nodes (or not, with "serverless" modes like GKE Autopilot and EKS Fargate). The big three are Amazon EKS, Google GKE, Azure AKS, plus DigitalOcean DOKS, Linode LKE, Oracle OKE, and Red Hat OpenShift Dedicated/ROSA. Self-managed means you run the control plane yourself — tools: kubeadm, kops, kubespray, Talos, Cluster API, Rancher RKE2.
Key features
  • Managed: Pay per hour for the control plane (~$72/month for EKS), provider handles upgrades and HA, integrates with cloud IAM/networking.
  • Serverless managed (Autopilot, Fargate): Pay per pod-second, no node management — highest abstraction.
  • Self-managed: Full control, no per-cluster fee, runs on any infrastructure (on-prem, edge, air-gapped).
  • Distributions: k3s (Rancher, lightweight for edge), Talos (immutable OS designed for K8s), OKD (upstream OpenShift).
How it differs
  • Managed: Faster to start, less ops burden, but locked to the provider and less flexible.
  • Self-managed: Full control, portable, but requires deep K8s ops expertise and operational investment.
  • GKE Autopilot vs EKS Fargate: Autopilot is per-pod pricing; Fargate requires separate networking config and has pod-size restrictions.
Why use itMost teams should start with managed K8s — operating etcd and the control plane is genuinely hard and not a differentiator. Self-managed makes sense for regulated industries, air-gapped environments, edge computing, or organizations with deep K8s expertise.
Common gotchasManaged doesn't mean zero work — you still handle node upgrades, addons, monitoring, RBAC, cost management. Provider-specific features create lock-in over time. Self-managed on bare metal requires solving load balancer provisioning (MetalLB), storage (Ceph/Longhorn), and upgrades yourself. Upgrades on self-managed clusters have broken plenty of production setups.
Real-world examplesMost startups use EKS/GKE/AKS. Shopify originally ran self-managed, moved heavily toward managed. Bloomberg runs massive self-managed clusters. SpaceX, Apple, large banks often prefer self-managed for control. CERN runs self-managed on OpenStack.
Simple Definition: Managed (EKS/GKE/AKS) = cloud handles control plane. Self-managed (kubeadm, k3s, RKE2) = you handle everything.

Managed: EKS (AWS, most popular), GKE (Google, most mature), AKS (Azure, free control plane).

Self-managed: kubeadm (official), k3s (edge/IoT, lightweight), RKE2 (FIPS, government), OpenShift (enterprise).

Interview Answer: "Managed: cloud handles control plane + etcd. 90% of orgs should use managed. Self-managed for edge (k3s), regulated (RKE2), on-prem, massive scale cost optimization. Cluster API standardizes management across both."

Operators & CRDs

Extending Kubernetes with custom resources and the Operator pattern.

CRD (Custom Resource Definition)

What is itA Custom Resource Definition (CRD) teaches the Kubernetes API server about a new object type. Once installed, you can create Custom Resources (CRs) of that type with kubectl, subject to the same RBAC, validation, and admission webhooks as built-in objects. CRDs define their schema in OpenAPI v3 (for validation) and optionally declare subresources like /status and /scale. CRDs alone don't do anything — they just store data; pair them with a controller (an Operator) to turn them into active automation. This is how thousands of tools extend K8s: Certificate, Kafka, PrometheusRule, VirtualService, Application.
Key features
  • OpenAPI v3 schemas: Type and range validation at the API server.
  • Multiple versions: Support schema evolution with conversion webhooks.
  • Subresources: /status separates user intent from controller-managed state; /scale enables HPA on your custom object.
  • Printer columns: Customize kubectl get foo output.
  • CEL validation: Cross-field validation rules without webhooks (K8s 1.25+).
How it differs
  • vs built-in resources: CRDs behave identically to built-ins from the API surface perspective.
  • vs API aggregation: Aggregated API servers were the old extension method — more powerful but far more complex; CRDs are now the standard.
  • vs ConfigMaps: ConfigMaps are untyped blobs; CRDs provide typed, validated, RBAC-aware objects.
Why use itCRDs are how K8s becomes extensible — a platform for platforms. You can model databases, message queues, certificates, cloud resources, feature flags, or business objects in the K8s API and get declarative management for free.
Common gotchasCRDs can bloat etcd if used for high-churn data (don't store per-request objects). Schema changes require careful versioning with conversion webhooks. CRDs are cluster-scoped resources themselves — require cluster-admin to install. Upgrading controllers without upgrading CRDs causes compatibility breaks. Validation schemas are often underspecified, letting bad data through.
Real-world examplesA typical production cluster has 30-100+ CRDs from installed tools. cert-manager (Certificate, Issuer), Istio (VirtualService, DestinationRule, 15+ CRDs), Argo CD (Application, AppProject), Prometheus Operator (Prometheus, ServiceMonitor, PrometheusRule), Crossplane (cloud resources as CRDs).
Simple Definition: Extends the K8s API with your own resource types. After creating a CRD, manage custom resources with kubectl like built-in ones.

Examples: Certificate (cert-manager), VirtualService (Istio), PostgresCluster (CloudNativePG).

A CRD alone just stores data. A Custom Controller watches it and takes action = the Operator pattern.

Interview Answer: "CRDs extend the K8s API with custom types. Managed via kubectl like built-in resources. CRD alone is just storage — a Custom Controller makes it actionable (Operator pattern). Key concepts: Finalizers (cleanup before delete), Owner References (cascading delete)."

Operator Pattern

What is itAn Operator is a Kubernetes-native way to automate the operational knowledge of a specific application. Technically, it's a custom controller that watches CRDs and reconciles the real world toward the declared state — but the term implies that the controller encodes domain-specific logic. For example, a PostgreSQL Operator knows how to do online backup, primary failover, streaming replication setup, minor version upgrades, TLS rotation, connection pooling. The pattern was coined by CoreOS in 2016. Operators are typically built with Operator SDK, Kubebuilder, or Metacontroller.
Key features
  • Reconciliation loop: Watch CR, compare to cluster state, take actions, update status.
  • Level-triggered logic: Always converges toward desired state, idempotent, tolerant of missed events.
  • Domain knowledge: Encodes SRE/DBA playbooks as code — backups, upgrades, failover, scaling.
  • OperatorHub.io: CNCF-backed catalog of 300+ operators, installable with one click via Operator Lifecycle Manager (OLM).
  • Capability levels: Basic install → seamless upgrades → full lifecycle → deep insights → auto-pilot.
How it differs
  • vs Helm chart: A chart is a static install; an operator is a living process that reacts to changes.
  • vs shell scripts: Operators use the K8s reconciliation model — idempotent, event-driven, declarative.
  • vs traditional automation (Ansible): Operators are continuous; Ansible runs and exits.
Why use itOperators bring Day 2 automation to stateful apps that Kubernetes itself doesn't know how to handle. Instead of writing runbooks for "how to upgrade Kafka," the operator does it. They turn K8s into a universal control plane for any application.
Common gotchasWriting a robust operator is hard — must handle failures, retries, finalizers, leader election, multi-version support. Poorly written operators create incidents (delete PVCs, loop endlessly). Operator version upgrades are risky because they touch CRDs. Over-operatoring for simple problems adds unnecessary complexity.
Real-world examplesZalando Postgres Operator, CrunchyData PGO, Strimzi Kafka Operator (used by IBM, Red Hat), MongoDB Operator, Elastic Cloud on K8s, Vitess Operator (YouTube's sharded MySQL), Rook Ceph. Cluster API is itself an operator that manages clusters. Crossplane uses operators to manage cloud resources.
Simple Definition: CRDs + Custom Controllers encoding human operational knowledge. Automates Day 2 ops: deploy, upgrade, backup, scale, failover for complex applications.

Major Operators: cert-manager (TLS), Prometheus Operator (monitoring), CloudNativePG (PostgreSQL), Strimzi (Kafka), Crossplane (cloud infra), External Secrets Operator (secrets sync).

Build with: Kubebuilder (Go), Operator SDK (Go/Ansible/Helm), Metacontroller (any language).

Core pattern: Watch → Compare desired vs actual → Act. Runs continuously. Idempotent.

Interview Answer: "Operator = CRDs + Controller encoding ops knowledge. Automates Day 2: deploy, upgrade, backup, failover. Key operators: cert-manager, Prometheus Operator, CloudNativePG, Strimzi, Crossplane. Built with Kubebuilder/Operator SDK. Core: reconciliation loop (watch→compare→act)."
Real-World Usage: Instead of a 47-step runbook for PostgreSQL HA, apply a 20-line Cluster CR. CloudNativePG handles StatefulSet creation, replication, automated backup to S3, and automatic failover. What took a DBA hours now takes seconds.

Platform Engineering, Cost & Strategy

IDPs, FinOps, DR, cloud strategy, DORA metrics, and compliance.

Internal Developer Platform (IDP)

What is itAn Internal Developer Platform (IDP) is an opinionated abstraction layer built on top of Kubernetes that hides its complexity from application developers. Instead of asking devs to understand Deployments, Services, Ingress, PDBs, HPAs, ConfigMaps, Secrets, and RBAC, an IDP provides a simplified self-service interface: a CLI, portal, or higher-level CRD where devs say "I need a new service" and the platform provisions everything. Popularized by the DevOps Topologies / Team Topologies concept of "platform teams as product teams." Tools: Backstage (Spotify, CNCF), Crossplane, Humanitec, Port, Kratix.
Key features
  • Service catalog: Browse/search all services, owners, docs, SLOs, dependencies.
  • Golden paths: Templated "new service" workflows that scaffold repo, CI, manifests, dashboards.
  • Self-service provisioning: Request a database, queue, or cache as a simple form → platform handles K8s CRDs + Terraform.
  • Cost and SLO visibility: Dashboards per service for cost, errors, latency.
  • Platform abstractions: Higher-level CRDs like App, Environment, Database instead of raw K8s resources.
How it differs
  • vs raw K8s: Raw K8s is too low-level for most developers — too many concepts, too much YAML.
  • vs PaaS (Heroku): IDPs are your own Heroku — built on K8s, extensible, cost-controlled internally.
  • vs DIY scripts: IDPs centralize tribal knowledge into a sanctioned platform.
Why use itDeveloper productivity suffers if every team reinvents the wheel on K8s. An IDP lets one platform team serve many app teams, enforcing best practices (security, observability, cost) automatically and freeing developers to focus on business logic.
Common gotchasBuilding an IDP is expensive — 5-20+ engineers for 1-2 years before it's useful. "Platform team as product team" requires discipline and user research. Abstractions that leak (you still need to understand K8s to debug) create frustration. Backstage plugins require Node.js/React skills. Over-engineering platforms before there are enough users creates shelfware.
Real-world examplesSpotify Backstage (2020 open-source release) is the dominant portal. Netflix Spinnaker + Paved Road, Airbnb Sparrow, Uber uDeploy, Monzo shipper, Twilio, Zalando Stups. The 2023 CNCF IDP whitepaper formalized common patterns.
Simple Definition: Self-service layer on K8s. Developers deploy without deep K8s knowledge. Golden paths, templates, guardrails.

Tools: Backstage (CNCF portal), Crossplane (cloud infra as CRDs), Tilt/Skaffold (inner-loop dev).

Interview Answer: "IDP abstracts K8s complexity for developers. Backstage (service catalog+templates), Crossplane (cloud infra as K8s resources), golden paths (opinionated templates with CI/CD, monitoring pre-configured). Requires a platform team. Goal: developers focus on code."

FinOps & Cost Optimization

What is itFinOps (Cloud Financial Operations) is the operational practice of bringing financial accountability to the variable spend model of cloud — forecasting, allocating, optimizing, and governing cloud costs collaboratively across engineering, finance, and business teams. In a Kubernetes context, FinOps means making K8s spend visible per namespace, per team, per workload, and systematically reducing waste. The FinOps Foundation (under Linux Foundation) publishes frameworks, and tools like Kubecost, OpenCost (CNCF), CloudHealth, Spot by NetApp, and Densify bring K8s cost visibility.
Key features
  • Cost allocation: Attribute cloud spend to namespaces, labels, teams using resource requests as the key.
  • Right-sizing: Identify over-requested resources via VPA, Goldilocks, Kubecost recommendations.
  • Spot/preemptible instances: Cheap compute for fault-tolerant workloads (80-90% savings).
  • Karpenter consolidation: Automatically pick cheaper instance types and pack workloads densely.
  • Reserved / Savings Plans: Commit to baseline usage for 20-72% discounts.
  • Idle/orphaned resource cleanup: Unused PVs, stopped PVCs, old ReplicaSets.
How it differs
  • vs traditional capacity planning: Cloud is elastic and variable — traditional fixed-budget IT breaks.
  • vs showback: Showback is visibility only; FinOps drives accountability and change.
  • vs raw cloud cost tools: Cloud bills show instance costs, not pod costs — need K8s-aware tools.
Why use itUnchecked K8s costs balloon fast — engineers over-request, nodes sit idle, dev clusters run 24/7. FinOps practices routinely save 30-60%.
Common gotchasCost allocation using requests (not actual usage) penalizes conservative teams. Spot instances require workload redesign to tolerate interruption. Egress costs (data leaving a zone/region) are often invisible until they hurt. Shared services (logging, monitoring) are hard to charge back fairly. Cost tooling adds operational complexity of its own.
Real-world examplesSpotify saved millions by tagging every pod with cost-center and enforcing quotas. Adobe runs Kubecost across hundreds of clusters. Intuit built automated spot interruption handling. Pinterest uses VPA for right-sizing. The FinOps Foundation counts major cloud consumers as members.
Simple Definition: Managing K8s spend. Levers: right-sizing (VPA), Spot instances, KEDA scale-to-zero, Karpenter, eliminating idle resources.

Tools: Kubecost, OpenCost (CNCF), Goldilocks (VPA dashboard). Chargeback via labels attributes costs to teams.

Interview Answer: "Right-size with VPA (30-50% savings), Spot for stateless (60-90%), KEDA scale-to-zero, Karpenter node consolidation. Kubecost/OpenCost for visibility. ResourceQuotas cap namespaces. Chargeback via labels. Most clusters are 60-80% over-provisioned."
Real-World Usage: Company's $180K/mo K8s bill dropped to $70K (61% reduction): right-sizing saved $50K, Karpenter+Spot saved $35K, KEDA scale-to-zero saved $15K, shutdown dev off-hours saved $10K.

Disaster Recovery & Chaos Engineering

What is itDisaster Recovery (DR) is the plan and tooling for restoring service after catastrophic failures: region outage, cluster corruption, ransomware, accidental delete. Key metrics: RTO (Recovery Time Objective) — how fast you recover — and RPO (Recovery Point Objective) — how much data you can afford to lose. Chaos Engineering is the discipline of proactively breaking systems in controlled ways to verify resilience before real incidents. Pioneered by Netflix with the Chaos Monkey family of tools. In K8s, DR tools include Velero (backup/restore of K8s objects + PVs), Kasten K10 (Veeam), and Portworx PX-Backup; chaos tools include Chaos Mesh (CNCF), LitmusChaos, and Gremlin.
Key features
  • Velero backup: Backup K8s manifests + PVs (via CSI snapshots) to S3/GCS/Azure.
  • Cluster rebuild: GitOps + Velero = recover a lost cluster from Git and storage.
  • etcd snapshots: Point-in-time backups of the control plane state.
  • Chaos Mesh experiments: Inject pod kills, network partitions, CPU stress, DNS failures, clock skew.
  • GameDays: Scheduled live-fire drills where teams practice incident response.
How it differs
  • DR vs HA: HA handles expected failures (single node death); DR handles catastrophic failures (entire region).
  • vs traditional tape backups: K8s DR is manifest-centric plus PV snapshots, more complex than file backups.
  • Chaos Engineering vs fault injection in tests: Chaos runs in production or prod-like environments, continuously.
Why use itThe question isn't if you'll have an incident, it's when. DR plans that haven't been tested don't work. Chaos Engineering builds organizational confidence that failover actually works — you discover the untested paths before customers do.
Common gotchasUntested backups are common and catastrophic. PV snapshots may be crash-consistent but not application-consistent (DB corruption). DNS TTLs can delay failover. Chaos experiments without proper scope cause real outages. Restoring to a new cluster often reveals hardcoded environment assumptions.
Real-world examplesNetflix pioneered Chaos Monkey and Simian Army (2011). Amazon runs GameDays quarterly across services. Alibaba tests zonal failover ahead of Double 11. Slack published post-mortems on region failover tests. Velero (from Heptio, now VMware) is the default K8s backup tool.
Simple Definition: DR: Velero (backup/restore), etcd snapshots, Volume Snapshots. Chaos engineering (Chaos Mesh, Litmus) validates resilience by injecting failures.

RTO = max downtime. RPO = max data loss. Lower values = higher cost (multi-region).

Patterns: Multi-AZ (minimum), Active-Active (lowest RTO), Active-Passive (simpler).

Interview Answer: "DR: Velero for K8s+volume backup, etcd snapshots for cluster state. RTO/RPO drive architecture and cost. Multi-AZ minimum for prod. Chaos engineering (Chaos Mesh, Litmus) validates DR before real incidents."

Cloud Strategy & Vendor Lock-In

What is itOrganizations choose between single-cloud, multi-cloud, hybrid cloud, and on-prem. Kubernetes is often sold as the "portability layer" that lets you run anywhere — and it is, for the core workloads. But every cloud has its own IAM, load balancers, managed databases, secrets managers, observability. Using those creates vendor lock-in. Avoiding them entirely is expensive and gives up benefits. Modern thinking balances intentional lock-in for differentiated services vs portable abstractions (K8s, Terraform, OpenTelemetry, Postgres) for the commodity layer. Tools like Crossplane, Cluster API, Terraform, and Istio multi-cluster support multi-cloud where needed.
Key features
  • Commodity vs differentiation: Use managed services where they're commodities (storage, DNS, K8s control plane) and portable stacks where you need flexibility.
  • Exit strategies: Document how to leave each cloud; test by running workloads in a second cloud.
  • Abstraction layers: Crossplane exposes cloud resources as K8s CRDs, making cross-cloud Terraform-like.
  • Data gravity: Large datasets (petabytes) are expensive to move — data is often the real lock-in.
How it differs
  • Single-cloud: Simpler ops, deeper managed service usage, higher lock-in risk.
  • Multi-cloud: Flexibility, redundancy, pricing leverage — but operational complexity and reduced access to cloud-specific features.
  • Hybrid: On-prem + cloud, often regulatory-driven.
  • Cloud-agnostic K8s stack: Portable compute but loses managed-service productivity.
Why use itCloud strategy is ultimately a business decision about risk, cost, and agility. K8s gives engineering teams a common substrate regardless of cloud, which reduces (but doesn't eliminate) lock-in.
Common gotchas"Cloud-agnostic" often means "least-common-denominator — worst of both worlds." Multi-cloud networking and identity federation are hard. Egress between clouds is expensive. Multi-cloud as a reflex goal wastes money; do it when there's a business reason (regulation, M&A risk, pricing).
Real-world examplesApple is famously multi-cloud (AWS + GCP + own data centers). Dropbox famously moved off AWS to save money. Capital One migrated entirely to AWS. Netflix is all-in on AWS. Snap uses both GCP and AWS. Basecamp/37signals (2023) very publicly left the cloud for on-prem to save ~$7M/year.
Simple Definition: K8s provides compute portability. Real lock-in: managed services (RDS, BigQuery), data gravity, IAM. Hybrid via Anthos, Azure Arc, EKS Anywhere.
Interview Answer: "K8s gives compute portability. Real lock-in: managed services, data gravity, IAM. Strategy: abstract where practical (PostgreSQL over Aurora), accept lock-in where value is high. Multi-cloud is expensive — most orgs are hybrid by circumstance."

DORA Metrics, Conway's Law & Organizational Impact

What is itDORA metrics are four key software delivery performance indicators derived from the Google DORA research (State of DevOps Report): deployment frequency, lead time for changes, change failure rate, and mean time to restore (MTTR). High-performing teams deploy multiple times per day with <1 hour lead time and <15% change failure rate. Conway's Law (1968) states: "any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure." In other words, your software architecture mirrors your org chart — and K8s platforms encode organizational boundaries in namespaces, RBAC, and CRDs.
Key features
  • DORA metric levels: Elite, High, Medium, Low performers — with multi-year gaps between them.
  • Reverse Conway maneuver: Redesign the org chart to get the architecture you want.
  • Team Topologies (2019): Book formalizing Stream-aligned, Platform, Complicated Subsystem, and Enabling team types.
  • Platform team as enablers: Build the IDP, serve dev teams as internal customers.
How it differs
  • vs vanity metrics: DORA is outcome-focused, not activity-based (unlike "commits per dev").
  • vs ITIL: ITIL emphasizes process and change control; DORA shows less process = better outcomes.
  • Conway's Law vs architecture-first: You cannot fix architecture without fixing the org — technical change alone fails.
Why use itDORA metrics give leadership a data-driven view of engineering effectiveness. Conway's Law is a reminder that introducing K8s won't help if your team structure creates bottlenecks. The best K8s migrations are accompanied by org restructuring toward small, autonomous product teams supported by a central platform team.
Common gotchasMeasuring DORA naively (deploys to staging count as deploys?) gives misleading numbers. Pushing teams to improve metrics without changing structure leads to gaming. Treating platform teams as gatekeepers rather than enablers kills adoption. Conway's Law is usually ignored until it breaks the system.
Real-world examplesGoogle, Amazon, Netflix, Shopify are elite DORA performers. Capital One's transformation is documented in "Accelerate" (Forsgren, Humble, Kim). Allianz, Nationwide, Gov.uk restructured teams alongside K8s adoption. Team Topologies influences nearly every modern platform engineering org.
Simple Definition: DORA = deployment frequency, lead time, MTTR, change failure rate. Conway's Law = systems mirror team structure. Platform teams enable autonomous product teams.

Team Topologies: Platform team (builds K8s platform), Stream-aligned (product features), Enabling (adoption help).

TCO: Compute + Storage + Networking + Licensing + People (biggest cost) + Opportunity cost.

Interview Answer: "DORA metrics measure engineering effectiveness. K8s+GitOps should improve all four. Conway's Law: architecture mirrors teams. Inverse Conway Maneuver: shape teams for desired architecture. Platform team of 5 can support 80+ developers. TCO: people cost often exceeds infrastructure."

Compliance, CIS Benchmarks & Zero Trust

What is itRegulated industries must demonstrate their K8s clusters meet compliance frameworks: SOC2, ISO 27001, PCI-DSS, HIPAA, FedRAMP, GDPR. The CIS Kubernetes Benchmark is a widely-used checklist of hardening configurations — over 100 controls covering etcd encryption, API server auth, RBAC, NetworkPolicy, Pod Security Standards, and more. Zero Trust is a security model that assumes no implicit trust — every request is authenticated and authorized, even from inside the network. In K8s, zero-trust is implemented via mTLS (service mesh), strict NetworkPolicies, admission controllers, and identity-based authorization.
Key features
  • kube-bench: Open-source tool that runs CIS Benchmark checks against your cluster.
  • Audit logging: Every API request is logged with user identity — essential for compliance evidence.
  • Encryption at rest: Secrets encrypted in etcd via KMS providers.
  • TLS everywhere: Control plane communication and pod-to-pod traffic encrypted.
  • Workload identity: SPIFFE/SPIRE, GKE Workload Identity, EKS IRSA replace long-lived credentials with short-lived tokens.
How it differs
  • Zero Trust vs traditional perimeter security: Perimeter assumes inside-the-firewall is safe; zero-trust verifies every hop.
  • vs check-the-box compliance: Real security requires continuous validation, not annual audits.
  • CIS Benchmark vs PCI-DSS: CIS is technical hardening; PCI is industry-specific data protection.
Why use itCompliance is a business requirement for finance, healthcare, government, and public companies. CIS Benchmark provides a concrete baseline. Zero Trust minimizes blast radius when (not if) an attacker gets inside.
Common gotchasCIS Benchmark has hundreds of controls and many are impractical — you'll rationalize some with "compensating controls." Zero-trust adds complexity; incomplete implementations (mTLS in mesh but no NetworkPolicy) leave gaps. Audit log volume can blow up storage. Compliance scope creep (treating dev clusters like prod) wastes effort.
Real-world examplesUS DoD Platform One runs FedRAMP-high K8s (Iron Bank registry). Capital One, JPMorgan, Goldman Sachs run PCI-compliant K8s clusters. BeyondCorp (Google's internal zero-trust architecture) inspired the industry. NIST SP 800-190 documents container security. Tools like kube-bench, kube-hunter, Trivy, and Starboard automate compliance checks.
Simple Definition: CIS Benchmarks (kube-bench) for hardening. Compliance (SOC2/PCI/HIPAA) via RBAC + NetworkPolicy + mTLS + audit logs + policy engines. Zero Trust = never trust, always verify.

Zero Trust stack: mTLS (service mesh), least-privilege RBAC, default-deny NetworkPolicies, PSS Restricted, short-lived credentials (Vault/IRSA), signed images only.

Interview Answer: "Compliance: RBAC (access), NetworkPolicy (segmentation), mTLS (encryption), audit logging (traceability), policy engines (automated enforcement). CIS benchmarks via kube-bench. Zero Trust: mTLS, least-privilege, default-deny, short-lived creds, signed images. Compliance-as-code = continuous enforcement."