Core Concepts
The fundamental building blocks of Kubernetes — containers, clusters, pods, and how they connect.
Container
namespaces (for isolation of PID, network, mount, UTS, IPC, user), cgroups (for resource limits: CPU, memory, I/O), and union filesystems (overlay/overlayfs for layered images). Containers start in milliseconds, weigh tens of megabytes, and behave identically across laptops, CI servers, and production clusters.- Process-level isolation: Each container is just a Linux process (or tree) with restricted visibility — no separate kernel, no hypervisor.
- Immutable images: A container image is a read-only snapshot; changes happen in a thin writable layer that is discarded when the container dies.
- Portability: Build once, run anywhere that speaks the OCI runtime spec (Linux x86_64/arm64, Windows containers on Server 2019+).
- Declarative builds: A
DockerfileorContainerfiledescribes the image deterministically, enabling reproducible artifacts. - Density: Hundreds of containers per host vs. dozens of VMs — no duplicated kernel and no 2 GB guest OS footprint.
- vs Virtual Machines: VMs virtualize hardware via a hypervisor (KVM, Hyper-V, ESXi) and run a full guest OS. Containers virtualize the OS. VMs take 30+ seconds to boot and 1-4 GB RAM each; containers boot in <1 second with <50 MB overhead.
- vs chroot/jails: Containers add cgroups (resource limits) and network namespaces that classic
chrootor BSD jails lack. - vs PaaS (Heroku, GAE): You control the entire image — OS packages, runtime version, every binary. PaaS hides this and imposes opinionated stacks.
- vs Unikernels: Unikernels compile app + kernel into a single binary; containers keep the host kernel but isolate userspace.
tini or --init). Writing to the container filesystem at runtime defeats immutability — use volumes instead. Image bloat (1+ GB images full of build tools) is common; use multi-stage builds and distroless bases.Deep Dive:
Containers use two Linux kernel features to work:
- Namespaces — Give each container its own isolated view of the system (process tree, network, filesystem, users). A container can't see other containers or the host processes.
- Cgroups (Control Groups) — Limit how much CPU, memory, disk I/O, and network a container can use. Prevents one container from starving others.
Unlike Virtual Machines, containers share the host OS kernel. This makes them:
- Start in milliseconds (VMs take minutes)
- Use MBs of memory (VMs use GBs)
- Near-native performance (no hypervisor overhead)
A container image is built in layers. Each instruction in a Dockerfile (FROM, COPY, RUN) creates a layer. Layers are cached and shared — if 10 services use the same base image, that layer is stored only once.
Docker
docker CLI, the dockerd daemon, the BuildKit builder, Docker Desktop, Docker Compose, and Docker Hub (the largest public image registry). Docker didn't invent containers — Linux had LXC, chroot, and Solaris Zones for years — but Docker gave them a developer-friendly UX, a standardized image format (now the OCI image spec), and a network effect via Docker Hub.- Dockerfile: Declarative build file with
FROM,RUN,COPY,CMDlayers cached independently for fast rebuilds. - BuildKit: Modern builder with parallelism, secret mounts, cache mounts, and multi-platform (amd64+arm64) support.
- Docker Compose: Declarative multi-container local dev via
docker-compose.yml— databases, caches, app in one file. - Registry protocol: Push/pull to Docker Hub, GHCR, ECR, GAR, Quay, Harbor — all speak the same OCI distribution spec.
- Volumes and networks: First-class data volumes and user-defined bridge/overlay networks for local orchestration.
- vs Podman: Podman is daemonless, rootless by default, and drop-in compatible (
alias docker=podman). Red Hat's answer to the Docker daemon's historical root privilege concerns. - vs containerd:
containerdis the low-level runtime that Docker itself now uses under the hood. Kubernetes since v1.24 talks directly to containerd, skipping Docker entirely ("dockershim removed"). - vs CRI-O: CRI-O is a minimal Kubernetes-only runtime from Red Hat, built specifically for the Kubernetes Container Runtime Interface.
- vs Buildah/Kaniko/Jib: Alternative image builders — Kaniko builds inside a Kubernetes pod without Docker daemon; Jib builds JVM images without a Dockerfile.
containerd or CRI-O directly, but Docker remains dominant for building images.Deep Dive:
Docker consists of several components:
- Docker CLI — Command-line interface (
docker build,docker run,docker push) - Docker Daemon (dockerd) — Background service that manages images and containers
- containerd — The actual container runtime that Docker uses internally
- Dockerfile — A text file with step-by-step instructions to build an image
# Example Dockerfile
FROM node:20-alpine # Base image
WORKDIR /app # Set working directory
COPY package*.json ./ # Copy dependency files
RUN npm ci --production # Install dependencies
COPY . . # Copy application code
EXPOSE 3000 # Document the port
CMD ["node", "server.js"] # Start command
Important distinction: Kubernetes dropped Docker as its container runtime in v1.24. But this does NOT mean Docker is dead. Docker is still the standard tool for building images. K8s just uses containerd directly to run them (cutting out the Docker daemon middleman). Your Docker-built images work perfectly on K8s.
Kubernetes (K8s)
- Declarative API: You submit desired state (YAML manifests) to the API server, stored in
etcd; controllers make it happen. - Self-healing: Crashed pods are restarted, failed nodes are drained and their pods rescheduled, unresponsive containers are killed.
- Horizontal scaling: Scale workloads with a single command or automatically via
HPA,VPA,KEDA, Cluster Autoscaler. - Service discovery + load balancing: Built-in DNS (CoreDNS) plus virtual IPs (
Services) that survive pod churn. - Rolling updates and rollbacks: Zero-downtime deployments with configurable surge/unavailability budgets.
- Extensibility:
CRDs+ Operators let you teach Kubernetes about new object types (databases, queues, certificates, cloud resources).
- vs Docker Swarm: Simpler to learn but far less capable. Swarm has basically lost the orchestration war — Docker Inc. now promotes K8s.
- vs HashiCorp Nomad: Nomad is simpler, supports non-container workloads (VMs, raw binaries, Java), but has a much smaller ecosystem.
- vs AWS ECS: ECS is simpler and deeply AWS-integrated, but locks you to AWS and lacks K8s's declarative extensibility.
- vs Mesos/Marathon: Mesos was the orchestration leader circa 2015 (Twitter, Airbnb, Apple Siri all used it) but lost to K8s; DC/OS is effectively dead.
- vs OpenShift: OpenShift is Red Hat's opinionated K8s distribution — adds developer UX, builds, routes, stricter security, paid support.
kubelet, kube-proxy, CNI, DNS, and more. Running etcd and the control plane yourself is hard (backups, upgrades, certificate rotation) — most teams use managed offerings (EKS, GKE, AKS). Cost can balloon if you oversize resources. Stateful workloads (databases) are trickier than stateless.Deep Dive:
Kubernetes was designed by Google based on 15 years of running production workloads on their internal system called Borg. Open-sourced in 2014, now maintained by the CNCF.
What Kubernetes actually does:
- Scheduling — Decides which machine runs which container based on resource needs, constraints, and policies
- Self-healing — Restarts crashed containers, replaces unresponsive Pods, kills containers failing health checks
- Scaling — Horizontally (more replicas) or vertically (more CPU/memory), automatically or manually
- Service discovery & load balancing — Gives Pods DNS names, distributes traffic
- Rolling updates & rollbacks — Deploy new versions with zero downtime, roll back if something breaks
- Secret & config management — Inject configuration and credentials without baking them into images
- Storage orchestration — Automatically attach cloud disks, NFS, or other storage to Pods
The Declarative Model — the most important concept:
You tell K8s what you want (desired state in YAML), not how to do it. K8s continuously reconciles actual state with desired state. If you say "I want 5 replicas" and one crashes, K8s creates a new one automatically. This reconciliation loop runs forever.
Cluster
kubelet agent. A single cluster can span thousands of nodes (the official upper limit is 5,000 nodes, 150,000 pods) and dozens of availability zones, behaving as one unified compute fabric.- Control plane components:
kube-apiserver(the front door),etcd(source of truth),kube-scheduler(places pods),kube-controller-manager(runs core controllers),cloud-controller-manager(talks to cloud APIs). - Worker node components:
kubelet(runs pods),kube-proxy(networking), a container runtime (containerd/CRI-O). - Flat networking: Every pod gets its own routable IP, every pod can talk to every other pod without NAT (the "Kubernetes networking model").
- HA control plane: Production clusters run 3 or 5 control-plane replicas for quorum on
etcd.
- vs a single Docker host: A cluster provides scheduling, failover, and multi-node networking — a single host has none of these.
- vs Nomad cluster: Nomad clusters are simpler to stand up (single binary) but lack K8s's rich object model.
- vs ECS cluster: An ECS cluster is really just a logical grouping of EC2/Fargate capacity; a K8s cluster is a full API-driven system.
- vs a Borg cell: Google's internal Borg uses "cells" of 10,000+ machines — K8s was intentionally scoped smaller per cluster, favoring multi-cluster federation.
Namespaces. The cluster becomes the unit of capacity planning, security, and upgrade.etcd backups are critical and often neglected. Noisy neighbors can starve other tenants without resource limits. Networking issues across zones or clusters are hard to debug. Single-cluster blast radius is real — many orgs run 10-50 clusters instead of one giant cluster for isolation.Deep Dive:
A cluster has two types of machines:
Control Plane Nodes (Masters):
- Run the "brain" of Kubernetes — API Server, etcd, Scheduler, Controller Manager
- Typically 3 or 5 nodes for high availability (odd number needed for etcd quorum)
- Should NOT run application workloads in production
- In managed K8s (EKS, GKE, AKS), the cloud provider manages these entirely
Worker Nodes:
- Run your actual application Pods
- Each runs: kubelet (agent), kube-proxy (networking), container runtime (containerd)
- Can be physical servers, VMs, or cloud instances
- Can be added/removed dynamically (auto-scaling)
- Can have different sizes (mix of large and small instances)
Node
kubelet (which talks to the API server, pulls pod specs, and supervises containers via the container runtime), kube-proxy (which implements service-level networking using iptables, IPVS, or eBPF), and a container runtime like containerd or CRI-O. Nodes register themselves with the control plane and report health, capacity, and running pods via periodic heartbeats.- Capacity and allocatable resources: Each node advertises CPU, memory, ephemeral storage, and GPUs; the scheduler uses this for bin-packing.
- Conditions:
Ready,MemoryPressure,DiskPressure,PIDPressure,NetworkUnavailable— reported by kubelet. - Labels and taints: Labels (
zone=us-east-1a,gpu=nvidia-a100) steer workloads; taints repel them unless pods have matching tolerations. - Drains and cordons:
kubectl drainsafely evicts pods;cordonmarks a node unschedulable for maintenance.
- vs a VM in isolation: A node is a fungible unit of capacity — K8s expects nodes to come and go (spot instances, autoscaling). Traditional VMs are pets, nodes are cattle.
- vs a Nomad client: Functionally similar — both register with a control plane and run scheduled work.
- vs an ECS container instance: Same concept, AWS-specific terminology.
tolerationSeconds. Kubelet bugs can orphan containers. Running too many pods per node (default limit: 110 per node) causes IP exhaustion and scheduling issues. Node upgrades require draining, which can break apps without proper PodDisruptionBudgets. Disk pressure from log volume is a classic production incident.Deep Dive:
Every worker node runs three essential components:
- kubelet — The agent that communicates with the control plane. It receives Pod specifications and ensures the described containers are running and healthy.
- kube-proxy — Maintains network rules for Service routing. Implements iptables or IPVS rules.
- Container runtime — The software that actually runs containers (
containerdorCRI-O).
Node lifecycle:
- Registration — Node joins the cluster and registers with the API server
- Heartbeat — kubelet sends heartbeats via Lease objects in
kube-node-leasenamespace - NotReady — If heartbeats stop, the node is marked NotReady after 40s (default)
- Eviction — After 5 minutes of NotReady, Pods are rescheduled to other nodes
- Drain —
kubectl draingracefully evicts all Pods before maintenance - Cordon —
kubectl cordonmarks a node as unschedulable without evicting existing Pods
Pod
- Shared network: All containers in a pod share
localhost— a sidecar proxy on port 15001 can intercept the main app's traffic on port 8080 transparently. - Init containers: Run to completion sequentially before app containers start — used for setup (DB schema migrations, secret fetching, permission fixes).
- Sidecar pattern: Helper containers running alongside the main app (log shippers, proxies, secret rotators). Native sidecars became stable in K8s 1.29.
- Lifecycle hooks:
postStartandpreStoprun code when containers start/stop for graceful handling. - Restart policies:
Always(default for Deployments),OnFailure(Jobs),Never(run-once debugging).
- vs a container: A pod is a logical host; containers within share resources like processes inside the same VM.
- vs a Docker Compose service: Compose runs containers on one host but doesn't group them into a shared-network unit the way pods do.
- vs Nomad task group: Nomad's "task group" is the closest analogue — a co-located set of tasks.
- vs an ECS task: Very similar concept — ECS tasks also group containers with shared network.
Services, never raw IPs. Deleting a pod directly is rarely what you want — delete the controlling Deployment instead. Pods can be stuck in Terminating if finalizers hang. Cross-container communication within a pod uses localhost, not the container name. Pod-level resource requests are the sum of container requests, used by the scheduler.istio-proxy envoy container. Vault Agent injector adds a sidecar that pulls secrets. Fluent Bit DaemonSet ships logs. Netflix and Airbnb run tens of thousands of pods per cluster at peak traffic. The average microservice pod has 2-3 containers (app + sidecar + init).Deep Dive:
Key properties:
- Shared network — All containers in a Pod share one IP address and port space. They talk to each other via
localhost - Shared storage — Containers can mount the same volumes to share files
- Co-scheduled — All containers in a Pod always run on the same node
- Ephemeral — Pods are disposable. Created, destroyed, and replaced, never "repaired"
Multi-container patterns:
- Sidecar — Helper container (Envoy proxy, log shipper, monitoring agent)
- Init Container — Runs before the main container to set up prerequisites
- Ambassador — Proxy for outbound connections
- Adapter — Transforms output (log format conversion)
Namespace
api if they live in different namespaces) and is the natural unit for RBAC, resource quotas, network policies, and billing chargebacks. Every K8s cluster ships with four default namespaces: default, kube-system (control plane components), kube-public (cluster info readable by all), and kube-node-lease (node heartbeats).- Name scoping: Namespaced objects (
Pods,Services,Deployments,ConfigMaps,Secrets) live in exactly one namespace. - Cluster-scoped exceptions: Nodes, PersistentVolumes, StorageClasses, CRDs, ClusterRoles are NOT namespaced — they span the whole cluster.
- DNS convention: Services are reachable at
svc-name.namespace.svc.cluster.local. - ResourceQuotas: Cap CPU/memory/storage/object counts per namespace.
- LimitRange: Default or cap container-level requests/limits to prevent runaway pods.
- vs a separate cluster: Namespaces share the same control plane, nodes, and network — cheaper but weaker isolation. For hard multi-tenancy, use separate clusters or vClusters.
- vs Linux namespaces: Unrelated — Linux namespaces isolate processes; K8s namespaces are an API-object partition.
- vs OpenShift projects: OpenShift "projects" are namespaces with extra metadata and a default quota/RBAC bundle.
team-payments, team-search), per environment (dev, staging, prod), or per customer. They provide the boundary for RBAC (who can do what), resource quotas (how much a team can consume), and network policies (who can talk to whom). Without namespaces, a cluster becomes a free-for-all.kubectl operates on default unless you -n or switch context. Cross-namespace references (e.g., a Service in another namespace) require the FQDN.Deep Dive:
Default namespaces: default, kube-system, kube-public, kube-node-lease.
What namespaces scope: Pods, Services, Deployments, ConfigMaps, Secrets, Roles, ServiceAccounts, PVCs.
What namespaces DON'T scope (cluster-wide): Nodes, PersistentVolumes, ClusterRoles, StorageClasses, Namespaces themselves.
Common strategies: per-team (team-payments), per-app (app-checkout), per-environment (staging), or hybrid.
Governance tools: ResourceQuota (cap resources), LimitRange (set defaults), NetworkPolicy (firewall), RBAC (access control).
Labels, Selectors & Annotations
app=payments, env=prod, version=v2) that are queryable — selectors use them to group or find objects. Selectors are the query syntax: equality (app=payments), set-based (env in (prod, staging)), or existence (!canary). Annotations are also key/value but not queryable — they carry descriptive metadata (build hash, Git commit, owner email, last-modified timestamp) consumed by tools, controllers, and humans rather than selectors.- Labels glue workloads together: A
Serviceuses a label selector to find its backing pods; aDeploymentuses one to own its ReplicaSet pods. - Multi-dimensional: You can slice across any axis — tier, release, owner, canary, shard — without rigid hierarchies.
- Standard labels: K8s recommends a set of common labels:
app.kubernetes.io/name,app.kubernetes.io/instance,app.kubernetes.io/version,app.kubernetes.io/part-of,app.kubernetes.io/managed-by. - Annotations for tools:
kubectl.kubernetes.io/last-applied-configuration,prometheus.io/scrape: "true",cert-manager.io/cluster-issuer.
- Labels vs Annotations: Labels are queryable and used by the control plane for grouping; annotations are free-form metadata for tools and humans only. Never put a cert or ConfigMap key in a label.
- Labels vs Tags (AWS/GCP): Conceptually similar — but K8s labels drive actual scheduling and selection, whereas cloud tags are mostly for billing/organization.
- Selectors vs Nomad constraints: Nomad uses HCL constraints for placement; K8s uses label selectors throughout the API.
kubectl label --overwrite is needed to change existing labels. Mismatched selector vs pod labels silently produce empty Services (no endpoints). Renaming labels on a live Deployment can cause orphaned ReplicaSets. Labels should be stable — putting volatile values (timestamps) in labels breaks rolling updates.squad=... (team) and cost-center=... to attribute infra spend. GitHub uses labels like shard=1..64 to implement horizontal sharding. cert-manager, Prometheus Operator, and Istio all lean heavily on labels for discovery and annotations for configuration.Deep Dive:
Labels are the glue of K8s. A Service finds Pods, a Deployment manages ReplicaSets, NetworkPolicies target Pods — all through label selectors.
metadata:
labels:
app: payment-service
team: payments
env: production
Selector types: Equality-based (app = my-api) and Set-based (env in (production, staging)).
Annotations store larger metadata: build timestamps, Git SHAs, monitoring config (prometheus.io/scrape: "true").
team: payments labels to attribute costs. Prometheus discovers scrape targets via annotations. Kyverno enforces that every Deployment must have team and app labels.kubectl
kubeconfig file (default: ~/.kube/config) containing cluster endpoints and credentials, translates user commands into HTTPS requests against the kube-apiserver, and formats responses into tables, YAML, or JSON. Every operation you can do through a dashboard, Helm chart, or CI pipeline ultimately hits the same REST API that kubectl hits — it is the universal debugging and operating tool for K8s.- Imperative commands:
kubectl run,kubectl expose,kubectl scale,kubectl deletefor quick actions. - Declarative apply:
kubectl apply -fsubmits YAML manifests and uses a three-way merge to reconcile state. - Debugging:
kubectl logs,kubectl exec,kubectl describe,kubectl port-forward,kubectl debug. - Context switching:
kubectl config use-contextjumps between clusters; tools likekubectx/kubensmake this faster. - Plugins:
kubectl-krewis the plugin manager — hundreds exist:neat,tree,who-can,stern,rakkess,view-secret.
- vs
helm: Helm is a package manager that generates YAML and then calls the same API; kubectl is the raw interface. - vs
k9s: k9s is a TUI wrapper around kubectl that shows live resources — much faster for interactive debugging. - vs client libraries: Go/Python/Java/Rust clients talk the same API directly, useful for building controllers and operators.
- vs Docker CLI: Docker CLI talks to
dockerdon one host; kubectl talks to a whole cluster API.
kubectl get pods -A, every CI pipeline eventually shells out to it. Mastering kubectl (and its JSONPath output, selectors, custom columns) is the single biggest productivity boost for anyone working with Kubernetes.kubectl apply vs kubectl create behave differently on re-runs. Forgetting -n namespace is the #1 source of "why doesn't my command find anything." Running the wrong context on a prod cluster is a classic disaster — use tools like kube-ps1 or kubie to show the active context in your prompt. Imperative edits (kubectl edit) diverge from Git-tracked state — always prefer GitOps. Version skew: kubectl should be within ±1 minor version of the cluster.kubectl one-liners: kubectl top pods for CPU/RAM, kubectl get events --sort-by=.lastTimestamp for recent cluster activity, kubectl rollout history for audit, kubectl cp for file transfer. Most team wikis contain "kubectl cheatsheets" that grow into hundreds of entries over time.# Viewing
kubectl get pods -A # All namespaces
kubectl describe pod my-pod # Detailed info
# Debugging
kubectl logs my-pod -f # Stream logs
kubectl exec -it my-pod -- /bin/sh # Shell in
kubectl debug -it my-pod --image=busybox # Ephemeral debug
# Managing
kubectl apply -f manifest.yaml # Create/update
kubectl rollout undo deploy/my-app # Rollback
kubectl scale deploy my-app --replicas=5 # Scale
# Networking
kubectl port-forward svc/my-app 8080:80 # Local tunnel
# Context
kubectl config use-context prod # Switch cluster
Workloads
The resources that run your applications — Deployments, StatefulSets, DaemonSets, Jobs, and CronJobs.
Deployment & ReplicaSet
ReplicaSet (the low-level object that actually ensures N pods exist). During updates, the Deployment creates a new ReplicaSet with the new pod template, gradually scales it up while scaling the old one down — this is the rolling update. The old ReplicaSet is retained (up to revisionHistoryLimit, default 10) so you can kubectl rollout undo to any prior version.- Rolling update strategy:
maxSurge(how many extra pods during update) andmaxUnavailable(how many pods can be down) control cadence. - Recreate strategy: Kill all old pods first, then start new — for apps that can't run two versions simultaneously.
- Rollback:
kubectl rollout undo deployment/fooreverts to the prior ReplicaSet instantly. - Pause and resume: Stage multiple changes, then apply atomically via
kubectl rollout resume. - Progress deadline: If the rollout stalls, the Deployment is marked
Progressing=Falsefor alerting.
- vs ReplicaSet alone: ReplicaSet only maintains pod count — it has no update strategy. You almost never create ReplicaSets directly.
- vs StatefulSet: StatefulSets give stable identities and ordered rollout; Deployments treat all pods as interchangeable.
- vs DaemonSet: DaemonSets run one pod per node; Deployments run N pods anywhere.
- vs the old ReplicationController: Deployments replaced ReplicationController in 2016 — RCs are deprecated.
kubectl apply on a Deployment doesn't restart pods unless the pod template hash changes — use kubectl rollout restart to force a rollover. Two Deployments with overlapping selectors will fight each other.deployment-service canary tooling. Tooling like Argo Rollouts extends Deployments with progressive delivery (canary, blue-green, analysis-driven rollouts). A typical mid-size company has 200-2000 Deployment objects in production.apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-api
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: payment-api
template:
metadata:
labels:
app: payment-api
spec:
containers:
- name: api
image: payment-api:2.1.0
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "1", memory: "1Gi" }
Rolling update: New ReplicaSet created → new Pods start → pass readiness probes → old Pods terminated. maxSurge: 1 = at most 4 total. maxUnavailable: 0 = always 3 ready.
Rollback: kubectl rollout undo deployment/payment-api (K8s keeps 10 revisions by default).
StatefulSet
mysql-0, mysql-1, mysql-2), stable DNS (mysql-0.mysql.default.svc.cluster.local), and ordinal-indexed PersistentVolumeClaims that persist across pod restarts and rescheduling. Pods are created, updated, and deleted in order (mysql-0 before mysql-1) to support bootstrap protocols like leader election, replica initialization, and quorum-based systems.- Stable network identity: Each pod has a fixed DNS name backed by a Headless Service.
- Stable storage:
volumeClaimTemplatesauto-create a dedicated PVC per pod, retained even if the pod is deleted. - Ordered deployment: Pods come up 0, 1, 2... and shut down in reverse.
- Rolling updates with partitions: Canary updates by setting
partition: N— only pods with ordinal >= N get the new version.
- vs Deployment: Deployments are for stateless services; StatefulSets are for databases, queues, and consensus systems.
- vs running a DB on a VM: StatefulSets give you the automation of K8s (scaling, upgrades, self-healing) but add complexity — many teams still prefer managed DBs (RDS, Cloud SQL, Aurora) instead.
- vs Operators: A StatefulSet alone doesn't know how to run a database safely — Operators (PostgreSQL Operator, MongoDB Operator, Vitess) wrap StatefulSets with domain logic.
Pending. StatefulSets alone don't handle backup, failover, or split-brain — you need an Operator or manual ops. Upgrades are risky because pods restart one at a time; always test in staging.What makes it special:
- Stable Pod names —
mysql-0,mysql-1,mysql-2 - Stable DNS —
mysql-0.mysql-headless.default.svc.cluster.local - Persistent storage — Each Pod gets its own PVC that survives restarts
- Ordered operations — Created 0→1→2, deleted 2→1→0
Requires a headless Service (clusterIP: None) and volumeClaimTemplates for per-Pod storage.
DaemonSet
nodeSelector/affinity). When a new node joins the cluster, the DaemonSet controller automatically schedules its pod there; when a node leaves, the pod is garbage collected. DaemonSets are used for infrastructure agents that must be present on every host: log shippers, metric collectors, network plugins, storage drivers, CNI agents, and security scanners.- One-per-node guarantee: Pods are scheduled by the DaemonSet controller itself, not the default scheduler (by default), ensuring strict placement.
- Toleration of taints: DaemonSets often tolerate
NoScheduletaints so they run even on control-plane nodes. - HostPort/HostNetwork: Many DaemonSets use host networking to intercept host-level traffic (kube-proxy, CNI).
- Rolling updates: Updated pod by pod, with
maxUnavailableto control blast radius.
- vs Deployment: Deployment places N pods anywhere; DaemonSet places one per node.
- vs static pods: Static pods are managed by kubelet directly from a file on disk — used for the control plane itself, not general workloads.
- vs systemd services: On traditional hosts, you'd install a log shipper via systemd. DaemonSets bring that model into K8s declaratively.
Fluent Bit for logs, Node Exporter for Prometheus metrics, kube-proxy for networking, Calico/Cilium/Flannel for CNI, CSI node drivers for storage, Falco for runtime security, NVIDIA device plugin for GPU exposure.hostPath volumes for /var/log or /proc, which can break on read-only root filesystems. Updates to DaemonSets running as critical infra (CNI) can briefly disrupt pod networking.kube-proxy, CNI plugin (Cilium, Calico), log agent (Fluent Bit, Vector), metrics (Node Exporter, DCGM Exporter for GPU), security (Falco, Aqua Enforcer). Datadog distributes its agent as a DaemonSet to every K8s node for host-level observability.What runs as DaemonSets: Log collection (Fluent Bit), monitoring (Node Exporter, Datadog), network plugins (Calico, Cilium), storage drivers (CSI), security agents (Falco).
Use nodeSelector or tolerations to restrict to specific node types (e.g., GPU nodes only).
Job & CronJob
"0 2 * * *") and creates a new Job at each tick — the K8s equivalent of crontab, but distributed and declarative. Both support retries (backoffLimit), timeouts (activeDeadlineSeconds), and parallelism.- Parallelism and completions:
parallelism: 5withcompletions: 100runs up to 5 pods at a time until 100 have succeeded. - Indexed Jobs: Each pod gets a unique index, enabling embarrassingly parallel batch processing (e.g., "process shard 37").
- BackoffLimit: Number of failures before marking the Job as failed.
- TTL after finished:
ttlSecondsAfterFinishedauto-cleans up completed jobs to avoid clutter. - CronJob history limits:
successfulJobsHistoryLimit/failedJobsHistoryLimitcontrol how many old Jobs to keep.
- vs Deployment: Deployments keep pods alive; Jobs finish and stop.
- vs Linux cron: CronJobs are HA — the scheduler in the control plane manages them, not a single host's crontab.
- vs Airflow/Argo Workflows: Airflow and Argo Workflows are full DAG engines for complex pipelines; K8s Jobs are primitives you build workflows on top of.
- vs AWS Batch: AWS Batch is similar but cloud-specific; K8s Jobs are portable.
concurrencyPolicy: Forbid or Replace. Old completed Jobs pile up without TTL, burdening etcd. A Job's pods are not immediately cleaned up on completion — set TTL or use a CI tool that cleans them. Jobs with PodDisruptionBudgets can block node drains.Key settings: backoffLimit (retries), activeDeadlineSeconds (timeout), concurrencyPolicy (Allow/Forbid/Replace for CronJobs), restartPolicy (Never or OnFailure).
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-report
spec:
schedule: "0 2 * * *" # 2 AM daily
concurrencyPolicy: Forbid # Skip if previous still running
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: report
image: report-gen:1.0
ReplicaSet
- Replica count enforcement: A reconciliation loop continuously compares actual vs desired pod count.
- Pod template: Defines the blueprint for pods it creates (image, env vars, volumes).
- Selector: Set-based or equality-based matching to adopt existing pods with matching labels.
- Owner references: Pods link back to their owning ReplicaSet via
ownerReferences, enabling cascading deletes.
- vs Deployment: Deployments are a higher-level abstraction that manages ReplicaSets for you, adding rolling updates, rollback, and pause/resume.
- vs ReplicationController (deprecated): ReplicaSets replaced RCs in 2016 by adding set-based label selectors.
- vs StatefulSet: ReplicaSets treat pods as fungible; StatefulSets give each pod an identity.
--cascade=orphan, the ReplicaSets become standalone and hard to manage. Multiple ReplicaSets with overlapping selectors will fight over pods. Old ReplicaSets from previous rollouts accumulate unless you lower revisionHistoryLimit.kubectl get rs — their names are the Deployment name plus a pod-template hash (e.g., nginx-7c5d5d6b9f). When you check rollout history, each entry is a ReplicaSet. Understanding ReplicaSets is key to debugging stuck Deployments and interpreting rollout state.Deployments create new ReplicaSets on updates and scale the old one down. Old ReplicaSets are kept (default 10) for rollback capability.
Networking
How Pods communicate — Services, Ingress, Gateway API, DNS, CNI, Network Policies, and Service Mesh.
Service
kube-proxy on each node using iptables, IPVS, or eBPF rules that load-balance traffic to the current list of healthy backend pods (the Endpoints/EndpointSlices object). Services use label selectors to find pods, updated in near-real-time by the endpoints controller.- Types:
ClusterIP(internal only),NodePort(opens a port on every node),LoadBalancer(provisions a cloud LB),ExternalName(DNS CNAME alias). - Headless (
clusterIP: None): No virtual IP — DNS returns pod IPs directly, used for StatefulSets and client-side load balancing. - Session affinity:
sessionAffinity: ClientIPpins a client to a pod (sticky sessions). - Multi-port: A Service can expose multiple ports (HTTP + metrics, for example).
- EndpointSlices: Scalable replacement for monolithic Endpoints, supporting clusters with tens of thousands of pods per Service.
- vs direct pod IPs: Pod IPs change with restarts; Service IPs are stable for the lifetime of the Service.
- vs Ingress: Services are Layer 4 (TCP/UDP); Ingress is Layer 7 (HTTP host/path routing). They work together.
- vs a cloud load balancer: LoadBalancer-type Services provision a cloud LB automatically (ELB, GLB, Azure LB), but bill per load balancer.
- vs a service mesh: Istio/Linkerd replace kube-proxy's dumb round-robin with smart routing, retries, mTLS, and observability.
http://payments.default.svc.cluster.local. Services decouple clients from pod lifecycles, enabling seamless rolling updates.Types: ClusterIP (internal, default), NodePort (expose on node ports 30000-32767), LoadBalancer (cloud LB, costs money), ExternalName (CNAME alias), Headless (clusterIP: None, returns Pod IPs directly for StatefulSets).
http://payment-api/charge). One or two LoadBalancer Services for the Ingress Controller. ExternalName Services wrap external databases so app code uses K8s DNS everywhere.Ingress & Ingress Controller
- Host-based routing:
api.example.com→ service A,web.example.com→ service B. - Path-based routing:
/api→ backend;/static→ CDN. - TLS termination: Decrypt HTTPS at the ingress, forward plain HTTP internally. Integrates with
cert-managerfor automatic Let's Encrypt certs. - Annotations: Controller-specific features (rate limits, auth, rewrites) attached via annotations — a historical warts-and-all design.
- IngressClass: Lets multiple ingress controllers coexist in one cluster.
- vs Service (LoadBalancer): LoadBalancer type creates one cloud LB per service — expensive at scale. Ingress shares one LB across many services via host/path routing.
- vs Gateway API: Gateway API is the modern replacement with richer, more expressive routing and better role separation. Ingress is feature-frozen.
- vs Service Mesh: Ingress handles north-south (external-to-cluster) traffic; mesh handles east-west (pod-to-pod).
pathType (Exact vs Prefix vs ImplementationSpecific) is a common source of 404s. The original Ingress API was criticized for being underspecified, which is why Gateway API was designed.apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
spec:
ingressClassName: nginx
tls:
- hosts: [api.myapp.com]
secretName: tls-secret
rules:
- host: api.myapp.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service: { name: api-v1, port: { number: 80 } }
- path: /
pathType: Prefix
backend:
service: { name: frontend, port: { number: 80 } }
Controllers: NGINX (most popular), Traefik, HAProxy, AWS ALB, Istio Gateway.
Gateway API
- Role-based layering: Infra, platform, and app teams each own a distinct resource — clear ownership boundaries.
- Rich routing: Header matching, query param matching, method matching, weighted traffic splitting, header/URL rewrites, mirroring — all native, no annotations.
- Cross-namespace routing: Routes in one namespace can attach to Gateways in another, controlled via
ReferenceGrant. - Protocol support: HTTP, HTTPS, gRPC, TLS passthrough, TCP, UDP.
- Extension points: Policies (RateLimitPolicy, BackendTLSPolicy) attach without annotation overload.
- vs Ingress: Ingress has vague semantics, requires annotations for everything, and lacks L4 support. Gateway API is typed, expressive, portable.
- vs Istio VirtualService: Istio's API is more powerful but mesh-specific. Gateway API is the common denominator across implementations.
- vs SMI: Service Mesh Interface was an earlier attempt at standardization that is now effectively replaced by Gateway API + GAMMA initiative.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: payment-route
spec:
parentRefs:
- name: production-gateway
hostnames: ["api.myapp.com"]
rules:
- matches:
- path: { type: PathPrefix, value: /payments }
backendRefs:
- name: payment-v2
port: 80
weight: 90 # 90% to v2
- name: payment-v3
port: 80
weight: 10 # 10% canary
DNS (CoreDNS) & kube-proxy
payments.default.svc.cluster.local to ClusterIPs, and headless Service names to pod IPs. kube-proxy is the node-level agent that implements Services — it watches the API server for Service and Endpoints changes and programs node-level iptables, IPVS, or nftables rules (or, increasingly, eBPF via Cilium) that transparently load-balance traffic destined for ClusterIPs to the actual pod IPs.- CoreDNS plugins:
kubernetes,forward,cache,loadbalance,hosts,rewrite,autopath,prometheus. - DNS search paths: Pods get
.svc.cluster.local,.cluster.localetc. in/etc/resolv.conffor short names. - kube-proxy modes: iptables (default, O(n) rules), IPVS (hash-based, scales better), nftables (newer), eBPF via Cilium (replaces kube-proxy entirely).
- Topology-aware routing: Prefer endpoints in the same zone for latency and egress-cost savings.
- CoreDNS vs kube-dns: kube-dns was a multi-container hack (dnsmasq + sidecar); CoreDNS is a single binary with a clean plugin model.
- iptables vs IPVS vs eBPF: iptables is fine below ~1000 services. IPVS uses kernel hash tables — better at 5000+. eBPF (Cilium) is fastest and most feature-rich.
- kube-proxy vs service mesh: A mesh replaces kube-proxy's simple L4 round-robin with sophisticated L7 routing via sidecar proxies.
cache plugin, add NodeLocal DNSCache. Poor pod /etc/resolv.conf ndots:5 causes excessive DNS queries. kube-proxy iptables reconciliation can be slow in large clusters. Conntrack table overflow causes mysterious packet drops. DNS lookups for external names bypass cluster resolution unless you configure rewrite.DNS format: my-service.my-namespace.svc.cluster.local (or just my-service within same namespace).
kube-proxy modes: iptables (O(n), default), IPVS (O(1), better at scale), eBPF/Cilium (highest performance, replaces kube-proxy).
East-West = service-to-service within cluster. North-South = external traffic.
CNI (Container Network Interface)
- IPAM: IP address management — allocating pod IPs from a configured range.
- Overlay networks: VXLAN, Geneve, or IPinIP encapsulation for cross-node pod-to-pod traffic (Flannel, Calico IPIP).
- Native routing: BGP (Calico) or cloud routing tables for non-encapsulated traffic.
- NetworkPolicy enforcement: Most CNI plugins implement NetworkPolicy via iptables or eBPF.
- eBPF: Cilium uses eBPF hooks in the kernel for fast, programmable datapath and policy enforcement.
- Calico vs Cilium vs Flannel: Flannel is simple but limited. Calico adds policy and BGP. Cilium adds eBPF, L7 policy, Hubble observability, and a kube-proxy replacement.
- Cloud CNI vs overlay: AWS VPC CNI assigns real VPC IPs to pods (no overlay overhead, but limited by ENI IP capacity); Flannel/Calico overlay is universal but has encap overhead.
- CNI vs Docker networking: Docker uses libnetwork and the CNM spec; K8s rejected CNM in favor of the simpler CNI.
Major plugins:
- Cilium — eBPF-based, highest performance, built-in observability (Hubble), service mesh. The rising star. CNCF graduated
- Calico — Most widely deployed. BGP routing, excellent Network Policy support
- Flannel — Simple overlay, NO Network Policy support. Dev only
- AWS VPC CNI — Real VPC IPs on EKS. Limited by ENI capacity per node
Network Policy
- Ingress rules: "Allow traffic to pods with label
app=dbonly from pods with labelapp=api." - Egress rules: "Pods in namespace
frontendmay only call out toapi.default.svc." - Namespace selectors: Policies can match by namespace labels, enabling cross-namespace rules.
- IP blocks: Allow/deny by CIDR (e.g., block access to cloud metadata endpoint 169.254.169.254).
- Deny-all baseline: Apply a policy selecting all pods with no rules to default-deny.
- vs cloud security groups: Security groups work at instance level; NetworkPolicy works at pod level and moves with the pod.
- vs Calico GlobalNetworkPolicy: Calico's CRD adds richer rules (L7, priority, deny) beyond the base K8s NetworkPolicy API.
- vs Cilium NetworkPolicy: Cilium adds L7 (HTTP, gRPC, Kafka, DNS) and identity-based policy.
- vs service mesh authz: Istio's AuthorizationPolicy operates at the mesh layer with mTLS identity, complementary to NetworkPolicy.
hostNetwork and bypass NetworkPolicy. Debugging denied traffic requires CNI-specific tools like Cilium Hubble.apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-allow-frontend
spec:
podSelector:
matchLabels: { app: api }
policyTypes: [Ingress]
ingress:
- from:
- podSelector:
matchLabels: { app: frontend }
ports:
- port: 8080
Service Mesh (Istio, Linkerd)
- mTLS: Automatic mutual TLS between every pod — zero-trust networking by default.
- Traffic management: Canary, weighted splits, fault injection, circuit breakers, retries with jitter, timeouts.
- Observability: Automatic distributed tracing, RED metrics (rate/errors/duration), and access logs for every service.
- Authorization policy: L7 authz using pod identity (SPIFFE) instead of IPs.
- Multi-cluster: Connect services across clusters and clouds via the same mesh.
- Istio vs Linkerd: Istio is feature-rich and complex; Linkerd is minimal, faster, lighter, written in Rust. Linkerd2 is the CNCF graduated option known for simplicity.
- Istio vs Cilium Service Mesh: Cilium uses eBPF to avoid sidecars entirely, promising lower overhead.
- vs Consul Connect: HashiCorp's mesh, often used in Nomad/VM environments.
- vs app-level libraries: Prior approach was Netflix OSS (Ribbon/Hystrix/Eureka) baked into the app — meshes externalize this.
Capabilities: mTLS (zero-trust), retries/timeouts/circuit breaking, traffic splitting (canary), request-level metrics and traces, fine-grained AuthorizationPolicies.
Options: Istio (most features, CNCF graduated, Ambient mode = sidecar-less), Linkerd (simpler, Rust proxy), Cilium (eBPF-based, no sidecars).
When to adopt: 10+ microservices AND you need mandatory mTLS, advanced traffic management, or request-level observability. Don't adopt for 2-3 services.
Configuration & Storage
ConfigMaps, Secrets, Volumes, PV/PVC, StorageClasses, and CSI.
ConfigMap
- Multiple consumption patterns:
envFromfor bulk env import,valueFrom.configMapKeyReffor individual keys, volume mounts for files. - Live updates: Mounted ConfigMaps update in the pod filesystem automatically (via symlink swap) — apps can reload on change.
- Immutable ConfigMaps: Marking
immutable: trueprevents accidental edits and reduces API server load. - Binary data: For certificates or non-UTF8 files, use
binaryData.
- vs Secret: ConfigMaps are plaintext in etcd; Secrets are base64-encoded (not encrypted by default) and handled more carefully. Use Secrets for credentials, ConfigMaps for everything else.
- vs env vars on VM: ConfigMaps are version-controlled, templated (via Helm/Kustomize), and managed via the same RBAC as everything else.
- vs external config services (Consul, etcd): ConfigMaps are native to K8s; external stores require extra infra but can offer hot reloads and audit.
Reloader. Base64 encoding a large file can push you toward the 1 MB limit. A missing ConfigMap referenced by a pod prevents pod startup.LOG_LEVEL, FEATURE_FLAGS, DB_HOST, application.yaml (Spring Boot), settings.py overrides (Django), Nginx configuration files, Prometheus scrape rules, fluentd parser configs.Volume-mounted ConfigMaps support hot-reload (~60s). Env vars do NOT — need Pod restart. immutable: true for better performance at scale. 1 MB size limit.
Secret
etcd, which is a well-known footgun — you MUST enable encryption at rest (via the EncryptionConfiguration API, ideally with a KMS provider like AWS KMS, GCP KMS, Vault Transit) to meet compliance.- Secret types:
Opaque(generic),kubernetes.io/tls,kubernetes.io/dockerconfigjson,kubernetes.io/service-account-token,kubernetes.io/ssh-auth. - Consumption: Env vars, volume mounts, imagePullSecrets on a pod, or referenced by ServiceAccounts.
- Encryption at rest: Requires explicit config on the API server with
aescbc,kms, orsecretboxproviders. - Immutable Secrets: Same idea as immutable ConfigMaps for safety and performance.
- vs ConfigMap: Same API shape, different intent and RBAC treatment. Secrets should be restricted to the minimum set of users/service accounts.
- vs Vault / AWS Secrets Manager: External secret stores provide rotation, audit, dynamic secrets, and true encryption. Tools like External Secrets Operator sync external stores into K8s Secrets.
- vs SealedSecrets / SOPS: SealedSecrets (Bitnami) encrypts Secrets for safe storage in Git; SOPS (Mozilla) does the same with more providers.
Types: Opaque, TLS, dockerconfigjson, service-account-token, basic-auth.
PV, PVC & StorageClass
- Access modes:
ReadWriteOnce(single node, block/file),ReadOnlyMany,ReadWriteMany(multi-node, rare),ReadWriteOncePod(1.22+). - Reclaim policies:
Delete(PV and underlying disk deleted when PVC is deleted),Retain(keep for manual cleanup). - Volume expansion: Grow a PVC online by updating
spec.resources.requests.storage. - CSI (Container Storage Interface): Vendor-agnostic plugin model — AWS EBS CSI, GCE PD CSI, Azure Disk CSI, Rook/Ceph, Portworx, Longhorn, OpenEBS.
- Snapshots and clones: Native CSI snapshot API for point-in-time backups.
- vs emptyDir: emptyDir lives with the pod and is destroyed on pod deletion; PVs survive pod lifecycle.
- vs hostPath: hostPath ties a pod to a specific node's disk; PVs are location-independent (ideally).
- vs Docker volumes: Docker volumes are host-local; PVs can be networked, zonal, and dynamically provisioned.
- vs CSI vs old in-tree drivers: CSI replaced the old in-tree cloud volume plugins in Kubernetes 1.23 — simpler API, cleaner out-of-tree development.
Delete, which has caused accidental data loss — change it to Retain for critical data. Expansion requires the filesystem to support online growth. Pod-PVC binding is sticky — deleting and recreating a PVC loses the data binding.Access Modes: RWO (single node), ROX (read-only many), RWX (read-write many, needs EFS/NFS), RWOP (single Pod).
Reclaim: Delete (default) or Retain (keep data). VolumeSnapshots for backups.
emptyDir, hostPath & Ephemeral Volumes
- emptyDir memory medium:
medium: Memoryuses tmpfs (RAM) — fast but counts against pod memory limit. - emptyDir sizeLimit: Cap how much disk/memory the pod can consume.
- hostPath types:
Directory,File,Socket,DirectoryOrCreate— with type validation. - Generic ephemeral volumes: Inline PVC-like specs that are created per-pod and deleted with it.
- projected volumes: Combine secrets, configmaps, downward API, and service account tokens into a single mount point.
- emptyDir vs PVC: emptyDir lives and dies with the pod; PVC survives.
- hostPath vs local PV: Local PVs are the sanctioned way to use local disks — they're scheduler-aware. hostPath bypasses scheduling and is generally discouraged outside DaemonSets and system pods.
- Generic ephemeral vs emptyDir: Generic ephemeral gives you a real storage class (SSD-backed, replicated) in a pod-scoped lifecycle.
/var/run/docker.sock). Use ephemeral CSI volumes when you need real storage performance but pod-scoped lifecycle./var/log via hostPath. Init containers commonly use emptyDir to pass generated configs to app containers. CI/CD jobs use emptyDir for build output. Large Spark/ML workloads use ephemeral local SSDs via generic ephemeral volumes.Scheduling & Scaling
How K8s places Pods, and how it auto-scales workloads and infrastructure.
Resource Requests & Limits
Guaranteed (requests == limits), Burstable (limits > requests), BestEffort (no requests or limits). QoS determines eviction priority under node pressure.- CPU units: 1 CPU = 1 vCPU/core. Fractional:
500m= 0.5 CPU. - Memory units: Bytes, or suffixes
Ki/Mi/Gi(binary) orK/M/G(decimal). - CPU limits cause throttling: Never killed for CPU, just throttled in cgroup CPU quota.
- Memory limits cause OOM kill: Kernel OOM killer terminates the container if it exceeds its memory limit.
- LimitRange: Per-namespace object that sets default requests/limits and min/max bounds.
- vs VM sizing: VMs have fixed sizes; K8s requests/limits are per-container and flexible across shared nodes.
- vs Docker
--memory: K8s requests have no direct Docker equivalent — they're scheduler hints. - vs Nomad resources: Nomad has a similar requests/limits model.
QoS Classes: Guaranteed (requests==limits, highest priority), Burstable (requests<limits), BestEffort (none set, first evicted).
Units: CPU: 1=1 core, 100m=0.1 core. Memory: 128Mi, 1Gi.
Node Affinity, Pod Affinity & Anti-Affinity
gpu=nvidia-a100, zone=us-east-1a). Pod Affinity attracts pods to nodes where other pods (matching a label selector) already run — useful for co-location to reduce latency. Pod Anti-Affinity does the opposite — spread replicas across nodes, racks, or zones to survive failures. All three come in two flavors: requiredDuringSchedulingIgnoredDuringExecution (hard rule) and preferredDuringSchedulingIgnoredDuringExecution (soft preference with weight).- Topology key: Defines the "domain" of spreading —
kubernetes.io/hostname(per node),topology.kubernetes.io/zone(per AZ),topology.kubernetes.io/region. - Label expressions: Set-based selectors (
In,NotIn,Exists,DoesNotExist). - TopologySpreadConstraints: Newer, higher-level API that achieves spread with simpler semantics than anti-affinity.
- vs nodeSelector: nodeSelector is simple equality; affinity supports richer expressions and soft preferences.
- vs taints/tolerations: Taints repel pods (nodes say "stay away"); affinity attracts them (pods say "I want this node"). Use them together.
- vs TopologySpreadConstraints: TSC is more concise for the common "spread my replicas across zones" case.
Hard vs Soft: requiredDuring... (must) vs preferredDuring... (try).
Taints & Tolerations
NoSchedule (new pods can't land), PreferNoSchedule (scheduler tries to avoid), NoExecute (evict existing pods that don't tolerate). Tolerations are the counterpart on pods: a pod with a matching toleration can be scheduled onto a tainted node. This is the opposite model of affinity — affinity attracts, taints repel. Often combined: taint a node to dedicate it, tolerate on the specific workload that should use it.- Node taints set via
kubectl taint:kubectl taint nodes gpu-1 dedicated=gpu:NoSchedule. - Automatic taints: Control plane taints like
node-role.kubernetes.io/control-plane:NoSchedule, or dynamicnode.kubernetes.io/not-ready,node.kubernetes.io/unreachable. - TolerationSeconds: How long a pod tolerates a NoExecute taint before eviction.
- Taint-based eviction: When a node goes unreachable, the node lifecycle controller taints it, and pods are evicted after
tolerationSeconds.
- vs Node Affinity: Affinity is pull (pods choose nodes); taints are push (nodes filter pods). Best combined.
- vs nodeSelector: nodeSelector is a simple "must match" — doesn't exclude other pods from the node. Taints actively exclude.
spot=true:NoSchedule so only interruption-tolerant workloads land there. GPU nodes use nvidia.com/gpu:NoSchedule with the NVIDIA device plugin. Control plane nodes in managed clusters are tainted to prevent user workloads. Dedicated team nodes use taints for hard chargeback boundaries.Effects: NoSchedule, PreferNoSchedule, NoExecute (evict existing).
HPA (Horizontal Pod Autoscaler)
--horizontal-pod-autoscaler-sync-period), the HPA controller queries the metrics server (or a custom metrics API), compares current values against the target, and adjusts the replica count. The current version is autoscaling/v2, which supports multi-metric scaling and scaling behavior policies.- Metric types: Resource (CPU/mem), Pods (per-pod custom metric), Object (metric from another object like Ingress RPS), External (SQS depth, Kafka lag).
- Target types:
Utilization(% of request),AverageValue,Value. - Scale behavior (v2): Separate scale-up and scale-down policies with stabilization windows to prevent flapping.
- Scale to zero: Not supported by HPA directly — use KEDA for that.
- vs VPA: HPA adds replicas; VPA resizes the individual pod's CPU/memory. Orthogonal, though overlapping use is tricky.
- vs KEDA: KEDA extends HPA with event-driven sources (Kafka, RabbitMQ, Azure Service Bus, Prometheus, cron) and scale-to-zero.
- vs Cluster Autoscaler: HPA scales pods; CA scales nodes. They compose: HPA adds pods, which triggers CA to add nodes.
v2 supports multiple metrics. behavior section controls scaling speed (scale up fast, down slow to prevent flapping). Requires Metrics Server.
VPA, KEDA & Cluster Autoscaler / Karpenter
- VPA modes:
Off(recommend only),Initial(set at creation),Auto(evict + recreate pod with new sizes). - KEDA ScaledObjects: Declare event source, metric, thresholds; KEDA creates/manages an HPA for you.
- KEDA scale-to-zero: Idle deployments drop to 0 pods and wake on first event.
- Cluster Autoscaler: Node-pool-based, conservative, requires pre-defined instance types.
- Karpenter: Provisions raw EC2 instances matched to pending pod specs; consolidates workloads for efficiency.
- VPA vs HPA: VPA resizes, HPA multiplies. VPA shouldn't be combined with HPA on CPU/mem.
- KEDA vs HPA: KEDA is a superset — handles sources HPA can't, like Kafka/SQS backlog, custom Prometheus queries.
- Karpenter vs Cluster Autoscaler: Karpenter launches nodes in seconds (vs minutes), picks optimal instance type per workload, consolidates idle nodes aggressively. Much better cost efficiency.
Auto mode recreates pods, which disrupts service — pair with PDBs. KEDA's scale-to-zero has a cold-start penalty. Cluster Autoscaler can't downsize if even one pod without a PDB is stuck. Karpenter requires careful spot instance handling to avoid cascading interruptions.# KEDA: Scale Kafka consumer to zero
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
scaleTargetRef: { name: kafka-consumer }
minReplicaCount: 0
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
topic: orders
lagThreshold: "100"
PDB (Pod Disruption Budget)
kubectl drain for node upgrades, cluster autoscaler scale-down, Karpenter consolidation. Involuntary disruptions (node crash, hardware failure) are NOT governed by PDBs. A PDB specifies minAvailable or maxUnavailable as a count or percentage, and the eviction API will block drains that would violate it.- minAvailable: "At least N pods (or X%) must remain running."
- maxUnavailable: "At most N pods (or X%) may be unavailable."
- Selector: Targets pods via label selector — must match the deployment's pods.
- Eviction API: PDBs only enforce against the eviction API, not direct pod deletes.
- vs replicas: Replicas guarantee target count on average; PDBs guarantee minimum during disruption windows.
- vs HPA: Independent — HPA adjusts replicas, PDB protects them during drains.
- vs anti-affinity: Anti-affinity spreads pods for availability; PDB prevents too many being drained simultaneously.
minAvailable: 100% blocks drains entirely and stalls cluster upgrades. PDBs don't prevent node crashes — for that you need multi-zone replication. Stuck drains due to PDBs are a common incident. Single-replica deployments + PDB = unschedulable drain forever.kubectl get pdb -A before major upgrades to identify risky workloads. Lyft's incident reports mention PDB-related drain failures as routine ops hazards.Applies to voluntary disruptions (drain, autoscaler), NOT involuntary (node crash, OOM).
Security
RBAC, Pod Security, admission control, policy engines, secrets management, and runtime protection.
RBAC (Role-Based Access Control)
get, list, create, update, patch, delete, watch) on the resource.- Fine-grained verbs: Control individual operations, not just read/write.
- Resource names: Restrict actions to specific named resources (e.g., only the
prod-dbsecret). - Aggregation: ClusterRoles can aggregate from labeled sub-roles for modular permissions.
- Subresources: Control sub-APIs like
pods/exec,pods/portforward,deployments/scale. - ServiceAccount binding: Pod service accounts get RBAC via bindings — enabling controllers to call the API with minimum privileges.
- vs ABAC (deprecated): ABAC used policy files; RBAC uses API objects that can be managed declaratively.
- vs cloud IAM: Cloud IAM controls cloud resources; RBAC controls K8s API. Managed K8s (EKS/GKE/AKS) bridges the two.
- vs OPA/Gatekeeper: RBAC decides "can you do this?"; Gatekeeper decides "is this object valid?"
staging, controllers only watch what they need. Essential for security audits, compliance, and multi-tenant safety.cluster-admin role) are the single most common security mistake. The system:masters group bypasses RBAC entirely. Permissions needed for CRDs are often forgotten — a new CRD requires new ClusterRoles. Debugging "forbidden" errors means running kubectl auth can-i and kubectl-who-can.aws-auth ConfigMap or the EKS API to map IAM principals to RBAC. Tools like rakkess, rbac-lookup, and kubectl-who-can audit permissions. Every K8s security review starts with an RBAC audit.apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: dev
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
---
kind: RoleBinding
metadata:
name: dev-team-reader
namespace: dev
subjects:
- kind: Group
name: dev-team
roleRef:
kind: Role
name: pod-reader
Test: kubectl auth can-i create pods --namespace dev
SecurityContext & Pod Security Standards
Privileged (no restrictions), Baseline (sensible minimums), Restricted (hardened, enforces non-root, no host networking, etc.). Enforced by the Pod Security Admission controller (replaced the deprecated PodSecurityPolicy in 1.25).- runAsNonRoot + runAsUser: Force non-root execution.
- readOnlyRootFilesystem: App can't write to its own rootfs — defeats many exploits.
- capabilities: Drop all Linux caps by default, add only the ones you need (e.g., NET_BIND_SERVICE).
- seccomp profiles: Restrict which syscalls the container can make via
seccompProfile: RuntimeDefaultor custom. - AppArmor/SELinux: Mandatory access control for stronger isolation.
- vs PodSecurityPolicy (removed): PSS is simpler, enforced via admission labels on namespaces rather than PSP objects.
- vs OPA/Kyverno: PSS covers the common cases; policy engines handle custom rules.
- vs Docker
--privileged: SecurityContext is more granular, with individual switches per capability.
pod-security.kubernetes.io/enforce: restricted.Pod Security Admission enforces PSS per namespace via labels: enforce, audit, warn.
Admission Controllers & Webhooks
MutatingWebhookConfiguration and ValidatingWebhookConfiguration objects point to HTTPS endpoints that implement custom logic. This is how tools like cert-manager, Istio sidecar injector, OPA, Kyverno, and Linkerd auto-inject proxies plug in.- Mutating vs validating: Mutating can change the object (inject sidecars, set defaults); validating can only reject.
- Order: All mutators run first, then all validators — so validation sees the final merged object.
- Dynamic registration: Webhook configs are K8s objects — install a chart, register a webhook at runtime.
- failurePolicy:
Failrejects all requests if the webhook is down;Ignorelets them through. - objectSelector / namespaceSelector: Limit which objects trigger the webhook.
- vs RBAC: RBAC is yes/no on verbs; admission inspects the actual object content and can modify or reject based on arbitrary logic.
- vs CRD validation schemas: Schemas (OpenAPI) catch simple type errors; admission handles complex multi-field or cross-object rules.
- vs audit hooks: Audit records what happened; admission prevents what shouldn't happen.
failurePolicy: Fail can take down the API server — imagine Istio's injector webhook crashing during cluster upgrade. Mutating webhooks that conflict cause non-deterministic results. Webhook latency adds to every API request. Webhooks must ignore kube-system or you'll break the control plane itself.OPA/Gatekeeper & Kyverno
ConstraintTemplates (policy definitions) and Constraints (instances). Kyverno is a newer CNCF graduated Kubernetes-native policy engine that uses YAML instead of Rego — easier for K8s users who don't want to learn Rego. Kyverno can validate, mutate, generate, and clean up resources.- Validation: Block deployments from the
latesttag, enforce required labels, forbid privileged pods, require resource limits. - Mutation: Auto-add sidecars, inject default labels/annotations, set security context fields.
- Audit mode: Report violations without blocking — useful for rolling out new policies.
- Policy libraries: Community-maintained policies (pod-security-standards, supply chain, ingress hardening).
- Kyverno-only: resource generation — create default ConfigMaps/NetworkPolicies when a namespace is created.
- OPA/Gatekeeper vs Kyverno: OPA is more powerful (arbitrary Rego, non-K8s use cases) but steeper; Kyverno is simpler, YAML-based, K8s-only.
- vs Pod Security Admission: PSA handles the common PSS rules; policy engines handle everything beyond that.
- vs validating webhooks you write: Policy engines save you from building and maintaining custom webhooks.
# Kyverno: Block :latest tag
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: disallow-latest
spec:
validationFailureAction: Enforce
rules:
- name: check-tag
match:
any:
- resources: { kinds: [Pod] }
validate:
message: "Using ':latest' is not allowed."
pattern:
spec:
containers:
- image: "!*:latest"
Image Security & Runtime Protection
- SBOM (Software Bill of Materials): Machine-readable list of every package in an image, produced by tools like Syft.
- SLSA provenance: Cryptographic attestation of how an image was built (build pipeline, source commit, build platform).
- Image signing: Cosign uses keyless signing via Fulcio/Rekor (OIDC identity) — no long-lived keys to manage.
- Admission enforcement: Kyverno/Gatekeeper can require signed images from approved registries.
- Runtime anomaly detection: Falco fires alerts on syscall-level rules (shell in container, unexpected file writes, crypto miners).
- vs traditional VM antivirus: Containers are immutable — the approach is "verify at build time, detect at runtime" rather than continuous file scanning.
- Falco vs Tetragon: Falco uses a kernel module or eBPF; Tetragon is pure eBPF and integrates with Cilium for network + syscall observability.
- vs network-based IDS: Runtime container security sees inside the pod; network IDS only sees traffic on the wire.
Observability
Metrics, logging, tracing, probes, alerting, and the three pillars of understanding your systems.
Prometheus & Grafana
/metrics endpoints exposed by every component (kubelet, node-exporter, apps) at a regular interval, stores them in a local TSDB, and serves alerts via Alertmanager. Grafana is the visualization layer — dashboards, alerts, and multi-data-source federation. Together they are the universal pair for cloud-native observability. The Prometheus Operator and kube-prometheus-stack Helm chart deploy them + exporters + default dashboards with one command.- Service discovery: Auto-discover scrape targets via K8s API (pod, service, endpoint annotations).
- PromQL: Powerful query language with rate, histogram_quantile, aggregation, label manipulation.
- Alerting: PrometheusRule CRD defines alerts; Alertmanager routes and deduplicates them to Slack, PagerDuty, email.
- Exemplars and histograms: Link metrics to traces for correlation.
- Remote write: Long-term storage via Thanos, Cortex, Mimir, VictoriaMetrics.
- vs Datadog/New Relic: SaaS alternatives are simpler but expensive at scale; Prometheus is free but you operate it.
- vs InfluxDB: Influx used to compete but lost ground to Prometheus in the K8s space.
- vs OpenTelemetry Metrics: OTel is the emerging standard for metric collection; Prometheus remains the default backend.
- Push vs pull: Prometheus pulls (better for reliability), unlike StatsD/Graphite push model.
/metrics with text exposition) is implemented by every modern CNCF project.Key sources: kube-state-metrics (K8s object state), Node Exporter (node hardware), Metrics Server (HPA/kubectl top), app /metrics endpoints.
# PromQL examples
rate(http_requests_total{status="500"}[5m]) # Error rate
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # P99
sum by (service)(rate(http_requests_total[5m])) # RPS per service
Long-term: Thanos or Grafana Mimir for multi-cluster, long-retention storage.
Logging (Loki, EFK, Fluent Bit)
- Node-level shipping: One agent per node reading
/var/log/pods/*. - Metadata enrichment: Add pod, namespace, labels, container name to every log line automatically.
- Parsing: Regex, JSON, logfmt, multi-line (stack traces).
- Backpressure handling: Buffering and retries when the backend is slow.
- Loki label-based indexing: Cheap storage (S3/GCS) by indexing only labels, not log contents.
- Loki vs Elasticsearch: Loki is 10-100× cheaper but slower for full-text queries; ES is faster but requires large clusters.
- Fluent Bit vs Fluentd: Fluent Bit uses ~10× less memory (~1 MB vs ~40 MB per instance) — preferred for K8s.
- Vector vs Fluent Bit: Vector is more powerful with transforms but less battle-tested at massive scale.
Loki (Grafana) = label-based, cost-effective, Grafana integration. Elasticsearch = full-text search, powerful but heavy. Loki increasingly replacing EFK.
Distributed Tracing & OpenTelemetry
- Context propagation: W3C Trace Context headers flow through HTTP/gRPC calls.
- Auto-instrumentation: OTel agents can instrument Java, Python, Node.js without code changes.
- Sampling: Head-based (decide at request start) or tail-based (decide after seeing the full trace — requires OTel Collector).
- OTel Collector: A vendor-agnostic middleman that receives, transforms, batches, and exports telemetry to any backend.
- Exemplars: Link metrics (Prometheus) to traces (Jaeger) for instant drill-down.
- vs logs: Logs tell you what happened at one point; traces tell you the full causal chain across services.
- vs metrics: Metrics aggregate; traces capture individual requests with full detail.
- Jaeger vs Zipkin: Jaeger (created at Uber, CNCF) is newer and more feature-rich; Zipkin (Twitter) was the pioneer.
- Tempo vs Jaeger: Tempo uses object storage for cheap, high-volume trace retention.
How: Request gets trace ID → each service creates a span → propagated via headers → assembled into full trace.
Backends: Jaeger (CNCF), Grafana Tempo, Zipkin. OTel Collector = vendor-neutral pipeline.
Probes: Liveness, Readiness & Startup
periodSeconds) with thresholds for success/failure transitions.- Liveness: Restart deadlocked or memory-leaking apps.
- Readiness: Avoid sending traffic to warming-up pods; temporarily remove unhealthy pods during dependency outages.
- Startup probes: For slow-starting apps (Java with 60+ second warmup) — prevents liveness from killing them during init.
- Probe parameters:
initialDelaySeconds,periodSeconds,timeoutSeconds,successThreshold,failureThreshold. - gRPC probes: Native (no grpc_health_probe sidecar needed) since K8s 1.24.
- Liveness vs readiness: Critical distinction — liveness restarts; readiness just removes from Service. A full DB outage should fail readiness, NOT liveness (restarts don't help).
- Startup vs initialDelaySeconds: Startup probes are better for slow apps — they adapt to variable warmup times.
- vs ELB health checks: K8s probes run inside the cluster from kubelet; cloud LB health checks run from outside.
/actuator/health/liveness and /actuator/health/readiness are standard for Java K8s apps. Go's /healthz convention is widespread.Methods: HTTP GET, TCP Socket, gRPC, Exec command.
SLI, SLO & SLA
- Error budget: The difference between 100% and the SLO — "we're allowed 43 minutes of downtime per month at 99.9%."
- Burn rate alerts: Fire when error budget is being consumed too fast (e.g., 10× normal).
- Multi-window multi-burn-rate: Modern alerting technique combining short and long windows to reduce noise.
- SLO tooling: Sloth, Pyrra, OpenSLO, Nobl9 turn SLO specs into Prometheus rules and dashboards.
- SLI vs metric: All SLIs are metrics; not all metrics are SLIs. SLIs must reflect user experience.
- SLO vs KPI: KPIs drive business; SLOs drive reliability engineering priorities.
- SLO vs alert threshold: A "5xx rate > 1%" alert is usually worse than a burn-rate alert because it doesn't account for error budget.
Four Golden Signals: Latency, Traffic, Errors, Saturation.
SLO 99.9% = 43 min downtime/month. When error budget exhausted → freeze features, fix reliability.
Deployment Strategies & GitOps
Rolling, blue-green, canary, Argo CD, Flux, Helm, and Kustomize.
Rolling Update
maxSurge (how many extra pods allowed temporarily) and maxUnavailable (how many pods can be down at once). The result is a zero-downtime deployment for any stateless app with working readiness probes.- maxSurge: Default 25% — adds up to 25% more pods during the update.
- maxUnavailable: Default 25% — can have up to 25% fewer pods during the update.
- Progress deadline: If no progress for
progressDeadlineSeconds, rollout is marked failed. - Pause and resume: Stage multiple edits, then apply together.
- Instant rollback:
kubectl rollout undoreverts to the previous ReplicaSet.
- vs Blue-Green: Rolling is gradual; blue-green is instant switch. Rolling uses less capacity but takes longer.
- vs Canary: Canary sends a small fraction of traffic to the new version for verification; rolling replaces based on pod count.
- vs Recreate strategy: Recreate tears down all old pods first — causes downtime, only used for incompatible versions.
- vs VM-based rolling: Same concept at a different layer — auto-scaling groups cycle instances; K8s cycles pods.
maxUnavailable: 0 with low replica counts can get stuck if the cluster is tight on capacity. Sticky sessions break mid-rollout unless using affinity.kubectl apply with an image change kicks off a rolling update by default.Blue-Green Deployment
- Instant cutover: Switch happens in seconds (label swap).
- Instant rollback: Flip back with the same mechanism.
- Full testing on green: Run smoke tests, load tests before the switch.
- Dual capacity: Requires running both versions simultaneously — 2× the resource footprint during deploy.
- vs Rolling: Instant rather than gradual; more capacity needed but faster rollback.
- vs Canary: Blue-green is all-or-nothing; canary ramps up gradually with traffic splitting.
- vs A/B testing: Blue-green is a deployment strategy; A/B testing splits by user segment for product experiments.
Pros: Atomic switchover, instant rollback. Cons: Double resources, DB schema compatibility needed.
Canary Deployment
- Automated analysis: Argo Rollouts + Flagger query Prometheus for error rate / latency / custom SLIs at each step and promote or rollback automatically.
- Progressive traffic shifting: Weighted routing via Istio, Linkerd, NGINX Ingress, or service mesh.
- Manual gates: Pause between steps for human approval.
- Shadow/mirror traffic: Duplicate traffic to canary without affecting users.
- Header-based canary: Route internal testers or beta users by HTTP header.
- vs Rolling: Rolling changes pod count; canary changes traffic percentage. Canary is safer for high-impact changes.
- vs Blue-Green: Blue-green flips instantly; canary ramps gradually and exposes fewer users to bad deploys.
- vs feature flags: Feature flags control behavior inside the same binary; canary controls which binary runs.
# Argo Rollouts canary
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates: [{ templateName: success-rate }]
- setWeight: 25
- pause: {duration: 10m}
- setWeight: 100
GitOps (Argo CD & Flux)
kubectl edit) is detected and reconciled back to Git state. Deploys happen by merging PRs, giving free audit, review, rollback, and traceability. The term was coined by Weaveworks in 2017. The two dominant tools are Argo CD (Intuit, now CNCF graduated) and Flux (Weaveworks, CNCF graduated).- Declarative desired state: Everything is a file in Git.
- Reconciliation loop: Controller polls Git, detects diffs, applies changes.
- Drift detection: Manual changes are reverted or flagged.
- App-of-apps pattern: One Argo Application manages many child Applications — scales to hundreds of microservices.
- Multi-cluster: Single Argo/Flux installation manages many clusters from a central control plane.
- Secret integration: SealedSecrets, SOPS, External Secrets to safely store secrets in Git.
- vs
kubectl applyfrom CI: Push-based CI gives CI credentials to the cluster; GitOps inverts this (cluster pulls) — cleaner security boundary. - Argo CD vs Flux: Argo CD has a rich UI, "Application" CRD, and SSO; Flux is lighter, more composable, with better Helm lifecycle support. Both are mature.
- vs Spinnaker: Spinnaker is heavier, multi-cloud-focused, predates GitOps; adoption has shrunk.
git revert), disaster recovery (rebuild cluster from repo). It's the modern default for K8s delivery.Argo CD: UI, SSO, ApplicationSets, auto-sync, self-healing. Flux: Toolkit approach, deeper Helm/Kustomize integration, no built-in UI.
Helm & Kustomize
values.yaml file with configurable defaults. Charts are versioned, shareable via repositories, and lifecycle-managed via helm install/upgrade/rollback. Kustomize is a templating-free alternative built into kubectl since v1.14 — it uses a layered overlay model (base + dev/staging/prod overlays) with strategic merge patches. Each tool has passionate advocates; many teams use both.- Helm charts: Templated YAML with Go templates, named releases, rollbacks, dependency charts.
- Helm Hub / Artifact Hub: Thousands of pre-built charts for common apps (Postgres, Redis, Prometheus, Cert-Manager).
- Helm values: Hierarchical YAML config injected into templates.
- Kustomize overlays: Layer patches on a base manifest for environment-specific customization.
- Kustomize components: Reusable patch snippets composed into overlays.
- Helm vs Kustomize: Helm uses templates (Go template syntax, can be messy); Kustomize is patch-based (no templating — pure YAML manipulation).
- Helm: Better for installing third-party apps (packaged by authors) and full lifecycle management.
- Kustomize: Better for managing your own manifests with small per-env differences.
- vs Pulumi/Crossplane: These use real programming languages; Helm/Kustomize are YAML-centric.
Use Helm for: third-party packages, complex parameterization. Use Kustomize for: your own apps, simple environment overrides. Many teams use both.
Cluster Architecture
Control plane, etcd, multi-cluster, multi-tenancy, managed vs. self-managed.
Control Plane Components
- API server: Stateless, horizontally scalable, the only write path to etcd.
- etcd: Raft consensus, runs as a 3- or 5-node cluster, stores all cluster state.
- Scheduler: Runs predicate+priority algorithms to place pods; extensible via scheduler framework plugins.
- Controller manager: Runs built-in control loops; each controller watches and reconciles its resource type.
- Cloud controller manager: Extracted from CCM so cloud providers can evolve independently.
- vs worker nodes: Control plane manages; workers execute.
- vs Nomad servers: Similar concept — Nomad "servers" are analogous but a simpler implementation.
- vs Borg master: Borg had a single master per cell; K8s learned from that and uses a stateless API server atop etcd.
- Managed vs self-hosted: EKS/GKE/AKS hide the control plane entirely; self-hosted (kubeadm, kops) requires operating it.
etcd backups and disaster recovery are critical and often neglected. API server load can spike with chatty controllers or list-watch storms. Scheduler extender bugs cause pod placement anomalies. Certificate rotation failures lock you out of the control plane. Upgrading the control plane must precede worker node upgrades (version skew policy).API Server: Auth → AuthZ → Admission → Validation → Persist. Stateless, run multiple replicas.
etcd: Raft consensus, 3 or 5 nodes. Fast SSDs required. Losing etcd without backup = losing everything.
Scheduler: Filter (which nodes CAN) → Score (which is BEST). Considers resources, affinity, taints, topology.
Controller Manager: Runs Deployment, ReplicaSet, Node, Job, Endpoint controllers. Watch → Compare → Act.
Multi-Cluster & Multi-Tenancy
- Cluster API: Declarative K8s API for creating and managing clusters as custom resources.
- vCluster: Virtual clusters running inside a host cluster — cheap isolation.
- Karmada: CNCF project for scheduling workloads across clusters.
- Multi-cluster Services API: Cross-cluster service discovery using shared DNS.
- Fleet management: GitOps-based deployment of identical workloads to many clusters.
- vs single giant cluster: Single cluster is cheaper/simpler at small scale; multi-cluster is necessary above a few thousand pods or for blast-radius reasons.
- vs multi-cloud via different clusters: Multi-cluster naturally supports multi-cloud if you don't rely on cloud-specific APIs.
- vs vCluster: vClusters give strong API-level isolation at lower cost than full clusters.
Management: Cluster API (declarative lifecycle), Rancher (UI), Argo CD ApplicationSets (multi-cluster deploys).
Managed vs. Self-Managed Kubernetes
kubeadm, kops, kubespray, Talos, Cluster API, Rancher RKE2.- Managed: Pay per hour for the control plane (~$72/month for EKS), provider handles upgrades and HA, integrates with cloud IAM/networking.
- Serverless managed (Autopilot, Fargate): Pay per pod-second, no node management — highest abstraction.
- Self-managed: Full control, no per-cluster fee, runs on any infrastructure (on-prem, edge, air-gapped).
- Distributions:
k3s(Rancher, lightweight for edge),Talos(immutable OS designed for K8s),OKD(upstream OpenShift).
- Managed: Faster to start, less ops burden, but locked to the provider and less flexible.
- Self-managed: Full control, portable, but requires deep K8s ops expertise and operational investment.
- GKE Autopilot vs EKS Fargate: Autopilot is per-pod pricing; Fargate requires separate networking config and has pod-size restrictions.
Managed: EKS (AWS, most popular), GKE (Google, most mature), AKS (Azure, free control plane).
Self-managed: kubeadm (official), k3s (edge/IoT, lightweight), RKE2 (FIPS, government), OpenShift (enterprise).
Operators & CRDs
Extending Kubernetes with custom resources and the Operator pattern.
CRD (Custom Resource Definition)
kubectl, subject to the same RBAC, validation, and admission webhooks as built-in objects. CRDs define their schema in OpenAPI v3 (for validation) and optionally declare subresources like /status and /scale. CRDs alone don't do anything — they just store data; pair them with a controller (an Operator) to turn them into active automation. This is how thousands of tools extend K8s: Certificate, Kafka, PrometheusRule, VirtualService, Application.- OpenAPI v3 schemas: Type and range validation at the API server.
- Multiple versions: Support schema evolution with conversion webhooks.
- Subresources:
/statusseparates user intent from controller-managed state;/scaleenables HPA on your custom object. - Printer columns: Customize
kubectl get foooutput. - CEL validation: Cross-field validation rules without webhooks (K8s 1.25+).
- vs built-in resources: CRDs behave identically to built-ins from the API surface perspective.
- vs API aggregation: Aggregated API servers were the old extension method — more powerful but far more complex; CRDs are now the standard.
- vs ConfigMaps: ConfigMaps are untyped blobs; CRDs provide typed, validated, RBAC-aware objects.
Examples: Certificate (cert-manager), VirtualService (Istio), PostgresCluster (CloudNativePG).
A CRD alone just stores data. A Custom Controller watches it and takes action = the Operator pattern.
Operator Pattern
- Reconciliation loop: Watch CR, compare to cluster state, take actions, update status.
- Level-triggered logic: Always converges toward desired state, idempotent, tolerant of missed events.
- Domain knowledge: Encodes SRE/DBA playbooks as code — backups, upgrades, failover, scaling.
- OperatorHub.io: CNCF-backed catalog of 300+ operators, installable with one click via Operator Lifecycle Manager (OLM).
- Capability levels: Basic install → seamless upgrades → full lifecycle → deep insights → auto-pilot.
- vs Helm chart: A chart is a static install; an operator is a living process that reacts to changes.
- vs shell scripts: Operators use the K8s reconciliation model — idempotent, event-driven, declarative.
- vs traditional automation (Ansible): Operators are continuous; Ansible runs and exits.
Major Operators: cert-manager (TLS), Prometheus Operator (monitoring), CloudNativePG (PostgreSQL), Strimzi (Kafka), Crossplane (cloud infra), External Secrets Operator (secrets sync).
Build with: Kubebuilder (Go), Operator SDK (Go/Ansible/Helm), Metacontroller (any language).
Core pattern: Watch → Compare desired vs actual → Act. Runs continuously. Idempotent.
Platform Engineering, Cost & Strategy
IDPs, FinOps, DR, cloud strategy, DORA metrics, and compliance.
Internal Developer Platform (IDP)
- Service catalog: Browse/search all services, owners, docs, SLOs, dependencies.
- Golden paths: Templated "new service" workflows that scaffold repo, CI, manifests, dashboards.
- Self-service provisioning: Request a database, queue, or cache as a simple form → platform handles K8s CRDs + Terraform.
- Cost and SLO visibility: Dashboards per service for cost, errors, latency.
- Platform abstractions: Higher-level CRDs like
App,Environment,Databaseinstead of raw K8s resources.
- vs raw K8s: Raw K8s is too low-level for most developers — too many concepts, too much YAML.
- vs PaaS (Heroku): IDPs are your own Heroku — built on K8s, extensible, cost-controlled internally.
- vs DIY scripts: IDPs centralize tribal knowledge into a sanctioned platform.
Tools: Backstage (CNCF portal), Crossplane (cloud infra as CRDs), Tilt/Skaffold (inner-loop dev).
FinOps & Cost Optimization
- Cost allocation: Attribute cloud spend to namespaces, labels, teams using resource requests as the key.
- Right-sizing: Identify over-requested resources via VPA, Goldilocks, Kubecost recommendations.
- Spot/preemptible instances: Cheap compute for fault-tolerant workloads (80-90% savings).
- Karpenter consolidation: Automatically pick cheaper instance types and pack workloads densely.
- Reserved / Savings Plans: Commit to baseline usage for 20-72% discounts.
- Idle/orphaned resource cleanup: Unused PVs, stopped PVCs, old ReplicaSets.
- vs traditional capacity planning: Cloud is elastic and variable — traditional fixed-budget IT breaks.
- vs showback: Showback is visibility only; FinOps drives accountability and change.
- vs raw cloud cost tools: Cloud bills show instance costs, not pod costs — need K8s-aware tools.
Tools: Kubecost, OpenCost (CNCF), Goldilocks (VPA dashboard). Chargeback via labels attributes costs to teams.
Disaster Recovery & Chaos Engineering
- Velero backup: Backup K8s manifests + PVs (via CSI snapshots) to S3/GCS/Azure.
- Cluster rebuild: GitOps + Velero = recover a lost cluster from Git and storage.
- etcd snapshots: Point-in-time backups of the control plane state.
- Chaos Mesh experiments: Inject pod kills, network partitions, CPU stress, DNS failures, clock skew.
- GameDays: Scheduled live-fire drills where teams practice incident response.
- DR vs HA: HA handles expected failures (single node death); DR handles catastrophic failures (entire region).
- vs traditional tape backups: K8s DR is manifest-centric plus PV snapshots, more complex than file backups.
- Chaos Engineering vs fault injection in tests: Chaos runs in production or prod-like environments, continuously.
RTO = max downtime. RPO = max data loss. Lower values = higher cost (multi-region).
Patterns: Multi-AZ (minimum), Active-Active (lowest RTO), Active-Passive (simpler).
Cloud Strategy & Vendor Lock-In
- Commodity vs differentiation: Use managed services where they're commodities (storage, DNS, K8s control plane) and portable stacks where you need flexibility.
- Exit strategies: Document how to leave each cloud; test by running workloads in a second cloud.
- Abstraction layers: Crossplane exposes cloud resources as K8s CRDs, making cross-cloud Terraform-like.
- Data gravity: Large datasets (petabytes) are expensive to move — data is often the real lock-in.
- Single-cloud: Simpler ops, deeper managed service usage, higher lock-in risk.
- Multi-cloud: Flexibility, redundancy, pricing leverage — but operational complexity and reduced access to cloud-specific features.
- Hybrid: On-prem + cloud, often regulatory-driven.
- Cloud-agnostic K8s stack: Portable compute but loses managed-service productivity.
DORA Metrics, Conway's Law & Organizational Impact
- DORA metric levels: Elite, High, Medium, Low performers — with multi-year gaps between them.
- Reverse Conway maneuver: Redesign the org chart to get the architecture you want.
- Team Topologies (2019): Book formalizing Stream-aligned, Platform, Complicated Subsystem, and Enabling team types.
- Platform team as enablers: Build the IDP, serve dev teams as internal customers.
- vs vanity metrics: DORA is outcome-focused, not activity-based (unlike "commits per dev").
- vs ITIL: ITIL emphasizes process and change control; DORA shows less process = better outcomes.
- Conway's Law vs architecture-first: You cannot fix architecture without fixing the org — technical change alone fails.
Team Topologies: Platform team (builds K8s platform), Stream-aligned (product features), Enabling (adoption help).
TCO: Compute + Storage + Networking + Licensing + People (biggest cost) + Opportunity cost.
Compliance, CIS Benchmarks & Zero Trust
- kube-bench: Open-source tool that runs CIS Benchmark checks against your cluster.
- Audit logging: Every API request is logged with user identity — essential for compliance evidence.
- Encryption at rest: Secrets encrypted in etcd via KMS providers.
- TLS everywhere: Control plane communication and pod-to-pod traffic encrypted.
- Workload identity: SPIFFE/SPIRE, GKE Workload Identity, EKS IRSA replace long-lived credentials with short-lived tokens.
- Zero Trust vs traditional perimeter security: Perimeter assumes inside-the-firewall is safe; zero-trust verifies every hop.
- vs check-the-box compliance: Real security requires continuous validation, not annual audits.
- CIS Benchmark vs PCI-DSS: CIS is technical hardening; PCI is industry-specific data protection.
Zero Trust stack: mTLS (service mesh), least-privilege RBAC, default-deny NetworkPolicies, PSS Restricted, short-lived credentials (Vault/IRSA), signed images only.