The 7 Kubernetes Failures I See Every Week (And How to Fix Them)

Kubernetes failures are rarely random. I've been running production clusters for years, and most incidents repeat a small set of patterns. Here's my troubleshooting playbook.

1. Image Pull Backoff

This is the most common. Your pod is stuck in ImagePullBackOff. You check the events: "Failed to pull image." Nine times out of ten, it's a typo in the image tag. Or the image is in a private registry and you forgot to set imagePullSecrets.

Fix: Double-check the image name and tag. Then verify your pull secret exists and is in the right namespace. kubectl get secrets is your friend.

2. CrashLoopBackOff

Pod starts, then crashes, then restarts, then crashes again. This loop is maddening. Usually, it's an application error — a missing config file, a bad environment variable, or a port conflict.

Fix: Check the logs with kubectl logs <pod> --previous. That shows you the logs from the last crash. Don't just stare at the current logs; they're empty because the pod just restarted.

3. Node Not Ready

One of your nodes goes NotReady. The cluster keeps running, but your pods are at risk. Often it's a resource issue — disk pressure, memory pressure, or the node simply lost contact with the control plane.

Fix: SSH into the node and run journalctl -u kubelet. Look for errors. If it's disk pressure, clean up old images and logs. If it's network, check the node's connectivity to the API server.

4. DNS Resolution Failures

Your app can't reach other services by name. curl http://my-service fails. But curl http://10.0.0.5 works. Classic DNS problem.

Fix: Check CoreDNS pods. kubectl -n kube-system get pods -l k8s-app=kube-dns. If they're running, check CoreDNS logs. Often it's a misconfigured ConfigMap or a network policy blocking DNS traffic.

5. PersistentVolume Claims Stuck in Pending

You create a PVC, and it never binds. It just sits there, pending. Usually, there's no StorageClass that matches, or the PV is in the wrong zone.

Fix: Check the StorageClass and the PVC's access modes. If you're using a cloud provider, make sure the PV is in the same availability zone as the node.

6. OOMKilled

Your pod gets killed with OOMKilled. The container used too much memory. This is often a memory leak in your app, but sometimes you just need to increase the limit.

Fix: Monitor memory usage over time with kubectl top pod. If it's a steady increase, you've got a leak. If it spikes, you might need to tune your resources.requests and resources.limits.

7. Network Policy Blocking Traffic

Your app can't talk to another service, and you've checked everything — DNS, endpoints, service name. Then you remember: network policies. They're silently dropping packets.

Fix: Use kubectl describe networkpolicy to see what's allowed. Temporarily disable the policy to test. Then craft a rule that allows the traffic you need.

The Real Developer Take

Look, Kubernetes is great until it's not. And when it's not, you'll spend hours chasing ghosts. My advice: automate your troubleshooting. Write scripts that check these seven things. Run them every minute. When something breaks, you'll know exactly where to look. Or, you know, just use a managed service and let someone else deal with it. But where's the fun in that?

Quick Reference: Troubleshooting Cheat Sheet

  • ImagePullBackOff → Check image name and pull secret
  • CrashLoopBackOff → Check logs with --previous
  • Node NotReady → Check kubelet logs
  • DNS failures → Check CoreDNS pods and ConfigMap
  • PVC pending → Check StorageClass and zone
  • OOMKilled → Check memory usage and limits
  • Network policy → Check policies and test without them

Final Thoughts

Kubernetes failures aren't random. They follow patterns. Learn these seven, and you'll spend less time debugging and more time building. And when you do hit something new, you'll have a solid foundation to figure it out.

Now go fix that cluster.