Troubleshooting Pods

This chapter is about pods troubleshooting, which are applications deployed into Kubernetes.

Usually, no matter which errors are you run into, the first step is getting pod's current state and its logs

kubectl describe pod <pod-name>
kubectl logs <pod-name>

The pod events and its logs are usually helpful to identify the issue.

Pod stuck in Pending

Pending state indicates the Pod hasn't been scheduled yet. Check pod events and they will show you why the pod is not scheduled.

$ kubectl describe pod mypod
...
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  12s (x6 over 27s)  default-scheduler  0/4 nodes are available: 2 Insufficient cpu.

Generally this is because there are insufficient resources of one type or another that prevent scheduling. An incomplete list of things that could go wrong includes

  • Cluster doesn't have enough resources, e.g. CPU, memory or GPU. You need to adjust pod's resource request or add new nodes to cluster

  • Pod requests more resources than node's capacity. You need to adjust pod's resource request or add larger nodes with more resources to cluster

  • Pod is using hostPort, but the port is already been taken by other services. Try using a Service if you're in such scenario

Pod stuck in Waiting or ContainerCreating

In such case, Pod has been scheduled to a worker node, but it can't run on that machine.

Again, get information from kubectl describe pod <pod-name> and check what's wrong.

So the sandbox for this Pod isn't able to start. Let's check kubelet's logs for detailed reasons:

Now we know "cni0" bridge has been configured an unexpected IP address. A simplest way to fix this issue is deleting the "cni0" bridge (network plugin will recreate it when required):

Above is an example of network configuration issue. There are also many other things may go wrong. An incomplete list of them includes

  • Failed to pull image, e.g.

    • image name is wrong

    • registry is not accessible

    • image hasn't been pushed to registry

    • docker secret is wrong or not configured for secret image

    • timeout because of big size (adjusting kubelet --image-pull-progress-deadline and --runtime-request-timeout could help for this case)

  • Network setup error for pod's sandbox, e.g.

    • can't setup network for pod's netns because of CNI configure error

    • can't allocate IP address because exhausted podCIDR

  • Failed to start container, e.g.

    • cmd or args configure error

    • image itself contains wrong binary

Pod stuck in ImagePullBackOff

ImagePullBackOff means image can't be pulled by a few times of retries. It could be caused by wrong image name or incorrect docker secret. In such case, docker pull <image> could be used to verify whether the image is correct.

For private images, a docker registry secret should be created

and then refer the secret in container's spec:

Pod stuck in CrashLoopBackOff

In such case, Pod has been started and then exited abnormally (its restartCount should be > 0). Take a look at the container logs

If your container has previously crashed, you can access the previous container’s crash log with:

From container logs, we may find the reason of crashing, e.g.

  • Container process exited

  • Health check failed

  • OOMKilled

Alternately, you can run commands inside that container with exec:

If none of these approaches work, SSH to Pod's host and check kubelet or docker's logs. The host running the Pod could be found by running:

Pod stuck in Error

In such case, Pod has been scheduled but failed to start. Again, get information from kubectl describe pod <pod-name> and check what's wrong. Reasons include:

  • referring non-exist ConfigMap, Secret or PV

  • exceeding resource limits (e.g. LimitRange)

  • violating PodSecurityPolicy

  • not authorized to cluster resources (e.g. with RBAC enabled, rolebinding should be created for service account)

Pod stuck in Terminating or Unknown

From v1.5, kube-controller-manager won't delete Pods because of Node unready. Instead, those Pods are marked with Terminating or Unknown status. If you are sure those Pods are not wanted any more, then there are three ways to delete them permanently

  • Delete the node from cluster, e.g. kubectl delete node <node-name>. If you are running with a cloud provider, node should be removed automatically after the VM is deleted from cloud provider.

  • Recover the node. After kubelet restarts, it will check Pods status with kube-apiserver and restarts or deletes those Pods.

  • Force delete the Pods, e.g. kubectl delete pods <pod> --grace-period=0 --force. This way is not recommended, unless you know what you are doing. For Pods belonging to StatefulSet, deleting forcibly may result in data loss or split-brain problem.

For kubelet run in Docker containers, an UnmountVolume.TearDown failed error may be found in kubelet logs:

In such case, kubelet should be configured with option --containerized and its running container should be run with volumes:

Pods in Terminating state should be removed after Kubelet recovery. But sometimes, the Pods may not be deleted automatically and even force deletion (kubectl delete pods <pod> --grace-period=0 --force) doesn't work. In such case, finalizers is probably the cause and remove it with kubelet edit could mitigate the problem.

Pod is running but not doing what it should do

If the pod has been running but not behaving as you expected, there may be errors in your pod description. Often a section of the pod description is nested incorrectly, or a key name is typed incorrectly, and so the key is ignored.

Try to recreate the pod with --validate option:

or check whether created pod is expected by getting its description back:

Static Pod not recreated after manifest changed

Kubelet monitors changes under /etc/kubernetes/manifests (configured by kubelet's --pod-manifest-path option) directory by inotify. There is possible kubelet missed some events, which results in static Pod not recreated automatically. Restart kubelet should solve the problem.

Last updated

Was this helpful?