Troubleshooting Pods
This chapter is about pods troubleshooting, which are applications deployed into Kubernetes.
Usually, no matter which errors are you run into, the first step is getting pod's current state and its logs
kubectl describe pod <pod-name>
kubectl logs <pod-name>The pod events and its logs are usually helpful to identify the issue.
Pod stuck in Pending
Pending state indicates the Pod hasn't been scheduled yet. Check pod events and they will show you why the pod is not scheduled.
$ kubectl describe pod mypod
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12s (x6 over 27s) default-scheduler 0/4 nodes are available: 2 Insufficient cpu.Generally this is because there are insufficient resources of one type or another that prevent scheduling. An incomplete list of things that could go wrong includes
Cluster doesn't have enough resources, e.g. CPU, memory or GPU. You need to adjust pod's resource request or add new nodes to cluster
Pod requests more resources than node's capacity. You need to adjust pod's resource request or add larger nodes with more resources to cluster
Pod is using hostPort, but the port is already been taken by other services. Try using a Service if you're in such scenario
Pod stuck in Waiting or ContainerCreating
In such case, Pod has been scheduled to a worker node, but it can't run on that machine.
Again, get information from kubectl describe pod <pod-name> and check what's wrong.
So the sandbox for this Pod isn't able to start. Let's check kubelet's logs for detailed reasons:
Now we know "cni0" bridge has been configured an unexpected IP address. A simplest way to fix this issue is deleting the "cni0" bridge (network plugin will recreate it when required):
Above is an example of network configuration issue. There are also many other things may go wrong. An incomplete list of them includes
Failed to pull image, e.g.
image name is wrong
registry is not accessible
image hasn't been pushed to registry
docker secret is wrong or not configured for secret image
timeout because of big size (adjusting kubelet
--image-pull-progress-deadlineand--runtime-request-timeoutcould help for this case)
Network setup error for pod's sandbox, e.g.
can't setup network for pod's netns because of CNI configure error
can't allocate IP address because exhausted podCIDR
Failed to start container, e.g.
cmd or args configure error
image itself contains wrong binary
Pod stuck in ImagePullBackOff
ImagePullBackOff means image can't be pulled by a few times of retries. It could be caused by wrong image name or incorrect docker secret. In such case, docker pull <image> could be used to verify whether the image is correct.
For private images, a docker registry secret should be created
and then refer the secret in container's spec:
Pod stuck in CrashLoopBackOff
In such case, Pod has been started and then exited abnormally (its restartCount should be > 0). Take a look at the container logs
If your container has previously crashed, you can access the previous container’s crash log with:
From container logs, we may find the reason of crashing, e.g.
Container process exited
Health check failed
OOMKilled
Alternately, you can run commands inside that container with exec:
If none of these approaches work, SSH to Pod's host and check kubelet or docker's logs. The host running the Pod could be found by running:
Pod stuck in Error
In such case, Pod has been scheduled but failed to start. Again, get information from kubectl describe pod <pod-name> and check what's wrong. Reasons include:
referring non-exist ConfigMap, Secret or PV
exceeding resource limits (e.g. LimitRange)
violating PodSecurityPolicy
not authorized to cluster resources (e.g. with RBAC enabled, rolebinding should be created for service account)
Pod stuck in Terminating or Unknown
From v1.5, kube-controller-manager won't delete Pods because of Node unready. Instead, those Pods are marked with Terminating or Unknown status. If you are sure those Pods are not wanted any more, then there are three ways to delete them permanently
Delete the node from cluster, e.g.
kubectl delete node <node-name>. If you are running with a cloud provider, node should be removed automatically after the VM is deleted from cloud provider.Recover the node. After kubelet restarts, it will check Pods status with kube-apiserver and restarts or deletes those Pods.
Force delete the Pods, e.g.
kubectl delete pods <pod> --grace-period=0 --force. This way is not recommended, unless you know what you are doing. For Pods belonging to StatefulSet, deleting forcibly may result in data loss or split-brain problem.
For kubelet run in Docker containers, an UnmountVolume.TearDown failed error may be found in kubelet logs:
In such case, kubelet should be configured with option --containerized and its running container should be run with volumes:
Pods in Terminating state should be removed after Kubelet recovery. But sometimes, the Pods may not be deleted automatically and even force deletion (kubectl delete pods <pod> --grace-period=0 --force) doesn't work. In such case, finalizers is probably the cause and remove it with kubelet edit could mitigate the problem.
Pod is running but not doing what it should do
If the pod has been running but not behaving as you expected, there may be errors in your pod description. Often a section of the pod description is nested incorrectly, or a key name is typed incorrectly, and so the key is ignored.
Try to recreate the pod with --validate option:
or check whether created pod is expected by getting its description back:
Static Pod not recreated after manifest changed
Kubelet monitors changes under /etc/kubernetes/manifests (configured by kubelet's --pod-manifest-path option) directory by inotify. There is possible kubelet missed some events, which results in static Pod not recreated automatically. Restart kubelet should solve the problem.
Last updated
Was this helpful?