Troubleshooting Azure
Azure Load Balancer (ALB)
When Azure cloud provider is configured (kube-controller-manager --cloud-provider=azure --cloud-config=/etc/kubernetes/azure.json), Azure load balancer (ALB) will be created automatically for LoadBalancer typed Service. Please note that only Basic SKU ALB is supported now, which has some limitations compared to Standard ALB:
Load Balancer
Basic
Standard
Back-end pool size
up to 100
up to 1,000
Back-end pool boundary
Availability Set
virtual network, region
Back-end pool design
VMs in Availability Set, virtual machine scale set in Availability Set
Any VM instance in the virtual network
HA Ports
Not supported
Available
Diagnostics
Limited, public only
Available
VIP Availability
Not supported
Available
Fast IP Mobility
Not supported
Available
Availability Zones scenarios
Zonal only
Zonal, Zone-redundant, Cross-zone load-balancing
Outbound SNAT algorithm
On-demand
Preallocated
Outbound SNAT front-end selection
Not configurable, multiple candidates
Optional configuration to reduce candidates
Network Security Group
Optional on NIC/subnet
Required
Public IP associated is Basic SKU, which has some limitations compared to Standard Public IP:
Public IP
Basic
Standard
Availability Zones scenarios
Zonal only
Zone-redundant (default), zonal (optional)
Fast IP Mobility
Not supported
Available
VIP Availability
Not supported
Available
Counters
Not supported
Available
Network Security Group
Optional on NIC
Required
When creating LoadBalancer Service, a set of annotations could be set to customize ALB:
Annotation
Comments
service.beta.kubernetes.io/azure-load-balancer-internal
If set, create internal load balancer
service.beta.kubernetes.io/azure-load-balancer-internal-subnet
Set subnet for internal load balancer
service.beta.kubernetes.io/azure-load-balancer-mode
Determine how to select ALB based on availability sets. Candidate values are: 1)Not set or empty, use primaryAvailabilitySet set in /etc/kubernetes/azure.json; 2)auto, select ALB which has the minimum rules associated ; 3)as1,as2, specify a list of availability sets
service.beta.kubernetes.io/azure-dns-label-name
Set DNS label name
service.beta.kubernetes.io/azure-shared-securityrule
If set, NSG will be shared with other services. This relies on Augmented Security Rules
service.beta.kubernetes.io/azure-load-balancer-resource-group
Specify the resource group of load balancer objects that are not in the same resource group as the cluster
Checking logs and events
ALB is managed by kube-controller-manager or cloud-controller-manager in kubernetes. So whenever there are problems with Azure cloud provider, their logs should be checked fist:
Resources events are also helpful, e.g. for Service
LoadBalancer Service stuck in Pending
When checking a service by kubectl describe service <service-name>, there is no error events. But its externalIP status is stuck in <pending>. This indicates something wrong when provisioning ALB/PublicIP/NSG. In such case, kube-controller-manager logs should be checked.
An incomplete list of things that could go wrong include:
Authorization failed because of cloud-config misconfigured, e.g. clientId, clientSecret, tenantId, subscriptionId, resourceGroup. Fix the configuation in
/etc/kubernetes/azure.jsonshould solve the problem.Client configured is not authorized to ALB/PublicIP/NSG. Add authorization in Azure portal or create a new one (
az ad sp create-for-rbac --role="Contributor" --scopes="/subscriptions/<subscriptionID>/resourceGroups/<resourceGroupName>"and update/etc/kubernetes/azure.jsonon all nodes) should solve the problemThere is also a NSG issue in Kubernetes v1.8.X:
Security rule must specify SourceAddressPrefixes, SourceAddressPrefix, or SourceApplicationSecurityGroups. To get rid of this issue, you could either upgrade cluster to v1.9.X/v1.10.X or replace SourceAddressPrefixes rule with multiple SourceAddressPrefix rules.
Service external IP is not accessible
Azure Cloud Provider creates a health probe for each Kubernetes services and only probe-successful backend VMs are added to Azure Load Balancer (ALB). If the external IP is not accessible, it's probably caused by health probing.
An incomplete list of such cases include:
Backend VMs in unhealthy (Solution: login the VM and check, or restart VM)
Containers are not listening on configured ports (Solution: correct container port configuration)
Firewall or network security groups block the port on ALB (Solution: add a new rule to expose the port)
If an ILB VIP is configured inside a VNet, and one of the participant backend VMs is trying to access the Internal Load Balancer VIP, that results in failure. This is an unsupported scenario. (Solution: use service's clusterIP instead)
Some or all containers are not responding any accesses. Note that only part of containers not responding could result in service not accessible, this is because
Azure probes service periodically by
NodeIP:NodePortThen, on the Node, kube-proxy load balances it to backend containers
And then, if it load balances the access to abnormal containers, then probe is failed and the Node VM may be removed from ALB traffic backends
Finally, ALB may think all backends are unhealthy
The solution is use Readiness Probles, which could ensure the unhealthy containers removed from service's endpoints
No target backends present for the internal load balancer (ILB)
This is a known bug (kubernetes#59746 kubernetes#60060 acs-engine#2151) in kubernetes v1.9.0-v1.9.3, which is caused by an error when matching ILB's AvaibilitySet.
The fix of this issue (kubernetes#59747 kubernetes#59083) will be included in v1.9.4+ and v1.10+.
No target backends present for the external load balancer
If kubelet is not configured with cloud provider (e.g. no --cloud-provider=azure --cloud-config=/etc/kubernetes/cloud-config configured), then the node will not join any Azure load balancers. This is because the node registers itself with externalID hostname, which is not recognized by kube-controller-manager.
A simple way to check is comparing externalID and name (they should be different):
To fix this issue
Delete the node object
kubectl delete node <node-name>Set Kubelet with options
--cloud-provider=azure --cloud-config=/etc/kubernetes/cloud-configFinally restart kubelet
PublicIP not removed after deleting LoadBalancer service
This is a known issue (kubernetes#59255) in v1.9.0-1.9.3: ALB has a default quota of 10 FrontendIPConfiguations for Basic ALB. When this quota is exceeded, ALB FrontendIPConfiguation won't be created but cloud provider continues to create PublicIPs for those services. And after deleting the services, those PublicIPs not removed togather.
The fix of this issue (kubernetes#59340) will be included in v1.9.4+ and v1.10+.
Besides the fix, if more than 10 LoadBalancer services are required in your cluster, you should also increase FrontendIPConfiguations quota to make it work. Check Azure subscription and service limits, quotas, and constraints for how to do this.
No credentials provided for AAD application with MSI
When Azure cloud provider is configured with "useManagedIdentityExtension": true, Managed Service Identity (MSI) is used to authorize Azure APIs. It is broken in v1.10.0-beta because of a refactor: [Config.UseManagedIdentityExtension overrides auth.AzureAuthConfig.UseManagedIdentityExtension](kubernetes #60691).
The fix of this issue (kubernetes#60775) will be included in v1.10.
Azure ARM calls rejected because of too many requests
Sometimes, kube-controller-manager or kubelet may fail to call Azure ARM APIs because of too many requests in a period.
From v1.9.2 and v1.10, Azure cloud provider adds cache for various resources (e.g. VM, VMSS, NSG and RouteTable).
Ways to mitigate the issue:
Ensure instance metadata is used, e.g. set
useInstanceMetadatatotruein/etc/kubernetes/azure.jsonfor all nodes and restart kubeletIncrease
--route-reconciliation-periodon kube-controller-manager and restart it, e.g. set the option in/etc/kubernetes/manifests/kube-controller-manager.yamland kubelet will recreate kube-controller-manager pods automatically
AKS kubectl logs/exec connection timed out
kubectl logs reports getsockopt: connection timed out error (AKS#232):
In AKS, kubectl logs, exec, and attach all require the master <-> node tunnels to be established. Check that the tunnelfront and kube-svc-redirect pods are up and running:
If the pods are not running or net/http: TLS handshake timeout error occurred, delete tunnelfront pod and wait a while, a new pod will be recreated after a few seconds:
LoadBalancer Service stuck in Pending after Virtual Kubelet deployed
After Virtual Kubelet deployed, LoadBalancer Service may stuck in Pending state and public IP can't be allocated. Check the service's events (e.g. by kubectl describe service <service-name>), you could find the error CreatingLoadBalancerFailed 4m (x15 over 45m) service-controller Error creating load balancer (will retry): failed to ensure load balancer for service default/nginx: ensure(default/nginx): lb(kubernetes) - failed to ensure host in pool: "instance not found". This is because the virtual Node created by Virtual Kubelet is not actually exist on Azure cloud platform, so it couldn't be added to the backends of Azure Load Balancer.
Kubernetes 1.9 introduces a new flag, ServiceNodeExclusion, for the control plane's Controller Manager. Enabling this flag in the Controller Manager's manifest ( kube-controller-manager --feature-gates=ServiceNodeExclusion=true) allows Kubernetes to exclude Virtual Kubelet nodes (with label alpha.service-controller.kubernetes.io/exclude-balancer) from being added to Load Balancer pools, allowing you to create public facing services with external IPs without issue.
No GPU is found in Node's capacity
This may happen when deploying GPU workloads. Node's capacity nvidia.com/gpu is always zero. This is caused by something wrong in device plugin. The workaround is redeploy the nvidia-gpu add-on:
References
Last updated
Was this helpful?