Troubleshooting Azure

Azure Load Balancer (ALB)

When Azure cloud provider is configured (kube-controller-manager --cloud-provider=azure --cloud-config=/etc/kubernetes/azure.json), Azure load balancer (ALB) will be created automatically for LoadBalancer typed Service. Please note that only Basic SKU ALB is supported now, which has some limitations compared to Standard ALB:

Load Balancer

Basic

Standard

Back-end pool size

up to 100

up to 1,000

Back-end pool boundary

Availability Set

virtual network, region

Back-end pool design

VMs in Availability Set, virtual machine scale set in Availability Set

Any VM instance in the virtual network

HA Ports

Not supported

Available

Diagnostics

Limited, public only

Available

VIP Availability

Not supported

Available

Fast IP Mobility

Not supported

Available

Availability Zones scenarios

Zonal only

Zonal, Zone-redundant, Cross-zone load-balancing

Outbound SNAT algorithm

On-demand

Preallocated

Outbound SNAT front-end selection

Not configurable, multiple candidates

Optional configuration to reduce candidates

Network Security Group

Optional on NIC/subnet

Required

Public IP associated is Basic SKU, which has some limitations compared to Standard Public IP:

Public IP

Basic

Standard

Availability Zones scenarios

Zonal only

Zone-redundant (default), zonal (optional)

Fast IP Mobility

Not supported

Available

VIP Availability

Not supported

Available

Counters

Not supported

Available

Network Security Group

Optional on NIC

Required

When creating LoadBalancer Service, a set of annotations could be set to customize ALB:

Annotation

Comments

service.beta.kubernetes.io/azure-load-balancer-internal

If set, create internal load balancer

service.beta.kubernetes.io/azure-load-balancer-internal-subnet

Set subnet for internal load balancer

service.beta.kubernetes.io/azure-load-balancer-mode

Determine how to select ALB based on availability sets. Candidate values are: 1)Not set or empty, use primaryAvailabilitySet set in /etc/kubernetes/azure.json; 2)auto, select ALB which has the minimum rules associated ; 3)as1,as2, specify a list of availability sets

service.beta.kubernetes.io/azure-dns-label-name

Set DNS label name

service.beta.kubernetes.io/azure-shared-securityrule

If set, NSG will be shared with other services. This relies on Augmented Security Rules

service.beta.kubernetes.io/azure-load-balancer-resource-group

Specify the resource group of load balancer objects that are not in the same resource group as the cluster

Checking logs and events

ALB is managed by kube-controller-manager or cloud-controller-manager in kubernetes. So whenever there are problems with Azure cloud provider, their logs should be checked fist:

Resources events are also helpful, e.g. for Service

LoadBalancer Service stuck in Pending

When checking a service by kubectl describe service <service-name>, there is no error events. But its externalIP status is stuck in <pending>. This indicates something wrong when provisioning ALB/PublicIP/NSG. In such case, kube-controller-manager logs should be checked.

An incomplete list of things that could go wrong include:

  • Authorization failed because of cloud-config misconfigured, e.g. clientId, clientSecret, tenantId, subscriptionId, resourceGroup. Fix the configuation in /etc/kubernetes/azure.json should solve the problem.

  • Client configured is not authorized to ALB/PublicIP/NSG. Add authorization in Azure portal or create a new one (az ad sp create-for-rbac --role="Contributor" --scopes="/subscriptions/<subscriptionID>/resourceGroups/<resourceGroupName>" and update /etc/kubernetes/azure.json on all nodes) should solve the problem

  • There is also a NSG issue in Kubernetes v1.8.X: Security rule must specify SourceAddressPrefixes, SourceAddressPrefix, or SourceApplicationSecurityGroups. To get rid of this issue, you could either upgrade cluster to v1.9.X/v1.10.X or replace SourceAddressPrefixes rule with multiple SourceAddressPrefix rules.

Service external IP is not accessible

Azure Cloud Provider creates a health probe for each Kubernetes services and only probe-successful backend VMs are added to Azure Load Balancer (ALB). If the external IP is not accessible, it's probably caused by health probing.

An incomplete list of such cases include:

  • Backend VMs in unhealthy (Solution: login the VM and check, or restart VM)

  • Containers are not listening on configured ports (Solution: correct container port configuration)

  • Firewall or network security groups block the port on ALB (Solution: add a new rule to expose the port)

  • If an ILB VIP is configured inside a VNet, and one of the participant backend VMs is trying to access the Internal Load Balancer VIP, that results in failure. This is an unsupported scenario. (Solution: use service's clusterIP instead)

  • Some or all containers are not responding any accesses. Note that only part of containers not responding could result in service not accessible, this is because

    • Azure probes service periodically by NodeIP:NodePort

    • Then, on the Node, kube-proxy load balances it to backend containers

    • And then, if it load balances the access to abnormal containers, then probe is failed and the Node VM may be removed from ALB traffic backends

    • Finally, ALB may think all backends are unhealthy

    • The solution is use Readiness Probles, which could ensure the unhealthy containers removed from service's endpoints

No target backends present for the internal load balancer (ILB)

This is a known bug (kubernetes#59746 kubernetes#60060 acs-engine#2151) in kubernetes v1.9.0-v1.9.3, which is caused by an error when matching ILB's AvaibilitySet.

The fix of this issue (kubernetes#59747 kubernetes#59083) will be included in v1.9.4+ and v1.10+.

No target backends present for the external load balancer

If kubelet is not configured with cloud provider (e.g. no --cloud-provider=azure --cloud-config=/etc/kubernetes/cloud-config configured), then the node will not join any Azure load balancers. This is because the node registers itself with externalID hostname, which is not recognized by kube-controller-manager.

A simple way to check is comparing externalID and name (they should be different):

To fix this issue

  • Delete the node object kubectl delete node <node-name>

  • Set Kubelet with options --cloud-provider=azure --cloud-config=/etc/kubernetes/cloud-config

  • Finally restart kubelet

PublicIP not removed after deleting LoadBalancer service

This is a known issue (kubernetes#59255) in v1.9.0-1.9.3: ALB has a default quota of 10 FrontendIPConfiguations for Basic ALB. When this quota is exceeded, ALB FrontendIPConfiguation won't be created but cloud provider continues to create PublicIPs for those services. And after deleting the services, those PublicIPs not removed togather.

The fix of this issue (kubernetes#59340) will be included in v1.9.4+ and v1.10+.

Besides the fix, if more than 10 LoadBalancer services are required in your cluster, you should also increase FrontendIPConfiguations quota to make it work. Check Azure subscription and service limits, quotas, and constraints for how to do this.

No credentials provided for AAD application with MSI

When Azure cloud provider is configured with "useManagedIdentityExtension": true, Managed Service Identity (MSI) is used to authorize Azure APIs. It is broken in v1.10.0-beta because of a refactor: [Config.UseManagedIdentityExtension overrides auth.AzureAuthConfig.UseManagedIdentityExtension](kubernetes #60691).

The fix of this issue (kubernetes#60775) will be included in v1.10.

Azure ARM calls rejected because of too many requests

Sometimes, kube-controller-manager or kubelet may fail to call Azure ARM APIs because of too many requests in a period.

From v1.9.2 and v1.10, Azure cloud provider adds cache for various resources (e.g. VM, VMSS, NSG and RouteTable).

Ways to mitigate the issue:

  • Ensure instance metadata is used, e.g. set useInstanceMetadata to true in /etc/kubernetes/azure.json for all nodes and restart kubelet

  • Increase --route-reconciliation-period on kube-controller-manager and restart it, e.g. set the option in /etc/kubernetes/manifests/kube-controller-manager.yaml and kubelet will recreate kube-controller-manager pods automatically

AKS kubectl logs/exec connection timed out

kubectl logs reports getsockopt: connection timed out error (AKS#232):

In AKS, kubectl logs, exec, and attach all require the master <-> node tunnels to be established. Check that the tunnelfront and kube-svc-redirect pods are up and running:

If the pods are not running or net/http: TLS handshake timeout error occurred, delete tunnelfront pod and wait a while, a new pod will be recreated after a few seconds:

LoadBalancer Service stuck in Pending after Virtual Kubelet deployed

After Virtual Kubelet deployed, LoadBalancer Service may stuck in Pending state and public IP can't be allocated. Check the service's events (e.g. by kubectl describe service <service-name>), you could find the error CreatingLoadBalancerFailed 4m (x15 over 45m) service-controller Error creating load balancer (will retry): failed to ensure load balancer for service default/nginx: ensure(default/nginx): lb(kubernetes) - failed to ensure host in pool: "instance not found". This is because the virtual Node created by Virtual Kubelet is not actually exist on Azure cloud platform, so it couldn't be added to the backends of Azure Load Balancer.

Kubernetes 1.9 introduces a new flag, ServiceNodeExclusion, for the control plane's Controller Manager. Enabling this flag in the Controller Manager's manifest ( kube-controller-manager --feature-gates=ServiceNodeExclusion=true) allows Kubernetes to exclude Virtual Kubelet nodes (with label alpha.service-controller.kubernetes.io/exclude-balancer) from being added to Load Balancer pools, allowing you to create public facing services with external IPs without issue.

No GPU is found in Node's capacity

This may happen when deploying GPU workloads. Node's capacity nvidia.com/gpu is always zero. This is caused by something wrong in device plugin. The workaround is redeploy the nvidia-gpu add-on:

References

Last updated

Was this helpful?