Last week I was working on my Azure Kubernetes Service cluster when I ran into a rather odd issue. I’d created a service with a type of LoadBalancer in order to get an external IP to connect to SQL Server running in a pod from my local machine.
I’ve done this quite a few times at this point so wasn’t expecting anything out of the ordinary.
However, my service never got it’s external IP address. It remained in a state of pending: –
N.B. – The images in this post are taken after the issue was resolved as I didn’t think at the time to screen shot everything 😦
I knew something was wrong after about 20 minutes as the IP should have definitely come up by then.
So I delved into the service by running: –
kubectl describe service sqlserver-service
And was greeted with the following: –
Error creating load balancer (will retry): failed to ensure load balancer for service default/sqlserver-service: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/subscriptionID/resourceGroups/MC_containers1_SQLK8sCluster1_eastus/providers/Microsoft.Network/loadBalancers?api-version=2017-09-01: StatusCode=0 — Original Error: adal: Refresh request failed. Status Code = ‘401’. Response body: {“error”:”invalid_client”,”error_description”:”AADSTS70002: Error validating credentials. AADSTS50012: Invalid client secret is provided.\r\nTrace ID: 17d1f0ce-6c11-4f8e-895d-29194d973900\r\nCorrelation ID: 3e11d85c-77bf-4041-a41d-267bfd5f066c\r\nTimestamp: 2019-01-23 18:58:59Z”,”error_codes”:[70002,50012],”timestamp”:”2019-01-23 18:58:59Z”,”trace_id”:”17d1f0ce-6c11-4f8e-895d-29194d973900″,”correlation_id”:”3e11d85c-77bf-4041-a41d-267bfd5f066c”}
Yikes! What’s happened there?
I logged a case with MS Support and when they came back to me, they advised that the service principal that is spun up in the background had expired. This service principal is required to allow the cluster to interact with the Azure APIs in order to create other Azure resources.
When a service is created within AKS with a type of LoadBalancer, a Load Balancer is created in the background which provides the external IP I was waiting on to allow me to connect to the cluster.
Because this principal had expired, the cluster was unable to create the Load Balancer and the external IP of the service remained in the pending state.
So I needed to update the service principal so that it was no longer expired. In order to update the service principal I needed two pieces of information. The clientId of the cluster and a secret used for the service principal password. This wasn’t the easiest process in the world so I’ll run through how to do it here.
First, log into Azure:-
az login
Then get the clientId of your cluster: –
az aks show --resource-group RESOURCEGROUPNAME --name CLUSTERNAME --query "servicePrincipalProfile.clientId" --output tsv
To confirm that the principal is expired: –
az ad sp credential list --id CLIENTID
Check the endDate value highlighted above. If it’s past the current date, that’s your issue!
You may have noticed that mine is set 10 years from now. This is because I’m running these commands to get screenshots after I’ve fixed the issue…I figured 10 years should be long enough 🙂
The way I got the secret was to ssh onto one of the nodes in my cluster. This is a little involved but I’ll go through it step-by-step.
Resources for an AKS cluster are created in a separate resource group (for….reasons). To get that resource group name run: –
az aks show --resource-group RESOURCEGROUPNAME --name CLUSTERNAME --query nodeResourceGroup -o tsv
Then grab the nodes in the cluster (RESOURCEGROUPNAME2 is the output of the above command): –
az vm list --resource-group RESOURCEGROUPNAME2 -o table
And then get the IP address of each node: –
az vm list-ip-addresses --resource-group RESOURCEGROUPNAME2 -o table
OK, now that I had the node details I could copy my SSH public key into one of them.
The ssh keys were generated when I created the cluster using the –generate-ssh-keys flag. If you didn’t specify this you’ll need to generate the keys before continuing on.
So I copied my public key into one of the nodes: –
az vm user update \ --resource-group RESOURCEGROUPNAME2 \ --name NODENAME \ --username azureuser \ --ssh-key-value id_rsa.pub
N.B. – I found it easiest to navigate to the directory that held my ssh keys before running this script
Then I spun up a pod with openssh-client installed so that I could ssh into one of the nodes from within the cluster (the nodes aren’t accessible externally).
To do this I created a docker image from the Alpine:latest image and installed the client. Pushed it to the Docker Hub and then ran: –
kubectl run -it --rm aks-ssh --image=dbafromthecold/alpine_ssh:latest
N.B. – the dbafromthecold/alpine_ssh:latest image is public so this will work for you as well
In a separate command prompt I got the name of the pod:-
kubectl get pods
And then copied my private ssh key into the pod:-
kubectl cp id_rsa PODNAME:/id_rsa
Once the key was copied in, I closed that window and went back to the original window where I had run the pod and changed the permissions on the private key: –
chmod 0600 id_rsa
And then I was able to ssh into one of the nodes:-
ssh -i id_rsa azureuser@NODEIPADDRESS
The secret is contained in a json file (as aadClientSecret). To grab it I ran: –
sudo cat /etc/kubernetes/azure.json
Once I had that information I could exit the node, then the pod, and update the service principal: –
az ad sp credential reset --name CLIENTID--password SECRET --years 10
I confirmed that the service principal had been updated: –
az ad sp credential list --id CLIENTID
And was then able to deploy a loadbalancer type service, and get an external IP!
kubectl get services
Phew 🙂
Hope that helps anyone who runs into the same issue!
Thank You very much for the details.. ❤ It saves my time. the article is super clear & helpful.
Life saver 🙂 Thank You !!!
Nowadays it is possible to submit command “sudo cat /etc/kubernetes/azure.json” via Azure Portal by running RunShellScript on selected VM (Run Commands) so there is no need to run “aks-ssh” pod.
Nice, thanks for sharing that.
This article is perfect, I had similar issue when I was upgrading the kubernetes version and the load balancer pod status went to Pending state and I ran the below script(with actual values) and it start working
az aks update-credentials \
–resource-group resource-group-name \
–name cluster-name \
–reset-service-principal \
–service-principal service-principal-id \
–client-secret service-principal-secret
I found one issue here we already had a load balancer which was working earlier before upgrade of the kubernetes version, but after version upgrade and updating the service principal it created a new load balancer with different IP and it was showing that, am not sure why this happened, I was expecting the old load balacer IP will get pointed but it didnt. Due to which we need to update the nginx services to point to this IP.
Am not sure if it due to version upgrade or due to service principal which created the new load balancers, Can we update the old load balancer IP’s to the service pod? If not every version upgrade we might need to update the IP address in NGINX level.
Thanks
Mansoor
You’d need to configure a static IP address for the load balancer by dropping the following into your service yaml file: –
loadBalancerIP: 40.121.183.52
Otherwise a load balanced service will get a different IP when it is recreated.
Full info is here: –
https://docs.microsoft.com/en-us/azure/aks/static-ip
Thank you for your response I will try to create the static IP address and see if this works after the version upgrade.
Thhanks for this blog post