Creating a STONITH resource for a pacemaker cluster on VMWare virtual machines

Nov 3, 2022 ~ dbafromthecold

A while back I wrote a post about creating a pacemaker cluster to run SQL Server availability group using the new Ubuntu images in Azure.

Recently I had to create another pacemaker cluster, this time on-premises using VMWare virtual machines. The steps to create the pacemaker cluster and deploy an availability group where pretty much the same as in my original post (minus any Azure marlarkey) but one step was different, creating the STONITH resource.

A STONITH resource is needed in a pacemaker cluster as this is what prevents the dreaded split brain scenario…two nodes thinking that they’re the primary node. If the resource detects a failed node in the cluster it’ll restart that node, hopefully allowing it to come up in the correct state.

There are different types of STONITH resources, in my original post I used a fence_azure_arm type, not available to me for my on-premises cluster.

So which type do you use and how do you configure it?

N.B. – This was a three node cluster running Ubuntu 20.04 and I configured it using crmsh

In order to list which types are available, run:-

crm ra list stonith

There are a few ones related to VMWare, I ended up going with the fence_vmware_rest type.

To test the resource before deploying: –

fence_vmware_rest -a <VSPHERE IP ADDRESS> -l <LOGIN> -p <PASSWORD> --ssl-insecure -z -o list | egrep "(<NODE1>|<NODE2>|<NODE3>)"
fence_vmware_rest -a <VSPHERE IP ADDRESS> -l <LOGIN> -p <PASSWORD> --ssl-insecure -z -o status -n <NODE1>

Now we can create the resource: –

sudo crm configure primitive fence_vmware stonith:fence_vmware_rest \
params \
ipaddr="<VSPHERE IP ADDRESS>" \
action=reboot \
login="<LOGIN>" \
passwd="<PASSWORD>" \
ssl=1 ssl_insecure=1 \
pcmk_reboot_timeout=900 \
power_timeout=60 \
op monitor \
interval=3600 \
timeout=120

There are a whole load of properties that can be set, to check them out run: –

crm ra info stonith:fence_vmware_rest

We can also configure additional properties: –

sudo crm configure property cluster-recheck-interval=2min
sudo crm configure property start-failure-is-fatal=true
sudo crm configure property stonith-timeout=900

A good explanation of these properties can be found here.

Now enable the STONITH resource: –

sudo crm configure property stonith-enabled=true

Now that the resource has been created and enabled, confirm the cluster status: –

sudo crm status

Awesome, we have our STONITH resource up and running in the cluster!

If you want to test the resource, this will fence a node: –

sudo crm node fence <NODE>

So that’s how to deploy a STONITH resource for a pacemaker cluster on VMWare virtual machines. If you want to see the whole process of creating the cluster, the code is available here.

One word of caution, there are a lot of STONITH and cluster properties that can be set…please remember to test your configuration fully before deploying to production!

Thanks for reading!

Overcoming a storage failover issue with SQL Server on Kubernetes using Portworx

Oct 11, 2022 ~ dbafromthecold

In a previous post I talked about an issue with storage failover when a Kubernetes node goes offline.

In a nutshell, the issue is that the attachdetach-controller in Kubernetes won’t detach storage from an offline node until that node is either brought back online or is removed from the cluster. What this means is that a pod spinning up on a new node that requires that storage can’t come online.

UPDATE – This issue has been resolved in Kubernetes version 1.26. Details are on this github issue: –
https://github.com/kubernetes/kubernetes/issues/65392

And there’s more on the official Kubernetes blog (when a feature called non-graceful node shutdown when into beta): –
https://kubernetes.io/blog/2022/12/16/kubernetes-1-26-non-graceful-node-shutdown-beta/

Aka, if you’re running SQL Server in Kubernetes and a node fails, SQL won’t be able to come back online until someone manually brings the node online or deletes the node.

Not great tbh, and it’s been a blocker for my PoC testing.

However, there are ways around this…one of them is by a product called Portworx which I’m going to demo here.

DISCLAIMER – I now work for Pure Storage who own Portworx…there are other options to get around this issue (Azure Arc Data Services being one) but I’ve found Portworx to be the simplest solution and everything I’m demoing here is completely free so you can go and test it for yourself

So let’s run through using Portworx to first create storage in Kubernetes for SQL Server, and then test a node failure.

First thing to do is log into Azure using the azure-cli: –

az login

Then create a role for portworx in Azure: –

az role definition create --role-definition '{
    "Name": "portworx",
    "Description": "",
    "AssignableScopes": [
        "/subscriptions/<subscription-id>"
    ],
    "Actions": [
        "Microsoft.ContainerService/managedClusters/agentPools/read",
        "Microsoft.Compute/disks/delete",
        "Microsoft.Compute/disks/write",
        "Microsoft.Compute/disks/read",
        "Microsoft.Compute/virtualMachines/write",
        "Microsoft.Compute/virtualMachines/read",
        "Microsoft.Compute/virtualMachineScaleSets/virtualMachines/write",
        "Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read"
    ],
    "NotActions": [],
    "DataActions": [],
    "NotDataActions": []
}'

Then grab the resource group that the Kubernetes cluster resources are in: –

az aks show --name <cluster-name> --resource-group <cluster-resource-group> | jq -r '.nodeResourceGroup'

Then create a service principal in Azure dropping the subscription ID and Kubernetes resource group in: –

az ad sp create-for-rbac --role="portworx" --scopes="/subscriptions/<subscription-id>/resourceGroups/<resource-group>"

Now create secret for portworx to access Azure APIs using the service principal’s details: –

kubectl create secret generic -n kube-system px-azure --from-literal=AZURE_TENANT_ID=<tenant> \
                                                      --from-literal=AZURE_CLIENT_ID=<appId> \
                                                      --from-literal=AZURE_CLIENT_SECRET=<password>

Now we need to go to the Portworx Portal to generate a yaml file to deploy to the AKS cluster: –
https://central.portworx.com

Create an account and once in, select Portworx Enterprise (don’t worry there’s a free option coming): –

Now select Portworx Essentials and click Continue: –

On the next page, leave everything as the default and click Next: –

Then on the Storage page, select the Cloud option, then Azure, and pick a size for the disks (I went with 50GB): –

Leave everything as the default settings on the next page (the Network page) and click Next.

And on the final, Customise page, select AKS and hit Finish: –

N.B. – you can also go into the Advanced Settings section and choose what options to deploy for Portworx here. I disabled pretty much everything for this demo but you can leave the defaults.

Ok, on the next page we have some instructions. The first thing to do is deploy the Portworx operator: –

kubectl apply -f 'https://install.portworx.com/2.10?comp=pxoperator'

Once that’s deployed we can download the spec generated: –

Great! We are ready to deploy Portworx to the cluster: –

kubectl apply -f portworx_essentials_sql_demo.yaml

Let’s have a look at some of the resource that deployed. View the Portworx pods: –

kubectl get pods -n=kube-system -l name=portworx -o wide

N.B. – These may take around 5 minutes to come online

To confirm Portworx’s status: –

PX_POD=$(kubectl get pods -l name=portworx -n kube-system -o jsonpath='{.items[0].metadata.name}')
kubectl exec $PX_POD -n kube-system -- /opt/pwx/bin/pxctl status

Now we can create a Portworx storage class. Create the following px_storageclass.yaml file: –

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
    name: px-mssql-sc
provisioner: kubernetes.io/portworx-volume
parameters:
   repl: "3"
   io_profile: "db_remote"
   priority_io: "high"
allowVolumeExpansion: true

Then deploy to the cluster: –

kubectl apply -f .\px_storageclass.yaml

Confirm the storage class is now available: –

kubectl get storageclass

Now create a px_pvc.yaml file referencing the Portworx storage class: –

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
   name: mssql-system
   annotations:
     volume.beta.kubernetes.io/storage-class: px-mssql-sc
spec:
   accessModes:
     - ReadWriteOnce
   resources:
     requests:
       storage: 1Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
   name: mssql-data
   annotations:
     volume.beta.kubernetes.io/storage-class: px-mssql-sc
spec:
   accessModes:
     - ReadWriteOnce
   resources:
     requests:
       storage: 1Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
   name: mssql-log
   annotations:
     volume.beta.kubernetes.io/storage-class: px-mssql-sc
spec:
   accessModes:
     - ReadWriteOnce
   resources:
     requests:
       storage: 1Gi

Create the PVCs: –

kubectl apply -f .\px_pvc.yaml.yaml

Confirm the PVCs: –

kubectl get pvc

And the corresponding PVs: –

kubectl get pv

Now we can deploy SQL Server using the PVCs. Create the following px_sqlserver.yaml file: –

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: sqlserver
  name: sqlserver
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sqlserver
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: sqlserver
    spec:
      securityContext:
        fsGroup: 10001
      containers:
      - image: mcr.microsoft.com/mssql/server:2019-CU11-ubuntu-18.04
        name: sqlserver
        env:
        - name: ACCEPT_EULA
          value: "Y"
        - name: MSSQL_SA_PASSWORD
          value: "Testing1122"
        resources:
          requests:
            memory: "1024Mi"
            cpu: "1000m"
          limits:
            memory: "2048Mi"
            cpu: "2000m"
        volumeMounts:
        - name: system
          mountPath: /var/opt/mssql
        - name: user
          mountPath: /var/opt/sqlserver/data
        - name: log
          mountPath: /var/opt/sqlserver/log
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 10
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 10
      volumes:
      - name: system
        persistentVolumeClaim:
          claimName: mssql-system
      - name: user
        persistentVolumeClaim:
          claimName: mssql-data
      - name: log
        persistentVolumeClaim:
          claimName: mssql-log
  strategy:
    type: Recreate

And deploy: –

kubectl apply -f .\px_sqlserver.yaml

Confirm the deployment: –

kubectl get deployments

View pod created: –

kubectl get pods

View pod events: –

kubectl describe pods

N.B. – Note that we have the PVCs attached to the pod but there are no events for the attachdetach-controller in the pod events

Ok, now that we have a pod running SQL Server in the cluster…let’s see what happens when we shut down the node the pod is running on.

Confirm the node that the pod is running on: –

kubectl get pods -o wide

And then shut that node down in the portal: –

Confirm the node is down: –

kubectl get nodes

And then watch the pods: –

kubectl get pods --watch

A new pod is spun up and goes into the running status on another node: –

kubectl get pods -o wide

Ok, the old pod will stay in the terminated status until the offline node is either brought back online or removed from the cluster…but we have SQL Server up and running!

So Portworx fixes the issue that we saw in my previous post…which was a show stopper when I was running a proof of concept for SQL Server in Azure Kubernetes Service.

Thanks for reading!

Running SQL Server in Windows Subsystem for Linux (WSL)

Sep 27, 2022 ~ dbafromthecold

I previously wrote a post on how to convert a SQL Server Docker image to a Windows Subsystem for Linux distribution.

I did this because if you tried to run SQL Server in WSL before now, you’ll be presented with this error: –

This happens because up until now, WSL did not support systemd. However recently Microsoft announced systemd support for WSL here: –
https://devblogs.microsoft.com/commandline/systemd-support-is-now-available-in-wsl/

This is pretty cool and gives us another option for running SQL Server locally on linux (great for testing and getting to grips with the Linux platform).

So how can we get this up and running?

Before going any further, the minimum version required for to get this running is OS version 10.0.22000.0 (a recent version of Windows 11).

I tried getting this to work on Windows 10, but no joy I’m afraid

UPDATE – 2022-11-23 – Microsoft have now made this available on Windows 10 (but I have not tested it I’m afraid) – the announcement is here: –
https://devblogs.microsoft.com/commandline/the-windows-subsystem-for-linux-in-the-microsoft-store-is-now-generally-available-on-windows-10-and-11/

First thing to do is get WSL up to the version that supports systemd. It’s only currently available through the store to Windows Insiders but you can download the installer from here: –
https://github.com/microsoft/WSL/releases

Run the installer once downloaded and then confirm the version of WSL: –

Now install a distro to run SQL Server on from the Microsoft Store: –

N.B. – I’m using Ubuntu 20.04.5 for this…I did try with Ubuntu 22.04 but couldn’t get it to work.

Once installed and log into WSL…update and upgrade: –

sudo apt update
sudo apt upgrade

Cool, ok now we are going to enable systemd in WSL. Create a /etc/wsl.conf file and drop in the following: –

[boot]
systemd=true

Exit out of WSL and then run: –

wsl --shutdown

Jump straight back into WSL and run the following to confirm systemd is running: –

systemctl list-unit-files --type=service

Great stuff, now we can run through the usual SQL install process (detailed here)

Import the GPG keys: –

wget -qO- https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -

sudo add-apt-repository "$(wget -qO- https://packages.microsoft.com/config/ubuntu/20.04/mssql-server-preview.list)"

Update and install SQL Server: –

sudo apt-get update
sudo apt-get install -y mssql-server

Once that’s complete, you’ll be able to configure SQL Server with mssql-conf: –

sudo /opt/mssql/bin/mssql-conf setup

Confirm SQL Server is running: –

We can also confirm the processes running by: –

ps aux | grep mssql

Finally, connect to SQL Server in SSMS (using 127.0.0.1 not localhost): –

And there we have it, SQL Server running in a Windows Subsystem for Linux distro!

Thanks for reading!

EightKB August 2022

Jul 13, 2022 ~ dbafromthecold

EightkB is back!

Today we announced the schedule for the next EightKB, which is happening at 1pm UTC on the 3rd of August.

We have five more amazing, mind-melting sessions from 7 amazing speakers.

Really looking forward to this one!

I want to say a massive thank you to all the speakers who submitted. Selection was difficult as always, with only 5 slots available meant us having to make difficult decisions. Thank you so much for submitting and I hope that you will submit to our future events.

Registration for EightKB is completely free and you can sign up here: – https://eightkb.online/

Hope to see you on the 3rd!

Shutting down nodes in Azure Kubernetes Service

Jun 24, 2022 ~ dbafromthecold

A while back I wrote a post on Adjusting Pod Eviction Timings in Kubernetes. To test the changes made in that post I had to shut down nodes in an Azure Kubernetes Service cluster.

This can be done easily in the Azure portal: –

However I did a presentation recently and didn’t want to have to keep jumping into the portal from VS Code…so I wanted to be able to shut down the nodes in code.

So here’s how to use the azure-cli to shut down a node in an Azure Kubernetes Service cluster.

DISCLAIMER – the following code should only be run against a test cluster!

Firstly, to test, let’s deployment a simple application to the cluster: –

kubectl create deployment test --image=nginx

Confirm: –

kubectl get all

To check the node that the pod is running on: –

kubectl get pods -o wide

kubectl get pods -o jsonpath="{.items[0].spec.nodeName}"

Assign the node that the pod is running on to a variable (we’ll use this in a minute): –

NODE=$(kubectl get pods -o jsonpath="{.items[0].spec.nodeName}" | sed 's/.$/\U&/')  && echo $NODE

N.B. – the sed command at the end of the statement above is to make sure that the last character of the node name is in upper case, which is needed to get the instance ID of the VM in a later statement.

OK now we can look at shutting down the node that the pod is running on.

Nodes in AKS run in a nodepool which is a virtual machine scale set that is in a different resource group that the kubernetes cluster itself. The naming convention of that resource group is: –

MC_resourcegroup_clustername_location

The cluster in the examples here is called kubernetes1 in the resource group kubernetes in EASTUS.

So set the resource group name: –

RESOURCEGROUP="MC_kubernetes_kubernetes1_eastus"

Now we can grab the VMSS name in two ways, firstly by running: –

VMSSNAME=$(az vmss list --resource-group $RESOURCEGROUP --query "[].name" -o tsv) && echo $VMSSNAME

N.B. – AKS clusters can have multiple nodepools in which case the query above will return multiple values and won’t work.

Or we use the $NODE variable we set earlier and strip out the last few characters: –

VMSSNAME=${NODE:0:27} && echo $VMSSNAME

Once we have the NODE name and the VMSS name, we need to get the instance ID of the VM in the scale set:-

INSTANCEID=$(az vmss list-instances --name $VMSSNAME --resource-group $RESOURCEGROUP --query "[?osProfile.computerName=='$NODE'].[instanceId]" -o tsv) && echo $INSTANCEID

And now we can shut down the node: –

az vmss deallocate --name $VMSSNAME --instance-ids $INSTANCEID --resource-group $RESOURCEGROUP

Confirm that the node is offline: –

kubectl get nodes

Great! The node is offline! Our test pod didn’t have any tolerations set so it’ll take 5 minutes for a new pod to be created on a healthy node. You can check out how to adjust this in my previous post.

Finally, to restart the node: –

az vmss start --name $VMSSNAME --instance-ids $INSTANCEID --resource-group $RESOURCEGROUP

And that’s how to shutdown a node in AKS to test pod eviction!

Thanks for reading!

DBA From The Cold

Ramblings on working as a SQL Server DBA

Creating a STONITH resource for a pacemaker cluster on VMWare virtual machines

Overcoming a storage failover issue with SQL Server on Kubernetes using Portworx

Running SQL Server in Windows Subsystem for Linux (WSL)

EightKB August 2022

Shutting down nodes in Azure Kubernetes Service