Adjusting pod eviction time in Kubernetes

One of the best features of Kubernetes is the built-in high availability.

When a node goes offline, all pods on that node are terminated and new ones spun up on a healthy node.

The default time that it takes from a node being reported as not-ready to the pods being moved is 5 minutes.

This really isn’t a problem if you have multiple pods running under a single deployment. The pods on the healthy nodes will handle any requests made whilst the pod(s) on the downed node are waiting to be moved.

But what happens when you only have one pod in a deployment? Say, when you’re running SQL Server in Kubernetes? Five minutes really isn’t an acceptable time for your SQL instance to be offline.

The simplest way to adjust this is to add the following tolerations to your deployment: –

      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 10
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 10

N.B.- You can read more about taints and tolerations in Kubernetes here

This will move any pods in the deployment to a healthy node 10 seconds after a node is reported as either not-ready or unreachable

But what if you wanted to change the default setting across the cluster?

I was trying to work out how to do this last week and the official docs here reference a flag for the controller manager: –

–pod-eviction-timeout duration Default: 5m0s
The grace period for deleting pods on failed nodes.

Great stuff! That’s exactly what I was looking for!

Unfortunately, it seems that this flag no longer works.

The way to set the eviction timeout value now is to set the flags on the api-server.

Now, this is done differently depending on how you installed Kubernetes. I installed this cluster with kubeadm so needed to create a kubeadm-apiserver-update.yaml file: –

apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
kubernetesVersion: v1.18.0
apiServer:
  extraArgs:
    enable-admission-plugins: DefaultTolerationSeconds
    default-not-ready-toleration-seconds: "10"
    default-unreachable-toleration-seconds: "10"

N.B.- Make sure the kubernetesVersion is correct

And then apply: –

sudo kubeadm init phase control-plane apiserver --config=kubeadm-apiserver-update.yaml

You can verify that the change has been applied by checking the api-server pods in the kube-system namespace (they should refresh) and by checking here: –

cat /etc/kubernetes/manifests/kube-apiserver.yaml

Let’s see it in action! I’ve got one pod running SQL Server in a K8s cluster on node kubernetes-w1. Let’s shut down that node…

Alright, that’s not exactly 10 seconds…there’s a couple of other things going on. But it’s a lot better than 5 mins!

The full deployment yaml that I used is here.

Ok, it is a bit of a contrived test, I’ll admit. The node was shutdown gracefully and I haven’t configured any persistent storage BUT this is still better than having to wait 5 minutes for the pod to be spun up on the healthy node.

N.B.- If you’re working with a managed Kubernetes service, such as AKS or EKS, you won’t be able to do this. You’ll need to add the tolerations to your deployment.

Thanks for reading!

Data Céilí 2020 Cancelled

Yesterday we announced that Data Céilí 2020 has been cancelled due to the continuing threat of COVID-19.

As much as we wanted to put this event on, the safety of all attendees is paramount and we can’t guarantee that at this moment in time.

I want to thank all the speakers who submitted to our event. I’m pretty sure a lot of you got sick of me badgering you to submit 🙂

This would have been the first year we were to run and the response we had absolutely blew us away. We wanted Data Céilí to be the biggest and best MS Data Platform event in Ireland and I’m certain with the quality of submissions that we had, that would have been the case.

We’re going to regroup and start planning, because Data Céilí will be back in 2021.

In the meantime, the Irish SQL User Groups have come together to run regular virtual meetups

If you would like to present please contact me on twitter @dbafromthecold or at dbafromthecold@gmail.com

Thank you, and stay safe.

Andrew

Two years of working remotely

I’ve been working remotely for just over 2 years now and my current position is my first remote post.

Before joining my current company I (and them) had concerns about me working from home. Would I like it? Would I find it isolating? Would I go completely mad?

Actually that last question is one I’ve been asked a lot when I tell people that I work from home. My usual response is….”If I’ve gone mad, how would I know??”

My office buddy, Spuddy!

Anyway as it turns out, I absolutely love working from home!

I do want to say that I think remote working is not for everyone though. If you enjoy the social aspect of working in an office you probably won’t enjoy working from home. It also really depends on how the team that you work with. If your team can communicate effectively out of the office, then you’re in good stead.

I’ve been lucky in that the team I work with was used to having purely remote members. I wasn’t the first DBA they’d hired from Ireland (the majority of my team is based in the U.S.) and they are really good at using virtual meetings and Google Chat.

To be honest it doesn’t really matter which chat program you use, as long as you use one, and use it well! I’ve often thought about how my job would be if I had to communicate solely via email and I just…shudder.

So, I don’t want this post to be a list of guidelines for what you should do when working from home, nor will it be a list of equipment that you should buy. What I want to do is talk about what works for me, and you can then decide if any of things mentioned here will work for you.

For example, I don’t listen to music when I work. I prefer to work in pretty much complete silence…I can concentrate better. I’ve only ever really listened to music when I’ve been working on something pretty boring and repetitive…something that (thankfully) doesn’t happen much in my current role.

There are probably people who read that and couldn’t imagine the thought of working in complete silence…what works for me doesn’t work for them, we’re all different so I don’t think that there could ever be a definitive guide to “how to work from home”. This post is just to talk about my experiences.

One thing that people told me when I first started working from home was that they would have difficult focusing as there’s too many distractions. This was something that I was concerned about as I’m definitely not the most disciplined of people…would I just while away the hours surfing online?

To be honest, it really hasn’t been an issue. Once I’ve started working…I’m working. Ok, there have been times where I’ve spent half an hour on chores that I really should have done in the evening but I’ve used that time to try and step away from a problem that I was stuck on, and come back to it fresh(er). Ack, I’ve had mixed results…sometimes it works and sometimes it doesn’t…but hey, at least I’ve done my washing! One thing it does do is prevent me from getting overly frustrated and end up really getting in a tangle.

With regards to work areas, a lot of people have a separate office…which I’d LOVE to have. I don’t have a separate room for work, my work desk is in a corner of my living room. This is fine for me as I live on my own but I do make efforts to ensure that I don’t start to feel like I spend all of my life in one room.

My work area

So I try to get out of my flat every chance I get. I go to the gym, go for a walk along the beach…hit the pub (AFTER work, honestly). One thing that I always do no matter what, is go and buy a coffee down the street at around 10am. It gives me a break an hour into the day, gets me some fresh air, and then I can get back to work.

Lunch time is another chance to get away from the desk, even if it’s just a stroll around the block, just something to get out. Then I’ll pretty much work straight through to 6pm but then, again, I try to get out. Of course, this is all weather permitting so with me being in Ireland, this doesn’t happen every day.

It can also be difficult just to tear myself away some days. I’ll get deep into something, stay glued to the desk but I’d say I get out of the flat at least twice a day. Just to make sure those walls don’t start closing in on me!

Another great way of breaking up the day is a standing desk. Although this’ll work no matter where you are (office or at home). I bought a cheap converter desk and would highly recommend it to anyone who’s looking at buying their first standing desk.

I’ll stand up for the first part of the morning, sit down after I get my coffee, back standing after lunch, and then sit down for the rest of the day. It seems to break the day up nicely.

Then, when the day’s over, I get away from my work area. If I have to continue using a computer to work on blog posts, sessions etc. I’ll get out of the flat for a bit and then when I come back, I’ll work from either my settee, or my dining table.

Some of the things I’ve talked about here won’t work for you but the BEST thing about working remotely is that you get to find out what DOES work for you. You have (almost?) total control over your work environment and you can tailor it to how you want it.

Thanks for reading!

Merge kubectl config files on Windows

When working with multiple Kubernetes clusters, at some point you’ll want to merge your kubectl config files.

I’ve seen a few blogs on how to merge kubectl config files but haven’t seen any on how to do it on Windows. It’s pretty much the same process, just adapted for powershell on Windows.

In this example, I’ll merge a new config file in C:\Temp to my existing config file in C:\users\andrew.pruski\.kube

N.B.- If you’re working with AKS, az aks get-credentials will do this for you

Firstly, backup the existing config file:-

cp C:\users\andrew.pruski\.kube\config C:\users\andrew.pruski\.kube\config_backup

Copy the new config file into the .kube directory: –

Copy-Item C:\Temp\config C:\users\andrew.pruski\.kube\config2

Set the KUBECONFIG environment variable to point at both config files:-

$env:KUBECONFIG="C:\users\andrew.pruski\.kube\config;C:\users\andrew.pruski\.kube\config2"

Export the output of the config view command (which references both config files) to a config_tmp file: –

kubectl config view  --raw > C:\users\andrew.pruski\.kube\config_tmp

Check all is working as expected (all clusters can be seen):-

kubectl config get-clusters --kubeconfig=C:\users\andrew.pruski\.kube\config_tmp

If all is working as expected, replace the old config file with the config_tmp file: –

Remove-Item C:\users\andrew.pruski\.kube\config
Move-Item C:\users\andrew.pruski\.kube\config_tmp C:\users\andrew.pruski\.kube\config

Finally, confirm it’s working: –

kubectl config get-clusters

Thanks for reading!

Chaos Engineering and SQL Server

Recently I’ve been delving into Chaos Engineering, reading books, watching videos, listening to podcasts etc. and I find it really intriguing….I mean, it just sounds exciting, right?
CHAOS Engineering!

N.B.- if you want a great resource for how to get into Chaos Engineering, I’d recommend Learning Chaos Engineering by Russ Miles. I’m using concepts and methods from that book to base this (hopefully) series of posts focusing on SQL Server but if you want a more in-depth dive…grab a copy of the book.

OK, before we move onto applying to SQL Server…first, a bit of history.

Back in 2010 Netflix migrated their platform to the cloud. When they did so they decided to adopt a mindset of: –

The best way to avoid failure is to fail constantly

The idea behind this is that if the platform cannot withstand a (semi)controlled outage, how will it react to an uncontrolled outage?

Out of that mindset came Chaos Monkey. A tool that’s designed to randomly terminate instances within their environment. Sounds nuts, right?

This is where Chaos Engineering comes from. So what exactly is it?

Principlesofchaos.org defines Chaos Engineering as: –

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Chaos Engineering is a practice in which we run experiments against a system to see if it reacts the way we expect it to when it encounters a failure.

We’re not trying to break things here…Chaos Engineering is not breaking things in production.

If I said to my boss that we’re now going to be experiencing an increased amount of outages because, “I AM AN ENGINEER OF CHAOS”, I’d be marched out the front door pretty quickly.

What we’re doing is investigating our systems to see if they will fail when faced with certain conditions.

We don’t even have to touch production with our tests, and to be honest, I’d recommend running in a development or sandbox environment for you first few experiments. As long as the configuration of the SQL instances in those environments mirrors your production servers then you can definitely get some benefit from running chaos experiments in them.

Now, I know what you’re thinking. “Is Chaos Engineering really just a buzz phrase for resilience testing?”.

Well, yep. Resilience testing is pretty much what we’re doing here but hey, Chaos Engineering sounds cooler.

Anyway, moving on….So how can we apply Chaos Engineering to SQL Server?

The first thing we need to do is identify a potential weakness in SQL Server and the best way to do that is by performing a Past Incident Analysis.

Performing a past incident analysis is a great way to start looking for potential weaknesses/failures in your environment. The main reason being, we want to run a Chaos experiment for a condition that is likely to happen. There’s really no point in running an experiment against a perceived failure/weakness that’s never going to happen (or is extremely unlikely) because we want to get some actionable results from these tests.

The end goal here is to increase our confidence in our systems so that we know that they will react as we expect them to when they encounter failure.

So we want to identify an potential failure that’s pretty likely to happen and could potentially have a significant impact.

If an incident analysis hasn’t thrown up any candidates another good method is to perform a Likelihood-Impact analysis.

You sit your team down and think about all the ways SQL Server (and the systems around it) can possibly fail.

N.B. – this is really good fun

Then you rank each failure it terms of how likely it is and how impact of an impact it would have. After doing this, you’ll end up with a couple (few?) failures in the red areas of the graphs…you first candidates for your Chaos experiments 🙂

OK, let’s think about some failures…

High Availability
We have a two node cluster hosting an availability group. One test we could run is to failover the availability group to make sure that it’s working as we expect it to. Now we could run

ALTER AVAILABILITY GROUP [NAME] FAILOVER

but that’s a very sanitised way of failing over the AG. How about running a Chaos experiment that shuts down the primary node? Wouldn’t that be a more realistic test of how the AG could fail out in the “wild”?

Monitoring
We don’t just have to test SQL Server…we can test the systems around it. So how about our monitoring systems? Say we run a query against a (test) database that fills the transaction log? When did we get alerted? Did we only get an alert once the log had filled up or did we get preemptive alerts? Did we only get an alert when there was an issue? Is that how we want our monitoring systems to behave? Monitoring systems are vital to our production environments so testing them is an absolute must.

Backups
When was the last time we tested our backups? If we needed to perform a point-in-time restore of a production database right now, would we be able to do it quickly and easily? Or would we be scrambling round getting scripts together? A restore strategy is absolutely something that we want to work when we need it to so we can run experiments to test it on a regular basis (dbatools is awesome for this).

Disaster recovery
OK, let’s go nuclear for the last one. Do we have a DR solution in place? When was the last time the we tested failing over to it? We really don’t want to be enacting our DR strategy for the first time when production is down (seriously).

Those are just a few examples of areas that we can test…there are hundreds of others that can be run. Literally any system or process in production can have a Chaos Engineering experiment run against it.

So now that we’ve identified some failures, we need to pick one and run an experiment…which I’ll discuss in an upcoming post.

Thanks for reading!