I made the mistake of upgrading my home Kubernetes cluster, expecting that all the move-pods-around goodness would just work and there would be no downtime of anything.
Well hope is not a strategy. Here is my postmortem of the event following the Google SRE book post mortem template.
Status#
Mitigated, but not resolved.
Summary#
A Kubernetes upgrade across a set of worker hosts downed the only instance of pi-hole which runs on my network. The upgrade for the worker node runniong pi-hole was stuck in between pods being down, and them being brought back up. The upgrade could not continue until the worker node was rebooted, and the upgrade restarted and successfully completed.
Impact#
External DNS resolution while pi-hole was down was impossible, and it appeared as if “the Internet was down”, and no internal connections to the outside Internet worked. Errors were all of the type “Cannot resovle DNS host name.”
Root Causes#
The single instance of pi-hole was down due to a hung kubelet upgrade. The root cause of the hang state is still unknown.
Trigger#
A Kubernetes version upgrade.
Resolution#
The worker host was restarted, which caused the pi-hole pod to restart.
Detection#
User on the network was unable to access a Zoom conference. DNS resolution in Chrome browser failed.
Action Items#
- Run multiple pi-hole instances on different worker hosts to provide for HA in the face of loss of one worker node containing a pi-hole pod.
- Understand deployments and why the pod wasn’t moved to another host during the initial upgrade drain.
- Configure my underlying DNS resolver to use an external DNS server as a fallback to pi-hole.
- Monitoring improvements: Detect when pi-hole is down. Detect when we’re using fallback DNS instead of pi-hole. Detect when my Internet connection is truly down.
Timeline#
- A Kubernetes upgrade was started.
- A worker node drain was initiated. After draining the node, it became hung. pi-hole was on the drain node and never started on another node.
- DNS stops resolving. IMPACT BEGINS
- User on the network is unable to resolve any domains in Google Chrome.
- Debugging begins. Can ping IPv4/IPv6 externals IPs, but not FQDNs.
- The Kubernetes upgrade is rembered and found, eventually finding the hung worker node.
- Worker node is rebooted MITIGATION
- Pihole is restarted. IMPACT ENDS
Note: Times are relative in the chronology. Exact times are unavailable.
Lessons Learned#
What went well#
- Ability to be at the console of the fw/gateway to ping both IPv4 and IPv6 hosts and IPs out on the Internet.
What went wrong#
- The Kubernetes upgrade got hung. I don’t know why. I think the pods on the worker host did not drain fully and correctly, and so the upgrade was hung.
Where we got lucky#
- I could easily re-point my DNS at an external DNS resolver and the only functionality lost was ad blocking.
- I happened to walk back to my desk where I had the upgrade running in a terminal. Knowing I run pi-hole as a Kubernetes pod, it suspected something happened with the upgrade.
Supporting Information#
Inter-pod affinity and anti-affinity to separate pods on different nodes: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity
YAML for making sure pihole apps (labels: app=pihole) are not on the same node
spec:
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- pihole
topologyKey: "kubernetes.io/hostname"
Read up on Pod disruption budgets and see whether this can help always keep at least >= 1 pod replica up.