08d. Operational Runbooks
Audience: Site Reliability Engineers, Operations Teams, On-Call Engineers
Prerequisites: Access to sovereign infrastructure, familiarity with Kubernetes and observability stack
Prerequisites: Access to sovereign infrastructure, familiarity with Kubernetes and observability stack
This section provides step-by-step operational procedures for maintaining sovereign cloud infrastructure, including incident response, disaster recovery, and routine maintenance tasks.
On-Call Principle: All runbooks follow the "five whys" approach. Document the root cause after every incident to prevent recurrence.
Incident Severity Classification
| Severity | Definition | Response Time | Escalation |
|---|---|---|---|
| P1 - Critical | Complete service outage, data breach, or sovereignty violation | Immediate (15 min) | CTO, Security Officer, Programme Director |
| P2 - High | Partial outage, significant degradation, single jurisdiction affected | 30 minutes | On-call lead, Service Owner |
| P3 - Medium | Minor degradation, non-critical service affected | 4 hours | Team lead |
| P4 - Low | Cosmetic issues, documentation updates | Next business day | None |
Runbook Categories
Incident Response
Critical Path- Service outage response
- Security incident handling
- Data breach procedures
- Sovereignty violation response
- Communication templates
Disaster Recovery
High Priority- Cross-jurisdiction failover
- Database restoration
- Kubernetes cluster recovery
- OpenBao unsealing procedures
- Network partition recovery
Routine Maintenance
Scheduled- Certificate rotation
- Kubernetes upgrades
- Database maintenance
- Backup verification
- Capacity planning reviews
Scaling Operations
On-Demand- Horizontal pod autoscaling
- Cluster node scaling
- Database read replica addition
- Storage expansion
- Load balancer configuration
Runbook: Service Outage Response
Severity: P1 - Critical
Time to Acknowledge: 15 minutes
Time to Mitigate: 1 hour target
Time to Acknowledge: 15 minutes
Time to Mitigate: 1 hour target
Symptoms
- Alertmanager firing critical alerts for service availability
- User reports of inability to access services
- Synthetic monitoring failures across multiple probes
- Error rate exceeds 5% threshold
Procedure
-
Acknowledge the incident
# Acknowledge in PagerDuty/Opsgenie # Join incident channel: #incident-YYYY-MM-DD-HH # Initial status update template: "INCIDENT DECLARED: [Service] unavailable Impact: [Description of user impact] Current status: Investigating Next update: 15 minutes" -
Assess scope and impact
# Check service health across jurisdictions for ctx in uk-prod eu-prod ca-prod au-prod; do echo "=== $ctx ===" kubectl --context=$ctx get pods -l app=affected-service kubectl --context=$ctx top pods -l app=affected-service done # Check recent deployments kubectl --context=uk-prod rollout history deployment/affected-service # Review error logs kubectl --context=uk-prod logs -l app=affected-service --tail=100 --since=10m -
Identify root cause
Common causes:
- Recent deployment: Check rollout history, consider rollback
- Resource exhaustion: Check CPU/memory limits, scale if needed
- Database issues: Check connection pool, query performance
- External dependency: Check upstream service health
- Certificate expiry: Check TLS certificates
-
Mitigate (choose appropriate action)
# Option A: Rollback deployment kubectl --context=uk-prod rollout undo deployment/affected-service # Option B: Scale up kubectl --context=uk-prod scale deployment/affected-service --replicas=10 # Option C: Restart pods (if transient issue) kubectl --context=uk-prod rollout restart deployment/affected-service # Option D: Failover to another jurisdiction # Update DNS/load balancer to route away from affected region -
Verify recovery
# Check pod status kubectl --context=uk-prod get pods -l app=affected-service -w # Verify health endpoints curl -s https://service.sovereign.gov.uk/health | jq . # Check error rate in Grafana # Query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) -
Communicate resolution
# Resolution update template: "INCIDENT RESOLVED: [Service] restored Duration: [X hours Y minutes] Root cause: [Brief description] Impact: [Number of users/requests affected] Follow-up: Post-incident review scheduled for [date]"
Runbook: Kubernetes Cluster Recovery
Severity: P1 - Critical
Scenario: Control plane failure or etcd data loss
Scenario: Control plane failure or etcd data loss
Control Plane Recovery
# 1. Check control plane component status
kubectl get componentstatuses
kubectl get nodes
kubectl get pods -n kube-system
# 2. If etcd is unhealthy, check etcd cluster
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
endpoint health
# 3. Check etcd member list
ETCDCTL_API=3 etcdctl member list
# 4. If etcd data is corrupted, restore from backup
# Stop etcd on all nodes first
systemctl stop etcd
# Restore from snapshot (on each node with unique initial-cluster)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--name=etcd-0 \
--initial-cluster=etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380 \
--initial-advertise-peer-urls=https://10.0.1.10:2380 \
--data-dir=/var/lib/etcd-restored
# Update etcd data directory and restart
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd
systemctl start etcd
# 5. Verify cluster recovery
kubectl get nodes
kubectl get pods --all-namespaces
Runbook: Database Backup and Restoration
Scheduled Backup Verification
# Daily backup verification procedure
# Run from sovereign backup server
# 1. List available backups
restic -r s3:s3.sovereign.gov.uk/backups snapshots --tag postgresql
# 2. Verify backup integrity
restic -r s3:s3.sovereign.gov.uk/backups check --read-data-subset=10%
# 3. Test restore to staging environment
SNAPSHOT_ID=$(restic -r s3:s3.sovereign.gov.uk/backups snapshots --tag postgresql --json | jq -r '.[0].id')
restic -r s3:s3.sovereign.gov.uk/backups restore $SNAPSHOT_ID \
--target /tmp/restore-test
# 4. Verify restored data
pg_restore --list /tmp/restore-test/backup.dump | head -20
# 5. Optional: Full restore to test database
createdb restore_test
pg_restore -d restore_test /tmp/restore-test/backup.dump
# 6. Verify row counts match production
psql -d restore_test -c "SELECT schemaname, relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"
# 7. Cleanup
dropdb restore_test
rm -rf /tmp/restore-test
Emergency Database Restoration
# Point-in-time recovery procedure
# 1. Identify recovery target time
# Check when corruption/deletion occurred in audit logs
# 2. Stop application writes
kubectl scale deployment/app --replicas=0
# 3. Create restoration database
createdb production_restored
# 4. Restore base backup
pg_restore -d production_restored /backups/base/latest.dump
# 5. Apply WAL logs up to target time
pg_ctl -D /var/lib/postgresql/data stop
# Configure recovery.conf (PostgreSQL 12+: recovery.signal + postgresql.conf)
cat > /var/lib/postgresql/data/postgresql.auto.conf << EOF
restore_command = 'restic -r s3:s3.sovereign.gov.uk/wal-archive restore latest --target /tmp/wal && cp /tmp/wal/%f %p'
recovery_target_time = '2024-01-15 14:30:00 UTC'
recovery_target_action = 'promote'
EOF
touch /var/lib/postgresql/data/recovery.signal
pg_ctl -D /var/lib/postgresql/data start
# 6. Verify recovery completed
psql -d production_restored -c "SELECT pg_is_in_recovery();"
# Should return 'f' (false) after promotion
# 7. Swap databases
psql -c "ALTER DATABASE production RENAME TO production_corrupted;"
psql -c "ALTER DATABASE production_restored RENAME TO production;"
# 8. Restart application
kubectl scale deployment/app --replicas=3
Runbook: Certificate Rotation
Scheduled TLS Certificate Renewal
# Certificate rotation using cert-manager and OpenBao PKI
# 1. Check certificate expiry status
kubectl get certificates --all-namespaces -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter'
# 2. Force certificate renewal (if needed before expiry)
kubectl annotate certificate app-tls -n production \
cert-manager.io/issuer-name-key-change="$(date +%s)"
# 3. Wait for new certificate
kubectl wait --for=condition=Ready certificate/app-tls -n production --timeout=120s
# 4. Verify new certificate
kubectl get secret app-tls -n production -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
# 5. Rolling restart to pick up new certificate (if not using sidecar)
kubectl rollout restart deployment/app -n production
# 6. Verify application is using new certificate
openssl s_client -connect app.sovereign.gov.uk:443 -servername app.sovereign.gov.uk < /dev/null 2>/dev/null | openssl x509 -noout -dates
Runbook: OpenBao Emergency Unsealing
Critical: This procedure requires access to unseal keys stored in separate secure locations. Follow key custodian procedures.
# OpenBao manual unseal procedure (if HSM auto-unseal fails)
# 1. Check OpenBao status
vault status
# 2. If sealed, begin unseal process (requires threshold keys)
# Key 1 (Custodian A)
vault operator unseal # Enter key share 1
# Key 2 (Custodian B)
vault operator unseal # Enter key share 2
# Key 3 (Custodian C)
vault operator unseal # Enter key share 3
# 3. Verify unsealed status
vault status
# Sealed: false
# 4. Authenticate and verify health
vault login -method=oidc
vault read sys/health
# 5. If HSM connection was the issue, check HSM status
# Verify HSM network connectivity
ping hsm.sovereign.internal
# Check PKCS#11 library
pkcs11-tool --module /usr/lib/libCryptoki2_64.so --list-slots
# 6. Restart OpenBao to re-establish HSM auto-unseal
kubectl rollout restart statefulset/vault -n vault-system
On-Call Handover Checklist
End of Shift Handover
- Document any ongoing incidents or investigations
- Note any scheduled maintenance windows
- List any alerts that were silenced and why
- Highlight any capacity concerns
- Mention any pending changes awaiting approval
- Confirm backup completion status
Related Documentation
- Security Hardening - Security incident response context
- Infrastructure Templates - Infrastructure recovery templates
- Governance Model - Escalation procedures and decision authority