08d. Operational Runbooks

Audience: Site Reliability Engineers, Operations Teams, On-Call Engineers
Prerequisites: Access to sovereign infrastructure, familiarity with Kubernetes and observability stack

This section provides step-by-step operational procedures for maintaining sovereign cloud infrastructure, including incident response, disaster recovery, and routine maintenance tasks.

On-Call Principle: All runbooks follow the "five whys" approach. Document the root cause after every incident to prevent recurrence.

Incident Severity Classification

Severity	Definition	Response Time	Escalation
P1 - Critical	Complete service outage, data breach, or sovereignty violation	Immediate (15 min)	CTO, Security Officer, Programme Director
P2 - High	Partial outage, significant degradation, single jurisdiction affected	30 minutes	On-call lead, Service Owner
P3 - Medium	Minor degradation, non-critical service affected	4 hours	Team lead
P4 - Low	Cosmetic issues, documentation updates	Next business day	None

Runbook Categories

Incident Response

Critical Path

Service outage response
Security incident handling
Data breach procedures
Sovereignty violation response
Communication templates

Disaster Recovery

High Priority

Cross-jurisdiction failover
Database restoration
Kubernetes cluster recovery
OpenBao unsealing procedures
Network partition recovery

Routine Maintenance

Scheduled

Certificate rotation
Kubernetes upgrades
Database maintenance
Backup verification
Capacity planning reviews

Scaling Operations

On-Demand

Horizontal pod autoscaling
Cluster node scaling
Database read replica addition
Storage expansion
Load balancer configuration

Runbook: Service Outage Response

Severity: P1 - Critical
Time to Acknowledge: 15 minutes
Time to Mitigate: 1 hour target

Symptoms

Alertmanager firing critical alerts for service availability
User reports of inability to access services
Synthetic monitoring failures across multiple probes
Error rate exceeds 5% threshold

Procedure

Acknowledge the incident

# Acknowledge in PagerDuty/Opsgenie
# Join incident channel: #incident-YYYY-MM-DD-HH

# Initial status update template:
"INCIDENT DECLARED: [Service] unavailable
Impact: [Description of user impact]
Current status: Investigating
Next update: 15 minutes"

Assess scope and impact

# Check service health across jurisdictions
for ctx in uk-prod eu-prod ca-prod au-prod; do
  echo "=== $ctx ==="
  kubectl --context=$ctx get pods -l app=affected-service
  kubectl --context=$ctx top pods -l app=affected-service
done

# Check recent deployments
kubectl --context=uk-prod rollout history deployment/affected-service

# Review error logs
kubectl --context=uk-prod logs -l app=affected-service --tail=100 --since=10m

Identify root cause
Common causes:
- Recent deployment: Check rollout history, consider rollback
- Resource exhaustion: Check CPU/memory limits, scale if needed
- Database issues: Check connection pool, query performance
- External dependency: Check upstream service health
- Certificate expiry: Check TLS certificates

Mitigate (choose appropriate action)

# Option A: Rollback deployment
kubectl --context=uk-prod rollout undo deployment/affected-service

# Option B: Scale up
kubectl --context=uk-prod scale deployment/affected-service --replicas=10

# Option C: Restart pods (if transient issue)
kubectl --context=uk-prod rollout restart deployment/affected-service

# Option D: Failover to another jurisdiction
# Update DNS/load balancer to route away from affected region

Verify recovery

# Check pod status
kubectl --context=uk-prod get pods -l app=affected-service -w

# Verify health endpoints
curl -s https://service.sovereign.gov.uk/health | jq .

# Check error rate in Grafana
# Query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Communicate resolution

# Resolution update template:
"INCIDENT RESOLVED: [Service] restored
Duration: [X hours Y minutes]
Root cause: [Brief description]
Impact: [Number of users/requests affected]
Follow-up: Post-incident review scheduled for [date]"

Runbook: Kubernetes Cluster Recovery

Severity: P1 - Critical
Scenario: Control plane failure or etcd data loss

Control Plane Recovery

# 1. Check control plane component status
kubectl get componentstatuses
kubectl get nodes
kubectl get pods -n kube-system

# 2. If etcd is unhealthy, check etcd cluster
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  endpoint health

# 3. Check etcd member list
ETCDCTL_API=3 etcdctl member list

# 4. If etcd data is corrupted, restore from backup
# Stop etcd on all nodes first
systemctl stop etcd

# Restore from snapshot (on each node with unique initial-cluster)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --name=etcd-0 \
  --initial-cluster=etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380 \
  --data-dir=/var/lib/etcd-restored

# Update etcd data directory and restart
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd
systemctl start etcd

# 5. Verify cluster recovery
kubectl get nodes
kubectl get pods --all-namespaces

Runbook: Database Backup and Restoration

Scheduled Backup Verification

# Daily backup verification procedure
# Run from sovereign backup server

# 1. List available backups
restic -r s3:s3.sovereign.gov.uk/backups snapshots --tag postgresql

# 2. Verify backup integrity
restic -r s3:s3.sovereign.gov.uk/backups check --read-data-subset=10%

# 3. Test restore to staging environment
SNAPSHOT_ID=$(restic -r s3:s3.sovereign.gov.uk/backups snapshots --tag postgresql --json | jq -r '.[0].id')

restic -r s3:s3.sovereign.gov.uk/backups restore $SNAPSHOT_ID \
  --target /tmp/restore-test

# 4. Verify restored data
pg_restore --list /tmp/restore-test/backup.dump | head -20

# 5. Optional: Full restore to test database
createdb restore_test
pg_restore -d restore_test /tmp/restore-test/backup.dump

# 6. Verify row counts match production
psql -d restore_test -c "SELECT schemaname, relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"

# 7. Cleanup
dropdb restore_test
rm -rf /tmp/restore-test

Emergency Database Restoration

# Point-in-time recovery procedure

# 1. Identify recovery target time
# Check when corruption/deletion occurred in audit logs

# 2. Stop application writes
kubectl scale deployment/app --replicas=0

# 3. Create restoration database
createdb production_restored

# 4. Restore base backup
pg_restore -d production_restored /backups/base/latest.dump

# 5. Apply WAL logs up to target time
pg_ctl -D /var/lib/postgresql/data stop

# Configure recovery.conf (PostgreSQL 12+: recovery.signal + postgresql.conf)
cat > /var/lib/postgresql/data/postgresql.auto.conf << EOF
restore_command = 'restic -r s3:s3.sovereign.gov.uk/wal-archive restore latest --target /tmp/wal && cp /tmp/wal/%f %p'
recovery_target_time = '2024-01-15 14:30:00 UTC'
recovery_target_action = 'promote'
EOF

touch /var/lib/postgresql/data/recovery.signal
pg_ctl -D /var/lib/postgresql/data start

# 6. Verify recovery completed
psql -d production_restored -c "SELECT pg_is_in_recovery();"
# Should return 'f' (false) after promotion

# 7. Swap databases
psql -c "ALTER DATABASE production RENAME TO production_corrupted;"
psql -c "ALTER DATABASE production_restored RENAME TO production;"

# 8. Restart application
kubectl scale deployment/app --replicas=3

Runbook: Certificate Rotation

Scheduled TLS Certificate Renewal

# Certificate rotation using cert-manager and OpenBao PKI

# 1. Check certificate expiry status
kubectl get certificates --all-namespaces -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter'

# 2. Force certificate renewal (if needed before expiry)
kubectl annotate certificate app-tls -n production \
  cert-manager.io/issuer-name-key-change="$(date +%s)"

# 3. Wait for new certificate
kubectl wait --for=condition=Ready certificate/app-tls -n production --timeout=120s

# 4. Verify new certificate
kubectl get secret app-tls -n production -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# 5. Rolling restart to pick up new certificate (if not using sidecar)
kubectl rollout restart deployment/app -n production

# 6. Verify application is using new certificate
openssl s_client -connect app.sovereign.gov.uk:443 -servername app.sovereign.gov.uk < /dev/null 2>/dev/null | openssl x509 -noout -dates

Runbook: OpenBao Emergency Unsealing

Critical: This procedure requires access to unseal keys stored in separate secure locations. Follow key custodian procedures.

# OpenBao manual unseal procedure (if HSM auto-unseal fails)

# 1. Check OpenBao status
vault status

# 2. If sealed, begin unseal process (requires threshold keys)
# Key 1 (Custodian A)
vault operator unseal # Enter key share 1

# Key 2 (Custodian B)
vault operator unseal # Enter key share 2

# Key 3 (Custodian C)
vault operator unseal # Enter key share 3

# 3. Verify unsealed status
vault status
# Sealed: false

# 4. Authenticate and verify health
vault login -method=oidc

vault read sys/health

# 5. If HSM connection was the issue, check HSM status
# Verify HSM network connectivity
ping hsm.sovereign.internal

# Check PKCS#11 library
pkcs11-tool --module /usr/lib/libCryptoki2_64.so --list-slots

# 6. Restart OpenBao to re-establish HSM auto-unseal
kubectl rollout restart statefulset/vault -n vault-system

On-Call Handover Checklist

End of Shift Handover

Document any ongoing incidents or investigations
Note any scheduled maintenance windows
List any alerts that were silenced and why
Highlight any capacity concerns
Mention any pending changes awaiting approval
Confirm backup completion status

Cookies on Sovereign Cloud Architecture