Level 2 Technical Implementation Documentation

08d. Operational Runbooks

Audience: Site Reliability Engineers, Operations Teams, On-Call Engineers
Prerequisites: Access to sovereign infrastructure, familiarity with Kubernetes and observability stack

This section provides step-by-step operational procedures for maintaining sovereign cloud infrastructure, including incident response, disaster recovery, and routine maintenance tasks.

On-Call Principle: All runbooks follow the "five whys" approach. Document the root cause after every incident to prevent recurrence.

Incident Severity Classification

Severity Definition Response Time Escalation
P1 - Critical Complete service outage, data breach, or sovereignty violation Immediate (15 min) CTO, Security Officer, Programme Director
P2 - High Partial outage, significant degradation, single jurisdiction affected 30 minutes On-call lead, Service Owner
P3 - Medium Minor degradation, non-critical service affected 4 hours Team lead
P4 - Low Cosmetic issues, documentation updates Next business day None

Runbook Categories

Incident Response

Critical Path
  • Service outage response
  • Security incident handling
  • Data breach procedures
  • Sovereignty violation response
  • Communication templates

Disaster Recovery

High Priority
  • Cross-jurisdiction failover
  • Database restoration
  • Kubernetes cluster recovery
  • OpenBao unsealing procedures
  • Network partition recovery

Routine Maintenance

Scheduled
  • Certificate rotation
  • Kubernetes upgrades
  • Database maintenance
  • Backup verification
  • Capacity planning reviews

Scaling Operations

On-Demand
  • Horizontal pod autoscaling
  • Cluster node scaling
  • Database read replica addition
  • Storage expansion
  • Load balancer configuration

Runbook: Service Outage Response

Severity: P1 - Critical
Time to Acknowledge: 15 minutes
Time to Mitigate: 1 hour target

Symptoms

Procedure

  1. Acknowledge the incident
    # Acknowledge in PagerDuty/Opsgenie
    # Join incident channel: #incident-YYYY-MM-DD-HH
    
    # Initial status update template:
    "INCIDENT DECLARED: [Service] unavailable
    Impact: [Description of user impact]
    Current status: Investigating
    Next update: 15 minutes"
  2. Assess scope and impact
    # Check service health across jurisdictions
    for ctx in uk-prod eu-prod ca-prod au-prod; do
      echo "=== $ctx ==="
      kubectl --context=$ctx get pods -l app=affected-service
      kubectl --context=$ctx top pods -l app=affected-service
    done
    
    # Check recent deployments
    kubectl --context=uk-prod rollout history deployment/affected-service
    
    # Review error logs
    kubectl --context=uk-prod logs -l app=affected-service --tail=100 --since=10m
  3. Identify root cause

    Common causes:

    • Recent deployment: Check rollout history, consider rollback
    • Resource exhaustion: Check CPU/memory limits, scale if needed
    • Database issues: Check connection pool, query performance
    • External dependency: Check upstream service health
    • Certificate expiry: Check TLS certificates
  4. Mitigate (choose appropriate action)
    # Option A: Rollback deployment
    kubectl --context=uk-prod rollout undo deployment/affected-service
    
    # Option B: Scale up
    kubectl --context=uk-prod scale deployment/affected-service --replicas=10
    
    # Option C: Restart pods (if transient issue)
    kubectl --context=uk-prod rollout restart deployment/affected-service
    
    # Option D: Failover to another jurisdiction
    # Update DNS/load balancer to route away from affected region
  5. Verify recovery
    # Check pod status
    kubectl --context=uk-prod get pods -l app=affected-service -w
    
    # Verify health endpoints
    curl -s https://service.sovereign.gov.uk/health | jq .
    
    # Check error rate in Grafana
    # Query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
  6. Communicate resolution
    # Resolution update template:
    "INCIDENT RESOLVED: [Service] restored
    Duration: [X hours Y minutes]
    Root cause: [Brief description]
    Impact: [Number of users/requests affected]
    Follow-up: Post-incident review scheduled for [date]"

Runbook: Kubernetes Cluster Recovery

Severity: P1 - Critical
Scenario: Control plane failure or etcd data loss

Control Plane Recovery

# 1. Check control plane component status
kubectl get componentstatuses
kubectl get nodes
kubectl get pods -n kube-system

# 2. If etcd is unhealthy, check etcd cluster
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/peer.crt \
  --key=/etc/kubernetes/pki/etcd/peer.key \
  endpoint health

# 3. Check etcd member list
ETCDCTL_API=3 etcdctl member list

# 4. If etcd data is corrupted, restore from backup
# Stop etcd on all nodes first
systemctl stop etcd

# Restore from snapshot (on each node with unique initial-cluster)
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --name=etcd-0 \
  --initial-cluster=etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380 \
  --data-dir=/var/lib/etcd-restored

# Update etcd data directory and restart
mv /var/lib/etcd /var/lib/etcd-old
mv /var/lib/etcd-restored /var/lib/etcd
systemctl start etcd

# 5. Verify cluster recovery
kubectl get nodes
kubectl get pods --all-namespaces

Runbook: Database Backup and Restoration

Scheduled Backup Verification

# Daily backup verification procedure
# Run from sovereign backup server

# 1. List available backups
restic -r s3:s3.sovereign.gov.uk/backups snapshots --tag postgresql

# 2. Verify backup integrity
restic -r s3:s3.sovereign.gov.uk/backups check --read-data-subset=10%

# 3. Test restore to staging environment
SNAPSHOT_ID=$(restic -r s3:s3.sovereign.gov.uk/backups snapshots --tag postgresql --json | jq -r '.[0].id')

restic -r s3:s3.sovereign.gov.uk/backups restore $SNAPSHOT_ID \
  --target /tmp/restore-test

# 4. Verify restored data
pg_restore --list /tmp/restore-test/backup.dump | head -20

# 5. Optional: Full restore to test database
createdb restore_test
pg_restore -d restore_test /tmp/restore-test/backup.dump

# 6. Verify row counts match production
psql -d restore_test -c "SELECT schemaname, relname, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"

# 7. Cleanup
dropdb restore_test
rm -rf /tmp/restore-test

Emergency Database Restoration

# Point-in-time recovery procedure

# 1. Identify recovery target time
# Check when corruption/deletion occurred in audit logs

# 2. Stop application writes
kubectl scale deployment/app --replicas=0

# 3. Create restoration database
createdb production_restored

# 4. Restore base backup
pg_restore -d production_restored /backups/base/latest.dump

# 5. Apply WAL logs up to target time
pg_ctl -D /var/lib/postgresql/data stop

# Configure recovery.conf (PostgreSQL 12+: recovery.signal + postgresql.conf)
cat > /var/lib/postgresql/data/postgresql.auto.conf << EOF
restore_command = 'restic -r s3:s3.sovereign.gov.uk/wal-archive restore latest --target /tmp/wal && cp /tmp/wal/%f %p'
recovery_target_time = '2024-01-15 14:30:00 UTC'
recovery_target_action = 'promote'
EOF

touch /var/lib/postgresql/data/recovery.signal
pg_ctl -D /var/lib/postgresql/data start

# 6. Verify recovery completed
psql -d production_restored -c "SELECT pg_is_in_recovery();"
# Should return 'f' (false) after promotion

# 7. Swap databases
psql -c "ALTER DATABASE production RENAME TO production_corrupted;"
psql -c "ALTER DATABASE production_restored RENAME TO production;"

# 8. Restart application
kubectl scale deployment/app --replicas=3

Runbook: Certificate Rotation

Scheduled TLS Certificate Renewal

# Certificate rotation using cert-manager and OpenBao PKI

# 1. Check certificate expiry status
kubectl get certificates --all-namespaces -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[0].status,EXPIRY:.status.notAfter'

# 2. Force certificate renewal (if needed before expiry)
kubectl annotate certificate app-tls -n production \
  cert-manager.io/issuer-name-key-change="$(date +%s)"

# 3. Wait for new certificate
kubectl wait --for=condition=Ready certificate/app-tls -n production --timeout=120s

# 4. Verify new certificate
kubectl get secret app-tls -n production -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates

# 5. Rolling restart to pick up new certificate (if not using sidecar)
kubectl rollout restart deployment/app -n production

# 6. Verify application is using new certificate
openssl s_client -connect app.sovereign.gov.uk:443 -servername app.sovereign.gov.uk < /dev/null 2>/dev/null | openssl x509 -noout -dates

Runbook: OpenBao Emergency Unsealing

Critical: This procedure requires access to unseal keys stored in separate secure locations. Follow key custodian procedures.
# OpenBao manual unseal procedure (if HSM auto-unseal fails)

# 1. Check OpenBao status
vault status

# 2. If sealed, begin unseal process (requires threshold keys)
# Key 1 (Custodian A)
vault operator unseal # Enter key share 1

# Key 2 (Custodian B)
vault operator unseal # Enter key share 2

# Key 3 (Custodian C)
vault operator unseal # Enter key share 3

# 3. Verify unsealed status
vault status
# Sealed: false

# 4. Authenticate and verify health
vault login -method=oidc

vault read sys/health

# 5. If HSM connection was the issue, check HSM status
# Verify HSM network connectivity
ping hsm.sovereign.internal

# Check PKCS#11 library
pkcs11-tool --module /usr/lib/libCryptoki2_64.so --list-slots

# 6. Restart OpenBao to re-establish HSM auto-unseal
kubectl rollout restart statefulset/vault -n vault-system

On-Call Handover Checklist

End of Shift Handover

  • Document any ongoing incidents or investigations
  • Note any scheduled maintenance windows
  • List any alerts that were silenced and why
  • Highlight any capacity concerns
  • Mention any pending changes awaiting approval
  • Confirm backup completion status

Related Documentation