Transition Management: Coexistence Strategies
Purpose: Managing the transition period where systems operate across both US hyperscale and sovereign infrastructure
Large-scale cloud migrations do not happen overnight. Government departments will operate in a hybrid state for 2-5 years, with systems running simultaneously on US cloud providers and sovereign infrastructure. This page addresses the practical realities of this coexistence period.
1. Projects In-Flight Decision Framework
At any given moment, government departments have dozens of cloud projects at various stages: procurement, development, testing, or recently launched. Each requires a decision about how to proceed.
Project Status Categories
| Category | Definition | Typical Examples |
|---|---|---|
| Pre-Procurement | Requirements defined, no contracts signed | New citizen services, modernisation initiatives |
| In Procurement | ITT issued or contract negotiations underway | Major transformation programmes |
| In Development | Actively being built, not yet live | Digital services in beta |
| Recently Launched | Live <12 months, still stabilising | New platforms, API services |
| Established | Live >12 months, stable operation | Core departmental systems |
| Legacy/End-of-Life | Scheduled for retirement within 24 months | Systems being replaced |
Decision Matrix: What To Do With Each Project
Pre-Procurement Projects
REDIRECT to sovereign infrastructure
- Update requirements to mandate sovereign-compatible architecture
- Add sovereign cloud deployment as primary target environment
- Require open standards (Kubernetes, S3-compatible, PostgreSQL)
- No additional cost if done before procurement
In Procurement Projects
PAUSE & ASSESS
- If ITT not yet issued: Add sovereign requirements, may delay 2-4 weeks
- If in evaluation: Score sovereign-readiness as weighted criterion
- If in contract negotiation: Add migration clause and exit provisions
- Cost: £50k-200k in delays, but avoids £2-10M migration later
In Development Projects
ASSESS ARCHITECTURE
- Cloud-agnostic architecture: Continue, plan sovereign deployment post-launch
- Light proprietary services: Continue with migration plan for specific services
- Deep proprietary lock-in: Pause and re-architect if <40% complete
- Near completion: Launch on US cloud, immediate migration planning
Recently Launched Projects (<12 months)
STABILISE THEN MIGRATE
- Do not migrate during stabilisation period (creates additional risk)
- Begin migration planning and architecture assessment immediately
- Target migration window: 6-18 months post-launch
- Add telemetry to understand actual usage patterns for migration planning
Established Systems
PRIORITISE BY RISK
- Classify by data sensitivity and criticality
- High sensitivity + high criticality = Priority 1 migration
- Schedule in migration waves per overall programme
- May operate in hybrid state for extended period
Legacy/End-of-Life Systems
DO NOT MIGRATE
- Continue on current platform until retirement
- Ensure replacement system targets sovereign infrastructure
- Exception: If retirement date slips past 24 months, reassess
- Maintain enhanced monitoring for security incidents
2. Extended Parallel Operation Patterns
For complex systems, the transition period may extend 12-36 months. During this time, the system operates in both environments simultaneously. This is not a bug—it's a feature that reduces risk and enables gradual confidence building.
Parallel Operation Models
Model A: Read Replica
Low Risk
US cloud remains primary. Sovereign infrastructure receives read-only replica of data. Used for reporting, analytics, and building operational confidence.
- One-way data flow (US → Sovereign)
- No sovereignty benefit until cutover
- Lowest risk, easiest rollback
- Good for: Initial validation phase
Model B: Traffic Split
Medium Risk
Both environments serve live traffic. Percentage gradually shifts from US to sovereign. Both write to their own data stores with reconciliation.
- Requires robust load balancing
- Data reconciliation complexity
- Partial sovereignty benefit during transition
- Good for: Stateless services, APIs
Model C: Active-Active
High Complexity
Both environments are fully operational with bidirectional data synchronisation. Either can serve any request. True multi-cloud operation.
- Complex conflict resolution required
- Highest operational overhead
- Maximum resilience during transition
- Good for: Critical 24/7 services
Model D: Strangler Fig
Recommended
New features built on sovereign. Existing features migrated incrementally. Old system gradually "strangled" as functionality moves.
- No big-bang cutover
- Each component migrates independently
- Can take 2-3 years for complex systems
- Good for: Monolithic applications
Extended Parallel Operation Timeline
MONTH 1 3 6 9 12 18 24 30 36
│ │ │ │ │ │ │ │ │
US CLOUD ████████████████████████████████░░░░░░░░░░ (100% → 0%)
│ │ │ │ │ │ │ │ │
SOVEREIGN ░░░░░░░░░░░░░░████████████████████████████ (0% → 100%)
│ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │
PHASE: │PREP│PILOT │RAMP-UP │PRIMARY │COMPLETE
│ │ │ │ │
DATA: │ │ Read │ Bi-dir │ Sovereign │ US
│ │ Replica │ Sync │ Primary │ Decomm
3. Data Synchronisation Strategies
Maintaining data consistency across two cloud environments is the most technically challenging aspect of extended parallel operation. The strategy depends on data characteristics and consistency requirements.
Synchronisation Patterns
| Pattern | Latency | Consistency | Complexity | Use Case |
|---|---|---|---|---|
| Change Data Capture (CDC) | Seconds | Eventual | Medium | Database replication |
| Event Sourcing | Seconds | Eventual | High | Event-driven systems |
| Dual-Write | Milliseconds | Strong (if sync) | Very High | Critical transactions |
| Batch Sync | Hours | Point-in-time | Low | Analytics, reporting |
| Message Queue | Seconds | At-least-once | Medium | Async workflows |
Change Data Capture (CDC) Implementation
CDC is the recommended pattern for most database synchronisation scenarios. It captures changes at the database level and streams them to the target environment.
# Example: PostgreSQL CDC with Debezium to sovereign infrastructure
# 1. Source Database (AWS RDS) - Enable logical replication
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 4;
ALTER SYSTEM SET max_wal_senders = 4;
# 2. Debezium Connector Configuration (runs in sovereign Kubernetes)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: cdc-source-connector
spec:
class: io.debezium.connector.postgresql.PostgresConnector
config:
database.hostname: source-db.xxx.eu-west-2.rds.amazonaws.com
database.port: 5432
database.user: cdc_user
database.password: ${CDC_PASSWORD}
database.dbname: production
database.server.name: aws-source
plugin.name: pgoutput
slot.name: debezium_slot
publication.name: dbz_publication
# Route through secure tunnel - NOT public internet
database.sslmode: verify-full
# 3. Sink Connector (writes to sovereign PostgreSQL)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: cdc-sink-connector
spec:
class: io.confluent.connect.jdbc.JdbcSinkConnector
config:
connection.url: jdbc:postgresql://sovereign-db:5432/production
connection.user: app_user
topics.regex: aws-source.*
insert.mode: upsert
pk.mode: record_key
auto.create: false
auto.evolve: false
Conflict Resolution for Bidirectional Sync
When both environments can write, conflicts will occur. Define resolution rules upfront:
| Conflict Type | Resolution Strategy | Example |
|---|---|---|
| Simultaneous update | Last-write-wins with vector clock | User profile updates |
| Delete vs update | Delete wins (or soft-delete only) | Record removal |
| Constraint violation | Reject and alert, manual resolution | Unique key conflict |
| Schema mismatch | Queue for review, do not auto-apply | New column in one env |
4. Contract & Commercial Management
Government departments have existing contractual commitments with AWS, Azure, and GCP—often multi-year enterprise agreements with committed spend. The transition must account for these commercial realities.
Contract Situation Assessment
| Contract Type | Typical Terms | Exit Considerations |
|---|---|---|
| Enterprise Discount Programme (EDP) | 3-5 years, committed annual spend | Early termination penalties; negotiate wind-down |
| Reserved Instances | 1-3 years, specific capacity | Non-refundable; use until expiry or sell on marketplace |
| Savings Plans | 1-3 years, flexible capacity | Use for remaining workloads; cannot transfer |
| G-Cloud Call-offs | Up to 24 months per call-off | Standard termination clauses; 30-90 day notice |
| Direct Award | Variable | Review specific terms; may have break clauses |
Commercial Transition Strategies
Strategy 1: Run Down Commitments
Continue paying committed spend while migrating workloads. Use remaining capacity for non-sensitive workloads, dev/test, or disaster recovery until commitment expires.
- Pros: No penalty payments; maintains vendor relationship
- Cons: Continued dependency; dual running costs
- Timeline: Aligned to contract expiry (1-5 years)
Strategy 2: Negotiate Early Exit
Approach vendor to negotiate termination. May involve paying portion of remaining commitment (typically 50-80%) in exchange for immediate release.
- Pros: Clean break; faster transition
- Cons: Significant one-time cost; difficult negotiation
- When: Emergency scenario; strategic imperative
Strategy 3: Renegotiate Terms
Use upcoming renewal as leverage to negotiate flexibility. Add migration clauses, reduce committed spend, or convert to pay-as-you-go for new workloads.
- Pros: No immediate cost; improved terms
- Cons: Requires negotiating leverage; vendor may resist
- When: 6-12 months before renewal
Data Egress Cost Planning
Cloud providers charge for data leaving their networks. At government scale, egress costs can be substantial:
| Data Volume | AWS Egress Cost | Azure Egress Cost | GCP Egress Cost |
|---|---|---|---|
| 100 TB | ~$8,500 | ~$8,500 | ~$8,000 |
| 1 PB | ~$50,000 | ~$50,000 | ~$45,000 |
| 10 PB | ~$250,000 | ~$250,000 | ~$200,000 |
5. Industry Lessons Learned
Several major organisations have undertaken large-scale cloud migrations or repatriations. Their experiences provide valuable lessons.
Dropbox: AWS to Private Infrastructure (2016-2018)
Scale: ~500 PB of data, serving 500M users
Duration: 2.5 years
Approach:
- Built custom infrastructure called "Magic Pocket" while still on AWS
- Ran in parallel for 18+ months before cutover
- Migrated metadata first, then gradually shifted block storage
- Kept some services on AWS (non-core functionality)
Key Lessons:
- Extended parallel operation is essential—Dropbox ran dual for nearly 2 years
- Build the destination fully before starting migration
- Migrate in order of increasing criticality (test with less critical first)
- Savings of ~$75M over 2 years justified the investment
37signals (Basecamp/Hey): AWS to Private Cloud (2022-2023)
Scale: ~$3.2M annual AWS spend, tens of servers
Duration: ~18 months planning to completion
Approach:
- Purchased physical servers, colocated in datacentres
- Used Kubernetes (k8s) for container orchestration
- Migrated application-by-application over several months
- Maintained AWS for specific services (S3 for some assets)
Key Lessons:
- Smaller scale made "big bang" per-application feasible
- 5-year payback on hardware investment
- Operational complexity increased—needed more in-house expertise
- Some hybrid state may be permanent (pragmatic approach)
Capital One: Data Centre to AWS (2012-2020)
Scale: Large US bank, 1000+ applications
Duration: 8 years (full exit from data centres)
Approach:
- Started with non-critical workloads in 2012
- Gradually moved more sensitive workloads as confidence grew
- Closed last data centre in 2020
- Heavy investment in cloud-native transformation (not lift-and-shift)
Key Lessons (Reverse-applicable):
- 8-year timeline for complete migration—government should plan similarly
- Regulatory complexity (banking) extended timelines significantly
- Cultural change was as important as technical migration
- Some applications were retired rather than migrated
Danish Government: Microsoft to Open Source (2017-Ongoing)
Scale: National government IT infrastructure
Duration: Ongoing, multi-year programme
Approach:
- Phased replacement of Microsoft Office with LibreOffice
- Migration of email systems to open platforms
- Development of shared open-source components
- Parallel operation during extended transition
Key Lessons:
- User training and change management as important as technology
- Document format compatibility requires long parallel period
- Departmental autonomy created inconsistent adoption
- Central mandate with local flexibility worked best
6. Hybrid Steady-State: Systems That May Never Fully Migrate
Some systems may remain on US cloud infrastructure indefinitely due to technical, commercial, or practical constraints. This is acceptable if properly managed.
Candidates for Permanent Hybrid State
| Category | Examples | Rationale | Mitigation |
|---|---|---|---|
| Deep Vendor Lock-in | Systems using AWS Lambda extensively, Azure Cosmos DB, GCP BigQuery | Refactoring cost exceeds benefit; 2-5 year rewrite required | Scheduled replacement with sovereign-native; enhanced monitoring |
| Third-Party SaaS | Salesforce, ServiceNow, Workday (hosted on US cloud) | Vendor choice, not government's; no sovereign equivalent | Data minimisation; API abstraction layer; evaluate alternatives at renewal |
| External Integration | Systems that must integrate with US-based partners | Partner systems are on US cloud; latency requirements | Gateway/proxy architecture; data classification review |
| Niche Services | Specialised AI/ML services, specific compliance tools | No sovereign equivalent exists or is immature | Isolate sensitive data; use for processing only, not storage |
| End-of-Life Systems | Legacy applications scheduled for retirement | Migration investment not justified for remaining lifespan | Enhanced security monitoring; accelerate replacement if possible |
Managing Permanent Hybrid State
- System does not process Tier 1 (TOP SECRET) or Tier 2 (SECRET) data
- System is not critical national infrastructure
- Data can be reconstituted from sovereign sources if access is lost
- Business impact of 72-hour outage is manageable
- System is documented in risk register with ministerial acceptance
Hybrid Architecture Pattern
SOVEREIGN INFRASTRUCTURE
Core Systems
(Kubernetes)
Citizen Data
(PostgreSQL)
Sensitive Processing
(Isolated)
API Gateway (Kong/APISIX)
All external traffic routes here
Secure Tunnel (WireGuard/IPsec)
Encrypted, logged, monitored
US CLOUD (Residual)
Proxy/Cache
No direct citizen access
Legacy App A
(Locked-in)
SaaS Integration
(Salesforce)
ML Processing
(Non-sensitive)
7. Operational Considerations During Transition
Monitoring & Observability
During coexistence, unified monitoring across both environments is essential:
- Single pane of glass: Grafana dashboards showing both environments
- Unified alerting: PagerDuty/Opsgenie routing regardless of source
- Distributed tracing: Jaeger/Tempo tracing requests across environments
- Log aggregation: All logs to sovereign infrastructure (even from US cloud)
- Synthetic monitoring: External probes testing both environments
Incident Response
| Scenario | Response |
|---|---|
| US cloud component fails | Route traffic to sovereign if available; standard incident process |
| Sovereign component fails | Route traffic to US cloud; investigate root cause; no different from any failover |
| Data sync failure | Alert immediately; assess data divergence; may need to pause writes |
| US cloud access revoked | Execute emergency cutover plan; accept data loss from last sync point |
| Security incident in US cloud | Isolate immediately; do not replicate potentially compromised data |
Team Structure During Transition
Related Documentation
- Migration Patterns - Technical patterns for individual workload migration
- Emergency Migration Strategy - Accelerated timelines for crisis scenarios
- Risk Register - Risks associated with extended parallel operation
- Cost Model - Financial implications of dual-running
- Migration Strategy (Level 1) - Strategic overview