Level 2 Technical Implementation Documentation

Transition Management: Coexistence Strategies

Audience: Programme Directors, Enterprise Architects, Migration Teams, Procurement
Purpose: Managing the transition period where systems operate across both US hyperscale and sovereign infrastructure

Large-scale cloud migrations do not happen overnight. Government departments will operate in a hybrid state for 2-5 years, with systems running simultaneously on US cloud providers and sovereign infrastructure. This page addresses the practical realities of this coexistence period.

Reality Check: The average enterprise cloud migration takes 3-5 years. Government systems—with their additional compliance requirements, legacy integrations, and procurement constraints—may take longer. Planning for extended coexistence is not defeatism; it's realism.

1. Projects In-Flight Decision Framework

At any given moment, government departments have dozens of cloud projects at various stages: procurement, development, testing, or recently launched. Each requires a decision about how to proceed.

Project Status Categories

Category Definition Typical Examples
Pre-Procurement Requirements defined, no contracts signed New citizen services, modernisation initiatives
In Procurement ITT issued or contract negotiations underway Major transformation programmes
In Development Actively being built, not yet live Digital services in beta
Recently Launched Live <12 months, still stabilising New platforms, API services
Established Live >12 months, stable operation Core departmental systems
Legacy/End-of-Life Scheduled for retirement within 24 months Systems being replaced

Decision Matrix: What To Do With Each Project

Pre-Procurement Projects

REDIRECT to sovereign infrastructure

  • Update requirements to mandate sovereign-compatible architecture
  • Add sovereign cloud deployment as primary target environment
  • Require open standards (Kubernetes, S3-compatible, PostgreSQL)
  • No additional cost if done before procurement

In Procurement Projects

PAUSE & ASSESS

  • If ITT not yet issued: Add sovereign requirements, may delay 2-4 weeks
  • If in evaluation: Score sovereign-readiness as weighted criterion
  • If in contract negotiation: Add migration clause and exit provisions
  • Cost: £50k-200k in delays, but avoids £2-10M migration later

In Development Projects

ASSESS ARCHITECTURE

  • Cloud-agnostic architecture: Continue, plan sovereign deployment post-launch
  • Light proprietary services: Continue with migration plan for specific services
  • Deep proprietary lock-in: Pause and re-architect if <40% complete
  • Near completion: Launch on US cloud, immediate migration planning

Recently Launched Projects (<12 months)

STABILISE THEN MIGRATE

  • Do not migrate during stabilisation period (creates additional risk)
  • Begin migration planning and architecture assessment immediately
  • Target migration window: 6-18 months post-launch
  • Add telemetry to understand actual usage patterns for migration planning

Established Systems

PRIORITISE BY RISK

  • Classify by data sensitivity and criticality
  • High sensitivity + high criticality = Priority 1 migration
  • Schedule in migration waves per overall programme
  • May operate in hybrid state for extended period

Legacy/End-of-Life Systems

DO NOT MIGRATE

  • Continue on current platform until retirement
  • Ensure replacement system targets sovereign infrastructure
  • Exception: If retirement date slips past 24 months, reassess
  • Maintain enhanced monitoring for security incidents

2. Extended Parallel Operation Patterns

For complex systems, the transition period may extend 12-36 months. During this time, the system operates in both environments simultaneously. This is not a bug—it's a feature that reduces risk and enables gradual confidence building.

Parallel Operation Models

Model A: Read Replica

Low Risk

US cloud remains primary. Sovereign infrastructure receives read-only replica of data. Used for reporting, analytics, and building operational confidence.

  • One-way data flow (US → Sovereign)
  • No sovereignty benefit until cutover
  • Lowest risk, easiest rollback
  • Good for: Initial validation phase

Model B: Traffic Split

Medium Risk

Both environments serve live traffic. Percentage gradually shifts from US to sovereign. Both write to their own data stores with reconciliation.

  • Requires robust load balancing
  • Data reconciliation complexity
  • Partial sovereignty benefit during transition
  • Good for: Stateless services, APIs

Model C: Active-Active

High Complexity

Both environments are fully operational with bidirectional data synchronisation. Either can serve any request. True multi-cloud operation.

  • Complex conflict resolution required
  • Highest operational overhead
  • Maximum resilience during transition
  • Good for: Critical 24/7 services

Model D: Strangler Fig

Recommended

New features built on sovereign. Existing features migrated incrementally. Old system gradually "strangled" as functionality moves.

  • No big-bang cutover
  • Each component migrates independently
  • Can take 2-3 years for complex systems
  • Good for: Monolithic applications

Extended Parallel Operation Timeline

MONTH    1    3    6    9    12   18   24   30   36
         │    │    │    │    │    │    │    │    │
US CLOUD ████████████████████████████████░░░░░░░░░░  (100% → 0%)
         │    │    │    │    │    │    │    │    │
SOVEREIGN ░░░░░░░░░░░░░░████████████████████████████  (0% → 100%)
         │    │    │    │    │    │    │    │    │
         │    │    │    │    │    │    │    │    │
PHASE:   │PREP│PILOT    │RAMP-UP    │PRIMARY      │COMPLETE
         │    │         │           │             │
DATA:    │    │ Read    │ Bi-dir    │ Sovereign   │ US
         │    │ Replica │ Sync      │ Primary     │ Decomm
        
Key Principle: The sovereign environment should be capable of running 100% of traffic before any cutover begins. The parallel period is for building confidence and validating operations—not for completing the technical migration.

3. Data Synchronisation Strategies

Maintaining data consistency across two cloud environments is the most technically challenging aspect of extended parallel operation. The strategy depends on data characteristics and consistency requirements.

Synchronisation Patterns

Pattern Latency Consistency Complexity Use Case
Change Data Capture (CDC) Seconds Eventual Medium Database replication
Event Sourcing Seconds Eventual High Event-driven systems
Dual-Write Milliseconds Strong (if sync) Very High Critical transactions
Batch Sync Hours Point-in-time Low Analytics, reporting
Message Queue Seconds At-least-once Medium Async workflows

Change Data Capture (CDC) Implementation

CDC is the recommended pattern for most database synchronisation scenarios. It captures changes at the database level and streams them to the target environment.

# Example: PostgreSQL CDC with Debezium to sovereign infrastructure

# 1. Source Database (AWS RDS) - Enable logical replication
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 4;
ALTER SYSTEM SET max_wal_senders = 4;

# 2. Debezium Connector Configuration (runs in sovereign Kubernetes)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: cdc-source-connector
spec:
  class: io.debezium.connector.postgresql.PostgresConnector
  config:
    database.hostname: source-db.xxx.eu-west-2.rds.amazonaws.com
    database.port: 5432
    database.user: cdc_user
    database.password: ${CDC_PASSWORD}
    database.dbname: production
    database.server.name: aws-source
    plugin.name: pgoutput
    slot.name: debezium_slot
    publication.name: dbz_publication
    # Route through secure tunnel - NOT public internet
    database.sslmode: verify-full

# 3. Sink Connector (writes to sovereign PostgreSQL)
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: cdc-sink-connector
spec:
  class: io.confluent.connect.jdbc.JdbcSinkConnector
  config:
    connection.url: jdbc:postgresql://sovereign-db:5432/production
    connection.user: app_user
    topics.regex: aws-source.*
    insert.mode: upsert
    pk.mode: record_key
    auto.create: false
    auto.evolve: false

Conflict Resolution for Bidirectional Sync

When both environments can write, conflicts will occur. Define resolution rules upfront:

Conflict Type Resolution Strategy Example
Simultaneous update Last-write-wins with vector clock User profile updates
Delete vs update Delete wins (or soft-delete only) Record removal
Constraint violation Reject and alert, manual resolution Unique key conflict
Schema mismatch Queue for review, do not auto-apply New column in one env
Critical Warning: Bidirectional synchronisation with strong consistency across geographic regions and cloud providers is extremely complex. Consider whether you truly need it, or whether a simpler model (sovereign-primary with US read-replica) would suffice during transition.

4. Contract & Commercial Management

Government departments have existing contractual commitments with AWS, Azure, and GCP—often multi-year enterprise agreements with committed spend. The transition must account for these commercial realities.

Contract Situation Assessment

Contract Type Typical Terms Exit Considerations
Enterprise Discount Programme (EDP) 3-5 years, committed annual spend Early termination penalties; negotiate wind-down
Reserved Instances 1-3 years, specific capacity Non-refundable; use until expiry or sell on marketplace
Savings Plans 1-3 years, flexible capacity Use for remaining workloads; cannot transfer
G-Cloud Call-offs Up to 24 months per call-off Standard termination clauses; 30-90 day notice
Direct Award Variable Review specific terms; may have break clauses

Commercial Transition Strategies

Strategy 1: Run Down Commitments

Continue paying committed spend while migrating workloads. Use remaining capacity for non-sensitive workloads, dev/test, or disaster recovery until commitment expires.

  • Pros: No penalty payments; maintains vendor relationship
  • Cons: Continued dependency; dual running costs
  • Timeline: Aligned to contract expiry (1-5 years)

Strategy 2: Negotiate Early Exit

Approach vendor to negotiate termination. May involve paying portion of remaining commitment (typically 50-80%) in exchange for immediate release.

  • Pros: Clean break; faster transition
  • Cons: Significant one-time cost; difficult negotiation
  • When: Emergency scenario; strategic imperative

Strategy 3: Renegotiate Terms

Use upcoming renewal as leverage to negotiate flexibility. Add migration clauses, reduce committed spend, or convert to pay-as-you-go for new workloads.

  • Pros: No immediate cost; improved terms
  • Cons: Requires negotiating leverage; vendor may resist
  • When: 6-12 months before renewal

Data Egress Cost Planning

Cloud providers charge for data leaving their networks. At government scale, egress costs can be substantial:

Data Volume AWS Egress Cost Azure Egress Cost GCP Egress Cost
100 TB ~$8,500 ~$8,500 ~$8,000
1 PB ~$50,000 ~$50,000 ~$45,000
10 PB ~$250,000 ~$250,000 ~$200,000
Cost Mitigation: For very large data migrations, consider AWS Snowball/Azure Data Box for physical transfer (avoids egress charges), or negotiate egress fee waivers as part of contract exit discussions.

5. Industry Lessons Learned

Several major organisations have undertaken large-scale cloud migrations or repatriations. Their experiences provide valuable lessons.

Dropbox: AWS to Private Infrastructure (2016-2018)

Scale: ~500 PB of data, serving 500M users

Duration: 2.5 years

Approach:

  • Built custom infrastructure called "Magic Pocket" while still on AWS
  • Ran in parallel for 18+ months before cutover
  • Migrated metadata first, then gradually shifted block storage
  • Kept some services on AWS (non-core functionality)

Key Lessons:

  • Extended parallel operation is essential—Dropbox ran dual for nearly 2 years
  • Build the destination fully before starting migration
  • Migrate in order of increasing criticality (test with less critical first)
  • Savings of ~$75M over 2 years justified the investment

37signals (Basecamp/Hey): AWS to Private Cloud (2022-2023)

Scale: ~$3.2M annual AWS spend, tens of servers

Duration: ~18 months planning to completion

Approach:

  • Purchased physical servers, colocated in datacentres
  • Used Kubernetes (k8s) for container orchestration
  • Migrated application-by-application over several months
  • Maintained AWS for specific services (S3 for some assets)

Key Lessons:

  • Smaller scale made "big bang" per-application feasible
  • 5-year payback on hardware investment
  • Operational complexity increased—needed more in-house expertise
  • Some hybrid state may be permanent (pragmatic approach)

Capital One: Data Centre to AWS (2012-2020)

Scale: Large US bank, 1000+ applications

Duration: 8 years (full exit from data centres)

Approach:

  • Started with non-critical workloads in 2012
  • Gradually moved more sensitive workloads as confidence grew
  • Closed last data centre in 2020
  • Heavy investment in cloud-native transformation (not lift-and-shift)

Key Lessons (Reverse-applicable):

  • 8-year timeline for complete migration—government should plan similarly
  • Regulatory complexity (banking) extended timelines significantly
  • Cultural change was as important as technical migration
  • Some applications were retired rather than migrated

Danish Government: Microsoft to Open Source (2017-Ongoing)

Scale: National government IT infrastructure

Duration: Ongoing, multi-year programme

Approach:

  • Phased replacement of Microsoft Office with LibreOffice
  • Migration of email systems to open platforms
  • Development of shared open-source components
  • Parallel operation during extended transition

Key Lessons:

  • User training and change management as important as technology
  • Document format compatibility requires long parallel period
  • Departmental autonomy created inconsistent adoption
  • Central mandate with local flexibility worked best

6. Hybrid Steady-State: Systems That May Never Fully Migrate

Some systems may remain on US cloud infrastructure indefinitely due to technical, commercial, or practical constraints. This is acceptable if properly managed.

Candidates for Permanent Hybrid State

Category Examples Rationale Mitigation
Deep Vendor Lock-in Systems using AWS Lambda extensively, Azure Cosmos DB, GCP BigQuery Refactoring cost exceeds benefit; 2-5 year rewrite required Scheduled replacement with sovereign-native; enhanced monitoring
Third-Party SaaS Salesforce, ServiceNow, Workday (hosted on US cloud) Vendor choice, not government's; no sovereign equivalent Data minimisation; API abstraction layer; evaluate alternatives at renewal
External Integration Systems that must integrate with US-based partners Partner systems are on US cloud; latency requirements Gateway/proxy architecture; data classification review
Niche Services Specialised AI/ML services, specific compliance tools No sovereign equivalent exists or is immature Isolate sensitive data; use for processing only, not storage
End-of-Life Systems Legacy applications scheduled for retirement Migration investment not justified for remaining lifespan Enhanced security monitoring; accelerate replacement if possible

Managing Permanent Hybrid State

Acceptable Hybrid Criteria:
  • System does not process Tier 1 (TOP SECRET) or Tier 2 (SECRET) data
  • System is not critical national infrastructure
  • Data can be reconstituted from sovereign sources if access is lost
  • Business impact of 72-hour outage is manageable
  • System is documented in risk register with ministerial acceptance

Hybrid Architecture Pattern

SOVEREIGN INFRASTRUCTURE

Core Systems

(Kubernetes)

Citizen Data

(PostgreSQL)

Sensitive Processing

(Isolated)

API Gateway (Kong/APISIX)

All external traffic routes here

Secure Tunnel (WireGuard/IPsec)

Encrypted, logged, monitored

US CLOUD (Residual)

Proxy/Cache

No direct citizen access

Legacy App A

(Locked-in)

SaaS Integration

(Salesforce)

ML Processing

(Non-sensitive)

Constraints: No citizen PII • No classified data • Logged access

7. Operational Considerations During Transition

Monitoring & Observability

During coexistence, unified monitoring across both environments is essential:

Incident Response

Scenario Response
US cloud component fails Route traffic to sovereign if available; standard incident process
Sovereign component fails Route traffic to US cloud; investigate root cause; no different from any failover
Data sync failure Alert immediately; assess data divergence; may need to pause writes
US cloud access revoked Execute emergency cutover plan; accept data loss from last sync point
Security incident in US cloud Isolate immediately; do not replicate potentially compromised data

Team Structure During Transition

Recommendation: Establish a dedicated "Migration Ops" team responsible for the coexistence infrastructure. This team owns the sync mechanisms, monitoring, and cutover procedures—separate from teams running either environment day-to-day.

Related Documentation