Technology Validation
Evidence of technology readiness for CloudStack, open-source FaaS, and supporting components. Includes existing deployments, performance benchmarks, and go/no-go criteria.
Technical Expert Challenge
"You're betting €25B on technology that has never been proven at government scale. CloudStack has 200+ deployments but NO government deployment at AWS/Azure scale. OpenFaaS/Knative cold starts are 5-10x worse than Lambda. What's your rollback plan?"
1. Apache CloudStack - Technology Maturity Assessment
Platform Overview
| Attribute | Value | Assessment |
|---|---|---|
| Project maturity | Apache Top-Level Project since 2013 (12+ years) | |
| Known deployments | 150+ organisations (production) | |
| Scale capability | Manages tens of thousands of physical servers | |
| Government deployments | Limited public documentation; some known but not at AWS scale | |
| Security certifications | No FedRAMP-equivalent; ISO 27001 achievable | |
| Hypervisor support | KVM, VMware, XenServer, Hyper-V |
Enterprise Case Studies
Scale: Enterprise-wide IaaS platform
Use case: Multi-tenant infrastructure for application teams; VMware and KVM workloads
Key features used: OpenTofu provider, GitOps management, over-provisioning
Source: CloudStack Collaboration Conference 2023
Scale: Public cloud provider
Use case: Web hosting, dedicated servers, cloud computing
Key features used: CloudStack + XCP-ng (open-source hypervisor)
Source: ShapeBlue Case Studies
Scale: Regional telecoms provider
Use case: Cloud and CDN services from Baltic datacenters
Key features used: Modern datacenter deployment
Source: Apache CloudStack Wiki
CloudStack Gaps and Mitigations
| Gap | Risk Level | Mitigation |
|---|---|---|
| No FedRAMP-equivalent certification | Commission UK NCSC assessment; work towards ISO 27001 + government-specific controls | |
| Limited government-scale deployments documented | Pilot programme specifically validates government scale; AT&T proves enterprise scale | |
| Smaller vendor ecosystem than VMware/AWS | ShapeBlue provides enterprise support; consider consortium support model | |
| Some advanced features (AI acceleration) less mature | AI/ML workloads not initial priority; can develop as needed |
2. Serverless (FaaS) - Performance Validation
Cold Start Performance Comparison
Platform Comparison
| Platform | Cold Start | Warm Performance | Scalability | Assessment |
|---|---|---|---|---|
| AWS Lambda | 50-100ms | Excellent | Excellent (managed) | |
| OpenFaaS | 1-2s (cold); opt-in | Good | Good (gradual) | |
| Knative | 4-6s | Good | Excellent (burst) | |
| Fission | 100-200ms (pooled) | Good | Good |
Cold Start Mitigation Strategies
- OpenFaaS default: Scale-to-zero disabled by default, eliminating cold starts for latency-sensitive functions
- Container pooling: Pre-warm container pools (Fission approach) reduces cold starts to milliseconds
- Predictive scaling: ML models forecast traffic and pre-warm containers (85% tail latency reduction)
- Hybrid approach: Use OpenFaaS for latency-sensitive, Knative for batch/async workloads
Honest Assessment
Open-source FaaS cold starts are 10-50x slower than AWS Lambda worst-case. However:
- Cold starts can be avoided by keeping functions warm (trade-off: higher base cost)
- Many government workloads are async/batch where cold start is acceptable
- Fission with pooling approaches Lambda-like performance
- Not all workloads require sub-100ms response times
Recommendation: Accept higher latency for initial release; invest in optimisation R&D as part of capability development (Workstream A).
Cold Start User Impact Quantification
Which services would be degraded?
| Service Type | Current Latency Requirement | Cold Start Impact | User Experience Effect |
|---|---|---|---|
| API Gateways / Authentication | <100ms expected | HIGH - 2s cold start noticeable | First login of day may feel slow. Mitigation: keep auth functions warm (£50K/year). |
| Form Submission Processing | <500ms typical | MEDIUM - 2s acceptable | Users expect brief wait on form submit. 2s within tolerance. |
| Document Generation | 1-5s typical | LOW - cold start hidden | Users already expect delay. No perceptible impact. |
| Batch/Async Processing | Minutes to hours | NONE | No user-facing impact. Ideal for open-source FaaS. |
| Real-time Chat/Notifications | <200ms required | HIGH - unacceptable | Must use persistent containers, not FaaS. ~15% of serverless workloads. |
| Static Content / CDN | <50ms expected | N/A - not FaaS | Object storage (MinIO) handles this. No FaaS involvement. |
Summary: Affected Services
- ~15% of serverless workloads require <200ms and cannot tolerate cold starts → use persistent containers
- ~25% of serverless workloads would benefit from warm functions → budget £100-200K/year for always-on
- ~60% of serverless workloads are async/batch/document-gen → cold starts acceptable
Citizen experience: For the average user on GOV.UK completing a transaction:
- Page loads: No impact (static content via CDN)
- Form submission: Possible 1-2s additional wait on first submission of session (acceptable)
- PDF generation: No impact (already async)
- Login: Possible 1-2s on first login; subsequent logins cached (mitigation: warm auth functions)
Net assessment: Minimal user-facing impact for most services. Budget £200K/year for warm functions on latency-sensitive paths. Real-time services (~15%) must use containers, not FaaS.
3. Supporting Technology Stack - Maturity Assessment
| Component | Open Source Option | Production Readiness | Government Use |
|---|---|---|---|
| Object Storage (S3) | MinIO | (S3-compatible, enterprise support) | Yes - multiple government deployments |
| Relational Database | PostgreSQL | (30+ years, enterprise standard) | Yes - GDS standard, NHS uses |
| Message Queue | Apache Kafka / RabbitMQ | Yes - widely deployed | |
| Identity/IAM | Keycloak | (Red Hat supported) | Yes - UK government use cases |
| Secrets Management | OpenBao | Yes - common in secure deployments | |
| Container Orchestration | Kubernetes | (CNCF graduated) | Yes - standard across governments |
| Service Mesh | Istio / Linkerd | Yes - enterprise adoption | |
| Monitoring | Prometheus + Grafana | (CNCF graduated) | Yes - GDS standard monitoring |
| Caching | Redis / Valkey | Yes - standard component | |
| Data Warehouse | Apache Spark / Trino | Some - requires specialist skills | |
| ML/AI Platform | Kubeflow / MLflow | Limited - emerging use cases |
4. Technology Go/No-Go Gates
The following gates must be passed before committing to each technology component:
Gate 1: Security Assessment (Month 6)
| Criterion | Required Evidence | Assessor |
|---|---|---|
| CloudStack security architecture review | Independent pen test + architecture review | NCSC / accredited assessor |
| Open source component security | SBOM for all components; CVE response process | Security team |
| HSM integration validation | Successful HSM integration for key management | Crypto specialists |
| Network isolation verification | Multi-tenant isolation demonstrated | Network security |
Gate 2: Performance Validation (Month 12)
| Criterion | Required Evidence | Threshold |
|---|---|---|
| Compute performance | Benchmark suite (SPECint, etc.) | Within 10% of equivalent AWS EC2 |
| Storage performance | MinIO IOPS/throughput testing | Within 20% of S3 equivalent |
| Network performance | Inter-zone latency, throughput | <1ms same-zone; <5ms cross-zone |
| FaaS latency | Cold/warm start measurements | Warm: <200ms P95; Cold: acceptable for use case |
Gate 3: Scalability Validation (Month 18)
| Criterion | Required Evidence | Threshold |
|---|---|---|
| Horizontal scaling | Scale from 100 to 1,000 VMs in controlled test | Linear scaling demonstrated |
| Management plane resilience | CloudStack management failover test | RTO <15 minutes |
| Multi-zone operation | 3-zone deployment operational | Cross-zone DR functional |
| 10x load stress test | Pilot workloads at 10x expected scale | No degradation beyond 20% |
5. Fallback and Rollback Options
If CloudStack or open-source components fail validation, the following fallbacks exist:
| Scenario | Fallback Option | Trade-off |
|---|---|---|
| CloudStack fails security assessment |
Option 1: OpenStack (more complex but larger community) Option 2: Commercial sovereign cloud (OVHcloud, Scaleway) Option 3: OCI Dedicated Region (Oracle, but on-prem/dedicated) |
OpenStack: Higher complexity, more skills needed Commercial: Less open source, vendor dependency OCI: US company, but fully dedicated |
| FaaS performance unacceptable |
Option 1: Accept higher latency for non-critical functions Option 2: Keep functions warm (higher cost) Option 3: Use container-based approach instead of FaaS |
Option 1: UX impact on some services Option 2: 2-3x compute cost increase Option 3: Developer experience changes |
| Integration failures between components |
Option 1: Extend pilot timeline Option 2: Use managed services from European providers Option 3: Reduce scope to proven components only |
Option 1: Cost increase, delay Option 2: Less sovereignty, vendor dependency Option 3: Reduced capability set |
| Complete technology approach fails | Return to status quo (AWS/Azure/GCP) with enhanced contractual protections and data localisation requirements | Sovereignty risk remains; pilot costs sunk |
OCI Dedicated Region as Parallel Path
Consideration: Oracle Cloud Infrastructure (OCI) offers "Dedicated Region" deployment where Oracle hardware is deployed in customer/government premises, managed by Oracle but fully isolated. This provides:
- Full AWS/Azure feature parity
- Data remains on-premise under government control
- Trade-off: Still US company; operational dependency on Oracle
Recommendation: Include OCI Dedicated Region as a benchmark/comparison during pilot. If open-source approach fails, this provides a fallback that's more sovereign than public cloud but less than full open-source independence.
Summary Assessment
| Component | Readiness | Key Risk | Recommendation |
|---|---|---|---|
| CloudStack IaaS | Government-scale unproven | Validate in pilot; enterprise scale proven | |
| Open FaaS | Cold start performance | Accept trade-off or use pooling; consider Fission | |
| Data services (MinIO, PG, etc.) | Low - mature technologies | Proceed with confidence | |
| Identity (Keycloak) | Low - proven in government | Proceed with confidence | |
| Container platform (K8s) | Low - industry standard | Proceed with confidence |
Document Status
This technology validation provides honest assessment including gaps and risks. Key message: core technologies are mature; CloudStack and FaaS require validation at government scale through the pilot programme.
Version: 1.0 | Last updated: January 2026