10 KiB
Architectural Design Document: Company Inc.
Cloud Infrastructure for Web Application Deployment
Version: 1.0
Date: February 2026
1. Executive Summary
This document outlines a robust, scalable, secure, and cost-effective infrastructure design for Company Inc., a startup deploying a web application with a Python/Flask REST API backend, React SPA frontend, and MongoDB database. The design leverages Google Cloud Platform (GCP) with GKE (Google Kubernetes Engine) as the primary compute platform.
Key Design Principles: Cost awareness from day one, security-by-default, scalability when needed, and GitOps-based operations.
2. Cloud Provider and Environment Structure
2.1 Provider Choice: GCP
Rationale: GCP offers strong managed Kubernetes (GKE) with autopilot options, excellent MongoDB Atlas integration (or GCP-native DocumentDB alternatives), competitive pricing for startups, and simplified networking. GKE Autopilot reduces operational overhead for a small team with limited Kubernetes expertise.
2.2 Project Structure (Cost-Optimised)
For a startup, fewer projects mean lower overhead and simpler billing. Start with 3 projects and add more only when traffic or compliance demands it.
| Project | Purpose | Isolation |
|---|---|---|
| company-inc-prod | Production workloads | High; sensitive data |
| company-inc-staging | Staging, QA, and dev experimentation | Medium |
| company-inc-shared | CI/CD, Artifact Registry, DNS | Low; no PII |
Why not 4+ projects?
- A dedicated sandbox project adds billing, IAM, and networking overhead with little benefit at startup scale.
- Developers can use Kubernetes namespaces within the staging cluster for experimentation.
- A fourth project can be introduced later when team size or compliance (SOC2, HIPAA) requires it.
Benefits:
- Billing separation (prod costs are clearly visible)
- Blast-radius containment (prod issues do not affect staging)
- IAM isolation between environments
- Minimal fixed cost — only 3 projects to manage
3. Network Design
3.1 VPC Architecture
- One VPC per project (or Shared VPC from
company-inc-sharedfor centralised control) - Regional subnets in at least 2 zones for HA
- Private subnets for workloads (no public IPs on nodes)
- Public subnets only for load balancers and NAT gateways
3.2 Security Layers
| Layer | Controls |
|---|---|
| VPC Firewall | Default deny; allow only required CIDRs and ports |
| GKE node pools | Private nodes; no public IPs |
| Security groups | Kubernetes Network Policies + GKE-native security |
| Ingress | HTTPS only; TLS termination at load balancer |
| Egress | Cloud NAT for outbound; restrict to necessary destinations |
3.3 Network Topology (High-Level)
flowchart TD
Internet((Internet))
Internet --> LB[Cloud Load Balancer<br/>HTTPS termination]
LB --> Ingress[GKE Ingress Controller]
subgraph VPC["VPC — Private Subnets"]
Ingress --> API[API Pods<br/>Python / Flask]
Ingress --> SPA[Frontend Pods<br/>React SPA]
API --> DB[(MongoDB<br/>Private Endpoint)]
end
4. Compute Platform: GKE
4.1 Cluster Strategy
- GKE Autopilot for production and staging to minimise node management
- Single regional cluster per environment initially; consider multi-region as scale demands
- Private cluster with no public endpoint; access via IAP or Bastion if needed
4.2 Node Configuration
| Setting | Initial | Growth Phase |
|---|---|---|
| Node type | Autopilot (no manual sizing) | Same |
| Min nodes | 0 (scale to zero when idle) | 2 |
| Max nodes | 5 | 50+ |
| Scaling | Pod-based (HPA, cluster autoscaler) | Same |
4.3 Workload Layout
- Backend (Python/Flask): Deployment with HPA (CPU/memory); target 2–3 replicas initially
- Frontend (React): Static assets served via CDN or container; 1–2 replicas
- Ingress: GKE Ingress for HTTP(S) routing; consider GKE Gateway API for advanced use
4.4 Blue-Green Deployment
Zero-downtime releases without duplicating infrastructure. Both versions run inside the same GKE cluster; the load balancer switches traffic atomically.
flowchart LR
LB[Load Balancer]
LB -->|100% traffic| Green[Green — v1.2.0<br/>current stable]
LB -.->|0% traffic| Blue[Blue — v1.3.0<br/>new release]
Blue -.->|smoke tests pass| LB
| Phase | Action |
|---|---|
| Deploy | New version deployed to the idle slot (blue) |
| Test | Run smoke tests / synthetic checks against blue |
| Switch | Update Service selector or Ingress to point to blue |
| Rollback | Instant — revert selector back to green (old version still running) |
| Cleanup | Scale down old slot after confirmation period |
Cost impact: Near-zero — both slots share the same node pool; the idle slot consumes minimal resources until traffic is switched. Argo Rollouts automates the full lifecycle within ArgoCD.
4.5 Containerisation and CI/CD
| Aspect | Approach |
|---|---|
| Image build | Dockerfile per service; multi-stage builds; non-root user |
| Registry | Artifact Registry in company-inc-shared |
| CI | GitHub/Gitea Actions — build, test, security scan |
| CD | ArgoCD + Argo Rollouts — GitOps with blue-green strategy |
| Secrets | External Secrets Operator + GCP Secret Manager |
5. Database: MongoDB
5.1 Service Choice
MongoDB Atlas (or Google Cloud DocumentDB if strict GCP-only) recommended for:
- Fully managed, automated backups
- Multi-region replication
- Strong security (encryption at rest, VPC peering)
- Easy scaling
Atlas on GCP provides native VPC peering and private connectivity.
5.2 High Availability and DR
| Topic | Strategy |
|---|---|
| Replicas | 3-node replica set; multi-AZ |
| Backups | Continuous backup; point-in-time recovery |
| Disaster recovery | Cross-region replica (e.g. us-central1 + europe-west1) |
| Restore testing | Quarterly DR drills |
5.3 Security
- Private endpoint (no public IP)
- TLS for all connections
- IAM-based access; principle of least privilege
- Encryption at rest (default in Atlas)
6. Cost Optimisation Strategy
| Lever | Approach | Estimated Savings |
|---|---|---|
| 3 projects, not 4 | Drop sandbox; use staging namespaces | ~25% fewer fixed project costs |
| GKE Autopilot | Pay per pod, not per node; no idle nodes | 30–60% vs standard GKE |
| Blue-green in-cluster | No duplicate environments for releases | Near-zero deployment cost |
| Spot/preemptible pods | Use for staging and non-critical workloads | Up to 60–80% off compute |
| Committed use discounts | 1-year CUDs once baseline is established | 20–30% off sustained use |
| CDN for frontend | Offload SPA traffic from GKE | Fewer pod replicas needed |
| MongoDB Atlas auto-scale | Start M10; scale up only when needed | Avoid over-provisioning |
| Cloud NAT shared | Single NAT in shared project | Avoid per-project NAT cost |
Monthly cost estimate (early stage):
- GKE Autopilot (2–3 API pods + 1 SPA): ~$80–150
- MongoDB Atlas M10: ~$60
- Load Balancer + Cloud NAT: ~$30
- Artifact Registry + Secret Manager: ~$5
- Total: ~$175–245/month
6.1 What Would Be Overkill at This Stage
Not everything in a "best practices" architecture is worth implementing on day one. The following are valuable at scale but add cost and complexity that a startup with a few hundred users/day does not need yet.
| Component | Why it's overkill now | When to introduce |
|---|---|---|
| Multi-region GKE | Single region handles millions of req/day; multi-region doubles cost | When SLA requires 99.99% or users span continents |
| Service mesh (Istio/Linkerd) | Adds sidecar overhead, complexity, and debugging difficulty | When you have 10+ microservices with mTLS requirements |
| Cross-region MongoDB replica | Atlas M10 with multi-AZ is sufficient; cross-region adds ~2x DB cost | When RPO < 1 hour is a compliance requirement |
| Dedicated observability stack | GKE built-in monitoring + Cloud Logging is free; Prometheus/Grafana adds ops burden | When team has > 2 SREs and needs custom dashboards |
| 4+ GCP projects | 3 projects cover prod/staging/shared; more adds IAM and billing complexity | When compliance (SOC2, HIPAA) requires strict separation |
| API Gateway (Apigee, Kong) | GKE Ingress handles routing; a gateway adds cost and latency | When you need rate limiting, API keys, or monetisation |
| Vault for secrets | GCP Secret Manager is cheaper, simpler, and natively integrated | When you need dynamic secrets or multi-cloud secret federation |
Rule of thumb: if a component doesn't solve a problem you have today, defer it. Every added piece increases the monthly bill and the on-call surface area.
7. High-Level Architecture Diagram
flowchart TD
Users((Users))
Users --> CDN[Cloud CDN<br/>Static Assets]
Users --> LB[Cloud Load Balancer<br/>HTTPS]
subgraph GKE["GKE Cluster — Private"]
LB --> Ingress[Ingress Controller]
Ingress --> API[Backend — Flask<br/>HPA 2–3 replicas]
Ingress --> SPA[Frontend — React SPA<br/>Nginx]
CDN --> SPA
API --> Redis[Redis<br/>Memorystore]
API --> Obs[Observability<br/>Prometheus / Grafana]
end
subgraph Data["Managed Services"]
Mongo[(MongoDB Atlas<br/>Replica Set · Private Endpoint)]
Secrets[Secret Manager<br/>App & DB credentials]
Registry[Artifact Registry<br/>Container images]
end
API --> Mongo
API --> Secrets
GKE ----> Registry
8. Summary of Recommendations
| Area | Recommendation |
|---|---|
| Cloud | GCP with 3 projects (prod, staging, shared) |
| Compute | GKE Autopilot, private nodes, HPA |
| Deployments | Blue-green via Argo Rollouts — zero downtime, instant rollback |
| Database | MongoDB Atlas on GCP with multi-AZ, automated backups |
| CI/CD | GitHub/Gitea Actions + ArgoCD |
| Security | Private VPC, TLS everywhere, Secret Manager, least privilege |
| Cost | ~$175–245/month early stage; spot pods, CUDs as traffic grows |
See architecture-hld.md for the standalone HLD diagram.