From edc552413e9600bba1c4161a860bff30034c87e9 Mon Sep 17 00:00:00 2001 From: Andriy Oblivantsev Date: Thu, 19 Feb 2026 20:32:30 +0000 Subject: [PATCH] Architecture: cost optimisation, blue-green deployment, reduce to 3 projects - Reduce from 4 to 3 GCP projects (drop sandbox, use staging namespaces) - Add blue-green deployment strategy via Argo Rollouts - Add cost optimisation section with monthly estimate (~$175-245) - Add blue-green flow diagram and cost pie chart to HLD Co-authored-by: Cursor --- docs/architecture-design-company-inc.md | 85 ++++++++++++++++++++----- docs/architecture-hld.md | 50 ++++++++++++--- 2 files changed, 110 insertions(+), 25 deletions(-) diff --git a/docs/architecture-design-company-inc.md b/docs/architecture-design-company-inc.md index 17596d6..1eeeea4 100644 --- a/docs/architecture-design-company-inc.md +++ b/docs/architecture-design-company-inc.md @@ -10,7 +10,7 @@ This document outlines a robust, scalable, secure, and cost-effective infrastructure design for Company Inc., a startup deploying a web application with a Python/Flask REST API backend, React SPA frontend, and MongoDB database. The design leverages **Google Cloud Platform (GCP)** with **GKE (Google Kubernetes Engine)** as the primary compute platform. -**Key Design Principles:** Security-by-default, scalability from day one, cost optimization for early stage, and GitOps-based operations. +**Key Design Principles:** Cost awareness from day one, security-by-default, scalability when needed, and GitOps-based operations. --- @@ -20,20 +20,26 @@ This document outlines a robust, scalable, secure, and cost-effective infrastruc **Rationale:** GCP offers strong managed Kubernetes (GKE) with autopilot options, excellent MongoDB Atlas integration (or GCP-native DocumentDB alternatives), competitive pricing for startups, and simplified networking. GKE Autopilot reduces operational overhead for a small team with limited Kubernetes expertise. -### 2.2 Multi-Project Structure +### 2.2 Project Structure (Cost-Optimised) + +For a startup, fewer projects mean lower overhead and simpler billing. Start with **3 projects** and add more only when traffic or compliance demands it. | Project | Purpose | Isolation | |---------|---------|-----------| | **company-inc-prod** | Production workloads | High; sensitive data | -| **company-inc-staging** | Staging / pre-production | Medium | -| **company-inc-shared** | CI/CD, shared tooling, DNS | Low; no PII | -| **company-inc-sandbox** | Dev experimentation | Lowest | +| **company-inc-staging** | Staging, QA, and dev experimentation | Medium | +| **company-inc-shared** | CI/CD, Artifact Registry, DNS | Low; no PII | + +**Why not 4+ projects?** +- A dedicated sandbox project adds billing, IAM, and networking overhead with little benefit at startup scale. +- Developers can use Kubernetes namespaces within the staging cluster for experimentation. +- A fourth project can be introduced later when team size or compliance (SOC2, HIPAA) requires it. **Benefits:** -- Billing separation per environment +- Billing separation (prod costs are clearly visible) - Blast-radius containment (prod issues do not affect staging) -- IAM and network isolation -- Aligns with GCP best practices for multi-tenant or multi-env setups +- IAM isolation between environments +- Minimal fixed cost — only 3 projects to manage --- @@ -96,14 +102,36 @@ flowchart TD - **Frontend (React):** Static assets served via CDN or container; 1–2 replicas - **Ingress:** GKE Ingress for HTTP(S) routing; consider GKE Gateway API for advanced use -### 4.4 Containerisation and CI/CD +### 4.4 Blue-Green Deployment + +Zero-downtime releases without duplicating infrastructure. Both versions run inside the **same GKE cluster**; the load balancer switches traffic atomically. + +```mermaid +flowchart LR + LB[Load Balancer] + LB -->|100% traffic| Green[Green — v1.2.0
current stable] + LB -.->|0% traffic| Blue[Blue — v1.3.0
new release] + Blue -.->|smoke tests pass| LB +``` + +| Phase | Action | +|-------|--------| +| **Deploy** | New version deployed to the idle slot (blue) | +| **Test** | Run smoke tests / synthetic checks against blue | +| **Switch** | Update Service selector or Ingress to point to blue | +| **Rollback** | Instant — revert selector back to green (old version still running) | +| **Cleanup** | Scale down old slot after confirmation period | + +**Cost impact:** Near-zero — both slots share the same node pool; the idle slot consumes minimal resources until traffic is switched. Argo Rollouts automates the full lifecycle within ArgoCD. + +### 4.5 Containerisation and CI/CD | Aspect | Approach | |-------|----------| | **Image build** | Dockerfile per service; multi-stage builds; non-root user | -| **Registry** | Artifact Registry (GCR) in `company-inc-shared` | -| **CI** | GitHub Actions (or GitLab CI) — build, test, security scan | -| **CD** | ArgoCD or Flux — GitOps; app of apps pattern | +| **Registry** | Artifact Registry in `company-inc-shared` | +| **CI** | GitHub/Gitea Actions — build, test, security scan | +| **CD** | ArgoCD + Argo Rollouts — GitOps with blue-green strategy | | **Secrets** | External Secrets Operator + GCP Secret Manager | --- @@ -138,7 +166,29 @@ flowchart TD --- -## 6. High-Level Architecture Diagram +## 6. Cost Optimisation Strategy + +| Lever | Approach | Estimated Savings | +|-------|----------|-------------------| +| **3 projects, not 4** | Drop sandbox; use staging namespaces | ~25% fewer fixed project costs | +| **GKE Autopilot** | Pay per pod, not per node; no idle nodes | 30–60% vs standard GKE | +| **Blue-green in-cluster** | No duplicate environments for releases | Near-zero deployment cost | +| **Spot/preemptible pods** | Use for staging and non-critical workloads | Up to 60–80% off compute | +| **Committed use discounts** | 1-year CUDs once baseline is established | 20–30% off sustained use | +| **CDN for frontend** | Offload SPA traffic from GKE | Fewer pod replicas needed | +| **MongoDB Atlas auto-scale** | Start M10; scale up only when needed | Avoid over-provisioning | +| **Cloud NAT shared** | Single NAT in shared project | Avoid per-project NAT cost | + +**Monthly cost estimate (early stage):** +- GKE Autopilot (2–3 API pods + 1 SPA): ~$80–150 +- MongoDB Atlas M10: ~$60 +- Load Balancer + Cloud NAT: ~$30 +- Artifact Registry + Secret Manager: ~$5 +- **Total: ~$175–245/month** + +--- + +## 7. High-Level Architecture Diagram ```mermaid flowchart TB @@ -169,16 +219,17 @@ flowchart TB --- -## 7. Summary of Recommendations +## 8. Summary of Recommendations | Area | Recommendation | |------|----------------| -| **Cloud** | GCP with 4 projects (prod, staging, shared, sandbox) | +| **Cloud** | GCP with 3 projects (prod, staging, shared) | | **Compute** | GKE Autopilot, private nodes, HPA | +| **Deployments** | Blue-green via Argo Rollouts — zero downtime, instant rollback | | **Database** | MongoDB Atlas on GCP with multi-AZ, automated backups | -| **CI/CD** | GitHub Actions + ArgoCD/Flux | +| **CI/CD** | GitHub/Gitea Actions + ArgoCD | | **Security** | Private VPC, TLS everywhere, Secret Manager, least privilege | -| **Cost** | Start small; use committed use discounts as usage grows | +| **Cost** | ~$175–245/month early stage; spot pods, CUDs as traffic grows | --- diff --git a/docs/architecture-hld.md b/docs/architecture-hld.md index c255209..9734b79 100644 --- a/docs/architecture-hld.md +++ b/docs/architecture-hld.md @@ -9,22 +9,25 @@ flowchart TB end subgraph GCP["Google Cloud Platform"] - subgraph Projects["Project Structure"] + subgraph Projects["Project Structure (3 projects)"] Prod[company-inc-prod] - Staging[company-inc-staging] + Staging[company-inc-staging
QA + dev namespaces] Shared[company-inc-shared] - Sandbox[company-inc-sandbox] end subgraph Edge["Edge / Networking"] LB[Cloud Load Balancer
HTTPS · TLS termination] CDN[Cloud CDN
Static Assets] - NAT[Cloud NAT
Egress] + NAT[Cloud NAT
Egress · shared] end subgraph VPC["VPC — Private Subnets"] subgraph GKE["GKE Autopilot Cluster"] Ingress[Ingress Controller] + subgraph BlueGreen["Blue-Green Deployment"] + Green[Green — stable
receives traffic] + Blue[Blue — new release
smoke tests] + end subgraph Workloads API[Backend — Python / Flask
HPA · 2–3 replicas] SPA[Frontend — React SPA
Nginx] @@ -44,14 +47,17 @@ flowchart TB subgraph CICD["CI / CD"] Git[Git Repository] Actions[Gitea / GitHub Actions
Build · Test · Scan] - Argo[ArgoCD / Flux
GitOps Deploy] + Argo[ArgoCD + Argo Rollouts
GitOps · Blue-Green] end Users --> LB Users --> CDN LB --> Ingress CDN --> SPA - Ingress --> API + Ingress -->|traffic| Green + Ingress -.->|after switch| Blue + Green --> API + Blue --> API Ingress --> SPA API --> Redis API --> Mongo @@ -64,6 +70,24 @@ flowchart TB Argo --> GKE ``` +## Blue-Green Deployment Flow + +```mermaid +flowchart LR + subgraph Cluster["GKE Cluster"] + LB[Load Balancer
Service Selector] + Green[Green — v1.2.0
current stable] + Blue[Blue — v1.3.0
new release] + end + + Deploy[ArgoCD
Argo Rollouts] -->|deploy new version| Blue + Blue -->|smoke tests| Check{Tests pass?} + Check -->|yes| LB + LB -->|switch 100%| Blue + Check -->|no| Rollback[Rollback
keep Green] + LB -.->|instant rollback| Green +``` + ## CI / CD Pipeline ```mermaid @@ -72,8 +96,8 @@ flowchart LR Repo -->|webhook| CI[CI Pipeline
lint · test · build] CI -->|push image| Registry[Artifact Registry] CI -->|update manifests| GitOps[GitOps Repo] - GitOps -->|sync| Argo[ArgoCD / Flux] - Argo -->|deploy| GKE[GKE Cluster] + GitOps -->|sync| Argo[ArgoCD] + Argo -->|blue-green deploy| GKE[GKE Cluster] ``` ## Network Security Layers @@ -86,3 +110,13 @@ flowchart TD NP --> Pods[Application Pods
Private IPs only] Pods --> PE[Private Endpoint
MongoDB Atlas] ``` + +## Cost Profile (Early Stage) + +```mermaid +pie title Monthly Cost Breakdown (~$200) + "GKE Autopilot" : 120 + "MongoDB Atlas M10" : 60 + "LB + NAT" : 30 + "Registry + Secrets" : 5 +```