Add 'what would be overkill' section to architecture doc

Pragmatic analysis of components that add cost/complexity without value at startup scale, with guidance on when to introduce each. Co-authored-by: Cursor <cursoragent@cursor.com>
Architecture: cost optimisation, blue-green deployment, reduce to 3 projects
2026-02-19 20:33:18 +00:00 · 2026-02-19 20:32:30 +00:00
2 changed files with 126 additions and 25 deletions
@@ -10,7 +10,7 @@
 This document outlines a robust, scalable, secure, and cost-effective infrastructure design for Company Inc., a startup deploying a web application with a Python/Flask REST API backend, React SPA frontend, and MongoDB database. The design leverages **Google Cloud Platform (GCP)** with **GKE (Google Kubernetes Engine)** as the primary compute platform.
-**Key Design Principles:** Security-by-default, scalability from day one, cost optimization for early stage, and GitOps-based operations.
+**Key Design Principles:** Cost awareness from day one, security-by-default, scalability when needed, and GitOps-based operations.
 ---
@@ -20,20 +20,26 @@ This document outlines a robust, scalable, secure, and cost-effective infrastruc
 **Rationale:** GCP offers strong managed Kubernetes (GKE) with autopilot options, excellent MongoDB Atlas integration (or GCP-native DocumentDB alternatives), competitive pricing for startups, and simplified networking. GKE Autopilot reduces operational overhead for a small team with limited Kubernetes expertise.
-### 2.2 Multi-Project Structure
+### 2.2 Project Structure (Cost-Optimised)
 For a startup, fewer projects mean lower overhead and simpler billing. Start with **3 projects** and add more only when traffic or compliance demands it.
 | Project | Purpose | Isolation |
 |---------|---------|-----------|
 | **company-inc-prod** | Production workloads | High; sensitive data |
-| **company-inc-staging** | Staging / pre-production | Medium |
+| **company-inc-staging** | Staging, QA, and dev experimentation | Medium |
-| **company-inc-shared** | CI/CD, shared tooling, DNS | Low; no PII |
+| **company-inc-shared** | CI/CD, Artifact Registry, DNS | Low; no PII |
-| **company-inc-sandbox** | Dev experimentation | Lowest |
+
 **Why not 4+ projects?**
 - A dedicated sandbox project adds billing, IAM, and networking overhead with little benefit at startup scale.
 - Developers can use Kubernetes namespaces within the staging cluster for experimentation.
 - A fourth project can be introduced later when team size or compliance (SOC2, HIPAA) requires it.
 **Benefits:**
- Billing separation per environment
+- Billing separation (prod costs are clearly visible)
 - Blast-radius containment (prod issues do not affect staging)
- IAM and network isolation
+- IAM isolation between environments
- Aligns with GCP best practices for multi-tenant or multi-env setups
+- Minimal fixed cost — only 3 projects to manage
 ---
@@ -96,14 +102,36 @@ flowchart TD
 - **Frontend (React):** Static assets served via CDN or container; 1–2 replicas
 - **Ingress:** GKE Ingress for HTTP(S) routing; consider GKE Gateway API for advanced use
-### 4.4 Containerisation and CI/CD
+### 4.4 Blue-Green Deployment
 Zero-downtime releases without duplicating infrastructure. Both versions run inside the **same GKE cluster**; the load balancer switches traffic atomically.
 ```mermaid
 flowchart LR
    LB[Load Balancer]
    LB -->|100% traffic| Green[Green — v1.2.0<br/>current stable]
    LB -.->|0% traffic| Blue[Blue — v1.3.0<br/>new release]
    Blue -.->|smoke tests pass| LB
 ```
 | Phase | Action |
 |-------|--------|
 | **Deploy** | New version deployed to the idle slot (blue) |
 | **Test** | Run smoke tests / synthetic checks against blue |
 | **Switch** | Update Service selector or Ingress to point to blue |
 | **Rollback** | Instant — revert selector back to green (old version still running) |
 | **Cleanup** | Scale down old slot after confirmation period |
 **Cost impact:** Near-zero — both slots share the same node pool; the idle slot consumes minimal resources until traffic is switched. Argo Rollouts automates the full lifecycle within ArgoCD.
 ### 4.5 Containerisation and CI/CD
 | Aspect | Approach |
 |-------|----------|
 | **Image build** | Dockerfile per service; multi-stage builds; non-root user |
-| **Registry** | Artifact Registry (GCR) in `company-inc-shared` |
+| **Registry** | Artifact Registry in `company-inc-shared` |
-| **CI** | GitHub Actions (or GitLab CI) — build, test, security scan |
+| **CI** | GitHub/Gitea Actions — build, test, security scan |
-| **CD** | ArgoCD or Flux — GitOps; app of apps pattern |
+| **CD** | ArgoCD + Argo Rollouts — GitOps with blue-green strategy |
 | **Secrets** | External Secrets Operator + GCP Secret Manager |
 ---
@@ -138,7 +166,45 @@ flowchart TD
 ---
-## 6. High-Level Architecture Diagram
+## 6. Cost Optimisation Strategy
 | Lever | Approach | Estimated Savings |
 |-------|----------|-------------------|
 | **3 projects, not 4** | Drop sandbox; use staging namespaces | ~25% fewer fixed project costs |
 | **GKE Autopilot** | Pay per pod, not per node; no idle nodes | 30–60% vs standard GKE |
 | **Blue-green in-cluster** | No duplicate environments for releases | Near-zero deployment cost |
 | **Spot/preemptible pods** | Use for staging and non-critical workloads | Up to 60–80% off compute |
 | **Committed use discounts** | 1-year CUDs once baseline is established | 20–30% off sustained use |
 | **CDN for frontend** | Offload SPA traffic from GKE | Fewer pod replicas needed |
 | **MongoDB Atlas auto-scale** | Start M10; scale up only when needed | Avoid over-provisioning |
 | **Cloud NAT shared** | Single NAT in shared project | Avoid per-project NAT cost |
 **Monthly cost estimate (early stage):**
 - GKE Autopilot (2–3 API pods + 1 SPA): ~$80–150
 - MongoDB Atlas M10: ~$60
 - Load Balancer + Cloud NAT: ~$30
 - Artifact Registry + Secret Manager: ~$5
 - **Total: ~$175–245/month**
 ### 6.1 What Would Be Overkill at This Stage
 Not everything in a "best practices" architecture is worth implementing on day one. The following are valuable at scale but add cost and complexity that a startup with a few hundred users/day does not need yet.
 | Component | Why it's overkill now | When to introduce |
 |-----------|----------------------|-------------------|
 | **Multi-region GKE** | Single region handles millions of req/day; multi-region doubles cost | When SLA requires 99.99% or users span continents |
 | **Service mesh (Istio/Linkerd)** | Adds sidecar overhead, complexity, and debugging difficulty | When you have 10+ microservices with mTLS requirements |
 | **Cross-region MongoDB replica** | Atlas M10 with multi-AZ is sufficient; cross-region adds ~2x DB cost | When RPO < 1 hour is a compliance requirement |
 | **Dedicated observability stack** | GKE built-in monitoring + Cloud Logging is free; Prometheus/Grafana adds ops burden | When team has > 2 SREs and needs custom dashboards |
 | **4+ GCP projects** | 3 projects cover prod/staging/shared; more adds IAM and billing complexity | When compliance (SOC2, HIPAA) requires strict separation |
 | **API Gateway (Apigee, Kong)** | GKE Ingress handles routing; a gateway adds cost and latency | When you need rate limiting, API keys, or monetisation |
 | **Vault for secrets** | GCP Secret Manager is cheaper, simpler, and natively integrated | When you need dynamic secrets or multi-cloud secret federation |
 **Rule of thumb:** if a component doesn't solve a problem you have *today*, defer it. Every added piece increases the monthly bill and the on-call surface area.
 ---
 ## 7. High-Level Architecture Diagram
 ```mermaid
 flowchart TB
@@ -169,16 +235,17 @@ flowchart TB
 ---
-## 7. Summary of Recommendations
+## 8. Summary of Recommendations
 | Area | Recommendation |
 |------|----------------|
-| **Cloud** | GCP with 4 projects (prod, staging, shared, sandbox) |
+| **Cloud** | GCP with 3 projects (prod, staging, shared) |
 | **Compute** | GKE Autopilot, private nodes, HPA |
 | **Deployments** | Blue-green via Argo Rollouts — zero downtime, instant rollback |
 | **Database** | MongoDB Atlas on GCP with multi-AZ, automated backups |
-| **CI/CD** | GitHub Actions + ArgoCD/Flux |
+| **CI/CD** | GitHub/Gitea Actions + ArgoCD |
 | **Security** | Private VPC, TLS everywhere, Secret Manager, least privilege |
-| **Cost** | Start small; use committed use discounts as usage grows |
+| **Cost** | ~$175–245/month early stage; spot pods, CUDs as traffic grows |
 ---
@@ -9,22 +9,25 @@ flowchart TB
    end
    subgraph GCP["Google Cloud Platform"]
-        subgraph Projects["Project Structure"]
+        subgraph Projects["Project Structure (3 projects)"]
            Prod[company-inc-prod]
-            Staging[company-inc-staging]
+            Staging[company-inc-staging<br/>QA + dev namespaces]
            Shared[company-inc-shared]
            Sandbox[company-inc-sandbox]
        end
        subgraph Edge["Edge / Networking"]
            LB[Cloud Load Balancer<br/>HTTPS · TLS termination]
            CDN[Cloud CDN<br/>Static Assets]
-            NAT[Cloud NAT<br/>Egress]
+            NAT[Cloud NAT<br/>Egress · shared]
        end
        subgraph VPC["VPC — Private Subnets"]
            subgraph GKE["GKE Autopilot Cluster"]
                Ingress[Ingress Controller]
                subgraph BlueGreen["Blue-Green Deployment"]
                    Green[Green — stable<br/>receives traffic]
                    Blue[Blue — new release<br/>smoke tests]
                end
                subgraph Workloads
                    API[Backend — Python / Flask<br/>HPA · 2–3 replicas]
                    SPA[Frontend — React SPA<br/>Nginx]
@@ -44,14 +47,17 @@ flowchart TB
    subgraph CICD["CI / CD"]
        Git[Git Repository]
        Actions[Gitea / GitHub Actions<br/>Build · Test · Scan]
-        Argo[ArgoCD / Flux<br/>GitOps Deploy]
+        Argo[ArgoCD + Argo Rollouts<br/>GitOps · Blue-Green]
    end
    Users --> LB
    Users --> CDN
    LB --> Ingress
    CDN --> SPA
-    Ingress --> API
+    Ingress -->|traffic| Green
    Ingress -.->|after switch| Blue
    Green --> API
    Blue --> API
    Ingress --> SPA
    API --> Redis
    API --> Mongo
@@ -64,6 +70,24 @@ flowchart TB
    Argo --> GKE
 ```
 ## Blue-Green Deployment Flow
 ```mermaid
 flowchart LR
    subgraph Cluster["GKE Cluster"]
        LB[Load Balancer<br/>Service Selector]
        Green[Green — v1.2.0<br/>current stable]
        Blue[Blue — v1.3.0<br/>new release]
    end
    Deploy[ArgoCD<br/>Argo Rollouts] -->|deploy new version| Blue
    Blue -->|smoke tests| Check{Tests pass?}
    Check -->|yes| LB
    LB -->|switch 100%| Blue
    Check -->|no| Rollback[Rollback<br/>keep Green]
    LB -.->|instant rollback| Green
 ```
 ## CI / CD Pipeline
 ```mermaid
@@ -72,8 +96,8 @@ flowchart LR
    Repo -->|webhook| CI[CI Pipeline<br/>lint · test · build]
    CI -->|push image| Registry[Artifact Registry]
    CI -->|update manifests| GitOps[GitOps Repo]
-    GitOps -->|sync| Argo[ArgoCD / Flux]
+    GitOps -->|sync| Argo[ArgoCD]
-    Argo -->|deploy| GKE[GKE Cluster]
+    Argo -->|blue-green deploy| GKE[GKE Cluster]
 ```
 ## Network Security Layers
@@ -86,3 +110,13 @@ flowchart TD
    NP --> Pods[Application Pods<br/>Private IPs only]
    Pods --> PE[Private Endpoint<br/>MongoDB Atlas]
 ```
 ## Cost Profile (Early Stage)
 ```mermaid
 pie title Monthly Cost Breakdown (~$200)
    "GKE Autopilot" : 120
    "MongoDB Atlas M10" : 60
    "LB + NAT" : 30
    "Registry + Secrets" : 5
 ```
Author	SHA1	Message	Date
Andriy Oblivantsev	ce0851dc3c	Add 'what would be overkill' section to architecture doc Helm Chart CI & Release / Lint Helm Chart (push) Successful in 10s Details Helm Chart CI & Release / Semantic Release (push) Successful in 10s Details Pragmatic analysis of components that add cost/complexity without value at startup scale, with guidance on when to introduce each. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 20:33:18 +00:00
Andriy Oblivantsev	edc552413e	Architecture: cost optimisation, blue-green deployment, reduce to 3 projects Helm Chart CI & Release / Lint Helm Chart (push) Failing after 1s Details Helm Chart CI & Release / Semantic Release (push) Has been skipped Details - Reduce from 4 to 3 GCP projects (drop sandbox, use staging namespaces) - Add blue-green deployment strategy via Argo Rollouts - Add cost optimisation section with monthly estimate (~$175-245) - Add blue-green flow diagram and cost pie chart to HLD Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 20:32:30 +00:00