Files
flamingo-tech-test/docs/architecture-design-company-inc.md
Andriy Oblivantsev ce0851dc3c
Helm Chart CI & Release / Lint Helm Chart (push) Successful in 10s
Helm Chart CI & Release / Semantic Release (push) Successful in 10s
Add 'what would be overkill' section to architecture doc
Pragmatic analysis of components that add cost/complexity without
value at startup scale, with guidance on when to introduce each.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-19 20:33:18 +00:00

253 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architectural Design Document: Company Inc.
**Cloud Infrastructure for Web Application Deployment**
**Version:** 1.0
**Date:** February 2026
---
## 1. Executive Summary
This document outlines a robust, scalable, secure, and cost-effective infrastructure design for Company Inc., a startup deploying a web application with a Python/Flask REST API backend, React SPA frontend, and MongoDB database. The design leverages **Google Cloud Platform (GCP)** with **GKE (Google Kubernetes Engine)** as the primary compute platform.
**Key Design Principles:** Cost awareness from day one, security-by-default, scalability when needed, and GitOps-based operations.
---
## 2. Cloud Provider and Environment Structure
### 2.1 Provider Choice: GCP
**Rationale:** GCP offers strong managed Kubernetes (GKE) with autopilot options, excellent MongoDB Atlas integration (or GCP-native DocumentDB alternatives), competitive pricing for startups, and simplified networking. GKE Autopilot reduces operational overhead for a small team with limited Kubernetes expertise.
### 2.2 Project Structure (Cost-Optimised)
For a startup, fewer projects mean lower overhead and simpler billing. Start with **3 projects** and add more only when traffic or compliance demands it.
| Project | Purpose | Isolation |
|---------|---------|-----------|
| **company-inc-prod** | Production workloads | High; sensitive data |
| **company-inc-staging** | Staging, QA, and dev experimentation | Medium |
| **company-inc-shared** | CI/CD, Artifact Registry, DNS | Low; no PII |
**Why not 4+ projects?**
- A dedicated sandbox project adds billing, IAM, and networking overhead with little benefit at startup scale.
- Developers can use Kubernetes namespaces within the staging cluster for experimentation.
- A fourth project can be introduced later when team size or compliance (SOC2, HIPAA) requires it.
**Benefits:**
- Billing separation (prod costs are clearly visible)
- Blast-radius containment (prod issues do not affect staging)
- IAM isolation between environments
- Minimal fixed cost — only 3 projects to manage
---
## 3. Network Design
### 3.1 VPC Architecture
- **One VPC per project** (or Shared VPC from `company-inc-shared` for centralised control)
- **Regional subnets** in at least 2 zones for HA
- **Private subnets** for workloads (no public IPs on nodes)
- **Public subnets** only for load balancers and NAT gateways
### 3.2 Security Layers
| Layer | Controls |
|-------|----------|
| **VPC Firewall** | Default deny; allow only required CIDRs and ports |
| **GKE node pools** | Private nodes; no public IPs |
| **Security groups** | Kubernetes Network Policies + GKE-native security |
| **Ingress** | HTTPS only; TLS termination at load balancer |
| **Egress** | Cloud NAT for outbound; restrict to necessary destinations |
### 3.3 Network Topology (High-Level)
```mermaid
flowchart TD
Internet((Internet))
Internet --> LB[Cloud Load Balancer<br/>HTTPS termination]
LB --> Ingress[GKE Ingress Controller]
subgraph VPC["VPC — Private Subnets"]
Ingress --> API[API Pods<br/>Python / Flask]
Ingress --> SPA[Frontend Pods<br/>React SPA]
API --> DB[(MongoDB<br/>Private Endpoint)]
end
```
---
## 4. Compute Platform: GKE
### 4.1 Cluster Strategy
- **GKE Autopilot** for production and staging to minimise node management
- **Single regional cluster** per environment initially; consider multi-region as scale demands
- **Private cluster** with no public endpoint; access via IAP or Bastion if needed
### 4.2 Node Configuration
| Setting | Initial | Growth Phase |
|---------|---------|--------------|
| **Node type** | Autopilot (no manual sizing) | Same |
| **Min nodes** | 0 (scale to zero when idle) | 2 |
| **Max nodes** | 5 | 50+ |
| **Scaling** | Pod-based (HPA, cluster autoscaler) | Same |
### 4.3 Workload Layout
- **Backend (Python/Flask):** Deployment with HPA (CPU/memory); target 23 replicas initially
- **Frontend (React):** Static assets served via CDN or container; 12 replicas
- **Ingress:** GKE Ingress for HTTP(S) routing; consider GKE Gateway API for advanced use
### 4.4 Blue-Green Deployment
Zero-downtime releases without duplicating infrastructure. Both versions run inside the **same GKE cluster**; the load balancer switches traffic atomically.
```mermaid
flowchart LR
LB[Load Balancer]
LB -->|100% traffic| Green[Green — v1.2.0<br/>current stable]
LB -.->|0% traffic| Blue[Blue — v1.3.0<br/>new release]
Blue -.->|smoke tests pass| LB
```
| Phase | Action |
|-------|--------|
| **Deploy** | New version deployed to the idle slot (blue) |
| **Test** | Run smoke tests / synthetic checks against blue |
| **Switch** | Update Service selector or Ingress to point to blue |
| **Rollback** | Instant — revert selector back to green (old version still running) |
| **Cleanup** | Scale down old slot after confirmation period |
**Cost impact:** Near-zero — both slots share the same node pool; the idle slot consumes minimal resources until traffic is switched. Argo Rollouts automates the full lifecycle within ArgoCD.
### 4.5 Containerisation and CI/CD
| Aspect | Approach |
|-------|----------|
| **Image build** | Dockerfile per service; multi-stage builds; non-root user |
| **Registry** | Artifact Registry in `company-inc-shared` |
| **CI** | GitHub/Gitea Actions — build, test, security scan |
| **CD** | ArgoCD + Argo Rollouts — GitOps with blue-green strategy |
| **Secrets** | External Secrets Operator + GCP Secret Manager |
---
## 5. Database: MongoDB
### 5.1 Service Choice
**MongoDB Atlas** (or **Google Cloud DocumentDB** if strict GCP-only) recommended for:
- Fully managed, automated backups
- Multi-region replication
- Strong security (encryption at rest, VPC peering)
- Easy scaling
**Atlas on GCP** provides native VPC peering and private connectivity.
### 5.2 High Availability and DR
| Topic | Strategy |
|-------|----------|
| **Replicas** | 3-node replica set; multi-AZ |
| **Backups** | Continuous backup; point-in-time recovery |
| **Disaster recovery** | Cross-region replica (e.g. `us-central1` + `europe-west1`) |
| **Restore testing** | Quarterly DR drills |
### 5.3 Security
- Private endpoint (no public IP)
- TLS for all connections
- IAM-based access; principle of least privilege
- Encryption at rest (default in Atlas)
---
## 6. Cost Optimisation Strategy
| Lever | Approach | Estimated Savings |
|-------|----------|-------------------|
| **3 projects, not 4** | Drop sandbox; use staging namespaces | ~25% fewer fixed project costs |
| **GKE Autopilot** | Pay per pod, not per node; no idle nodes | 3060% vs standard GKE |
| **Blue-green in-cluster** | No duplicate environments for releases | Near-zero deployment cost |
| **Spot/preemptible pods** | Use for staging and non-critical workloads | Up to 6080% off compute |
| **Committed use discounts** | 1-year CUDs once baseline is established | 2030% off sustained use |
| **CDN for frontend** | Offload SPA traffic from GKE | Fewer pod replicas needed |
| **MongoDB Atlas auto-scale** | Start M10; scale up only when needed | Avoid over-provisioning |
| **Cloud NAT shared** | Single NAT in shared project | Avoid per-project NAT cost |
**Monthly cost estimate (early stage):**
- GKE Autopilot (23 API pods + 1 SPA): ~$80150
- MongoDB Atlas M10: ~$60
- Load Balancer + Cloud NAT: ~$30
- Artifact Registry + Secret Manager: ~$5
- **Total: ~$175245/month**
### 6.1 What Would Be Overkill at This Stage
Not everything in a "best practices" architecture is worth implementing on day one. The following are valuable at scale but add cost and complexity that a startup with a few hundred users/day does not need yet.
| Component | Why it's overkill now | When to introduce |
|-----------|----------------------|-------------------|
| **Multi-region GKE** | Single region handles millions of req/day; multi-region doubles cost | When SLA requires 99.99% or users span continents |
| **Service mesh (Istio/Linkerd)** | Adds sidecar overhead, complexity, and debugging difficulty | When you have 10+ microservices with mTLS requirements |
| **Cross-region MongoDB replica** | Atlas M10 with multi-AZ is sufficient; cross-region adds ~2x DB cost | When RPO < 1 hour is a compliance requirement |
| **Dedicated observability stack** | GKE built-in monitoring + Cloud Logging is free; Prometheus/Grafana adds ops burden | When team has > 2 SREs and needs custom dashboards |
| **4+ GCP projects** | 3 projects cover prod/staging/shared; more adds IAM and billing complexity | When compliance (SOC2, HIPAA) requires strict separation |
| **API Gateway (Apigee, Kong)** | GKE Ingress handles routing; a gateway adds cost and latency | When you need rate limiting, API keys, or monetisation |
| **Vault for secrets** | GCP Secret Manager is cheaper, simpler, and natively integrated | When you need dynamic secrets or multi-cloud secret federation |
**Rule of thumb:** if a component doesn't solve a problem you have *today*, defer it. Every added piece increases the monthly bill and the on-call surface area.
---
## 7. High-Level Architecture Diagram
```mermaid
flowchart TB
Users((Users))
Users --> CDN[Cloud CDN<br/>Static Assets]
Users --> LB[Cloud Load Balancer<br/>HTTPS]
subgraph GKE["GKE Cluster — Private"]
LB --> Ingress[Ingress Controller]
Ingress --> API[Backend — Flask<br/>HPA 23 replicas]
Ingress --> SPA[Frontend — React SPA<br/>Nginx]
CDN --> SPA
API --> Redis[Redis<br/>Memorystore]
API --> Obs[Observability<br/>Prometheus / Grafana]
end
subgraph Data["Managed Services"]
Mongo[(MongoDB Atlas<br/>Replica Set · Private Endpoint)]
Secrets[Secret Manager<br/>App & DB credentials]
Registry[Artifact Registry<br/>Container images]
end
API --> Mongo
API --> Secrets
GKE --> Registry
```
---
## 8. Summary of Recommendations
| Area | Recommendation |
|------|----------------|
| **Cloud** | GCP with 3 projects (prod, staging, shared) |
| **Compute** | GKE Autopilot, private nodes, HPA |
| **Deployments** | Blue-green via Argo Rollouts — zero downtime, instant rollback |
| **Database** | MongoDB Atlas on GCP with multi-AZ, automated backups |
| **CI/CD** | GitHub/Gitea Actions + ArgoCD |
| **Security** | Private VPC, TLS everywhere, Secret Manager, least privilege |
| **Cost** | ~$175245/month early stage; spot pods, CUDs as traffic grows |
---
*See [architecture-hld.md](architecture-hld.md) for the standalone HLD diagram.*