Expand architecture doc section 4.5 with image building process, container registry management, and deployment pipeline prose. Add Docker build & push to Gitea OCI registry in CI workflow. Co-authored-by: Cursor <cursoragent@cursor.com>
296 lines
13 KiB
Markdown
296 lines
13 KiB
Markdown
# Architectural Design Document: Company Inc.
|
||
|
||
**Cloud Infrastructure for Web Application Deployment**
|
||
**Version:** 1.0
|
||
**Date:** February 2026
|
||
|
||
---
|
||
|
||
## 1. Executive Summary
|
||
|
||
This document outlines a robust, scalable, secure, and cost-effective infrastructure design for Company Inc., a startup deploying a web application with a Python/Flask REST API backend, React SPA frontend, and MongoDB database. The design leverages **Google Cloud Platform (GCP)** with **GKE (Google Kubernetes Engine)** as the primary compute platform.
|
||
|
||
**Key Design Principles:** Cost awareness from day one, security-by-default, scalability when needed, and GitOps-based operations.
|
||
|
||
---
|
||
|
||
## 2. Cloud Provider and Environment Structure
|
||
|
||
### 2.1 Provider Choice: GCP
|
||
|
||
**Rationale:** GCP offers strong managed Kubernetes (GKE) with autopilot options, excellent MongoDB Atlas integration (or GCP-native DocumentDB alternatives), competitive pricing for startups, and simplified networking. GKE Autopilot reduces operational overhead for a small team with limited Kubernetes expertise.
|
||
|
||
### 2.2 Project Structure (Cost-Optimised)
|
||
|
||
For a startup, fewer projects mean lower overhead and simpler billing. Start with **3 projects** and add more only when traffic or compliance demands it.
|
||
|
||
| Project | Purpose | Isolation |
|
||
|---------|---------|-----------|
|
||
| **company-inc-prod** | Production workloads | High; sensitive data |
|
||
| **company-inc-staging** | Staging, QA, and dev experimentation | Medium |
|
||
| **company-inc-shared** | CI/CD, Artifact Registry, DNS | Low; no PII |
|
||
|
||
**Why not 4+ projects?**
|
||
- A dedicated sandbox project adds billing, IAM, and networking overhead with little benefit at startup scale.
|
||
- Developers can use Kubernetes namespaces within the staging cluster for experimentation.
|
||
- A fourth project can be introduced later when team size or compliance (SOC2, HIPAA) requires it.
|
||
|
||
**Benefits:**
|
||
- Billing separation (prod costs are clearly visible)
|
||
- Blast-radius containment (prod issues do not affect staging)
|
||
- IAM isolation between environments
|
||
- Minimal fixed cost — only 3 projects to manage
|
||
|
||
---
|
||
|
||
## 3. Network Design
|
||
|
||
### 3.1 VPC Architecture
|
||
|
||
- **One VPC per project** (or Shared VPC from `company-inc-shared` for centralised control)
|
||
- **Regional subnets** in at least 2 zones for HA
|
||
- **Private subnets** for workloads (no public IPs on nodes)
|
||
- **Public subnets** only for load balancers and NAT gateways
|
||
|
||
### 3.2 Security Layers
|
||
|
||
| Layer | Controls |
|
||
|-------|----------|
|
||
| **VPC Firewall** | Default deny; allow only required CIDRs and ports |
|
||
| **GKE node pools** | Private nodes; no public IPs |
|
||
| **Security groups** | Kubernetes Network Policies + GKE-native security |
|
||
| **Ingress** | HTTPS only; TLS termination at load balancer |
|
||
| **Egress** | Cloud NAT for outbound; restrict to necessary destinations |
|
||
|
||
### 3.3 Network Topology (High-Level)
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Internet((Internet))
|
||
Internet --> LB[Cloud Load Balancer<br/>HTTPS termination]
|
||
LB --> Ingress[GKE Ingress Controller]
|
||
|
||
subgraph VPC["VPC — Private Subnets"]
|
||
Ingress --> API[API Pods<br/>Python / Flask]
|
||
Ingress --> SPA[Frontend Pods<br/>React SPA]
|
||
API --> DB[(MongoDB<br/>Private Endpoint)]
|
||
end
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Compute Platform: GKE
|
||
|
||
### 4.1 Cluster Strategy
|
||
|
||
- **GKE Autopilot** for production and staging to minimise node management
|
||
- **Single regional cluster** per environment initially; consider multi-region as scale demands
|
||
- **Private cluster** with no public endpoint; access via IAP or Bastion if needed
|
||
|
||
### 4.2 Node Configuration
|
||
|
||
| Setting | Initial | Growth Phase |
|
||
|---------|---------|--------------|
|
||
| **Node type** | Autopilot (no manual sizing) | Same |
|
||
| **Min nodes** | 0 (scale to zero when idle) | 2 |
|
||
| **Max nodes** | 5 | 50+ |
|
||
| **Scaling** | Pod-based (HPA, cluster autoscaler) | Same |
|
||
|
||
### 4.3 Workload Layout
|
||
|
||
- **Backend (Python/Flask):** Deployment with HPA (CPU/memory); target 2–3 replicas initially
|
||
- **Frontend (React):** Static assets served via CDN or container; 1–2 replicas
|
||
- **Ingress:** GKE Ingress for HTTP(S) routing; consider GKE Gateway API for advanced use
|
||
|
||
### 4.4 Blue-Green Deployment
|
||
|
||
Zero-downtime releases without duplicating infrastructure. Both versions run inside the **same GKE cluster**; the load balancer switches traffic atomically.
|
||
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
LB[Load Balancer]
|
||
LB -->|100% traffic| Green[Green — v1.2.0<br/>current stable]
|
||
LB -.->|0% traffic| Blue[Blue — v1.3.0<br/>new release]
|
||
Blue -.->|smoke tests pass| LB
|
||
```
|
||
---
|
||
| Phase | Action |
|
||
|-------|--------|
|
||
| **Deploy** | New version deployed to the idle slot (blue) |
|
||
| **Test** | Run smoke tests / synthetic checks against blue |
|
||
| **Switch** | Update Service selector or Ingress to point to blue |
|
||
| **Rollback** | Instant — revert selector back to green (old version still running) |
|
||
| **Cleanup** | Scale down old slot after confirmation period |
|
||
|
||
**Cost impact:** Near-zero — both slots share the same node pool; the idle slot consumes minimal resources until traffic is switched. Argo Rollouts automates the full lifecycle within ArgoCD.
|
||
|
||
### 4.5 Containerisation Strategy
|
||
|
||
#### Image Building Process
|
||
|
||
Each service (Flask backend, React frontend) has its own **multi-stage Dockerfile**:
|
||
|
||
1. **Build stage** — installs dependencies and compiles artefacts in a full SDK image (e.g. `python:3.12`, `node:20`).
|
||
2. **Runtime stage** — copies only the built artefacts into a minimal base image (e.g. `python:3.12-slim`, `nginx:alpine`). This cuts image size by 60–80% and removes build tools from the attack surface.
|
||
3. **Non-root user** — the runtime stage runs as a dedicated unprivileged user (`appuser`), never as root.
|
||
4. **Reproducible builds** — dependency lock files (`requirements.txt` / `package-lock.json`) are copied and installed before application code to maximise Docker layer caching.
|
||
|
||
**Tagging convention:** images are tagged with the **git SHA** for traceability and a `latest` alias for convenience. Semantic version tags (e.g. `v1.3.0`) are added on release.
|
||
|
||
#### Container Registry Management
|
||
|
||
All container images are stored in **GCP Artifact Registry** in the `company-inc-shared` project:
|
||
|
||
- **Single source of truth** — one registry serves both staging and production via cross-project IAM pull permissions.
|
||
- **Vulnerability scanning** — Artifact Registry's built-in scanning is enabled; CI fails if critical CVEs are detected.
|
||
- **Image retention policy** — keep the latest 10 tagged images per service; automatically garbage-collect untagged manifests older than 30 days.
|
||
- **Access control** — CI service account has `roles/artifactregistry.writer`; GKE node service accounts have `roles/artifactregistry.reader`. No human push access.
|
||
|
||
*For self-hosted Git platforms (e.g. Gitea), the built-in OCI container registry can serve the same role at zero additional cost, with Trivy added as a CI step for vulnerability scanning.*
|
||
|
||
#### Deployment Pipelines (CI/CD Integration)
|
||
|
||
The pipeline follows a **GitOps** model with clear separation between CI and CD:
|
||
|
||
| Phase | Tool | What happens |
|
||
|-------|------|-------------|
|
||
| **Lint & Test** | Gitea / GitHub Actions | Unit tests, linting, Helm lint on every push |
|
||
| **Build & Push** | Gitea / GitHub Actions | `docker build` → tag with git SHA → push to registry |
|
||
| **Security Scan** | Trivy (in CI) | Scan image for OS and library CVEs; block on critical findings |
|
||
| **Manifest Update** | CI job | Update image tag in the GitOps manifests repo (or Helm values) |
|
||
| **Sync & Deploy** | ArgoCD | Detects manifest drift → triggers blue-green rollout via Argo Rollouts |
|
||
| **Promotion** | Argo Rollouts | Automated analysis (metrics, health checks) → promote or rollback |
|
||
|
||
**Key properties:**
|
||
- **CI never touches the cluster directly** — it only builds images and updates manifests. ArgoCD is the sole deployer.
|
||
- **Rollback is instant** — revert the manifest repo to the previous commit; ArgoCD syncs automatically.
|
||
- **Audit trail** — every deployment maps to a git commit in the manifests repo.
|
||
|
||
### 4.6 CI/CD Summary
|
||
|
||
| Aspect | Approach |
|
||
|-------|----------|
|
||
| **Image build** | Multi-stage Dockerfile; layer caching; non-root; git-SHA tags |
|
||
| **Registry** | Artifact Registry in `company-inc-shared` (or Gitea built-in OCI registry) |
|
||
| **CI** | Gitea / GitHub Actions — lint, test, build, scan, push |
|
||
| **CD** | ArgoCD + Argo Rollouts — GitOps with blue-green strategy |
|
||
| **Secrets** | External Secrets Operator + GCP Secret Manager |
|
||
|
||
---
|
||
|
||
## 5. Database: MongoDB
|
||
|
||
### 5.1 Service Choice
|
||
|
||
**MongoDB Atlas** (or **Google Cloud DocumentDB** if strict GCP-only) recommended for:
|
||
- Fully managed, automated backups
|
||
- Multi-region replication
|
||
- Strong security (encryption at rest, VPC peering)
|
||
- Easy scaling
|
||
|
||
**Atlas on GCP** provides native VPC peering and private connectivity.
|
||
|
||
### 5.2 High Availability and DR
|
||
|
||
| Topic | Strategy |
|
||
|-------|----------|
|
||
| **Replicas** | 3-node replica set; multi-AZ |
|
||
| **Backups** | Continuous backup; point-in-time recovery |
|
||
| **Disaster recovery** | Cross-region replica (e.g. `us-central1` + `europe-west1`) |
|
||
| **Restore testing** | Quarterly DR drills |
|
||
|
||
### 5.3 Security
|
||
|
||
- Private endpoint (no public IP)
|
||
- TLS for all connections
|
||
- IAM-based access; principle of least privilege
|
||
- Encryption at rest (default in Atlas)
|
||
|
||
---
|
||
|
||
## 6. Cost Optimisation Strategy
|
||
|
||
| Lever | Approach | Estimated Savings |
|
||
|-------|----------|-------------------|
|
||
| **3 projects, not 4** | Drop sandbox; use staging namespaces | ~25% fewer fixed project costs |
|
||
| **GKE Autopilot** | Pay per pod, not per node; no idle nodes | 30–60% vs standard GKE |
|
||
| **Blue-green in-cluster** | No duplicate environments for releases | Near-zero deployment cost |
|
||
| **Spot/preemptible pods** | Use for staging and non-critical workloads | Up to 60–80% off compute |
|
||
| **Committed use discounts** | 1-year CUDs once baseline is established | 20–30% off sustained use |
|
||
| **CDN for frontend** | Offload SPA traffic from GKE | Fewer pod replicas needed |
|
||
| **MongoDB Atlas auto-scale** | Start M10; scale up only when needed | Avoid over-provisioning |
|
||
| **Cloud NAT shared** | Single NAT in shared project | Avoid per-project NAT cost |
|
||
|
||
**Monthly cost estimate (early stage):**
|
||
- GKE Autopilot (2–3 API pods + 1 SPA): ~$80–150
|
||
- MongoDB Atlas M10: ~$60
|
||
- Load Balancer + Cloud NAT: ~$30
|
||
- Artifact Registry + Secret Manager: ~$5
|
||
- **Total: ~$175–245/month**
|
||
|
||
### 6.1 What Would Be Overkill at This Stage
|
||
|
||
Not everything in a "best practices" architecture is worth implementing on day one. The following are valuable at scale but add cost and complexity that a startup with a few hundred users/day does not need yet.
|
||
|
||
| Component | Why it's overkill now | When to introduce |
|
||
|-----------|----------------------|-------------------|
|
||
| **Multi-region GKE** | Single region handles millions of req/day; multi-region doubles cost | When SLA requires 99.99% or users span continents |
|
||
| **Service mesh (Istio/Linkerd)** | Adds sidecar overhead, complexity, and debugging difficulty | When you have 10+ microservices with mTLS requirements |
|
||
| **Cross-region MongoDB replica** | Atlas M10 with multi-AZ is sufficient; cross-region adds ~2x DB cost | When RPO < 1 hour is a compliance requirement |
|
||
| **Dedicated observability stack** | GKE built-in monitoring + Cloud Logging is free; Prometheus/Grafana adds ops burden | When team has > 2 SREs and needs custom dashboards |
|
||
| **4+ GCP projects** | 3 projects cover prod/staging/shared; more adds IAM and billing complexity | When compliance (SOC2, HIPAA) requires strict separation |
|
||
| **API Gateway (Apigee, Kong)** | GKE Ingress handles routing; a gateway adds cost and latency | When you need rate limiting, API keys, or monetisation |
|
||
| **Vault for secrets** | GCP Secret Manager is cheaper, simpler, and natively integrated | When you need dynamic secrets or multi-cloud secret federation |
|
||
|
||
**Rule of thumb:** if a component doesn't solve a problem you have *today*, defer it. Every added piece increases the monthly bill and the on-call surface area.
|
||
|
||
---
|
||
|
||
## 7. High-Level Architecture Diagram
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
Users((Users))
|
||
|
||
Users --> CDN[Cloud CDN<br/>Static Assets]
|
||
Users --> LB[Cloud Load Balancer<br/>HTTPS]
|
||
|
||
subgraph GKE["GKE Cluster — Private"]
|
||
LB --> Ingress[Ingress Controller]
|
||
Ingress --> API[Backend — Flask<br/>HPA 2–3 replicas]
|
||
Ingress --> SPA[Frontend — React SPA<br/>Nginx]
|
||
CDN --> SPA
|
||
API --> Redis[Redis<br/>Memorystore]
|
||
API --> Obs[Observability<br/>Prometheus / Grafana]
|
||
end
|
||
|
||
subgraph Data["Managed Services"]
|
||
Mongo[(MongoDB Atlas<br/>Replica Set · Private Endpoint)]
|
||
Secrets[Secret Manager<br/>App & DB credentials]
|
||
Registry[Artifact Registry<br/>Container images]
|
||
end
|
||
|
||
API --> Mongo
|
||
API --> Secrets
|
||
GKE ----> Registry
|
||
```
|
||
|
||
---
|
||
|
||
## 8. Summary of Recommendations
|
||
|
||
| Area | Recommendation |
|
||
|------|----------------|
|
||
| **Cloud** | GCP with 3 projects (prod, staging, shared) |
|
||
| **Compute** | GKE Autopilot, private nodes, HPA |
|
||
| **Deployments** | Blue-green via Argo Rollouts — zero downtime, instant rollback |
|
||
| **Database** | MongoDB Atlas on GCP with multi-AZ, automated backups |
|
||
| **CI/CD** | GitHub/Gitea Actions + ArgoCD |
|
||
| **Security** | Private VPC, TLS everywhere, Secret Manager, least privilege |
|
||
| **Cost** | ~$175–245/month early stage; spot pods, CUDs as traffic grows |
|
||
|
||
---
|
||
|
||
*See [architecture-hld.md](architecture-hld.md) for the standalone HLD diagram.*
|