Files
flamingo-tech-test/docs/architecture-design-company-inc.md
Andriy Oblivantsev 8e99fe7614
Helm Chart CI & Release / Lint Helm Chart (push) Successful in 9s
Helm Chart CI & Release / Semantic Release (push) Successful in 2m1s
Add Trivy CVE scan, container registry docs, and update diagrams
- Add Trivy vulnerability scan step to CI (HIGH/CRITICAL, warn-only)
- Add Container Registry section to README with pull examples
- Update architecture doc and HLD with crane + Trivy details

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-19 21:38:05 +00:00

296 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Architectural Design Document: Company Inc.
**Cloud Infrastructure for Web Application Deployment**
**Version:** 1.0
**Date:** February 2026
---
## 1. Executive Summary
This document outlines a robust, scalable, secure, and cost-effective infrastructure design for Company Inc., a startup deploying a web application with a Python/Flask REST API backend, React SPA frontend, and MongoDB database. The design leverages **Google Cloud Platform (GCP)** with **GKE (Google Kubernetes Engine)** as the primary compute platform.
**Key Design Principles:** Cost awareness from day one, security-by-default, scalability when needed, and GitOps-based operations.
---
## 2. Cloud Provider and Environment Structure
### 2.1 Provider Choice: GCP
**Rationale:** GCP offers strong managed Kubernetes (GKE) with autopilot options, excellent MongoDB Atlas integration (or GCP-native DocumentDB alternatives), competitive pricing for startups, and simplified networking. GKE Autopilot reduces operational overhead for a small team with limited Kubernetes expertise.
### 2.2 Project Structure (Cost-Optimised)
For a startup, fewer projects mean lower overhead and simpler billing. Start with **3 projects** and add more only when traffic or compliance demands it.
| Project | Purpose | Isolation |
|---------|---------|-----------|
| **company-inc-prod** | Production workloads | High; sensitive data |
| **company-inc-staging** | Staging, QA, and dev experimentation | Medium |
| **company-inc-shared** | CI/CD, Artifact Registry, DNS | Low; no PII |
**Why not 4+ projects?**
- A dedicated sandbox project adds billing, IAM, and networking overhead with little benefit at startup scale.
- Developers can use Kubernetes namespaces within the staging cluster for experimentation.
- A fourth project can be introduced later when team size or compliance (SOC2, HIPAA) requires it.
**Benefits:**
- Billing separation (prod costs are clearly visible)
- Blast-radius containment (prod issues do not affect staging)
- IAM isolation between environments
- Minimal fixed cost — only 3 projects to manage
---
## 3. Network Design
### 3.1 VPC Architecture
- **One VPC per project** (or Shared VPC from `company-inc-shared` for centralised control)
- **Regional subnets** in at least 2 zones for HA
- **Private subnets** for workloads (no public IPs on nodes)
- **Public subnets** only for load balancers and NAT gateways
### 3.2 Security Layers
| Layer | Controls |
|-------|----------|
| **VPC Firewall** | Default deny; allow only required CIDRs and ports |
| **GKE node pools** | Private nodes; no public IPs |
| **Security groups** | Kubernetes Network Policies + GKE-native security |
| **Ingress** | HTTPS only; TLS termination at load balancer |
| **Egress** | Cloud NAT for outbound; restrict to necessary destinations |
### 3.3 Network Topology (High-Level)
```mermaid
flowchart TD
Internet((Internet))
Internet --> LB[Cloud Load Balancer<br/>HTTPS termination]
LB --> Ingress[GKE Ingress Controller]
subgraph VPC["VPC — Private Subnets"]
Ingress --> API[API Pods<br/>Python / Flask]
Ingress --> SPA[Frontend Pods<br/>React SPA]
API --> DB[(MongoDB<br/>Private Endpoint)]
end
```
---
## 4. Compute Platform: GKE
### 4.1 Cluster Strategy
- **GKE Autopilot** for production and staging to minimise node management
- **Single regional cluster** per environment initially; consider multi-region as scale demands
- **Private cluster** with no public endpoint; access via IAP or Bastion if needed
### 4.2 Node Configuration
| Setting | Initial | Growth Phase |
|---------|---------|--------------|
| **Node type** | Autopilot (no manual sizing) | Same |
| **Min nodes** | 0 (scale to zero when idle) | 2 |
| **Max nodes** | 5 | 50+ |
| **Scaling** | Pod-based (HPA, cluster autoscaler) | Same |
### 4.3 Workload Layout
- **Backend (Python/Flask):** Deployment with HPA (CPU/memory); target 23 replicas initially
- **Frontend (React):** Static assets served via CDN or container; 12 replicas
- **Ingress:** GKE Ingress for HTTP(S) routing; consider GKE Gateway API for advanced use
### 4.4 Blue-Green Deployment
Zero-downtime releases without duplicating infrastructure. Both versions run inside the **same GKE cluster**; the load balancer switches traffic atomically.
```mermaid
flowchart LR
LB[Load Balancer]
LB -->|100% traffic| Green[Green — v1.2.0<br/>current stable]
LB -.->|0% traffic| Blue[Blue — v1.3.0<br/>new release]
Blue -.->|smoke tests pass| LB
```
---
| Phase | Action |
|-------|--------|
| **Deploy** | New version deployed to the idle slot (blue) |
| **Test** | Run smoke tests / synthetic checks against blue |
| **Switch** | Update Service selector or Ingress to point to blue |
| **Rollback** | Instant — revert selector back to green (old version still running) |
| **Cleanup** | Scale down old slot after confirmation period |
**Cost impact:** Near-zero — both slots share the same node pool; the idle slot consumes minimal resources until traffic is switched. Argo Rollouts automates the full lifecycle within ArgoCD.
### 4.5 Containerisation Strategy
#### Image Building Process
Each service (Flask backend, React frontend) has its own **multi-stage Dockerfile**:
1. **Build stage** — installs dependencies and compiles artefacts in a full SDK image (e.g. `python:3.12`, `node:20`).
2. **Runtime stage** — copies only the built artefacts into a minimal base image (e.g. `python:3.12-slim`, `nginx:alpine`). This cuts image size by 6080% and removes build tools from the attack surface.
3. **Non-root user** — the runtime stage runs as a dedicated unprivileged user (`appuser`), never as root.
4. **Reproducible builds** — dependency lock files (`requirements.txt` / `package-lock.json`) are copied and installed before application code to maximise Docker layer caching.
**Tagging convention:** images are tagged with the **git SHA** for traceability and a `latest` alias for convenience. Semantic version tags (e.g. `v1.3.0`) are added on release.
#### Container Registry Management
All container images are stored in **GCP Artifact Registry** in the `company-inc-shared` project:
- **Single source of truth** — one registry serves both staging and production via cross-project IAM pull permissions.
- **Vulnerability scanning** — Artifact Registry's built-in scanning is enabled; CI fails if critical CVEs are detected.
- **Image retention policy** — keep the latest 10 tagged images per service; automatically garbage-collect untagged manifests older than 30 days.
- **Access control** — CI service account has `roles/artifactregistry.writer`; GKE node service accounts have `roles/artifactregistry.reader`. No human push access.
*For self-hosted Git platforms (e.g. Gitea), the built-in OCI container registry can serve the same role at zero additional cost. In the practical part of this project, this is demonstrated: the CI pipeline mirrors the upstream FleetDM image to the Gitea OCI registry using `crane` (a daemonless image tool), then scans it with **Trivy** for HIGH/CRITICAL CVEs before publishing the release.*
#### Deployment Pipelines (CI/CD Integration)
The pipeline follows a **GitOps** model with clear separation between CI and CD:
| Phase | Tool | What happens |
|-------|------|-------------|
| **Lint & Test** | Gitea / GitHub Actions | Unit tests, linting, Helm lint on every push |
| **Build & Push** | Gitea / GitHub Actions | `docker build` → tag with git SHA → push to registry |
| **Security Scan** | Trivy (in CI) | Scan image for OS and library CVEs; block on critical findings |
| **Manifest Update** | CI job | Update image tag in the GitOps manifests repo (or Helm values) |
| **Sync & Deploy** | ArgoCD | Detects manifest drift → triggers blue-green rollout via Argo Rollouts |
| **Promotion** | Argo Rollouts | Automated analysis (metrics, health checks) → promote or rollback |
**Key properties:**
- **CI never touches the cluster directly** — it only builds images and updates manifests. ArgoCD is the sole deployer.
- **Rollback is instant** — revert the manifest repo to the previous commit; ArgoCD syncs automatically.
- **Audit trail** — every deployment maps to a git commit in the manifests repo.
### 4.6 CI/CD Summary
| Aspect | Approach |
|-------|----------|
| **Image build** | Multi-stage Dockerfile; layer caching; non-root; git-SHA tags |
| **Registry** | Artifact Registry in `company-inc-shared` (or Gitea built-in OCI registry) |
| **CI** | Gitea / GitHub Actions — lint, test, build, scan, push |
| **CD** | ArgoCD + Argo Rollouts — GitOps with blue-green strategy |
| **Secrets** | External Secrets Operator + GCP Secret Manager |
---
## 5. Database: MongoDB
### 5.1 Service Choice
**MongoDB Atlas** (or **Google Cloud DocumentDB** if strict GCP-only) recommended for:
- Fully managed, automated backups
- Multi-region replication
- Strong security (encryption at rest, VPC peering)
- Easy scaling
**Atlas on GCP** provides native VPC peering and private connectivity.
### 5.2 High Availability and DR
| Topic | Strategy |
|-------|----------|
| **Replicas** | 3-node replica set; multi-AZ |
| **Backups** | Continuous backup; point-in-time recovery |
| **Disaster recovery** | Cross-region replica (e.g. `us-central1` + `europe-west1`) |
| **Restore testing** | Quarterly DR drills |
### 5.3 Security
- Private endpoint (no public IP)
- TLS for all connections
- IAM-based access; principle of least privilege
- Encryption at rest (default in Atlas)
---
## 6. Cost Optimisation Strategy
| Lever | Approach | Estimated Savings |
|-------|----------|-------------------|
| **3 projects, not 4** | Drop sandbox; use staging namespaces | ~25% fewer fixed project costs |
| **GKE Autopilot** | Pay per pod, not per node; no idle nodes | 3060% vs standard GKE |
| **Blue-green in-cluster** | No duplicate environments for releases | Near-zero deployment cost |
| **Spot/preemptible pods** | Use for staging and non-critical workloads | Up to 6080% off compute |
| **Committed use discounts** | 1-year CUDs once baseline is established | 2030% off sustained use |
| **CDN for frontend** | Offload SPA traffic from GKE | Fewer pod replicas needed |
| **MongoDB Atlas auto-scale** | Start M10; scale up only when needed | Avoid over-provisioning |
| **Cloud NAT shared** | Single NAT in shared project | Avoid per-project NAT cost |
**Monthly cost estimate (early stage):**
- GKE Autopilot (23 API pods + 1 SPA): ~$80150
- MongoDB Atlas M10: ~$60
- Load Balancer + Cloud NAT: ~$30
- Artifact Registry + Secret Manager: ~$5
- **Total: ~$175245/month**
### 6.1 What Would Be Overkill at This Stage
Not everything in a "best practices" architecture is worth implementing on day one. The following are valuable at scale but add cost and complexity that a startup with a few hundred users/day does not need yet.
| Component | Why it's overkill now | When to introduce |
|-----------|----------------------|-------------------|
| **Multi-region GKE** | Single region handles millions of req/day; multi-region doubles cost | When SLA requires 99.99% or users span continents |
| **Service mesh (Istio/Linkerd)** | Adds sidecar overhead, complexity, and debugging difficulty | When you have 10+ microservices with mTLS requirements |
| **Cross-region MongoDB replica** | Atlas M10 with multi-AZ is sufficient; cross-region adds ~2x DB cost | When RPO < 1 hour is a compliance requirement |
| **Dedicated observability stack** | GKE built-in monitoring + Cloud Logging is free; Prometheus/Grafana adds ops burden | When team has > 2 SREs and needs custom dashboards |
| **4+ GCP projects** | 3 projects cover prod/staging/shared; more adds IAM and billing complexity | When compliance (SOC2, HIPAA) requires strict separation |
| **API Gateway (Apigee, Kong)** | GKE Ingress handles routing; a gateway adds cost and latency | When you need rate limiting, API keys, or monetisation |
| **Vault for secrets** | GCP Secret Manager is cheaper, simpler, and natively integrated | When you need dynamic secrets or multi-cloud secret federation |
**Rule of thumb:** if a component doesn't solve a problem you have *today*, defer it. Every added piece increases the monthly bill and the on-call surface area.
---
## 7. High-Level Architecture Diagram
```mermaid
flowchart TD
Users((Users))
Users --> CDN[Cloud CDN<br/>Static Assets]
Users --> LB[Cloud Load Balancer<br/>HTTPS]
subgraph GKE["GKE Cluster — Private"]
LB --> Ingress[Ingress Controller]
Ingress --> API[Backend — Flask<br/>HPA 23 replicas]
Ingress --> SPA[Frontend — React SPA<br/>Nginx]
CDN --> SPA
API --> Redis[Redis<br/>Memorystore]
API --> Obs[Observability<br/>Prometheus / Grafana]
end
subgraph Data["Managed Services"]
Mongo[(MongoDB Atlas<br/>Replica Set · Private Endpoint)]
Secrets[Secret Manager<br/>App & DB credentials]
Registry[Artifact Registry<br/>Container images]
end
API --> Mongo
API --> Secrets
GKE ----> Registry
```
---
## 8. Summary of Recommendations
| Area | Recommendation |
|------|----------------|
| **Cloud** | GCP with 3 projects (prod, staging, shared) |
| **Compute** | GKE Autopilot, private nodes, HPA |
| **Deployments** | Blue-green via Argo Rollouts — zero downtime, instant rollback |
| **Database** | MongoDB Atlas on GCP with multi-AZ, automated backups |
| **CI/CD** | GitHub/Gitea Actions + ArgoCD |
| **Security** | Private VPC, TLS everywhere, Secret Manager, least privilege |
| **Cost** | ~$175245/month early stage; spot pods, CUDs as traffic grows |
---
*See [architecture-hld.md](architecture-hld.md) for the standalone HLD diagram.*