Files
flamingo-tech-test/docs/architecture-design-company-inc.md
Andriy Oblivantsev e6176999c1
Helm Chart CI & Release / Lint Helm Chart (push) Successful in 10s
Helm Chart CI & Release / Semantic Release (push) Failing after 9s
Add containerisation strategy details and CI image build step
Expand architecture doc section 4.5 with image building process,
container registry management, and deployment pipeline prose.
Add Docker build & push to Gitea OCI registry in CI workflow.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-19 21:19:31 +00:00

13 KiB
Raw Permalink Blame History

Architectural Design Document: Company Inc.

Cloud Infrastructure for Web Application Deployment
Version: 1.0
Date: February 2026


1. Executive Summary

This document outlines a robust, scalable, secure, and cost-effective infrastructure design for Company Inc., a startup deploying a web application with a Python/Flask REST API backend, React SPA frontend, and MongoDB database. The design leverages Google Cloud Platform (GCP) with GKE (Google Kubernetes Engine) as the primary compute platform.

Key Design Principles: Cost awareness from day one, security-by-default, scalability when needed, and GitOps-based operations.


2. Cloud Provider and Environment Structure

2.1 Provider Choice: GCP

Rationale: GCP offers strong managed Kubernetes (GKE) with autopilot options, excellent MongoDB Atlas integration (or GCP-native DocumentDB alternatives), competitive pricing for startups, and simplified networking. GKE Autopilot reduces operational overhead for a small team with limited Kubernetes expertise.

2.2 Project Structure (Cost-Optimised)

For a startup, fewer projects mean lower overhead and simpler billing. Start with 3 projects and add more only when traffic or compliance demands it.

Project Purpose Isolation
company-inc-prod Production workloads High; sensitive data
company-inc-staging Staging, QA, and dev experimentation Medium
company-inc-shared CI/CD, Artifact Registry, DNS Low; no PII

Why not 4+ projects?

  • A dedicated sandbox project adds billing, IAM, and networking overhead with little benefit at startup scale.
  • Developers can use Kubernetes namespaces within the staging cluster for experimentation.
  • A fourth project can be introduced later when team size or compliance (SOC2, HIPAA) requires it.

Benefits:

  • Billing separation (prod costs are clearly visible)
  • Blast-radius containment (prod issues do not affect staging)
  • IAM isolation between environments
  • Minimal fixed cost — only 3 projects to manage

3. Network Design

3.1 VPC Architecture

  • One VPC per project (or Shared VPC from company-inc-shared for centralised control)
  • Regional subnets in at least 2 zones for HA
  • Private subnets for workloads (no public IPs on nodes)
  • Public subnets only for load balancers and NAT gateways

3.2 Security Layers

Layer Controls
VPC Firewall Default deny; allow only required CIDRs and ports
GKE node pools Private nodes; no public IPs
Security groups Kubernetes Network Policies + GKE-native security
Ingress HTTPS only; TLS termination at load balancer
Egress Cloud NAT for outbound; restrict to necessary destinations

3.3 Network Topology (High-Level)

flowchart TD
    Internet((Internet))
    Internet --> LB[Cloud Load Balancer<br/>HTTPS termination]
    LB --> Ingress[GKE Ingress Controller]

    subgraph VPC["VPC — Private Subnets"]
        Ingress --> API[API Pods<br/>Python / Flask]
        Ingress --> SPA[Frontend Pods<br/>React SPA]
        API --> DB[(MongoDB<br/>Private Endpoint)]
    end

4. Compute Platform: GKE

4.1 Cluster Strategy

  • GKE Autopilot for production and staging to minimise node management
  • Single regional cluster per environment initially; consider multi-region as scale demands
  • Private cluster with no public endpoint; access via IAP or Bastion if needed

4.2 Node Configuration

Setting Initial Growth Phase
Node type Autopilot (no manual sizing) Same
Min nodes 0 (scale to zero when idle) 2
Max nodes 5 50+
Scaling Pod-based (HPA, cluster autoscaler) Same

4.3 Workload Layout

  • Backend (Python/Flask): Deployment with HPA (CPU/memory); target 23 replicas initially
  • Frontend (React): Static assets served via CDN or container; 12 replicas
  • Ingress: GKE Ingress for HTTP(S) routing; consider GKE Gateway API for advanced use

4.4 Blue-Green Deployment

Zero-downtime releases without duplicating infrastructure. Both versions run inside the same GKE cluster; the load balancer switches traffic atomically.

flowchart LR
    LB[Load Balancer]
    LB -->|100% traffic| Green[Green — v1.2.0<br/>current stable]
    LB -.->|0% traffic| Blue[Blue — v1.3.0<br/>new release]
    Blue -.->|smoke tests pass| LB

Phase Action
Deploy New version deployed to the idle slot (blue)
Test Run smoke tests / synthetic checks against blue
Switch Update Service selector or Ingress to point to blue
Rollback Instant — revert selector back to green (old version still running)
Cleanup Scale down old slot after confirmation period

Cost impact: Near-zero — both slots share the same node pool; the idle slot consumes minimal resources until traffic is switched. Argo Rollouts automates the full lifecycle within ArgoCD.

4.5 Containerisation Strategy

Image Building Process

Each service (Flask backend, React frontend) has its own multi-stage Dockerfile:

  1. Build stage — installs dependencies and compiles artefacts in a full SDK image (e.g. python:3.12, node:20).
  2. Runtime stage — copies only the built artefacts into a minimal base image (e.g. python:3.12-slim, nginx:alpine). This cuts image size by 6080% and removes build tools from the attack surface.
  3. Non-root user — the runtime stage runs as a dedicated unprivileged user (appuser), never as root.
  4. Reproducible builds — dependency lock files (requirements.txt / package-lock.json) are copied and installed before application code to maximise Docker layer caching.

Tagging convention: images are tagged with the git SHA for traceability and a latest alias for convenience. Semantic version tags (e.g. v1.3.0) are added on release.

Container Registry Management

All container images are stored in GCP Artifact Registry in the company-inc-shared project:

  • Single source of truth — one registry serves both staging and production via cross-project IAM pull permissions.
  • Vulnerability scanning — Artifact Registry's built-in scanning is enabled; CI fails if critical CVEs are detected.
  • Image retention policy — keep the latest 10 tagged images per service; automatically garbage-collect untagged manifests older than 30 days.
  • Access control — CI service account has roles/artifactregistry.writer; GKE node service accounts have roles/artifactregistry.reader. No human push access.

For self-hosted Git platforms (e.g. Gitea), the built-in OCI container registry can serve the same role at zero additional cost, with Trivy added as a CI step for vulnerability scanning.

Deployment Pipelines (CI/CD Integration)

The pipeline follows a GitOps model with clear separation between CI and CD:

Phase Tool What happens
Lint & Test Gitea / GitHub Actions Unit tests, linting, Helm lint on every push
Build & Push Gitea / GitHub Actions docker build → tag with git SHA → push to registry
Security Scan Trivy (in CI) Scan image for OS and library CVEs; block on critical findings
Manifest Update CI job Update image tag in the GitOps manifests repo (or Helm values)
Sync & Deploy ArgoCD Detects manifest drift → triggers blue-green rollout via Argo Rollouts
Promotion Argo Rollouts Automated analysis (metrics, health checks) → promote or rollback

Key properties:

  • CI never touches the cluster directly — it only builds images and updates manifests. ArgoCD is the sole deployer.
  • Rollback is instant — revert the manifest repo to the previous commit; ArgoCD syncs automatically.
  • Audit trail — every deployment maps to a git commit in the manifests repo.

4.6 CI/CD Summary

Aspect Approach
Image build Multi-stage Dockerfile; layer caching; non-root; git-SHA tags
Registry Artifact Registry in company-inc-shared (or Gitea built-in OCI registry)
CI Gitea / GitHub Actions — lint, test, build, scan, push
CD ArgoCD + Argo Rollouts — GitOps with blue-green strategy
Secrets External Secrets Operator + GCP Secret Manager

5. Database: MongoDB

5.1 Service Choice

MongoDB Atlas (or Google Cloud DocumentDB if strict GCP-only) recommended for:

  • Fully managed, automated backups
  • Multi-region replication
  • Strong security (encryption at rest, VPC peering)
  • Easy scaling

Atlas on GCP provides native VPC peering and private connectivity.

5.2 High Availability and DR

Topic Strategy
Replicas 3-node replica set; multi-AZ
Backups Continuous backup; point-in-time recovery
Disaster recovery Cross-region replica (e.g. us-central1 + europe-west1)
Restore testing Quarterly DR drills

5.3 Security

  • Private endpoint (no public IP)
  • TLS for all connections
  • IAM-based access; principle of least privilege
  • Encryption at rest (default in Atlas)

6. Cost Optimisation Strategy

Lever Approach Estimated Savings
3 projects, not 4 Drop sandbox; use staging namespaces ~25% fewer fixed project costs
GKE Autopilot Pay per pod, not per node; no idle nodes 3060% vs standard GKE
Blue-green in-cluster No duplicate environments for releases Near-zero deployment cost
Spot/preemptible pods Use for staging and non-critical workloads Up to 6080% off compute
Committed use discounts 1-year CUDs once baseline is established 2030% off sustained use
CDN for frontend Offload SPA traffic from GKE Fewer pod replicas needed
MongoDB Atlas auto-scale Start M10; scale up only when needed Avoid over-provisioning
Cloud NAT shared Single NAT in shared project Avoid per-project NAT cost

Monthly cost estimate (early stage):

  • GKE Autopilot (23 API pods + 1 SPA): ~$80150
  • MongoDB Atlas M10: ~$60
  • Load Balancer + Cloud NAT: ~$30
  • Artifact Registry + Secret Manager: ~$5
  • Total: ~$175245/month

6.1 What Would Be Overkill at This Stage

Not everything in a "best practices" architecture is worth implementing on day one. The following are valuable at scale but add cost and complexity that a startup with a few hundred users/day does not need yet.

Component Why it's overkill now When to introduce
Multi-region GKE Single region handles millions of req/day; multi-region doubles cost When SLA requires 99.99% or users span continents
Service mesh (Istio/Linkerd) Adds sidecar overhead, complexity, and debugging difficulty When you have 10+ microservices with mTLS requirements
Cross-region MongoDB replica Atlas M10 with multi-AZ is sufficient; cross-region adds ~2x DB cost When RPO < 1 hour is a compliance requirement
Dedicated observability stack GKE built-in monitoring + Cloud Logging is free; Prometheus/Grafana adds ops burden When team has > 2 SREs and needs custom dashboards
4+ GCP projects 3 projects cover prod/staging/shared; more adds IAM and billing complexity When compliance (SOC2, HIPAA) requires strict separation
API Gateway (Apigee, Kong) GKE Ingress handles routing; a gateway adds cost and latency When you need rate limiting, API keys, or monetisation
Vault for secrets GCP Secret Manager is cheaper, simpler, and natively integrated When you need dynamic secrets or multi-cloud secret federation

Rule of thumb: if a component doesn't solve a problem you have today, defer it. Every added piece increases the monthly bill and the on-call surface area.


7. High-Level Architecture Diagram

flowchart TD
    Users((Users))

    Users --> CDN[Cloud CDN<br/>Static Assets]
    Users --> LB[Cloud Load Balancer<br/>HTTPS]

    subgraph GKE["GKE Cluster — Private"]
        LB --> Ingress[Ingress Controller]
        Ingress --> API[Backend — Flask<br/>HPA 23 replicas]
        Ingress --> SPA[Frontend — React SPA<br/>Nginx]
        CDN --> SPA
        API --> Redis[Redis<br/>Memorystore]
        API --> Obs[Observability<br/>Prometheus / Grafana]
    end

    subgraph Data["Managed Services"]
        Mongo[(MongoDB Atlas<br/>Replica Set · Private Endpoint)]
        Secrets[Secret Manager<br/>App & DB credentials]
        Registry[Artifact Registry<br/>Container images]
    end

    API --> Mongo
    API --> Secrets
    GKE ----> Registry

8. Summary of Recommendations

Area Recommendation
Cloud GCP with 3 projects (prod, staging, shared)
Compute GKE Autopilot, private nodes, HPA
Deployments Blue-green via Argo Rollouts — zero downtime, instant rollback
Database MongoDB Atlas on GCP with multi-AZ, automated backups
CI/CD GitHub/Gitea Actions + ArgoCD
Security Private VPC, TLS everywhere, Secret Manager, least privilege
Cost ~$175245/month early stage; spot pods, CUDs as traffic grows

See architecture-hld.md for the standalone HLD diagram.