eSlider/flamingo-tech-test

Fork 0

Files

Andriy Oblivantsev e6176999c1

Helm Chart CI & Release / Lint Helm Chart (push) Successful in 10s

Details

Helm Chart CI & Release / Semantic Release (push) Failing after 9s

Details

Add containerisation strategy details and CI image build step

Expand architecture doc section 4.5 with image building process,
container registry management, and deployment pipeline prose.
Add Docker build & push to Gitea OCI registry in CI workflow.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-02-19 21:19:31 +00:00

13 KiB

Raw Permalink Blame History

Architectural Design Document: Company Inc.

Cloud Infrastructure for Web Application Deployment
Version: 1.0
Date: February 2026

1. Executive Summary

This document outlines a robust, scalable, secure, and cost-effective infrastructure design for Company Inc., a startup deploying a web application with a Python/Flask REST API backend, React SPA frontend, and MongoDB database. The design leverages Google Cloud Platform (GCP) with GKE (Google Kubernetes Engine) as the primary compute platform.

Key Design Principles: Cost awareness from day one, security-by-default, scalability when needed, and GitOps-based operations.

2. Cloud Provider and Environment Structure

2.1 Provider Choice: GCP

Rationale: GCP offers strong managed Kubernetes (GKE) with autopilot options, excellent MongoDB Atlas integration (or GCP-native DocumentDB alternatives), competitive pricing for startups, and simplified networking. GKE Autopilot reduces operational overhead for a small team with limited Kubernetes expertise.

2.2 Project Structure (Cost-Optimised)

For a startup, fewer projects mean lower overhead and simpler billing. Start with 3 projects and add more only when traffic or compliance demands it.

Project	Purpose	Isolation
company-inc-prod	Production workloads	High; sensitive data
company-inc-staging	Staging, QA, and dev experimentation	Medium
company-inc-shared	CI/CD, Artifact Registry, DNS	Low; no PII

Why not 4+ projects?

A dedicated sandbox project adds billing, IAM, and networking overhead with little benefit at startup scale.
Developers can use Kubernetes namespaces within the staging cluster for experimentation.
A fourth project can be introduced later when team size or compliance (SOC2, HIPAA) requires it.

Benefits:

Billing separation (prod costs are clearly visible)
Blast-radius containment (prod issues do not affect staging)
IAM isolation between environments
Minimal fixed cost — only 3 projects to manage

3. Network Design

3.1 VPC Architecture

One VPC per project (or Shared VPC from company-inc-shared for centralised control)
Regional subnets in at least 2 zones for HA
Private subnets for workloads (no public IPs on nodes)
Public subnets only for load balancers and NAT gateways

3.2 Security Layers

Layer	Controls
VPC Firewall	Default deny; allow only required CIDRs and ports
GKE node pools	Private nodes; no public IPs
Security groups	Kubernetes Network Policies + GKE-native security
Ingress	HTTPS only; TLS termination at load balancer
Egress	Cloud NAT for outbound; restrict to necessary destinations

3.3 Network Topology (High-Level)

flowchart TD
    Internet((Internet))
    Internet --> LB[Cloud Load Balancer<br/>HTTPS termination]
    LB --> Ingress[GKE Ingress Controller]

    subgraph VPC["VPC — Private Subnets"]
        Ingress --> API[API Pods<br/>Python / Flask]
        Ingress --> SPA[Frontend Pods<br/>React SPA]
        API --> DB[(MongoDB<br/>Private Endpoint)]
    end

4. Compute Platform: GKE

4.1 Cluster Strategy

GKE Autopilot for production and staging to minimise node management
Single regional cluster per environment initially; consider multi-region as scale demands
Private cluster with no public endpoint; access via IAP or Bastion if needed

4.2 Node Configuration

Setting	Initial	Growth Phase
Node type	Autopilot (no manual sizing)	Same
Min nodes	0 (scale to zero when idle)	2
Max nodes	5	50+
Scaling	Pod-based (HPA, cluster autoscaler)	Same

4.3 Workload Layout

Backend (Python/Flask): Deployment with HPA (CPU/memory); target 2–3 replicas initially
Frontend (React): Static assets served via CDN or container; 1–2 replicas
Ingress: GKE Ingress for HTTP(S) routing; consider GKE Gateway API for advanced use

4.4 Blue-Green Deployment

Zero-downtime releases without duplicating infrastructure. Both versions run inside the same GKE cluster; the load balancer switches traffic atomically.

flowchart LR
    LB[Load Balancer]
    LB -->|100% traffic| Green[Green — v1.2.0<br/>current stable]
    LB -.->|0% traffic| Blue[Blue — v1.3.0<br/>new release]
    Blue -.->|smoke tests pass| LB

Phase	Action
Deploy	New version deployed to the idle slot (blue)
Test	Run smoke tests / synthetic checks against blue
Switch	Update Service selector or Ingress to point to blue
Rollback	Instant — revert selector back to green (old version still running)
Cleanup	Scale down old slot after confirmation period

Cost impact: Near-zero — both slots share the same node pool; the idle slot consumes minimal resources until traffic is switched. Argo Rollouts automates the full lifecycle within ArgoCD.

4.5 Containerisation Strategy

Image Building Process

Each service (Flask backend, React frontend) has its own multi-stage Dockerfile:

Build stage — installs dependencies and compiles artefacts in a full SDK image (e.g. python:3.12, node:20).
Runtime stage — copies only the built artefacts into a minimal base image (e.g. python:3.12-slim, nginx:alpine). This cuts image size by 60–80% and removes build tools from the attack surface.
Non-root user — the runtime stage runs as a dedicated unprivileged user (appuser), never as root.
Reproducible builds — dependency lock files (requirements.txt / package-lock.json) are copied and installed before application code to maximise Docker layer caching.

Tagging convention: images are tagged with the git SHA for traceability and a latest alias for convenience. Semantic version tags (e.g. v1.3.0) are added on release.

Container Registry Management

All container images are stored in GCP Artifact Registry in the company-inc-shared project:

Single source of truth — one registry serves both staging and production via cross-project IAM pull permissions.
Vulnerability scanning — Artifact Registry's built-in scanning is enabled; CI fails if critical CVEs are detected.
Image retention policy — keep the latest 10 tagged images per service; automatically garbage-collect untagged manifests older than 30 days.
Access control — CI service account has roles/artifactregistry.writer; GKE node service accounts have roles/artifactregistry.reader. No human push access.

For self-hosted Git platforms (e.g. Gitea), the built-in OCI container registry can serve the same role at zero additional cost, with Trivy added as a CI step for vulnerability scanning.

Deployment Pipelines (CI/CD Integration)

The pipeline follows a GitOps model with clear separation between CI and CD:

Phase	Tool	What happens
Lint & Test	Gitea / GitHub Actions	Unit tests, linting, Helm lint on every push
Build & Push	Gitea / GitHub Actions	`docker build` → tag with git SHA → push to registry
Security Scan	Trivy (in CI)	Scan image for OS and library CVEs; block on critical findings
Manifest Update	CI job	Update image tag in the GitOps manifests repo (or Helm values)
Sync & Deploy	ArgoCD	Detects manifest drift → triggers blue-green rollout via Argo Rollouts
Promotion	Argo Rollouts	Automated analysis (metrics, health checks) → promote or rollback

Key properties:

CI never touches the cluster directly — it only builds images and updates manifests. ArgoCD is the sole deployer.
Rollback is instant — revert the manifest repo to the previous commit; ArgoCD syncs automatically.
Audit trail — every deployment maps to a git commit in the manifests repo.

4.6 CI/CD Summary

Aspect	Approach
Image build	Multi-stage Dockerfile; layer caching; non-root; git-SHA tags
Registry	Artifact Registry in `company-inc-shared` (or Gitea built-in OCI registry)
CI	Gitea / GitHub Actions — lint, test, build, scan, push
CD	ArgoCD + Argo Rollouts — GitOps with blue-green strategy
Secrets	External Secrets Operator + GCP Secret Manager

5. Database: MongoDB

5.1 Service Choice

MongoDB Atlas (or Google Cloud DocumentDB if strict GCP-only) recommended for:

Fully managed, automated backups
Multi-region replication
Strong security (encryption at rest, VPC peering)
Easy scaling

Atlas on GCP provides native VPC peering and private connectivity.

5.2 High Availability and DR

Topic	Strategy
Replicas	3-node replica set; multi-AZ
Backups	Continuous backup; point-in-time recovery
Disaster recovery	Cross-region replica (e.g. `us-central1` + `europe-west1`)
Restore testing	Quarterly DR drills

5.3 Security

Private endpoint (no public IP)
TLS for all connections
IAM-based access; principle of least privilege
Encryption at rest (default in Atlas)

6. Cost Optimisation Strategy

Lever	Approach	Estimated Savings
3 projects, not 4	Drop sandbox; use staging namespaces	~25% fewer fixed project costs
GKE Autopilot	Pay per pod, not per node; no idle nodes	30–60% vs standard GKE
Blue-green in-cluster	No duplicate environments for releases	Near-zero deployment cost
Spot/preemptible pods	Use for staging and non-critical workloads	Up to 60–80% off compute
Committed use discounts	1-year CUDs once baseline is established	20–30% off sustained use
CDN for frontend	Offload SPA traffic from GKE	Fewer pod replicas needed
MongoDB Atlas auto-scale	Start M10; scale up only when needed	Avoid over-provisioning
Cloud NAT shared	Single NAT in shared project	Avoid per-project NAT cost

Monthly cost estimate (early stage):

GKE Autopilot (2–3 API pods + 1 SPA): ~$80–150
MongoDB Atlas M10: ~$60
Load Balancer + Cloud NAT: ~$30
Artifact Registry + Secret Manager: ~$5
Total: ~$175–245/month

6.1 What Would Be Overkill at This Stage

Not everything in a "best practices" architecture is worth implementing on day one. The following are valuable at scale but add cost and complexity that a startup with a few hundred users/day does not need yet.

Component	Why it's overkill now	When to introduce
Multi-region GKE	Single region handles millions of req/day; multi-region doubles cost	When SLA requires 99.99% or users span continents
Service mesh (Istio/Linkerd)	Adds sidecar overhead, complexity, and debugging difficulty	When you have 10+ microservices with mTLS requirements
Cross-region MongoDB replica	Atlas M10 with multi-AZ is sufficient; cross-region adds ~2x DB cost	When RPO < 1 hour is a compliance requirement
Dedicated observability stack	GKE built-in monitoring + Cloud Logging is free; Prometheus/Grafana adds ops burden	When team has > 2 SREs and needs custom dashboards
4+ GCP projects	3 projects cover prod/staging/shared; more adds IAM and billing complexity	When compliance (SOC2, HIPAA) requires strict separation
API Gateway (Apigee, Kong)	GKE Ingress handles routing; a gateway adds cost and latency	When you need rate limiting, API keys, or monetisation
Vault for secrets	GCP Secret Manager is cheaper, simpler, and natively integrated	When you need dynamic secrets or multi-cloud secret federation

Rule of thumb: if a component doesn't solve a problem you have today, defer it. Every added piece increases the monthly bill and the on-call surface area.

7. High-Level Architecture Diagram

flowchart TD
    Users((Users))

    Users --> CDN[Cloud CDN<br/>Static Assets]
    Users --> LB[Cloud Load Balancer<br/>HTTPS]

    subgraph GKE["GKE Cluster — Private"]
        LB --> Ingress[Ingress Controller]
        Ingress --> API[Backend — Flask<br/>HPA 2–3 replicas]
        Ingress --> SPA[Frontend — React SPA<br/>Nginx]
        CDN --> SPA
        API --> Redis[Redis<br/>Memorystore]
        API --> Obs[Observability<br/>Prometheus / Grafana]
    end

    subgraph Data["Managed Services"]
        Mongo[(MongoDB Atlas<br/>Replica Set · Private Endpoint)]
        Secrets[Secret Manager<br/>App & DB credentials]
        Registry[Artifact Registry<br/>Container images]
    end

    API --> Mongo
    API --> Secrets
    GKE ----> Registry

8. Summary of Recommendations

Area	Recommendation
Cloud	GCP with 3 projects (prod, staging, shared)
Compute	GKE Autopilot, private nodes, HPA
Deployments	Blue-green via Argo Rollouts — zero downtime, instant rollback
Database	MongoDB Atlas on GCP with multi-AZ, automated backups
CI/CD	GitHub/Gitea Actions + ArgoCD
Security	Private VPC, TLS everywhere, Secret Manager, least privilege
Cost	~$175–245/month early stage; spot pods, CUDs as traffic grows

See architecture-hld.md for the standalone HLD diagram.

13 KiB Raw Permalink Blame History Unescape Escape