Skip to main content

Command Palette

Search for a command to run...

I Built a Production Food Delivery Platform on AWS EKS — Here's Everything I Learned

Updated
6 min read
V
I build AI+DevOps tools that run 100% locally — no API costs, no cloud required. Currently building a series of 12 projects combining Ollama, Python, and core DevOps tools like Docker, Kubernetes, Terraform, and GitHub Actions. YouTube: ThinkWithOps

Why I Built This

Most Kubernetes tutorials stop at kubectl apply -f deployment.yaml. They don't show you how a VPC is laid out, why you need two availability zones, what IAM roles EKS nodes actually need, or how to debug a live failure using Prometheus metrics.

I wanted to build something that forced me to make every decision a senior DevOps engineer would make on a real project. So I built a food delivery platform — four independent microservices, a React frontend, full Terraform infrastructure on AWS, a GitHub Actions pipeline, and a Grafana dashboard — and recorded the whole thing.

This is what I learned.


How It Works

The Application Layer

Four FastAPI microservices, each completely independent with its own SQLite database:

  • user-service (port 8001): Registration, JWT login, user profiles. Seeds 3 users on startup.

  • restaurant-service (port 8002): Restaurant listing + full menus. Seeds 5 restaurants with 10 menu items each — real food names, USD prices.

  • order-service (port 8003): Order placement. Makes a synchronous HTTP call to restaurant-service to validate menu items before placing the order. Has a built-in ORDER_SERVICE_FAILURE_MODE env var for the observability demo.

  • delivery-service (port 8004): Agent assignment and delivery tracking. Seeds 5 delivery agents.

Each service exposes /health (returns {"status":"healthy","service":"<name>","version":"1.0.0"}) and /metrics (auto-generated by prometheus-fastapi-instrumentator).

An NGINX gateway (port 8080 locally) routes /api/users, /api/restaurants, /api/orders, /api/delivery to the right service and serves the React frontend at /.

The Infrastructure

Terraform is split into four modules:

modules/vpc: VPC (10.0.0.0/16), 2 public + 2 private subnets across us-east-1a and us-east-1b, Internet Gateway, 1 NAT Gateway (single point of failure — intentional cost trade-off for a demo, documented in comments), route tables.

modules/eks: EKS 1.32 cluster, managed node group with t3.small instances (min=1, desired=2, max=4 in private subnets), cluster IAM role, node IAM role with three AWS-managed policies, launch template to name EC2 instances in the console.

modules/ecr: Five repositories (food-delivery/user-service, food-delivery/frontend, etc.), image scan on push, lifecycle policy keeping last 10 images.

modules/iam: GitHub Actions IAM user with an inline policy scoped to ECR push/pull and EKS describe — nothing else.

The CI/CD Pipeline

deploy.yml triggers on push to main. It:

  1. Applies Kubernetes manifests, ingress-nginx, and kube-prometheus-stack

  2. Uses a matrix job for user-service, restaurant-service, order-service, delivery-service, and frontend

  3. Logs into ECR

  4. Builds and tags each image with $GITHUB_SHA and latest

  5. Runs aws eks update-kubeconfig

  6. Does kubectl set image with the SHA tag

  7. Waits for kubectl rollout status

pr-checks.yml runs flake8, pytest, terraform fmt -check, and terraform validate on every pull request.

destroy.yml is a manual workflow_dispatch with a typed confirmation — safeguard against accidental terraform destroy.


The Observability Demo

This is the part that makes the project worth recording.

Set ORDER_SERVICE_FAILURE_MODE=true in Docker Compose and restart order-service. Now 50% of POST /orders requests return HTTP 500. Run scripts/load-test.sh — it fires 300 requests in 10 concurrent workers over 3 minutes.

In Grafana, the "Error rate per service" panel spikes immediately from 0% to ~50% for order-service. The failed_orders_total counter climbs. P95 latency creeps up because failed requests still go through the restaurant-service validation call before failing.

Meanwhile HPA detects elevated CPU, scales replicas from 2 to 6. More pods, same error rate — the bug is in code, not capacity.

kubectl logs on any order-service pod shows the failure mode immediately. Fix: set ORDER_SERVICE_FAILURE_MODE=false, redeploy. Grafana recovers in under 30 seconds.

That recovery graph — the spike, the plateau, the drop — is the money shot of the video.


What I Learned

1. EKS nodes don't get Name tags by default. The aws_eks_node_group resource tags the node group, not the individual EC2 instances. You need a launch_template with tag_specifications { resource_type = "instance" } to see names in the EC2 console. Lost 20 minutes on this.

2. One NAT Gateway is a trade-off, not a mistake. The prompt called for cost saving. A single NAT Gateway means if us-east-1a goes down, private subnets in us-east-1b lose internet access. I documented this in a comment on the resource. Production would use one NAT per AZ. That trade-off is worth explaining explicitly.

3. The IAM roles for EKS are the biggest footgun. You need three separate IAM roles: cluster role (for the control plane), node role (for EC2 instances in the node group), and optionally a IRSA role per service. Mixing them up silently breaks things. The AmazonEKS_CNI_Policy on the node role is what makes pod networking work — missing it gives you running pods with no network connectivity.

4. prometheus-fastapi-instrumentator is one line of code.

Instrumentator().instrument(app).expose(app)

That's it. You get request count, latency histograms, and HTTP status breakdown per endpoint, all at /metrics. The custom counters (orders_total, failed_orders_total, order_processing_seconds) are 5 more lines.

5. Service-to-service calls need explicit timeouts. order-service calls restaurant-service with httpx.AsyncClient(timeout=5.0). Without the timeout, a slow restaurant-service will hold an order-service worker indefinitely, causing cascade failures that look like order-service bugs in the logs.

6. maxUnavailable=0 in rolling updates protects you more than you think. With maxSurge=1, maxUnavailable=0, Kubernetes brings up the new pod and passes readiness checks before terminating the old one. The /health readinessProbe with initialDelaySeconds=15 means the new pod gets 15 seconds to initialize SQLite and seed data before traffic hits it. Without this, users hit 503s during every deploy.


Limitations (honest)

  • SQLite is fine for local dev and demos. This would use RDS or Aurora in production.

  • Single NAT Gateway is a cost optimization, not production-ready.

  • The React frontend hardcodes http://localhost:8080 — a real app would use environment injection at build time.

  • No secrets management — passwords and JWT secret are env vars. Production would use AWS Secrets Manager + Kubernetes Secrets.

  • The GitHub Actions IAM user uses long-lived access keys. Production would use OIDC federation (no keys at all).

  • The Grafana dashboard started as a local Docker Compose dashboard. Kubernetes metrics need their own PromQL queries and dashboard panels.


Try It

# Local — everything runs in Docker
git clone https://github.com/vijayb-aiops/devops-production-projects
cd devops-production-projects/projects/01-food-delivery-eks-platform
bash scripts/bootstrap.sh

# Trigger the observability demo
ORDER_SERVICE_FAILURE_MODE=true docker compose up -d order-service
bash scripts/load-test.sh
# Open Grafana at http://localhost:3000 (admin/foodrush123)

# Deploy to AWS
cd infra/terraform
terraform init
terraform apply
cd ../..
bash scripts/deploy-eks.sh

Estimated AWS cost while recording: ~$0.19/hr. Run terraform destroy when done.

📺 Full build-along: https://youtube.com/@ThinkWithOps
📁 GitHub: https://github.com/vijayb-aiops/devops-production-projects/tree/main/projects/01-food-delivery-eks-platform