Platform Starter Kit
Terraform + Kubernetes + ArgoCD + OPA Gatekeeper + Veeam Kasten
Problem: Platform teams provisioning Kubernetes clusters manually end up with snowflake environments, inconsistent RBAC, no backup governance, and configuration drift between dev and prod.
Approach: Built a production-ready starter kit that provisions AKS clusters via Terraform with ArgoCD handling all post-cluster configuration as GitOps. Every change, from namespace onboarding to backup policy, flows through Git. OPA/Gatekeeper enforces guardrails at the admission layer so teams can self-serve without bypassing security.
- Terraform modules for AKS with configurable node pools, networking, RBAC, and remote state on Azure Storage; separate dev/prod environment configs
- ArgoCD GitOps delivery with automated sync, self-heal for drift detection, and prune for removed resources
- OPA/Gatekeeper policy requiring StatefulSets to carry a backup label, enforcing backup governance at admission
- Kasten K10 backup with daily snapshots, 7/4/3 GFS retention, automated export to Azure Blob via managed identity
- Pod Security Standards (restricted), default-deny NetworkPolicies, Gateway API routing, Kustomize-based team onboarding with namespace isolation, quotas, and LimitRanges
UpDog Monitor
FastAPI + Prometheus + Grafana + React/TypeScript
Problem: Simple uptime checks tell you a service is down but not whether you're burning through your error budget. Teams need SLO-driven monitoring with real-time visibility into availability and latency against defined targets.
Approach: Built a full-stack monitoring platform with an SLO engine that computes availability (99.5% target) and latency (p95 < 500ms) from Prometheus metrics. Error budget burn rate tracking surfaces problems before SLO breach, not after. The entire stack deploys to Kubernetes with CI that includes container image scanning.
- FastAPI backend with PostgreSQL, React/TypeScript frontend, and background worker performing health checks on configurable intervals
- Custom Prometheus instrumentation, histograms, counters, gauges, with Grafana dashboards provisioned via JSON
- Alerting rules for SLO breaches and burn rate using
histogram_quantileandratequeries - CI/CD via GitHub Actions: lint, test, build, push to GHCR, and Trivy container image scanning