Mistral LLM Inference Platform on AWS EKS
Designed and deployed a production-grade LLM inference platform serving Mistral 7B Instruct on AWS EKS with GPU-accelerated inference via vLLM. The architecture features a FastAPI gateway with bearer token authentication and per-IP rate limiting, proxying OpenAI-compatible requests to a vLLM backend running on NVIDIA A10G GPUs (Spot instances for 60-70% cost savings). Infrastructure is provisioned via Terraform (VPC, EKS, ECR, managed node groups with AL2023 NVIDIA AMI), deployed with Helm charts, and monitored with Prometheus metrics and Grafana dashboards. Autoscaling is driven by a custom HPA metric (pending inference request queue depth). CI/CD pipelines implement canary deployments with automated smoke tests and rollback. Load testing with Locust validated 0% failure rate at ~3.4s p50 latency for 100-token completions.