Overview
About the Role
We're building an AI platform that's moving from a working MVP into secure, enterprise-grade production deployments inside the cloud environments of large enterprise customers. Each customer requires the platform to be deployed inside their own VPC, meet their detailed infrastructure security requirements, and pass a rigorous security review before going live.
We're looking for a senior DevOps / Platform Engineer to own the infrastructure, deployment, and security side of this work. You'll be designing and building the deployment package, infrastructure-as-code, network and security posture, and observability stack that makes the platform deployable across multiple enterprise customers, starting with AWS and extending to GCP and Azure as the customer base grows.
This is hands-on work. You'll be writing Terraform, Helm charts, IAM policies, and security documentation. You'll be sitting alongside the founding engineering team, talking to enterprise security reviewers, and making the architectural decisions that determine whether the platform passes review or doesn't.
What You'll Build
Cloud-agnostic deployment package. Containerized application (Next.js frontend, FastAPI/LangGraph backend) running on managed Kubernetes, deployable inside any major cloud (AWS first, then GCP, Azure). Helm charts, Docker images, image registries, deployment workflows.
Infrastructure as code. Terraform modules that provision the entire stack (networking, compute, database, cache, object storage, secrets, logging) inside a customer's cloud account. Designed to be reusable across customers with minimal per-customer changes.
Secure deployment posture. Network isolation with VPC endpoints, encryption in transit (TLS 1.2+) and at rest (SSE-KMS with customer-managed keys), least-privilege IAM with no wildcards, comprehensive CloudTrail audit logging, lifecycle management for data retention and deletion.
In-VPC data ingestion pipeline. A pipeline that reads scheduled data drops from a designated S3 bucket inside a customer's VPC, validates the files, and loads them into the application's database. All ingestion happens inside the customer's cloud account.
LLM access within the customer network. Routing Claude calls through AWS Bedrock (and Vertex AI / Azure AI Foundry as needed) so all LLM traffic stays inside the customer's cloud network rather than calling external APIs over the public internet.
Observability and monitoring. Tracing, monitoring, alerting across the application and infrastructure. Tooling for agent-level observability (LangSmith or equivalent) integrated with the broader observability stack.
Security documentation. Architecture diagrams, data flow diagrams, security questionnaire responses, runbooks. The documentation that enterprise security reviewers will scrutinize line by line.
Customer onboarding playbook. Repeatable patterns and scripts so each new customer deployment takes days of configuration rather than weeks of engineering.
What You Should Have Done Before
We're not looking for someone learning these skills on this job. We need someone who has done this kind of work before:
Production Kubernetes experience on AWS (EKS), ideally also GKE or AKS. You should know how to design, deploy, and troubleshoot Kubernetes workloads in production, not just run minikube locally.
Strong Terraform / infrastructure-as-code skills. You've written and maintained substantial Terraform codebases. You know how to structure modules, handle state, manage secrets, and avoid the common pitfalls.
Deep AWS knowledge. VPC design, private subnets, VPC endpoints, security groups, IAM (policies, roles, conditions, least-privilege), KMS (customer-managed keys, key policies), S3 (security, lifecycle, encryption), RDS (Postgres preferred), CloudTrail, Secrets Manager, ALB.
Experience meeting enterprise security baselines. You've worked through enterprise security reviews before. You understand what "no wildcards in IAM," "VPC endpoint enforcement," "SSE-KMS with customer-managed keys," and "CloudTrail data events" actually mean and how to implement them correctly.
Hands-on with CI/CD pipelines for containerized applications. GitHub Actions, GitLab CI, ArgoCD, or similar.
Comfort with Linux, Bash, Python. You can debug a failing container, write a deployment script, or hack together a data ingestion job in Python when needed.
Nice to have, not required:
Experience with BYOC (Bring Your Own Cloud) deployments for SaaS products
Familiarity with LangGraph, LLM observability tools (LangSmith, Langfuse), or other AI/ML infrastructure
SOC 2 readiness experience (Vanta, Drata, audit prep) — not the current focus but useful context
GCP (GKE, Vertex AI) and Azure (AKS, AI Foundry) experience for future customer deployments
Prior experience in startups working with enterprise customers, especially in retail, ad-tech, or fintech