Overview
About the Role
We are hiring a mid-level DevOps Engineer to help own the infrastructure and delivery surface of a leading technology marketplace that has been running successfully for the last 10 years. The infrastructure is a multi-region cloud estate running dozens of microservices on Kubernetes, an edge router, a Node.js proxy fleet, a Kafka event backbone, a data plane built on ClickHouse, Spark, Snowflake, and Airflow, and modern CI/CD pipelines. The job is to keep that estate reliable, cheap, secure, and pleasant to ship into.
As the platform adds agent-native traffic, you will help ensure the infrastructure scales for machine-speed, high-frequency agent calls alongside human traffic — without separate handling or runaway cost.
Roles and Responsibilities
• AWS infrastructure — Operate and improve the multi-region AWS estate across compute, streaming, databases, caching, and storage. Diagnose issues across regions and accounts.
• Infrastructure-as-Code — Maintain and extend the Terraform modules that define the platform. Treat infra changes like code: reviewed, tested, deployed via pipeline.
• Kubernetes — Operate workloads on Kubernetes: deployments, autoscaling, ingress, networking, secrets, resource limits. Help service teams adopt safe defaults.
• CI/CD — Own and improve the CI/CD pipelines that ship frontend, backend, and data-platform code. Keep build times short, test suites fast and reliable in CI, and deploys boring.
• Observability platform — Own the observability and alerting platform and the standards service teams instrument against. Service teams own their own instrumentation — you own the platform that makes it possible, the conventions that keep it consistent, and the gap analysis that catches what nobody instrumented. Take part in the on-call rotation.
• Incident response — Lead or co-lead incidents; run blameless post-mortems; turn each incident into automation or improved runbooks.
• Security and access — Manage secrets, rotation, access policies, and network policies. Partner with the architect on zero-trust patterns, OAuth2 / JWT / mTLS, and security compliance.
• Cost discipline — Hunt down idle capacity, oversized clusters, runaway warehouse queries, expensive streaming shards, and CDN waste. Report monthly on what came down and why.
• Vendor consolidation — Help simplify the observability, CDN, CDP, and supporting vendor stack. Carry through the migrations.
What Success Looks Like
• Uptime ≥ 99.99% across the platform; incidents are rarer, shorter, and better understood than before you joined.
• Deploys to every part of the stack are routine — service teams ship multiple times a day without your help.
• Terraform is the source of truth: nothing important is configured by hand in the console.
• Alert volume is down and alert quality is up — on-call does not dread the pager.
• AWS spend and vendor spend trend down per call as traffic grows.
• Security and compliance audits go smoothly because the controls are real and the evidence is automated.
What You Bring
• 5+ years of hands-on DevOps / SRE / Platform Engineering experience in a production environment that mattered.
• Strong hands-on AWS across compute, networking, storage, and managed databases.
• Kubernetes in production — deployments, autoscaling, ingress, helm, secrets, RBAC, troubleshooting.
• Terraform in production — modules, state hygiene, code review discipline.
• Solid scripting in Python, Bash, or Go.
• CI/CD experience with modern pipeline tooling.
• Real observability experience — instrumentation, dashboards, alerts.
• Comfortable on-call. Comfortable running an incident. Comfortable telling the team that something needs to be redesigned.
• Security mindset: least-privilege, secrets hygiene, supply-chain awareness.
• Bonus: experience operating Kafka, ClickHouse, Snowflake, or Airflow; exposure to agentic / LLM-driven ops workflows for incident triage or remediation.
• You automate before you complain. You leave the cluster cheaper and quieter than you found it.