Overview
DevOps Engineer
About Us
We build a collaborative, real-time workspace platform enabling teams to organize content, manage projects, and communicate at scale. Our platform is a cloud-native SaaS product running on AWS, serving users across multiple regions through a microservices architecture.
Our engineering team moves fast. We ship continuously to a Kubernetes-based infrastructure with a fully automated CI/CD pipeline and take infrastructure quality as seriously as product quality. We value engineers who treat infrastructure as code, own reliability end-to-end, and proactively improve the systems they work on.
About the Role
Experience Level: Mid to Senior | Minimum: 5+ years in DevOps / Platform Engineering
We are looking for an experienced DevOps Engineer to own and evolve the infrastructure that powers our platform. You will work closely with our backend and frontend engineers to keep our systems reliable, secure, observable, and cost-efficient.
You will manage a production-grade AWS environment spanning 16+ microservices (Go, Node.js) on Kubernetes (EKS), with infrastructure provisioned entirely through Terraform and deployments managed via Helm and GitHub Actions. The role covers everything from infrastructure design and CI/CD pipelines to monitoring, incident response, and security hardening.
This is a hands-on engineering role — you will write real Terraform, maintain Helm charts, build and improve CI/CD pipelines, and debug production issues from CloudWatch logs to Kubernetes pod events.
What You'll Do
Infrastructure as Code (Terraform)
Own, maintain, and evolve a large library of Terraform modules that provision the entire AWS environment across development and production accounts
Manage EKS cluster configurations including managed node groups and spot/fleet instance node groups (cost-optimized, achieving up to 70% savings vs on-demand)
Provision and maintain supporting infrastructure: VPC, subnets, security groups, ALB, ACM certificates, Route53 DNS, SQS queues, SES email, and EFS volumes
Add new modules for evolving infrastructure requirements and ensure all resources are reproducible and version-controlled
Apply Terraform changes safely across environments using Terraform workspaces and remote state backends
Kubernetes & Container Orchestration
Operate and maintain the AWS EKS cluster with both spot/fleet and on-demand worker node groups
Deploy and manage 16+ microservices on Kubernetes using Helm charts (4 custom charts: generic deployments, one-time jobs, cron jobs, and ingress)
Configure and tune Horizontal Pod Autoscalers (HPA), Pod Disruption Budgets (PDB), and Persistent Volume Claims (PVC) per service
Manage Kubernetes ingress, service accounts, RBAC, and ConfigMaps/Secrets
Maintain the Helm chart repository (versioning, publishing, GitHub Actions pipeline)
Debug pod failures, resource constraints, and node scheduling issues
CI/CD Pipeline Management
Own multiple GitHub Actions workflows covering PR validation, auto-deployment to dev, and production releases
Enforce a two-part release flow: (1) PR checks (build, unit tests, commit linting, manual approvals) → (2) auto-deploy on merge to development for dev environment; semver tag (vx.y.z) releases for production
Maintain build pipelines for Go microservices (multi-stage Docker builds), Node.js services, and Helm charts
Manage AWS ECR image repositories — pushing, tagging, lifecycle policies
Configure Slack notifications for deployment failures and pipeline events
Build and improve deployment automation, reducing manual intervention in release processes
Monitoring & Observability
Operate SigNoz for APM — configure service traces, metrics dashboards, and alerts across all microservices
Manage CloudWatch log groups per service (integrated via Fluent Bit log shipping from Kubernetes)
Maintain Grafana dashboards for infrastructure-level metrics
Monitor Prometheus metrics exposed by backend services
Maintain StatusPage.io public status pages for our services
Define alerting rules and on-call runbooks; own incident response and post-mortems
Security & Secrets Management
Manage AWS Secrets Manager for all service credentials (MongoDB, Wasabi, application configs)
Administer AWS Client VPN with SSO integration for secure developer access to private infrastructure
Maintain IAM roles, policies, and service accounts following least-privilege principles
Manage ACM certificates and ensure TLS is enforced across all ingress endpoints
Operate ClamAV for malware scanning of user-uploaded files
Support the SpiceDB fine-grained authorization service and its migration tooling
Participate in compliance reviews and apply security best practices across the AWS account
Networking & Cloud Architecture
Manage multi-VPC architecture: separate VPCs for dev and production environments with VPC peering for controlled cross-environment access
Configure MongoDB Atlas PrivateLink connectivity ensuring database clusters are accessible only from within the designated VPC
Maintain bastion host configuration for emergency database access
Design and implement network segmentation, security group rules, and NACLs
Manage DNS via Route53 and ALB routing rules
Collaboration with Engineering Teams
Partner with Go and Node.js backend engineers to containerize new services and onboard them to the deployment pipeline
Work with frontend engineers on AWS Amplify deployments for the Nuxt.js / Vue 3 PWA
Provide runbooks and documentation for common debugging workflows (e.g., CloudWatch log tailing, VPN access, EKS pod debugging)
Define and enforce infrastructure standards, naming conventions, and tagging strategies across environments
Our Stack — You'll Work With These Every Day
Cloud Platform — AWS
EKS (Kubernetes managed control plane)
EC2 (managed and custom/fleet node groups)
ECR (container image registry)
ALB (Application Load Balancer)
CloudWatch (logging and metrics)
Secrets Manager
SQS (message queues)
SES (transactional email)
ACM (SSL/TLS certificates)
Route53 (DNS)
EFS (persistent storage for Kubernetes)
Client VPN (developer access)
AWS SSO (identity federation)
AWS FIS (Fault Injection Simulator — chaos engineering)
AWS Amplify (frontend CI/CD and hosting)
Container Orchestration & Packaging
Kubernetes (EKS) — fleet/spot + on-demand node groups
Helm (4 custom charts: generic deployments, one-time jobs, cron jobs, ingress)
Docker (multi-stage builds for Go and Node.js services)
HPA, PDB, PVC, Ingress, RBAC
Infrastructure as Code
Terraform — modular components, multi-environment (dev + prod), remote state backend
CI/CD & Automation
GitHub Actions (multiple workflows)
Semver-based release tagging (vx.y.z) for production promotions
Slack for pipeline notifications
Monitoring & Observability
SigNoz (APM, distributed tracing, dashboards, alerts)
CloudWatch (log aggregation — per-service log streams)
Fluent Bit (Kubernetes log shipping to CloudWatch)
Grafana (infrastructure dashboards)
Prometheus (per-service metrics)
StatusPage.io (public incident communication)
Data & Storage
MongoDB Atlas (cloud MongoDB with PrivateLink, per-environment isolation)
Aurora PostgreSQL and MySQL (via Amazon RDS)
Redis (ElastiCache — single-instance and cluster mode)
Wasabi (S3-compatible object storage with HA configuration)
EFS (Elastic File System for Kubernetes PVCs)
Security & Access
AWS Secrets Manager
AWS Client VPN + AWS SSO
IAM (service roles, least-privilege policies)
ACM (TLS certificates)
Security Groups and NACLs
SpiceDB (fine-grained authorization service)
ClamAV (antivirus scanning)
Services Architecture
14 Go microservices (gRPC inter-service communication via Protocol Buffers)
1 Node.js service (document generation)
gRPC (primary inter-service transport)
REST/HTTP (client-facing APIs)
MongoDB change streams (event-driven data sync)
Asynq/Redis (async task queues)
Frontend Deployment
AWS Amplify (Nuxt.js 3 / Vue 3 PWA — web application frontend)
Node.js 22+
GitHub Actions for Amplify CI/CD
What We're Looking For
Minimum Experience Requirements at a Glance
Area
Minimum
DevOps / Platform Engineering (overall)
5+ years
Terraform (module-level IaC)
2+ years
Kubernetes in production
2+ years
AWS (EKS, ECR, CloudWatch, IAM, etc.)
2+ years
CI/CD pipeline ownership (GitHub Actions or equivalent)
1+ year
Must Have
Experience & General Skills
5+ years of hands-on DevOps or Platform Engineering experience in a production environment
Strong ownership mentality — you don't wait to be asked to fix something that's broken
Comfortable working in a fast-moving startup environment with evolving infrastructure requirements
Clear written communication (runbooks, post-mortems, documentation)
Cloud — AWS (2+ years)
Solid experience with AWS core services: EKS, EC2, ALB, ECR, CloudWatch, Secrets Manager, IAM, SQS, Route53, ACM
Understanding of AWS networking: VPC design, subnets, security groups, VPC peering, PrivateLink
Experience managing multi-environment AWS accounts (dev / prod separation)
Kubernetes & Containers (2+ years)
Production Kubernetes experience — deploying, scaling, and debugging workloads
Helm chart authoring and maintenance (not just helm install)
Docker — writing efficient multi-stage Dockerfiles for compiled (Go) and interpreted (Node.js) applications
Familiarity with HPA, PDB, resource limits/requests, and pod scheduling
Infrastructure as Code — Terraform (2+ years)
2+ years writing and maintaining Terraform at module level
Experience with remote state, workspaces, and multi-environment Terraform layouts
Ability to read existing module code, understand dependencies, and extend it safely
CI/CD (1+ year)
GitHub Actions — building and maintaining workflows (jobs, steps, secrets, environments, reusable workflows)
Experience implementing gated release pipelines with automated checks and manual approval gates
Container build and push pipelines to ECR or similar registries
Monitoring & Observability
Practical experience with log aggregation (CloudWatch, Fluent Bit, or similar)
Alerting configuration — defining meaningful alerts (not alert fatigue)
Experience debugging production issues from logs and metrics
Security
Secrets management best practices (Secrets Manager or Vault)
IAM least-privilege design
VPN and SSO administration basics
Nice to Have
Experience with SigNoz or OpenTelemetry-based APM platforms
Experience with MongoDB Atlas including PrivateLink and cluster management
Familiarity with SpiceDB or Zanzibar-style authorization systems
Experience with AWS FIS or other chaos engineering tools
Knowledge of Wasabi or S3-compatible storage beyond AWS native S3
Experience with AWS Amplify for frontend deployments
Exposure to gRPC service-based architectures (understanding of how Protocol Buffer services are deployed and scaled)
Experience running cost optimization programs on EKS using spot/fleet instances
Familiarity with ClamAV integration in Kubernetes environments
Go or Node.js — enough to read service code, identify issues in Dockerfiles, and help debug build failures
What We Offer
Ownership over a production-grade, cloud-native infrastructure stack — not just ticket execution
Exposure to a modern, well-structured microservices architecture with 16+ services
A team that treats infrastructure quality as a first-class concern
Flexible, asynchronous-friendly work culture
Opportunity to shape DevOps practices and tooling from an early stage
To apply, please send your resume and a short note about a complex infrastructure problem you've solved — specifically what the challenge was, what you built or changed, and what the outcome was.