Overview
This is one of the first site reliability engineering position being opened at potpie π₯§, and we are excited to discover and onboard a great mind to work along with us. You'll be instrumental in building the robust, scalable, and resilient infrastructure required to build and deploy AI agents for engineering use cases like debugging, system design, and testing.
ππ»ββοΈ What is potpie?
Potpie π₯§ is an open-source platform that understands your codebase and helps you build use-case-specific AI agents for your developer workflows. We also provide users with ready-to-use agents for engineering use cases like debugging, low-level design, and testing.
π Responsibilities
As a Site Reliability Engineer at potpie, you will:
Design, implement, and maintain the core infrastructure and CI/CD pipelines to ensure high availability, scalability, and performance of the potpie platform and its AI agents.
Be responsible for observability (logging, monitoring, alerting) across the stack to proactively identify and resolve issues.
Automate deployment, scaling, and operational tasks using infrastructure-as-code (IaC) principles.
Collaborate closely with the backend and product teams to plan features and ensure new deployments meet reliability and performance standards.
Conduct system design reviews with a focus on reliability, fault tolerance, and disaster recovery.
Participate in on-call rotation to respond to and resolve critical production incidents efficiently.
Drive the adoption of best practices for security, performance optimization, and cost management within the cloud infrastructure.
π Proof of Work & Qualifications
At potpie, we accept any meaningful project as proof of work. If you donβt have work experience but have an open-source project, that would count. We donβt measure your ability in terms of the years of experience you have, but sometimes years of experience can be a good proxy to project your capabilities.
β Must-haves
Expertise in Cloud Infrastructure (e.g., AWS, GCP, Azure), particularly managing and deploying applications at scale.
Strong practical experience with Kubernetes (e.g., GKE, EKS) and containerization technologies (Docker).
Solid understanding of SRE principles and practices, including SLOs, SLIs, error budgets, and post-mortem analysis.
Experience with Infrastructure-as-Code tools (e.g., Terraform, Ansible).
Proficiency in setting up and managing monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack, Datadog).
Strong scripting and automation skills in a language like Python or Go.
Familiarity with database operations and reliability (e.g., PostgreSQL, Redis, MongoDB).
Excellent debugging, problem-solving skills, and attention to detail, especially in distributed systems.
ππ» Good to have
Experience with AI/ML infrastructure or deploying LLM-based applications.
Tangible open-source contributions related to infrastructure, DevOps, or reliability.
Experience in a startup environment or building infrastructure from the ground up.
Strong knowledge of networking and security best practices for cloud-native applications.
π€ Why should you consider working at potpie?
Working at potpie is the right bet for you if you can relate to the following:
You are tired of building the same standard infrastructure setups and are eager to tackle challenging technical problems related to scaling AI agents.
You want to do impactful work that has a positive change in the lives of fellow developers.
You have out-of-the-box ideas and want the autonomy to chase them.
You want to work across the stack on a fast-paced project from day one.
You want the opportunity to build the company culture you always wanted at work.
You aspire to build something of your own one day.
π§© How we hire engineers at potpie?
Introductory call: A brief call to understand your background, expectations, and ambitions.
Assessment: A take-home assignment (48 hours) relevant to the SRE role, followed by a discussion with an engineering interviewer. This step may be skipped if there is substantial proof of work.
Technical Interview: The core technical round, assessing your system design, SRE knowledge, and algorithmic thinking. We will collaborate on designing a real-time solution to evaluate your fundamentals.
Pro-Tip: If you want to grab our attention, the best way is to find a bug in the application and raise a PR: https://github.com/potpie-ai/potpie