Overview
Are you a seasoned SRE looking to make a significant impact on large-scale cloud environments? Want to deepen your GCP expertise while being mentored by an ex-Google SRE leader?
Aviato Consulting is seeking an experienced Senior Site Reliability Engineer to join our growing team. This isn't just another SRE role; it's an opportunity to own critical infrastructure, drive technical strategy, and shape the reliability culture for major Australian and EU clients, all within a supportive, G-inspired environment built on transparency and collaboration.
What's In It For You?
- Learn from the Best: Report directly to and receive mentorship from our Head of SRE, an experienced ex-Google SRE Manager. Gain invaluable insights into scaling, reliability, and leadership honed at one of the world's tech giants.
- High-Impact Projects: Take ownership of complex GCP environments for diverse, significant clients across Australia and the EU. Your work directly influences the stability and performance of critical systems.
- Drive Innovation, Not Just Tickets: We empower our Senior SREs to think strategically. You'll architect solutions, implement cutting-edge practices (SLOs, error budgets, advanced automation), and proactively improve systems, not just react to issues.
- A Culture That Works: Founded by ex-Googlers, we foster a transparent, collaborative, and low-bureaucracy environment where doing the right thing matters. We value SRE principles and give you the autonomy to implement them effectively.
- Cutting-Edge Tech: Deepen your expertise with GCP, Kubernetes, Terraform, modern observability tooling (Grafana, Dynatrace, Sentry), and sophisticated CI/CD pipelines.
What You'll Do (Your Impact):
- Own & Architect Reliability: Design, implement, and manage highly available, scalable, and resilient architectures on Google Cloud Platform (GCP) for key customer environments.
- Lead GCP Expertise: Serve as a subject matter expert for GCP within the team and potentially wider organisation, driving best practices for security, cost optimization, and performance.
- Master Kubernetes at Scale: Architect, deploy, secure, and manage production-grade Kubernetes clusters (GKE preferred), ensuring optimal performance and reliability for critical applications (including API platforms like Apigee, though prior Apigee experience isn't mandatory).
- Drive Automation & IaC: Lead the design and implementation of robust automation strategies using Terraform, Ansible, and scripting (Python, Go, Bash) for provisioning, configuration management, and CI/CD pipelines (Jenkins, GitHub Actions, etc.).
- Elevate Observability: Architect and refine comprehensive monitoring, logging, and alerting strategies using tools like Grafana, Dynatrace, and Sentry to ensure proactive issue detection and rapid response.
- Lead Incident Response & Prevention: Spearhead incident management efforts, conduct blameless post-mortems, and drive the implementation of preventative measures to continuously improve system resilience.
- Champion SRE Principles: Actively promote and embed SRE best practices (SLOs, SLIs, error budgets) within delivery teams and operational processes.
- Mentor & Collaborate: Share your expertise, mentor junior team members (potentially), and collaborate effectively across teams to foster a strong reliability culture.
What You'll Bring (Your Expertise):
- Proven SRE Experience: 5+ years of hands-on experience in a Site Reliability Engineering, DevOps, or Cloud Engineering role, with a significant focus on production systems.
- Deep GCP Knowledge: Demonstrable, in-depth expertise in designing, deploying, and managing services within Google Cloud Platform (Compute Engine, GKE, Networking, IAM, Cloud SQL/Spanner, Pub/Sub, Monitoring/Logging etc.). GCP certifications are a plus.
- Strong Kubernetes Skills: Proven experience managing Kubernetes clusters in production environments (GKE highly desirable). Understanding of networking, security, and operational best practices within Kubernetes.
- Infrastructure as Code Mastery: Significant experience using Terraform in complex environments. Proficiency with configuration management tools (Ansible, Puppet, Chef) is beneficial.
- Automation & Scripting Prowess: Strong proficiency in scripting languages like Python or Go, with experience in automating operational tasks and building tooling.
- Observability Expertise: Experience implementing and leveraging monitoring, logging, and tracing tools (e.g., Prometheus, Grafana, ELK Stack, Dynatrace, Datadog, Sentry).
- Problem-Solving Acumen: Strong analytical and troubleshooting skills, with experience leading incident response for critical systems.
- Collaboration & Communication: Excellent communication skills and a collaborative mindset, with the ability to explain complex technical concepts clearly. Experience mentoring others is advantageous.
- (Desirable): Experience with API Management platforms (Apigee, Kong, etc.), advanced networking concepts, or security hardening in cloud environments.
Technologies We Use (You'll Master):
- Cloud: Google Cloud Platform (GCP)
- Containerisation & Orchestration: Kubernetes (GKE), Docker
- Infrastructure & Automation: Terraform, Ansible
- Monitoring & Observability: Grafana, Dynatrace, Sentry, Google Cloud Operations Suite
- CI/CD: Jenkins, GitHub Actions, Bamboo (or similar)
- Scripting: Python, Go, Bash
- Collaboration: JIRA, Confluence, Slack
Ready to Elevate Your SRE Career?
If you're a passionate Senior SRE ready to tackle complex challenges on GCP, work with leading clients, and benefit from exceptional mentorship in a fantastic culture, Aviato is the place for you. Apply now and help us build the future of reliable cloud infrastructure!