Pune, Maharashtra, India
Information Technology
Other
Cloudangles

Overview
Site Reliability Engineer I
Job Summary
Site Reliability Engineers (SRE's) cover the intersection of Software Engineer and Systems Administrator. In other words, they can both create code and manage the infrastructure on which the code runs. This is a very wide skillset, but the end goal of an SRE is always the same: to ensure that all SLAs are met, but not exceeded, so as to balance performance and reliability with operational costs.
As a Site Reliability Engineer I, you will be learning our systems, improving your craft as an engineer, and taking on tasks that improve the overall reliability of the VP platform.
Key Responsibilities:
- Design, implement, and maintain robust monitoring and alerting systems.
- Lead observability initiatives by improving metrics, logging, and tracing across services and infrastructure.
- Collaborate with development and infrastructure teams to instrument applications and ensure visibility into system health and performance.
- Write Python scripts and tools for automation, infrastructure management, and incident response.
- Participate in and improve the incident management and on-call process, driving down Mean Time to Resolution (MTTR).
- Conduct root cause analysis and postmortems following incidents, and champion efforts to prevent recurrence.
- Optimize systems for scalability, performance, and cost-efficiency in cloud and containerized environments.
- Advocate and implement SRE best practices, including SLOs/SLIs, capacity planning, and reliability reviews.
Required Skills & Qualifications:
- 3+ years of experience in a Site Reliability Engineer or similar role.
- Proficiency in Python for automation and tooling.
- Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, New Relic, OpenTelemetry, etc.
- Experience with log aggregation and analysis tools like ELK Stack (Elasticsearch, Logstash, Kibana) or Fluentd.
- Good understanding of cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes).
- Familiarity with infrastructure-as-code (Terraform, Ansible, or similar).
- Strong debugging and incident response skills.
- Knowledge of CI/CD pipelines and release engineering practices.
Similar Jobs
View All
Talk to us
Feel free to call, email, or hit us up on our social media accounts.
Email
info@antaltechjobs.in