Bangalore, Karnataka, India
Information Technology
Full-Time
GSPANN Technologies, Inc
Overview
Change Management, Incident Response, Dynatrace, Grafana, Splunk, Datadog, Grafana, New Relic, Azure, Python, CI/CD/CT Pipeline, Kubernetes, Docker, Ansible, DevOps, Terraform, DevOps, Root Cause Analysis (RCA), SLO/SLAs Monitoring, E2E Implementation
Description
GSPANN is hiring a Principal Engineer – Technical Lead for Site Reliability Engineering (SRE) to lead reliability engineering initiatives in Pune or Hyderabad. This full-time role focuses on driving enterprise-wide observability, automation, and infrastructure optimization across global production systems.
Location: Pune / Hyderabad
Role Type: Full Time
Published On: 2 June 2025
Experience: 12 - 15 Years
Share this job
Description
GSPANN is hiring a Principal Engineer – Technical Lead for Site Reliability Engineering (SRE) to lead reliability engineering initiatives in Pune or Hyderabad. This full-time role focuses on driving enterprise-wide observability, automation, and infrastructure optimization across global production systems.
Role and Responsibilities
Description
GSPANN is hiring a Principal Engineer – Technical Lead for Site Reliability Engineering (SRE) to lead reliability engineering initiatives in Pune or Hyderabad. This full-time role focuses on driving enterprise-wide observability, automation, and infrastructure optimization across global production systems.
Location: Pune / Hyderabad
Role Type: Full Time
Published On: 2 June 2025
Experience: 12 - 15 Years
Share this job
Description
GSPANN is hiring a Principal Engineer – Technical Lead for Site Reliability Engineering (SRE) to lead reliability engineering initiatives in Pune or Hyderabad. This full-time role focuses on driving enterprise-wide observability, automation, and infrastructure optimization across global production systems.
Role and Responsibilities
- Demonstrate deep expertise in monitoring and observability tools such as Dynatrace, Splunk, Datadog, Grafana, and New Relic.
- Apply modern observability practices and tools across enterprise environments.
- Resolve organizational gaps in SRE implementation by designing scalable, long-term solutions.
- Lead cross-functional initiatives to adopt emerging technologies and reliability frameworks.
- Influence senior leadership on strategic decisions related to tooling, observability, and transformation.
- Analyze complex system issues, uncover performance bottlenecks, and drive root cause resolution.
- Drive automation and foster a culture of continuous improvement aligned with evolving technology trends.
- Manage cloud infrastructure efficiently, with a strong preference for Microsoft Azure experience.
- Write automation scripts proficiently, preferably using Python.
- Work with cloud deployment tools including Ansible, Terraform, and Azure DevOps.
- Architect and operate containerized environments using Kubernetes and Docker.
- Utilize configuration management solutions such as Chef, Ansible, and AWS CodeDeploy.
- Implement and optimize Continuous Integration/Continuous Deployment (CI/CD) pipelines using tools like GitLab, Jenkins, Bamboo, Travis CI, and CircleCI.
- Solve technical issues independently and deliver sustainable solutions with minimal supervision.
- Lead change and incident management processes, while driving strategic SRE transformation at scale.
- Standardize observability across teams with end-to-end (E2E) implementation and innovative approaches.
- Champion enterprise-grade monitoring strategies using industry-leading tools.
- Build scalable infrastructure using Infrastructure as Code (IaC) principles and technologies.
- Exhibit soft skills such as visionary thinking, proactive leadership, and deep-rooted troubleshooting expertise.
- Define, implement, and monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Coordinate and lead incident response while conducting thorough Root Cause Analysis (RCA).
- Bachelor's degree in Computer Science, Information Science, Engineering, or a related field.
- 12+ years of experience in Site Reliability Engineering (SRE) or DevOps roles, with a strong focus on managing production systems.
- Ensure high availability, low latency, optimal performance, and cost-efficient operations for global e-commerce platforms.
- Spearhead change and incident management across business-critical systems.
- Mentor and guide product teams in embedding observability and operational excellence throughout the delivery pipeline.
- Architect and deploy unified, end-to-end observability dashboards tailored for engineering and business stakeholders.
- Define instrumentation standards and build reusable patterns to scale best practices across teams.
- Collaborate with cross-functional stakeholders to integrate reliability into every stage of product development.
- Develop proprietary tools that close gaps in software delivery and incident response.
- Lead the adoption of SRE best practices to systematically improve resilience and uptime.
- Automate key operations to ensure rapid and effective incident handling.
- Monitor and enforce compliance with SLOs and ensure uninterrupted availability of mission-critical services.
- Continuously optimize infrastructure to lower operational costs and seamlessly manage demand surges.
Similar Jobs
View All
Talk to us
Feel free to call, email, or hit us up on our social media accounts.
Email
info@antaltechjobs.in