600000 - 1800000 INR - Yearly
Pune, Maharashtra, India
Information Technology
Full-Time
Neverinstall
Overview
What You'll Do
- Design and implement comprehensive monitoring and alerting systems
- Build observability for our distributed architecture (streaming servers, microservices, orchestration)
- Respond to and resolve service outages with focus on rapid recovery
- Create runbooks, incident response procedures, and post-mortem processes
- Implement SLI/SLO frameworks for enterprise customer SLA compliance
- Monitor and optimize performance across multiple cloud providers (Azure, OCI)
- Build automation for deployment, scaling, and recovery processes
- Design disaster recovery and business continuity procedures
- Work with engineering teams to improve system reliability and reduce MTTR
- Implement chaos engineering and reliability testing practices
What We're Looking For
- 3+ years of SRE, DevOps, or production systems experience
- Strong experience with monitoring tools (Prometheus, Grafana, ELK, or similar)
- Experience with incident management and on-call responsibilities
- Knowledge of distributed systems reliability patterns and practices
- Understanding of cloud platforms (Azure, AWS, GCP, OCI) and their monitoring tools
- Experience with Kubernetes, container orchestration, and microservices
- Scripting and automation skills (Python, Go, Bash, or similar)
- Strong troubleshooting and debugging skills across the full stack
Nice to Have
- Experience with enterprise SLA management and reporting
- Knowledge of streaming protocols, real-time systems, or VDI platforms
- Experience with multi-cloud architectures and failover strategies
- Understanding of network protocols and performance optimization
- Background in high-availability systems or financial services
- Experience with infrastructure as code and GitOps practices
- Knowledge of security monitoring and compliance frameworks
Key Responsibilities Include
- Incident Response: Lead major incident response and coordinate cross-team resolution
- Monitoring & Alerting: Build comprehensive observability across our streaming and orchestration services
- Reliability Engineering: Work with development teams to build reliability into new features
- Performance Optimization: Monitor and optimize system performance for enterprise workloads
- Documentation: Create and maintain runbooks, troubleshooting guides, and operational procedures
Similar Jobs
View All
Talk to us
Feel free to call, email, or hit us up on our social media accounts.
Email
info@antaltechjobs.in