Pune, Maharashtra, India
Information Technology
Full-Time
Wits Innovation Lab
Overview
Key Responsibilities
- Design, implement, and maintain comprehensive monitoring, logging, and alerting solutions across our production and other environments
- Lead incident response and post-mortem analyses, establishing best practices for problem resolution
- Design and implement disaster recovery strategies and ensure regular testing
- Collaborate with development teams and other stakeholders to implement SLAs for critical services
- Optimize cloud infrastructure for performance, reliability, and cost efficiency
- Develop and maintain automation for deployment, scaling, and recovery procedures
- Run and maintain our infrastructure with cookbooks using Terraform, GitLab CI/CD, and Kubernetes
- Responding to on-call incidents
- 6+ years of experience in SRE, DevOps, or similar roles
- Work in a variety of languages: Shell, Chef (recipes, cookbooks) and Ansible (basic syntax, tasks, playbooks), Python
- Strong experience in AWS related services: Cognito EC2, EKS, RDS, CloudWatch, etc.,
- Proficient in Kubernetes administration and operations in production environments
- Experience with infrastructure as code using tools like Terraform or CloudFormation
- Strong scripting skills with Python, Bash, or similar languages
- Deep understanding of observability tools such as Prometheus, Grafana, ELK stack, and
- Provisioning and setup of metric in Prometheus, Grafana and alerts; Provision and setup logs
- Experience with PostgreSQL or similar database systems, including replication strategies
- Knowledge of network protocols, load balancing, and security best practices
- Experience with CI/CD pipelines and Git Ops workflows
Similar Jobs
View All
Talk to us
Feel free to call, email, or hit us up on our social media accounts.
Email
info@antaltechjobs.in