Overview
SRE Key Skills -
- GCP: BigQuery, Airflow, Cloudstorage..
- Observability: ELK + grafana
- Devops: CI/CD Gitlab and Jenkins
- Integration background
Key Responsibilities
● Implement and manage the observability stack (metrics, logs, traces and alerts)
to ensure optimal performance and availability
● Analyze observability data to proactively identify performance bottlenecks and
drive reliability improvements
● Define, track and report on Service Level Objectives (SLOs) and Service Level
Indicators (SLIs) for key services
● Identify, develop and implement automation tools to reduce operational toil and
improve system reliability
● Conduct blameless incident postmortems and drive preventive measures
● Collaborate with developers to improve service reliability through better design,
testing, and deployment practices
● Assist developers in troubleshooting complex issues by delving into the available
observability data
● Advocate for SRE best practices within the embedded team and contribute to
wider company SRE initiatives
Your profile
● Hands-on experience with managing and using monitoring tools (e.g. ELK,
Grafana, Prometheus)
● 3+ years of experience in a Site Reliability Engineering, DevOps, or Systems
Engineering role
● Experience with CI/CD tooling (e.g. Jenkins, GitLab CI, Argo CD)
● Experience with cloud platforms (preferably GCP)
● Comfortable with at least one scripting language
● Experience working in large, complex production environments
● Excellent problem-solving, communication, and collaboration skills
● GCP: BigQuery, Airflow, Cloudstorage.
● Observability: ELK + grafan