
Overview
We are seeking a motivated Software/Platform Engineer with experience in Databricks observability to join our dynamic team in D&A Platforms SRE. The ideal candidate will work and play a role in maintaining the reliability, availability, and performance of our data infrastructure and applications, demonstrating Databricks to ensure flawless operations and efficient performance. You will collaborate closely with development, operations, and data teams to implement best practices in observability and monitoring, enabling a proactive approach to incident management and system optimization.
Key Responsibilities
Reliability and Performance:
- Design, implement, and maintain scalable and reliable systems and services
- Monitor system performance, availability, and reliability, proactively identifying and resolving issues.
Observability Implementation:
- Apply Databricks observability tools to develop and maintain dashboards, alerts, and reporting mechanisms that provide insights into system performance and usage.
- Establish and improve observability frameworks to supervise key performance indicators (KPIs) and service-level objectives (SLOs).
Incident Management:
- Respond to and fix production incidents, performing root cause analysis and implementing corrective actions to prevent future occurrences.
- Collaborate with multi-functional teams to ensure effective incident response processes and documentation.
Automation and Efficiency:
- Develop automation scripts and tools to streamline operational tasks, improve deployment processes, and enhance system reliability.
- Supply to the continuous improvement of deployment pipelines and infrastructure as code (IaC) practices.
Collaboration and Documentation:
- Work closely with development teams to understand application architectures and give to system design discussions.
- Document processes, best practices, and system architecture to facilitate knowledge sharing and onboarding.
Performance Optimization:
- Analyze system performance and application usage patterns to recommend and implement optimizations that improve efficiency and reduce costs.