Free cookie consent management tool by TermsFeed Lead Consultant - HPC DevOps Engineer | Antal Tech Jobs
Back to Jobs
10 Weeks ago

Lead Consultant - HPC DevOps Engineer

decor
Information Technology
AstraZeneca

Overview

Job Title: Lead Consultant - HPC DevOps Engineer

Career level: E

Introduction to role

The Research Data & Analytics Team in R&D IT comprises skilled data and AI engineers and professionals who are dedicated to delivering innovative services and products. Our mission is to transform the way R&D discovers and develops medicine through data, analytics, and AI. We partner with scientific teams to deliver groundbreaking capabilities, products, and platforms that enable scientists to accelerate medicines that are safe and effective for patients.

The Scientific Computing platform (SCP) is a foundational capability for HPC and scaled computing solutions. Embedded within the Research D&A organization, it is central to analytics products focused on computational chemistry, imaging, multi-OMICs, structural biology, data science, and AI. The SCP team is accountable for the end-to-end delivery of high-performance analytics products, with an emphasis on augmenting the HPC experience. We combine modern HPC with a powerful DevOps stack and cloud-native technologies to power research and development at AstraZeneca.

Accountabilities

The Observability Engineer will be responsible for designing, implementing, and managing monitoring and logging systems that ensure high availability, performance, and visibility across the platform’s infrastructure and applications. The ideal candidate should have deep expertise in Prometheus, Grafana, ELK (Elastic Stack), or similar stack, with a strong understanding of short-term and long-term storage solutions for metrics and logs. Equally important is experience in leadership and coaching to lead and encourage best practices throughout the platform.

What you'll do:

Prometheus: Metrics Collection and Storage: Design and manage Prometheus architecture, including identifying high cardinality and troubleshooting performance issues. Configure short-term and long-term storage solutions using Prometheus-compatible systems (e.g., Thanos, Cortex, or VictoriaMetrics). Implement and optimize Prometheus exporters for collecting custom application metrics. Establish alerting rules using Prometheus Alertmanager.

Grafana: Visualization and Dashboarding: Develop and maintain Grafana dashboards for real-time observability. Integrate Grafana with other systems for unified visualization. Identify key metrics and insights through dashboards for both internal and external consumption.

Management and Insights: Setup and manage logging solutions, develop relevant dashboards and queries to provide actionable insights. Integrate logging solutions with other observability tools for cohesive monitoring.

Cross-Tool Integration: Implement integrations between Prometheus, Grafana, and logging solutions to create a unified observability platform. Design solutions for correlation of metrics and logs to streamline root cause analysis.

Performance Tuning and Maintenance: Monitor the performance of observability tools and optimize resource utilization. Conduct regular upgrades and maintenance of all observability components. Collaboration and Documentation: Work with SCP teams and users to define monitoring and logging requirements. Leadership and coaching on observability best practices while aiming for simplification. Focus on offering observability as an easy-to-consume service for the rest. Document observability architecture, workflows, and troubleshooting guides.

Essential Skills/Experience

Technical skills

  • Prometheus: Expertise in Prometheus setup, scaling, and federation. Knowledge of Thanos, Cortex, or VictoriaMetrics for long-term storage. Hands-on experience with PromQL for writing complex queries.

  • Grafana: Proficiency in creating dashboards and integrating with multiple data sources.

  • Logging: In-depth experience with ELK, Splunk, Loki or similar, both with query languages and dashboarding.

  • Infrastructure: Hands-on experience managing observability infrastructure in Kubernetes, Docker, or other container technologies.

  • Scripting and Automation: Proficiency in Python, Bash, or similar scripting languages. Experience with Infrastructure as Code tools like Terraform or Ansible.

Soft skills

  • Strong problem-solving and analytical abilities.

  • Excellent communication and collaboration skills to work across teams and end users.

  • Ability to streamline complex processes and requirements into simple and elegant solutions.

  • Ability to document complex systems clearly and concisely.

Desirable Skills/Experience

  • Familiarity with other observability tools (e.g., Loki, VictoriaMetrics).

  • Certifications: Prometheus Certified Associate.

When we put unexpected teams in the same room, we unleash bold thinking with the power to inspire life-changing medicines. In-person working gives us the platform we need to connect, work at pace and challenge perceptions. That's why we work, on average, a minimum of three days per week from the office. But that doesn't mean we're not flexible. We balance the expectation of being in the office while respecting individual flexibility. Join us in our unique and ambitious world.

At AstraZeneca, our work has a direct impact on patients by transforming our ability to develop life-changing medicines. We empower the business to perform at its peak by combining ground breaking science with leading digital technology platforms and data. Here you can innovate, take ownership, explore new solutions, experiment with innovative technology, and tackle challenges in a modern technology environment.

Ready to make a difference? Apply now!

Share job
Similar Jobs
View All
1 Day ago
TrueFan - Senior Machine Learning Engineer
Information Technology
  • Thiruvananthapuram, Kerala, India
About UsTrueFan is at the forefront of AI-driven content generation, leveraging cutting-edge generative models to build next-generation products. Our mission is to redefine content generation space through advanced AI technologies, including deep ge...
decor
1 Day ago
Salesforce commerce cloud consultant
Information Technology
  • Thiruvananthapuram, Kerala, India
Salesforce Commerce Cloud consultant  5+ Years of Experience 6 to 12 months Mode - Remote 1.1LPM - 1.2LPM Max Key Responsibilities Translate business requirements into scalable Salesforce Service Cloud solutions, in collaboration with CAE's technic...
decor
1 Day ago
Cloud Infrastructure Engineer
Information Technology
  • Thiruvananthapuram, Kerala, India
DescriptionInvent the future with us. Recognized by Fast Company’s 2023 100 Best Workplaces for Innovators List, Ampere is a semiconductor design company for a new era, leading the future of computing with an innovative approach to CPU design focuse...
decor
1 Day ago
Devops Engineer- Intermetiate
Information Technology
  • Thiruvananthapuram, Kerala, India
BackJD: Dev ops Engineer:As a DevOps Specialist- should be able to take ownership of the entire DevOps process, including Automated CI/CD pipelines and deployment to production.They should also be comfortable with risk analysis and prioritization.Le...
decor
1 Day ago
Sr Data Scientist (London)
Information Technology
  • Thiruvananthapuram, Kerala, India
AryaXAI stands at the forefront of AI innovation, revolutionizing AI for mission-critical, highly regulated industries by building explainable, safe, and aligned systems that scale responsibly. Our mission is to create AI tools that empower research...
decor
1 Day ago
Software Test Engineer
Information Technology
  • Thiruvananthapuram, Kerala, India
By clicking the “Apply” button, I understand that my employment application process with Takeda will commence and that the information I provide in my application will be processed in line with Takeda’s Privacy Notice and Terms of Use. I further att...
decor
1 Day ago
Software Developer 5 (Java Fullstack)
Information Technology
  • Thiruvananthapuram, Kerala, India
Job DescriptionBuilding off our Cloud momentum, Oracle has formed a new organization - Oracle Health Applications & Infrastructure. This team focuses on product development and product strategy for Oracle Health, while building out a complete platfo...
decor
1 Day ago
Java Developer - Spring Frameworks
Information Technology
  • Thiruvananthapuram, Kerala, India
Java DescriptionWe are looking for a passionate and talented Java Developer with 2-3 years of hands-on experience to join our growing development team.The ideal candidate should have a strong foundation in Java technologies and the ability to develo...
decor

Talk to us

Feel free to call, email, or hit us up on our social media accounts.
Social media