
Overview
Opportunity
We are looking for a highly skilled and experienced Site Reliability Engineer (SRE) who will play a key role in transforming reliability engineering through AI-based innovation—while bringing deep expertise in core SRE practices.
This role is not just about applying AI; it’s about being a hands-on SRE first—someone who understands real-world operational pain points and knows how to drive systemic improvements through automation, observability, and intelligent tooling. You'll play a key role in institutionalizing SRE best practices and routines and embedding intelligence-driven operations into the engineering culture.
Your efforts will directly contribute to unifying reliability efforts across teams, enabling consistent engineering standards, and fostering a shared accountability model for service health. By driving operational discipline and aligning reliability goals with business priorities, you will help create a culture where platform stability, developer productivity, and customer experience go hand in hand. These contributions will play a vital role in supporting the organization's broader strategy—enabling faster innovation, scalable growth, and a resilient technology foundation aligned with long-term business outcomes.
Key Responsibilities
- Drive strategic initiatives to transform SRE capabilities through AI/ML innovation—while setting the vision for reliability engineering and operational excellence.
- Leverage AI and machine learning technologies to architect and oversee solutions that advance the overall SRE agenda—improving reliability, automation, observability, and operational efficiency across complex systems.
- Own, govern, and continuously improve incident management, change management, and release processes to ensure highest levels of stability, safety, and velocity.
- Lead and champion key SRE practices and routines—driving organization-wide adoption of SRE Community of Practice (CoP), SLA/SLO alignment, error budget governance, and data-driven process optimization.
- Guide and influence cross-functional teams including SREs, platform engineers to develop reliable, scalable AI/ML tools and frameworks.
- Oversee engineering strategies that improve service reliability, availability, and performance at scale.
- Define, build, and evangelize internal frameworks and tooling to accelerate AI/ML adoption across all reliability domains.
- Lead Zero-Touch Operations initiatives and roadmap, empowering platforms for autonomous issue detection and resolution.
- Leverage advanced metrics, telemetry, and incident data analytics to inform strategic decisions and build enterprise-grade reliability dashboards.
- Own on-call strategy, escalation policies, and incident response governance across teams.
- Drive security integration across all reliability workflows, leading vulnerability management, compliance, and collaboration with security leadership.
- Shape and own the AI-in-SRE strategic vision—serving as a thought leader and mentor to the entire SRE organization.
Required Skills & Qualifications
- Extensive experience (5+ years) as a senior SRE, Platform Engineer, or DevOps Engineer responsible for large-scale, complex distributed systems, with a strong understanding of AI/ML fundamentals and hands-on experience applying AI-powered tools.
- Automation-First Mindset: Demonstrated ability to drive end-to-end automation across incident response, change/release workflows, observability, and daily operations. Strong “never do it twice manually” attitude with a proven track record of eliminating toil through intelligent tooling, scripting, and systematic process optimization.
- Expert-level programming and scripting skills (Python, Go, or similar) with experience designing automation at scale.
- AI-Accelerated Mindset: You actively leverage modern AI tools (e.g., LLMs) to boost productivity, streamline development workflows, and augment traditional engineering tasks—demonstrating a willingness to adapt and innovate with evolving technologies.
- Mastery of core SRE principles including SLIs/SLOs, incident management governance, root cause analysis, scalability, fault tolerance, and capacity planning.
- Proven leadership in incident, change, and release management driving automation, auditability, and continuous service reliability improvements.
- Strategic ability to establish and evangelize reliability frameworks, rituals, and operational excellence aligned with enterprise-wide goals.
- Deep expertise in cloud architectures (AWS, GCP, Azure), container ecosystems (Docker), and orchestration platforms (Kubernetes).
- Advanced knowledge of observability systems and the ability to architect enterprise-grade monitoring and alerting solutions.
- In-depth understanding of Linux/Unix internals, performance optimization, and complex OS-level production troubleshooting.
- Strong grasp of networking, security best practices, vulnerability management, and compliance requirements.
- Experience influencing cross-team collaboration and mentoring junior engineers in SRE practices.
Preferred Skills
LLM-Native Development Approach: Proficiency in using LLM-powered tools for research, automation, or code generation. Experience building custom AI-assisted automations or tools that deliver measurable engineering efficiency gains.
- Statistical Quality Verification: Hands-on experience with experimental design, statistical analysis, and scripting to measure the impact of system changes. Familiarity with confidence intervals, significance testing, and frameworks for validating probabilistic AI/ML models.
Maersk is committed to a diverse and inclusive workplace, and we embrace different styles of thinking. Maersk is an equal opportunities employer and welcomes applicants without regard to race, colour, gender, sex, age, religion, creed, national origin, ancestry, citizenship, marital status, sexual orientation, physical or mental disability, medical condition, pregnancy or parental leave, veteran status, gender identity, genetic information, or any other characteristic protected by applicable law. We will consider qualified applicants with criminal histories in a manner consistent with all legal requirements.
We are happy to support your need for any adjustments during the application and hiring process. If you need special assistance or an accommodation to use our website, apply for a position, or to perform a job, please contact us by emailing