Overview
IntroductionAt IBM Software, we transform client challenges into solutions. Building the world’s leading AI-powered, cloud-native products that shape the future of business and society. Our legacy of innovation creates endless opportunities for IBMers to learn, grow, and make an impact on a global scale. Working in Software means joining a team fueled by curiosity and collaboration. You’ll work with diverse technologies, partners, and industries to design, develop, and deliver solutions that power digital transformation. With a culture that values innovation, growth, and continuous learning, IBM Software places you at the heart of IBM’s product and technology landscape. Here, you’ll have the tools and opportunities to advance your career while creating software that changes the world.
Role Overview
Your role and responsibilities
We are seeking a highly skilled AI Engineer to architect, build, and scale intelligent, agentic AI platforms that solve complex infrastructure and operational challenges. This role requires deep expertise in Generative AI, multi-agent orchestration, and production-grade ML systems, with strong emphasis on reliability, performance, and operational excellence.
Key Responsibilities
- Develop Intelligent AI Solutions - Design and deliver advanced NLP and Generative AI solutions, including Retrieval-Augmented Generation (RAG) pipelines, prompt-driven systems, and agentic workflows to solve real-world infrastructure and operational problems.
- Own Critical AI Features End-to-End - Lead the complete lifecycle of LLM-powered applications—such as chatbots, optimization engines, and autonomous assistants—from prompt design and model integration to deployment, monitoring, and iteration.
- Build Agentic & Multi-Agent AI Platforms - Architect and deploy scalable multi-agent systems capable of autonomous reasoning, tool usage, task decomposition, and orchestration across multiple operational domains.
- Multi-Operational Tool Integration - Design and integrate AI agents with multiple operational tools and APIs (monitoring, CI/CD, ticketing, infrastructure, and platform services) to enable autonomous execution, remediation, and decision-making.
- AI-Powered Observability & Autoscaling - Develop AI-driven observability, anomaly detection, root-cause analysis, and autoscaling frameworks for large-scale distributed systems and microservices platforms.
- CI/CD & Platform Integration - Embed AI/ML solutions into CI/CD pipelines, monitoring systems, and platform APIs to support continuous delivery, experimentation, and operational intelligence.
- Cross-Functional Collaboration - Collaborate closely with platform, infrastructure, SRE, and product engineering teams to deliver high-impact, AI-enabled operational experiences.
- Technical Leadership & SME Role - Serve as a subject matter expert across Generative AI, agentic architectures, ML optimization techniques, and large-scale system design.
- Mentorship & Best Practices - Mentor engineers on ML system design, prompt engineering strategies, Python best practices, code quality, testing, and experimentation methodologies.
- Agentic AI & Multi-Agent Systems - Strong hands-on experience building agentic and multi-agent systems using frameworks such as LangChain, LangGraph, or equivalent, with deep understanding of multi-step reasoning, planning, and tool invocation.
- Prompt Engineering & LLM Control - Expertise in prompt engineering, prompt templating, structured outputs, evaluation, and guardrails to ensure reliable, controllable, and secure LLM behavior in production.
- Python & Production Engineering Excellence - Advanced Python skills with strong adherence to best practices, including modular design, async programming, type hints, testing frameworks, performance optimization, and secure coding standards.
- LLM Inference & Performance Optimization - Proven experience optimizing LLM inference at scale using techniques such as KV caching, quantization, batching, and efficient model serving.
- End-to-End ML Systems Ownership - Demonstrated ownership of ML systems across the full lifecycle—from data ingestion and feature pipelines to deployment, monitoring, and continuous improvement.
- Large-Scale Distributed Systems - Experience designing and operating AI systems for large-scale distributed platforms, including multi-agent operations across infrastructure and cloud environments (e.g., MSCP servers and complex operational ecosystems).
- Experience with AIOps, autonomous remediation, and self-healing systems
- Familiarity with Kubernetes and cloud-native observability stacks
- Prior experience delivering production-grade Generative AI or AI platform services