Overview
Overview & Expectations
Role Summary
Design, build, lead, and deliver production-grade AI solutions on Azure. Own execution excellence with measurable business value, technical depth, governance, and reliability.
Key Outcomes (06–12 months)
• Ship production-grade AI/GenAI solutions with clear ROI, reliability (SLOs), and security.
• Establish engineering standards, CI/CD pipelines, observability, and repeatable delivery patterns.
• Build a reusable AI platform that enables AI applications across multiple domains (paved paths, templates, guardrails).
• Mentor engineers via reviews, playbooks, and hands-on guidance.
Responsibilities
• Translate business problems into well-posed technical specifications and architectures.
• Lead design reviews, prototype quickly, and harden solutions for scale (high QPS / 1M+ users).
• Build automated pipelines (CI/CD) and model/data governance across environments (dev/test/prod).
• Define and track KPIs: accuracy, latency, cost, adoption, and compliance readiness.
• Partner with Product, Security, Compliance, and Ops to land safe-by-default systems.
GenAI + Agentic AI on Azure (must-have focus)
• Implement Azure OpenAI solutions (prompting, evals, fine-tuning where applicable, safety filters).
• Build RAG architectures using Azure AI Search (vector) + curated data sources (SharePoint, SQL, Blob/ADLS, APIs).
• Design agentic workflows (tool use, multi-step orchestration, human-in-the-loop) using combinations of:
o Azure Functions / Durable Functions, Logic Apps, Event Grid, Service Bus
o Frameworks like Semantic Kernel / LangChain (as orchestration layer)
• Implement observability for agent workflows (traces, latency breakdown, failure modes, cost per run).
Technical Skills (Azure-focused)
Platform & Runtime
• Azure Kubernetes Service (AKS), Docker, Helm; Azure Container Registry (ACR)
• API Management, ingress patterns, autoscaling, secure networking (VNet, Private Link)
MLOps
• Azure Machine Learning (pipelines, registries, endpoints), MLflow (tracking/registry)
• CI/CD with Azure DevOps or GitHub Actions, environment promotion, canary/champion-challenger patterns
Serving
• Azure ML managed online endpoints and/or AKS-based inference
• FastAPI/gRPC-based services; performance tuning for low-latency inference
Data & Feature
• ADLS Gen2, Azure Data Factory, Synapse/Databricks (as applicable)
• Feature store approach (Feast/managed equivalents), batch vs streaming (Event Hubs/Stream Analytics)
Monitoring & Observability
• Azure Monitor, Application Insights, Log Analytics; Prometheus/Grafana where needed
• Model/data drift monitoring and alerting (Azure ML monitoring patterns)
Security & Compliance
• Microsoft Entra ID (Azure AD), RBAC, Managed Identities, Key Vault
• Encryption at rest/in transit, network isolation, audit logging, policy controls
Hands-on programming
• Strong applied coding in Python (plus scripting/automation).
Architecture & Tooling Stack
• Git, branching standards, PR reviews, trunk-based delivery
• IaC: Bicep / Terraform (preferred), policy-as-code, reusable modules
• Registries/lineage/versioning and staged promotions for data/models
• Must have designed and built at least 3 Agentic AI solutions on Azure (end-to-end, production-grade).
Performance, Reliability & Cost
• Define SLAs/SLOs for accuracy, tail latency, throughput, availability
• Capacity planning, autoscaling, load tests, caching, graceful degradation
• Cost controls: instance sizing, reserved/spot strategies, storage tiering
Qualifications
• Bachelor’s/Master’s or equivalent practical experience
• Proven track record of shipping and operating systems in production
• Must have strong platform engineering experience