Overview
We are seeking a Senior Software Developer to help build and scale an AI/ML platform that powers model training, evaluation, deployment, and inference at scale. This role is ideal for engineers who enjoy working at the intersection of backend systems, data infrastructure, and machine learning workflows.
You will design platform services that support model lifecycle management, large-scale data processing, and low-latency inference, while collaborating closely with ML researchers, data scientists, and product teams.
Key Responsibilities
• Design and build AI/ML platform services using Java and Python
• Develop backend systems for:
o Model training pipelines
o Feature stores
o Model registry and versioning
o Experiment tracking
o Batch and real-time inference
• Build and maintain scalable microservices and APIs for ML workflows
• Enable MLOps best practices: reproducibility, monitoring, and governance
• Optimize systems for performance, scalability, and cost efficiency
• Collaborate with ML engineers to productionize models
• Mentor engineers and lead design reviews
• Ensure platform reliability, security, and observability
Required Qualifications
• 7+ years of software engineering experience
• Strong expertise in Python (ML pipelines, data processing, APIs)
• Strong expertise in Java (platform services, orchestration, scalability)
• Experience building ML platforms or data platforms
• Solid understanding of ML lifecycle (training → validation → deployment → monitoring)
• Experience with distributed systems and microservices
• Strong understanding of data structures, algorithms, and system design
• Experience with SQL and NoSQL databases
• Hands-on experience with cloud infrastructure (AWS / GCP / Azure)
• Experience with Docker and container-based deployments
Preferred / Nice-to-Have Skills
• Experience with MLOps tools (MLflow, Kubeflow, Airflow, Flyte, Argo)
• Knowledge of LLM systems (fine-tuning, inference optimization, embeddings)
• Experience with vector databases (FAISS, Milvus, Weaviate, OpenSearch)
• Familiarity with model monitoring and drift detection
• Experience with streaming systems (Kafka, Pulsar)
• Kubernetes experience
• Exposure to GPU workloads and performance optimization
• Understanding of data privacy, model governance, and explainability