Gurugram, Haryana, India
Information Technology
Full-Time
MyCareernet
Overview
Company: Indian / Global Engineering & Manufacturing Organization
Key Skills: Machine Learning, ML, AI Artificial intelligence, Artificial Intelligence, Tensorflow, Python, Pytorch.
Roles and Responsibilities:
- Design, build, and rigorously optimize the complete stack necessary for large-scale model training, fine-tuning, and inference--including dataloading, distributed training, and model deployment--to maximize Model Flop Utilization (MFU) on compute clusters.
- Collaborate closely with research scientists to translate state-of-the-art models and algorithms into production-grade, high-performance code and scalable infrastructure.
- Implement, integrate, and test advancements from recent research publications and open-source contributions into enterprise-grade systems.
- Profile training workflows to identify and resolve bottlenecks across all layers of the training stack--from input pipelines to inference--enhancing speed and resource efficiency.
- Contribute to evaluations and selections of hardware, software, and cloud platforms defining the future of the AI infrastructure stack.
- Use MLOps tools (e.g., MLflow, Weights & Biases) to establish best practices across the entire AI model lifecycle, including development, validation, deployment, and monitoring.
- Maintain extensive documentation of infrastructure architecture, pipelines, and training processes to ensure reproducibility and smooth knowledge transfer.
- Continuously research and implement improvements in large-scale training strategies and data engineering workflows to keep the organization at the cutting edge.
- Demonstrate initiative and ownership in developing rapid prototypes and production-scale systems for AI applications in the energy sector.
Experience Requirement:
- 5-9 years of experience building and optimizing large-scale machine learning infrastructure, including distributed training and data pipelines.
- Proven hands-on expertise with deep learning frameworks such as PyTorch, JAX, or PyTorch Lightning in multi-node GPU environments.
- Experience in scaling models trained on large datasets across distributed computing systems.
- Familiarity with writing and optimizing CUDA, Triton, or CUTLASS kernels for performance enhancement is preferred.
- Hands-on experience with AI/ML lifecycle management using MLOps frameworks and performance profiling tools.
- Demonstrated collaboration with AI researchers and data scientists to integrate models into production environments.
- Track record of open-source contributions in AI infrastructure or data engineering is a significant plus.
Education: M.E., B.Tech M.Tech (Dual), BCA, B.E., B.Tech, M. Tech, MCA.
Similar Jobs
View All
Talk to us
Feel free to call, email, or hit us up on our social media accounts.
Email
info@antaltechjobs.in