Overview
Requirements
• 3+ years of applied or academic experience in speech, multimodal, or LLM research
• Bachelor’s or Master’s in Computer Science, AI, or Electrical Engineering
• Strong in Python and scientific computing, including JupyterHub environments
• Deep understanding of LLMs, transformer architectures, and multimodal embeddings
• Experience in speech modeling pipelines: ASR, TTS, speech-to-speech, or audio-language models
• Knowledge of turn-taking systems, VAD, prosody modeling, and real-time voice synthesis
• Familiarity with self-supervised learning, contrastive learning, and agentic reinforcement (ART)
• Skilled in dataset curation, experimental design, and model evaluation
• Comfortable with tools like Agno, Pipecat, HuggingFace, and PyTorch
• Exposure to LangChain, vector databases, and memory systems for agentic research
• Strong written communication and clarity in presenting research insights
• High research curiosity, independent ownership, and mission-driven mindset
• Currently employed at a product-based organisation
Responsibilities
• Research and develop direct speech-to-speech modeling using LLMs and audio encoders/decoders
• Model and evaluate conversational turn-taking, latency, and VAD for real-time AI
• Explore Agentic Reinforcement Training (ART) and self-learning mechanisms
• Design memory-augmented multimodal architectures for context-aware interactions
• Create expressive speech generation systems with emotion conditioning and speaker preservation
• Contribute to SOTA research in multimodal learning, audio-language alignment, and agentic reasoning
• Define long-term AI research roadmap with the Research Director
• Collaborate with MLEs on model training and evaluation, while leading dataset and experimentation design
Job Details
Location: Hybrid — Mumbai, Bengaluru, Chennai, India
Interview process
• Screening / HR round
• Technical round(s) — coding, system design, ML case studies
• ML / research deep dive
• Final / leadership round