Overview
We are looking for a Senior Voice AI Engineer to design, build, and optimize the core audio processing pipeline for our voice AI platform. You will work at the intersection of speech processing, real-time systems, and conversational AI—building intelligent voice applications that understand when to listen, who is speaking, and how to manage natural conversations.
This role is ideal for engineers passionate about voice technology who want to shape how humans interact with AI through speech.
What You'll Do
VAD Integration: Implement and fine-tune Voice Activity Detection models to accurately detect speech segments, silence, and noise in real-time audio streams
Speaker Diarization: Build and optimize speaker diarization pipelines to identify and segment multiple speakers in conversations
Smart Turn-Taking: Design intelligent turn-taking mechanisms that enable natural, fluid conversations between users and AI—minimizing interruptions and awkward pauses
Audio Pipeline Architecture: Own the end-to-end audio processing pipeline including noise suppression, echo cancellation, audio segmentation, and streaming inference
Context Management: Develop robust context management systems to maintain conversational state, handle multi-turn dialogues, and enable coherent long-form interactions
Latency Optimization: Optimize pipeline components for low-latency, real-time performance across varying network conditions and device capabilities
Model Evaluation: Establish benchmarks and evaluation frameworks for VAD accuracy, diarization error rate (DER), and turn-taking performance
Cross-Functional Collaboration: Work closely with ML, backend, and product teams to integrate voice capabilities into the broader platform.
What You Bring
Required
4+ years of hands-on experience in speech/audio processing, voice AI, or conversational AI systems
Strong proficiency in Python; experience with C++ or Rust is a plus
Deep understanding of Voice Activity Detection algorithms and models (WebRTC VAD, Silero VAD, or similar)
Experience with speaker diarization systems and embeddings (x-vectors, d-vectors, ECAPA-TDNN)
Familiarity with speech processing fundamentals: audio signal processing, spectral analysis, feature extraction (MFCCs, mel-spectrograms)
Experience building real-time streaming audio pipelines with low latency requirements
Knowledge of ASR systems (Whisper, Deepgram, AssemblyAI, or similar) and their integration
Understanding of context management patterns for multi-turn conversational systems
Experience with audio frameworks and libraries (PyAudio, librosa, torchaudio, WebRTC)
Nice to Have
Experience with end-of-turn detection and barge-in handling
Familiarity with LLM-powered voice agents and orchestration frameworks
Background in telephony systems (SIP, VoIP, Twilio, etc.)
Experience with edge deployment and on-device inference
Contributions to open-source speech/audio projects
Published research in speech processing or conversational AI