Overview
TechGrove is the Centre of Excellence for Banyan Software, based in Chennai, India. It plays a key role in supporting Banyan’s global businesses through technology, security, and software development. TechGrove brings together India’s deep pool of technical talent with Banyan’s long-term approach to growth, creating a trusted, developer-focused environment where people can do their best work.
Senior Site Reliability Engineer (SRE) & Support Lead (Touchstream)Location: Chennai, India
Reports to: Head of Integrations
Role Type: Hands-on senior individual contributor with support leadership responsibilities
Touchstream is the OTT Operations Hub: a cloud-native SaaS platform for independent, end-to-end monitoring of streaming video systems (CDNs, origin, delivery chain). We serve some of the world’s largest broadcasters, telco/OTT services, and streaming platforms—monitoring tens of thousands of live streams in real time.
Touchstream now unifies its best selling CDN Monitoring and VirtualNOC into a single platform delivering:
- Unified data & end-to-end visibility across the streaming workflow
- Best-in-class incident intelligence and RCA tooling (including timestamped evidence packs)
- Operating-model improvements via shared views, collaboration, AI MCP Servers and rich knowledge bases
- Business value and ROI reporting for capacity optimization and performance insights
As Senior SRE Engineer & Support Lead, you will own production health for Touchstream’s customer-facing platform and data plane, while also leading the global technical support function as part of your SRE responsibilities. Your mission is twofold:
- Reliability ownership: ensure high availability, performance, and change safety across the system (UI/API and ingest, process & query pipelines), with strong SLO discipline and continuous improvement.
- Support leadership: run and evolve the support operation—triage, escalation, incident response coordination, tooling, and (over time) building a strong support team in Chennai to deliver world-class customer outcomes.
This is a highly impactful role at the intersection of SRE, incident management, observability engineering, and customer-facing support.
Responsibilities:1) Reliability Ownership (Primary)- Define and maintain SLOs, error budgets, and service health reporting.
- Own availability and performance of:
- Customer-facing system: UI/API
- Data plane: ingest, process & query pipelines
- Drive capacity planning for live-event spikes, load testing, and scaling strategies.
- Prevent recurring issues through high-quality RCAs and rigorous follow-through.
- Build and evolve the on-call operating model: severity levels, paging rules, escalation paths, comms templates.
- Lead high-severity incidents end-to-end: triage, mitigation, rollback, “stop the bleeding” decisions, stakeholder comms.
- Track MTTA/MTTR and implement systemic improvements over time.
- Own “who watches the watcher?”—monitoring and alerting for Touchstream’s monitoring pipeline itself.
- Standardize telemetry conventions (logs/metrics/traces) across services.
- Build and maintain dashboards for:
- ingest health (per customer / per source)
- pipeline lag
- query performance
- alerting health
- Tune alerting to reduce noise: dedupe, routing, “symptom vs cause,” threshold hygiene.
- Implement guardrails: feature flags, progressive delivery/canaries, automated rollback triggers.
- Maintain release readiness practices: migration checks, backfills, customer impact assessment, capacity impacts.
- Drive change metrics: deploy frequency, change failure rate, recovery time from deploys.
- Monitor and optimize cost per GB ingested/stored/queried.
- Enforce retention policies, tiering, sampling, and query limits without breaking customer value.
- Make explicit capacity vs. cost tradeoffs—especially around large live events and heavy dashboards.
- Baseline controls: access reviews, secrets management, least privilege, dependency scanning.
- Rate limiting / abuse guardrails, audit logging, security incident response readiness.
- Backup/restore and lightweight-but-real disaster recovery drills.
- Serve as the senior escalation point for critical customer issues and high-impact outages.
Senior Technical Support Manage… - Own the support operating model:
- ticket triage, prioritization, SLAs, escalation paths, and shift handovers
- runbooks, playbooks, FAQs, and knowledge base (including formats suitable for AI-assisted support / RAG)
- Establish and monitor support KPIs (SLA compliance, backlog, customer satisfaction, MTTx) and implement process improvements.
Senior Technical Support Manage… - Partner with Engineering/Product/Integrations to turn support learnings into reliability fixes and product improvements.
- Over time: help build, mentor, and lead a team of support/NOC engineers in Chennai.
- Maintain per-tenant “customer health views”: SLO compliance, noisy sources, top offenders, recurring incident patterns.
- Collaborate with Product on operator workflows: service health panels, incident summaries, status updates.
- 8+ years in SRE, production operations, technical support for SaaS, or NOC/ops roles with strong reliability ownership.
- Strong Linux fundamentals; comfort with debugging distributed systems.
- Strong understanding of cloud infrastructure (AWS and/or GCP) and service operations.
- Experience with monitoring/alerting/logging stacks, incident management, and RCA practices.
- Ability to automate operational work (Python and/or shell scripting); comfort with APIs and CLI tooling.
- Strong understanding of video streaming and delivery concepts: HLS, DASH, CMAF, ABR, CDNs, origin, HTTP, caching, DNS, SSL/TLS. Familiarity with AWS Media Services is a big plus.
- Proven ability to run escalations and communicate clearly in high-pressure incidents.
- Experience designing support workflows, SLAs, escalation paths, and operational KPIs.
- Strong written and verbal English; confidence presenting incident status and RCAs to customers.
- Comfortable with flexible hours to support global customers (overlap with Europe/US time zones as needed).
- Bias for action, continuous improvement mindset, and strong ownership.
- Prior experience supporting high-scale, always-on streaming events and live operations.
- Experience with progressive delivery, canarying, feature-flag platforms, and release automation.
- Familiarity with IT service management frameworks (e.g., ITIL).
- Security operations exposure (secrets management, vulnerability management, audit logging).
- A senior, high-ownership role shaping reliability + support for a mission-critical observability platform in OTT streaming.
- Direct impact on global broadcasters and streaming services—improving viewer experience at scale.
- Opportunity to build the SRE/support operating model and grow the Chennai support function over time.
- Collaboration with a globally distributed team across engineering, integrations, operations, and product.
Beware of Recruitment Scams
We have been made aware of individuals fraudulently posing as members of our Talent Acquisition team and extending fake job offers. These scams may involve requests for personal information or payment for equipment.
Protect yourself by following these steps:
- Verify that all communications from our recruiting team come from an @banyansoftware.com email address.
- Remember, employers will never request payment or banking information during the hiring process.
- If you receive a suspicious message, do not respond — instead, forward it to careers@banyansoftware.com and/or report it to the platform where you received it.
Your safety and security are important to us. Thank you for staying vigilant.