Overview
Lexsi Labs is one of the leading frontier labs focused on building aligned, interpretable, and safe Superintelligence. Our work spans efficient alignment methods, interpretability-led system design, and scalable AI platforms that operate reliably across enterprise and regulated environments. Our mission is to build AI systems that are powerful, transparent, and production-grade by design.
Our team operates with deep technical ownership, minimal hierarchy, and a strong bias toward building systems that work at scale. At Lexsi.ai, infrastructure is not support work. It is a core product capability.
As a Senior Infrastructure Software Engineer, you will architect, build, and operate the core backend and deployment systems that power the Lexsi AI platform. You will own multi-cloud, serverless, and stateless AI deployments, ensuring our systems scale seamlessly across environments of any size while maintaining correctness, reliability, and cost efficiency.
This role is ideal for someone who thinks like SDE + Platform Engineer + DevOps, and takes pride in owning systems end-to-end.
Responsibilities
- Design and build Python-based backend services that power core platform functionality and AI workflows.
- Architect and operate AI/LLM inference and serving infrastructure at production scale.
- Build stateless, serverless, and horizontally scalable systems that can run across environments of varying sizes.
- Design multi-cloud infrastructure across AWS, Azure, and GCP with portability and reliability as first-class goals.
- Deploy and manage containerized workloads using Docker, Kubernetes, ECS, or equivalent systems.
- Build and operate distributed compute systems for AI workloads, including inference-heavy and RL-style execution patterns.
- Implement Infrastructure as Code using Terraform, CloudFormation, Pulumi, or similar tools.
- Own CI/CD pipelines for backend services, infrastructure, and AI workloads.
- Optimize GPU and compute usage for performance, cost, batching, and autoscaling.
- Define and enforce reliability standards targeting 99%+ uptime across critical services.
- Build observability systems for latency, throughput, failures, and resource utilization.
- Implement security best practices across IAM, networking, secrets, and encryption.
- Support compliance requirements (SOC 2, ISO, HIPAA) through system design and evidence-ready infrastructure.
- Lead incident response, root cause analysis, and long-term reliability improvements.
- Collaborate closely with ML engineers, product, and leadership to translate AI requirements into infrastructure design.
Required Qualifications
- 3+ years of hands-on experience in backend engineering, platform engineering, DevOps, or infrastructure-focused SDE roles.
- Strong Python expertise with experience building and running production backend services.
- Experience with Python frameworks such as FastAPI, Django, Flask, or equivalent.
- Deep hands-on experience with Docker and Kubernetes in production environments.
- Practical experience designing and operating multi-cloud infrastructure.
- Strong understanding of Infrastructure as Code and declarative infrastructure workflows.
- Experience building and maintaining CI/CD pipelines for complex systems.
- Solid understanding of distributed systems, async processing, and cloud networking.
- Strong ownership mindset with the ability to build, run, debug, and improve systems end-to-end.
Nice to Have
- Experience deploying AI or LLM workloads in production environments.
- Familiarity with model serving frameworks such as KServe, Kubeflow, Ray etc.
- Experience running GPU workloads, inference batching, and rollout strategies.
- Exposure to serverless or hybrid serverless architectures for AI systems.
- Experience implementing SLO/SLA-driven reliability and monitoring strategies.
- Prior involvement in security audits or compliance-driven infrastructure work.
- Contributions to open-source infrastructure or platform projects.
- Strong system design documentation and architectural reasoning skills.
What Success Looks Like
- Lexsi’s AI platform scales cleanly across cloud providers and deployment sizes.
- Inference systems are reliable, observable, and cost-efficient under real load.
- Engineering teams ship faster because infrastructure is predictable and well-designed.
- Incidents are rare, understood deeply when they occur, and lead to durable fixes.
- Infrastructure decisions support long-term platform scalability, not short-term hacks.
Next Steps & Interview Process
- Take-home assignment focused on real infrastructure and scaling problems.
- Deep technical interview covering system design, failure modes, and trade-offs.
- Final discussion focused on ownership, reliability mindset, and execution style.
We avoid process theatre. If you can design systems that don’t fall apart under pressure, we’ll move fast.