
Senior Staff Machine Learning Engineer - DevOps/Site Reliability Engineer
- Santa Clara, CA
- Permanent
- Full-time
- Contribute to the design, development and implementation of infrastructure, platform, deployment and observability features that power AI workloads.
- Collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.
- Contribute to the continuous improvement of the SRE practice by turning operational use cases into requirements for software tooling.
- Contribute to the execution of deployment and support activities for AI/ML developers;
- Build high-quality, clean, scalable and reusable code by enforcing best practices around software engineering architecture and processes (Code Reviews, Unit testing, etc.);
- Work with the product owners to understand detailed requirements and own your code from design, implementation, test automation and delivery of high-quality product to our users;
- Experience with operating LLMs on NVIDIA GPUs.
- Be a mentor for colleagues and help promote knowledge-sharing.
- Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
- Proficient in prompt engineering and developing LLM based features
- Experience with methods of training and fine tuning large language models, such as distilation, supervised fine-tunning and policy optimization
- 6+ years of experience operating highly-available distributed workloads on Kubernetes following a DevOps approach.
- 6+ years of development experience with Python, GoLang, Java or similar languages;
- Experience with DevOps tooling (e.g. Helm / Ansible / Kubernetes / Prometheus /Splunk/ GitLab CI);
- Strong working experience operating distributed systems built on Linux and J2EE;
- Experience with software-defined networking, infrastructure as code and configuration management;
- Experience building software for compliance and security in regulated environments
- Ability to drive outcome in projects with material technical risk.