
Senior System Software Engineer - DevOps and Infrastructure Automation
- Santa Clara, CA
- Permanent
- Full-time
- Building and maintaining infrastructure from first principles needed to deliver our growing family of AI Inferencing products including Dynamo and NIXL.
- Maintain CI/CD pipelines to automate the build, test, and deployment process and build improvements on the bottlenecks. Managing tools and enabling automations for redundant manual workflows via Github Actions, Gitlab, Terraform, etc
- Enable performing scans and handling of security CVEs for infrastructure components
- Extensive collaboration with cross-functional teams to integrate pipelines from deep learning frameworks and components is essential to ensuring seamless deployment and inference of deep learning models on our platform.
- Masters degree or equivalent experience
- 3+ years of experience in Computer Science, computer architecture, or related field
- Ability to work in a fast-paced, agile team environment
- Excellent Bash, CI/CD, Python programming and software design skills, including debugging, performance analysis, and test design.
- Experience in administering, monitoring, and deploying systems and services on GitHub and cloud platforms. Support other technical teams in monitoring operating efficiencies of the platform, and responding as needs arise.
- Highly skilled in Kubernetes and Docker/containerd. Automation expert with hands on skills in frameworks like Ansible & Terraform. Experience in AWS, Azure or GCP
- Knowledge of distributed systems programming.
- Experience contributing to a large open-source deep learning community - use of GitHub, bug tracking, branching and merging code, OSS licensing issues handling patches, etc.
- Experience in defining and leading the DevOps strategy (design patterns, reliability and scaling) for a team or organization.
- Experience driving efficiencies in software architecture, creating metrics, implementing infrastructure as code and other automation improvements.
- Deep understanding of test automation infrastructure, framework and test analysis.
- Excellent problem solving abilities spanning multiple software (storage systems, kernels and containers) as well as collaborating within an agile team environment to prioritize deep learning-specific features and capabilities within Triton Inference Server, employing advanced troubleshooting and debugging techniques to resolve complex technical issues.