Aumni - Site Reliability Engineer III - MLOPS
JPMorgan Chase
- Salt Lake City, UT
- Permanent
- Full-time
- Guides and assists others in the areas of designing and deploying new AI/ML models in the cloud, gaining consensus from peers where appropriate
- Designs and implements automated continuous integration and continuous delivery pipelines for the Data Science teams to develop and train AI/ML models
- Writes and deploys infrastructure as code for the models and pipelines you support
- Collaborates with technical experts, key stakeholders, and team members to resolve complex technical problems
- Understands the importance of monitoring and observability in the AI/ML space - i.e. service level indicators and utilizes service level objectives
- Proactively resolve issues before they impact internal and external stakeholders of deployed models
- Supports the adoption of MLops best practices within your team
- Formal training or certification on site reliability engineering concepts and 3+ years applied experience
- Understanding of MLops culture and principles and familiarity with how to implement associated concepts at scale
- Domain knowledge of machine learning applications and technical processes within the AWS ecosystem
- Experience with infrastructure as code tooling such as Terraform, Cloudformation
- Experience with container and container orchestration such as ECS, Kubernetes, and Docker
- Knowledge of continuous integration and continuous delivery tools like Jenkins, GitLab, or Github Actions
- Proficiency in the following programming languages: Python, Bash
- Hands-on knowledge of Linux and networking internals
- Understanding of the different roles served by data engineers, data scientists, machine learning engineers, and system architects, and how MLops contributes to each of these workstreams
- Ability to identify new technologies and relevant solutions to ensure design constraints are met by the Data Science and Machine Learning teams
- Experience with Model training and deployment pipelines, managing scoring endpoints
- Familiarity with observability concepts and telemetry collection using tools such as Datadog, Grafana, Prometheus, Splunk, and others
- Understanding of data engineering platforms such as Databricks or Snowflake, and machine learning platforms such as AWS Sagemaker
- Comfortable troubleshooting common containerization technologies and issues
- Ability to proactively recognize road blocks and demonstrates interest in learning technology that facilitates innovation