
Staff Software Engineer, Machine Learning Operations
- Chicago, IL
- Permanent
- Full-time
- Medical, dental, vision, and life insurance plans with coverage starting on day one of employment and 6 free sessions each year with a licensed therapist to support your emotional wellbeing.
- 18 paid time off (PTO) days annually for full-time employees (accrual prorated based on employment start date) and 6 company holidays per year.
- 6% company contribution to a 401(k) Retirement Savings Plan each pay period, no employee contribution required.
- Employee discounts, tuition reimbursement, student loan refinancing and free access to financial counseling, education, and tools.
- Maternity support programs, nursing benefits, and up to 14 weeks paid leave for birth parents and up to 4 weeks paid leave for non-birth parents.
- Machine Learning Operations & Infrastructure: Build and maintain core infrastructure components (i.e., Kubernetes clusters) and tooling enabling self-service development and deployment of a variety of applications leveraging GitOps practices.
- Machine Learning
- Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
- Design, monitor, and improve cloud infrastructure solutions that support applications executing at scale. Optimize infrastructure spend by conducting utilization reviews, forecasting capacity, and driving cost/performance trade‑offs for training and inference.
- Architect multi‑cluster/region topologies (e.g., with High Availability (HA), Disaster Recovery (DR), failover/federation, blue/green) for ML workloads and lead progressive delivery (canary, auto‑rollback) patterns in CI/CD.
- Ensure a rigorous deployment process using DevOps (GitOps) standards and mentor users in software development best practices. Evolve CI/CD from repo‑local workflows to reusable pipeline templates with quality/performance gates; standardize GitOps objects/guardrails (e.g., Argo CD Applications/Projects, policy‑as‑code).
- Define org‑wide observability standards (logs/metrics/traces schemas, retention) for ML system and model reliability; drive adoption across teams and integrate with enterprise tools (Prometheus/Grafana + Splunk/Datadog).
- Collaborate with the SRE team to define and drive SRE standards for ML systems by setting and reviewing SLOs/error budgets, partnering on org-wide reliability scorecards and improvement plans, and scaling blameless RCA rituals.
- Institute compatibility and deprecation/versioning policies for clusters and runtimes; integrate enterprise SSO (Okta/AD) and define RBAC scopes across clusters / pipelines.
- Own multi‑component roadmap initiatives that measurably move platform & reliability OKRs; communicate major changes and incidents to org‑wide forums and host cross‑team design sessions.
- Partner with teams across the business to enable reliable adoption of ML by hosting internal workshops, publishing playbooks/templates, and advising teams on adopting platform patterns safely.
- Bachelor’s degree and 7+ years’ relevant work experience or equivalent staff-level impact in platform / infrastructure roles.
- Possess strong software engineering fundamentals and experience developing production-grade software; experience with Python, Golang, or similar language preferred.
- Experience leading org-wide platform initiatives (e.g., multi‑cluster K8s, CI/CD platform evolution, observability standards) and mentoring senior engineers.
- Strong working knowledge of cloud-based services as well as their capabilities and usage; AWS preferred.
- Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments (e.g., Terraform, Ansible, Helm).
- Deep expertise with GitOps practices and tools (Argo CD app‑of‑apps, RBAC, sync policies) as well as policy‑as‑code (OPA/Kyverno) for safe rollouts.
- Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, Datadog, ELK).
- Deep, hands‑on experience with containers and Kubernetes (cluster operations/upgrades, HA/DR patterns).
- Ability to work collaboratively and empathetically in a team environment.
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs).
- Experience driving machine learning system reliability and awareness of associated requirements (e.g., model/feature drift telemetry, evaluation services, and model‑routing layers integrated with CI/CD).
- You’ve built pragmatic Kubernetes extensions (think small CRDs or admission webhooks), helped teams adopt OpenTelemetry to standardize traces/metrics/logs, and led safe, multi-cluster Kubernetes upgrades with staged rollouts, thorough testing, and clean rollback.