
Staff, Site Reliability Engineer
- Bentonville, AR
- Permanent
- Full-time
he CES team builds best-in-class customer service experiences for hundreds of millions of Walmart customers and customer service agents globally. We are a group of software engineers, data scientists, and machine learning experts pushing the boundaries of GenAI technology in complex enterprise applications. The CES Technology team is part of the Enterprise Business Systems organization in Walmart Global Tech. We partner with our product, business and UX teams to drive significant measurable business impact. Our mission is to help customers save money and live better.What you'll do:About Team
The CES team builds best-in-class customer service experiences for hundreds of millions of Walmart customers and customer service agents globally. We are a group of software engineers, data scientists, and machine learning experts pushing the boundaries of GenAI technology in complex enterprise applications. The CES Technology team is part of the Enterprise Business Systems organization in Walmart Global Tech. We partner with our product, business and UX teams to drive significant measurable business impact. Our mission is to help customers save money and live better.What You will Do
- Drive the design and evolution of monitoring and observability frameworks that enable proactive detection, root cause analysis, and rapid resolution of customer-impacting incidents.
- Lead the development and integration of automation tools to streamline operational workflows, reduce toil, and enhance the reliability of customer service platforms.
- Participate in on-call rotations, applying deep technical expertise to swiftly diagnose and mitigate production issues, ensuring high availability and minimal disruption to customer support experiences.
- Collaborate closely with engineering teams to embed reliability into the software development lifecycle, championing a culture of shared ownership and “you build it, you run it.”
- Define and manage SLIs, SLOs, and SLAs to align service reliability with business expectations and continuously improve system performance.
- Apply proven reliability patterns and practices, leveraging hands-on experience to architect resilient systems that scale with customer demand.
- Lead post-incident reviews and blameless retrospectives, identifying systemic improvements and fostering a culture of continuous learning and operational excellence.
- Analyze system performance and advocate for cost-effective optimizations, balancing infrastructure efficiency with world-class service reliability.
- 8+ years of experience engineering and scaling highly available, customer-facing systems with a focus on reliability and operational excellence.
- A proven ability to lead the design and implementation of resilient infrastructure and automation solutions that solve complex reliability challenges.
- Strong judgment in making architectural trade-offs, balancing long-term system health with short-term delivery needs.
- Deep expertise in distributed systems, service ownership models, CI/CD pipelines, and observability practices.
- Exceptional communication and collaboration skills, with a track record of influencing cross-functional teams and driving consensus on reliability strategies.
- Experience mentoring engineers in incident response, reliability patterns, and career growth within SRE disciplines.
- A curious mindset and eagerness to explore new technologies and domains that enhance customer support platforms at scale.
- Multiple health plan options, including vision & dental plans for you & dependents
- Financial benefits including 401(k), stock purchase plans, life insurance and more
- Associate discounts in-store and online
- Education assistance for Associate and dependents
- Parental Leave
- Pay during military service
- Paid Time off - to include vacation, sick, parental
- Short-term and long-term disability for when you can't work because of injury, illness, or childbirth