
Senior Site Reliability Engineer
- Austin, TX
- Permanent
- Full-time
- Design and maintain resilient, automated infrastructure in private cloud environments.
- Lead incident response efforts, including communication and follow-ups during major incidents.
- Drive initiatives to reduce recurring issues and enhance both availability and recovery.
- Define and enforce best practices for observability, incident management, and production readiness to ensure optimal performance and reliability.
- Lead improvements in Infrastructure as Code, CI/CD tooling.
- Collaborate with product and application teams to define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) while addressing reliability risks.
- Advocate for automation and operational efficiency.
- Contribute to the reliability roadmap and engage in architecture discussions.
- Mentor engineers and promote a culture of learning and ownership.
- Participate in the on-call rotation and work to improve its effectiveness.
- 5+ years of experience as a Site Reliability Engineer (SRE).
- Demonstrates mastery in managing Linux systems in production environments with ease and is able to effectively teach others.
- Proficient in leveraging Kubernetes and similar container orchestration systems, possessing a level of expertise that matches the depth of my Linux administration skills..
- Demonstrates an advanced proficiency in scripting and programming, particularly with languages such as Python, Bash, and Golang.
- Prior experience mentoring engineers or leading reliability initiatives.
- Experience in building automation tools to reduce operational toil and improve service availability.
- Proven experience leading incident response and conducting root cause analysis.
- Experience with maintaining and using Prometheus(or VictoriaMetrics), Grafana, and other observability tools for metrics-based alerting (including SLOs and error budgets) to support incident resolution.
- Strong understanding of advanced networking and security practices.
- Ability to identify reliability gaps and implement scalable solutions.
- Experience improving on-call rotations and alert systems.
- Understanding of system-level performance tuning and capacity planning.
- Medical Benefits: We offer a competitive medical plan. Company offsets premiums.
- Taco Tuesdays: Like breakfast tacos? You’re at the right place, because weekly breakfast tacos are provided.
- 401k plan with company match!
- Weekly Lunch: Food is love. Especially when it is free.
- Snacks: You will never go hungry.
- Culture: Innovation drives our vibe.
- Diversity: We embrace our global presence, the diverse ideas and backgrounds of our team to improve our culture, our products and grow our people and our business.
- Unlimited PTO: We value our employees’ work/life balance and want you to spend the time off you need.