
Principal Site Reliability Engineer
- Cambridge, MA
- Permanent
- Full-time
- Architecting, developing, testing, and distributing changes to software, services, and tools the VHP team is responsible for.
- Designing and implementing enhancements to VHP observability infrastructure in order to identify and correct problems before they impact our customers
- Developing subject matter expertise in VHP components and mentoring the team.
- Identifying and implementing automation best practices for existing products and processes
- Collaborating with our support, operations and engineering teams to investigate and troubleshoot complex problems
- Participating in on-call rotations, guiding restoration and repair of service-impacting issues
- Have 12 years of relevant experience and a Bachelor's degree in Computer Science or its equivalent
- Possess expertise in Linux internals, deep understanding of hardware and best practices enabling HW features in Linux.
- Possess advanced level experience with the Linux kernel, OS, and optimization of their configurations for KVM/QEMU virtualization.
- Possess expert level experience with designing, developing, and deploying software and infrastructure at scale
- Have expertise in a DevOps, Development, or SysAdmin role, working with large scale distributed systems
- Have experience with tools like SaltStack and Ansible for managing infrastructure at scale
- Have excellent communication and interpersonal skills
- Your health
- Your finances
- Your family
- Your time at work
- Your time pursuing other endeavors