
Platform Site Reliability Engineer
- Phoenix, AZ
- Permanent
- Full-time
- Design, build, and maintain the infrastructure powering our multi-tenant SaaS platform with reliability, security, and scalability in mind.
- Implement and manage cloud-native systems (AWS) using best-in-class tools and automation.
- Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support continuous delivery.
- Establish and enforce SLOs, SLAs, and error budgets, and proactively address availability and performance issues.
- Develop infrastructure as code (Terraform or similar) for repeatable and auditable provisioning.
- Experience in programming solutions for Platform Tools such as for automation, monitoring, provisioning, using programming technologies.
- Solid understanding of the network stack (TCP/IP, VPN, HTTP, SSL, routing, etc.), cloud topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc).
- Monitor system health, application performance, and user-facing SLAs using tools like Datadog, Prometheus, Grafana...
- Be a main actor and improve incident response practices and help reduce mean time to detect (MTTD) and recover (MTTR). Experience in coordinating teams and persons to maintain a SLA.
- Ability to troubleshoot, narrow down and fix incidents with minimal intervention of other functions.
- Participate in a shared on-call rotation, responding to incidents, troubleshooting outages, and driving timely resolution and communication.
- Work closely with software engineers to embed reliability and observability into every service.
- Develop automated runbooks, health checks, and alerting to support reliable operations with minimal manual intervention.
- Support automated testing, canary deployments, and rollback strategies to ensure safe, fast, and reliable releases.
- Contribute to security best practices, compliance automation, and cost optimization.
- Minimum BS in Computer Science/Engineering
- 5+ years in an SRE/platform engineering role supporting SaaS platforms.
- Strong hands-on experience with public cloud services (AWS, GCP, Azure).
- Proficiency with Kubernetes, container-based deployment and related ecosystems (Helm...), and containerized microservices.
- Strong programming or scripting skills (Python, Go, Bash...).
- Experience with CI/CD pipelines (e.g., GitHub Actions, GitLab CI, ArgoCD).
- Experience with observability stacks (Prometheus, ELK/EFK, Datadog, etc.).
- Comfort with being part of a rotating on-call schedule, including handling critical incidents and conducting post-incident reviews.
- Strong system-level troubleshooting skills and a proactive mindset toward incident prevention.
- Deep understanding of Linux systems, networking, and common troubleshooting practices.
- Experience supporting multi-tenant microservices architectures.
- Familiarity with service mesh, e.g., Istio.
- Knowledge of zero-downtime deployment strategies, blue/green and canary releases.
- Exposure to compliance standards such as SOC 2, ISO 27001, or HIPAA. FedRAMP experience is a big plus.
- Experience with chaos engineering or resilience testing practices.
- 🏖️ Flexible Hours and unlimited vacation (employees have unlimited paid time off on top of the 15 days of holidays we offer), 11 company-paid holidays, and 3 extra days for volunteering.
- 🏡 Hybrid work model that balances office and remote work, with structured onboarding to foster connections and team integration.
- 📚 Free access to professional training platforms to explore your interests and enhance your skills.
- 🍼 Up to 16 weeks of paid leave for birthing parents/primary caregivers, 6 weeks for secondary caregivers.
- 💰 Plan for the future with a 401(k) plan featuring up to 4% company matching contributions, vesting immediately, to grow your retirement savings.
- 📣 Bonuses for referring successful hires after three months of continuous employment.
- 🏖️ Flexible Hours and unlimited vacation (employees have unlimited paid time off on top of the 15 days of holidays we offer), 11 company-paid holidays, and 3 extra days for volunteering.
- 🏡 Hybrid work model that balances office and remote work, with structured onboarding to foster connections and team integration.
- 📚 Free access to professional training platforms to explore your interests and enhance your skills.
- 🍼 Up to 16 weeks of paid leave for birthing parents/primary caregivers, 6 weeks for secondary caregivers.
- 💰 Plan for the future with a 401(k) plan featuring up to 4% company matching contributions, vesting immediately, to grow your retirement savings.
- 📣 Bonuses for referring successful hires after three months of continuous employment.