
Principal Site Reliability Engineer
- Miami, FL
- Permanent
- Full-time
- Reliability Strategy & Resilience Engineering: Design and implement fault-tolerant, scalable, and highly available systems across our AWS-hosted platform to ensure reliability under load and failure conditions.
- Service Ownership & Runbook Maturity: Partner with engineering teams to define and uphold SLIs/SLOs, perform root cause analyses, and drive post-incident reviews with a focus on long-term systemic improvements. Run recurring reliability reviews, and mature incident response practices including alert quality, runbooks, and failure simulations.
- Automation & Tooling: Build and maintain automation for deployment, incident response, and remediation workflows to reduce manual toil and increase operational efficiency.
- Secure Systems Design: Hands-on experience implementing DevSecOps practices including secure IaC, policy-as-code, and embedding controls in pipelines or platform abstractions.
- Observability & Monitoring: Champion the development of comprehensive observability solutions-including metrics, logging, tracing, and alerting-to enable proactive detection and resolution of issues.
- Infrastructure as Code: Contribute to and improve our Terraform-based infrastructure management, enabling consistent, auditable, and repeatable infrastructure deployments.
- Capacity Planning, FinOps & Performance: Lead efforts in system tuning, load testing, and capacity forecasting to support our scaling platform and avoid bottlenecks before they occur. Lead efforts to monitor and optimize cloud costs across environments. Design and advocate for architectural trade-offs that balance cost, performance, and reliability.
- Cross-Functional Reliability Coaching: Embed reliability thinking into engineering and product workflows. Run architecture reviews, failure simulations, and training to elevate operational discipline.
- Mentorship & Leadership: Mentor engineers across the organization in SRE best practices, incident response, and reliability design patterns, helping build a culture of ownership and operational excellence across the company.
- Experience: 10+ years in Site Reliability Engineering, DevOps, Infrastructure or related roles, with a proven track record of improving system reliability and scaling distributed systems in cloud environments (preferably AWS).
- Technical Proficiency: Deep expertise in Infrastructure as Code (Terraform strongly preferred), Kubernetes, and container orchestration at scale; strong background in automation, scripting (e.g., Python, Go, or Bash), and CI/CD pipelines.
- Reliability Engineering Mindset: Experience defining and maintaining SLOs/SLIs, leading incident response and postmortems, and applying SRE principles to reduce toil and improve system reliability. Deep familiarity with chaos engineering, failure mode analysis, and designing systems for graceful degradation under partial failure.
- Observability & Performance: Strong understanding of modern observability stacks (e.g., Datadog, Prometheus, Grafana, OpenTelemetry) and performance tuning for distributed systems.
- Security & Compliance Awareness: Solid understanding of security and compliance in cloud environments, with experience implementing secure-by-default infrastructure patterns. Familiar with secure infrastructure design, cloud compliance requirements (SOC2, ISO27001, ISO42001), and embedding DevSecOps into delivery workflows.
- Problem Solving: Skilled in diagnosing complex, multi-layered production issues and implementing pragmatic, long-term solutions.
- Influence & Communication: Excellent written and verbal communication skills with the ability to clearly articulate reliability trade-offs and influence engineering teams toward better operational outcomes. Trusted collaborator with product, infra, security, and GTM leaders.
- Location: Required to work on-site 5x a week in our Miami office (Coral Gables).
- Competitive salary
- 100% individual and dependent medical + dental + vision coverage
- 401(k) with a 4% company match
- 20 days PTO
- Kandji Wellness Week the first week in July
- Equity for full-time employees
- Up to 16 weeks of paid leave for new parents
- Paid Family and Medical Leave
- Modern Health - Mental Health Benefits - Individual and Dependents
- Fertility Benefits
- Working Advantage Employee Discounts
- Free onsite fitness center
- Free parking
- Lunch 5 days/week
- Exciting opportunities for career growth
- An outstanding, inclusive culture