Principal Site Reliability Engineer

Miami, FL
Permanent
Full-time

1 month ago

About KandjiKandji is the Apple device management and security platform that empowers secure and productive global work. With Kandji, Apple devices transform themselves into enterprise-ready endpoints, with all the right apps, settings, and security systems in place. Through advanced automation and thoughtful experiences, we're bringing much-needed harmony to the way IT, InfoSec, and Apple device users work today and tomorrow.Some of the smartest money in tech has partnered with Kandji to realize our vision, including Tiger Global, Felicis, Greycroft, First Round Capital, and Okta Ventures. In July 2024, Kandji raised $100 million in capital from General Catalyst, bringing Kandji's valuation to $850 Million.Since Kandji's Series C in 2021, the company has seen a 600%+ increase in annual recurring revenue, and its customer base has grown nearly 4X across 40+ industries. Notable customers include Allbirds, Canva, and Notion, and the company has partnerships with such industry giants as ServiceNow, AWS, and Okta.Kandji was also named to Forbes' Next Billion Dollar Startup List 2023 and recognized as a top venture-backed startup with the potential to reach unicorn status.As a Principal Site Reliability Engineer at Kandji, you will play a critical role in ensuring the reliability, scalability, and performance of our platform. In this strategic position, you'll work cross-functionally to build and evolve the systems, tools, and processes that keep our services resilient and performant-especially as we scale to meet the demands of a growing customer base.You'll bring a deep understanding of distributed systems, incident management, observability, and automation. Your experience with AWS, Kubernetes, and Infrastructure-as-Code (Terraform preferred) will help drive efforts to proactively identify and eliminate reliability risks, reduce toil through automation, and establish engineering best practices across teams.We're looking for a seasoned engineer with both technical depth and a strategic mindset-someone who can guide long-term reliability efforts, lead postmortems and systemic remediation, and mentor others in SRE principles. This role provides the opportunity to shape the culture and architecture of reliability at Kandji, partnering closely with engineering, infrastructure, and product teams to build systems that are not only functional, but fault-tolerant and maintainable.How You Will Make a Difference Day to Day:

Reliability Strategy & Resilience Engineering: Design and implement fault-tolerant, scalable, and highly available systems across our AWS-hosted platform to ensure reliability under load and failure conditions.
Service Ownership & Runbook Maturity: Partner with engineering teams to define and uphold SLIs/SLOs, perform root cause analyses, and drive post-incident reviews with a focus on long-term systemic improvements. Run recurring reliability reviews, and mature incident response practices including alert quality, runbooks, and failure simulations.
Automation & Tooling: Build and maintain automation for deployment, incident response, and remediation workflows to reduce manual toil and increase operational efficiency.
Secure Systems Design: Hands-on experience implementing DevSecOps practices including secure IaC, policy-as-code, and embedding controls in pipelines or platform abstractions.
Observability & Monitoring: Champion the development of comprehensive observability solutions-including metrics, logging, tracing, and alerting-to enable proactive detection and resolution of issues.
Infrastructure as Code: Contribute to and improve our Terraform-based infrastructure management, enabling consistent, auditable, and repeatable infrastructure deployments.
Capacity Planning, FinOps & Performance: Lead efforts in system tuning, load testing, and capacity forecasting to support our scaling platform and avoid bottlenecks before they occur. Lead efforts to monitor and optimize cloud costs across environments. Design and advocate for architectural trade-offs that balance cost, performance, and reliability.
Cross-Functional Reliability Coaching: Embed reliability thinking into engineering and product workflows. Run architecture reviews, failure simulations, and training to elevate operational discipline.
Mentorship & Leadership: Mentor engineers across the organization in SRE best practices, incident response, and reliability design patterns, helping build a culture of ownership and operational excellence across the company.

We'd love to hear from you if you have:

Experience: 10+ years in Site Reliability Engineering, DevOps, Infrastructure or related roles, with a proven track record of improving system reliability and scaling distributed systems in cloud environments (preferably AWS).
Technical Proficiency: Deep expertise in Infrastructure as Code (Terraform strongly preferred), Kubernetes, and container orchestration at scale; strong background in automation, scripting (e.g., Python, Go, or Bash), and CI/CD pipelines.
Reliability Engineering Mindset: Experience defining and maintaining SLOs/SLIs, leading incident response and postmortems, and applying SRE principles to reduce toil and improve system reliability. Deep familiarity with chaos engineering, failure mode analysis, and designing systems for graceful degradation under partial failure.
Observability & Performance: Strong understanding of modern observability stacks (e.g., Datadog, Prometheus, Grafana, OpenTelemetry) and performance tuning for distributed systems.
Security & Compliance Awareness: Solid understanding of security and compliance in cloud environments, with experience implementing secure-by-default infrastructure patterns. Familiar with secure infrastructure design, cloud compliance requirements (SOC2, ISO27001, ISO42001), and embedding DevSecOps into delivery workflows.
Problem Solving: Skilled in diagnosing complex, multi-layered production issues and implementing pragmatic, long-term solutions.
Influence & Communication: Excellent written and verbal communication skills with the ability to clearly articulate reliability trade-offs and influence engineering teams toward better operational outcomes. Trusted collaborator with product, infra, security, and GTM leaders.
Location: Required to work on-site 5x a week in our Miami office (Coral Gables).

Benefits & Perks

Competitive salary

100% individual and dependent medical + dental + vision coverage

401(k) with a 4% company match

20 days PTO

Kandji Wellness Week the first week in July

Equity for full-time employees

Up to 16 weeks of paid leave for new parents

Paid Family and Medical Leave

Modern Health - Mental Health Benefits - Individual and Dependents

Fertility Benefits

Working Advantage Employee Discounts

Free onsite fitness center

Free parking

Lunch 5 days/week

Exciting opportunities for career growth

An outstanding, inclusive culture

We are excited to be serving a significant need for a fast-growing market, and are proud of the high-performing team we have brought together so far. If you're someone who wants to engage in new, exciting projects that will challenge your skills in the best way possible, we would love to connect with you.At Kandji we believe in fostering an inclusive environment in which employees feel encouraged to share their unique perspectives, leverage their strengths, and act authentically. We know that diverse teams are strong teams, and welcome those from all backgrounds and varying experiences.Kandji is proud to be an equal opportunity employer committed to diversity and inclusion in the workplace. Qualified applicants will be considered for employment without regard to race, color, religion, national origin, age, sex, sexual orientation, gender identity, physical or mental disability, protected veteran or military status or any other status protected by applicable law.

Kandji

Apply Now