
Principal Site Reliability Engineer
- New York City, NY
- Permanent
- Full-time
- Lead the design and evolution of Kubernetes-based infrastructure to support multi-tenant, high-scale applications with strong isolation, resilience, and security.
- Architect and optimize CI/CD pipelines to support fast and reliable build, test, and deploy cycles across a polyglot environment.
- Establish and evangelize best practices for GitOps, canary deployments, rollback strategies, and progressive delivery.
- Define and implement scalable Infrastructure as Code (IaC) patterns using tools such as Terraform, Helm, and Crossplane.
- Drive the adoption of automated testing throughout the delivery lifecycle-unit, integration, load, and chaos testing-to ensure high confidence in production changes.
- Guide teams in designing for observability, SLOs, and alerting, ensuring actionable signals and minimizing alert fatigue.
- Partner with security, compliance, and development teams to ensure infrastructure and delivery systems meet modern security and governance standards.
- Lead incident response retrospectives and foster a blameless culture of continuous improvement.
- Mentor and influence senior engineers across multiple teams, helping to up-level platform reliability capabilities organization-wide.
- 8+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles, with 2+ years in a technical leadership or principal capacity.
- Deep expertise with Kubernetes internals (controllers, networking, autoscaling, operators, etc.) and production-grade clusters on cloud providers (EKS, GKE, or AKS).
- Proven experience designing and scaling CI/CD systems using tools such as GitHub Actions, Argo CD, Tekton, Spinnaker, or similar.
- Strong proficiency in Terraform and modern IaC practices.
- Advanced knowledge of automated testing strategies, including performance, load, and failure testing.
- Proficient in one or more programming/scripting languages (Python, Go, Bash, etc.).
- Deep experience with monitoring and observability stacks such as Prometheus, Grafana, OpenTelemetry, and Datadog.
- Strong communicator with the ability to align technical initiatives to business objectives and influence across engineering teams.
- Experience implementing multi-cluster or multi-region Kubernetes strategies.
- Exposure to chaos engineering and building resilient distributed systems.
- Familiarity with compliance frameworks (SOC 2, HIPAA, etc.) as they relate to infrastructure and deployment.
- Contributions to open-source Kubernetes tooling or SRE frameworks.
- Familiarity with JVM- or Node-based application stacks.