Principal Site Reliability Engineer

Lowell, MA
Permanent
Full-time

17 days ago

-Architect, develop, and maintain scalable automation, internal tools, health checks, monitoring, auto-remediation to improve service availability, reliability, latency, scalability, and system resiliency--ensuring services withstand failures and recover gracefully to maintain high availability. -Lead incident response effort to minimize customer impact and reduce MTTx, including leading post-incident reviews to identify root causes and implement long-term solutions. -Provide strategic guidance and design consultation throughout the full-service lifecycle--from architecture and capacity planning to production readiness--while establishing and enforcing SRE standards for system architecture, observability, incident response, and reliability metrics. - Partnership closely with product, infrastructure, and engineering teams to integrate reliability goals into the development process. - Mentor and guide engineers across the organization on reliability principles and best practices and serve as a reliability evangelist to drive cultural and operational changes that improve engineering velocity. - Leverage generative AI agents and automation tools to enhance operational efficiency, automate health checks, incident detection and resolution, and drive innovative solutions in site reliability engineering. - Define, implement, and measure SLIs and SLOs to guide reliability-focused engineering decisions. - Minimum 8 years of engineering experience, including 5+ years in Site Reliability, DevOps, or Production Engineering roles. - Advanced proficiency in one or more programming languages (e.g., Python, Go, Java, or C++) with the ability to write production-grade software. - Strong Linux systems expertise, including scripting, performance tuning, and debugging. - Hands-on experience operating large-scale distributed systems in public cloud environments, preferably GCP. - Deep knowledge of Kubernetes and container orchestration patterns in production environments. - Experience with GitHub Actions and modern CI/CD practices. - Deep experience with SLI/SLO design, service health instrumentation, and production telemetry. - Proven ability to build dashboards and alerts using Splunk and Grafana. - Strong understanding of observability systems, including: Metrics pipelines, Distributed tracing, Log aggregation, Alerting strategies and incident triage - Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible). -Experience building and supporting highly available, customer-facing systems. - Experience working with generative AI agents or AI-driven automation tools to support incident management, monitoring, or operational workflows. - Broad grounding in at least two of the following: Cloud Architecture, Nginx, Security, or Database Technologies - Strong troubleshooting skills for complex system issues, with proven experience leading incident response efforts. - Excellent communication and collaboration skills, with experience mentoring and guiding engineers. - Experience implementing chaos engineering, load testing, and resilience modeling. -Google Cloud Professional Architect Certification is a plus. -Understanding of OpenTelemetry (metrics, tracing, logs) and its integration into observability pipelines.

Ultimate Kronos Group

Apply Now