Site Reliability Engineer - (SRE)
Lovelace AI
- Pittsburgh, PA
- Permanent
- Full-time
- Design, implement, and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end-users.
- Lead troubleshooting efforts for complex production issues, providing detailed root cause analysis (RCA) and implementing preventative measures.
- Develop and maintain automation scripts, build systems (Bazel) and infrastructure as code (IaC) using tools like Terraform, Ansible, or CloudFormation to eliminate manual tasks and improve system reliability and efficiency.
- Collaborate closely with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the outset.
- Participate in on-call rotations to respond to platform emergencies, alerts, and escalations, ensuring high service uptime.
- Analyze system performance and recommend optimizations for scalability, reliability, and efficiency.
- Implement and enforce best practices in deployment, monitoring, and incident management to continuously improve overall system reliability and reduce downtime.
- Develop and maintain internal tools that streamline complex operations, track bugs, manage CI/CD pipelines, and facilitate cross-team communication.
- Conduct post-incident reviews, documenting software problems and solutions in a shared knowledge base to prevent similar issues in the future.
- Assist with vulnerability management, system patching, and implementing security measures to protect the integrity and availability of services.
- 5+ years of experience in site reliability engineering, DevOps, systems administration, or related roles.
- Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance in high-scale environments.
- Strong experience with Linux/Unix administration and proficiency in scripting languages (e.g., Python, Bash, Go).
- Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (e.g., EC2, S3, Lambda, Kubernetes).
- Experience with containerization and orchestration technologies like Docker and Kubernetes.
- Proficiency with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace, ELK Stack).
- Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs.
- Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and infrastructure automation.
- Familiarity with distributed systems and microservices architecture.
- Excellent problem-solving and troubleshooting skills.
- Strong analytical skills with the ability to identify Service Level Indicators (SLIs) and align efforts to meet availability and latency objectives.
- Ability to balance both development and support roles effectively.
- Strong interpersonal skills and excellent communication skills, with the ability to collaborate effectively across various teams.
- Experience in working on projects that involve business segments.