
Site Reliability Engineer, Cloud Infrastructure - USDS
- Seattle, WA
- Permanent
- Full-time
- Drive infrastructure automation and tooling: Design, develop, and maintain solutions for efficient operation, optimization, and comprehensive monitoring of global infrastructure, minimizing manual intervention.
- Collaborate on service lifecycle management: Partner with engineering teams to design, deploy, operate, and continuously improve robust and scalable systems and services, from inception to refinement.
- Ensure service reliability and performance: Proactively monitor system health, conduct performance testing, and manage incidents to maximize uptime, availability, and adherence to defined SLAs/SLOs.
- Execute core SRE practices: Perform on-call duties and production operations, including change management, capacity planning, and disaster recovery, while contributing to documentation and process improvements across teams.Qualifications:Minimum Qualifications
-Proficient in one or more programming languages (e.g., Python, Go, Java, C++).
-Strong understanding of Linux operating systems and open-source technologies.
-Experience in network architecture and troubleshooting, database modeling, cloud systems, and large-scale distributed systems.
-Knowledge of monitoring tools and methodologies (such as Prometheus, Grafana), AIOPS, APM, Disaster Recovery.
-Experience in designing, analyzing, and building automation and tools for large-scale systems.
-Experience in building solutions with AWS, GCP, Azure, and other cloud services.Preferred qualifications
-Expertise in any of these tech stacks: Kubernetes, ElasticSearch, ClickHouse, Message Queue, OpenTSDB, Service Mesh, MySQL, Redis, etc.
-Master's degree in Computer Science, Engineering, or a related field.As a condition of employment, all successful candidates must be able to establish authorization to work in the United States. For this position, the Company does not provide sponsorship for any immigration-related benefits.