Foundations Site Reliability Engineer

Vinsys Information Technology Inc

Seattle, WA
Permanent
Full-time

12 hours ago
Apply easily

SRE | Foundations | Site Reliability Engineer (Contract)About this team Site Reliability EngineeringWe are looking for a motivated engineer to join the Foundations team which is responsible for observability and monitoring in Site Reliability Engineering, guiding the digital organization to improve the practice of reliability here. We are a consultative enablement team providing guidance and support to product engineering teams for the development of high-quality and resilient software systems through the use of monitoring tools and practices. SRE partners with many product engineering teams across digital and beyond to infuse the concepts and practices of reliability into engineering process and deliverables. The Foundations team owns the management of our monitoring tools and the best practices for using those tools to provide total visibility into our systems. This role requires a vision and strategy for monitoring and how to manage it across a disparate organization.As a SRE Engineer you will be responsible for designing, implementing, and maintaining robust monitoring solutions, creating insightful dashboards, identifying relevant metrics, and driving efficient problem management practices. You will help identify observability maturity opportunities and roadblocks to success for digital teams and clearing those roadblocks. You will partner closely with Product Owners and Scrum Masters to manage scope and strike a balance between support and investment work. You are expected to clearly communicate risks to your partners for deliverables. A day in the lifeAs an Engineer II on the SRE Foundations team, you are a technical contributor and domain leader in observability and reliability. Your day-to-day responsibilities include:Observability & MonitoringDesign, implement, and optimize observability solutions across metrics, logging, and tracing.Build and maintain dashboards and alerts (e.g., Datadog) that provide meaningful insight into system health and performance.Define and support adoption of Service Level Objectives (SLOs), Indicators (SLIs), and error budgets.Incident & Problem ManagementParticipate in and lead incident response efforts during major outages and critical events.Support on-call rotations, particularly during key business events (e.g., product launches, holiday traffic).Conduct and contribute to Root Cause Analyses (RCAs) and post-incident reviews, driving follow-up actions and long-term remediation plans.Collaborate with partner teams to enhance incident playbooks, reduce mean time to detect (MTTD) and resolve (MTTR), and improve operational readiness.Apply principles of the ITIL framework in areas such as incident, problem, and change management, ensuring alignment with organizational reliability goals.Team Collaboration & EnablementPartner with digital product teams to integrate observability best practices into their development and deployment workflows.Identify tooling and knowledge gaps; champion improvements and automation initiatives that reduce toil and increase visibility.Support product owners and engineering leads with prioritization between support, investment, and innovation work.Mentor junior team members and advocate for team-wide knowledge sharing and continuous improvement.Continuous Improvement & Strategic ContributionStay up to date with SRE and observability trends, helping to evaluate and adopt new tools and approaches.Contribute to domain-level standards and practices within the broader technology organization.Influence reliability strategy by sharing insights, performance metrics, and "what's working/what's not" feedback with senior engineers and technical leadership. QualificationsBachelor's degree in Computer Science, Engineering, or equivalent experience.8 12 years of experience in software engineering or SRE, with deep exposure to observability and monitoring.Strong experience with observability tools such as Datadog, Splunk, and distributed tracing frameworks.Proven track record in incident management, RCA facilitation, and on-call response - especially during critical peak traffic events.Understanding of ITIL concepts including Incident, Problem, and Change Management.Experience building and maintaining dashboards, alerts, and SLOs/SLIs.Strong debugging and root cause analysis skills across complex distributed systems.Excellent collaboration, documentation, and communication skills.Familiarity with infrastructure-as-code (e.g., Terraform), Kubernetes, and cloud-native systems.Relevant certifications (e.g., Certified Kubernetes Administrator, Terraform Associate) are a plus.BonusDeep expertise in observability tooling (Datadog, Splunk).Prior experience in e-commerce or high-availability digital platforms.Background in product ownership or leading reliability-focused initiatives. Must havesAcknowledges the presence of choice in every moment and takes personal responsibility for their life.Required Skills : Terraform,Kubernetes,SplunkBasic Qualification :Additional Skills :Background Check : NoDrug Screen : No

Vinsys Information Technology Inc