
Site Reliability Engineer
- Wilmington, DE
- Permanent
- Full-time
- Lead and conduct detailed Root Cause Analysis (RCA) for incidents, identifying underlying issues and recommending corrective actions.
- Document and communicate findings from RCA processes, ensuring transparency and knowledge sharing across the organization.
- Develop and maintain incident postmortem reports, providing insights and actionable recommendations to stakeholders.
- Monitor system performance and reliability metrics, proactively identifying potential issues before they escalate.
- Contribute to the design and implementation of automated monitoring and alerting systems to improve incident detection and response times.
- Continuously improve the incident management process, incorporating feedback and lessons learned from RCA activities.
- Participate in incident response activities.
- Bachelor's degree or equivalent experience in a software engineering discipline
- 6+ years of Software Engineering experience
- Excellent communication skills, with the ability to convey technical findings to both technical and non-technical audiences
- Excellent debugging and trouble shooting skills
- Experience in Site Reliability Engineering, DevOps, or a similar role, with a focus on incident management and RCA.
- Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Dynatrace).
- Familiarity with containerization technologies (e.g., Docker, Kubernetes).
Sr. Tech RecruiterEmail:Address:
505 Knolle Court
Saint Augustine, FL 32092Telephone:
+1 321-641-0093