
Reliability Engineer - Production Support
- Hartford, CT
- $90,320-135,480 per year
- Permanent
- Full-time
- Assists with instrumenting code/application technology stack to enable the generation of relevant metrics on overall technology health - availability, performance, quality, currency, and resiliency. (10%)
- Contributes to the architecture and software engineering teams to influence the technical strategy for the organization, keeping in mind its cross-functional impacts, integration across the organization, and architecture rationalization. (10%)
- DevSecOps Solution Responsibilities: (10%)
- Build the necessary tooling, alerts, and response mechanisms to identify and address reliability risks leveraging automation to support problem prevention, detection, mitigation, and resolution.
- Enhance the delivery flow by building the appropriate solutions to increase delivery speed while adhering to technology standards for sustained reliability.
- Progressively implement preventative controls and build increased automation and self-healing capabilities. Continue to improve cost efficiency baselines.
- Promote and implement innovative solutions
- IT / Data Engineering Responsibilities: (35%)
- Participate in the elimination of toil by creating automation or engineering autonomous solutions requiring minimal manual effort (e.g., covering OS patching to CICD to infrastructure configuration mgmt.)
- Ability to build reliable and performant data systems to support data delivery
- Ability to build scalable SDLC environments using COTS, SaaS, PaaS products to support Data Pipeline needs
- IT Ops Responsibilities: (35%)
- Promote operational excellence. Participate in the triaging and service restoration of all high impact incidents in order to minimize the mean time to service restoration and impact to the business. Demonstrate end-to-end ownership.
- Partner with infrastructure Product teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes. Take proactive measures to prevent high impactful incidents.
- Achieve and maintain the continuity of Hartford and third-party assets that support a business function. Accountable for keeping the IT application and infrastructure metadata repositories current.
- Promote the reliability (such as availability, capacity, performance) of the solution. Participate in on-call activities to mitigate incidents as quickly as possible. (10%)
- Participate in the development of effective tooling, alerts, and response to both identify and address reliability risks including automatic problem detection and mitigation. (55%)
- Build and operate reliable and performant data systems and services that enable the business to make data-driven decisions
- Engage with the service consumers to define functional and non-functional requirements for the solutions. (10%)
- Participate with training, best-practices, and sample code to enable consumers to take advantage of the solution to the best degree possible. (10%)
- Partner with the RE and Software Engineering teams to collect ongoing feedback and improvement backlog items for the Infrastructure Product teams. (10%)
- Leverage analyst reviews, vendor offerings, client success stories to evolve the portfolio. Participate in relevant vendor / community / industry conferences. (5%)
- Build and maintain Governance policies especially to Data masking (PII management), data lifecycle management needs
- DevOps Mindset
- Enjoy solving difficult engineering problems and don’t mind getting your hands dirty
- Maintains personal responsibility and commitment to respond to and address incidents quickly
- Good Software engineering skills ideally with experience in Java, Python, .Net and/or Go.
- Understanding of Linux system internals, are familiar with the TCP/IP stack, network routing and load balancing
- Approach troubleshooting systematically and have a deep sense of ownership for whatever you work on
- Ability to root cause sources of instability in a high-traffic, distributed system
- Experience with configuration and troubleshooting of Linux, Java/Scala, Docker / Kubernetes systems
- Understanding of large-scale complex systems from a reliability perspective
- Passion for resolving reliability issues and identifying strategies to mitigate going forward
- Knowledge of Performance and Observability tools such as Dynatrace, SumoLogic, TrueSight, CloudWatch, CloudTrail, AWS X-Ray, Splunk, and related tools.
- Willingness to work in an ever-changing environment
- Passioned about automation and innovations that improve productivity
- Experience with IAC tools such as Terraform, Cloud Formation etc.
- Degree in Computer Science or related discipline with a minimum of 3-5 years of work experience in IT systems operations and/or application development.
- Some experience in an RE role. Experience with building, supporting Enterprise Contact Center platforms, systems, omni channel applications (voice, chat, SMS, email, social, etc.)
- Certifications/Licenses/Badges (as applicable): AWS Certified Cloud Practitioner, AWS Certified Developer, Microsoft Certified Azure Fundamentals, Microsoft Certified Azure Developer, AWS Amazon Connect Fundamentals, AWS Amazon Connect Communications Specialist, AWS Amazon Connect Developer,