
Lead Software Engineer - Remote
- Minnetonka, MN
- $110,200-188,800 per year
- Permanent
- Full-time
- The availability, performance, and scalability of critical systems by implementing best practices in site reliability engineering
- Drive the design and evolution of observability systems by building scalable, extensible solutions using Open Telemetry (OTEL) and other modern observability tools
- Champion innovation in monitoring, distributed tracing, and logging strategies to provide deep visibility into system behavior. Continuously evaluate and integrate emerging technologies to improve observability maturity and reduce mean time to detect (MTTD) and resolve (MTTR)
- Lead and contribute to projects such as performance testing, CI/CD tooling, and infrastructure/application migrations with focus to migrate from on-prem to cloud solutions
- Participate in incident response, troubleshooting, and post-mortem analysis to identify root causes and prevent future occurrences
- Develop and maintain automation tools to reduce manual effort, streamline processes, and enhance system reliability
- Collaboration: Work closely with other SREs, engineers, and stakeholders across time zones to align on goals, strategies, and ensure smooth project execution
- Identify opportunities to improve system reliability, performance, and operational efficiency, and implement changes as needed
- Provide guidance and mentorship to junior engineers on the team, fostering a culture of learning and growth
- Leverage AI-powered tools and platforms to enhance observability, incident response, and operational efficiency
- Bachelor's degree in computer science, engineering or equivalent experience
- 12+ years of experience in Site Reliability Engineering, DevOps, or a similar function
- Experience architecting and implementing observability platforms using Open Telemetry, Splunk, Grafana, or similar tools
- Experience with CI/CD tools like Jenkins, GitHub Actions, and related automation pipelines
- Knowledge of public cloud platforms, preferably Azure, and expertise in On-Prem to Cloud migrations
- Deep understanding of systems architecture, cloud infrastructure, networking, and automation tools
- Demonstrated ability to innovate in this space-whether by building custom telemetry pipelines, integrating AI/ML for anomaly detection, or developing new approaches to visualize and interpret system health
- Payer Claims domain experience in 837, 835 and 277, 271
- AI Tools: Proven exposure to AI tools and their application in SRE workflows for faster delivery and smarter operations
- Proven solid in scripting/programming skills (Python, Go, PowerShell, Bash, etc.), and experience with infrastructure-as-code tools like Terraform and Ansible
- Proven excellent communication and collaboration skills, with the ability to work effectively in a distributed team across time zones