
Software Engineer
- Atlanta, GA
- Permanent
- Full-time
- Contributes to defining system reliability goals through Service Level Objectives (SLOs) and enhancing production posture with targeted improvements in observability and operability (telemetry, alerting, incident/change management, safe deployment practices).
- Builds reusable automation and processes that help multiple teams meet their reliability goals. With guidance, influences product architecture and roadmaps to ensure customer-experienced reliability is a core design principle.
- Works directly on product code to achieve reliability outcomes. Leverages AI to proactively detect anomalies, predict incidents, and automate operational workflows - scaling reliability efforts across complex systems.
- With guidance, supports the design and development of large-scale distributed software services and solutions. Delivers “best-in-class” engineering by ensuring services are modular, secure, reliable, testable, diagnosable, observable, and reusable.
- Collaborates with internal and external partners to support team goals. Balances pragmatism with vision - driving continuous improvements in process and codebase. Builds automation to prevent or remediate service issues before they impact users.
- Applies cutting-edge AI tools and techniques to reduce operational toil and scale reliability engineering across complex systems.
- Gains a working understanding of Microsoft businesses and contributes to cohesive, end-to-end user experiences.
- Bachelor's Degree in Computer Science, or related technical discipline with proven experience coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python,
- OR equivalent experience.
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
- Familiarity with modern distributed software design patterns and cloud systems architecture, including microservices, containers, load balancing, queuing, caching.
- Experience coding in Python and/or C#, following SOLID principles and leveraging unit/integration testing frameworks.
- Experience deploying cloud-native solutions using Azure or similar cloud service provider technologies.
- Experience in building, shipping, and operating reliable solutions.
- Experience with automated infrastructure provisioning and configuration using IaC tools (eg., Bicep, Terraform).
- Experience applying prompt engineering to optimize LLM-based workflows for summarization, classification, and decision support scenarios.
- Experience leveraging Azure OpenAI or similar LLMs to build intelligent applications and/or integrated them into services.