Principal Observability Architect

Orlando, FL
Permanent
Full-time

1 day ago
Apply easily

Company DescriptionIt all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market leader, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500®. Our intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and better ways to work. But this is just the beginning of our journey. Join us as we pursue our purpose to make the world work better for everyone.Job DescriptionWe are seeking a Principal Observability Architect to lead the strategic architecture, evolution, and operationalization of a modern, multi-tenant Observability Platform-as-a-Service (OPaaS) tailored for a hybrid on-prem and cloud-native SaaS product.You will architect a cloud-agnostic, federated observability platform that supports real-time monitoring, advanced telemetry pipelines, and AI-powered insights to ensure platform reliability, developer productivity, and exceptional customer experiences. This role combines deep technical leadership with a strong focus on developer enablement, platform resiliency, and data governance.What you get to do in this role:Platform Architecture & Strategy

Lead architecture and roadmap for a multi-region, multi-cloud, multi-tenant observability platform scalable across diverse customer environments and service boundaries.
Architect near real-time telemetry ingestion pipelines with low-latency guarantees (seconds) using a mix of streaming and batch processing technologies.
Define observability blueprints including telemetry SLAs, data contracts, tenant data isolation, and cost-aware retention strategies for high-cardinality data.
Ensure observability systems are cloud-native and container-aware, supporting environments built on Kubernetes, service meshes, and serverless components.

Real-Time Monitoring & Detection

Design and implement real-time metrics, logs, traces, and event pipelines with technologies such as:

VictoriaMetrics, Prometheus, Grafana, Alertmanager
Cribl Stream and Edge for dynamic routing and filtering
VictoriaLogs for structured log analysis
Embed real-time anomaly detection and signal correlation, with context-aware alerting to reduce noise and MTTR.
Integrate with alerting and incident response tools (PagerDuty, Slack, ServiceNow) for automated incident routing and contextual enrichment.
Ensure observability of synthetic probes, end-user transactions, and critical SLOs with per-tenant granularity.

Instrumentation, Developer Enablement & CI/CD Integration

Standardize OpenTelemetry instrumentation across all services with prebuilt SDKs, language libraries, and semantic conventions.
Architect OpenTelemetry deployment patterns (agent-based, sidecar, collector pipelines) with support for Kubernetes, Lambda, and edge environments.
Embed observability validation gates into CI/CD workflows (e.g., GitHub Actions, GitLab CI) to enforce telemetry compliance before production rollout.
Provide self-service tools, templates, and training to enable developer teams to adopt observability by default.

AI for Observability & Productivity

Leverage AI/ML for:

Real-time anomaly detection and noise suppression
Predictive incident detection and impact forecasting
Auto-summarization of alert storms and telemetry bursts
Multi-tenant root cause and blast radius correlation
Build or integrate LLM-powered tools that support:

Natural language querying of live telemetry
AI-assisted debugging and dashboard generation
Generative runbooks and incident summaries

Data Platform Architecture

Architect hot and cold telemetry storage pipelines using:

VictoriaMetrics and Cribl for hot-path observability
Long-term retention in object storage (e.g., S3, GCS) using open formats (Parquet, JSON)
Federated querying engines like Trino for historical and cross-service analytics
Implement cost-aware ETL strategies, balancing real-time visibility with storage and ingestion optimization.
Incorporate data governance, PII handling, and regional data compliance (e.g., GDPR, SOC2) into telemetry architecture.

SaaS Operations & ITSM Integration

Integrate observability into ITSM and incident response systems (e.g., ServiceNow, Jira):

Auto-create incidents enriched with correlated traces, logs, and metrics
Provide real-time telemetry context in change and problem management flows
Deliver customer-facing health dashboards, SLA monitoring, and per-tenant observability insights to support operational excellence and transparency.

Technical Leadership

Lead cross-functional collaboration with SRE, Platform, Security, and Engineering teams to evolve observability maturity.
Define and document observability patterns, anti-patterns, and escalation workflows.
Drive internal R&D around OpenTelemetry, AI in observability, high-cardinality telemetry, and eBPF-based observability tooling.

QualificationsTo be successful in this role you have:

Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
10+ years in DevOps, SRE, or Observability roles, including 5+ years in architecture or platform engineering.
Proven experience designing and operating near real-time observability systems in global-scale SaaS environments.
Deep expertise in OpenTelemetry (including collector deployment, semantic conventions, sampling strategies).
Experience integrating observability in Kubernetes, microservices, and serverless ecosystems.
Hands-on with telemetry data pipelines using Cribl, Prometheus/VictoriaMetrics, and log/trace platforms.
Experience embedding telemetry validation in CI/CD workflows.
Familiarity with AI/ML for observability (anomaly detection, summarization, impact correlation).
Working knowledge of data privacy, retention, and compliance practices in observability.

Nice to Have:

Experience with Trino, S3 data lakes, and long-term observability analysis.
Experience building customer-facing observability features (dashboards, SLAs, health status pages).
Contributions to open-source observability tools or standards.
Knowledge of or hands-on experience with Agentic AI systems to drive autonomous remediation, telemetry analysis, or incident response.
Relevant certifications (e.g., AWS, GCP, Azure, OpenTelemetry, Observability Practitioner).

GCS-23Additional InformationWork PersonasWe approach our distributed world of work with flexibility and trust. Work personas (flexible, remote, or required in office) are categories that are assigned to ServiceNow employees depending on the nature of their work and their assigned work location. . To determine eligibility for a work persona, ServiceNow may confirm the distance between your primary residence and the closest ServiceNow office using a third-party service.Equal Opportunity EmployerServiceNow is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, creed, religion, sex, sexual orientation, national origin or nationality, ancestry, age, disability, gender identity or expression, marital status, veteran status, or any other category protected by law. In addition, all qualified applicants with arrest or conviction records will be considered for employment in accordance with legal requirements.AccommodationsWe strive to create an accessible and inclusive experience for all candidates. If you require a reasonable accommodation to complete any part of the application process, or are unable to use this online application and need an alternative method to apply, please contact for assistance.Export Control RegulationsFor positions requiring access to controlled technology subject to export control regulations, including the U.S. Export Administration Regulations (EAR), ServiceNow may be required to obtain export control approval from government authorities for certain individuals. All employment is contingent upon ServiceNow obtaining any export license or other approval that may be required by relevant export control authorities.From Fortune. ©2025 Fortune Media IP Limited. All rights reserved. Used under license.

ServiceNow