
Principal Observability Architect
- Orlando, FL
- Permanent
- Full-time
- Lead architecture and roadmap for a multi-region, multi-cloud, multi-tenant observability platform scalable across diverse customer environments and service boundaries.
- Architect near real-time telemetry ingestion pipelines with low-latency guarantees (seconds) using a mix of streaming and batch processing technologies.
- Define observability blueprints including telemetry SLAs, data contracts, tenant data isolation, and cost-aware retention strategies for high-cardinality data.
- Ensure observability systems are cloud-native and container-aware, supporting environments built on Kubernetes, service meshes, and serverless components.
- Design and implement real-time metrics, logs, traces, and event pipelines with technologies such as:
- VictoriaMetrics, Prometheus, Grafana, Alertmanager
- Cribl Stream and Edge for dynamic routing and filtering
- VictoriaLogs for structured log analysis
- Embed real-time anomaly detection and signal correlation, with context-aware alerting to reduce noise and MTTR.
- Integrate with alerting and incident response tools (PagerDuty, Slack, ServiceNow) for automated incident routing and contextual enrichment.
- Ensure observability of synthetic probes, end-user transactions, and critical SLOs with per-tenant granularity.
- Standardize OpenTelemetry instrumentation across all services with prebuilt SDKs, language libraries, and semantic conventions.
- Architect OpenTelemetry deployment patterns (agent-based, sidecar, collector pipelines) with support for Kubernetes, Lambda, and edge environments.
- Embed observability validation gates into CI/CD workflows (e.g., GitHub Actions, GitLab CI) to enforce telemetry compliance before production rollout.
- Provide self-service tools, templates, and training to enable developer teams to adopt observability by default.
- Leverage AI/ML for:
- Real-time anomaly detection and noise suppression
- Predictive incident detection and impact forecasting
- Auto-summarization of alert storms and telemetry bursts
- Multi-tenant root cause and blast radius correlation
- Build or integrate LLM-powered tools that support:
- Natural language querying of live telemetry
- AI-assisted debugging and dashboard generation
- Generative runbooks and incident summaries
- Architect hot and cold telemetry storage pipelines using:
- VictoriaMetrics and Cribl for hot-path observability
- Long-term retention in object storage (e.g., S3, GCS) using open formats (Parquet, JSON)
- Federated querying engines like Trino for historical and cross-service analytics
- Implement cost-aware ETL strategies, balancing real-time visibility with storage and ingestion optimization.
- Incorporate data governance, PII handling, and regional data compliance (e.g., GDPR, SOC2) into telemetry architecture.
- Integrate observability into ITSM and incident response systems (e.g., ServiceNow, Jira):
- Auto-create incidents enriched with correlated traces, logs, and metrics
- Provide real-time telemetry context in change and problem management flows
- Deliver customer-facing health dashboards, SLA monitoring, and per-tenant observability insights to support operational excellence and transparency.
- Lead cross-functional collaboration with SRE, Platform, Security, and Engineering teams to evolve observability maturity.
- Define and document observability patterns, anti-patterns, and escalation workflows.
- Drive internal R&D around OpenTelemetry, AI in observability, high-cardinality telemetry, and eBPF-based observability tooling.
- Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
- 10+ years in DevOps, SRE, or Observability roles, including 5+ years in architecture or platform engineering.
- Proven experience designing and operating near real-time observability systems in global-scale SaaS environments.
- Deep expertise in OpenTelemetry (including collector deployment, semantic conventions, sampling strategies).
- Experience integrating observability in Kubernetes, microservices, and serverless ecosystems.
- Hands-on with telemetry data pipelines using Cribl, Prometheus/VictoriaMetrics, and log/trace platforms.
- Experience embedding telemetry validation in CI/CD workflows.
- Familiarity with AI/ML for observability (anomaly detection, summarization, impact correlation).
- Working knowledge of data privacy, retention, and compliance practices in observability.
- Experience with Trino, S3 data lakes, and long-term observability analysis.
- Experience building customer-facing observability features (dashboards, SLAs, health status pages).
- Contributions to open-source observability tools or standards.
- Knowledge of or hands-on experience with Agentic AI systems to drive autonomous remediation, telemetry analysis, or incident response.
- Relevant certifications (e.g., AWS, GCP, Azure, OpenTelemetry, Observability Practitioner).