
Senior Staff DevOps Engineer - Cloud Analytics & FinOps Engineering Platform
- Pleasanton, CA
- Permanent
- Full-time
- Design and implement secure, scalable Kubernetes clusters across AWS EKS, GCP GKE, and Azure AKS supporting complex data platform workloads.
- Architect hybrid cloud infrastructure with unified management and governance, building infrastructure-as-code solutions using Terraform, AWS CDK, and CloudFormation for repeatable deployments.
- Establish multi-cloud networking including VPC design, cross-cloud connectivity, Transit Gateway configurations, and secure service mesh implementations while navigating ServiceNow enterprise standards and approval processes.
- Implement comprehensive security frameworks across multi-cloud data platform stack adhering to enterprise security standards.
- Design identity and access management across cloud providers following principle of least privilege, orchestrate secrets management using cloud-native solutions, and establish security scanning for container images and infrastructure.
- Ensure compliance with SOC2, FedRAMP, and regulatory requirements while working with security teams to implement platform controls and data governance.
- Design sophisticated CI/CD pipelines using Jenkins, GitHub Actions, TeamCity, and Argo CD for GitOps workflows.
- Manage artifact repositories with automated image scanning and promotion, create Helm charts for complex data platform services (Trino, Airflow, Lightdash, Grafana), and establish automated testing pipelines for infrastructure changes with drift detection and remediation.
- Architect comprehensive monitoring using Grafana, Prometheus, and CloudWatch with advanced alerting and incident response frameworks.
- Design SLIs/SLOs/SLAs for data platform services with error budget management, establish SRE practices including toil reduction and capacity planning, and create operational dashboards for platform health and performance metrics.
- Implement automated remediation workflows and capacity forecasting with predictive analytics.
- Design secure data ingestion pipelines from disparate systems across multi-cloud and on-premises environments.
- Implement data source connectors for billing systems, ServiceNow internal systems, SaaS platforms, and ML platforms.
- Manage hybrid cloud connectivity and orchestrate complex data workflows using Apache Airflow with high availability across multiple cloud environments.
- Implement automated scaling and resource management across cloud providers.
- Establish Cloud Development Environment (CDE) platform using Coder to provision on-demand development workspaces via Terraform templates for global distributed teams, with enterprise compliance and cost optimization.
- Work within ServiceNow enterprise processes for technology approvals and infrastructure changes.
- Mentor junior engineers across global time zones on SRE best practices, establish operational runbooks for 24/7 platform support with automated incident response, and implement SRE organizational practices including error budget policies and reliability reviews.
- Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
- 10+ years of DevOps/Platform engineering experience with large-scale distributed systems in enterprise environments
- Expert-level Kubernetes knowledge across multiple cloud providers (EKS, GKE, AKS) including service mesh and cluster management
- Multi-cloud expertise across AWS, GCP, and Azure with deep understanding of platform strengths and cost models
- Advanced Infrastructure-as-Code experience with Terraform, CloudFormation, and AWS CDK
- Proven CI/CD pipeline management using GitHub Actions, Jenkins, Argo CD, and GitOps workflows in enterprise environments
- Strong security background with cloud security best practices and compliance frameworks (SOC2, FedRAMP)
- Expertise in network security for cloud and Kubernetes environments, including VPC design, zero-trust networking, security policies, firewall rules, VPNs, and intrusion detection/prevention systems
- Enterprise navigation skills with large organization processes and cross-team collaboration
- Bachelor's degree in Computer Science, Engineering, or related technical field
- Full professional proficiency in English
- Multi-Cloud & Container Orchestration: Docker, Kubernetes, and Helm across AWS, GCP, and Azure at enterprise scale with hybrid cloud networking including VPN, Direct Connect, ExpressRoute, and cross-cloud connectivity.
- DevOps & Automation: CI/CD automation, GitOps workflows, Infrastructure-as-Code mastery, and scripting proficiency in Python, Bash, and Go for infrastructure management and toil reduction.
- Data Platform Operations: Experience with Trino/Presto, Apache Airflow, dbt, analytics databases, and Cloud Development Environment platforms including Coder for workspace provisioning.
- Site Reliability Engineering: SLI/SLO design, error budgets, chaos engineering, automated remediation, monitoring with Grafana/Prometheus/ELK stack, and performance engineering for distributed systems.
- Database & Integration: PostgreSQL operations, on-premises integration with legacy systems, and hybrid cloud architectures across cloud and on-prem environments.
- Multi-cloud security architecture with expertise in cloud-native security services, identity and access management across providers (AWS IAM, GCP IAM, Azure AD), enterprise compliance frameworks (SOC2, FedRAMP, PCI-DSS), secrets management, security scanning, and data privacy frameworks including GDPR and CCPA compliance.
- Multi-cloud cost optimization strategies, resource rightsizing with automated scaling, cost allocation models for multi-tenant platforms, FinOps tooling with enterprise budget constraints, and cloud cost negotiation experience with procurement teams.
- Data engineering background with modern data stack technologies
- Service mesh experience with Istio, Linkerd, or cloud-native solutions
- Enterprise platform experience at Fortune 500 companies
- Global team leadership across multiple time zones
- SRE certification or formal training from Google, AWS, or similar programs
- Chaos engineering experience with tools like Chaos Monkey, Litmus, or Gremlin
- Open-source contributions to DevOps, Kubernetes, or SRE tools
- Multi-cloud certifications (AWS Solutions Architect, Google Cloud Architect, Azure Architect, CKA/CKAD, Terraform Associate)
- Performance engineering experience with large-scale distributed systems
- FinOps Certified Practitioner or AWS Certified Cloud Practitioner
- Open-source contributions to data engineering or FinOps tools
- Experience with additional query engines (Spark, Snowflake, BigQuery)
- Data visualization experience with tools like Lightdash, Tableau, or similar
- Patent applications or publications in data systems or cloud technologies