
Backend AI Platform Engineer
- San Mateo, CA
- Permanent
- Full-time
- Experience building agent runtime systems, orchestration layers, or reasoning engines.
- Implement agent runtime orchestration services for reasoning, planning, and tool invocation (integrating LangChain/LangGraph).
- Develop APIs and SDKs for building, deploying, and managing agent workflows and tools.
- Build stateful dialog management and memory services to support multi-turn, context-aware agents.
- Implement multi-agent communication (A2A) protocols and shared context for collaborative agent behavior.
- Work with the Architect to build cloud-native, event-driven backend services using microservices or service mesh architecture.
- Design task scheduling and workflow orchestration (e.g., Temporal/Airflow) for agent planning and long-running operations.
- Develop real-time message buses (Kafka/Pub-Sub) for agent communication and event propagation.
- Optimize LLM orchestration performance and cost efficiency with caching, batching, and token usage monitoring.
- Implement integrations with enterprise systems, APIs, and knowledge sources (REST, GraphQL, gRPC).
- Work with vector DBs (e.g., ElasticSearch, Pinecone, Weaviate, FAISS) and RAG pipelines for knowledge-augmented reasoning agents.
- Support metadata and telemetry pipelines for LangSmith-based evaluation and observability.
- Build secure multi-tenant backend services with RBAC, API authentication, and tenant isolation.
- Ensure agent action auditability and reproducibility for compliance and governance.
- Collaborate on implementing platform guardrails (rate limiting, content filtering, safe tool usage).
- Write automated test suites (unit, integration, load testing) for backend services.
- Implement logging, tracing, and metrics (OpenTelemetry, Prometheus, Grafana) for full observability.
- Participate in on-call rotations and improve platform resiliency and fault tolerance.
- 7+ years of backend engineering experience, with at least 2+ years in AI/ML platform development.
- Strong programming skills in Java and Python
- Hands-on experience with LangChain, LangGraph, LangSmith LLM/agent frameworks
- Expertise in distributed systems, event-driven architectures, and microservices
- Solid experience with cloud platforms (AWS), Kubernetes, and IaC (Terraform/Helm)
- Familiarity with databases (PostgreSQL, Redis, MongoDB) and vector databases (ElasticSearch, Pinecone, FAISS, Weaviate)
- Understanding of workflow orchestration (Temporal/Airflow) and messaging systems (Kafka, RabbitMQ)
- Knowledge of AI/LLM (OpenAI, Anthropic, Azure OpenAI, etc.) and prompt/tool orchestration
- Exposure to conversation orchestration, dialog state tracking, and AI-driven planning systems
- Familiarity with enterprise security standards (RBAC, SSO/OAuth2, data encryption)
- Prior experience in startup or high-growth environments building SaaS platforms
- Contributions to open-source AI agentic frameworks or backend libraries