
Reliability Engineer, Ai & Data Platforms
- Austin, TX
- Permanent
- Full-time
- Develop and operate large-scale big data platforms using open source and other solutions.
- Support critical applications including analytics, reporting, and AI/ML apps.
- Optimize platform performance and cost efficiency.
- Automate operational tasks for big data systems.
- Identify and resolve production errors and issues to ensure platform reliability and user experience
- 3+ years of professional software engineering experience, with strong programming skills in languages such as Java, Scala, Python, or Go, preferably with critical, large-scale distributed systems.
- Expertise in designing, building, and operating critical, large-scale distributed systems with a focus on low latency, fault-tolerance, and high availability.
- Proven experience with data processing ecosystems and distributed computing frameworks like Spark or Flink, as well as MPP Query Engines such as Trino or Starrocks.
- Experience designing and developing stateless APIs (e.g., HTTP) for service-oriented architectures across multi-cloud environments.
- Proficiency with container orchestration (e.g., Kubernetes, Helm), CI/CD pipelines (e.g., GitHub Actions, Jenkins), and infrastructure as code tools (e.g., Terraform, Pulumi).
- Strong troubleshooting and performance analysis skills in complex production environments, with proficiency in Unix/Linux operating systems and command-line tools.
- Experience with contribution to Open Source projects is a plus.
- Experience with multiple public cloud infrastructure, managing multi-tenant Kubernetes clusters at scale and debugging Kubernetes/Spark issues.
- Experience with workflow and data pipeline orchestration tools (e.g., Airflow, DBT).
- Understanding of data modeling and data warehousing concepts.
- Familiarity with the AI/ML stack, including GPUs, MLFlow, or Large Language Models (LLMs).
- A learning attitude to continuously improve the self, team, and the organization.
- Solid understanding of software engineering best practices, including the full development lifecycle, secure coding, and experience building reusable frameworks or libraries.