
AI Cluster Validations For Distributed Training and Inference Engineer
- San Jose, CA
- Permanent
- Full-time
- Work with AMD’s architecture specialists to validate AI solutions for distributed training and inference workloads with AMD's ROCM software
- Build cluster scale automation for distributed training and inference workloads
- Publish reference designs and benchmark numbers for AI workloads
- Apply a data minded approach to target optimization efforts
- Design and develop new groundbreaking AMD technologies
- Participating in new ASIC and hardware bring ups
- Develop technical relationships with peers and partners
- Good experience with complex compute systems used in AI, HPC deployments, backend network designs in RDMA clusters
- Experience in validating complex AI infrastructure - GPUs, networking, ROCEv2, UEC, running benchmark tests like IBPerf benchmarking, RCCL/NCCL.
- Experience with running training of LLMs, MoE models, Image Generation, recommendations models with different frameworks like PyTorch, Tensorflow, Megatron-LM, JAX. Running training performance benchmarks.
- Experience with running inference workloads in AI clusters with different inference frameworks like vLLM, SGLang. Running performance benchmarks for inference.
- Experience with distributed systems and schedulers like Kubernetes, Slurm
- Ability to write high quality automation frameworks and scripts using Python or Golang
- Experience with performance profiling of CPUs, GPUs and debugging complex compute, network, storage problems.
- Experience with AMD ROCM would be an added advantage
- Experience with Linux, Windows operating systems
- Effective communication and problem-solving skills
- Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent