
Member of Engineering, Pre-training/data (Remote)
- USA
- Permanent
- Full-time
Location: Remote (East Coast or EMEA preferred)
Company Stage of Funding: Series B (Series C closing soon, $600M+ raised)
Office Type: Remote
Salary: $240,000 – $400,000 base + highly competitive equityCompany DescriptionWe are representing a frontier AI lab focused on building some of the most capable foundational models in the world. Unlike many AI startups, this company both trains its own large-scale models and ships a developer-facing product , backed by hundreds of millions in venture funding. The team is engineering-first, led by proven leaders from top-tier technology companies, and dedicated to pushing the boundaries of what AI can do for software development.The company is growing rapidly and seeking world-class engineers to join their data team, where you will help shape the future of AI-powered development.What You Will DoBuild and optimize massive-scale pretraining datasets of natural language and source code to improve LLM performance.Design, experiment with, and analyze data ablations, data mix optimization, and synthetic data generation techniques .Collaborate closely with pre-training, fine-tuning, and product teams to ensure short feedback loops on model quality.Stay at the forefront of the latest research in dataset design and LLM pretraining, rapidly iterating on experiments to improve quality.Deploy solutions into high-performance distributed data pipelines running on large GPU clusters.Ideal Candidate Background3+ years of industry experience as a research scientist or engineer.Strong background in machine learning AND engineering .Proven experience building large-scale pretraining datasets and running experiments such as ablations or mixture modeling.Prior hands-on involvement in LLM pretraining , including training models from scratch.Familiarity with distributed systems, data pipelines, and large GPU cluster operations.Passion for data quality and applied experimentation.PreferredDegree in Computer Science or related technical field.Strong programming skills, including Python , plus low-level languages such as C/C++, CUDA, or Triton .Experience with DevOps tooling (Git, Docker, Kubernetes, Terraform).Author of published research in ML/LLMs.Experience generating and working with synthetic data.Willingness to travel occasionally (e.g., to Europe for team sessions).Compensation and BenefitsBase Salary: $240,000 – $400,000 depending on experience.Equity: Highly competitive package.Visa Sponsorship: Available for exceptional candidates.Remote Work: Flexible, with preference for East Coast U.S. or EMEA time zones.Work Environment: Join an engineering-first culture (over 75% of the team are engineers) working alongside leaders from GitHub, Snap, and other top companies.Impact: Architect the data pipelines powering foundational models that will define the future of AI-assisted software development.