Human Data Evals Lead
AI Summary
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Human Data Evals Lead based in United States.
About this role
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Human Data Evals Lead based in United States.
This role sits at the core of frontier AI data operations, owning how high-quality evaluation datasets and benchmarks are designed, validated, and delivered to leading AI labs. You will be responsible for translating ambiguous evaluation needs into structured, high-signal data proposals and production-ready sample packages that demonstrate model performance with rigor and clarity. The work blends technical judgment, quality design, and commercial awareness, requiring close collaboration with subject-matter experts and research stakeholders. You will shape how “frontier-grade” quality is defined and enforced, ensuring every dataset meets the standards expected by advanced model developers. Acting as a key interface with AI lab partners, you will help convert pilots into scaled production engagements. This is a high-ownership role at the intersection of AI evaluation, data quality, and applied research operations.
Accountabilities:
Own the design, development, and delivery of high-quality AI evaluation data initiatives, from initial proposals through pilot execution and production readiness.
- Develop data proposals and sample packages based on lab requests, benchmarks, and evaluation targets, translating them into structured, high-signal datasets.
- Design frontier-grade evaluation samples across reasoning, coding, agents, tool use, and multimodal tasks, ensuring measurable model discrimination and headroom.
- Define and enforce rigorous quality control frameworks, including expert verification, calibration layers, rubrics, and deterministic validation approaches.
- Recruit, onboard, and manage subject-matter experts across technical domains, ensuring consistent output quality aligned with benchmark standards.
- Own pilot engagements end-to-end, including scoping, staffing, SOW definition, QC execution, and final delivery to AI lab partners.
- Act as a key point of contact for lab stakeholders, aligning expectations and surfacing technical requirements in collaboration with internal leadership.
- Continuously refine evaluation methodologies and sample design standards to improve signal quality and benchmark reliability.
- 5+ years of experience in technical program management, data operations, quality engineering, or ML evaluation roles.
- Proven experience working with AI labs or enterprise ML teams, delivering datasets, benchmarks, or evaluation frameworks.
- Strong understanding of LLM evaluation concepts such as benchmarks, rubrics, pass rates, headroom, and model discrimination.
- Hands-on experience designing or managing QC processes and ensuring high-quality annotated or evaluated datasets.
- Demonstrated ability to recruit, manage, and calibrate subject-matter experts or external contributor pools.
- Strong problem-solving skills in ambiguous environments with evolving requirements and fast iteration cycles.
- Excellent English communication skills; Spanish is a plus.
- Competitive compensation aligned with senior-level AI and data roles
- Remote-first setup with flexibility across LATAM and US time zones
- Opportunity to work directly with leading AI labs and frontier model development teams
- High-ownership role with significant influence over evaluation standards and methodologies
- Collaboration with top-tier subject-matter experts across technical domains
- Exposure to cutting-edge AI benchmarking and evaluation practices
- Fast-paced, research-driven environment with strong learning potential
- Opportunity to shape how frontier model quality is measured and improved
Requirements:
You are an experienced operator in AI evaluation or technical delivery, with strong expertise in building structured, high-quality data systems for model assessment.
