Jobless Developer
Jobgether logo

Posted 2 days ago

Open

Human Data Evals Lead

United StatesRemoteFull-time

AI Summary

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Human Data Evals Lead based in United States.

About this role

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Human Data Evals Lead based in United States.

This role sits at the core of frontier AI data operations, owning how high-quality evaluation datasets and benchmarks are designed, validated, and delivered to leading AI labs. You will be responsible for translating ambiguous evaluation needs into structured, high-signal data proposals and production-ready sample packages that demonstrate model performance with rigor and clarity. The work blends technical judgment, quality design, and commercial awareness, requiring close collaboration with subject-matter experts and research stakeholders. You will shape how “frontier-grade” quality is defined and enforced, ensuring every dataset meets the standards expected by advanced model developers. Acting as a key interface with AI lab partners, you will help convert pilots into scaled production engagements. This is a high-ownership role at the intersection of AI evaluation, data quality, and applied research operations.

Accountabilities:

Own the design, development, and delivery of high-quality AI evaluation data initiatives, from initial proposals through pilot execution and production readiness.

  • Develop data proposals and sample packages based on lab requests, benchmarks, and evaluation targets, translating them into structured, high-signal datasets.
  • Design frontier-grade evaluation samples across reasoning, coding, agents, tool use, and multimodal tasks, ensuring measurable model discrimination and headroom.
  • Define and enforce rigorous quality control frameworks, including expert verification, calibration layers, rubrics, and deterministic validation approaches.
  • Recruit, onboard, and manage subject-matter experts across technical domains, ensuring consistent output quality aligned with benchmark standards.
  • Own pilot engagements end-to-end, including scoping, staffing, SOW definition, QC execution, and final delivery to AI lab partners.
  • Act as a key point of contact for lab stakeholders, aligning expectations and surfacing technical requirements in collaboration with internal leadership.
  • Continuously refine evaluation methodologies and sample design standards to improve signal quality and benchmark reliability.
  • Requirements:

    You are an experienced operator in AI evaluation or technical delivery, with strong expertise in building structured, high-quality data systems for model assessment.

    • 5+ years of experience in technical program management, data operations, quality engineering, or ML evaluation roles.
    • Proven experience working with AI labs or enterprise ML teams, delivering datasets, benchmarks, or evaluation frameworks.
    • Strong understanding of LLM evaluation concepts such as benchmarks, rubrics, pass rates, headroom, and model discrimination.
    • Hands-on experience designing or managing QC processes and ensuring high-quality annotated or evaluated datasets.
    • Demonstrated ability to recruit, manage, and calibrate subject-matter experts or external contributor pools.
    • Strong problem-solving skills in ambiguous environments with evolving requirements and fast iteration cycles.
    • Excellent English communication skills; Spanish is a plus.
    • Benefits:

      • Competitive compensation aligned with senior-level AI and data roles
      • Remote-first setup with flexibility across LATAM and US time zones
      • Opportunity to work directly with leading AI labs and frontier model development teams
      • High-ownership role with significant influence over evaluation standards and methodologies
      • Collaboration with top-tier subject-matter experts across technical domains
      • Exposure to cutting-edge AI benchmarking and evaluation practices
      • Fast-paced, research-driven environment with strong learning potential
      • Opportunity to shape how frontier model quality is measured and improved

Browse these categories