Jobless Developer
SAIGroup logo

Posted 5 months ago

Open

Data Manager — Multimodal Medical Foundation Models

BangaloreOn-siteFull-time

AI Summary

Leads end-to-end data operations for multimodal medical foundation models, overseeing ingestion, cleaning, versioning, labeling, governance, and delivery of complex 3D medical data to research teams.

About this role

About the Role

You will lead data operations for a cutting-edge research group developing 3D medical multimodal foundation modelsand agentic clinical AI systems. These models rely on extremely high-quality, well-structured, and compliant datasets—including 3D medical imaging volumes (MRI, CT, PET), clinical text corpora, annotations, and multimodal metadata.

Your job is to own the end-to-end data lifecycle: acquisition, ingestion, cleaning, versioning, labeling, quality control, governance, and delivery to researchers. You are the central node ensuring our foundation model teams and medical agent teams have clean, scalable, well-documented data pipelines.

This is a pivotal foundational role—without great data, large models cannot be great.

What You Will Work On

Multimodal Medical Data Ops

  • Oversee ingestion and processing of 3D medical volumes (DICOM, NIfTI, MHA) and associated clinical texts.
  • Build automated pipelines for metadata extraction, de-identification, slice/series validation, and cohort structuring.
  • Manage large-scale internal datasets and external research datasets (BraTS, LiTS, MIMIC-CXR, CheXpert, MosMed, etc.).

Data Infrastructure & Versioning

  • Implement scalable data storage, cataloging, and retrieval systems for multimodal training data.
  • Own dataset version control, lineage tracking, reproducibility, and dataset documentation.
  • Collaborate with ML systems engineers on high-throughput data loaders, sharding strategies, and caching mechanisms.

Annotation & Labeling Programs

  • Lead medical annotation workflows with radiologists, medical students, and labeling vendors.
  • Create guidelines for ROI labeling, segmentation, captioning, report alignment, and case-level curation.
  • Build semi-automated labeling pipelines using model-assisted tools.

Data Quality, Compliance & Governance

  • Enforce strict standards on data quality, completeness, consistency, and bias control.
  • Ensure adherence to medical data privacy, HIPAA-equivalent frameworks, and institutional data-sharing rules.
  • Manage PHI de-identification, audit logs, access control, and compliance approvals.

Collaboration with Research & Engineering

  • Work closely with foundation-model researchers to understand data needs for model training.
  • Partner with agentic system designers to supply structured datasets for clinical reasoning tasks.
  • Collaborate with foundational engineers on data access layers, performance bottlenecks, and dataset optimization.

Why This Role Is Critical

  • The foundation model relies on high-quality 3D and textual data at scale.
  • You shape the data pipelines enabling next-generation medical AI agents.
  • You ensure clinical-grade governance, safety, reproducibility, and trust.
  • Your systems become the backbone for research, experiments, and deployments.

For candidates motivated by the intersection of data, healthcare, and machine learning, this is a high-impact opportunity.

What We’re Looking For

  • Strong experience managing large multimodal or imaging datasets, ideally medical imaging.
  • Proficiency with DICOM/DICOMweb, NIfTI, PACS systems, and medical imaging toolkits (dicompyler, pydicom, MONAI, ITK).
  • Experience with ETL pipelines, distributed data systems, and cloud/on-prem storage.
  • Knowledge of metadata standards, ontologies, and text–image linking strategies.
  • Comfortable working with Python, SQL, and data tooling (Airflow, Prefect, Dagster, DBT, Delta Lake, etc.).
  • Understanding of data privacy, de-identification, and compliance requirements in healthcare.
  • Strong communication skills and the ability to coordinate between engineers, researchers, clinicians, and data partners.

Nice to Have

  • Experience with vector databases, multimodal retrieval, or embedding store design.
  • Familiarity with annotation tools (Labelbox, CVAT, iMerit, custom MONAI Label pipelines).
  • Prior work with clinical NLP datasets or multilingual Indian medical corpora.
  • Experience conducting bias audits, dataset characterization, or quality scoring at scale.
  • Contributions to open datasets, benchmarks, or data documentation frameworks.

What We Offer

  • Competitive compensation.
  • Access to one of the most ambitious medical multimodal datasets in the region.
  • Collaboration with scientists building India’s first 3D multimodal medical foundation model.
  • Autonomy to design data systems from the ground up.
  • A mission-driven team working to transform clinical care with agentic AI.

Skills

Access ControlAnnotation WorkflowsAudit LogsBias AuditsCaching MechanismsCaptioningCase-level CurationClinical NLP DatasetsCloud StorageCompliance ApprovalsCVATData CatalogingData Documentation FrameworksData Loading PipelinesDataset DocumentationDataset Quality ScoringDataset VersioningDICOMDicompylerDICOMwebDistributed Data SystemsEmbedding StoresETL PipelinesHIPAA-equivalent ComplianceIMeritITKLabelboxLineage TrackingMHAModel-assisted LabelingMONAIMONAI LabelMultimodal RetrievalNIfTIOn-prem StoragePACSPHI De-identificationPydicomReport AlignmentReproducibilityROI LabelingSegmentationSharding StrategiesVector Databases

Explore related jobs

Browse these categories