Jobless Developer
Generalist logo
Generalist

Posted 3 months ago

Open

Software Engineer: ML Infra

San FranciscoOn-siteFull-time

AI Summary

Owns GPU compute fleets and orchestration for large-scale ML workloads, focusing on research and inference pipelines across distributed on-prem and cloud infrastructure.

About this role

About the Role

Generalist trains very large robot foundation models. This requires utilizing very large numbers of the latest generation GPU hardware and infrastructure (currently Nvidia) to run distributed training jobs and researcher experiments. We have extreme requirements on storage and data loading infrastructure that requires maximizing cloud infrastructure and custom solutions.

You will also own inference infrastructure. For our robots this is a fleet of on-prem GPUs attached to robots that have extreme real-time and latency budgets in compute constrained environments.

You’ll be responsible for:

  • Owning our GPU compute fleets

  • Ensure our GPUs are easy for researchers to use and maximally utilized

  • Optimizing and improving ML data loading transport and storage in highly distributed fully utilized environments.

  • Orchestration of robot inference fleets

You might thrive in this role if you:

  • Have managed large fleets of GPUs doing large-scale, long-term, highly distributed training runs or inference

  • Deep experience in Slurm or Kubernetes for ML workload orchestration

  • Have build high-scale ML data loaders and preparation systems

  • Deeply understand every layer of the ML hardware, storage, and networking stacks

  • Have experience in the NVidia GPU ecosystem


About Generalist

At Generalist, we are on a mission to make general-purpose robots a reality. We believe the industries and homes of the future will depend on humans and machines working together in new ways. Robots can help us build more and get more done.

We build embodied foundation models, starting with a focus on dexterity. This requires advancing the frontiers of data, models, and hardware, to enable robots to intelligently interact with the physical world.

The company embraces both large-scale AI and robotics as core to its DNA. Our team of researchers, roboticists, and company builders come from OpenAI, Boston Dynamics, Google DeepMind, and other frontier labs—with a track record of shipping AI breakthroughs. Before Generalist, we pioneered large embodied multimodal models and vision-language-action models (PaLM-E, RT-2, Gemini Robotics), launched and scaled ChatGPT and GPT-4 to hundreds of millions of users, engineered the foundations of autonomous driving, built next-generation robots (Atlas, Spot, Stretch) and pushed the limits of what they can do (from parkour to manipulation, and testing robustness).

We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

Skills

Cloud InfrastructureDistributed TrainingGPU Fleet ManagementGPU Utilization OptimizationHardware/software Co-designInference OrchestrationKubernetesLarge-scale Data LoadersML Data LoadingML Hardware Stack UnderstandingNVIDIA GPU EcosystemOn-prem GPU InfrastructureRobotics Compute PipelinesRuntime OptimizationSlurmSLURM Or Kubernetes For ML WorkloadsStorage And Data Loading

Explore related jobs

Browse these categories