Machine Learning Infrastructure Engineer

Palo AltoFull-time

AI Summary

Build and operate large-scale ML infrastructure for distributed training. Own core ML systems, sharding, parallelism, and performance across hundreds of GPUs to reduce friction for researchers and improve training reliability and deployment.

About this role

We’re hiring Machine Learning Infrastructure Engineers to build the systems that make large-scale model training actually work. This role is for people who enjoy operating at scale—owning distributed training, core ML infrastructure, and fast iteration loops across hundreds of GPUs. If you’ve built or run large training systems in PyTorch or JAX and care about things like sharding, parallelism, and performance, you’ll feel at home here. You’ll work closely with researchers to remove friction, improve reliability, and make it easier to train, evaluate, and deploy models that show up in real systems.

Skills

Distributed TrainingGPU SystemsJAXModel Training InfrastructureParallelismPerformance OptimizationPyTorchSharding

Machine Learning Infrastructure Engineer

About this role

Skills

Explore related jobs

More jobs at Mind Robotics

Similar Distributed Training jobs

Browse these categories