Jobless Developer
Mind Robotics logo
Mind Robotics

Posted 4 months ago

Open

Machine Learning Infrastructure Engineer

Palo AltoRemoteFull-time

AI Summary

Installs and maintains ML infrastructure for large-scale model training, focusing on distributed training, core systems, and fast iteration across GPUs; collaborates with researchers to remove friction and improve reliability for training, evaluation, and deployment.

About this role

We’re hiring Machine Learning Infrastructure Engineers to build the systems that make large-scale model training actually work. This role is for people who enjoy operating at scale—owning distributed training, core ML infrastructure, and fast iteration loops across hundreds of GPUs. If you’ve built or run large training systems in PyTorch or JAX and care about things like sharding, parallelism, and performance, you’ll feel at home here. You’ll work closely with researchers to remove friction, improve reliability, and make it easier to train, evaluate, and deploy models that show up in real systems.

Skills

Big Data WorkflowsDeployment PipelinesDistributed TrainingGPU ComputeGPU SchedulingHigh-performance ComputingJAXKernel / Driver Level OptimizationsML InfrastructureModel Training SystemsMonitoring / ObservabilityPerformance OptimizationPyTorchReliability EngineeringResearch CollaborationScalabilityShardingSystems Programming

Explore related jobs

Browse these categories