Posted 10 months ago

Software Engineer, Distributed Systems

San FranciscoOn-siteFull-time

AI Summary

Software engineer focused on building distributed systems platforms for high traffic, data-intensive workloads. Responsible for design, implementation, and tuning of core compute and orchestration components.

About this role

You are an experienced software engineer who thrives on building large-scale computing platforms. You have deep expertise in large scale distributed systems that deal with high complexity, a lot of traffic and data. You know how to achieve reliability and scale with minimum operational load.

Key responsibilities

Build our core Python/Rust platform: request routing, AI workload orchestration, scheduling, GPU autoscaling, large scale file storage, queueing, etc
Produce forward designs for platform evolution as we scale to 100x current traffic and need to provide low latency across the world
Leverage AI to an extreme level to automate the mundane parts of building complex but reliable systems
Profile and tune low level CPU and memory performance

Requirements

3+ years experience building distributed compute and orchestration platforms in Python or Rust
Strong understanding of distributed systems fundamentals: consensus, scheduling, fault tolerance, capacity planning
Deep understanding of computational complexity and memory allocation
Track record of designing systems that scale under real production load
Experience building and using observability to drive performance and reliability decisions
Excellent communication and ability to drive technical decisions across teams
Self-starter who executes quickly, takes ownership, and constantly seeks improvement

Nice to have

Experience with AI/ML inference or training infrastructure
Experience with high-performance systems programming (async runtimes, zero-copy, memory-safe concurrency)
Background in building multi-tenant compute platforms
Understanding of networking fundamentals and performance characteristics
Familiarity with GPU workload characteristics and scheduling constraints

Compensation

$180,000-250,000 plus equity + benefits (This range is across all 3 levels Mid, Senior and Staff)

Location

San Francisco, CA (willing to consider remote for Senior and Staff levels)

What we offer at fal

Interesting and challenging work
A lot of learning and growth opportunities
We are currently hiring in downtown San Francisco.
We offer relocation assistance to San Francisco.
Health, dental, and vision insurance (US)
Regular team events and offsites

Skills

AI/ML Inference InfrastructureAsync RuntimesCapacity PlanningConsensusCPU/memory Performance TuningDistributed SystemsFault ToleranceGPU SchedulingGPU Workload CharacteristicsMemory-safe ConcurrencyMulti-tenant ComputeNetworking FundamentalsObservabilityPythonRustSchedulingZero-copy

Explore related jobs

More jobs at Fal

Similar AI/ML Inference Infrastructure jobs

Jobs in San Francisco

Browse these categories

Capacity Planning Jobs Distributed Systems Jobs Observability Jobs Python Jobs Rust Jobs