Jobless Developer
Embedding VC logo
Embedding VC

Posted 6 months ago

Open

Member of Technical Staff - ML Infrastructure & Performance

San Mateo, CAOn-siteFull-time

AI Summary

A senior technical role focusing on ML infrastructure and performance optimization for real-time content generation, including GPU kernels, serving systems, and distributed training/processing. The role drives throughput, latency, and cost improvements.

About this role

Introducing Moonlake, AI for creating real-time interactive content

Mission: Improve Throughput, Latency, & Cost - deploying our models 2–10× faster & cheaper without quality regressions.

Scope of Work:

- GPU performance: CUDA/Triton kernels, FlashAttention family, paged attention, CUDA Graphs.

- Serving stack: TensorRT-LLM/Triton Inference Server, vLLM/TGI; continuous batching; on-GPU KV reuse; speculative decoding/medusa; mixture-of-agents routing.

- Parallelism: FSDP/ZeRO, TP/PP/expert parallel; NCCL tuning.

- Quantization/PEFT: AWQ/GPTQ/FP8; LoRA/DoRA serving.

- Systems: Ray/k8s/Argo, observability (Prom/Grafana/OpenTelemetry), autoscaling, A/B infra, canary + rollback.

Tech signals:

Previous experience at Infra-heavy startups such as Databricks, Roblox

We are committed to being an on-site, in-person team currently based in San Mateo

Skills

A/b TestingArgoAutoscalingAWQCanaryContinuous BatchingCUDACUDA GraphsCUDA KernelsDORAFlashAttentionFP8FSDPGPTQGrafanaKubernetesLoRAMedusaMixture-of-agents RoutingNCCL TuningObservabilityOn-GPU KV ReuseOpenTelemetryPaged AttentionPEFTPrometheusQuantizationRayRollbackSpeculative DecodingTensorRT-LLMTGITP/PP/expert ParallelTritonTriton Inference ServerVLLMZeRO

Explore related jobs

Browse these categories