Research Engineer
AI Summary
About KogKog builds the fastest LLM inference engine on standard datacenter GPUs. Our Kog Inference Engine generates 3,000 output tokens per second per request on a single 8× AMD MI300X node and 2,100 on an 8× NVIDIA H200 node (FP16, batch size 1, no speculative decoding).We co-design the model architecture and the execution engine together.
About this role
About Kog
Kog builds the fastest LLM inference engine on standard datacenter GPUs. Our Kog Inference Engine generates 3,000 output tokens per second per request on a single 8× AMD MI300X node and 2,100 on an 8× NVIDIA H200 node (FP16, batch size 1, no speculative decoding).
We co-design the model architecture and the execution engine together. Our Laneformer model uses Delayed Tensor Parallelism (DTP), a novel architecture that restructures the Transformer dependency graph so inter-GPU communication overlaps with computation rather than blocking it.
We pretrained a 2B-parameter DTP model on 6T tokens on 256 H100 GPUs.
We are a team of 11 people, including 10 engineers and 4 PhDs.
Test it at playground.kog.ai. Read the technical details on the Kog Labs blog.
What you will work on
You will imagine, design and run experiments to understand how architectural decisions propagate through inference behavior, morph existing open-weight models into architecture variants optimized for speed, and turn findings into measurable gains in generation speed and model quality.
Design new model architecture variants, including routing strategies, attention mechanisms, and MoE structure, with execution constraints as a first-order design input.
Extend the Laneformer thesis by exploring inference-aware architectural variants such as DTP, Ladder Residual, and PT-Transformer, and finding what compounds at scale.
Own the post-training pipeline across fine-tuning, evaluation methodology, and adaptation of existing open-weight models toward architecture variants optimized for inference speed.
Scale the stack to large MoE models such as DeepSeek v4 and Qwen 3, working through routing, expert parallelism, and communication patterns at inference time.
Write up findings as research papers, submit them to top venues, and present them at conferences.
Contribute to building AI agents that will perform architecture research and training experiments autonomously, starting from the research foundations we are building now.
What we look for
You are rigorous, curious, and comfortable working at the intersection of model design and hardware constraints.
You have worked on complex AI problems and have something concrete to show for it. A paper, a repository, a thesis, or a side project with evidence of serious technical thinking is what we want to see.
Strong signals include experience adapting or modifying existing model architectures, understanding of how communication structure and layer dependencies affect inference behavior, and fluency in Transformers and MoE with enough depth to reason across trade-offs.
Experience in post-training methods such as fine-tuning, preference optimization, or quantization is a plus, even without production-scale exposure.
What we offerDirect access to AMD and NVIDIA datacenter GPUs from day one
A team where creativity and technical judgment carry weight and where the people closest to the problem shape the key decisions
Problems that sit on the critical path of model execution speed and that directly influence what the system can become
A remote-friendly working model, though you'll spend at least 50% of your time in our Paris office
Explore related jobs
Jobs in Paris
- Vendeur Lingerie - Printemps ParisVan de Velde · Paris, Île-de-France
CDI - ComptableMatera · Paris, Île-de-France
CDI - Coordinateur.rice Archivage DigitalMatera · Paris, Île-de-France
CDI - Responsable gestion employés d’immeuble (Paris)Matera · Paris, Île-de-France
CDI - Revenue Recovery SpecialistMatera · Paris, Île-de-France
Senior Maintenance CoordinatorFastned · Paris, Île-de-France
