Work at a Frontier Lab

The ML Knowledge That Never Gets Written Down

Every frontier lab has tribal knowledge that takes years to absorb — the memory math, the failure modes, the things senior engineers just know. It's also what they test for in interviews.

We wrote it all down.

Free interactive course with executable Python.

See What You're Missing→
9 lessons2.7+ hoursRun code in browser

Built from knowledge earned at Meta AI, DeepMind, and Anthropic

Trusted by professionals from top institutions

Carnegie Mellon University
Columbia University
New York University
UC San Diego
University of Washington
Arizona State University
Indian Institute of Science
IIT Roorkee
IIT Hyderabad
BITS Pilani
IIIT Delhi
University of Amsterdam
LinkedIn
Oregon State University
Carnegie Mellon University
Columbia University
New York University
UC San Diego
University of Washington
Arizona State University
Indian Institute of Science
IIT Roorkee
IIT Hyderabad
BITS Pilani
IIIT Delhi
University of Amsterdam
LinkedIn
Oregon State University

You're closer than you think.

But there are gaps you can't see.

  • Your model OOMs during backward, not forward — and you don't know why backward doubles the memory
  • You can recite the attention formula, but can't explain why head_dim controls your inference memory budget
  • You've read the Chinchilla paper, but don't know that Kaplan got a different answer — or why
  • You're optimizing GPU utilization without knowing whether your operation is memory-bound or compute-bound

These aren't gaps in your intelligence. They're gaps in your exposure. This is the knowledge you'd absorb after two years inside a frontier lab — and exactly what comes up in their interviews. We compressed it into lessons you can run in your browser.

Things you'd learn in your first year at a frontier lab

Each of these has shown up in real interviews. Here's what's actually inside.

From: KV Cache & Memory

KV cache memory = batch × seq × kv_heads × head_dim × 2 × 2 × layers. That head_dim you chose during training? It just locked your inference budget.

See this lesson →
From: The Roofline Model

Before you optimize a single kernel, check if your op is memory-bound or compute-bound. The roofline model tells you in 30 seconds. Most engineers skip this and waste weeks.

See this lesson →
From: Distributed Training

A 70B model needs ~1,120 GB per GPU in vanilla DDP. With FSDP ZeRO-3, that drops to ~140 GB. If you can't do this math on a whiteboard, you'll fumble the systems design interview.

See this lesson →
From: Inference Systems

Prefill and decode are completely different computational regimes — one is compute-bound, the other is memory-bound. Every serving optimization you build is a tradeoff between them.

See this lesson →
Free Course

Track 0: Foundations

Start here. 9 lessons — each one ends with something that breaks.

The Memory Wall

Understand why your 70B model OOMs

Gradient Flow Under Pressure

Debug NaN losses like a pro

Adam, Warmup & Scheduling

Know when AdamW actually helps

The Debugging Flowchart

Systematic methodology, not guesswork

See all 9 lessons→

Overheard in the Valley

Byte-sized insights on what actually matters in ML interviews and research engineering — straight from the trenches.

reinforcement-learningverl

A Gentle Introduction to verl — Part 1

Wrangle and implement RL algorithms with confidence. A deep dive into verl's architecture — from master-worker design to the PPO training loop — so you can go beyond config files.

12 min readRead →
reinforcement-learningreasoning

Absolute Zero Reasoner: Walkthrough, Implementation and No Jargon

How to make your LLM learn math and code using *no* data. A no-jargon deep dive into AZR — the paper that eliminates alignment data by having the model propose and solve its own problems.

15 min readRead →

The questions that separate ‘ML engineer’ from ‘research engineer’

A frontier lab interviewer might ask any of these. Could you answer them on a whiteboard?

1

Your 70B model training is at 40% MFU. Walk me through where the other 60% is going.

2

We need to serve this model at 200 tokens/sec per user. What's your KV cache memory budget and how does it constrain batch_size?

3

Why does Chinchilla recommend a different compute-optimal ratio than Kaplan's original scaling laws?

4

Your model's loss spikes at step 50k. Here's the training log. Diagnose it.

5

Explain why speculative decoding gives exact samples from the target distribution, not approximate ones.

6

This operation runs at 2 TFLOPS on an A100 rated for 312 TFLOPS. Is that a problem? Why or why not?

Every one of these is covered in the course. Not as trivia — as understanding you build yourself.

What ‘depth’ actually means

The left column passes a tutorial quiz. The right column passes an interview.

Other Courses

Memorize the attention formula

Work at a Frontier Lab

Know that head_dim chosen at training time locks your inference memory budget

Other Courses

Here's how to use FSDP

Work at a Frontier Lab

Here's the memory math — 1,120 GB → ~140 GB with ZeRO-3, and why

Other Courses

Scaling laws say bigger is better

Work at a Frontier Lab

Kaplan and Chinchilla disagreed. Here's why, and when to break the rules

Other Courses

Watch this video about transformers

Work at a Frontier Lab

Run this 80-line attention implementation, then break it by removing the scaling factor

Other Courses

Speculative decoding speeds up inference

Every lesson follows the same arc

Here's what it looks like for one lesson — KV Cache & Memory.

1

Motivation

You deploy a chat model. Works at batch=1. At batch=32, it OOMs. Why?

2

Mental Model

A diagram of KV cache growing with sequence length. The formula: batch × seq × kv_heads × head_dim × 2 × 2 × layers.

3

Toy Code

80 lines of Python. You compute KV cache size for Llama 2 7B. Run it right here in your browser.

4

Break It

Double the sequence length. Watch the memory explode. Now you understand why long context is expensive.

5

Scale Thinking

What happens at 70B? At 128k context? When does KV cache dominate your entire GPU memory?

What Engineers Say

“I spent three months optimizing a training run before realizing the bottleneck was memory bandwidth, not compute. The roofline model lesson would have saved me those three months.”
S

Sourav Bose

Research Engineer, Observo AI (Now Acquired)

“My team burned $40k on a training run with the wrong parallelism strategy. The FSDP memory math in Track 2 is the kind of thing you only learn after making expensive mistakes — or taking this course.”
O

Osaid Rehman

Senior Research Engineer, LinkedIn

“I've taken every ML course on the internet. This is the first one where the 'Break It' exercise actually taught me something I couldn't have gotten from reading the docs.”
T

Tushar Kadam

Senior ML Engineer, EarnIn

Two years of tribal knowledge. Nine lessons. Zero signup.

The same understanding that frontier labs test for — now free and interactive.

See What You're Missing→

Free. No account needed. Start reading in 30 seconds.

Work at a Frontier Lab
CoursesBlogJobsDiscussAboutContactPrivacyTermsGitHub

Built with Next.js + Pyodide

interviewssystem-design

You Have Been Doing ML System Design Interviews Wrong

The questions you should ask, before you even begin. ML System Design interviews aren't about showcasing the latest research — they're a sophisticated vibe check.

8 min readRead →
Read all posts→
Work at a Frontier Lab

Here's the mathematical proof that it preserves the exact target distribution

Other CoursesWork at a Frontier Lab
Memorize the attention formulaKnow that head_dim chosen at training time locks your inference memory budget
Here's how to use FSDPHere's the memory math — 1,120 GB → ~140 GB with ZeRO-3, and why
Scaling laws say bigger is betterKaplan and Chinchilla disagreed. Here's why, and when to break the rules
Watch this video about transformersRun this 80-line attention implementation, then break it by removing the scaling factor
Speculative decoding speeds up inferenceHere's the mathematical proof that it preserves the exact target distribution
6

Production

How vLLM uses PagedAttention to solve this. Why GQA exists. What Meta actually ships.

Paper Breakdowns

Read the papers that actually matter.

Key insights and interview-ready takeaways from the most influential ML papers — no fluff, just the important bits.

Moonshot AIadvanced

Attention Residuals

Attention Residuals (AttnRes) replaces the fixed unit-weight accumulation in standard residual connections with learned softmax attention over depth, letting each layer selectively aggregate earlier representations. Block AttnRes partitions layers into blocks and attends over block-level summaries, reducing memory from O(Ld) to O(Nd). Integrated into the 48B Kimi Linear architecture, Block AttnRes mitigates PreNorm dilution and improves downstream performance across all benchmarks.

residual-connectionsdepth-attentionscaling-laws20 min read
Read breakdown →
Carnegie Mellon Universityintermediate

Midtraining Bridges Pretraining and Posttraining Distributions

This paper provides the first systematic study of midtraining — mixing specialized data with general pretraining data during an intermediate phase. Through controlled experiments at 70M-1B scale, the authors show midtraining functions as distributional bridging, providing better initialization for posttraining. Benefits scale with domain distance, are predicted by a simple proximity metric, and depend critically on a timing-weight interaction governed by a plasticity window.

midtrainingcurriculum-learningcatastrophic-forgetting18 min read
Read breakdown →
Google Brainfoundational

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

attentiontransformersarchitecture15 min read
Read breakdown →
Browse all paper breakdowns→
Community

Learn together. Build together.

Ask questions, share insights, and discuss ML research engineering with people who are on the same path.

·just now

How do you estimate KV cache memory for a 70B model at 128k context?

I'm trying to work through the memory math for deploying Llama 2 70B with long context. What's the right formula and where do most people get tripped up?

memoryinference
·just now

FSDP vs DeepSpeed ZeRO-3 — which do you actually use in production?

Curious what people are actually running at scale. The docs make them sound interchangeable but the tradeoffs seem real when you hit multi-node.

distributedtraining
·just now

What questions came up in your research engineer interview?

Preparing for interviews at frontier labs. Would love to hear what systems-level questions people actually got asked and what they wished they'd studied.

interviewscareer

These are the kinds of conversations waiting to happen. Be the first to start one.

Start the first discussion→
Now Hiring

Learn it. Then go build it.

Frontier labs are hiring research engineers right now. Master the skills here, then apply to the roles below.

Anthropic·San Francisco, CA

Research Scientist, Frontier Red Team (Emerging Risks)

The Frontier Red Team (FRT) is a technical research team within Anthropic’s Policy organization. Our goal is to make the entire world safer in this era of advanced AI by understanding what these systems can do and building the defenses that matter. In 2026, we're focused on researching and ensuring safety with self-improving, highly autonomous AI systems—especially ones with cyber-physical capabilities. See our previous related work on cyberdefense, robotics, and Project Vend. This is…

MidOnsite1 month ago
Apply →
Nous Research·Remote

Full TimeResearch Scientist Work with the Fundamental AI Research Team to produce high-impact AI research.

Full TimeResearch Scientist Work with the Fundamental AI Research Team to produce high-impact AI research. at Nous Research — an open-source AI research lab focused on advancing open AI models and reinforcement learning.

MidRemote1 month ago
Apply →
Anthropic·San Francisco, CA | New York City, NY

ML/Research Engineer, Safeguards

We are looking for ML Engineers and Research Engineers to help detect and mitigate misuse of our AI systems. As a member of the Safeguards ML team, you will build systems that identify harmful use—from individual policy violations to sophisticated, coordinated attacks—and develop defenses that keep our products safe as capabilities advance. You will also work on systems that protect user wellbeing and ensure our models behave appropriately across a wide range of contexts. This work feeds…

MidOnsite1 month ago
Apply →
View all open roles→