Paper Breakdowns
Frontier Reports. Broken down for your next ML interview.
Attention Is All You Need
Ashish Vaswani, Noam Shazeer +6 more
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI, Aixin Liu +3 more
DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), an efficient attention mechanism that reduces computational complexity from O(L²) to O(Lk) while preserving performance in long-context scenarios. Through a robust reinforcement learning protocol using improved GRPO and scaling post-training compute to over 10% of pre-training cost, V3.2 performs comparably to GPT-5. The high-compute variant DeepSeek-V3.2-Speciale surpasses GPT-5 on math and achieves gold medals at IMO 2025, IOI 2025, and ICPC World Finals 2025. A novel large-scale agentic task synthesis pipeline generates 85,000+ training prompts across 1,800+ environments.
GLM-5: From Vibe Coding to Agentic Engineering
Aohan Zeng, Xin Lv +11 more
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks.
GPT-OSS-120B & GPT-OSS-20B: OpenAI's First Open-Weight Reasoning Models
OpenAI, Sandhini Agarwal +5 more
OpenAI releases gpt-oss-120b (116.8B total, 5.1B active) and gpt-oss-20b (20.9B total, 3.6B active), their first open-weight reasoning models under Apache 2.0. Both use efficient MoE transformer architectures with 128 and 32 experts respectively, trained via large-scale distillation and reinforcement learning similar to o3. MXFP4 quantization enables the 120B model to run on a single 80GB GPU. The models feature variable-effort reasoning (low/medium/high), agentic tool use (browsing, Python, function calling), and a novel Harmony chat format with instruction hierarchy. On AIME 2025, gpt-oss-120b scores 97.9% and gpt-oss-20b scores 98.7%, competitive with o3 and o4-mini.
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai +6 more
We introduce Kimi K2, a Mixture-of-Experts large language model with 32 billion activated parameters and 1 trillion total parameters. K2 uses the MuonClip optimizer and was pre-trained on 15.5 trillion tokens with zero loss spikes. Post-training involves a large-scale agentic data synthesis pipeline and a joint reinforcement learning stage. K2 achieves state-of-the-art performance among open-source non-thinking models with scores including 65.8 on SWE-Bench Verified and 53.7 on LiveCodeBench v6.