Paper Breakdowns

Frontier Reports. Broken down for your next ML interview.

Moonshot AIMar 16, 2026

Attention Residuals

Kimi Team, Guangyu Chen +4 more

Attention Residuals (AttnRes) replaces the fixed unit-weight accumulation in standard residual connections with learned softmax attention over depth, letting each layer selectively aggregate earlier representations. Block AttnRes partitions layers into blocks and attends over block-level summaries, reducing memory from O(Ld) to O(Nd). Integrated into the 48B Kimi Linear architecture, Block AttnRes mitigates PreNorm dilution and improves downstream performance across all benchmarks.

advancedresidual-connectionsdepth-attentionscaling-laws+3

Carnegie Mellon UniversityFeb 25, 2026

Midtraining Bridges Pretraining and Posttraining Distributions

Emmy Liu, Graham Neubig +1 more

This paper provides the first systematic study of midtraining — mixing specialized data with general pretraining data during an intermediate phase. Through controlled experiments at 70M-1B scale, the authors show midtraining functions as distributional bridging, providing better initialization for posttraining. Benefits scale with domain distance, are predicted by a simple proximity metric, and depend critically on a timing-weight interaction governed by a plasticity window.

intermediatemidtrainingcurriculum-learningcatastrophic-forgetting+2

Google BrainFeb 22, 2026

Attention Is All You Need

Ashish Vaswani, Noam Shazeer +6 more

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

foundationalattentiontransformersarchitecture

DeepSeekFeb 22, 2026

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, Aixin Liu +3 more

DeepSeek-V3.2 introduces DeepSeek Sparse Attention (DSA), an efficient attention mechanism that reduces computational complexity from O(L²) to O(Lk) while preserving performance in long-context scenarios. Through a robust reinforcement learning protocol using improved GRPO and scaling post-training compute to over 10% of pre-training cost, V3.2 performs comparably to GPT-5. The high-compute variant DeepSeek-V3.2-Speciale surpasses GPT-5 on math and achieves gold medals at IMO 2025, IOI 2025, and ICPC World Finals 2025. A novel large-scale agentic task synthesis pipeline generates 85,000+ training prompts across 1,800+ environments.

frontierMoEsparse-attentionreinforcement-learning+3

Zhipu AI / Tsinghua UniversityFeb 22, 2026

GLM-5: From Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv +11 more

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks.

advancedreinforcement-learningagenticsparse-attention+2

OpenAIFeb 22, 2026

GPT-OSS-120B & GPT-OSS-20B: OpenAI's First Open-Weight Reasoning Models

OpenAI, Sandhini Agarwal +5 more

OpenAI releases gpt-oss-120b (116.8B total, 5.1B active) and gpt-oss-20b (20.9B total, 3.6B active), their first open-weight reasoning models under Apache 2.0. Both use efficient MoE transformer architectures with 128 and 32 experts respectively, trained via large-scale distillation and reinforcement learning similar to o3. MXFP4 quantization enables the 120B model to run on a single 80GB GPU. The models feature variable-effort reasoning (low/medium/high), agentic tool use (browsing, Python, function calling), and a novel Harmony chat format with instruction hierarchy. On AIME 2025, gpt-oss-120b scores 97.9% and gpt-oss-20b scores 98.7%, competitive with o3 and o4-mini.

advancedMoEreasoningopen-weights+3

Moonshot AIFeb 22, 2026

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai +6 more

We introduce Kimi K2, a Mixture-of-Experts large language model with 32 billion activated parameters and 1 trillion total parameters. K2 uses the MuonClip optimizer and was pre-trained on 15.5 trillion tokens with zero loss spikes. Post-training involves a large-scale agentic data synthesis pipeline and a joint reinforcement learning stage. K2 achieves state-of-the-art performance among open-source non-thinking models with scores including 65.8 on SWE-Bench Verified and 53.7 on LiveCodeBench v6.

advancedMoEoptimizeragentic+3