SFT, RL, and On-Policy Distillation Through a Distributional Lens
I have been thinking about post-training methods in terms of distributions. A language model is a distribution over sequences. When we post-train it and attempt to teach it a task, we are reshaping this distribution. Different post-training methods differ in how they reshape this distribution, what they treat as the target and how directly they define this target. This is neither a very precise statement nor is it meant to be fully rigorous.