Ah, now I see the issue!
When processing token N, attention doesn’t just receive information from tokens (1…N), it also receives information from intermediate representations of (1…N) at prior layers. With autoregression, this allows a model to “set up” useful representations for future generations
This also helps explain how meaningless output tokens can work. If a model is able to balance an intermediate representation between outputting the right token (“.”) and doing other useful work, the forward pass is not wasted. With enough pressure, a model can learn how to do this doublethink
Oct 7, 2025 22:03