This also helps explain how meaningless output tokens can work. If a model is able to balance an intermediate representation between outputting the right token (“.”) and doing other useful work, the forward pass is not wasted. With enough pressure, a model can learn how to do this doublethink