A different approach to interpretability. Instead of probing neurons and circuits, *tune* the models to provide "accurate, quantitative descriptions of their own internal processes during certain kinds of decision-making," and then show that this ability generalizes.
#MLSky 🤖

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training
We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning t...