can't wait til they actually upload the dataset to go with this one
arxiv.org/abs/2506.08300
Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability
Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to t...