Ted Underwood: Hey, the dataset is out now. A million books: huggingface.co/datasets/ins... H/t @naitian.org for the link

See full post

Ted Underwood tedunderwood.com
Hey, the dataset is out now. A million books: huggingface.co/datasets/ins... H/t @naitian.org for the link
institutional/institutional-books-1.0 · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co
- SE Gyges segyges.bsky.social · Jun 11, 2025
  can't wait til they actually upload the dataset to go with this one arxiv.org/abs/2506.08300
  Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability
  
  Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to t...
  
  arxiv.org
Jun 16, 2025 19:02
0 reposts 0 quotes 0 likes

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙