- NEW: Common Crawl, the massive archiver of the web, has gotten cozy with AI companies and is providing paywalled articles for training data. They’re also lying to publishers who have asked for material to be removed. “The robots are people too,” CC’s exec director told us when we asked about this.
Nov 4, 2025 12:15
- ask common crawl about whether they remove child sexual abuse material from their archives, the dataset is full of it (given that LAION 5B comes from Common Crawl)