Common Crawl Foundation

handle.invalid

Followers · Following

Common Crawl is a non-profit foundation dedicated to the Open Web.

Joined November 2024

Posts Replies Media Original posts Likes

Common Crawl Foundation handle.invalid · Feb 2
The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level nodes with 6.1 billion edges. www.commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026

The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level n...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Feb 2
We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content. www.commoncrawl.org/blog/january...
Common Crawl - Blog - January 2026 Crawl Archive Now Available

We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Feb 2
Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work. www.commoncrawl.org/blog/web-arc...
Common Crawl - Blog - Web Archives for Social Sciences Datathon, Bristol

Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jan 21
As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph. commoncrawl.org/blog/how-seo...
Common Crawl - Blog - How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals

As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jan 16
GneissWeb Annotations Examples A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket. commoncrawl.org/blog/gneissw...
Common Crawl - Blog - GneissWeb Annotations Examples

A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jan 8
From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events. www.commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl at the Mozilla Festival 2025

From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jan 2
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025. commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jan 2
The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content). commoncrawl.org/blog/decembe...
Common Crawl - Blog - December 2025 Crawl Archive Now Available

The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Dec 18, 2025
As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and referenced. commoncrawl.org/blog/a-sampl...
Common Crawl - Blog - A Sampling of 2025 Research Referencing Common Crawl

As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and refere...

commoncrawl.org

View on Bluesky Show all post labels

Reposted by Common Crawl Foundation
Jean Golding Institute jgibristol.bsky.social · Nov 27, 2025
[Not loaded yet]

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Nov 24, 2025
We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and 100.7 million nodes and 6.6 billion edges at the domain level. commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025

We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Nov 24, 2025
We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content. commoncrawl.org/blog/novembe...
Common Crawl - Blog - November 2025 Crawl Archive Now Available

We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Nov 6, 2025
Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve? commoncrawl.org/blog/common-...

View on Bluesky Download image Show all post labels

Common Crawl Foundation handle.invalid · Nov 4, 2025
Setting the Record Straight A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities. commoncrawl.org/blog/setting...
Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good

A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activiti...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Nov 4, 2025
Check out our newsletter for October/November 2025, with updates on what we've been up to commoncrawl.org/blog/october...
Common Crawl - Blog - October/November 2025 Newsletter

Check out our newsletter for October/November 2025, with updates on what we've been up to

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Oct 29, 2025
The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”. commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation at Stanford HAI

The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Oct 29, 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edges at the host level, and 97.7 million nodes and 6.0 billion edges at the domain level.
Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edge...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Oct 29, 2025
We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content. commoncrawl.org/blog/october...
Common Crawl - Blog - October 2025 Crawl Archive Now Available

We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Oct 21, 2025
The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community. commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation at COLM 2025

The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.

commoncrawl.org

View on Bluesky Show all post labels

Reposted by Common Crawl Foundation
wmdqs wmdqs.bsky.social · Oct 10, 2025
[Not loaded yet]

View on Bluesky Show all post labels

Reposted by Common Crawl Foundation
wmdqs wmdqs.bsky.social · Oct 10, 2025
[Not loaded yet]

View on Bluesky Show all post labels

Reposted by Common Crawl Foundation
wmdqs wmdqs.bsky.social · Oct 10, 2025
[Not loaded yet]

View on Bluesky Show all post labels

Reposted by Common Crawl Foundation
wmdqs wmdqs.bsky.social · Oct 10, 2025
[Not loaded yet]

View on Bluesky Show all post labels

Reposted by Common Crawl Foundation
Julia Kreutzer juliakreutzer.bsky.social · Oct 9, 2025
[Not loaded yet]

View on Bluesky Show all post labels

Reposted by Common Crawl Foundation
wmdqs wmdqs.bsky.social · Oct 9, 2025
[Not loaded yet]

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Oct 6, 2025
Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology. commoncrawl.org/blog/announc...
Common Crawl - Blog - Announcing GneissWeb Annotations

Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Oct 2, 2025
Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links we’re adding to our seed crawl are of good quality. commoncrawl.org/blog/web-lan...
Common Crawl - Blog - Web Languages Needing Review by Native Speakers

Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Oct 2, 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9 billion edges, and the domain-level graph consists of 184.6 million nodes and 5.4 billion edges.
Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Oct 2, 2025
The era of traditional search engine optimization is rapidly evolving into "AIO" (AI optimization), where businesses must ensure their content exists in AI training datasets to remain discoverable as users increasingly turn to AI assistants for answers. commoncrawl.org/blog/from-se...
Common Crawl - Blog - From SEO to AIO: Why Your Content Needs to Exist in AI Training Data

The era of traditional search engine optimization is rapidly evolving into

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Sep 23, 2025
We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content. www.commoncrawl.org/blog/septemb...
Common Crawl - Blog - September 2025 Crawl Archive Now Available

We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Sep 18, 2025
Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received. commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation Opt-Out Registry

Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we ...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Sep 18, 2025
On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundation’s AI_dev event in Amsterdam. commoncrawl.org/blog/trip-re...
Common Crawl - Blog - Trip Report: AI_dev (Linux Foundation) August 2025

On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundation’s AI_dev event in Amsterdam.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Sep 18, 2025
On October 22, the Common Crawl team will lead a seminar at Stanford HAI. Our topic of discussion is “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”. Please register at: hai.stanford.edu/events/commo...
Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI

Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data.

hai.stanford.edu

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Sep 9, 2025
We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else www.techdirt.com/2025/09/08/w...
We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else

A longtime open internet activist recently asked me whether I’d reversed my position on internet openness and copyright because of AI. The question caught me off guard—until I realized what h…

techdirt.com

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Sep 9, 2025
Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new ways to preserve and share humanity’s knowledge. www.commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation

Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new ...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Aug 26, 2025
We are pleased to release our newsletter for July and August 2025, with updates on our team's activities. commoncrawl.org/blog/july-au...
Common Crawl - Blog - July/August 2025 Newsletter

We are pleased to release our newsletter for July and August 2025, with updates on our team's activities.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Aug 22, 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025. commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2025

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025. The host-level graph consists of 691.1 million nodes and 5.0 bill...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Aug 19, 2025
We are pleased to announce the release of our August 2025 crawl, containing 2.44 billion web pages (or 424 TiB of uncompressed content). commoncrawl.org/blog/august-...
Common Crawl - Blog - August 2025 Crawl Archive Now Available

We are pleased to announce the release of our August 2025 crawl, containing 2.44 billion web pages (or 424 TiB of uncompressed content).

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Aug 14, 2025
Publishers and brands are shifting from SEO to AIO. Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can train AI models becomes as crucial as traditional SEO. commoncrawl.org/blog/ai-opti...
Common Crawl - Blog - AI Optimization Is Here: Are You Ready for Search 2.0?

Publishers and brands are shifting from SEO to AIO. Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can ...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Aug 14, 2025
The Enclosure Of The Open Web And The Open Internet Toll Booth: What’s Behind Pay-By-Crawl digitalmedusa.org/the-enclosur...
The Enclosure of the Open Web and the Open Internet Toll booth: What’s Behind Pay-By-Crawl - Digital Medusa

Cloudflare recently proposed a system where AI companies and crawlers would pay websites for the right to crawl their content, a move framed as “content independence day”, a response to growing concer...

digitalmedusa.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Aug 4, 2025
A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement. commoncrawl.org/blog/ietf-12...
Common Crawl - Blog - IETF 123 Report

A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jul 27, 2025
Our Web Graph release for July 2025 is now available, consisting of 481.6 million nodes and 3.4 billion edges at the host level, and 209.5 million nodes and 2.6 billion edges at the domain level. commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2025

Our Web Graph release for July 2025 is now available, consisting of 481.6 million nodes and 3.4 billion edges at the host level, and 209.5 million nodes and 2.6 billion edges at the domain level.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jul 27, 2025
The crawl archive for July 2025 is now available. Crawled between July 7th and July 21st, the data contains 2.42 billion web pages, or 419 TiB of uncompressed content. commoncrawl.org/blog/july-20...
Common Crawl - Blog - July 2025 Crawl Archive Now Available

The crawl archive for July 2025 is now available. Crawled between July 7th and July 21st, the data contains 2.42 billion web pages, or 419 TiB of uncompressed content.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jul 21, 2025
The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identification for web data. commoncrawl.org/blog/wmdqs-s...
Common Crawl - Blog - WMDQS Shared Task on Language Identification

The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identi...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jul 21, 2025
"MOIC will also partner with Common Crawl, one of the largest free and open repositories of web crawled data. MOIC will fund work at Common Crawl, leveraging native speakers to annotate and seed European language data in the publicly available Common Crawl data set."
Unlocking data to advance European commerce and culture - Microsoft On the Issues

Microsoft launches 2 initiatives to open Europe’s languages and culture, building on AI, cloud, and digital sovereignty commitments.

blogs.microsoft.com

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jul 8, 2025
In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotations for African languages. commoncrawl.org/blog/the-fir...
Common Crawl - Blog - The First WMDQS-Masakhane LangID Hackathon

In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotation...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jul 2, 2025
We are pleased to announce that the Web Graph for June 2025 is now available. The graph consists of 371.6 million nodes and 3.1 billion edges at the host level, and 161.8 million nodes and 2.2 billion edges at the domain level. commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2025

We are pleased to announce that the Web Graph for June 2025 is now available. The graph consists of 371.6 million nodes and 3.1 billion edges at the host level, and 161.8 million nodes and 2.2 billion...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jul 1, 2025
The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open source and AI. commoncrawl.org/blog/common-...
Common Crawl - Blog - Common Crawl at the United Nations Open Source Week, June 2025

The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open s...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jun 27, 2025
We are pleased to announce that the crawl archive for June 2025 is now available. www.commoncrawl.org/blog/june-20...
Common Crawl - Blog - June 2025 Crawl Archive Now Available

We are pleased to announce that the crawl archive for June 2025 is now available.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jun 24, 2025
We're happy to share our newsletter for May/June 2025 with updates from our team. commoncrawl.org/blog/may-jun...
Common Crawl - Blog - May/June 2025 Newsletter

We're happy to share our newsletter for May/June 2025 with updates from our team.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jun 23, 2025
The deadline for paper submissions has been extended! The new deadline is July 3, 2025. AoE. For more information, please visit: wmdqs.org
1st Workshop on Multilingual Data Quality Signals

wmdqs.org
- Common Crawl Foundation handle.invalid · May 29, 2025
  Call for papers! We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality! Submission deadline is 23 June, more info: wmdqs.org
  1st Workshop on Multilingual Data Quality Signals
  
  wmdqs.org
View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jun 20, 2025
The AI Alliance Forms Non-profit AI Lab and AI Technology & Advocacy Association to Scale Open-Source Innovation www.prnewswire.com/news-release...
The AI Alliance Forms Non-profit AI Lab and AI Technology & Advocacy Association to Scale Open-Source Innovation

/PRNewswire/ -- Today, the AI Alliance, a global collaboration of more than 180 organizations committed to AI open innovation, announced it has incorporated...

prnewswire.com

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jun 12, 2025
Announcing a refreshed version of the Whirlwind Tour in Python. Get to know how to make the most of our crawl data. commoncrawl.org/blog/announc...
Common Crawl - Blog - Announcing the Whirlwind Tour of Common Crawl's Datasets using Python

Announcing a refreshed version of the Whirlwind Tour in Python. Get to know how to make the most of our crawl data.

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jun 10, 2025
The Common Crawl Foundation, together with IBM, the AI Alliance, and BrightQuery will be hosting an "UN Conference" at IBM's new flagship NYC HQ at One Madison Avenue on Friday, June 20, from 12:30-5pm. If you are in NYC, it would be great to see you there! lu.ma/p0a1scde
AI Alliance @ IBM One Madison (UN Open Source Week 2025) · Luma

This year’s UN Open Source Week 2025, June 16-20) will once again bring together a global “who is who” of Open Source leaders. As part of the official…

lu.ma

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jun 7, 2025
We are pleased to announce that the Web Graph for May 2025 is now available. The graph consists of 326.8 million nodes and 2.9 billion edges at the host level, and 156.1 million nodes and 2.1 billion edges at the domain level. commoncrawl.org/blog/host--a...
Common Crawl - Blog - Host- and Domain-Level Web Graphs March, April, and May 2025

We are pleased to announce that the Web Graph for May 2025 is now available. The graph consists of 326.8 million nodes and 2.9 billion edges at the host level, and 156.1 million nodes and 2.1 billion ...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · Jun 3, 2025
We are pleased to announce that the crawl archive for May 2025 is now available. The data was crawled between May 11th and May 25th, and contains 2.47 billion web pages, or 429 TiB of uncompressed content. commoncrawl.org/blog/may-202...
Common Crawl - Blog - May 2025 Crawl Archive Now Available

We are pleased to announce that the crawl archive for May 2025 is now available. The data was crawled between May 11th and May 25th, and contains 2.47 billion web pages, or 429 TiB of uncompressed con...

commoncrawl.org

View on Bluesky Show all post labels

Common Crawl Foundation handle.invalid · May 29, 2025
Call for papers! We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality! Submission deadline is 23 June, more info: wmdqs.org
1st Workshop on Multilingual Data Quality Signals

wmdqs.org

View on Bluesky Show all post labels

Common Crawl Foundation

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026

Common Crawl - Blog - January 2026 Crawl Archive Now Available

Common Crawl - Blog - Web Archives for Social Sciences Datathon, Bristol

Common Crawl - Blog - How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals

Common Crawl - Blog - GneissWeb Annotations Examples

Common Crawl - Blog - Common Crawl at the Mozilla Festival 2025

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025

Common Crawl - Blog - December 2025 Crawl Archive Now Available

Common Crawl - Blog - A Sampling of 2025 Research Referencing Common Crawl

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025

Common Crawl - Blog - November 2025 Crawl Archive Now Available

Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good

Common Crawl - Blog - October/November 2025 Newsletter

Common Crawl - Blog - Common Crawl Foundation at Stanford HAI

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025

Common Crawl - Blog - October 2025 Crawl Archive Now Available

Common Crawl - Blog - Common Crawl Foundation at COLM 2025

Common Crawl - Blog - Announcing GneissWeb Annotations

Common Crawl - Blog - Web Languages Needing Review by Native Speakers

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025

Common Crawl - Blog - From SEO to AIO: Why Your Content Needs to Exist in AI Training Data

Common Crawl - Blog - September 2025 Crawl Archive Now Available

Common Crawl - Blog - Common Crawl Foundation Opt-Out Registry

Common Crawl - Blog - Trip Report: AI_dev (Linux Foundation) August 2025

Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI

We’re Walling Off The Open Internet To Stop AI—And It May End Up Breaking Everything Else

Common Crawl - Blog - Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation

Common Crawl - Blog - July/August 2025 Newsletter

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2025

Common Crawl - Blog - August 2025 Crawl Archive Now Available

Common Crawl - Blog - AI Optimization Is Here: Are You Ready for Search 2.0?

The Enclosure of the Open Web and the Open Internet Toll booth: What’s Behind Pay-By-Crawl - Digital Medusa

Common Crawl - Blog - IETF 123 Report

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2025

Common Crawl - Blog - July 2025 Crawl Archive Now Available

Common Crawl - Blog - WMDQS Shared Task on Language Identification

Unlocking data to advance European commerce and culture - Microsoft On the Issues

Common Crawl - Blog - The First WMDQS-Masakhane LangID Hackathon

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2025

Common Crawl - Blog - Common Crawl at the United Nations Open Source Week, June 2025

Common Crawl - Blog - June 2025 Crawl Archive Now Available

Common Crawl - Blog - May/June 2025 Newsletter

1st Workshop on Multilingual Data Quality Signals

1st Workshop on Multilingual Data Quality Signals

The AI Alliance Forms Non-profit AI Lab and AI Technology & Advocacy Association to Scale Open-Source Innovation

Common Crawl - Blog - Announcing the Whirlwind Tour of Common Crawl's Datasets using Python

AI Alliance @ IBM One Madison (UN Open Source Week 2025) · Luma

Common Crawl - Blog - Host- and Domain-Level Web Graphs March, April, and May 2025

Common Crawl - Blog - May 2025 Crawl Archive Now Available

1st Workshop on Multilingual Data Quality Signals