Common Crawl and the AI Web Scraping Crisis: What You Need to Know
How a nonprofit archive became AI's backdoor to paywalled content
The internet’s infrastructure for data collection is at a breaking point. A nonprofit organization called Common Crawl, which has been quietly archiving the web since 2013, has become the unexpected flashpoint in a larger battle over AI companies’ rights to scrape content without permission. The latest investigations reveal troubling practices that could fundamentally reshape how we think about web scraping ethics and compliance in 2025.
What Is Common Crawl, Really?
Common Crawl is a nonprofit organization that maintains one of the largest archives of the internet ever created. Its petabyte-scale database contains snapshots of billions of web pages, freely available for research and development. On the surface, this sounds noble—democratizing access to the internet’s public data.
But there’s a catch. In recent years, AI companies like OpenAI, Google, Meta, and Amazon have become the primary users of Common Crawl’s archives. These organizations use the data to train large language models. That might seem fine, except for one critical detail: Common Crawl’s archives contain millions of paywalled news articles that shouldn’t be freely accessible at all.
The Paywall Bypass Problem
Here’s where things get murky. Most modern news websites use client-side paywalls, which work like this: the full article loads first, and then JavaScript code runs to check if you’re a subscriber. If you’re not, the article gets hidden. It’s a common pattern across publications like The New York Times, The Washington Post, and countless others.
But Common Crawl’s scraper, known as CCBot, never executes this JavaScript verification. It simply retrieves the full HTML before any subscription checks occur. In effect, Common Crawl is bypassing paywalls—not through malicious intent necessarily, but through architectural simplicity.
This creates a backdoor for AI companies. They can access premium journalism that was never meant to be free, train their models on this content without licensing fees, and ultimately profit from publishers’ work without compensation.
The Investigation That Changed Everything
In November 2024, The Atlantic published an investigation that revealed the full scope of the problem. Researchers found that despite publishers submitting takedown requests—including The New York Times in July 2023—Common Crawl’s archives still contained the supposedly removed content.
When asked about this discrepancy, Common Crawl’s executive director Rich Skrenta revealed something shocking: the organization’s file format is “immutable.” Once data is archived, nothing can be truly removed. Content files showed no modifications since at least 2016, suggesting that removal requests have been essentially ignored for nearly a decade.
This isn’t accidental. It’s architectural. Common Crawl’s design prioritizes permanence over flexibility.
Financial Ties to the AI Industry
The situation becomes even more troubling when you examine Common Crawl’s funding. The organization received $250,000 in donations each from OpenAI and Anthropic in 2023. It also collaborates with NVIDIA on AI training datasets. These financial relationships create obvious incentives to maintain unfettered access to comprehensive web data.
The Danish Rights Alliance and other publishers have filed removal requests, only to discover that Common Crawl misrepresented the completion status of these removals. The organization claimed to be “50% complete,” then “80% complete,” all while the underlying archive remained unchanged.
The Broader Web Scraping Landscape in 2025
Common Crawl isn’t operating in isolation. The entire web scraping ecosystem has shifted dramatically toward supporting AI companies over content creators. According to recent research, AI crawlers have become increasingly aggressive:
In Q1 2025, retrieval-augmented generation (RAG) bots increased their scraping activity by 49 percent compared to just 18 percent growth for training-focused bots. This indicates that AI companies now see continuous content access as essential to their business model, not just a one-time training necessity.
Worse, bot compliance with industry standards has deteriorated. TollBit reported that as of Q1 2025, 12.9 percent of bots now ignore robots.txt files entirely—up from 3.3 percent. Companies like Perplexity have been caught deliberately obscuring their identity and rotating user agents to evade blocking attempts.
Meta’s leaked scraping list revealed that the company harvested data from 6 million unique websites to train its AI models. That includes copyrighted content, pirated material, and adult websites—much of it obtained without explicit consent.
Fighting Back: New Tools and Regulations
The pushback has begun. In July 2025, Cloudflare made a historic shift by flipping the default setting for AI scraping from “opt-out” to “opt-in.” Every new domain now starts with AI crawlers blocked by default. Over one million customers have already activated this protection.
But Cloudflare went further. The company launched “Pay Per Crawl,” a marketplace where publishers can set their own rates and AI companies can choose whether to pay for access. This creates a potential revenue stream for content creators—though adoption remains limited.
Other infrastructure providers and individual websites are following suit. Many major publishers have begun aggressive bot blocking, and court rulings have shown mixed but somewhat encouraging results for publishers. In June 2025, Reddit filed a lawsuit against Anthropic over unauthorized Claude AI training.
What This Means for Web Scraping Ethics Going Forward
The Common Crawl controversy has forced a reckoning. Web scraping in 2025 isn’t just a technical issue anymore—it’s a legal, ethical, and compliance issue.
The old internet operated on assumptions of openness and good faith. Crawlers would respect robots.txt files. Publishers would permit reasonable indexing for search engines. Data collection would remain proportional to its value to users.
Those assumptions are dead. The new reality requires explicit permission frameworks, contractual relationships, and reputation management. Your crawler’s identity matters. Your compliance history matters. Your stated purpose matters.
If you’re building legitimate scraping infrastructure, follow these core principles: respect robots.txt files strictly, implement crawl delays to minimize server load, seek explicit permission when possible, and document your compliance efforts meticulously. The days of scraping “politely” without clear authorization are ending.
For companies building AI infrastructure, the message is clearer: free data pipelines built on ambiguous consent are increasingly untenable. Licensing arrangements, transparent attribution, and compensation frameworks are becoming competitive necessities rather than ethical niceties.
The Bigger Picture
Common Crawl’s crisis represents a fundamental tension in the modern internet. AI companies need vast datasets to build competitive models. Publishers need revenue to fund journalism. Researchers need accessible data. These interests aren’t easily reconciled.
But Common Crawl has chosen one side, and that choice has consequences. By maintaining immutable archives and accepting funding from the companies that benefit most from unrestricted data access, the organization has positioned itself as infrastructure for AI model training rather than neutral research resource.
The question now isn’t whether the current system will change—it clearly will. The question is whether that change will be shaped by formal regulation, market mechanisms like Cloudflare’s Pay Per Crawl, individual website blocking, or some combination of all three.
For web scraping professionals, publishers, and AI companies, one thing is certain: 2025 marks the end of an era. The permissionless internet is being replaced by a permission-based one. The only question is how quickly, and who gets to write the rules.
What’s your take on Common Crawl’s approach? Are AI companies justified in using archived data, or is this exploitation of content creators? Share your thoughts in the comments.


