Wikipedia vs. the Scraping Surge

What the Bot Battle Reveals About the Future of the Open Web

Oct 14, 2025

Wikipedia has always been the backbone of the open web — a digital commons built by humans, for humans. But as IBM’s recent Think feature revealed, the encyclopedia is facing an unprecedented identity crisis: bots now account for nearly two-thirds of its traffic, and they’re consuming far more server resources than real readers ever could.

At its core, this is not just a bandwidth problem. It’s a philosophical one. Can humans and bots really share the internet? And if so, under what rules?

The Scraping Surge: When AI Models Start Acting Like Readers

As IBM’s report highlights, platforms like SearchGPT, Perplexity, and countless AI startups are leaning heavily on Wikipedia’s structured and updated data. The result: constant automated queries that mimic the behavior of human readers — but at superhuman scale.

The irony here is palpable. Wikipedia was built on openness. Its entire premise is that anyone can read, copy, and remix information freely. But in the AI era, this same openness is now being weaponized against it.

According to Lane Becker, President of Wikimedia LLC, “The AI companies are here, and they are particularly voracious.” That’s putting it lightly. During breaking news events, Wikimedia’s servers see traffic spikes not from curious readers, but from automated scrapers hoovering up text and images to retrain AI models.

One striking example: the reported surge of traffic following the death of Jimmy Carter wasn’t driven by nostalgia or human curiosity — it was largely due to bots scraping Wikimedia Commons’ vast catalog of images.

For the first time in history, the web’s largest volunteer project is being overloaded by machines, not humans.

Beyond the Bandwidth: The “Garbage In, Garbage Out” Problem

While infrastructure strain is one issue, IBM researchers are focusing on another: data quality.

Nirmit Desai and Rosario Uceda-Sosa, two IBM AI researchers, have worked with Wikimedia on improving data annotations and structure — because large language models (LLMs) are only as good as their inputs. “It really does pay to have clean data,” Uceda-Sosa told IBM.

And this is where the story deepens. Every AI company — from OpenAI to Anthropic — depends on Wikipedia data in some form. But as automated scrapers flood the web, there’s a growing risk that the quality signal gets lost in the noise.

Think of it this way: AI models scraping low-quality derivatives of Wikipedia content can easily amplify errors, distort facts, and reinforce feedback loops of misinformation. As IBM put it, “Garbage in, garbage out” remains the rule — even in the era of trillion-parameter models.

Wikimedia Enterprise: Monetizing the Commons

To balance openness with sustainability, Wikimedia launched Wikimedia Enterprise in 2021 — a paid API for commercial users that provides guaranteed uptime, curated datasets, and structured, machine-readable data.

This isn’t Wikipedia selling out. It’s survival.

The model gives AI companies legitimate access to the encyclopedia’s data at scale while helping fund its infrastructure. It’s also an early test case for a future where data provenance and fair compensation become the norm.

Compare this to what’s happening elsewhere: Gannett’s revenue-sharing deal with Perplexity, or Cloudflare’s recent decision to block scrapers by default. The web’s foundational layers are being redrawn — access, fairness, and monetization are now inseparable.

The New Scraping Economy

The fight over scraping isn’t really about code or crawling — it’s about ownership and reciprocity.

Scraping once meant pulling prices from e-commerce sites or indexing news feeds. Now it means fueling AI intelligence itself. The stakes are higher, the players bigger, and the ethics blurrier.

Startups like TollBit, ScalePost, and ProRata.ai are trying to broker that balance — offering systems where publishers get paid when AI systems use their data. This may sound idealistic, but it’s a necessary evolution. Without compensation models, the open web risks becoming a one-way extraction economy.

Can Humans and Bots Coexist?

That question — the one IBM’s title subtly implies — cuts to the heart of the issue.

Yes, humans and bots can share the internet. But not like this. Not when automated crawlers consume bandwidth, drain infrastructure, and exploit open licenses without returning value to the ecosystem that sustains them.

The future of the web depends on transparent data agreements, traceable training sets, and shared economic value between content creators and AI systems.

Wikipedia’s move toward a structured, monetized API might just be the blueprint for that balance. It’s a reminder that open data doesn’t have to mean free-for-all — it can mean fair-for-all.

What’s happening to Wikipedia is not an isolated event. It’s a mirror for the entire internet.

If we want AI to coexist with the human web, we need to rebuild the architecture of trust — where data is open, but responsibly used; models are powerful, but accountable; and scrapers are transparent, not parasitic.

The future of scraping isn’t about blocking bots.
It’s about teaching them to be better citizens of the web.

Scrape Talk

Discussion about this post