When Bots Are the Audience: What the Rise of AI Crawlers Mean for Content Creators

Written by Tom Fry | Apr 11, 2025 11:17:14 AM

How much of the internet’s web traffic is real? Probably less than you think. In fact, in 2024 the percentage of traffic that is driven by a person at a browser - dropped to below half for the first time.

This raises an obvious question - where does the other half come from? One word: bots

Bots have (almost) always been part of web traffic – the first bot – called WebCrawler – was created in 1994 to index web pagers. It was quickly followed by other crawlers – the most famous being Googlebot which was created in 1996.

The internet as we know it today wouldn’t be possible without bots - search engines rely on bots to crawl through the web, link by link, and uncover previously hidden parts of the World Wide Web. When a new public website is launched, one of the first actions by a web master is to submit it to Google for indexing.

As the web has grown, so too has the use of bots. But recently the use of bots has surged, and it’s all because of Generative AI.

A brief history - Common Crawl Data

Not all web crawler bots are working for search engines. A well-known crawler is called CCBot, and it belongs to a non-profit organisation called Common Crawl (https://commoncrawl.org/). This web crawler that has been quietly scouring the internet for more than 18 years, not to index the web but to create a copy of it.

The Common Crawl dataset is a broad scrape of the internet. It’s a free-to-use open repository of web crawl data that consists of 250 billion pages of data spanning 18 years, and it’s made freely available to researchers.

This dataset was crucial in the development of generative AI as we know it today. For GPT-3, the pre-cursor to ChatGPT and the whole generative AI-era, Common Crawl data made up 60% of its training dataset.

It’s no understatement to say that we wouldn’t have the advanced LLMs we have today without having this common crawl dataset to help train them. But AI years are short, and clearly having your own copy of the internet is probably going to be better than using someone else’s – so in 2023, OpenAI launched its own web crawler, “GPTBot”.

This has been followed by several other bots designed for generative AI. Meta’s new bot launched in August 2024, the “Meta External Agent.” Mistral AI has “MistralAI-User,” Anthropic has “ClaudeBot”. Even Google launched a new bot last year, the “Google-CloudVertexBot” which it says” “unlike other Google bots linked with search or advertising, CloudVertexBot collects data particularly for AI-related services.”

Bots are expensive guests

Every time a bot visits a site, it consumes bandwidth, server resources, and processing power. For large sites, this may be manageable. But for smaller publishers, academic platforms, and independent blogs, the burden can be significant - especially when tens of thousands of requests are coming from AI bots that bring zero traffic in return. Yes, they can exclude certain bots with a configuration file called robots.txt but the onus is on the publisher to make the change.

The problem is partly one of ethics. Unlike search engine crawlers, which help drive traffic back to a site by surfacing it in search results, GenAI bots offer no such benefit (at least, that's not their primary function). They scrape, train, and leave - and the payback goes entirely to the AI companies behind them. This has led to a growing chorus of frustration among publishers and website owners who feel their work is being siphoned off without consent or compensation.

Some have responded by blocking these bots entirely. Sites like The New York Times, CNN, and Reuters have already updated their robots.txt files to keep out crawlers like GPTBot. Others are exploring ways to monetise their data – for instance, Reddit recently partnered with OpenAI. As the value of high-quality content rises, a battle is brewing over who gets to use it, and on what terms.

So what happens next?

The web is at a crossroads. With more than half of traffic now coming from bots - and a growing share of that driven by generative AI - we’re facing a fundamental shift in how the internet works. Websites are no longer just publishing for human readers. They’re also feeding machines.

That raises thorny questions: Who owns web content in the age of AI? Should publishers be paid when their work is used to train models? Is it time to rethink the norms of open access that have defined the internet for decades?

One thing’s clear: the rise of GenAI bots is reshaping not just web traffic, but the web itself. And whether you run a website, read the news, or build AI tools, it’s going to affect you.

In this new era, the line between reader and scraper is blurrier than ever. And if half the internet is now bots, maybe it’s time to ask: who are we really building the web for?

View full post