Advertisement

Cloudflare rolls out ‘pay-per-crawl’ feature to constrain AI’s limitless hunger for data 

The move is the result customer feedback, since they neither wanted to grant AI web crawlers unrestricted access to their data nor block the practice entirely.  
Listen to this article
0:00
Learn more. This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.
Cloudflare headquarters in San Francisco. (Getty Images)

Cloudflare announced Tuesday it will allow customers to block or charge fees for web crawlers deployed to scrape their websites and data on behalf of AI systems.

In a blog on its corporate website, Will Allen, Cloudflare’s vice president of product, and Simon Newton, an engineer manager, said the company is establishing a new system to limit AI web crawlers after hearing feedback from their customers.

The beta feature, called “Pay-Per-Crawl,” integrates with web infrastructure, HTTP status codes and authentication mechanisms that allow for paid content access to customer websites.

The move is the result of feedback that the company — which provides hosting and cybersecurity services for approximately 1 out of every 5 websites in the world — got from customers, who neither wanted to grant AI web crawlers unrestricted access to their data nor block the practice entirely.  

Advertisement

“After hundreds of conversations with news organizations, publishers, and large-scale social media platforms, we heard a consistent desire for a third path: They’d like to allow AI crawlers to access their content, but they’d like to get compensated,” Allen and Newton wrote.

Domain owners can designate a flat fee for every request, giving publishers the option to block the crawler entirely, allow free access or pay a domain-wide price for access. Cloudflare will act as both the Merchant of Record for the exchanges and provide the underlying technical infrastructure to run pay-per-crawl.

AI crawlers, meanwhile, will be able to register with the system, see pricing options for different resources and set maximum price points for the system to decide whether it’s worth the costs, according to a Cloudflare’s sign-up page for the beta program. To prevent bad actors from spoofing legitimate crawlers and collecting fraudulent payments, AI crawlers must also register with Cloudflare and provide the URL of their key directories and user agent information.

An entry in the frequently asked questions section says the company — which handles trillions of requests and beats back automated denial-of-service attacks every day — has “the world’s most advanced bot management solutions,” using a combination of machine learning, behavioral analysis and digital fingerprinting to separate AI crawlers from search engine bots, verified bot programs and other “good” forms of automated web scraping.

The announcement marks a potentially substantial blow to one of the primary ways that AI models feed and train their systems: by collecting every scrap of publicly available data they can through web-scraping technology.

Advertisement

Web scraping is far from new, but the data-hungry needs of large language models has grown the practice to potentially unsustainable levels, eating up traffic bandwidth, causing sites to load more slowly and other service disruptions. The Wikimedia Foundation said that since January 2024, 65% of its most expensive traffic has come from bots. It has also seen bandwidth used for downloading multimedia content grow by 50%, noting that the expansion of AI scrapers “is causing a significant load on the underlying infrastructure that keep our sites available for everyone.”

“We are observing a significant increase in request volume, with most of this traffic being driven by scraping bots collecting training data for large language models (LLMs) and other use cases,” members of foundation wrote in April. “Automated requests for our content have grown exponentially, alongside the broader technology economy, via mechanisms including scraping, APIs, and bulk downloads.”

Allen and Newton wrote that features like pay-per-crawl are part of a larger shift in “how content is controlled online” as AI systems gobble up data from every available source. They expect the program to evolve “significantly” over the years to cover different types of transactions and marketplaces.

“For example, a publisher or new organization might want to charge different rates for different paths or content types,” the authors wrote. “How do you introduce dynamic pricing based not only upon demand, but also how many users your AI application has? How do you introduce granular licenses at internet scale, whether for training, inference, search, or something entirely new?”

The move to establish a system of compensation for websites and data owners comes as AI companies like OpenAI are facing numerous copyright lawsuits from artists, writers, publishers and other content creators who argue AI companies are training their systems and profiting from content produced by others.

Derek B. Johnson

Written by Derek B. Johnson

Derek B. Johnson is a reporter at CyberScoop, where his beat includes cybersecurity, elections and the federal government. Prior to that, he has provided award-winning coverage of cybersecurity news across the public and private sectors for various publications since 2017. Derek has a bachelor’s degree in print journalism from Hofstra University in New York and a master’s degree in public policy from George Mason University in Virginia.

Latest Podcasts