New Internet Rules Will Block AI Training Bots

New standards are being developed to extend the Robots Exclusion Protocol and Meta Robots tags, allowing them to block all AI crawlers from using publicly available web content for training purposes. The proposal, drafted by Krishna Madhavan, Principal Product Manager at Microsoft AI, and Fabrice Canel, Principal Product Manager at Microsoft Bing, will make it easy to block all mainstream AI Training crawlers with one simple rule that can be applied to each individual crawler.

Virtually all legitimate crawlers obey the Robots.txt and Meta Robots tags which makes this proposal a dream come true for publishers who don’t want their content used for AI training purposes.

Internet Engineering Task Force (IETF)

The Internet Engineering Task Force (IETF) is an international Internet standards making group founded in 1986 that coordinates the development and codification of standards that everyone can voluntarily agree one. For example, the Robots Exclusion Protocol was independently created in 1994 and in 2019 Google proposed that the IETF adopt it as an official standards with agreed upon definitions. In 2022 the IETF published an official Robots Exclusion Protocol that defines what it is and extends the original protocol.

Three Ways To Block AI Training Bots

The draft proposal for blocking AI training bots suggests three ways to block the bots:

Robots.txt Protocols
Meta Robots HTML Elements
Application Layer Response Header

1. Robots.Txt For Blocking AI Robots

The draft proposal seeks to create additional rules that will extend the Robots Exclusion Protocol (Robots.txt) to AI Training Robots. This will bring about some order and give publishers choice in what robots are allowed to crawl their websites.

Adherence to the Robots.txt protocol is voluntary but all legitimate crawlers tend to obey it.

The draft explains the purpose of the new Robots.txt rules:

“While the Robots Exclusion Protocol enables service owners to control how, if at all, automated clients known as crawlers may access the URIs on their services as defined by [RFC8288], the protocol doesn’t provide controls on how the data returned by their service may be used in training generative AI foundation models.

Application developers are requested to honor these tags. The tags are not a form of access authorization however.”

An important quality of the new robots.txt rules and the meta robots HTML elements is that legit AI training crawlers tend to voluntarily agree to follow these protocols, which is something that all legitimate bots do. This will simplify bot blocking for publishers.

The following are the proposed Robots.txt rules:

DisallowAITraining – instructs the parser to not use the data for AI training language model.

AllowAITraining -instructs the parser that the data can be used for AI training language model.

2. HTML Element ( Robots Meta Tag)

The following are the proposed meta robots directives:

<meta name=”robots” content=”DisallowAITraining”>

<meta name=”examplebot” content=”AllowAITraining”>

3. Application Layer Response Header

Application Layer Response Headers are sent by a server in response to a browser’s request for a web page. The proposal suggests adding new rules to the application layer response headers for robots:

“DisallowAITraining – instructs the parser to not use the data for AI training language model.

AllowAITraining – instructs the parser that the data can be used for AI training language model.”

Provides Greater Control

AI companies have been unsuccessfully sued in court for using publicly available data. AI companies have asserted that it’s fair use to crawl publicly available websites, just as search engines have done for decades.

These new protocols give web publishers control over crawlers whose purpose is for consuming training data, bringing those crawlers into alignment with search crawlers.

Read the proposal at the IETF:

Robots Exclusion Protocol Extension to manage AI content use

Featured Image by Shutterstock/ViDI Studio

Industry