The 32 AI Crawlers Every Site Owner Should Know About

Your robots.txt stopped being a Google-only document years ago. Today 30+ named AI crawlers fetch the web for training, grounding, or live answering. Block the wrong one and you vanish from the engine that would have cited you. Allow them all and you get free distribution.

This is the full working list LLCrawler tracks, grouped by parent company.

OpenAI

User-Agent	Purpose
`GPTBot`	Training data for GPT models
`ChatGPT-User`	Live browsing from inside ChatGPT
`OAI-SearchBot`	ChatGPT Search index

Anthropic

User-Agent	Purpose
`ClaudeBot`	Training and Claude Search
`Claude-Web`	Legacy Claude fetcher
`anthropic-ai`	Legacy identifier, still honored

Google

User-Agent	Purpose
`Googlebot`	Core web index (feeds AI Overviews)
`Google-Extended`	Gemini / Vertex AI training opt-out lever
`Googlebot-News`	News-specific index
`Googlebot-Image`	Image index

Perplexity

User-Agent	Purpose
`PerplexityBot`	Index used for answers
`Perplexity-User`	Live fetch when a user follows a citation

Apple

User-Agent	Purpose
`Applebot`	Siri and Apple search
`Applebot-Extended`	Apple Intelligence training opt-out

Meta, ByteDance, Amazon, Others

User-Agent	Owner
`FacebookBot`	Meta training
`Meta-ExternalAgent`	Meta AI live fetch
`Meta-ExternalFetcher`	Meta AI link previews
`Bytespider`	ByteDance / Doubao
`Amazonbot`	Alexa and Amazon AI
`cohere-ai`	Cohere model grounding
`cohere-training-data-crawler`	Cohere training
`CCBot`	Common Crawl (feeds dozens of models)
`DuckAssistBot`	DuckDuckGo AI assist
`MistralAI-User`	Mistral Le Chat
`PanguBot`	Huawei Pangu

Secondary AI data sources

User-Agent	Purpose
`Diffbot`	Knowledge graph feeding multiple AI products
`ImagesiftBot`	Image AI training
`Omgilibot`	Public forum aggregator
`PiplBot`	People-data index
`Timpibot`	AI search startup
`YouBot`	You.com search
`Bingbot`	Feeds Microsoft Copilot

That is 32 distinct crawlers LLCrawler checks against your robots.txt.

The three policies that actually make sense

1. Allow everything (most sites): you want AI citations, not training opt-outs. This is one line.

User-agent: *
Allow: /

2. Block training, allow answering: common for news and premium publishers. Allow the *-User and *-SearchBot user agents; block the training-only ones like GPTBot, CCBot, cohere-training-data-crawler.

3. Block everything: only if you have a clear business reason. You are opting out of AI answer surfaces entirely.

How to check what you are blocking today

Paste your URL into LLCrawler. The AI Crawler Access section shows how many of the 32 tracked bots can reach your site, and lists the ones you are blocking. If you see a "32 / 32" result, you are well positioned for every major engine — combine that with a valid llms.txt and JSON-LD basics and you have the foundation for AI visibility.

Sources

platform.openai.com/docs/gptbot — official GPTBot / ChatGPT-User / OAI-SearchBot docs
support.anthropic.com — Anthropic's crawler policy
robotstxt.org — the base specification
commoncrawl.org — background on CCBot, the most widely scraped crawler