The 32 AI Crawlers Every Site Owner Should Know About
Your robots.txt stopped being a Google-only document years ago. Today 30+ named AI crawlers fetch the web for training, grounding, or live answering. Block the wrong one and you vanish from the engine that would have cited you. Allow them all and you get free distribution.
This is the full working list LLCrawler tracks, grouped by parent company.
OpenAI
| User-Agent | Purpose |
|---|---|
GPTBot |
Training data for GPT models |
ChatGPT-User |
Live browsing from inside ChatGPT |
OAI-SearchBot |
ChatGPT Search index |
Anthropic
| User-Agent | Purpose |
|---|---|
ClaudeBot |
Training and Claude Search |
Claude-Web |
Legacy Claude fetcher |
anthropic-ai |
Legacy identifier, still honored |
| User-Agent | Purpose |
|---|---|
Googlebot |
Core web index (feeds AI Overviews) |
Google-Extended |
Gemini / Vertex AI training opt-out lever |
Googlebot-News |
News-specific index |
Googlebot-Image |
Image index |
Perplexity
| User-Agent | Purpose |
|---|---|
PerplexityBot |
Index used for answers |
Perplexity-User |
Live fetch when a user follows a citation |
Apple
| User-Agent | Purpose |
|---|---|
Applebot |
Siri and Apple search |
Applebot-Extended |
Apple Intelligence training opt-out |
Meta, ByteDance, Amazon, Others
| User-Agent | Owner |
|---|---|
FacebookBot |
Meta training |
Meta-ExternalAgent |
Meta AI live fetch |
Meta-ExternalFetcher |
Meta AI link previews |
Bytespider |
ByteDance / Doubao |
Amazonbot |
Alexa and Amazon AI |
cohere-ai |
Cohere model grounding |
cohere-training-data-crawler |
Cohere training |
CCBot |
Common Crawl (feeds dozens of models) |
DuckAssistBot |
DuckDuckGo AI assist |
MistralAI-User |
Mistral Le Chat |
PanguBot |
Huawei Pangu |
Secondary AI data sources
| User-Agent | Purpose |
|---|---|
Diffbot |
Knowledge graph feeding multiple AI products |
ImagesiftBot |
Image AI training |
Omgilibot |
Public forum aggregator |
PiplBot |
People-data index |
Timpibot |
AI search startup |
YouBot |
You.com search |
Bingbot |
Feeds Microsoft Copilot |
That is 32 distinct crawlers LLCrawler checks against your robots.txt.
The three policies that actually make sense
1. Allow everything (most sites): you want AI citations, not training opt-outs. This is one line.
User-agent: *
Allow: /
2. Block training, allow answering: common for news and premium publishers. Allow the *-User and *-SearchBot user agents; block the training-only ones like GPTBot, CCBot, cohere-training-data-crawler.
3. Block everything: only if you have a clear business reason. You are opting out of AI answer surfaces entirely.
How to check what you are blocking today
Paste your URL into LLCrawler. The AI Crawler Access section shows how many of the 32 tracked bots can reach your site, and lists the ones you are blocking. If you see a "32 / 32" result, you are well positioned for every major engine — combine that with a valid llms.txt and JSON-LD basics and you have the foundation for AI visibility.
Sources
- platform.openai.com/docs/gptbot — official GPTBot / ChatGPT-User / OAI-SearchBot docs
- support.anthropic.com — Anthropic's crawler policy
- robotstxt.org — the base specification
- commoncrawl.org — background on CCBot, the most widely scraped crawler