The 32 AI Crawlers Every Site Owner Should Know About

LLCrawler ·

Your robots.txt stopped being a Google-only document years ago. Today 30+ named AI crawlers fetch the web for training, grounding, or live answering. Block the wrong one and you vanish from the engine that would have cited you. Allow them all and you get free distribution.

This is the full working list LLCrawler tracks, grouped by parent company.

OpenAI

User-Agent Purpose
GPTBot Training data for GPT models
ChatGPT-User Live browsing from inside ChatGPT
OAI-SearchBot ChatGPT Search index

Anthropic

User-Agent Purpose
ClaudeBot Training and Claude Search
Claude-Web Legacy Claude fetcher
anthropic-ai Legacy identifier, still honored

Google

User-Agent Purpose
Googlebot Core web index (feeds AI Overviews)
Google-Extended Gemini / Vertex AI training opt-out lever
Googlebot-News News-specific index
Googlebot-Image Image index

Perplexity

User-Agent Purpose
PerplexityBot Index used for answers
Perplexity-User Live fetch when a user follows a citation

Apple

User-Agent Purpose
Applebot Siri and Apple search
Applebot-Extended Apple Intelligence training opt-out

Meta, ByteDance, Amazon, Others

User-Agent Owner
FacebookBot Meta training
Meta-ExternalAgent Meta AI live fetch
Meta-ExternalFetcher Meta AI link previews
Bytespider ByteDance / Doubao
Amazonbot Alexa and Amazon AI
cohere-ai Cohere model grounding
cohere-training-data-crawler Cohere training
CCBot Common Crawl (feeds dozens of models)
DuckAssistBot DuckDuckGo AI assist
MistralAI-User Mistral Le Chat
PanguBot Huawei Pangu

Secondary AI data sources

User-Agent Purpose
Diffbot Knowledge graph feeding multiple AI products
ImagesiftBot Image AI training
Omgilibot Public forum aggregator
PiplBot People-data index
Timpibot AI search startup
YouBot You.com search
Bingbot Feeds Microsoft Copilot

That is 32 distinct crawlers LLCrawler checks against your robots.txt.

The three policies that actually make sense

1. Allow everything (most sites): you want AI citations, not training opt-outs. This is one line.

User-agent: *
Allow: /

2. Block training, allow answering: common for news and premium publishers. Allow the *-User and *-SearchBot user agents; block the training-only ones like GPTBot, CCBot, cohere-training-data-crawler.

3. Block everything: only if you have a clear business reason. You are opting out of AI answer surfaces entirely.

How to check what you are blocking today

Paste your URL into LLCrawler. The AI Crawler Access section shows how many of the 32 tracked bots can reach your site, and lists the ones you are blocking. If you see a "32 / 32" result, you are well positioned for every major engine — combine that with a valid llms.txt and JSON-LD basics and you have the foundation for AI visibility.

Sources

Is your site visible to AI?

Run a free analysis in 30 seconds and find out what to fix.

Analyze my site