The Blanket Blocking Mistake
After OpenAI released GPTBot documentation in 2023, a significant wave of websites added blanket AI bot blocks to their robots.txt files. The intent was to protect content from being used in AI training without compensation. The unintended consequence was destroying AI search visibility — because the same bots that collect training data are also the retrieval agents used for real-time citation.
In our audit of 80+ enterprise websites, 23% had robots.txt configurations that blocked at least one major AI citation crawler. Of those, 91% were receiving zero citations from the blocked platform. The correlation is direct: if you block GPTBot, ChatGPT cannot retrieve or cite your content, regardless of how well-optimized it is.
AI Crawler User Agents Reference
This is the definitive reference for AI crawler user agent strings as of March 2026. Verify against official documentation when new crawlers emerge, as this landscape evolves rapidly:
- GPTBot - OpenAI's primary crawler for ChatGPT's browsing feature and content retrieval
- OAI-SearchBot - OpenAI's secondary search agent; also used for ChatGPT retrieval
- Google-Extended - Google's AI training and Gemini retrieval crawler; separate from Googlebot
- PerplexityBot - Perplexity AI's primary crawler
- anthropic-ai - Anthropic's primary crawler (training and Claude.ai retrieval)
- ClaudeBot - Anthropic's secondary user agent
- YouBot - You.com's AI search crawler
- Applebot-Extended - Apple's AI training crawler
- Bytespider - ByteDance's crawler (used for AI training; minimal citation relevance currently)
The Correct Configuration for AI Search Visibility
The simplest configuration that maximizes AI search visibility while allowing you to control training data exposure:
To allow all AI citation crawlers (recommended if you want AI search visibility): Make no specific entries for the AI user agents above — an absent rule defaults to allow. Verify there's no wildcard User-agent: * rule with a broad Disallow that would catch these bots.
To allow citation retrieval but opt out of AI training: This requires platform-specific action. For Google, use the noindex tag on a per-page basis or opt out through Google Search Console's AI content controls (where available). OpenAI has a separate opt-out mechanism at their webmaster documentation. These controls are separate from robots.txt.
To block specific crawlers entirely: Use a specific User-agent rule:
- Add
User-agent: [CrawlerName]followed byDisallow: /on the next line - Be precise — only block crawlers you've made a deliberate, informed decision to exclude
- Document your reasoning; robots.txt decisions compound over time and get forgotten
What You Should Actually Block
Not all AI crawlers are equal. Some provide citation value; others primarily consume bandwidth without providing visibility in return. Here's a reasonable blocking framework:
- Allow (never block): GPTBot, OAI-SearchBot, PerplexityBot, Google-Extended, anthropic-ai — these directly feed the AI search platforms your audience uses
- Block if bandwidth is a concern: Bytespider, Diffbot, CCBot (Common Crawl) — these crawl heavily for training data with limited citation value currently
- Block always: Scrapers without documented user agents; bots making unusually rapid requests; user agents that don't appear in official documentation from any legitimate AI company
Testing Your Configuration
After any robots.txt change, verify your configuration doesn't inadvertently block important crawlers:
- Google's robots.txt Tester in Search Console lets you test specific user agents and URLs against your current robots.txt
- Server log analysis: After making changes, monitor your logs for 2–4 weeks. If GPTBot or PerplexityBot visits drop to zero, your new configuration may have accidentally blocked them
- Manual verification: Use the robots.txt specification tester at robotstxt.checker.org to validate syntax
- Citation monitoring: Query your target keywords in ChatGPT and Perplexity 4–6 weeks after any configuration change to verify your pages are still appearing
