robots.txt for AI Crawlers — GPTBot, PerplexityBot, ClaudeBot
The AI Crawlers You Need to Allow
Each major AI system operates its own web crawler with a distinct user agent string. Blocking a crawler removes your content from that AI platform entirely — not just from training, but from search results and citation. Understanding which crawler powers which product lets you make informed access decisions.
| Crawler | User Agent String | Powers | Default Behavior |
|---|---|---|---|
| GPTBot | GPTBot | ChatGPT training data and web browsing | Respects robots.txt |
| OAI-SearchBot | OAI-SearchBot | OpenAI search results specifically | Respects robots.txt |
| PerplexityBot | PerplexityBot | Perplexity AI answer citations | Respects robots.txt |
| ClaudeBot | ClaudeBot | Anthropic Claude training and responses | Respects robots.txt |
| Google-Extended | Google-Extended | Google AI Overviews and Gemini training | Respects robots.txt |
| Bingbot | bingbot/2.0 | Bing search and Bing Copilot answers | Respects robots.txt |
Note the distinction between GPTBot and OAI-SearchBot. OpenAI separated these in 2024 to let publishers allow search citations while blocking training data usage. If you want ChatGPT to cite your content in its search feature but do not want your content used for model training, allow OAI-SearchBot and block GPTBot. This granularity is available only because OpenAI maintains two separate crawlers.
A Complete robots.txt for AEO
The following robots.txt allows all major AI crawlers to access your content while blocking access to administrative, private, and non-content paths. Place this file at the root of your domain — it must be accessible at yoursite.com/robots.txt.
# =============================================
# robots.txt — AEO-optimized configuration
# Allow AI crawlers to access content pages
# Block admin, API, and non-content paths
# =============================================
# Default: allow all crawlers
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Disallow: /_next/ # Next.js build artifacts
Disallow: /static/ # Static assets that are not content
# GPTBot — powers ChatGPT web browsing and training
User-agent: GPTBot
Allow: /
# OAI-SearchBot — powers OpenAI search results
User-agent: OAI-SearchBot
Allow: /
# PerplexityBot — powers Perplexity AI answers
User-agent: PerplexityBot
Allow: /
# ClaudeBot — powers Anthropic Claude
User-agent: ClaudeBot
Allow: /
# Google-Extended — powers AI Overviews and Gemini
User-agent: Google-Extended
Allow: /
# Bingbot — powers Bing search and Bing Copilot
User-agent: bingbot
Allow: /
# Sitemap location
Sitemap: https://yoursite.com/sitemap.xmlExplicit Allow directives for each AI crawler are not strictly necessary when the default rule already allows all agents. However, listing them explicitly serves two purposes: it makes your intention unambiguous to anyone reading the file, and it provides a named target you can change to Disallow if you later decide to block a specific crawler. Defensive clarity is worth the extra lines.
The Cloudflare Problem — The Most Common Silent AEO Failure
Cloudflare Bot Fight Mode has been the single most common reason that correctly configured robots.txt files still result in blocked AI crawlers. Since 2024, Cloudflare has enabled bot management features by default on many plan tiers. These features identify GPTBot, PerplexityBot, ClaudeBot, and other AI crawlers as automated traffic and return 403 Forbidden responses before the request ever reaches your server — or your robots.txt file.
The failure is silent. Your robots.txt says Allow. Your content is well-optimized. But no AI system can access it because Cloudflare intercepts the request at the CDN edge. You will not see any error in your application logs because the request never arrives at your application. The only symptoms are absence — your content does not appear in AI search results, and you have no idea why.
To detect this: open the Cloudflare dashboard, navigate to Security, then Bots. Check whether Bot Fight Mode is enabled. Review your WAF rules for any pattern that blocks user agents matching known AI crawler strings. Then verify with a direct test — use curl to simulate a GPTBot request and check the response code. A 200 means the crawler can reach your content. A 403 means Cloudflare is blocking it regardless of your robots.txt.
Testing Whether AI Crawlers Can Actually Access Your Content
Do not assume that a correct robots.txt means crawlers can access your site. Test it directly. The following workflow confirms whether AI crawlers receive a 200 response with your actual content, not a block page or challenge screen.
# Test GPTBot access — should return 200
curl -A "GPTBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/
# Test PerplexityBot access
curl -A "PerplexityBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/
# Test ClaudeBot access
curl -A "ClaudeBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/
# If any return 403, Cloudflare or your WAF is blocking
# If any return 503, you may be getting a challenge page
# Verify the actual response content (not just status code)
curl -A "GPTBot" -s https://yoursite.com/ | head -50A 200 status code alone is not sufficient verification. Some WAF configurations return a 200 with a JavaScript challenge page instead of your actual content. The last command in the sequence checks the actual response body. If you see your HTML content, the crawler can access it. If you see a Cloudflare challenge page or an empty response, access is blocked despite the 200 status.
For ongoing monitoring, check your server access logs for requests from AI crawler user agents. If you see Googlebot but never GPTBot, something is blocking GPTBot before it reaches your server. Google Search Console also reports crawl issues, but it only covers Googlebot — it will not show you GPTBot or PerplexityBot access problems.
The Difference Between robots.txt and Actual Crawler Access
Robots.txt is a protocol-level request — it tells crawlers what you want them to do. It has no enforcement mechanism. A well-behaved crawler reads robots.txt and respects it. A misbehaving crawler ignores it. But the larger issue for AEO practitioners is the opposite direction: your robots.txt may welcome crawlers that your infrastructure actively blocks.
Access control happens at multiple layers, and a block at any layer overrides a permission at another. Your robots.txt allows GPTBot. Your Cloudflare WAF blocks it. GPTBot is blocked. Your robots.txt allows PerplexityBot. Your hosting provider rate-limits unknown user agents. PerplexityBot gets throttled into uselessness. Both layers — the policy layer (robots.txt) and the enforcement layer (CDN, WAF, server config) — must be aligned.
| Symptom | Root Cause | Fix |
|---|---|---|
| robots.txt allows GPTBot but it never crawls | Cloudflare Bot Fight Mode intercepting the request | Disable Bot Fight Mode or add a WAF allow rule for GPTBot user agent |
| AI crawlers get 403 responses | WAF rule blocking non-browser user agents | Create an allow rule for AI crawler user agents above the blocking rule |
| AI crawlers get 503 responses | Cloudflare JavaScript challenge being served | Add AI crawler IPs to the Cloudflare allowlist or reduce security level for known bots |
| Content appears in Google but not in ChatGPT | GPTBot blocked while Googlebot is allowed | Ensure GPTBot and OAI-SearchBot have explicit Allow directives and no server-side blocks |
| Crawlers get 200 but with empty or wrong content | JavaScript rendering required but crawler does not execute JS | Implement server-side rendering or pre-rendering for content pages |
| Intermittent access — sometimes 200, sometimes 403 | Rate limiting triggering on crawler request frequency | Increase rate limit thresholds for known AI crawler user agents |
Try it: optimize your content using the robots.txt tactic
Frequently Asked Questions
About the Author