robots.txt for AI Crawlers — GPTBot, PerplexityBot, ClaudeBot

RM
Robert McDonough·Web Content Architect & AEO Systems Builder
TITLErobots.txt for AI Crawlers — GPTBot, PerplexityBot, ClaudeBot | AEO Resource Guide
DESCHow to configure robots.txt to allow AI crawlers for AEO. Covers GPTBot, PerplexityBot, ClaudeBot, Cloudflare Bot Fight Mode, and testing methodology.
QUERIESrobots.txt AI crawlers·How to allow GPTBot in robots.txt·Cloudflare blocking AI crawlers·PerplexityBot robots.txt setup
UPDATED
Direct Answer
A robots.txt optimized for AEO explicitly allows AI crawlers including GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, Google-Extended, and Bingbot. However, robots.txt alone is not enough — Cloudflare Bot Fight Mode and WAF rules can block these crawlers at the server level regardless of robots.txt directives. Both the robots.txt file and server-side access controls must permit AI crawlers for your content to be cited.

The AI Crawlers You Need to Allow

Each major AI system operates its own web crawler with a distinct user agent string. Blocking a crawler removes your content from that AI platform entirely — not just from training, but from search results and citation. Understanding which crawler powers which product lets you make informed access decisions.

AI crawlers, their user agent strings, and the products they power
CrawlerUser Agent StringPowersDefault Behavior
GPTBotGPTBotChatGPT training data and web browsingRespects robots.txt
OAI-SearchBotOAI-SearchBotOpenAI search results specificallyRespects robots.txt
PerplexityBotPerplexityBotPerplexity AI answer citationsRespects robots.txt
ClaudeBotClaudeBotAnthropic Claude training and responsesRespects robots.txt
Google-ExtendedGoogle-ExtendedGoogle AI Overviews and Gemini trainingRespects robots.txt
Bingbotbingbot/2.0Bing search and Bing Copilot answersRespects robots.txt

Note the distinction between GPTBot and OAI-SearchBot. OpenAI separated these in 2024 to let publishers allow search citations while blocking training data usage. If you want ChatGPT to cite your content in its search feature but do not want your content used for model training, allow OAI-SearchBot and block GPTBot. This granularity is available only because OpenAI maintains two separate crawlers.

A Complete robots.txt for AEO

The following robots.txt allows all major AI crawlers to access your content while blocking access to administrative, private, and non-content paths. Place this file at the root of your domain — it must be accessible at yoursite.com/robots.txt.

robots.txt — full AEO configurationtext
# =============================================
# robots.txt — AEO-optimized configuration
# Allow AI crawlers to access content pages
# Block admin, API, and non-content paths
# =============================================

# Default: allow all crawlers
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Disallow: /_next/      # Next.js build artifacts
Disallow: /static/     # Static assets that are not content

# GPTBot — powers ChatGPT web browsing and training
User-agent: GPTBot
Allow: /

# OAI-SearchBot — powers OpenAI search results
User-agent: OAI-SearchBot
Allow: /

# PerplexityBot — powers Perplexity AI answers
User-agent: PerplexityBot
Allow: /

# ClaudeBot — powers Anthropic Claude
User-agent: ClaudeBot
Allow: /

# Google-Extended — powers AI Overviews and Gemini
User-agent: Google-Extended
Allow: /

# Bingbot — powers Bing search and Bing Copilot
User-agent: bingbot
Allow: /

# Sitemap location
Sitemap: https://yoursite.com/sitemap.xml

Explicit Allow directives for each AI crawler are not strictly necessary when the default rule already allows all agents. However, listing them explicitly serves two purposes: it makes your intention unambiguous to anyone reading the file, and it provides a named target you can change to Disallow if you later decide to block a specific crawler. Defensive clarity is worth the extra lines.

The Cloudflare Problem — The Most Common Silent AEO Failure

Cloudflare Bot Fight Mode has been the single most common reason that correctly configured robots.txt files still result in blocked AI crawlers. Since 2024, Cloudflare has enabled bot management features by default on many plan tiers. These features identify GPTBot, PerplexityBot, ClaudeBot, and other AI crawlers as automated traffic and return 403 Forbidden responses before the request ever reaches your server — or your robots.txt file.

The failure is silent. Your robots.txt says Allow. Your content is well-optimized. But no AI system can access it because Cloudflare intercepts the request at the CDN edge. You will not see any error in your application logs because the request never arrives at your application. The only symptoms are absence — your content does not appear in AI search results, and you have no idea why.

To detect this: open the Cloudflare dashboard, navigate to Security, then Bots. Check whether Bot Fight Mode is enabled. Review your WAF rules for any pattern that blocks user agents matching known AI crawler strings. Then verify with a direct test — use curl to simulate a GPTBot request and check the response code. A 200 means the crawler can reach your content. A 403 means Cloudflare is blocking it regardless of your robots.txt.

Testing Whether AI Crawlers Can Actually Access Your Content

Do not assume that a correct robots.txt means crawlers can access your site. Test it directly. The following workflow confirms whether AI crawlers receive a 200 response with your actual content, not a block page or challenge screen.

Testing AI crawler access with curlbash
# Test GPTBot access — should return 200
curl -A "GPTBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/

# Test PerplexityBot access
curl -A "PerplexityBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/

# Test ClaudeBot access
curl -A "ClaudeBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/

# If any return 403, Cloudflare or your WAF is blocking
# If any return 503, you may be getting a challenge page

# Verify the actual response content (not just status code)
curl -A "GPTBot" -s https://yoursite.com/ | head -50

A 200 status code alone is not sufficient verification. Some WAF configurations return a 200 with a JavaScript challenge page instead of your actual content. The last command in the sequence checks the actual response body. If you see your HTML content, the crawler can access it. If you see a Cloudflare challenge page or an empty response, access is blocked despite the 200 status.

For ongoing monitoring, check your server access logs for requests from AI crawler user agents. If you see Googlebot but never GPTBot, something is blocking GPTBot before it reaches your server. Google Search Console also reports crawl issues, but it only covers Googlebot — it will not show you GPTBot or PerplexityBot access problems.

The Difference Between robots.txt and Actual Crawler Access

Robots.txt is a protocol-level request — it tells crawlers what you want them to do. It has no enforcement mechanism. A well-behaved crawler reads robots.txt and respects it. A misbehaving crawler ignores it. But the larger issue for AEO practitioners is the opposite direction: your robots.txt may welcome crawlers that your infrastructure actively blocks.

Access control happens at multiple layers, and a block at any layer overrides a permission at another. Your robots.txt allows GPTBot. Your Cloudflare WAF blocks it. GPTBot is blocked. Your robots.txt allows PerplexityBot. Your hosting provider rate-limits unknown user agents. PerplexityBot gets throttled into uselessness. Both layers — the policy layer (robots.txt) and the enforcement layer (CDN, WAF, server config) — must be aligned.

Common blocking patterns and how to fix them
SymptomRoot CauseFix
robots.txt allows GPTBot but it never crawlsCloudflare Bot Fight Mode intercepting the requestDisable Bot Fight Mode or add a WAF allow rule for GPTBot user agent
AI crawlers get 403 responsesWAF rule blocking non-browser user agentsCreate an allow rule for AI crawler user agents above the blocking rule
AI crawlers get 503 responsesCloudflare JavaScript challenge being servedAdd AI crawler IPs to the Cloudflare allowlist or reduce security level for known bots
Content appears in Google but not in ChatGPTGPTBot blocked while Googlebot is allowedEnsure GPTBot and OAI-SearchBot have explicit Allow directives and no server-side blocks
Crawlers get 200 but with empty or wrong contentJavaScript rendering required but crawler does not execute JSImplement server-side rendering or pre-rendering for content pages
Intermittent access — sometimes 200, sometimes 403Rate limiting triggering on crawler request frequencyIncrease rate limit thresholds for known AI crawler user agents

Try it: optimize your content using the robots.txt tactic

0 / 5,000 characters

Frequently Asked Questions

About the Author

RM

Robert McDonough