Which AI crawlers should I allow in robots.txt?

Allow GPTBot (powers ChatGPT search and training), OAI-SearchBot (OpenAI search results), PerplexityBot (Perplexity AI answers), ClaudeBot (Anthropic Claude), Google-Extended (Google AI Overviews training), and Bingbot (Bing Copilot). Each crawler has a distinct user agent string. Blocking any of them removes your content from that AI system entirely.

Does robots.txt actually prevent AI crawlers from accessing my site?

Robots.txt is a voluntary protocol — it requests that crawlers respect the rules, but it does not enforce them. Well-behaved crawlers like GPTBot, PerplexityBot, and ClaudeBot honor robots.txt directives. However, robots.txt does not provide technical enforcement. Server-side blocking via Cloudflare, WAF rules, or firewall configuration is what actually prevents access. Both layers must be configured correctly for reliable AEO access.

How do I check if Cloudflare is blocking AI crawlers?

Use curl with a spoofed user agent to simulate a crawler request. Run: curl -A "GPTBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/. If the response is 403 instead of 200, Cloudflare or your WAF is blocking the crawler regardless of what robots.txt says. Check Cloudflare Bot Fight Mode settings and your Security Rules dashboard for bot-blocking rules.

What is Cloudflare Bot Fight Mode and how does it affect AEO?

Cloudflare Bot Fight Mode is a security feature that challenges or blocks requests identified as bots, including legitimate AI crawlers. It has been enabled by default on many Cloudflare plans since 2024. When active, it returns 403 responses to GPTBot, PerplexityBot, and other AI crawlers, making your content invisible to those AI systems regardless of your robots.txt configuration.

How do I allow AI crawlers through Cloudflare?

Navigate to the Cloudflare dashboard, go to Security then Bot settings, and configure Bot Fight Mode to allow verified AI crawler user agents. Alternatively, create a WAF custom rule that allows traffic matching known AI crawler user agent strings (GPTBot, PerplexityBot, ClaudeBot). Place this allow rule above any blocking rules in your rule priority order.

Should I block AI crawlers to protect my content from being used in training?

That is a business decision with direct AEO consequences. Blocking GPTBot prevents OpenAI from using your content in training data, but it also prevents ChatGPT from citing your site in search responses. Blocking PerplexityBot prevents your content from appearing in Perplexity answers. Each block trades content protection for visibility loss. Consider allowing search-specific crawlers like OAI-SearchBot while blocking training-specific ones if your platform supports the distinction.

How often should I review my robots.txt for AI crawlers?

Review your robots.txt quarterly at minimum. New AI crawlers emerge regularly, and existing crawlers update their user agent strings. OpenAI introduced OAI-SearchBot as a separate crawler from GPTBot in 2024. Missing a new crawler means missing a new AI distribution channel. Set a calendar reminder to check for newly announced AI crawler user agents every quarter.

robots.txt for AI Crawlers — GPTBot, PerplexityBot, ClaudeBot

Robert McDonough·Web Content Architect & AEO Systems Builder

TITLErobots.txt for AI Crawlers — GPTBot, PerplexityBot, ClaudeBot | AEO Resource Guide

DESCHow to configure robots.txt to allow AI crawlers for AEO. Covers GPTBot, PerplexityBot, ClaudeBot, Cloudflare Bot Fight Mode, and testing methodology.

QUERIESrobots.txt AI crawlers·How to allow GPTBot in robots.txt·Cloudflare blocking AI crawlers·PerplexityBot robots.txt setup

UPDATEDApril 2026

Direct Answer

A robots.txt optimized for AEO explicitly allows AI crawlers including GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, Google-Extended, and Bingbot. However, robots.txt alone is not enough — Cloudflare Bot Fight Mode and WAF rules can block these crawlers at the server level regardless of robots.txt directives. Both the robots.txt file and server-side access controls must permit AI crawlers for your content to be cited.

The AI Crawlers You Need to Allow

Each major AI system operates its own web crawler with a distinct user agent string. Blocking a crawler removes your content from that AI platform entirely — not just from training, but from search results and citation. Understanding which crawler powers which product lets you make informed access decisions.

AI crawlers, their user agent strings, and the products they power
Crawler	User Agent String	Powers	Default Behavior
GPTBot	GPTBot	ChatGPT training data and web browsing	Respects robots.txt
OAI-SearchBot	OAI-SearchBot	OpenAI search results specifically	Respects robots.txt
PerplexityBot	PerplexityBot	Perplexity AI answer citations	Respects robots.txt
ClaudeBot	ClaudeBot	Anthropic Claude training and responses	Respects robots.txt
Google-Extended	Google-Extended	Google AI Overviews and Gemini training	Respects robots.txt
Bingbot	bingbot/2.0	Bing search and Bing Copilot answers	Respects robots.txt

Note the distinction between GPTBot and OAI-SearchBot. OpenAI separated these in 2024 to let publishers allow search citations while blocking training data usage. If you want ChatGPT to cite your content in its search feature but do not want your content used for model training, allow OAI-SearchBot and block GPTBot. This granularity is available only because OpenAI maintains two separate crawlers.

A Complete robots.txt for AEO

The following robots.txt allows all major AI crawlers to access your content while blocking access to administrative, private, and non-content paths. Place this file at the root of your domain — it must be accessible at yoursite.com/robots.txt.

robots.txt — full AEO configurationtext

# =============================================
# robots.txt — AEO-optimized configuration
# Allow AI crawlers to access content pages
# Block admin, API, and non-content paths
# =============================================

# Default: allow all crawlers
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Disallow: /_next/      # Next.js build artifacts
Disallow: /static/     # Static assets that are not content

# GPTBot — powers ChatGPT web browsing and training
User-agent: GPTBot
Allow: /

# OAI-SearchBot — powers OpenAI search results
User-agent: OAI-SearchBot
Allow: /

# PerplexityBot — powers Perplexity AI answers
User-agent: PerplexityBot
Allow: /

# ClaudeBot — powers Anthropic Claude
User-agent: ClaudeBot
Allow: /

# Google-Extended — powers AI Overviews and Gemini
User-agent: Google-Extended
Allow: /

# Bingbot — powers Bing search and Bing Copilot
User-agent: bingbot
Allow: /

# Sitemap location
Sitemap: https://yoursite.com/sitemap.xml

Explicit Allow directives for each AI crawler are not strictly necessary when the default rule already allows all agents. However, listing them explicitly serves two purposes: it makes your intention unambiguous to anyone reading the file, and it provides a named target you can change to Disallow if you later decide to block a specific crawler. Defensive clarity is worth the extra lines.

The Cloudflare Problem — The Most Common Silent AEO Failure

Cloudflare Bot Fight Mode has been the single most common reason that correctly configured robots.txt files still result in blocked AI crawlers. Since 2024, Cloudflare has enabled bot management features by default on many plan tiers. These features identify GPTBot, PerplexityBot, ClaudeBot, and other AI crawlers as automated traffic and return 403 Forbidden responses before the request ever reaches your server — or your robots.txt file.

The failure is silent. Your robots.txt says Allow. Your content is well-optimized. But no AI system can access it because Cloudflare intercepts the request at the CDN edge. You will not see any error in your application logs because the request never arrives at your application. The only symptoms are absence — your content does not appear in AI search results, and you have no idea why.

To detect this: open the Cloudflare dashboard, navigate to Security, then Bots. Check whether Bot Fight Mode is enabled. Review your WAF rules for any pattern that blocks user agents matching known AI crawler strings. Then verify with a direct test — use curl to simulate a GPTBot request and check the response code. A 200 means the crawler can reach your content. A 403 means Cloudflare is blocking it regardless of your robots.txt.

Testing Whether AI Crawlers Can Actually Access Your Content

Do not assume that a correct robots.txt means crawlers can access your site. Test it directly. The following workflow confirms whether AI crawlers receive a 200 response with your actual content, not a block page or challenge screen.

Testing AI crawler access with curlbash

# Test GPTBot access — should return 200
curl -A "GPTBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/

# Test PerplexityBot access
curl -A "PerplexityBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/

# Test ClaudeBot access
curl -A "ClaudeBot" -s -o /dev/null -w "%{http_code}" https://yoursite.com/

# If any return 403, Cloudflare or your WAF is blocking
# If any return 503, you may be getting a challenge page

# Verify the actual response content (not just status code)
curl -A "GPTBot" -s https://yoursite.com/ | head -50

A 200 status code alone is not sufficient verification. Some WAF configurations return a 200 with a JavaScript challenge page instead of your actual content. The last command in the sequence checks the actual response body. If you see your HTML content, the crawler can access it. If you see a Cloudflare challenge page or an empty response, access is blocked despite the 200 status.

For ongoing monitoring, check your server access logs for requests from AI crawler user agents. If you see Googlebot but never GPTBot, something is blocking GPTBot before it reaches your server. Google Search Console also reports crawl issues, but it only covers Googlebot — it will not show you GPTBot or PerplexityBot access problems.

The Difference Between robots.txt and Actual Crawler Access

Robots.txt is a protocol-level request — it tells crawlers what you want them to do. It has no enforcement mechanism. A well-behaved crawler reads robots.txt and respects it. A misbehaving crawler ignores it. But the larger issue for AEO practitioners is the opposite direction: your robots.txt may welcome crawlers that your infrastructure actively blocks.

Access control happens at multiple layers, and a block at any layer overrides a permission at another. Your robots.txt allows GPTBot. Your Cloudflare WAF blocks it. GPTBot is blocked. Your robots.txt allows PerplexityBot. Your hosting provider rate-limits unknown user agents. PerplexityBot gets throttled into uselessness. Both layers — the policy layer (robots.txt) and the enforcement layer (CDN, WAF, server config) — must be aligned.

Common blocking patterns and how to fix them
Symptom	Root Cause	Fix
robots.txt allows GPTBot but it never crawls	Cloudflare Bot Fight Mode intercepting the request	Disable Bot Fight Mode or add a WAF allow rule for GPTBot user agent
AI crawlers get 403 responses	WAF rule blocking non-browser user agents	Create an allow rule for AI crawler user agents above the blocking rule
AI crawlers get 503 responses	Cloudflare JavaScript challenge being served	Add AI crawler IPs to the Cloudflare allowlist or reduce security level for known bots
Content appears in Google but not in ChatGPT	GPTBot blocked while Googlebot is allowed	Ensure GPTBot and OAI-SearchBot have explicit Allow directives and no server-side blocks
Crawlers get 200 but with empty or wrong content	JavaScript rendering required but crawler does not execute JS	Implement server-side rendering or pre-rendering for content pages
Intermittent access — sometimes 200, sometimes 403	Rate limiting triggering on crawler request frequency	Increase rate limit thresholds for known AI crawler user agents

→

Technical Implementation (Hub)

Overview of all technical AEO implementation topics

→

useSchema Hook — Injecting JSON-LD in React and Next.js

How to inject structured data at runtime

→

The Complete Guide to AEO

Return to the pillar page

Try it: optimize your content using the robots.txt tactic

Paste your content

0 / 5,000 characters

Frequently Asked Questions

About the Author

Robert McDonough

bobmcd.com LinkedIn GitHub