The number of AI training crawlers visiting websites has exploded since 2023. OpenAI, Anthropic, ByteDance, Common Crawl, Google, Meta, Apple, Cohere — they all operate crawlers harvesting web content for AI model training.
If you want to control which of these bots can access your content, this is your complete guide. We’ll cover robots.txt, server-level blocking, and CDN-level blocking — with ready-to-copy templates.
What Are AI Training Crawlers?
AI training crawlers collect text from websites to build datasets for training large language models. Unlike search engine crawlers (Googlebot, Bingbot), they don’t send you traffic — they only take your content.
The major AI training crawlers in 2026:
| Bot | Company | User Agent |
|---|---|---|
| GPTBot | OpenAI | GPTBot |
| ClaudeBot | Anthropic | ClaudeBot |
| anthropic-ai | Anthropic | anthropic-ai |
| CCBot | Common Crawl | CCBot |
| Bytespider | ByteDance | Bytespider |
| Google-Extended | Google-Extended |
|
| Applebot-Extended | Apple | Applebot-Extended |
| Meta-ExternalAgent | Meta | Meta-ExternalAgent |
| cohere-ai | Cohere | cohere-ai |
| Diffbot | Diffbot | Diffbot |
Method 1: robots.txt (Easiest)
robots.txt is the standard way to tell crawlers which parts of your site they can access. Most legitimate AI crawlers respect it.
Minimal block (OpenAI + Anthropic only)
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
Block all major AI training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: cohere-ai
User-agent: Diffbot
Disallow: /
Block AI training but allow AI search (recommended for most sites)
This keeps you visible in Perplexity, SearchGPT, and Siri while blocking training data collection:
# Block AI training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: cohere-ai
Disallow: /
# Keep search engine crawlers allowed (important for SEO)
# Googlebot, Bingbot, YandexBot, BaiduSpider — allowed by default
# Keep AI search bots allowed (they drive traffic)
# PerplexityBot, OAI-SearchBot, Applebot — allowed by default
Block specific sections only
If you want to protect premium or sensitive content but allow crawling of public pages:
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
Disallow: /premium/
Disallow: /members/
Disallow: /paid-content/
Disallow: /private/
Allow: /blog/
Allow: /
Method 2: Nginx Server-Level Blocking
robots.txt relies on crawlers choosing to follow it. For stronger enforcement — especially for Bytespider which has inconsistent robots.txt compliance — use server-level rules.
Block all AI training crawlers (Nginx)
# In your server block
map $http_user_agent $block_ai_crawler {
default 0;
"~*GPTBot" 1;
"~*ClaudeBot" 1;
"~*anthropic-ai" 1;
"~*CCBot" 1;
"~*Bytespider" 1;
"~*Google-Extended" 1;
"~*Applebot-Extended" 1;
"~*Meta-ExternalAgent" 1;
"~*cohere-ai" 1;
}
server {
# ...
if ($block_ai_crawler) {
return 403;
}
}
Block Bytespider specifically (highest priority)
Bytespider is the most aggressive and least robots.txt-compliant crawler. Block it at the server level:
if ($http_user_agent ~* "Bytespider") {
return 403;
}
Method 3: Apache .htaccess
RewriteEngine On
# Block AI training crawlers
RewriteCond %{HTTP_USER_AGENT} "GPTBot|ClaudeBot|anthropic-ai|CCBot|Bytespider|Google-Extended|Applebot-Extended|Meta-ExternalAgent|cohere-ai" [NC]
RewriteRule .* - [F,L]
Or block them one by one for more control:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteRule .* - [F,L]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC]
RewriteRule .* - [F,L]
RewriteCond %{HTTP_USER_AGENT} CCBot [NC]
RewriteRule .* - [F,L]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]
Method 4: Cloudflare WAF Rules (Recommended for High-Traffic Sites)
Cloudflare’s WAF blocks requests before they reach your server — the most effective approach for aggressive crawlers.
Create a WAF Custom Rule
- Go to Cloudflare Dashboard → Security → WAF → Custom Rules
- Click “Create rule”
- Configure:
Rule name: Block AI Training Crawlers
Expression:
(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "anthropic-ai") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "Google-Extended") or
(http.user_agent contains "Applebot-Extended") or
(http.user_agent contains "Meta-ExternalAgent") or
(http.user_agent contains "cohere-ai")
Action: Block
Cloudflare Bot Fight Mode
Cloudflare also offers Bot Fight Mode (free) and Super Bot Fight Mode (paid) which can automatically identify and challenge known bots. This doesn’t specifically target AI crawlers but provides additional protection.
Does robots.txt Actually Work?
Most legitimate AI crawlers respect robots.txt — but not all, and not always.
| Bot | robots.txt compliance |
|---|---|
| GPTBot | Good — OpenAI explicitly committed to honoring it |
| ClaudeBot | Good — Anthropic follows standard |
| CCBot | Good — Common Crawl respects it |
| Bytespider | Inconsistent — multiple reports of ignoring Disallow rules |
| Google-Extended | Good |
| Applebot-Extended | Good |
Recommendation: Use robots.txt as your first line of defense for well-behaved crawlers. Add server-level or Cloudflare blocking for Bytespider and any bot you need to reliably block.
Important: Don’t Block AI Search Bots
AI search bots are different from training crawlers — they index your site to show it in AI search results and send referral traffic. Don’t accidentally block them.
AI search bots to keep allowed:
- PerplexityBot — Perplexity AI search (allow for visibility and traffic)
- OAI-SearchBot — SearchGPT / ChatGPT Search
- Applebot — Siri, Spotlight, Safari Suggestions
- Googlebot — Google Search (obviously keep allowed)
Verify Your robots.txt is Working
After updating your robots.txt, verify it’s correctly formatted:
- Visit
yourdomain.com/robots.txtto confirm it’s accessible - Use Google Search Console’s robots.txt tester
- Use our AI Bot Checker to test which AI bots can access your site
Complete robots.txt Template for 2026
Copy and paste this into your robots.txt to block all major AI training crawlers while keeping search engines and AI search bots allowed:
# ============================================
# AI Training Bot Blocking (2026)
# ============================================
# OpenAI training crawler
User-agent: GPTBot
Disallow: /
# Anthropic training crawlers
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
# Common Crawl (used by OpenAI, Meta, Google, and many others)
User-agent: CCBot
Disallow: /
# ByteDance / TikTok AI training
User-agent: Bytespider
Disallow: /
# Google AI training (separate from regular Googlebot)
User-agent: Google-Extended
Disallow: /
# Apple Intelligence training (separate from regular Applebot)
User-agent: Applebot-Extended
Disallow: /
# Meta AI training
User-agent: Meta-ExternalAgent
Disallow: /
# Cohere AI training
User-agent: cohere-ai
Disallow: /
# ============================================
# Search engines — KEEP ALLOWED (drives traffic)
# ============================================
# Googlebot — allowed by default
# Bingbot — allowed by default
# YandexBot — allowed by default
# BaiduSpider — allowed by default
# Applebot — allowed by default (Siri/Spotlight)
# DuckDuckBot — allowed by default
# ============================================
# AI Search Bots — KEEP ALLOWED (drives traffic)
# ============================================
# PerplexityBot — allowed by default
# OAI-SearchBot — allowed by default
# ============================================
# Sitemap
# ============================================
Sitemap: https://yourdomain.com/sitemap.xml
Replace yourdomain.com with your actual domain.
Test Your Bot Blocking
Use our AI Bot Checker to verify which AI training bots can currently access your site and whether your blocking is working correctly.
Related guides:
- AI Training Bots vs AI Search Bots: What’s the Difference?
- Do AI Bots Respect robots.txt?
- GPTBot - OpenAI’s training crawler
- CCBot - Common Crawl, highest-impact block
- Bytespider - Most aggressive AI crawler