The number of AI training crawlers visiting websites has exploded since 2023. OpenAI, Anthropic, ByteDance, Common Crawl, Google, Meta, Apple, Cohere — they all operate crawlers harvesting web content for AI model training.

If you want to control which of these bots can access your content, this is your complete guide. We’ll cover robots.txt, server-level blocking, and CDN-level blocking — with ready-to-copy templates.

What Are AI Training Crawlers?

AI training crawlers collect text from websites to build datasets for training large language models. Unlike search engine crawlers (Googlebot, Bingbot), they don’t send you traffic — they only take your content.

The major AI training crawlers in 2026:

Bot Company User Agent
GPTBot OpenAI GPTBot
ClaudeBot Anthropic ClaudeBot
anthropic-ai Anthropic anthropic-ai
CCBot Common Crawl CCBot
Bytespider ByteDance Bytespider
Google-Extended Google Google-Extended
Applebot-Extended Apple Applebot-Extended
Meta-ExternalAgent Meta Meta-ExternalAgent
cohere-ai Cohere cohere-ai
Diffbot Diffbot Diffbot

Method 1: robots.txt (Easiest)

robots.txt is the standard way to tell crawlers which parts of your site they can access. Most legitimate AI crawlers respect it.

Minimal block (OpenAI + Anthropic only)

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

Block all major AI training crawlers

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: cohere-ai
User-agent: Diffbot
Disallow: /

This keeps you visible in Perplexity, SearchGPT, and Siri while blocking training data collection:

# Block AI training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: cohere-ai
Disallow: /

# Keep search engine crawlers allowed (important for SEO)
# Googlebot, Bingbot, YandexBot, BaiduSpider — allowed by default

# Keep AI search bots allowed (they drive traffic)
# PerplexityBot, OAI-SearchBot, Applebot — allowed by default

Block specific sections only

If you want to protect premium or sensitive content but allow crawling of public pages:

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
Disallow: /premium/
Disallow: /members/
Disallow: /paid-content/
Disallow: /private/
Allow: /blog/
Allow: /

Method 2: Nginx Server-Level Blocking

robots.txt relies on crawlers choosing to follow it. For stronger enforcement — especially for Bytespider which has inconsistent robots.txt compliance — use server-level rules.

Block all AI training crawlers (Nginx)

# In your server block
map $http_user_agent $block_ai_crawler {
    default 0;
    "~*GPTBot"              1;
    "~*ClaudeBot"           1;
    "~*anthropic-ai"        1;
    "~*CCBot"               1;
    "~*Bytespider"          1;
    "~*Google-Extended"     1;
    "~*Applebot-Extended"   1;
    "~*Meta-ExternalAgent"  1;
    "~*cohere-ai"           1;
}

server {
    # ...
    if ($block_ai_crawler) {
        return 403;
    }
}

Block Bytespider specifically (highest priority)

Bytespider is the most aggressive and least robots.txt-compliant crawler. Block it at the server level:

if ($http_user_agent ~* "Bytespider") {
    return 403;
}

Method 3: Apache .htaccess

RewriteEngine On

# Block AI training crawlers
RewriteCond %{HTTP_USER_AGENT} "GPTBot|ClaudeBot|anthropic-ai|CCBot|Bytespider|Google-Extended|Applebot-Extended|Meta-ExternalAgent|cohere-ai" [NC]
RewriteRule .* - [F,L]

Or block them one by one for more control:

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteRule .* - [F,L]

RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC]
RewriteRule .* - [F,L]

RewriteCond %{HTTP_USER_AGENT} CCBot [NC]
RewriteRule .* - [F,L]

RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]

Cloudflare’s WAF blocks requests before they reach your server — the most effective approach for aggressive crawlers.

Create a WAF Custom Rule

  1. Go to Cloudflare Dashboard → Security → WAF → Custom Rules
  2. Click “Create rule”
  3. Configure:

Rule name: Block AI Training Crawlers

Expression:

(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "anthropic-ai") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "Google-Extended") or
(http.user_agent contains "Applebot-Extended") or
(http.user_agent contains "Meta-ExternalAgent") or
(http.user_agent contains "cohere-ai")

Action: Block

Cloudflare Bot Fight Mode

Cloudflare also offers Bot Fight Mode (free) and Super Bot Fight Mode (paid) which can automatically identify and challenge known bots. This doesn’t specifically target AI crawlers but provides additional protection.

Does robots.txt Actually Work?

Most legitimate AI crawlers respect robots.txt — but not all, and not always.

Bot robots.txt compliance
GPTBot Good — OpenAI explicitly committed to honoring it
ClaudeBot Good — Anthropic follows standard
CCBot Good — Common Crawl respects it
Bytespider Inconsistent — multiple reports of ignoring Disallow rules
Google-Extended Good
Applebot-Extended Good

Recommendation: Use robots.txt as your first line of defense for well-behaved crawlers. Add server-level or Cloudflare blocking for Bytespider and any bot you need to reliably block.

Important: Don’t Block AI Search Bots

AI search bots are different from training crawlers — they index your site to show it in AI search results and send referral traffic. Don’t accidentally block them.

AI search bots to keep allowed:

  • PerplexityBot — Perplexity AI search (allow for visibility and traffic)
  • OAI-SearchBot — SearchGPT / ChatGPT Search
  • Applebot — Siri, Spotlight, Safari Suggestions
  • Googlebot — Google Search (obviously keep allowed)

Verify Your robots.txt is Working

After updating your robots.txt, verify it’s correctly formatted:

  1. Visit yourdomain.com/robots.txt to confirm it’s accessible
  2. Use Google Search Console’s robots.txt tester
  3. Use our AI Bot Checker to test which AI bots can access your site

Complete robots.txt Template for 2026

Copy and paste this into your robots.txt to block all major AI training crawlers while keeping search engines and AI search bots allowed:

# ============================================
# AI Training Bot Blocking (2026)
# ============================================

# OpenAI training crawler
User-agent: GPTBot
Disallow: /

# Anthropic training crawlers
User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Common Crawl (used by OpenAI, Meta, Google, and many others)
User-agent: CCBot
Disallow: /

# ByteDance / TikTok AI training
User-agent: Bytespider
Disallow: /

# Google AI training (separate from regular Googlebot)
User-agent: Google-Extended
Disallow: /

# Apple Intelligence training (separate from regular Applebot)
User-agent: Applebot-Extended
Disallow: /

# Meta AI training
User-agent: Meta-ExternalAgent
Disallow: /

# Cohere AI training
User-agent: cohere-ai
Disallow: /

# ============================================
# Search engines — KEEP ALLOWED (drives traffic)
# ============================================
# Googlebot — allowed by default
# Bingbot — allowed by default
# YandexBot — allowed by default
# BaiduSpider — allowed by default
# Applebot — allowed by default (Siri/Spotlight)
# DuckDuckBot — allowed by default

# ============================================
# AI Search Bots — KEEP ALLOWED (drives traffic)
# ============================================
# PerplexityBot — allowed by default
# OAI-SearchBot — allowed by default

# ============================================
# Sitemap
# ============================================
Sitemap: https://yourdomain.com/sitemap.xml

Replace yourdomain.com with your actual domain.


Test Your Bot Blocking

Use our AI Bot Checker to verify which AI training bots can currently access your site and whether your blocking is working correctly.

Related guides: