How to Block AI Crawlers in 2026: Complete robots.txt Guide

The number of AI training crawlers visiting websites has exploded since 2023. OpenAI, Anthropic, ByteDance, Common Crawl, Google, Meta, Apple, Cohere — they all operate crawlers harvesting web content for AI model training.

If you want to control which of these bots can access your content, this is your complete guide. We’ll cover robots.txt, server-level blocking, and CDN-level blocking — with ready-to-copy templates.

What Are AI Training Crawlers?

AI training crawlers collect text from websites to build datasets for training large language models. Unlike search engine crawlers (Googlebot, Bingbot), they don’t send you traffic — they only take your content.

The major AI training crawlers in 2026:

Bot	Company	User Agent
GPTBot	OpenAI	`GPTBot`
ClaudeBot	Anthropic	`ClaudeBot`
anthropic-ai	Anthropic	`anthropic-ai`
CCBot	Common Crawl	`CCBot`
Bytespider	ByteDance	`Bytespider`
Google-Extended	Google	`Google-Extended`
Applebot-Extended	Apple	`Applebot-Extended`
Meta-ExternalAgent	Meta	`Meta-ExternalAgent`
cohere-ai	Cohere	`cohere-ai`
Diffbot	Diffbot	`Diffbot`

Method 1: robots.txt (Easiest)

robots.txt is the standard way to tell crawlers which parts of your site they can access. Most legitimate AI crawlers respect it.

Minimal block (OpenAI + Anthropic only)

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

Block all major AI training crawlers

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: cohere-ai
User-agent: Diffbot
Disallow: /

Block AI training but allow AI search (recommended for most sites)

This keeps you visible in Perplexity, SearchGPT, and Siri while blocking training data collection:

# Block AI training crawlers
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
User-agent: cohere-ai
Disallow: /

# Keep search engine crawlers allowed (important for SEO)
# Googlebot, Bingbot, YandexBot, BaiduSpider — allowed by default

# Keep AI search bots allowed (they drive traffic)
# PerplexityBot, OAI-SearchBot, Applebot — allowed by default

Block specific sections only

If you want to protect premium or sensitive content but allow crawling of public pages:

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
Disallow: /premium/
Disallow: /members/
Disallow: /paid-content/
Disallow: /private/
Allow: /blog/
Allow: /

Method 2: Nginx Server-Level Blocking

robots.txt relies on crawlers choosing to follow it. For stronger enforcement — especially for Bytespider which has inconsistent robots.txt compliance — use server-level rules.

Block all AI training crawlers (Nginx)

# In your server block
map $http_user_agent $block_ai_crawler {
    default 0;
    "~*GPTBot"              1;
    "~*ClaudeBot"           1;
    "~*anthropic-ai"        1;
    "~*CCBot"               1;
    "~*Bytespider"          1;
    "~*Google-Extended"     1;
    "~*Applebot-Extended"   1;
    "~*Meta-ExternalAgent"  1;
    "~*cohere-ai"           1;
}

server {
    # ...
    if ($block_ai_crawler) {
        return 403;
    }
}

Block Bytespider specifically (highest priority)

Bytespider is the most aggressive and least robots.txt-compliant crawler. Block it at the server level:

if ($http_user_agent ~* "Bytespider") {
    return 403;
}

Method 3: Apache .htaccess

RewriteEngine On

# Block AI training crawlers
RewriteCond %{HTTP_USER_AGENT} "GPTBot|ClaudeBot|anthropic-ai|CCBot|Bytespider|Google-Extended|Applebot-Extended|Meta-ExternalAgent|cohere-ai" [NC]
RewriteRule .* - [F,L]

Or block them one by one for more control:

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteRule .* - [F,L]

RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC]
RewriteRule .* - [F,L]

RewriteCond %{HTTP_USER_AGENT} CCBot [NC]
RewriteRule .* - [F,L]

RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]

Method 4: Cloudflare WAF Rules (Recommended for High-Traffic Sites)

Cloudflare’s WAF blocks requests before they reach your server — the most effective approach for aggressive crawlers.

Create a WAF Custom Rule

Go to Cloudflare Dashboard → Security → WAF → Custom Rules
Click “Create rule”
Configure:

Rule name: Block AI Training Crawlers

Expression:

(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "anthropic-ai") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "Google-Extended") or
(http.user_agent contains "Applebot-Extended") or
(http.user_agent contains "Meta-ExternalAgent") or
(http.user_agent contains "cohere-ai")

Action: Block

Cloudflare Bot Fight Mode

Cloudflare also offers Bot Fight Mode (free) and Super Bot Fight Mode (paid) which can automatically identify and challenge known bots. This doesn’t specifically target AI crawlers but provides additional protection.

Does robots.txt Actually Work?

Most legitimate AI crawlers respect robots.txt — but not all, and not always.

Bot	robots.txt compliance
GPTBot	Good — OpenAI explicitly committed to honoring it
ClaudeBot	Good — Anthropic follows standard
CCBot	Good — Common Crawl respects it
Bytespider	Inconsistent — multiple reports of ignoring Disallow rules
Google-Extended	Good
Applebot-Extended	Good

Recommendation: Use robots.txt as your first line of defense for well-behaved crawlers. Add server-level or Cloudflare blocking for Bytespider and any bot you need to reliably block.

Important: Don’t Block AI Search Bots

AI search bots are different from training crawlers — they index your site to show it in AI search results and send referral traffic. Don’t accidentally block them.

AI search bots to keep allowed:

PerplexityBot — Perplexity AI search (allow for visibility and traffic)
OAI-SearchBot — SearchGPT / ChatGPT Search
Applebot — Siri, Spotlight, Safari Suggestions
Googlebot — Google Search (obviously keep allowed)

Verify Your robots.txt is Working

After updating your robots.txt, verify it’s correctly formatted:

Visit yourdomain.com/robots.txt to confirm it’s accessible
Use Google Search Console’s robots.txt tester
Use our AI Bot Checker to test which AI bots can access your site

Complete robots.txt Template for 2026

Copy and paste this into your robots.txt to block all major AI training crawlers while keeping search engines and AI search bots allowed:

# ============================================
# AI Training Bot Blocking (2026)
# ============================================

# OpenAI training crawler
User-agent: GPTBot
Disallow: /

# Anthropic training crawlers
User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Common Crawl (used by OpenAI, Meta, Google, and many others)
User-agent: CCBot
Disallow: /

# ByteDance / TikTok AI training
User-agent: Bytespider
Disallow: /

# Google AI training (separate from regular Googlebot)
User-agent: Google-Extended
Disallow: /

# Apple Intelligence training (separate from regular Applebot)
User-agent: Applebot-Extended
Disallow: /

# Meta AI training
User-agent: Meta-ExternalAgent
Disallow: /

# Cohere AI training
User-agent: cohere-ai
Disallow: /

# ============================================
# Search engines — KEEP ALLOWED (drives traffic)
# ============================================
# Googlebot — allowed by default
# Bingbot — allowed by default
# YandexBot — allowed by default
# BaiduSpider — allowed by default
# Applebot — allowed by default (Siri/Spotlight)
# DuckDuckBot — allowed by default

# ============================================
# AI Search Bots — KEEP ALLOWED (drives traffic)
# ============================================
# PerplexityBot — allowed by default
# OAI-SearchBot — allowed by default

# ============================================
# Sitemap
# ============================================
Sitemap: https://yourdomain.com/sitemap.xml

Replace yourdomain.com with your actual domain.

Test Your Bot Blocking

Use our AI Bot Checker to verify which AI training bots can currently access your site and whether your blocking is working correctly.

Related guides:

AI Training Bots vs AI Search Bots: What’s the Difference?
Do AI Bots Respect robots.txt?
GPTBot - OpenAI’s training crawler
CCBot - Common Crawl, highest-impact block
Bytespider - Most aggressive AI crawler