When you add an AI bot to your robots.txt Disallow list, does it actually stop? The answer varies significantly by crawler — and for at least one major AI bot, the evidence suggests it sometimes doesn’t.

Here’s a practical breakdown of robots.txt compliance across all major AI crawlers, based on official statements and real-world reports.

How robots.txt Works

robots.txt is a plain text file at the root of your domain (yourdomain.com/robots.txt) that tells web crawlers which pages they can and cannot access.

User-agent: GPTBot
Disallow: /

The critical point: robots.txt is voluntary. There is no technical enforcement. A bot can choose to ignore it entirely. The only enforcement mechanism is reputational and legal — companies that ignore robots.txt risk lawsuits and public backlash.

For decades, search engines (Googlebot, Bingbot) have reliably honored robots.txt. The AI bot era has tested this norm.

Compliance by Bot

GPTBot (OpenAI) — Good

Official stance: OpenAI explicitly stated when launching GPTBot in August 2023 that it respects robots.txt Disallow rules.

Real-world reports: Generally consistent with this claim. Webmasters who have blocked GPTBot report successful enforcement within a few days of adding the rule.

Caveat: Data already collected before you added the block may have been used in previous training runs. robots.txt only prevents future crawling.

Recommendation: robots.txt is sufficient for GPTBot.

User-agent: GPTBot
Disallow: /

ClaudeBot / anthropic-ai (Anthropic) — Good

Official stance: Anthropic has stated that ClaudeBot honors robots.txt.

Real-world reports: Consistent with this — blocking via robots.txt appears effective.

Recommendation: robots.txt is sufficient.

User-agent: ClaudeBot
User-agent: anthropic-ai
Disallow: /

CCBot (Common Crawl) — Good

Official stance: Common Crawl explicitly follows the Robots Exclusion Protocol.

Real-world reports: Generally reliable. Common Crawl has been operating for years and has a track record of compliance.

Caveat: Common Crawl’s historical archive (going back to 2007) contains data from before widespread blocking. Historical data is already out there — blocking now only affects future crawls.

Recommendation: robots.txt is sufficient, but note the historical data caveat.

User-agent: CCBot
Disallow: /

Bytespider (ByteDance) — Inconsistent ⚠️

Official stance: ByteDance states that Bytespider respects robots.txt.

Real-world reports: Multiple webmasters have reported continued crawling from Bytespider after adding Disallow rules. The pattern suggests:

  • Some Bytespider variants honor robots.txt, others may not
  • Compliance may be delayed — Bytespider may take longer to update its crawl schedule
  • Different IP ranges may behave differently

Recommendation: Do NOT rely solely on robots.txt for Bytespider. Use server-level blocking.

# Nginx — add in addition to robots.txt
if ($http_user_agent ~* "Bytespider") {
    return 403;
}

Google-Extended (Google) — Good

Official stance: Google created Google-Extended specifically as a separate user agent so webmasters could opt out of AI training without affecting search. Compliance is part of the design.

Real-world reports: Reliable. This is Google — robots.txt compliance is fundamental to their operations.

Recommendation: robots.txt is sufficient.

User-agent: Google-Extended
Disallow: /

Applebot-Extended (Apple) — Good

Official stance: Apple explicitly states that Applebot-Extended honors robots.txt and can be controlled separately from Applebot.

Real-world reports: Consistent with this claim.

Recommendation: robots.txt is sufficient.

User-agent: Applebot-Extended
Disallow: /

Meta-ExternalAgent (Meta) — Good

Official stance: Meta states compliance with robots.txt.

Real-world reports: Generally consistent, though Meta’s crawler is newer and has less of a track record than Google’s or OpenAI’s.

Recommendation: robots.txt is sufficient.

User-agent: Meta-ExternalAgent
Disallow: /

PerplexityBot (Perplexity AI) — Mostly Good, with controversy

Official stance: Perplexity has committed to honoring robots.txt.

Real-world controversy: In 2024, reports surfaced that Perplexity’s real-time browsing feature (Perplexity-User) was sometimes visiting pages that had Disallow rules — not by violating robots.txt for the indexing bot, but by using a different user agent for real-time fetching.

Practical reality: PerplexityBot (the indexing crawler) is generally robots.txt compliant. The real-time fetching behavior has been disputed.

Recommendation: robots.txt works for the indexing crawler. Be aware that real-time queries may use different user agents.


Summary Table

Bot Company robots.txt Compliance Server-level needed?
GPTBot OpenAI Good ✓ Optional
ClaudeBot Anthropic Good ✓ Optional
anthropic-ai Anthropic Good ✓ Optional
CCBot Common Crawl Good ✓ Optional
Bytespider ByteDance Inconsistent ⚠️ Yes
Google-Extended Google Good ✓ Optional
Applebot-Extended Apple Good ✓ Optional
Meta-ExternalAgent Meta Good ✓ Optional
PerplexityBot Perplexity AI Mostly Good Optional

Why Bots Might Ignore robots.txt

Even for bots with good compliance records, there are scenarios where robots.txt might not be honored immediately or completely:

  1. Cached crawl schedules — bots precompute crawl schedules. A new robots.txt rule may take days to take effect as the cached schedule expires.

  2. Multiple crawler variants — large companies operate many crawler variants. A robots.txt rule targeting one user agent string may not block related variants.

  3. Race conditions — bots may fetch a page before checking robots.txt for that domain, especially for first-time visits.

  4. Ignoring by design — some bots (especially scrapers and bad actors) deliberately ignore robots.txt.

When to Use Server-Level Blocking

For reliable blocking that doesn’t depend on bot compliance:

Nginx

if ($http_user_agent ~* "Bytespider") {
    return 403;
}

Apache

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]

Cloudflare WAF

For maximum protection — blocks at the network edge before requests reach your server:

  1. Security → WAF → Custom Rules
  2. User Agent contains target bot name
  3. Action: Block

robots.txt compliance is increasingly backed by legal precedent:

  • HiQ v. LinkedIn (2022) — established that violating robots.txt and Terms of Service has legal implications
  • NYT v. OpenAI (2023) — copyright lawsuit touching on web scraping practices
  • EU AI Act — being implemented with provisions around training data transparency

Companies operating AI crawlers that ignore robots.txt face growing legal risk, which creates additional pressure for compliance beyond just reputation.

Practical Recommendations

  1. Always add robots.txt rules — it works for most crawlers and establishes clear intent
  2. Add server-level blocking for Bytespider — don’t rely on robots.txt alone
  3. Use Cloudflare for high-traffic sites — stops aggressive crawlers at the network edge
  4. Monitor your logs — verify that bots you’ve blocked aren’t still appearing
  5. Keep robots.txt updated — new AI crawlers launch regularly

Test Your Bot Access

Use our AI Bot Checker to see which AI crawlers can currently access your site, and verify your robots.txt is correctly blocking the bots you want to stop.

Related guides: