When you add an AI bot to your robots.txt Disallow list, does it actually stop? The answer varies significantly by crawler — and for at least one major AI bot, the evidence suggests it sometimes doesn’t.
Here’s a practical breakdown of robots.txt compliance across all major AI crawlers, based on official statements and real-world reports.
How robots.txt Works
robots.txt is a plain text file at the root of your domain (yourdomain.com/robots.txt) that tells web crawlers which pages they can and cannot access.
User-agent: GPTBot
Disallow: /
The critical point: robots.txt is voluntary. There is no technical enforcement. A bot can choose to ignore it entirely. The only enforcement mechanism is reputational and legal — companies that ignore robots.txt risk lawsuits and public backlash.
For decades, search engines (Googlebot, Bingbot) have reliably honored robots.txt. The AI bot era has tested this norm.
Compliance by Bot
GPTBot (OpenAI) — Good
Official stance: OpenAI explicitly stated when launching GPTBot in August 2023 that it respects robots.txt Disallow rules.
Real-world reports: Generally consistent with this claim. Webmasters who have blocked GPTBot report successful enforcement within a few days of adding the rule.
Caveat: Data already collected before you added the block may have been used in previous training runs. robots.txt only prevents future crawling.
Recommendation: robots.txt is sufficient for GPTBot.
User-agent: GPTBot
Disallow: /
ClaudeBot / anthropic-ai (Anthropic) — Good
Official stance: Anthropic has stated that ClaudeBot honors robots.txt.
Real-world reports: Consistent with this — blocking via robots.txt appears effective.
Recommendation: robots.txt is sufficient.
User-agent: ClaudeBot
User-agent: anthropic-ai
Disallow: /
CCBot (Common Crawl) — Good
Official stance: Common Crawl explicitly follows the Robots Exclusion Protocol.
Real-world reports: Generally reliable. Common Crawl has been operating for years and has a track record of compliance.
Caveat: Common Crawl’s historical archive (going back to 2007) contains data from before widespread blocking. Historical data is already out there — blocking now only affects future crawls.
Recommendation: robots.txt is sufficient, but note the historical data caveat.
User-agent: CCBot
Disallow: /
Bytespider (ByteDance) — Inconsistent ⚠️
Official stance: ByteDance states that Bytespider respects robots.txt.
Real-world reports: Multiple webmasters have reported continued crawling from Bytespider after adding Disallow rules. The pattern suggests:
- Some Bytespider variants honor robots.txt, others may not
- Compliance may be delayed — Bytespider may take longer to update its crawl schedule
- Different IP ranges may behave differently
Recommendation: Do NOT rely solely on robots.txt for Bytespider. Use server-level blocking.
# Nginx — add in addition to robots.txt
if ($http_user_agent ~* "Bytespider") {
return 403;
}
Google-Extended (Google) — Good
Official stance: Google created Google-Extended specifically as a separate user agent so webmasters could opt out of AI training without affecting search. Compliance is part of the design.
Real-world reports: Reliable. This is Google — robots.txt compliance is fundamental to their operations.
Recommendation: robots.txt is sufficient.
User-agent: Google-Extended
Disallow: /
Applebot-Extended (Apple) — Good
Official stance: Apple explicitly states that Applebot-Extended honors robots.txt and can be controlled separately from Applebot.
Real-world reports: Consistent with this claim.
Recommendation: robots.txt is sufficient.
User-agent: Applebot-Extended
Disallow: /
Meta-ExternalAgent (Meta) — Good
Official stance: Meta states compliance with robots.txt.
Real-world reports: Generally consistent, though Meta’s crawler is newer and has less of a track record than Google’s or OpenAI’s.
Recommendation: robots.txt is sufficient.
User-agent: Meta-ExternalAgent
Disallow: /
PerplexityBot (Perplexity AI) — Mostly Good, with controversy
Official stance: Perplexity has committed to honoring robots.txt.
Real-world controversy: In 2024, reports surfaced that Perplexity’s real-time browsing feature (Perplexity-User) was sometimes visiting pages that had Disallow rules — not by violating robots.txt for the indexing bot, but by using a different user agent for real-time fetching.
Practical reality: PerplexityBot (the indexing crawler) is generally robots.txt compliant. The real-time fetching behavior has been disputed.
Recommendation: robots.txt works for the indexing crawler. Be aware that real-time queries may use different user agents.
Summary Table
| Bot | Company | robots.txt Compliance | Server-level needed? |
|---|---|---|---|
| GPTBot | OpenAI | Good ✓ | Optional |
| ClaudeBot | Anthropic | Good ✓ | Optional |
| anthropic-ai | Anthropic | Good ✓ | Optional |
| CCBot | Common Crawl | Good ✓ | Optional |
| Bytespider | ByteDance | Inconsistent ⚠️ | Yes |
| Google-Extended | Good ✓ | Optional | |
| Applebot-Extended | Apple | Good ✓ | Optional |
| Meta-ExternalAgent | Meta | Good ✓ | Optional |
| PerplexityBot | Perplexity AI | Mostly Good | Optional |
Why Bots Might Ignore robots.txt
Even for bots with good compliance records, there are scenarios where robots.txt might not be honored immediately or completely:
-
Cached crawl schedules — bots precompute crawl schedules. A new robots.txt rule may take days to take effect as the cached schedule expires.
-
Multiple crawler variants — large companies operate many crawler variants. A robots.txt rule targeting one user agent string may not block related variants.
-
Race conditions — bots may fetch a page before checking robots.txt for that domain, especially for first-time visits.
-
Ignoring by design — some bots (especially scrapers and bad actors) deliberately ignore robots.txt.
When to Use Server-Level Blocking
For reliable blocking that doesn’t depend on bot compliance:
Nginx
if ($http_user_agent ~* "Bytespider") {
return 403;
}
Apache
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]
Cloudflare WAF
For maximum protection — blocks at the network edge before requests reach your server:
- Security → WAF → Custom Rules
- User Agent contains target bot name
- Action: Block
Legal Landscape
robots.txt compliance is increasingly backed by legal precedent:
- HiQ v. LinkedIn (2022) — established that violating robots.txt and Terms of Service has legal implications
- NYT v. OpenAI (2023) — copyright lawsuit touching on web scraping practices
- EU AI Act — being implemented with provisions around training data transparency
Companies operating AI crawlers that ignore robots.txt face growing legal risk, which creates additional pressure for compliance beyond just reputation.
Practical Recommendations
- Always add robots.txt rules — it works for most crawlers and establishes clear intent
- Add server-level blocking for Bytespider — don’t rely on robots.txt alone
- Use Cloudflare for high-traffic sites — stops aggressive crawlers at the network edge
- Monitor your logs — verify that bots you’ve blocked aren’t still appearing
- Keep robots.txt updated — new AI crawlers launch regularly
Test Your Bot Access
Use our AI Bot Checker to see which AI crawlers can currently access your site, and verify your robots.txt is correctly blocking the bots you want to stop.
Related guides:
- How to Block All AI Crawlers in 2026 - Complete blocking templates
- Bytespider - The most inconsistently compliant crawler
- GPTBot - OpenAI’s compliant training crawler
- AI Training Bots vs AI Search Bots - Understanding the difference