The robots.txt file is a simple but powerful way to communicate with web crawlers and bots about which parts of your site they can access. It’s based on an official internet standard (RFC 9309) and has been used since 1994.
Think of robots.txt as a “No Trespassing” sign for bots - well-behaved bots will respect it, but it’s not a security feature. For real protection, you need authentication and access controls.
What is robots.txt?
robots.txt is a plain text file that you place at the root of your website (https://yoursite.com/robots.txt). It tells web crawlers like Googlebot which pages they should and shouldn’t visit.
Important: robots.txt works on the honor system. Respectful bots follow it, but malicious bots ignore it completely.
Why Use robots.txt?
Good Reasons:
- Save crawl budget: Direct search engines to your important pages
- Prevent duplicate content: Block parameter URLs and filter pages
- Hide low-value pages: Keep admin interfaces out of search results
- Reduce server load: Limit aggressive crawlers
Bad Reasons (Won’t Work):
- ❌ Hiding sensitive data (bots can still access it!)
- ❌ Preventing indexing (pages can still appear in search)
- ❌ Security (use authentication instead)
The Basics: How It Works
robots.txt contains groups of rules. Each group has:
- User-agent: Which bot the rules apply to
- Rules: What paths to allow or block
User-agent: Googlebot
Disallow: /private/
Allow: /public/
This tells Googlebot: “Don’t crawl /private/ but you can crawl /public/”
Key Directives
User-agent
Specifies which bot the rules apply to. Not case-sensitive.
User-agent: Googlebot # Google's bot
User-agent: * # All bots
Multiple user-agents:
User-agent: Googlebot
User-agent: Bingbot
Disallow: /expensive-to-crawl/
Both Googlebot and Bingbot will follow the same rules.
Disallow
Blocks access to a path.
Disallow: /admin/ # Block /admin/ and everything under it
Disallow: /secret.html # Block specific file
Disallow: / # Block everything
Allow
Explicitly allows access to a path. Useful to override broader blocks.
User-agent: *
Disallow: /folder/
Allow: /folder/public.html # Exception: allow this one file
Important rule: If both Allow and Disallow match the same URL, the longest (most specific) rule wins. If they’re equally long, Allow wins.
Common Examples
1. Allow Everything (Default)
User-agent: *
Allow: /
This explicitly allows all bots to crawl everything.
Or even simpler - just don’t have a robots.txt file! No file means “everything is allowed.”
2. Block Everything
User-agent: *
Disallow: /
⚠️ Warning: This blocks all search engines. Only use for staging/development sites!
3. Block Specific Folders
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
4. Block Specific Bot
# Block one specific bot
User-agent: BadBot
Disallow: /
# Allow all others
User-agent: *
Allow: /
5. Mixed Rules with Exceptions
User-agent: *
Disallow: /members/ # Block members area
Allow: /members/public/ # Except public section
Since /members/public/ is more specific (longer path), it overrides the /members/ block.
Pattern Matching
Wildcards (*)
Matches zero or more characters:
# Block all PDFs
Disallow: /*.pdf$
# Block URLs with parameters
Disallow: /*?
# Block all paths containing "private"
Disallow: /*/private/*
End of Path ($)
Matches the end of the URL:
# Block .php files only (not /admin.php/folder/)
Disallow: /*.php$
# Block URLs ending with =
Disallow: /*=$
Examples:
# Block URL parameters
Disallow: /*?
Disallow: /*?*
# Block session IDs
Disallow: /*sessionid=
Disallow: /*PHPSESSID=
# Block sort/filter URLs
Disallow: /*?sort=
Disallow: /*?filter=
Real-World Examples
WordPress Site
User-agent: *
# Block WordPress admin
Disallow: /wp-admin/
Disallow: /wp-includes/
# Block plugins and themes
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
# Allow uploads (images)
Allow: /wp-content/uploads/
# Block URL parameters
Disallow: /*?
# Tell bots about sitemap
Sitemap: https://yourdomain.com/sitemap.xml
E-commerce Site
User-agent: *
# Block cart and checkout
Disallow: /cart
Disallow: /checkout
Disallow: /my-account
# Block search and filters
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
# Block admin
Disallow: /admin/
Sitemap: https://yourdomain.com/sitemap.xml
SaaS Application
User-agent: *
# Allow public marketing pages
Allow: /
Allow: /blog/
Allow: /pricing/
Allow: /features/
# Block app and API
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/
# Block SEO crawlers to save resources
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
Disallow: /
Sitemap: https://yourdomain.com/sitemap.xml
Managing Specific Bots
Block All Except Search Engines
# Allow Google
User-agent: Googlebot
Allow: /
# Allow Bing
User-agent: Bingbot
Allow: /
# Block everything else
User-agent: *
Disallow: /
Managing SEO Tool Crawlers (Optional)
SEO tool crawlers like AhrefsBot and SemrushBot provide valuable backlink data but consume server resources.
Consider blocking if:
- You have limited bandwidth or server resources
- You want to keep your link building strategy private from competitors
- You don’t use these SEO tools yourself
- You’ve experienced performance issues from aggressive crawling
Consider allowing if:
- You use these tools for your own SEO research
- You want your backlinks visible in industry tools
- You have sufficient server capacity
- You value comprehensive SEO data visibility
Example - Block SEO crawlers:
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: MJ12bot
Disallow: /
# Allow search engines
User-agent: *
Allow: /
Block AI Training Bots
# Block OpenAI
User-agent: GPTBot
Disallow: /
# Block Common Crawl
User-agent: CCBot
Disallow: /
# Block Anthropic
User-agent: Claude-Web
Disallow: /
Sitemaps
While not officially part of RFC 9309, all major search engines support the Sitemap directive:
User-agent: *
Disallow: /admin/
Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-news.xml
You can list multiple sitemaps. This helps search engines find your content faster.
How Bots Process robots.txt
Here’s what happens when a bot visits your site:
- Fetch robots.txt first:
https://yoursite.com/robots.txt - Find matching rules: Look for rules for their user-agent
- Apply longest match: If multiple rules match a URL, use the most specific one
- Check Allow vs Disallow: If both match equally, Allow wins
- Crawl or skip: Follow the rules (if they’re well-behaved)
Important Matching Rules:
Longest match wins:
User-agent: *
Disallow: /folder/
Allow: /folder/exception/file.html
For /folder/exception/file.html, the Allow rule wins because it’s more specific (longer).
Allow beats Disallow when equal length:
Allow: /page
Disallow: /page
Both match /page equally, so Allow wins.
Common Mistakes to Avoid
1. Blocking CSS and JavaScript
❌ Don’t do this:
Disallow: /css/
Disallow: /js/
Search engines need CSS and JS to render your pages properly. Blocking these can hurt your SEO.
2. Using robots.txt for Security
❌ Wrong:
Disallow: /secret-admin-panel/
This doesn’t hide anything! The URL is now public in robots.txt, and bots can still access it. Use real authentication instead.
3. Trying to Remove Pages from Search
❌ Wrong approach:
Disallow: /page-i-dont-want-indexed/
This tells bots “don’t crawl it” but pages can still appear in search results if other sites link to them.
✅ Correct approach:
Use a meta tag in the page itself:
<meta name="robots" content="noindex, nofollow">
4. Forgetting the Slash
❌ Blocks more than intended:
Disallow: /admin
This blocks /admin, /administrator, /admin-panel/, etc.
✅ Be specific:
Disallow: /admin/
5. Case Sensitivity
Paths are case-sensitive!
Disallow: /Admin/ # Doesn't block /admin/
Disallow: /admin/ # Correct
Testing Your robots.txt
1. Check It Works
Visit: https://yoursite.com/robots.txt
You should see:
- Plain text (not HTML)
- HTTP 200 status code
- Content-Type: text/plain
2. Use Search Console
Google Search Console:
- URL Inspection Tool
- Test specific URLs against your robots.txt
Bing Webmaster Tools:
- Similar testing functionality
3. Online Validators
Many free tools can validate your robots.txt syntax.
Status Codes and Error Handling
What bots do when they can’t fetch robots.txt:
| Status Code | What Bots Do |
|---|---|
| 200 (Success) | Follow the rules |
| 404 (Not Found) | Assume everything is allowed |
| 403 (Forbidden) | Assume everything is allowed |
| 500 (Server Error) | Assume everything is blocked |
| Timeout | Assume everything is blocked |
Key point: 404 and 403 mean “allow everything”, but 500+ errors mean “block everything” (for safety).
After a server error, bots should retry. If it’s still broken after 30 days, they may treat it as “allow all.”
Best Practices
1. Keep It Simple
Start with basic rules and add more only when needed.
User-agent: *
Disallow: /admin/
Sitemap: https://yoursite.com/sitemap.xml
2. Use Comments
# Block admin area (added 2025-11-18)
User-agent: *
Disallow: /admin/
# Block aggressive crawlers (bandwidth concerns)
User-agent: AhrefsBot
Disallow: /
3. Always Include Sitemap
Sitemap: https://yourdomain.com/sitemap.xml
This helps search engines find your content faster.
4. Be Specific with Paths
# Vague - blocks too much
Disallow: /temp
# Better - specific folders
Disallow: /temp/
Disallow: /temporary/
Disallow: /_temp/
5. Review Regularly
- Check quarterly for outdated rules
- Remove blocks for deleted sections
- Add new protections as site grows
6. Monitor Compliance
Check server logs to see if bots respect your rules:
# Check for bots accessing blocked areas
grep "Googlebot" access.log | grep "/admin/"
What robots.txt Can’t Do
Can’t:
- ❌ Provide security (use authentication)
- ❌ Remove pages from search (use noindex meta tags)
- ❌ Force bots to obey (it’s voluntary)
- ❌ Block determined scrapers (they ignore it)
- ❌ Hide sensitive info (everything in robots.txt is public)
Can:
- ✅ Guide well-behaved bots
- ✅ Optimize crawl budget
- ✅ Reduce server load
- ✅ Prevent duplicate content in search
Advanced: Dynamic robots.txt
For complex sites, generate robots.txt dynamically:
PHP Example
<?php
header("Content-Type: text/plain");
echo "User-agent: *\n";
echo "Disallow: /admin/\n\n";
// Block on staging
if ($_SERVER['HTTP_HOST'] === 'staging.example.com') {
echo "Disallow: /\n";
}
echo "Sitemap: https://" . $_SERVER['HTTP_HOST'] . "/sitemap.xml\n";
Node.js Example
app.get('/robots.txt', (req, res) => {
res.type('text/plain');
let content = 'User-agent: *\n';
if (process.env.NODE_ENV === 'production') {
content += 'Disallow: /admin/\n';
} else {
content += 'Disallow: /\n';
}
content += `\nSitemap: ${req.protocol}://${req.hostname}/sitemap.xml\n`;
res.send(content);
});
Caching
Bots cache your robots.txt file to avoid refetching it constantly.
Standard behavior:
- Bots should refetch at least every 24 hours
- Use HTTP Cache-Control headers to suggest caching
Cache-Control: public, max-age=3600
This says “you can cache this for 1 hour.”
Quick Reference
| Directive | Purpose | Example |
|---|---|---|
User-agent |
Specify bot | User-agent: Googlebot |
Disallow |
Block path | Disallow: /admin/ |
Allow |
Allow path (override) | Allow: /public/ |
Sitemap |
Declare sitemap | Sitemap: https://site.com/sitemap.xml |
# |
Comment | # This is a comment |
* |
Wildcard | Disallow: /*.pdf$ |
$ |
End of URL | Disallow: /page$ |
Complete Example
# robots.txt for example.com
# Updated: 2025-11-18
# Allow all major search engines
User-agent: Googlebot
User-agent: Googlebot-Image
User-agent: Bingbot
Allow: /
# Block for everyone else
User-agent: *
# Core blocks
Disallow: /admin/
Disallow: /api/
Disallow: /private/
# Block URL parameters (duplicate content)
Disallow: /*?
Disallow: /*?*
# Allow certain exceptions
Allow: /api/public/
# Block SEO crawlers (save bandwidth)
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: MJ12bot
Disallow: /
# Block AI training bots
User-agent: GPTBot
User-agent: CCBot
Disallow: /
# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Conclusion
robots.txt is a powerful tool for guiding well-behaved bots, but remember:
- It’s not security - use authentication for sensitive areas
- It’s voluntary - bad bots will ignore it
- It’s simple - plain text, easy to understand
- It’s effective - for managing search engine crawling
Keep your robots.txt file simple, test it regularly, and combine it with other SEO and security tools for best results.
The key is understanding that robots.txt is a communication tool, not a lock. It tells respectful bots what you prefer, but it’s up to them to follow it.
Test Your robots.txt Configuration
After configuring robots.txt, verify which bots can actually access your website with our free bot testing tools:
- SEO Bot Checker — Test Googlebot, Bingbot, and 4 search engines
- AI Bot Checker — Scan 28 AI crawlers (GPTBot, ClaudeBot, CCBot)
- Social Bot Checker — Test Facebook, Twitter, LinkedIn previews
- SEO Tools Bot Checker — Verify Ahrefs, SEMrush, Moz, Majestic
Each tool compares your robots.txt rules against actual bot access to help you identify configuration issues.
Related Guides:
- Understanding Bot Traffic — Learn about different bot types
- SEO Bots Guide — Optimize for search engine crawlers
Bot Configuration Examples:
- Block AhrefsBot — Ahrefs crawler management
- Block SemrushBot — SEMrush crawler control
- Block MJ12bot — Majestic crawler configuration
- Block AI Scrapers — GPTBot, ClaudeBot, and more