robots.txt: Complete Guide for Bot Management

The robots.txt file is a simple but powerful way to communicate with web crawlers and bots about which parts of your site they can access. It’s based on an official internet standard (RFC 9309) and has been used since 1994.

Think of robots.txt as a “No Trespassing” sign for bots - well-behaved bots will respect it, but it’s not a security feature. For real protection, you need authentication and access controls.

What is robots.txt?

robots.txt is a plain text file that you place at the root of your website (https://yoursite.com/robots.txt). It tells web crawlers like Googlebot which pages they should and shouldn’t visit.

Important: robots.txt works on the honor system. Respectful bots follow it, but malicious bots ignore it completely.

Why Use robots.txt?

Good Reasons:

Save crawl budget: Direct search engines to your important pages
Prevent duplicate content: Block parameter URLs and filter pages
Hide low-value pages: Keep admin interfaces out of search results
Reduce server load: Limit aggressive crawlers

Bad Reasons (Won’t Work):

❌ Hiding sensitive data (bots can still access it!)
❌ Preventing indexing (pages can still appear in search)
❌ Security (use authentication instead)

The Basics: How It Works

robots.txt contains groups of rules. Each group has:

User-agent: Which bot the rules apply to
Rules: What paths to allow or block

User-agent: Googlebot
Disallow: /private/
Allow: /public/

This tells Googlebot: “Don’t crawl /private/ but you can crawl /public/”

Key Directives

User-agent

Specifies which bot the rules apply to. Not case-sensitive.

User-agent: Googlebot      # Google's bot
User-agent: *              # All bots

Multiple user-agents:

User-agent: Googlebot
User-agent: Bingbot
Disallow: /expensive-to-crawl/

Both Googlebot and Bingbot will follow the same rules.

Disallow

Blocks access to a path.

Disallow: /admin/          # Block /admin/ and everything under it
Disallow: /secret.html     # Block specific file
Disallow: /                # Block everything

Allow

Explicitly allows access to a path. Useful to override broader blocks.

User-agent: *
Disallow: /folder/
Allow: /folder/public.html  # Exception: allow this one file

Important rule: If both Allow and Disallow match the same URL, the longest (most specific) rule wins. If they’re equally long, Allow wins.

Common Examples

1. Allow Everything (Default)

User-agent: *
Allow: /

This explicitly allows all bots to crawl everything.

Or even simpler - just don’t have a robots.txt file! No file means “everything is allowed.”

2. Block Everything

User-agent: *
Disallow: /

⚠️ Warning: This blocks all search engines. Only use for staging/development sites!

3. Block Specific Folders

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

4. Block Specific Bot

# Block one specific bot
User-agent: BadBot
Disallow: /

# Allow all others
User-agent: *
Allow: /

5. Mixed Rules with Exceptions

User-agent: *
Disallow: /members/        # Block members area
Allow: /members/public/    # Except public section

Since /members/public/ is more specific (longer path), it overrides the /members/ block.

Pattern Matching

Wildcards (*)

Matches zero or more characters:

# Block all PDFs
Disallow: /*.pdf$

# Block URLs with parameters
Disallow: /*?

# Block all paths containing "private"
Disallow: /*/private/*

End of Path ($)

Matches the end of the URL:

# Block .php files only (not /admin.php/folder/)
Disallow: /*.php$

# Block URLs ending with =
Disallow: /*=$

Examples:

# Block URL parameters
Disallow: /*?
Disallow: /*?*

# Block session IDs
Disallow: /*sessionid=
Disallow: /*PHPSESSID=

# Block sort/filter URLs
Disallow: /*?sort=
Disallow: /*?filter=

Real-World Examples

WordPress Site

User-agent: *
# Block WordPress admin
Disallow: /wp-admin/
Disallow: /wp-includes/

# Block plugins and themes
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/

# Allow uploads (images)
Allow: /wp-content/uploads/

# Block URL parameters
Disallow: /*?

# Tell bots about sitemap
Sitemap: https://yourdomain.com/sitemap.xml

E-commerce Site

User-agent: *
# Block cart and checkout
Disallow: /cart
Disallow: /checkout
Disallow: /my-account

# Block search and filters
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

# Block admin
Disallow: /admin/

Sitemap: https://yourdomain.com/sitemap.xml

SaaS Application

User-agent: *
# Allow public marketing pages
Allow: /
Allow: /blog/
Allow: /pricing/
Allow: /features/

# Block app and API
Disallow: /app/
Disallow: /dashboard/
Disallow: /api/

# Block SEO crawlers to save resources
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
Disallow: /

Sitemap: https://yourdomain.com/sitemap.xml

Managing Specific Bots

Block All Except Search Engines

# Allow Google
User-agent: Googlebot
Allow: /

# Allow Bing
User-agent: Bingbot
Allow: /

# Block everything else
User-agent: *
Disallow: /

Managing SEO Tool Crawlers (Optional)

SEO tool crawlers like AhrefsBot and SemrushBot provide valuable backlink data but consume server resources.

Consider blocking if:

You have limited bandwidth or server resources
You want to keep your link building strategy private from competitors
You don’t use these SEO tools yourself
You’ve experienced performance issues from aggressive crawling

Consider allowing if:

You use these tools for your own SEO research
You want your backlinks visible in industry tools
You have sufficient server capacity
You value comprehensive SEO data visibility

Example - Block SEO crawlers:

User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: MJ12bot
Disallow: /

# Allow search engines
User-agent: *
Allow: /

Block AI Training Bots

# Block OpenAI
User-agent: GPTBot
Disallow: /

# Block Common Crawl
User-agent: CCBot
Disallow: /

# Block Anthropic
User-agent: Claude-Web
Disallow: /

Sitemaps

While not officially part of RFC 9309, all major search engines support the Sitemap directive:

User-agent: *
Disallow: /admin/

Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-news.xml

You can list multiple sitemaps. This helps search engines find your content faster.

How Bots Process robots.txt

Here’s what happens when a bot visits your site:

Fetch robots.txt first: https://yoursite.com/robots.txt
Find matching rules: Look for rules for their user-agent
Apply longest match: If multiple rules match a URL, use the most specific one
Check Allow vs Disallow: If both match equally, Allow wins
Crawl or skip: Follow the rules (if they’re well-behaved)

Important Matching Rules:

Longest match wins:

User-agent: *
Disallow: /folder/
Allow: /folder/exception/file.html

For /folder/exception/file.html, the Allow rule wins because it’s more specific (longer).

Allow beats Disallow when equal length:

Allow: /page
Disallow: /page

Both match /page equally, so Allow wins.

Common Mistakes to Avoid

1. Blocking CSS and JavaScript

❌ Don’t do this:

Disallow: /css/
Disallow: /js/

Search engines need CSS and JS to render your pages properly. Blocking these can hurt your SEO.

2. Using robots.txt for Security

❌ Wrong:

Disallow: /secret-admin-panel/

This doesn’t hide anything! The URL is now public in robots.txt, and bots can still access it. Use real authentication instead.

3. Trying to Remove Pages from Search

❌ Wrong approach:

Disallow: /page-i-dont-want-indexed/

This tells bots “don’t crawl it” but pages can still appear in search results if other sites link to them.

✅ Correct approach:

Use a meta tag in the page itself:

<meta name="robots" content="noindex, nofollow">

4. Forgetting the Slash

❌ Blocks more than intended:

Disallow: /admin

This blocks /admin, /administrator, /admin-panel/, etc.

✅ Be specific:

Disallow: /admin/

5. Case Sensitivity

Paths are case-sensitive!

Disallow: /Admin/      # Doesn't block /admin/
Disallow: /admin/      # Correct

Testing Your robots.txt

1. Check It Works

Visit: https://yoursite.com/robots.txt

You should see:

Plain text (not HTML)
HTTP 200 status code
Content-Type: text/plain

2. Use Search Console

Google Search Console:

URL Inspection Tool
Test specific URLs against your robots.txt

Bing Webmaster Tools:

Similar testing functionality

3. Online Validators

Many free tools can validate your robots.txt syntax.

Status Codes and Error Handling

What bots do when they can’t fetch robots.txt:

Status Code	What Bots Do
200 (Success)	Follow the rules
404 (Not Found)	Assume everything is allowed
403 (Forbidden)	Assume everything is allowed
500 (Server Error)	Assume everything is blocked
Timeout	Assume everything is blocked

Key point: 404 and 403 mean “allow everything”, but 500+ errors mean “block everything” (for safety).

After a server error, bots should retry. If it’s still broken after 30 days, they may treat it as “allow all.”

Best Practices

1. Keep It Simple

Start with basic rules and add more only when needed.

User-agent: *
Disallow: /admin/
Sitemap: https://yoursite.com/sitemap.xml

2. Use Comments

# Block admin area (added 2025-11-18)
User-agent: *
Disallow: /admin/

# Block aggressive crawlers (bandwidth concerns)
User-agent: AhrefsBot
Disallow: /

3. Always Include Sitemap

Sitemap: https://yourdomain.com/sitemap.xml

This helps search engines find your content faster.

4. Be Specific with Paths

# Vague - blocks too much
Disallow: /temp

# Better - specific folders
Disallow: /temp/
Disallow: /temporary/
Disallow: /_temp/

5. Review Regularly

Check quarterly for outdated rules
Remove blocks for deleted sections
Add new protections as site grows

6. Monitor Compliance

Check server logs to see if bots respect your rules:

# Check for bots accessing blocked areas
grep "Googlebot" access.log | grep "/admin/"

What robots.txt Can’t Do

Can’t:

❌ Provide security (use authentication)
❌ Remove pages from search (use noindex meta tags)
❌ Force bots to obey (it’s voluntary)
❌ Block determined scrapers (they ignore it)
❌ Hide sensitive info (everything in robots.txt is public)

Can:

✅ Guide well-behaved bots
✅ Optimize crawl budget
✅ Reduce server load
✅ Prevent duplicate content in search

Advanced: Dynamic robots.txt

For complex sites, generate robots.txt dynamically:

PHP Example

<?php
header("Content-Type: text/plain");

echo "User-agent: *\n";
echo "Disallow: /admin/\n\n";

// Block on staging
if ($_SERVER['HTTP_HOST'] === 'staging.example.com') {
    echo "Disallow: /\n";
}

echo "Sitemap: https://" . $_SERVER['HTTP_HOST'] . "/sitemap.xml\n";

Node.js Example

app.get('/robots.txt', (req, res) => {
    res.type('text/plain');

    let content = 'User-agent: *\n';

    if (process.env.NODE_ENV === 'production') {
        content += 'Disallow: /admin/\n';
    } else {
        content += 'Disallow: /\n';
    }

    content += `\nSitemap: ${req.protocol}://${req.hostname}/sitemap.xml\n`;

    res.send(content);
});

Caching

Bots cache your robots.txt file to avoid refetching it constantly.

Standard behavior:

Bots should refetch at least every 24 hours
Use HTTP Cache-Control headers to suggest caching

Cache-Control: public, max-age=3600

This says “you can cache this for 1 hour.”

Quick Reference

Directive	Purpose	Example
`User-agent`	Specify bot	`User-agent: Googlebot`
`Disallow`	Block path	`Disallow: /admin/`
`Allow`	Allow path (override)	`Allow: /public/`
`Sitemap`	Declare sitemap	`Sitemap: https://site.com/sitemap.xml`
`#`	Comment	`# This is a comment`
`*`	Wildcard	`Disallow: /*.pdf$`
`$`	End of URL	`Disallow: /page$`

Complete Example

# robots.txt for example.com
# Updated: 2025-11-18

# Allow all major search engines
User-agent: Googlebot
User-agent: Googlebot-Image
User-agent: Bingbot
Allow: /

# Block for everyone else
User-agent: *

# Core blocks
Disallow: /admin/
Disallow: /api/
Disallow: /private/

# Block URL parameters (duplicate content)
Disallow: /*?
Disallow: /*?*

# Allow certain exceptions
Allow: /api/public/

# Block SEO crawlers (save bandwidth)
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: MJ12bot
Disallow: /

# Block AI training bots
User-agent: GPTBot
User-agent: CCBot
Disallow: /

# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Conclusion

robots.txt is a powerful tool for guiding well-behaved bots, but remember:

It’s not security - use authentication for sensitive areas
It’s voluntary - bad bots will ignore it
It’s simple - plain text, easy to understand
It’s effective - for managing search engine crawling

Keep your robots.txt file simple, test it regularly, and combine it with other SEO and security tools for best results.

The key is understanding that robots.txt is a communication tool, not a lock. It tells respectful bots what you prefer, but it’s up to them to follow it.

Test Your robots.txt Configuration

After configuring robots.txt, verify which bots can actually access your website with our free bot testing tools:

SEO Bot Checker — Test Googlebot, Bingbot, and 4 search engines
AI Bot Checker — Scan 28 AI crawlers (GPTBot, ClaudeBot, CCBot)
Social Bot Checker — Test Facebook, Twitter, LinkedIn previews
SEO Tools Bot Checker — Verify Ahrefs, SEMrush, Moz, Majestic

Each tool compares your robots.txt rules against actual bot access to help you identify configuration issues.

Related Guides:

Understanding Bot Traffic — Learn about different bot types
SEO Bots Guide — Optimize for search engine crawlers

Bot Configuration Examples:

Block AhrefsBot — Ahrefs crawler management
Block SemrushBot — SEMrush crawler control
Block MJ12bot — Majestic crawler configuration
Block AI Scrapers — GPTBot, ClaudeBot, and more