Scrapy Bots: Detecting and Handling Web Scrapers

What is Scrapy?

Scrapy is an open-source Python framework for building web scrapers. While it has legitimate uses, it’s also commonly used for unauthorized data harvesting, content theft, and competitive intelligence gathering.

How Scrapy Bots Work

Scrapy bots:

Send HTTP requests to fetch web pages
Parse HTML to extract specific data
Follow links to crawl multiple pages
Can rotate user agents and IPs
May use delays to avoid detection

Default User Agent

Out-of-the-box Scrapy identifies itself as:

Scrapy/VERSION (+https://scrapy.org)

Example:

Scrapy/2.8.0 (+https://scrapy.org)

However, most scrapers change the user agent to masquerade as real browsers.

Legitimate vs Malicious Use

Legitimate Uses:

Price monitoring for business intelligence
Research and data analysis
Testing your own websites
Authorized data collection with permission

Malicious Uses:

Stealing copyrighted content
Harvesting email addresses for spam
Scraping product data from competitors
Collecting user-generated content without permission
Bypassing API rate limits

How to Detect Scrapy Bots

1. User Agent Analysis

Check for:

Default Scrapy user agent
Suspicious or outdated browser strings
Missing or malformed headers

2. Behavior Patterns

Scrapy bots often:

Request pages very rapidly
Don’t load images, CSS, or JavaScript
Follow links in unusual patterns
Never submit forms or interact
Make requests at perfectly regular intervals

3. Technical Fingerprints

Missing browser headers (Accept-Language, DNT, etc.)
No JavaScript execution capability
Unusual request header order
Missing or fake TLS fingerprints

4. Rate Analysis

Abnormally high request rates from single IP
Requests outside normal user behavior
Perfect timing patterns (bots often use exact delays)

Protection Strategies

Basic Protection

1. Robots.txt

User-agent: *
Disallow: /api/
Disallow: /admin/
Crawl-delay: 10

Note: Malicious scrapers usually ignore robots.txt

2. Rate Limiting Limit requests per IP:

100 requests per hour for unknown IPs
Stricter limits for suspicious patterns

3. CAPTCHA Show CAPTCHA when detecting:

High request rates
Suspicious user agents
Unusual behavior patterns

Advanced Protection

1. JavaScript Challenges Require JavaScript execution:

<script>
// Generate token that must be included in requests
// Scrapy can't execute JavaScript by default
</script>

2. Honeypot Links Add hidden links that humans won’t see but bots will follow:

<a href="/trap" style="display:none">Trap</a>

Block IPs that access honeypot URLs.

3. Request Header Validation Check for complete and realistic headers:

Accept-Language
Accept-Encoding
Connection
Upgrade-Insecure-Requests

4. TLS Fingerprinting Analyze TLS handshake to identify automated tools vs real browsers.

5. Behavioral Analysis Track:

Mouse movements (bots don’t move mice)
Scroll patterns
Time on page
Click patterns

Blocking Scrapy Bots

Nginx Configuration

# Block by user agent
if ($http_user_agent ~* (scrapy|python|curl|wget)) {
    return 403;
}

# Rate limiting
limit_req_zone $binary_remote_addr zone=general:10m rate=30r/m;
limit_req zone=general burst=10 nodelay;

Application-Level

# Example: Check user agent
banned_agents = ['scrapy', 'python-requests', 'curl', 'wget']
user_agent = request.headers.get('User-Agent', '').lower()

if any(agent in user_agent for agent in banned_agents):
    return abort(403)

Legal Considerations

Before implementing aggressive anti-scraping measures:

Check your terms of service
Understand CFAA and relevant laws
Consider if data is publicly available
Evaluate business impact vs protection cost
Consult legal counsel for your jurisdiction

Cat and Mouse Game

Remember: Determined scrapers will:

Rotate user agents to mimic browsers
Use residential proxies
Add random delays
Solve CAPTCHAs (manually or via services)
Render JavaScript with headless browsers

Effective protection requires layered defenses and continuous monitoring.

Monitoring and Detection

Track metrics:

Requests per IP over time
Failed CAPTCHA attempts
Honeypot hits
Unusual traffic patterns
403/429 response rates

Use tools like:

Web Application Firewalls (WAF)
Bot management platforms
Log analysis tools
Traffic anomaly detection

Scrapy itself is just a tool - focus on detecting scraper behavior patterns rather than just blocking specific user agents.

Test Bot Access to Your Site

Want to understand which bots can access your website? Use our free bot detection tools to test robots.txt rules and actual bot access across different crawler types:

SEO Bot Checker - Test search engine crawlers
SEO Tools Bot Checker - Verify SEO tool access
AI Bot Checker - Scan for AI crawlers

Related Bot Topics:

Understanding Bot Traffic - Learn to distinguish good bots from bad
robots.txt Guide - Configure bot access properly