What is Scrapy?
Scrapy is an open-source Python framework for building web scrapers. While it has legitimate uses, it’s also commonly used for unauthorized data harvesting, content theft, and competitive intelligence gathering.
How Scrapy Bots Work
Scrapy bots:
- Send HTTP requests to fetch web pages
- Parse HTML to extract specific data
- Follow links to crawl multiple pages
- Can rotate user agents and IPs
- May use delays to avoid detection
Default User Agent
Out-of-the-box Scrapy identifies itself as:
Scrapy/VERSION (+https://scrapy.org)
Example:
Scrapy/2.8.0 (+https://scrapy.org)
However, most scrapers change the user agent to masquerade as real browsers.
Legitimate vs Malicious Use
Legitimate Uses:
- Price monitoring for business intelligence
- Research and data analysis
- Testing your own websites
- Authorized data collection with permission
Malicious Uses:
- Stealing copyrighted content
- Harvesting email addresses for spam
- Scraping product data from competitors
- Collecting user-generated content without permission
- Bypassing API rate limits
How to Detect Scrapy Bots
1. User Agent Analysis
Check for:
- Default Scrapy user agent
- Suspicious or outdated browser strings
- Missing or malformed headers
2. Behavior Patterns
Scrapy bots often:
- Request pages very rapidly
- Don’t load images, CSS, or JavaScript
- Follow links in unusual patterns
- Never submit forms or interact
- Make requests at perfectly regular intervals
3. Technical Fingerprints
- Missing browser headers (Accept-Language, DNT, etc.)
- No JavaScript execution capability
- Unusual request header order
- Missing or fake TLS fingerprints
4. Rate Analysis
- Abnormally high request rates from single IP
- Requests outside normal user behavior
- Perfect timing patterns (bots often use exact delays)
Protection Strategies
Basic Protection
1. Robots.txt
User-agent: *
Disallow: /api/
Disallow: /admin/
Crawl-delay: 10
Note: Malicious scrapers usually ignore robots.txt
2. Rate Limiting Limit requests per IP:
- 100 requests per hour for unknown IPs
- Stricter limits for suspicious patterns
3. CAPTCHA Show CAPTCHA when detecting:
- High request rates
- Suspicious user agents
- Unusual behavior patterns
Advanced Protection
1. JavaScript Challenges Require JavaScript execution:
<script>
// Generate token that must be included in requests
// Scrapy can't execute JavaScript by default
</script>
2. Honeypot Links Add hidden links that humans won’t see but bots will follow:
<a href="/trap" style="display:none">Trap</a>
Block IPs that access honeypot URLs.
3. Request Header Validation Check for complete and realistic headers:
- Accept-Language
- Accept-Encoding
- Connection
- Upgrade-Insecure-Requests
4. TLS Fingerprinting Analyze TLS handshake to identify automated tools vs real browsers.
5. Behavioral Analysis Track:
- Mouse movements (bots don’t move mice)
- Scroll patterns
- Time on page
- Click patterns
Blocking Scrapy Bots
Nginx Configuration
# Block by user agent
if ($http_user_agent ~* (scrapy|python|curl|wget)) {
return 403;
}
# Rate limiting
limit_req_zone $binary_remote_addr zone=general:10m rate=30r/m;
limit_req zone=general burst=10 nodelay;
Application-Level
# Example: Check user agent
banned_agents = ['scrapy', 'python-requests', 'curl', 'wget']
user_agent = request.headers.get('User-Agent', '').lower()
if any(agent in user_agent for agent in banned_agents):
return abort(403)
Legal Considerations
Before implementing aggressive anti-scraping measures:
- Check your terms of service
- Understand CFAA and relevant laws
- Consider if data is publicly available
- Evaluate business impact vs protection cost
- Consult legal counsel for your jurisdiction
Cat and Mouse Game
Remember: Determined scrapers will:
- Rotate user agents to mimic browsers
- Use residential proxies
- Add random delays
- Solve CAPTCHAs (manually or via services)
- Render JavaScript with headless browsers
Effective protection requires layered defenses and continuous monitoring.
Monitoring and Detection
Track metrics:
- Requests per IP over time
- Failed CAPTCHA attempts
- Honeypot hits
- Unusual traffic patterns
- 403/429 response rates
Use tools like:
- Web Application Firewalls (WAF)
- Bot management platforms
- Log analysis tools
- Traffic anomaly detection
Scrapy itself is just a tool - focus on detecting scraper behavior patterns rather than just blocking specific user agents.
Test Bot Access to Your Site
Want to understand which bots can access your website? Use our free bot detection tools to test robots.txt rules and actual bot access across different crawler types:
- SEO Bot Checker - Test search engine crawlers
- SEO Tools Bot Checker - Verify SEO tool access
- AI Bot Checker - Scan for AI crawlers
Related Bot Topics:
- Understanding Bot Traffic - Learn to distinguish good bots from bad
- robots.txt Guide - Configure bot access properly