What is Scrapy?

Scrapy is an open-source Python framework for building web scrapers. While it has legitimate uses, it’s also commonly used for unauthorized data harvesting, content theft, and competitive intelligence gathering.

How Scrapy Bots Work

Scrapy bots:

  • Send HTTP requests to fetch web pages
  • Parse HTML to extract specific data
  • Follow links to crawl multiple pages
  • Can rotate user agents and IPs
  • May use delays to avoid detection

Default User Agent

Out-of-the-box Scrapy identifies itself as:

Scrapy/VERSION (+https://scrapy.org)

Example:

Scrapy/2.8.0 (+https://scrapy.org)

However, most scrapers change the user agent to masquerade as real browsers.

Legitimate vs Malicious Use

Legitimate Uses:

  • Price monitoring for business intelligence
  • Research and data analysis
  • Testing your own websites
  • Authorized data collection with permission

Malicious Uses:

  • Stealing copyrighted content
  • Harvesting email addresses for spam
  • Scraping product data from competitors
  • Collecting user-generated content without permission
  • Bypassing API rate limits

How to Detect Scrapy Bots

1. User Agent Analysis

Check for:

  • Default Scrapy user agent
  • Suspicious or outdated browser strings
  • Missing or malformed headers

2. Behavior Patterns

Scrapy bots often:

  • Request pages very rapidly
  • Don’t load images, CSS, or JavaScript
  • Follow links in unusual patterns
  • Never submit forms or interact
  • Make requests at perfectly regular intervals

3. Technical Fingerprints

  • Missing browser headers (Accept-Language, DNT, etc.)
  • No JavaScript execution capability
  • Unusual request header order
  • Missing or fake TLS fingerprints

4. Rate Analysis

  • Abnormally high request rates from single IP
  • Requests outside normal user behavior
  • Perfect timing patterns (bots often use exact delays)

Protection Strategies

Basic Protection

1. Robots.txt

User-agent: *
Disallow: /api/
Disallow: /admin/
Crawl-delay: 10

Note: Malicious scrapers usually ignore robots.txt

2. Rate Limiting Limit requests per IP:

  • 100 requests per hour for unknown IPs
  • Stricter limits for suspicious patterns

3. CAPTCHA Show CAPTCHA when detecting:

  • High request rates
  • Suspicious user agents
  • Unusual behavior patterns

Advanced Protection

1. JavaScript Challenges Require JavaScript execution:

<script>
// Generate token that must be included in requests
// Scrapy can't execute JavaScript by default
</script>

2. Honeypot Links Add hidden links that humans won’t see but bots will follow:

<a href="/trap" style="display:none">Trap</a>

Block IPs that access honeypot URLs.

3. Request Header Validation Check for complete and realistic headers:

  • Accept-Language
  • Accept-Encoding
  • Connection
  • Upgrade-Insecure-Requests

4. TLS Fingerprinting Analyze TLS handshake to identify automated tools vs real browsers.

5. Behavioral Analysis Track:

  • Mouse movements (bots don’t move mice)
  • Scroll patterns
  • Time on page
  • Click patterns

Blocking Scrapy Bots

Nginx Configuration

# Block by user agent
if ($http_user_agent ~* (scrapy|python|curl|wget)) {
    return 403;
}

# Rate limiting
limit_req_zone $binary_remote_addr zone=general:10m rate=30r/m;
limit_req zone=general burst=10 nodelay;

Application-Level

# Example: Check user agent
banned_agents = ['scrapy', 'python-requests', 'curl', 'wget']
user_agent = request.headers.get('User-Agent', '').lower()

if any(agent in user_agent for agent in banned_agents):
    return abort(403)

Before implementing aggressive anti-scraping measures:

  • Check your terms of service
  • Understand CFAA and relevant laws
  • Consider if data is publicly available
  • Evaluate business impact vs protection cost
  • Consult legal counsel for your jurisdiction

Cat and Mouse Game

Remember: Determined scrapers will:

  • Rotate user agents to mimic browsers
  • Use residential proxies
  • Add random delays
  • Solve CAPTCHAs (manually or via services)
  • Render JavaScript with headless browsers

Effective protection requires layered defenses and continuous monitoring.

Monitoring and Detection

Track metrics:

  • Requests per IP over time
  • Failed CAPTCHA attempts
  • Honeypot hits
  • Unusual traffic patterns
  • 403/429 response rates

Use tools like:

  • Web Application Firewalls (WAF)
  • Bot management platforms
  • Log analysis tools
  • Traffic anomaly detection

Scrapy itself is just a tool - focus on detecting scraper behavior patterns rather than just blocking specific user agents.


Test Bot Access to Your Site

Want to understand which bots can access your website? Use our free bot detection tools to test robots.txt rules and actual bot access across different crawler types:

Related Bot Topics: