name	scrapy
description	Build web crawlers and spiders using the Scrapy framework. Use this skill when building large-scale web crawlers, following links across multiple pages, handling request throttling, or creating production scraping pipelines. NOT needed for parsing single HTML files or processing already-fetched content.

Scrapy

Name: scrapy
Author: benchflow-ai

Framework for building large-scale web crawlers.

Installation

pip install scrapy

Quick Start

Create a new Scrapy project:

scrapy startproject myproject
cd myproject
scrapy genspider example example.com

Basic Spider

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('div.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'url': product.css('a::attr(href)').get(),
            }

        # Follow pagination
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Selectors

# CSS selectors
response.css('div.content')
response.css('h1::text').get()  # Get text
response.css('a::attr(href)').get()  # Get attribute
response.css('p::text').getall()  # Get all matches

# XPath selectors
response.xpath('//div[@class="content"]')
response.xpath('//h1/text()').get()
response.xpath('//a/@href').get()

Item Pipeline

# pipelines.py
class CleaningPipeline:
    def process_item(self, item, spider):
        item['price'] = float(item['price'].replace('$', ''))
        item['name'] = item['name'].strip()
        return item

class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open('items.json', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        self.file.write(json.dumps(dict(item)) + '\n')
        return item

Settings

# settings.py
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0.5
ROBOTSTXT_OBEY = True

# Enable pipelines
ITEM_PIPELINES = {
    'myproject.pipelines.CleaningPipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

Running Spiders

# Run spider
scrapy crawl products

# Output to file
scrapy crawl products -o products.json

# Output formats: json, jsonl, csv, xml
scrapy crawl products -o products.csv

When to Use Scrapy

Large-scale crawling (thousands of pages)
Following links across a site
Built-in request throttling and politeness
Complex extraction pipelines
Production scraping systems

When NOT to Use Scrapy

Single page extraction (use requests + BeautifulSoup)
JavaScript-rendered content (use Playwright/Selenium)
Processing already downloaded HTML files
Simple one-off scraping tasks

scrapy

Install Skill

SKILL.md

Scrapy

Installation

Quick Start

Basic Spider

Selectors

Item Pipeline

Settings

Running Spiders

When to Use Scrapy

When NOT to Use Scrapy