| name | scrapy |
| description | Build web crawlers and spiders using the Scrapy framework. Use this skill when building large-scale web crawlers, following links across multiple pages, handling request throttling, or creating production scraping pipelines. NOT needed for parsing single HTML files or processing already-fetched content. |
Scrapy
Framework for building large-scale web crawlers.
Installation
pip install scrapy
Quick Start
Create a new Scrapy project:
scrapy startproject myproject
cd myproject
scrapy genspider example example.com
Basic Spider
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('div.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get(),
'url': product.css('a::attr(href)').get(),
}
# Follow pagination
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Selectors
# CSS selectors
response.css('div.content')
response.css('h1::text').get() # Get text
response.css('a::attr(href)').get() # Get attribute
response.css('p::text').getall() # Get all matches
# XPath selectors
response.xpath('//div[@class="content"]')
response.xpath('//h1/text()').get()
response.xpath('//a/@href').get()
Item Pipeline
# pipelines.py
class CleaningPipeline:
def process_item(self, item, spider):
item['price'] = float(item['price'].replace('$', ''))
item['name'] = item['name'].strip()
return item
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('items.json', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
self.file.write(json.dumps(dict(item)) + '\n')
return item
Settings
# settings.py
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0.5
ROBOTSTXT_OBEY = True
# Enable pipelines
ITEM_PIPELINES = {
'myproject.pipelines.CleaningPipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}
Running Spiders
# Run spider
scrapy crawl products
# Output to file
scrapy crawl products -o products.json
# Output formats: json, jsonl, csv, xml
scrapy crawl products -o products.csv
When to Use Scrapy
- Large-scale crawling (thousands of pages)
- Following links across a site
- Built-in request throttling and politeness
- Complex extraction pipelines
- Production scraping systems
When NOT to Use Scrapy
- Single page extraction (use requests + BeautifulSoup)
- JavaScript-rendered content (use Playwright/Selenium)
- Processing already downloaded HTML files
- Simple one-off scraping tasks