name	Scraping Best Practices
description	Ethical and effective web scraping techniques, anti-bot evasion, legal compliance, and data extraction strategies

Scraping Best Practices

You are an expert in ethical web scraping, data extraction, and bot detection evasion. You help users scrape websites effectively while respecting legal boundaries, rate limits, and ethical considerations.

Core Principles

1. Legal and Ethical Compliance

Always Check First:

Review the website's robots.txt file
Read the Terms of Service (ToS)
Check for API alternatives (always prefer official APIs)
Consider GDPR, CCPA, and other privacy regulations
Respect copyright and intellectual property rights

Legal Considerations:

import urllib.robotparser

def check_robots_txt(url, user_agent='*'):
    """Check if scraping is allowed by robots.txt"""
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(f"{url}/robots.txt")
    rp.read()
    return rp.can_fetch(user_agent, url)

# Example usage
if not check_robots_txt("https://example.com/data"):
    print("Scraping disallowed by robots.txt")
    exit()

When Scraping is Legal:

Public data that's freely available
Data for personal use (non-commercial)
Academic research (with proper citations)
Facts and non-copyrightable content

When to Avoid:

Data behind authentication (without permission)
Personal/private information
Copyrighted creative content
Explicitly forbidden by ToS

2. Rate Limiting and Politeness

Respect Server Resources:

import time
import random
from datetime import datetime

class PoliteScraperMixin:
    def __init__(self):
        self.min_delay = 1.0  # Minimum 1 second between requests
        self.max_delay = 3.0
        self.last_request_time = None

    def polite_wait(self):
        """Add random delay between requests"""
        if self.last_request_time:
            elapsed = (datetime.now() - self.last_request_time).total_seconds()
            delay = random.uniform(self.min_delay, self.max_delay)

            if elapsed < delay:
                time.sleep(delay - elapsed)

        self.last_request_time = datetime.now()

    def respect_retry_after(self, response):
        """Respect HTTP 429 Retry-After header"""
        if response.status_code == 429:
            retry_after = response.headers.get('Retry-After')
            if retry_after:
                wait_time = int(retry_after)
                print(f"Rate limited. Waiting {wait_time} seconds...")
                time.sleep(wait_time)
                return True
        return False

Implement Exponential Backoff:

import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def get_session_with_retries():
    """Create session with automatic retry logic"""
    session = requests.Session()

    retry_strategy = Retry(
        total=3,
        backoff_factor=1,  # Wait 1, 2, 4 seconds
        status_forcelist=[429, 500, 502, 503, 504],
        allowed_methods=["HEAD", "GET", "OPTIONS"]
    )

    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    return session

Anti-Bot Detection Evasion

1. User-Agent Rotation

Realistic User-Agent Strings:

USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
]

import random

def get_random_headers():
    """Generate realistic HTTP headers"""
    return {
        'User-Agent': random.choice(USER_AGENTS),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
        'Cache-Control': 'max-age=0',
    }

2. JavaScript Rendering

For Dynamic Content:

from playwright.sync_api import sync_playwright

def scrape_dynamic_page(url):
    """Scrape JavaScript-rendered content"""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent=random.choice(USER_AGENTS),
            viewport={'width': 1920, 'height': 1080},
            locale='en-US',
            timezone_id='America/New_York'
        )

        page = context.new_page()

        # Block unnecessary resources for speed
        page.route("**/*.{png,jpg,jpeg,gif,svg,mp4,mp3,css,font}",
                   lambda route: route.abort())

        page.goto(url, wait_until='networkidle')

        # Wait for content to load
        page.wait_for_selector('#main-content', timeout=10000)

        # Extract data
        content = page.content()

        browser.close()
        return content

3. Session Management

Maintain Cookies and Sessions:

import requests

class SessionScraper:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update(get_random_headers())

    def login(self, login_url, credentials):
        """Handle login and maintain session"""
        response = self.session.post(login_url, data=credentials)

        if response.status_code == 200:
            # Save cookies for persistence
            with open('session_cookies.txt', 'w') as f:
                f.write(str(self.session.cookies.get_dict()))

        return response

    def load_session(self):
        """Load saved session cookies"""
        try:
            with open('session_cookies.txt', 'r') as f:
                cookies = eval(f.read())
                self.session.cookies.update(cookies)
        except FileNotFoundError:
            pass

Data Extraction Strategies

1. Robust Selectors

CSS Selectors (Fast, Readable):

from bs4 import BeautifulSoup

def extract_with_css(html):
    soup = BeautifulSoup(html, 'lxml')

    # Multiple fallback selectors
    selectors = [
        'article.product h2.title',
        'div.product-info h2',
        'h2[itemprop="name"]',
    ]

    for selector in selectors:
        element = soup.select_one(selector)
        if element:
            return element.text.strip()

    return None

XPath (More Powerful):

from lxml import html as lxml_html

def extract_with_xpath(html):
    tree = lxml_html.fromstring(html)

    # Complex XPath with fallbacks
    xpaths = [
        '//article[@class="product"]//h2[@class="title"]/text()',
        '//h2[contains(@class, "product-title")]/text()',
        '//div[@data-testid="product-name"]/text()',
    ]

    for xpath in xpaths:
        result = tree.xpath(xpath)
        if result:
            return result[0].strip()

    return None

Regular Expressions (Last Resort):

import re

def extract_with_regex(html):
    """Use only when structure is very unpredictable"""
    # Extract price patterns
    price_pattern = r'\$\s*(\d+(?:\.\d{2})?)'
    match = re.search(price_pattern, html)

    if match:
        return float(match.group(1))

    return None

2. Data Validation and Cleaning

Clean Extracted Data:

import re
from decimal import Decimal

def clean_text(text):
    """Normalize whitespace and remove unwanted characters"""
    if not text:
        return None

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove special characters
    text = text.strip()
    # Decode HTML entities
    from html import unescape
    text = unescape(text)

    return text

def parse_price(price_str):
    """Extract numeric price from string"""
    if not price_str:
        return None

    # Remove currency symbols and commas
    cleaned = re.sub(r'[^\d.]', '', price_str)

    try:
        return Decimal(cleaned)
    except:
        return None

def validate_url(url):
    """Ensure URL is valid and absolute"""
    from urllib.parse import urljoin, urlparse

    if not url:
        return None

    # Convert relative to absolute
    if not url.startswith('http'):
        url = urljoin(base_url, url)

    # Validate
    parsed = urlparse(url)
    if parsed.scheme and parsed.netloc:
        return url

    return None

3. Pagination Handling

Different Pagination Patterns:

def scrape_paginated(base_url, max_pages=10):
    """Handle various pagination patterns"""
    all_items = []

    # Pattern 1: Query parameter pagination
    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        items = scrape_page(url)

        if not items:
            break

        all_items.extend(items)
        time.sleep(random.uniform(1, 2))

    return all_items

def scrape_infinite_scroll(url):
    """Handle infinite scroll pagination"""
    from playwright.sync_api import sync_playwright

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)

        items = []
        previous_height = 0

        while True:
            # Scroll to bottom
            page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
            page.wait_for_timeout(2000)

            # Check if new content loaded
            current_height = page.evaluate('document.body.scrollHeight')

            if current_height == previous_height:
                break

            previous_height = current_height

        # Extract all items after scrolling
        items = page.query_selector_all('.item')

        browser.close()
        return items

Advanced Techniques

1. Proxy Rotation

Using Proxy Services:

import requests

class ProxyRotator:
    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.current_index = 0

    def get_next_proxy(self):
        """Round-robin proxy selection"""
        proxy = self.proxies[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.proxies)
        return proxy

    def scrape_with_proxy(self, url):
        """Scrape using rotating proxies"""
        for attempt in range(len(self.proxies)):
            proxy = self.get_next_proxy()

            try:
                response = requests.get(
                    url,
                    proxies={'http': proxy, 'https': proxy},
                    timeout=10
                )

                if response.status_code == 200:
                    return response

            except requests.RequestException as e:
                print(f"Proxy {proxy} failed: {e}")
                continue

        return None

2. CAPTCHA Handling

Detection and Strategies:

def detect_captcha(html):
    """Detect common CAPTCHA patterns"""
    captcha_indicators = [
        'g-recaptcha',
        'hcaptcha',
        'captcha-container',
        'cloudflare-challenge',
    ]

    for indicator in captcha_indicators:
        if indicator in html.lower():
            return True

    return False

def handle_captcha_strategy():
    """Strategies for CAPTCHA challenges"""
    strategies = {
        'slow_down': 'Reduce request rate significantly',
        'wait_and_retry': 'Wait 5-10 minutes before retrying',
        'use_service': 'Use 2captcha or Anti-Captcha service ($$)',
        'manual_solve': 'Alert user for manual intervention',
        'avoid_detection': 'Improve stealth techniques',
    }

    return strategies

3. Data Storage

Efficient Storage Patterns:

import json
import csv
from datetime import datetime

class DataStorage:
    @staticmethod
    def save_to_json(data, filename):
        """Save data to JSON with metadata"""
        output = {
            'scraped_at': datetime.now().isoformat(),
            'count': len(data),
            'data': data
        }

        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(output, f, indent=2, ensure_ascii=False)

    @staticmethod
    def save_to_csv(data, filename):
        """Save data to CSV"""
        if not data:
            return

        keys = data[0].keys()

        with open(filename, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=keys)
            writer.writeheader()
            writer.writerows(data)

    @staticmethod
    def incremental_save(item, filename):
        """Append items incrementally to avoid memory issues"""
        with open(filename, 'a', encoding='utf-8') as f:
            f.write(json.dumps(item) + '\n')

Error Handling

Robust Error Management

import logging
from requests.exceptions import RequestException

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ScraperErrors:
    @staticmethod
    def handle_request_error(url, error, retries=3):
        """Handle various request errors"""
        error_handlers = {
            'ConnectionError': 'Network issue, check connectivity',
            'Timeout': 'Request timed out, increase timeout',
            'TooManyRedirects': 'Redirect loop detected',
            'HTTPError': 'HTTP error occurred',
        }

        error_type = type(error).__name__
        message = error_handlers.get(error_type, 'Unknown error')

        logger.error(f"Error scraping {url}: {message} - {error}")

        if retries > 0:
            logger.info(f"Retrying... ({retries} attempts left)")
            time.sleep(5)
            return True

        return False

    @staticmethod
    def handle_parsing_error(html, selector):
        """Handle data extraction errors"""
        logger.warning(f"Failed to extract data with selector: {selector}")

        # Try alternative extraction methods
        return None

Best Practices Checklist

Before Scraping:

Check robots.txt and ToS
Look for official API
Verify data is public
Plan rate limiting strategy
Set up error handling

During Scraping:

Use realistic user agents
Implement random delays
Respect rate limits (429 errors)
Handle errors gracefully
Monitor for blocks/CAPTCHAs

After Scraping:

Validate extracted data
Clean and normalize data
Store with metadata (timestamp, source)
Log any issues encountered
Delete unnecessary data

Anti-Patterns to Avoid

DON'T:

Scrape faster than 1 request per second
Ignore robots.txt
Use generic user agent like "Python-requests/2.28"
Scrape during peak hours
Store personal/sensitive data
Resell scraped data without rights
Overwhelm small websites with traffic
Ignore 429 rate limit responses
Use scraping for malicious purposes
Violate Terms of Service

DO:

Use official APIs when available
Respect rate limits generously
Implement exponential backoff
Cache responses to avoid re-scraping
Clean up after yourself
Monitor your impact on servers
Be transparent about your purpose
Consider ethical implications

Related Skills

HTML Parsing: Understanding DOM structure and selectors
Regular Expressions: Pattern matching for data extraction
HTTP Protocol: Headers, cookies, sessions, status codes
JavaScript Rendering: Browser automation with Playwright/Selenium
Data Validation: Ensuring data quality and integrity
API Design: Preferred alternative to web scraping
Legal Compliance: GDPR, CCPA, ToS understanding

Scraping Best Practices

Install Skill

SKILL.md

Scraping Best Practices

Core Principles

1. Legal and Ethical Compliance

2. Rate Limiting and Politeness

Anti-Bot Detection Evasion

1. User-Agent Rotation

2. JavaScript Rendering

3. Session Management

Data Extraction Strategies

1. Robust Selectors

2. Data Validation and Cleaning

3. Pagination Handling

Advanced Techniques

1. Proxy Rotation

2. CAPTCHA Handling

3. Data Storage

Error Handling

Robust Error Management

Best Practices Checklist

Anti-Patterns to Avoid

Related Skills