name	airflow-etl
description	Generate Apache Airflow ETL pipelines for government websites and document sources. Explores websites to find downloadable documents, verifies commercial use licenses, and creates complete Airflow DAG assets with daily scheduling. Use when user wants to create ETL pipelines, scrape government documents, or automate document collection workflows.

Airflow ETL Pipeline Generator

Generate production-ready Apache Airflow ETL pipelines that automatically discover, download, and transform documents from government websites and other data sources into structured markdown files.

Workflow

Phase 1: Website Exploration and Discovery

Initial Analysis:
- Use WebFetch to explore the provided website URL
- Identify document sections (downloads, archives, publications, meetings, etc.)
- Look for API endpoints, RSS feeds, or structured data sources
- Note pagination patterns and document organization
License Verification:
- Search for license information (Creative Commons, Open Government License, etc.)
- Look for terms of use or copyright notices
- Check for explicit commercial use permissions
- If unclear, ask user about license status
Document Inventory:
- Identify document types (PDF, DOC, DOCX, etc.)
- Understand the URL patterns for documents
- Determine how to detect new documents
- Note any metadata available (dates, categories, titles)
User Confirmation:
- Present findings in a clear summary
- Show example document URLs
- Describe the discovered structure
- Ask user to confirm this is the correct data source

Phase 2: Generate Airflow Pipeline Assets

Create a complete, production-ready Airflow project structure:

airflow_pipelines/
├── dags/
│   └── [source_name]_etl_dag.py
├── operators/
│   ├── __init__.py
│   ├── document_scraper.py
│   └── document_converter.py
├── utils/
│   ├── __init__.py
│   ├── license_checker.py
│   └── file_manager.py
├── config/
│   └── [source_name]_config.yaml
├── requirements.txt
└── README.md

File Generation Requirements:

1. DAG File (dags/[source_name]_etl_dag.py):

Daily schedule (adjustable)
Clear task dependencies
Error handling and retries
Sensor for checking new documents
Download task
Conversion task to markdown
File organization task
Use Airflow best practices (XComs, task groups, dynamic task generation)

2. Document Scraper (operators/document_scraper.py):

BeautifulSoup or Scrapy for web scraping
Request handling with retries
Respect robots.txt
User-agent configuration
Rate limiting
Checksum/hash tracking to avoid re-downloading
State management for incremental updates

3. Document Converter (operators/document_converter.py):

Support for PDF, DOC, DOCX conversion to markdown
Use libraries like pypandoc, pdfplumber, or python-docx
Preserve document structure (headings, lists, tables)
Extract metadata
Handle encoding issues
Clean and normalize output

4. License Checker (utils/license_checker.py):

Validate license information
Check for commercial use permission
Log license status
Skip non-compliant documents

5. File Manager (utils/file_manager.py):

Create meaningful directory structure
Organize by date, category, or document type
Generate consistent filenames
Handle duplicates
Maintain index of processed documents

6. Configuration (config/[source_name]_config.yaml):

source:
  name: "Source Name"
  url: "https://example.com"
  document_section: "/documents"

schedule:
  interval: "0 0 * * *"  # Daily at midnight

storage:
  base_path: "/data/documents"
  structure: "year/month/category"

scraping:
  rate_limit: 1  # requests per second
  user_agent: "ETL Pipeline Bot"
  retry_attempts: 3

conversion:
  format: "markdown"
  preserve_structure: true
  extract_metadata: true

7. Requirements (requirements.txt):

apache-airflow>=2.7.0
beautifulsoup4>=4.12.0
requests>=2.31.0
pypandoc>=1.12
pdfplumber>=0.10.0
python-docx>=1.0.0
pyyaml>=6.0
lxml>=4.9.0

8. Documentation (README.md):

Pipeline overview
Setup instructions
Configuration guide
Airflow connection requirements
Monitoring and troubleshooting
Example usage

Phase 3: Implementation Notes

Important Considerations:

Include comprehensive error handling
Log all operations for debugging
Add data quality checks
Implement idempotency (safe to re-run)
Use Airflow variables for sensitive config
Add email/Slack alerts for failures
Document the directory structure created

Code Quality:

Follow PEP 8 style guidelines
Add docstrings to all functions
Include type hints
Write modular, reusable code
Add comments for complex logic

Testing Recommendations (optional):

Suggest basic unit tests for utilities
Recommend integration testing approach
Provide example test cases

Phase 4: Delivery

Generate all files using Write tool
Provide summary of created assets
Explain how to deploy to Airflow:
- Copy files to Airflow home directory
- Install requirements
- Enable the DAG in Airflow UI
- Configure connections if needed
Suggest next steps (testing, scheduling, monitoring)

Examples

Example 1: German Bundestag Documents

User: "Create an ETL pipeline for https://www.bundestag.de/digitales to collect committee meeting documents"

Skill Response:
- Explores the digital committee section
- Finds document sections (agendas, protocols, reports)
- Checks copyright notice
- Confirms findings with user
- Generates complete Airflow pipeline
- Creates scraper for committee documents
- Sets up markdown conversion
- Organizes by committee and date

Example 2: EU Open Data Portal

User: "Build an Airflow pipeline for EU legislation documents from data.europa.eu"

Skill Response:
- Discovers API endpoints
- Verifies open data license
- Generates API-based scraper
- Creates pipeline with API operators
- Includes rate limiting
- Organizes by document type and year

Key Success Criteria

Pipeline runs successfully in Airflow
Documents are correctly downloaded
Markdown conversion preserves structure
File organization is logical and scalable
License compliance is enforced
New documents are detected automatically
Pipeline is idempotent and fault-tolerant

Tips for Users

Provide the main URL of the data source
Mention any specific document types needed
Specify preferred organization structure
Note any special requirements (date ranges, categories)
Test with a small sample before full deployment

airflow-etl

Install Skill

SKILL.md