name	deployment-engineer
description	Expert deployment automation for cloud platforms. Handles CI/CD pipelines, container orchestration, infrastructure setup, and production deployments with battle-tested configurations. Specializes in GitHub Actions, Docker, HuggingFace Spaces, and GitHub Pages.
category	devops
version	1.0.0

Deployment Engineer Skill

Purpose

Automate and manage production deployments across multiple platforms with zero-downtime, proper monitoring, and rollback capabilities. This skill encapsulates hard-won lessons from real-world deployment scenarios.

When to Use This Skill

Use this skill when:

Setting up CI/CD pipelines for web applications
Deploying to HuggingFace Spaces, Vercel, Netlify, or GitHub Pages
Configuring Docker containers and orchestration
Implementing environment-specific configurations
Troubleshooting deployment failures
Setting up monitoring and health checks

Core Deployment Patterns

1. Multi-Platform Deployment Strategy

Lesson Learned: Always verify platform-specific requirements before deployment.

# .github/workflows/deploy-backend.yml
# Critical patterns discovered through painful debugging:

# 1. Branch Name Consistency
on:
  push:
    # NEVER assume 'main' - always verify actual branch name
    branches: [master]  # Fixed from 'main' after repo inspection

# 2. Authentication for External Services
- name: Deploy to HuggingFace
  env:
    HF_TOKEN: ${{ secrets.HF_TOKEN }}
  run: |
    # Pattern: Use credential helper for Git auth
    git config credential.helper store
    echo "https://hf:$HF_TOKEN@huggingface.co" > ~/.git-credentials
    git remote set-url origin https://hf:$HF_TOKEN@huggingface.co/spaces/${{ env.HF_SPACE_NAME }}

# 3. Error Handling and Verification
- name: Verify Deployment
  run: |
    # Always add post-deployment verification
    curl -f "${{ env.DEPLOY_URL }}/health" || echo "Health check failed - space might still be starting"

2. Docker Configuration Best Practices

Lesson Learned: Order of operations in Dockerfile is critical for build success.

# backend/Dockerfile - Battle-tested pattern

# 1. Use specific Python version
FROM python:3.11-slim

# 2. Install system dependencies FIRST
RUN apt-get update && apt-get install -y \
    gcc \
    curl \
    && rm -rf /var/lib/apt/lists/*

# 3. Set working directory early
WORKDIR /app

# 4. Copy requirements BEFORE source code (leverages Docker cache)
COPY pyproject.toml requirements.txt README.md ./

# 5. Install Python dependencies
RUN pip install uv
RUN uv pip install --system -e .

# 6. Copy application code
COPY . .

# 7. Create non-root user AFTER installation
RUN useradd -m -u 1000 user && chown -R user:user /app
USER user

# 8. Expose port and health check
EXPOSE 7860
HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:7860/health || exit 1

# 9. CMD must be last
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

3. Environment Variables Management

Lesson Learned: Different platforms require different environment variable strategies.

# backend/main.py - Environment loading pattern

from dotenv import load_dotenv

# Load .env for local development
load_dotenv()

class Settings(BaseSettings):
    """Always provide defaults for critical settings"""

    # OpenAI Configuration
    openai_api_key: str = os.getenv("OPENAI_API_KEY", "")
    openai_model: str = os.getenv("OPENAI_MODEL", "gpt-4o-mini")  # Default to stable model

    # Platform Detection
    is_hf_spaces: bool = os.getenv("SPACE_ID") is not None
    is_production: bool = os.getenv("NODE_ENV") == "production"

    @property
    def api_endpoint(self) -> str:
        """Auto-detect API endpoint based on platform"""
        if self.is_hf_spaces:
            # HuggingFace Spaces
            space_name = os.getenv("SPACE_ID", "")
            return f"https://{space_name.replace(' ', '-').lower()}.hf.space"
        elif self.is_production:
            # Production environment
            return os.getenv("API_URL", "")
        else:
            # Local development
            return "http://localhost:7860"

4. CORS Configuration for Cross-Origin Requests

Lesson Learned: Frontend and backend on different domains require explicit CORS setup.

# backend/main.py - CORS configuration

app = FastAPI()

# Dynamic CORS origins based on environment
cors_origins = []
if os.getenv("NODE_ENV") == "production":
    cors_origins = [
        "https://yourusername.github.io",
        "https://yourdomain.com"
    ]
else:
    cors_origins = ["http://localhost:3000", "http://localhost:7860"]

app.add_middleware(
    CORSMiddleware,
    allow_origins=cors_origins,
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

5. Frontend Configuration Pattern

Lesson Learned: Frontend must adapt to different deployment environments.

// src/theme/Root.tsx - Dynamic API endpoint detection
const getChatkitEndpoint = () => {
  // Check environment variable first
  if (process.env.REACT_APP_CHAT_API_URL) {
    return process.env.REACT_APP_CHAT_API_URL;
  }

  const hostname = window.location.hostname;
  if (hostname === 'localhost' || hostname === '127.0.0.1') {
    return 'http://localhost:7860/chat';
  }

  // Production URLs
  if (hostname.includes('github.io')) {
    // GitHub Pages
    return 'https://your-space.hf.space/chat';
  } else if (hostname.includes('hf.space')) {
    // HuggingFace Spaces
    return `https://${hostname}/chat`;
  }

  return '/chat'; // Same domain deployment
};

Common Pitfalls & Solutions

1. Branch Name Mismatch

Problem: GitHub Actions configured for 'main' but repo uses 'master'

# NEVER hard-code branch names
branches: [master]  # Verify with `git branch` first

2. Docker Build Failures

Problem: Permission errors during package installation

# Install dependencies BEFORE switching to non-root user
RUN uv pip install --system -e .  # As root
USER user  # Switch AFTER installation

3. Model Compatibility Issues

Problem: Using models that require different APIs

# Wrong: gpt-5-nano requires Responses API, not Chat Completions
# Correct: Use compatible models
openai_model: str = os.getenv("OPENAI_MODEL", "gpt-4o-mini")

4. Query Validation Errors

Problem: Backend crashes on short queries like "hi"

# Allow single character queries
if not query or len(query.strip()) < 1:
    raise ValueError("Query must be at least 1 character long")

5. Missing Health Checks

Problem: No way to verify deployment success

@app.get("/health")
async def health_check():
    """Always implement health endpoints"""
    return {
        "status": "healthy",
        "version": "1.0.0",
        "timestamp": datetime.utcnow().isoformat(),
        "services": {
            "database": await check_database(),
            "openai": bool(os.getenv("OPENAI_API_KEY"))
        }
    }

6. Hatchling README.md Not Found Error

Problem: pip install -e . fails with OSError: Readme file does not exist: README.md

# ❌ Wrong - README.md not copied before pip install
COPY pyproject.toml ./
RUN pip install --no-cache-dir -e .

# ✅ Correct - Copy README.md with pyproject.toml
COPY pyproject.toml README.md ./
RUN pip install --no-cache-dir -e .

Root Cause: pyproject.toml has readme = "README.md" but hatchling can't find it during install. Files Affected: Dockerfile, Dockerfile.hf

7. Multiple Dockerfiles Confusion

Problem: HuggingFace Spaces uses Dockerfile by default, not Dockerfile.hf

# You have TWO files:
Dockerfile       # Used by HF Spaces by default
Dockerfile.hf     # IGNORED by HF Spaces unless specified

# Solution: Keep BOTH files in sync or use one file
# Or specify in README.md frontmatter:
# sdk: docker
# dockerfile: Dockerfile.hf  # Optional override

Lesson Learned: When you have multiple Dockerfiles, HuggingFace uses Dockerfile by default. Either keep them synchronized or explicitly specify which one to use.

8. Docusaurus SSR Build Errors

Problem: ReferenceError: window is not defined or localStorage is not defined during build

// ❌ Wrong - Runs during SSR build
function setupAPIConfig() {
  window.__API_BASE_URL__ = 'http://localhost:7860';
}
setupAPIConfig(); // Runs immediately at module load

// ✅ Correct - SSR guard
function setupAPIConfig() {
  window.__API_BASE_URL__ = 'http://localhost:7860';
}
if (typeof window !== 'undefined') {
  setupAPIConfig(); // Only runs in browser
}

For AuthContext with localStorage:

// ❌ Wrong - getInitialState accesses localStorage during SSR
const getInitialState = (): AuthState => {
  const tokens = tokenManager.getTokens(); // Uses localStorage
  return { token: tokens.token, ... };
};

// ✅ Correct - SSR guard in init function
const getInitialState = (): AuthState => {
  // Return default state during SSR
  if (typeof window === 'undefined') {
    return {
      user: null,
      token: null,
      refreshToken: null,
      isLoading: false,
      isAuthenticated: false,
      error: null,
    };
  }

  const tokens = tokenManager.getTokens();
  return { token: tokens.token, ... };
};

Files Affected: src/clientModules/apiConfig.ts, src/context/AuthContext.tsx

9. HuggingFace Spaces Missing Configuration

Problem: "Missing configuration in README" error

# ❌ Wrong - README.md missing YAML frontmatter
# My Backend

FastAPI backend...

# ✅ Correct - YAML frontmatter at TOP of README.md
---
title: AI Book Backend
emoji: 🤖
colorFrom: blue
colorTo: indigo
sdk: docker
sdk_version: "3.11"
app_file: main.py
pinned: false
license: mit
---

# AI Book Backend

FastAPI backend...

Root Cause: HuggingFace Spaces requires YAML configuration in README.md at the ROOT of the repository. Files Affected: backend/README.md

10. Client Module SSR Execution

Problem: Client modules execute during SSR build in Docusaurus

// ❌ Wrong - Immediately executes code that needs browser APIs
// src/clientModules/apiConfig.ts
const config = window.location.hostname; // FAILS during build

// ✅ Correct - Lazy execution with guard
// src/clientModules/apiConfig.ts
function setupAPIConfig() {
  if (typeof window !== 'undefined') {
    window.__API_BASE_URL__ = 'http://localhost:7860';
  }
}
// Only call if in browser
if (typeof window !== 'undefined') {
  setupAPIConfig();
}
export default {};

Key Insight: Docusaurus client modules are bundled server-side. Always check typeof window !== 'undefined' before accessing browser APIs.

11. Outdated Import Paths After Code Refactoring

Problem: Module import errors after code reorganization

# ❌ Old import paths from refactored code
from database.config import get_db, SessionLocal, create_tables
from auth.auth import verify_token

# ✅ Fix: Update to new module structure
from src.core.database import get_async_db, SessionLocal, create_all_tables
from src.core.security import verify_token

# For sync operations in tests/migrations:
from src.core.database import get_sync_db

Common Patterns:

get_db → get_async_db (async) or get_sync_db (sync)
Session → AsyncSession (async type hints)
create_tables → create_all_tables
database.config → src.core.database

Files Affected: All files referencing old database modules after refactoring

12. Missing Configuration Attributes

Problem: AttributeError: 'Settings' object has no attribute 'X'

# ❌ Settings class missing required attributes
class Settings(BaseSettings):
    database_url: str
    jwt_secret_key: str
    # Missing: openai_api_key, qdrant_url, etc.

# ✅ Fix: Add all required attributes with defaults
class Settings(BaseSettings):
    # Core
    database_url: str = "sqlite:///./database/auth.db"
    jwt_secret_key: str = "your-secret-key"

    # OpenAI (for RAG features)
    openai_api_key: Optional[str] = Field(default=None)
    openai_model: str = "gpt-4o-mini"
    openai_embedding_model: str = "text-embedding-3-small"

    # Qdrant (for vector search)
    qdrant_url: Optional[str] = Field(default=None)
    qdrant_api_key: Optional[str] = Field(default=None)

    # RAG settings
    chunk_size: int = 512
    chunk_overlap: int = 50
    batch_size: int = 32
    max_context_messages: int = 10

Root Cause: Settings class refactored but main.py still references old attributes.

Files Affected: src/core/config.py, main.py

13. Undefined Global Variables in Scripts

Problem: NameError: name 'DATABASE_URL' is not defined in init scripts

# ❌ Using undefined global variable
print(f"Initializing database at: {DATABASE_URL}")

# ✅ Fix: Use Settings object
from src.core.config import settings
print(f"Initializing database at: {settings.database_url_sync}")

Files Affected: init_database.py, startup scripts

14. HuggingFace Spaces Docker Build Issues

Problem: Docker build fails with various errors on HuggingFace Spaces

Error	Cause	Solution
`OSError: Readme file does not exist: README.md`	pyproject.toml references README.md but Dockerfile doesn't copy it	`COPY pyproject.toml README.md ./` before `pip install -e .`
`ModuleNotFoundError: No module named 'X'`	Outdated import paths after refactoring	Update all imports to new module structure
`AttributeError: 'Settings' object has no attribute 'X'`	Settings class missing attributes	Add all required attributes to Settings class
`NameError: name 'VAR' is not defined`	Using undefined global variables	Use `from src.core.config import settings` and access via settings object
`Config file '.env' not found`	Missing .env file (warning only)	Ensure all required env vars set in HF Space secrets
`AttributeError: 'AsyncSession' object has no attribute 'query'`	Using sync query() with AsyncSession	Use `await db.execute(select(Model))` instead of `db.query(Model)`
`asyncpg.exceptions._base.InterfaceError: connection is closed`	Database connection pool giving stale connections	Add `pool_pre_ping=True` and reduce `pool_recycle` to 1800 for Neon

15. Database Initialization in Async Context

Problem: Trying to use async functions in sync context during startup

# ❌ Wrong: Calling async function without await
async def create_all_tables():
    await conn.run_sync(Base.metadata.create_all)

# In startup sync context:
create_all_tables()  # Doesn't actually create tables!

# ✅ Fix: Use sync engine for startup
from src.core.database import sync_engine, Base
Base.metadata.create_all(sync_engine)

# OR use async properly:
import asyncio
asyncio.create_task(create_all_tables())  # Fire and forget

Files Affected: main.py lifespan function, init_database.py

16. AsyncSession Query Method Error (Runtime)

Problem: AttributeError: 'AsyncSession' object has no attribute 'query'

# ❌ Wrong: Using sync query() method with AsyncSession
@router.get("/users")
async def get_users(db: AsyncSession = Depends(get_async_db)):
    users = db.query(User).all()  # Error: AsyncSession has no 'query'
    return users

# ✅ Fix: Use select() with execute() for async
from sqlalchemy import select

@router.get("/users")
async def get_users(db: AsyncSession = Depends(get_async_db)):
    result = await db.execute(select(User))
    users = result.scalars().all()
    return users

# For single record:
result = await db.execute(select(User).filter(User.id == user_id))
user = result.scalar_one_or_none()

# For filtering:
result = await db.execute(
    select(User).filter(User.email == email)
)
user = result.scalar_one_or_none()

Common Async Patterns:

db.query(Model).filter(...).first() → result = await db.execute(select(Model).filter(...)); user = result.scalar_one_or_none()
db.query(Model).all() → result = await db.execute(select(Model)); users = result.scalars().all()
db.commit() → await db.commit()
db.refresh(obj) → await db.refresh(obj)
db.add(obj) → db.add(obj) (no await needed)
db.delete(obj) → await db.delete(obj) (if iterating) or use delete statement

Files Affected: All files using AsyncSession (routes, services, auth modules)

Critical: When converting from sync to async SQLAlchemy, ALL database operations must use the async pattern.

17. Database Connection Closed Error (Runtime)

Problem: asyncpg.exceptions._base.InterfaceError: connection is closed

Database session error: (sqlalchemy.dialects.postgresql.asyncpg.InterfaceError)
<class 'asyncpg.exceptions._base.InterfaceError'>: connection is closed

Cause: Database connection pool giving stale/closed connections, especially after idle periods.

Fix: Configure async engine with proper pool settings:

# ❌ Wrong: Missing pool_pre_ping and incorrect pool settings
async_engine = create_async_engine(
    settings.database_url_async,
    pool_size=5,
    max_overflow=10,
    pool_recycle=3600,  # Too long for Neon's 5-min idle timeout
    # Missing pool_pre_ping
)

# ✅ Fix: Add pool_pre_ping and optimize settings for async
async_engine = create_async_engine(
    settings.database_url_async,
    echo=settings.debug,
    pool_size=3,  # Reduced from 5 for async
    max_overflow=5,  # Reduced from 10
    pool_timeout=30,
    pool_recycle=1800,  # 30 min (reduced from 3600 for Neon's idle timeout)
    pool_pre_ping=True,  # CRITICAL: Verify connections before use
    connect_args={
        "server_settings": {
            "application_name": "ai_book_backend",
            "timezone": "utc"
        },
        "command_timeout": 60,
        # Note: SSL configured via DATABASE_URL (sslmode=require)
    },
    pool_use_lifo=True,  # Use LIFO to reduce stale connections
    pool_drop_on_rollback=False,
)

Common Issues:

Neon PostgreSQL idle timeout: Free tier closes connections after 5 minutes of inactivity
Missing pool_pre_ping: Connections become stale but pool reuses them
SSL misconfiguration: Setting ssl directly in connect_args doesn't work with asyncpg
Pool too large: Async connections use more resources, keep pool smaller

Files Affected: src/core/database.py

Critical for Neon PostgreSQL: Reduce pool_recycle to 1800 (30 min) or less, and always use pool_pre_ping=True.

HuggingFace Spaces Deployment: Complete Guide

Critical Requirements

1. README.md with YAML Frontmatter (REQUIRED)

---
title: AI Book Backend
emoji: 🤖
colorFrom: blue
colorTo: indigo
sdk: docker
sdk_version: "3.11"
app_file: main.py
pinned: false
license: mit
---

Must be at ROOT of repository with YAML at the TOP.

2. Dockerfile Requirements

# MUST copy README.md before pip install
COPY pyproject.toml README.md ./
RUN pip install --no-cache-dir -e .

# Not just:
COPY pyproject.toml ./  # ❌ Will fail if pyproject.toml has readme field

3. Environment Variables (Set in Space Settings)

JWT_SECRET_KEY=your-super-secret-jwt-key-at-least-32-chars
DATABASE_URL=sqlite:///./database/auth.db
ALLOWED_ORIGINS=https://your-frontend.github.io,https://huggingface.co

4. Import Path Consistency All Python imports must use the new module structure:

# Old (broken):
from database.config import get_db
from auth.auth import verify_token

# New (working):
from src.core.database import get_async_db
from src.core.security import verify_token

5. Database Session Types

# For async endpoints (most FastAPI routes):
from src.core.database import get_async_db
from sqlalchemy.ext.asyncio import AsyncSession

@router.get("/users")
async def get_users(db: AsyncSession = Depends(get_async_db)):
    result = await db.execute(select(User))
    return result.scalars().all()

# For sync operations (migrations, scripts):
from src.core.database import get_sync_db, sync_engine
from sqlalchemy.orm import Session

def run_migration():
    Base.metadata.create_all(sync_engine)

Common Startup Sequence Failures

Pattern 1: Import Errors

File "/app/main.py", line 56, in <module>
    from routes import auth
File "/app/routes/auth.py", line 9, in <module>
    from database.config import get_db
ModuleNotFoundError: No module named 'database.config'

Solution: Update ALL import paths across the codebase.

Pattern 2: Attribute Errors

AttributeError: 'Settings' object has no attribute 'openai_api_key'

Solution: Add missing attributes to src/core/config.py Settings class.

Pattern 3: Database Initialization Errors

NameError: name 'DATABASE_URL' is not defined

Solution: Import settings and use settings.database_url_sync.

Pattern 4: AsyncSession Query Errors (Runtime)

AttributeError: 'AsyncSession' object has no attribute 'query'

Solution: Convert all database queries to async pattern using select():

# Replace db.query() with:
from sqlalchemy import select
result = await db.execute(select(Model).filter(...))
item = result.scalar_one_or_none()

Pattern 5: Database Connection Closed (Runtime)

asyncpg.exceptions._base.InterfaceError: connection is closed

Solution: Add pool_pre_ping=True to async engine and reduce pool_recycle:

async_engine = create_async_engine(
    settings.database_url_async,
    pool_pre_ping=True,  # Verify connections before use
    pool_recycle=1800,   # 30 min (for Neon's idle timeout)
    pool_use_lifo=True,  # Use most recent connections first
)

Production Deployment Checklist for HuggingFace Spaces

Before Pushing:

README.md has YAML frontmatter at ROOT
Dockerfile copies README.md before pip install
All import paths updated to new structure
Settings class has all required attributes
Environment variables documented in .env.hf-template

In HuggingFace Space Settings:

Set JWT_SECRET_KEY (required)
Set DATABASE_URL (defaults to sqlite if not set)
Set ALLOWED_ORIGINS (your frontend domain)
Set OPENAI_API_KEY (if using RAG features)
Set QDRANT_URL and QDRANT_API_KEY (if using vector search)

After Deployment:

Check logs for startup errors
Test /health endpoint
Visit /docs for Swagger UI
Test authentication endpoints
Verify CORS with frontend requests

Troubleshooting HuggingFace Spaces

Issue: "Config error" in Space UI

Fix: Add YAML frontmatter to README.md

Issue: Build fails at pip install

Fix: Ensure Dockerfile copies README.md with pyproject.toml

Issue: Module import errors

Fix: Update all import paths from old structure to new src.core.* structure

Issue: AttributeError on startup

Fix: Add missing configuration to Settings class

Issue: Database initialization fails

Fix: Use sync operations in init scripts, ensure proper imports

Deployment Checklist

Pre-Deployment

Verify branch names in workflows match actual branches
Test Docker build locally: docker build -t test .
Run container locally: docker run -p 7860:7860 test
Check all environment variables are documented
Validate API endpoints with health checks
Test CORS configuration in browser dev tools

Deployment

Ensure secrets are configured in GitHub
Monitor build logs for errors
Verify deployment URL accessibility
Test critical user flows
Check error logs in production

Post-Deployment

Set up monitoring/alerting
Document rollback procedure
Update API documentation
Notify stakeholders of deployment

Platform-Specific Configurations

HuggingFace Spaces

# README.md frontmatter for HF Spaces
---
title: Your App Title
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
---

GitHub Pages

# docusaurus.config.ts for GitHub Pages
baseUrl: '/your-repo-name/',
organizationName: 'your-username',
projectName: 'your-repo',
deploymentBranch: 'gh-pages',

Environment Variables

Create .env.example:

# Required
OPENAI_API_KEY=your_key_here
QDRANT_URL=your_qdrant_url

# Optional
OPENAI_MODEL=gpt-4o-mini
NODE_ENV=production

Troubleshooting Guide

"HF_TOKEN not provided"

Check GitHub repository settings > Secrets
Verify secret name matches exactly: HF_TOKEN
Ensure workflow has permissions to access secrets

Docker "Permission denied"

Install packages before creating non-root user
Use --system flag with uv/pip
Set proper file ownership: chown -R user:user /app

CORS Errors

Add frontend domain to CORS origins
Check browser network tab for preflight requests
Verify API endpoint URLs are correct

Application won't start

Check health endpoint: curl /health
Verify all environment variables
Check application logs for startup errors

Scripts Directory

Include deployment helper scripts:

# scripts/deploy.sh
#!/bin/bash
set -e

echo "Starting deployment..."

# Build and test locally
docker build -t app .
docker run -d -p 7860:7860 --name test-app app
sleep 5
curl -f http://localhost:7860/health || exit 1
docker stop test-app

# Push to registry
echo "Deployment test passed!"

Monitoring Setup

Always include basic monitoring:

# Add to main.py
import structlog

logger = structlog.get_logger()

@app.middleware("http")
async def log_requests(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    process_time = time.time() - start_time

    logger.info(
        "request_processed",
        method=request.method,
        url=str(request.url),
        status_code=response.status_code,
        process_time=process_time
    )

    return response

Security Considerations

Never commit secrets: Use environment variables
Use HTTPS in production: Configure SSL certificates
Implement rate limiting: Prevent abuse
Validate inputs: Sanitize all user inputs
Regular updates: Keep dependencies updated

Rolling Back Deployments

# Git rollback
git revert <commit-hash>
git push origin master

# Or if using tags
git checkout previous-tag
git push -f origin master

Remember: The goal is not just to deploy, but to deploy reliably and maintainably. Test thoroughly, monitor continuously, and always have a rollback plan.

Install Skill

SKILL.md