name	ml-api-endpoint
description	Эксперт ML API. Используй для model serving, inference endpoints, FastAPI и ML deployment.

ML API Endpoint Expert

Name: ml-api-endpoint
Author: dengineproblem

Expert in designing and deploying machine learning API endpoints.

Core Principles

API Design

Stateless Design: Each request contains all necessary information
Consistent Response Format: Standardize success/error structures
Versioning Strategy: Plan for model updates
Input Validation: Rigorous validation before inference

FastAPI Implementation

Basic ML Endpoint

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, validator
import joblib
import numpy as np

app = FastAPI(title="ML Model API", version="1.0.0")

model = None

@app.on_event("startup")
async def load_model():
    global model
    model = joblib.load("model.pkl")

class PredictionInput(BaseModel):
    features: list[float]

    @validator('features')
    def validate_features(cls, v):
        if len(v) != 10:
            raise ValueError('Expected 10 features')
        return v

class PredictionResponse(BaseModel):
    prediction: float
    confidence: float | None = None
    model_version: str
    request_id: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(input_data: PredictionInput):
    features = np.array([input_data.features])
    prediction = model.predict(features)[0]

    return PredictionResponse(
        prediction=float(prediction),
        model_version="v1",
        request_id=generate_request_id()
    )

Batch Prediction

class BatchInput(BaseModel):
    instances: list[list[float]]

    @validator('instances')
    def validate_batch_size(cls, v):
        if len(v) > 100:
            raise ValueError('Batch size cannot exceed 100')
        return v

@app.post("/predict/batch")
async def batch_predict(input_data: BatchInput):
    features = np.array(input_data.instances)
    predictions = model.predict(features)

    return {
        "predictions": predictions.tolist(),
        "count": len(predictions)
    }

Performance Optimization

Model Caching

class ModelCache:
    def __init__(self, ttl_seconds=300):
        self.cache = {}
        self.ttl = ttl_seconds

    def get(self, features):
        key = hashlib.md5(str(features).encode()).hexdigest()
        if key in self.cache:
            result, timestamp = self.cache[key]
            if time.time() - timestamp < self.ttl:
                return result
        return None

    def set(self, features, prediction):
        key = hashlib.md5(str(features).encode()).hexdigest()
        self.cache[key] = (prediction, time.time())

Health Checks

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "model_loaded": model is not None
    }

@app.get("/metrics")
async def get_metrics():
    return {
        "requests_total": request_counter,
        "prediction_latency_avg": avg_latency,
        "error_rate": error_rate
    }

Docker Deployment

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Best Practices

Use async/await for I/O operations
Validate data types, ranges, and business rules
Cache predictions for deterministic models
Handle model failures with fallback responses
Log predictions, latencies, and errors
Support multiple model versions
Set memory and CPU limits

ml-api-endpoint

Install Skill

SKILL.md

ML API Endpoint Expert

Core Principles

API Design

FastAPI Implementation

Basic ML Endpoint

Batch Prediction

Performance Optimization

Model Caching

Health Checks

Docker Deployment

Best Practices