Claude Code Plugins

Community-maintained marketplace

Feedback

using-ml-production

@tachyon-beep/skillpacks
0
0

Router skill directing to deployment, optimization, MLOps, and monitoring guides.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name using-ml-production
description Router skill directing to deployment, optimization, MLOps, and monitoring guides.
mode true

Using ML Production

Overview

This meta-skill routes you to the right production deployment skill based on your concern. Load this when you need to move ML models to production but aren't sure which specific aspect to address.

Core Principle: Production concerns fall into four categories. Identify the concern first, then route to the appropriate skill. Tools and infrastructure choices are implementation details, not routing criteria.

When to Use

Load this skill when:

  • Deploying ML models to production
  • Optimizing model inference (speed, size, cost)
  • Setting up MLOps workflows (tracking, automation, CI/CD)
  • Monitoring or debugging production models
  • User mentions: "production", "deploy", "serve model", "MLOps", "monitoring", "optimize inference"

Don't use for: Training optimization (use training-optimization), model architecture selection (use neural-architectures), PyTorch infrastructure (use pytorch-engineering)

Routing by Concern

Category 1: Model Optimization

Symptoms: "Model too slow", "inference latency high", "model too large", "need to optimize for edge", "reduce model size", "speed up inference"

When to route here:

  • Model itself is the bottleneck (not infrastructure)
  • Need to reduce model size or increase inference speed
  • Deploying to resource-constrained hardware (edge, mobile)
  • Cost optimization through model efficiency

Routes to:

  • quantization-for-inference - Reduce precision (INT8/INT4), speed up inference
  • model-compression-techniques - Pruning, distillation, architecture optimization
  • hardware-optimization-strategies - GPU/CPU/edge tuning, batch sizing

Key question to ask: "Is the MODEL the bottleneck, or is it infrastructure/serving?"


Category 2: Serving Infrastructure

Symptoms: "How to serve model", "need API endpoint", "deploy to production", "containerize model", "scale serving", "load balancing", "traffic management"

When to route here:

  • Need to expose model as API or service
  • Questions about serving patterns (REST, gRPC, batch)
  • Deployment strategies (gradual rollout, A/B testing)
  • Scaling concerns (traffic, replicas, autoscaling)

Routes to:

  • model-serving-patterns - FastAPI, TorchServe, gRPC, ONNX, batching, containerization
  • deployment-strategies - A/B testing, canary, shadow mode, rollback procedures
  • scaling-and-load-balancing - Horizontal scaling, autoscaling, load balancing, cost optimization

Key distinction:

  • Serving patterns = HOW to expose model (API, container, batching)
  • Deployment strategies = HOW to roll out safely (gradual, testing, rollback)
  • Scaling = HOW to handle traffic (replicas, autoscaling, balancing)

Category 3: MLOps Tooling

Symptoms: "Track experiments", "version models", "automate deployment", "reproducibility", "CI/CD for ML", "feature store", "model registry", "experiment management"

When to route here:

  • Need workflow/process improvements
  • Want to track experiments or version models
  • Need to automate training-to-deployment pipeline
  • Team collaboration and reproducibility concerns

Routes to:

  • experiment-tracking-and-versioning - MLflow, Weights & Biases, model registries, reproducibility, lineage
  • mlops-pipeline-automation - CI/CD for ML, feature stores, data validation, automated retraining, orchestration

Key distinction:

  • Experiment tracking = Research/development phase (track runs, version models)
  • Pipeline automation = Production phase (automate workflows, CI/CD)

Multi-concern: Queries like "track experiments AND automate deployment" → route to BOTH skills


Category 4: Observability

Symptoms: "Monitor production", "model degrading", "detect drift", "production debugging", "alert on failures", "model not working in prod", "performance issues in production"

When to route here:

  • Model already deployed, need to monitor or debug
  • Detecting production issues (drift, errors, degradation)
  • Setting up alerts and dashboards
  • Root cause analysis for production failures

Routes to:

  • production-monitoring-and-alerting - Metrics, drift detection, dashboards, alerts, SLAs
  • production-debugging-techniques - Error analysis, profiling, rollback procedures, post-mortems

Key distinction:

  • Monitoring = Proactive (set up metrics, alerts, detect issues early)
  • Debugging = Reactive (diagnose and fix existing issues)

"Performance" ambiguity:

  • If "performance" = speed/latency → might be Category 1 (optimization) or Category 2 (serving/scaling)
  • If "performance" = accuracy degradation → Category 4 (observability - drift detection)
  • Ask clarifying question: "By performance, do you mean inference speed or model accuracy?"

Routing Decision Tree

User query → Identify primary concern

Is model THE problem (size/speed)?
  YES → Category 1: Model Optimization
  NO → Continue

Is it about HOW to expose/deploy model?
  YES → Category 2: Serving Infrastructure
  NO → Continue

Is it about workflow/process/automation?
  YES → Category 3: MLOps Tooling
  NO → Continue

Is it about monitoring/debugging in production?
  YES → Category 4: Observability
  NO → Ask clarifying question

Ambiguous? → Ask ONE question to clarify concern category

Clarification Questions for Ambiguous Queries

Query: "My model is too slow"

Ask: "Is this inference latency (how fast predictions are), or training time?"

  • Training → Route to training-optimization (wrong pack)
  • Inference → Follow-up: "Have you profiled to find bottlenecks?"
    • Model is bottleneck → Category 1 (optimization)
    • Infrastructure/batching issue → Category 2 (serving)

Query: "I need to deploy my model"

Ask: "What's your deployment target - cloud server, edge device, or batch processing?"

  • Cloud/server → Category 2 (serving-patterns, then maybe deployment-strategies if gradual rollout needed)
  • Edge/mobile → Category 1 (optimization first for size/speed) + Category 2 (serving)
  • Batch → Category 2 (serving-patterns - batch processing)

Query: "My model isn't performing well in production"

Ask: "By performance, do you mean inference speed or prediction accuracy?"

  • Speed → Category 1 (optimization) or Category 2 (serving/scaling)
  • Accuracy → Category 4 (observability - drift detection, monitoring)

Query: "Set up MLOps for my team"

Ask: "What's the current pain point - experiment tracking, automated deployment, or both?"

  • Tracking/versioning → Category 3 (experiment-tracking-and-versioning)
  • Automation/CI/CD → Category 3 (mlops-pipeline-automation)
  • Both → Route to BOTH skills

Multi-Concern Scenarios

Some queries span multiple categories. Route to ALL relevant skills in logical order:

Scenario Route Order Why
"Optimize and deploy model" 1. Optimization → 2. Serving Optimize BEFORE deploying
"Deploy and monitor model" 1. Serving → 2. Observability Deploy BEFORE monitoring
"Track experiments and automate deployment" 1. Experiment tracking → 2. Pipeline automation Track BEFORE automating
"Quantize model and serve with TorchServe" 1. Quantization → 2. Serving patterns Optimize BEFORE serving
"Deploy with A/B testing and monitor" 1. Deployment strategies → 2. Monitoring Deploy strategy BEFORE monitoring

Principle: Route in execution order (what needs to happen first).


Relationship with Other Packs

With llm-specialist

ml-production covers: General serving, quantization, deployment, monitoring (universal patterns)

llm-specialist covers: LLM-specific optimization (KV cache, prompt caching, speculative decoding, token streaming)

When to use both:

  • "Deploy LLM to production" → llm-specialist (for inference-optimization) + ml-production (for serving, monitoring)
  • "Quantize LLM" → llm-specialist (LLM-specific quantization patterns) OR ml-production (general quantization)

Rule of thumb: LLM-specific optimization stays in llm-specialist. General production patterns use ml-production.

With training-optimization

Clear boundary:

  • training-optimization = Training phase (convergence, hyperparameters, training speed)
  • ml-production = Inference phase (deployment, serving, monitoring)

"Too slow" disambiguation:

  • Training slow → training-optimization
  • Inference slow → ml-production

With pytorch-engineering

pytorch-engineering covers: Foundation (distributed training, profiling, memory management)

ml-production covers: Production-specific (serving APIs, deployment patterns, MLOps)

When to use both:

  • "Profile production inference" → pytorch-engineering (profiling techniques) + ml-production (production context)
  • "Optimize serving performance" → ml-production (serving patterns) + pytorch-engineering (if need low-level profiling)

Common Routing Mistakes

Query Wrong Route Correct Route Why
"Model too slow in production" Immediately to quantization Ask: inference or training? Then model vs infrastructure? Could be serving/batching issue, not model
"Deploy with Kubernetes" Defer to Kubernetes docs Category 2: serving-patterns or deployment-strategies Kubernetes is tool choice, not routing concern
"Set up MLOps" Route to one skill Ask about specific pain point, might be both tracking AND automation MLOps spans multiple skills
"Performance issues" Assume accuracy Ask: speed or accuracy? Performance is ambiguous
"We use TorchServe" Skip routing Still route to serving-patterns Tool choice doesn't change routing

Common Rationalizations (Don't Do These)

Excuse Reality
"User mentioned Kubernetes, route to deployment" Tools are implementation details. Route by concern first.
"Slow = optimization, route to quantization" Slow could be infrastructure. Clarify model vs serving bottleneck.
"They said deploy, must be serving-patterns" Could need serving + deployment-strategies + monitoring. Don't assume single concern.
"MLOps = experiment tracking" MLOps spans tracking AND automation. Ask which pain point.
"Performance obviously means speed" Could mean accuracy. Clarify inference speed vs prediction quality.
"They're technical, skip clarification" Technical users still benefit from clarifying questions.

Red Flags Checklist

If you catch yourself thinking ANY of these, STOP and clarify:

  • "I'll guess optimization vs serving" → ASK which is the bottleneck
  • "Performance probably means speed" → ASK speed or accuracy
  • "Deploy = serving-patterns only" → Consider deployment-strategies and monitoring too
  • "They mentioned [tool], route based on tool" → Route by CONCERN, not tool
  • "MLOps = one skill" → Could span experiment tracking AND automation
  • "Skip question to save time" → Clarifying prevents wrong routing

When in doubt: Ask ONE clarifying question. 10 seconds of clarification prevents minutes of wrong-skill loading.


Routing Summary Table

User Concern Ask Clarifying Route To Also Consider
Model slow/large Inference or training? Optimization skills If inference, check serving too
Deploy model Target (cloud/edge/batch)? Serving patterns Deployment strategies for gradual rollout
Production monitoring Proactive or reactive? Monitoring OR debugging Both if setting up + fixing issues
MLOps setup Tracking or automation? Experiment tracking AND/OR automation Often both needed
Performance issues Speed or accuracy? Optimization OR observability Depends on clarification
Scale serving Traffic pattern? Scaling-and-load-balancing Serving patterns if not set up yet

Integration Examples

Example 1: Full Production Pipeline

Query: "I trained a model, now I need to put it in production"

Routing:

  1. Ask: "What's your deployment target and are there performance concerns?"
  2. If "cloud deployment, model is fast enough":
    • serving-patterns (expose as API)
    • deployment-strategies (if gradual rollout needed)
    • production-monitoring-and-alerting (set up observability)
  3. If "edge device, model too large":
    • quantization-for-inference (reduce size first)
    • model-serving-patterns (edge deployment pattern)
    • production-monitoring-and-alerting (if possible on edge)

Example 2: Optimization Decision

Query: "My inference is slow"

Routing:

  1. Ask: "Have you profiled to find the bottleneck - is it the model or serving infrastructure?"
  2. If "not profiled yet":
    • production-debugging-techniques (profile first to diagnose)
    • Then route based on findings
  3. If "model is bottleneck":
    • hardware-optimization-strategies (check if hardware tuning helps)
    • If not enough → quantization-for-inference or model-compression-techniques
  4. If "infrastructure/batching is bottleneck":
    • model-serving-patterns (batching strategies)
    • scaling-and-load-balancing (if traffic-related)

Example 3: MLOps Maturity

Query: "We need better ML workflows"

Routing:

  1. Ask: "What's the current pain point - can't reproduce experiments, manual deployment, or both?"
  2. If "can't reproduce, need to track experiments":
    • experiment-tracking-and-versioning
  3. If "manual deployment is slow":
    • mlops-pipeline-automation
  4. If "both reproducibility and automation":
    • experiment-tracking-and-versioning (establish tracking first)
    • mlops-pipeline-automation (then automate workflow)

When NOT to Use ml-production Skills

Skip ml-production when:

  • Still designing/training model → Use neural-architectures, training-optimization
  • PyTorch infrastructure issues → Use pytorch-engineering
  • LLM-specific optimization only → Use llm-specialist (unless also need serving)
  • Classical ML deployment → ml-production still applies but consider if gradient boosting/sklearn instead

Red flag: If model isn't trained yet, probably don't need ml-production. Finish training first.


Success Criteria

You've routed correctly when:

  • ✅ Identified concern category (optimization, serving, MLOps, observability)
  • ✅ Asked clarifying question for ambiguous queries
  • ✅ Routed to appropriate skill(s) in logical order
  • ✅ Didn't let tool choices (Kubernetes, TorchServe) dictate routing
  • ✅ Recognized multi-concern scenarios and routed to multiple skills

References

  • See design doc: docs/plans/2025-10-30-ml-production-pack-design.md
  • Primary router: yzmir/ai-engineering-expert/using-ai-engineering
  • Related packs: llm-specialist/using-llm-specialist, training-optimization/using-training-optimization