Claude Code Plugins

Community-maintained marketplace

Feedback

Embedding backends (InsightFace/PyTorch+ONNXRuntime vs TensorRT). Use when optimizing embedding throughput or debugging drift/fallbacks.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name embedding-engine
description Embedding backends (InsightFace/PyTorch+ONNXRuntime vs TensorRT). Use when optimizing embedding throughput or debugging drift/fallbacks.

Embedding Engine Skill

Use this skill to optimize embedding performance and debug embedding drift/fallback behavior.

When to Use

  • Embedding pipeline running slowly
  • Need to switch between PyTorch and TensorRT
  • Debugging embedding drift between backends
  • Building/caching TensorRT engines
  • Verifying ONNXRuntime/CoreML provider selection (macOS)

Sub-agents

Sub-agent Purpose
PyTorchEmbeddingSubagent Reference ArcFace (training/validation)
TensorRTEmbeddingSubagent GPU-optimized TRT inference
ONNXEmbeddingSubagent Future ONNXRuntime C++ service (planned)

Current Backends

  • pytorch (default): ArcFace via the insightface Python package (used by tools/episode_run.py)
  • tensorrt (optional): TensorRT engine build + inference via FEATURES/arcface_tensorrt/

Key Skills

Embed faces with the configured backend

Run embedding with the configured backend (same interface as the pipeline).

from tools.episode_run import get_embedding_backend

embedder = get_embedding_backend(
    backend_type="pytorch",  # or "tensorrt"
    device="cpu",
    tensorrt_config="config/pipeline/arcface_tensorrt.yaml",
    allow_cpu_fallback=True,
)
embedder.ensure_ready()
embeddings = embedder.encode(face_crops)  # (N, 512) L2-normalized

Build a TensorRT engine from ONNX

python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx

Compare TensorRT vs PyTorch embeddings (parity + speedup)

python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100

This uses FEATURES/arcface_tensorrt/src/embedding_compare.py and reports cosine similarity + L2 distance stats.

Config Reference

File: config/pipeline/embedding.yaml

Key Default Description
embedding.backend pytorch Backend: pytorch or tensorrt
embedding.tensorrt_config config/pipeline/arcface_tensorrt.yaml TensorRT config path
validation.max_drift_cosine 0.001 Drift tolerance (behavior depends on runtime)

File: config/pipeline/arcface_tensorrt.yaml

Key Default Description
arcface_tensorrt.enabled false Sandbox feature flag (engine must exist)
tensorrt.precision fp16 Engine precision
tensorrt.max_batch_size 32 Max batch for engine build
tensorrt.workspace_size_mb 1024 TRT workspace
tensorrt.engine_s3_bucket null Optional engine bucket

Engine Storage

TensorRT engines are GPU-architecture specific. Stored in S3:

s3://screenalytics-models/engines/
├── arcface_r100-fp16-sm75.plan   # Ampere (RTX 30xx)
├── arcface_r100-fp16-sm80.plan   # A100
├── arcface_r100-fp16-sm86.plan   # Ada (RTX 40xx)
└── arcface_r100-fp16-sm89.plan   # Hopper (H100)

Naming convention: {model_name}-{precision}-sm{arch}.plan

Common Issues

"Engine not found" / TensorRT backend won’t load

Cause: No engine built for the current GPU / config mismatch

Fix: Build locally:

python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx

Embedding drift too high

Cause: FP16 quantization or TRT optimization changes

Check: Run parity compare:

python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100

Fix: Use FP32 precision:

tensorrt:
  precision: fp32  # default is fp16

TensorRT slower than expected / falling back

Cause: Not batching, engine built with suboptimal shapes/precision, or backend fell back

Check: Ensure config/pipeline/embedding.yaml has embedding.backend: tensorrt and re-run with --mode benchmark.

Fix: Increase batch size, ensure GPU backend:

tensorrt:
  opt_batch_size: 32
  max_batch_size: 64

Out of GPU memory

Cause: Engine workspace too large

Check: nvidia-smi during inference

Fix: Reduce workspace:

tensorrt:
  workspace_size_mb: 512  # default is 1024

Benchmark Reference

Backend Batch Throughput Latency VRAM
PyTorch 32 ~50 fps ~640ms 2GB
TensorRT FP16 32 ~250 fps ~128ms 1GB
TensorRT FP32 32 ~180 fps ~178ms 1.5GB

Diagnostic Output

{
  "backend": "tensorrt",
  "engine_path": "~/.cache/screenalytics/engines/arcface_r100_v1-sm86.trt",
  "precision": "fp16",
  "batch_size": 32,
  "embedding_dim": 512,
  "throughput_fps": 245.3,
  "latency_ms": 130.5,
  "vram_mb": 1024,
  "validation": {
    "drift_vs_pytorch": 0.9995,
    "regression_test": "passed"
  }
}

Key Files

File Purpose
tools/episode_run.py Pipeline embedding backend selection (get_embedding_backend)
FEATURES/arcface_tensorrt/src/tensorrt_builder.py Engine build/cache + optional S3
FEATURES/arcface_tensorrt/src/tensorrt_inference.py TensorRT inference wrapper
FEATURES/arcface_tensorrt/src/embedding_compare.py Parity + speedup compare utilities
config/pipeline/embedding.yaml Backend selection + validation knobs
config/pipeline/arcface_tensorrt.yaml TensorRT builder/runtime config
FEATURES/arcface_tensorrt/tests/test_tensorrt_embedding.py Unit tests (synthetic)
tests/ml/test_arcface_embeddings.py ML-gated embedding invariants

Related Skills