| name | embedding-engine |
| description | Embedding backends (InsightFace/PyTorch+ONNXRuntime vs TensorRT). Use when optimizing embedding throughput or debugging drift/fallbacks. |
Embedding Engine Skill
Use this skill to optimize embedding performance and debug embedding drift/fallback behavior.
When to Use
- Embedding pipeline running slowly
- Need to switch between PyTorch and TensorRT
- Debugging embedding drift between backends
- Building/caching TensorRT engines
- Verifying ONNXRuntime/CoreML provider selection (macOS)
Sub-agents
| Sub-agent | Purpose |
|---|---|
| PyTorchEmbeddingSubagent | Reference ArcFace (training/validation) |
| TensorRTEmbeddingSubagent | GPU-optimized TRT inference |
| ONNXEmbeddingSubagent | Future ONNXRuntime C++ service (planned) |
Current Backends
pytorch(default): ArcFace via theinsightfacePython package (used bytools/episode_run.py)tensorrt(optional): TensorRT engine build + inference viaFEATURES/arcface_tensorrt/
Key Skills
Embed faces with the configured backend
Run embedding with the configured backend (same interface as the pipeline).
from tools.episode_run import get_embedding_backend
embedder = get_embedding_backend(
backend_type="pytorch", # or "tensorrt"
device="cpu",
tensorrt_config="config/pipeline/arcface_tensorrt.yaml",
allow_cpu_fallback=True,
)
embedder.ensure_ready()
embeddings = embedder.encode(face_crops) # (N, 512) L2-normalized
Build a TensorRT engine from ONNX
python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx
Compare TensorRT vs PyTorch embeddings (parity + speedup)
python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100
This uses FEATURES/arcface_tensorrt/src/embedding_compare.py and reports cosine similarity + L2 distance stats.
Config Reference
File: config/pipeline/embedding.yaml
| Key | Default | Description |
|---|---|---|
embedding.backend |
pytorch |
Backend: pytorch or tensorrt |
embedding.tensorrt_config |
config/pipeline/arcface_tensorrt.yaml |
TensorRT config path |
validation.max_drift_cosine |
0.001 | Drift tolerance (behavior depends on runtime) |
File: config/pipeline/arcface_tensorrt.yaml
| Key | Default | Description |
|---|---|---|
arcface_tensorrt.enabled |
false | Sandbox feature flag (engine must exist) |
tensorrt.precision |
fp16 | Engine precision |
tensorrt.max_batch_size |
32 | Max batch for engine build |
tensorrt.workspace_size_mb |
1024 | TRT workspace |
tensorrt.engine_s3_bucket |
null | Optional engine bucket |
Engine Storage
TensorRT engines are GPU-architecture specific. Stored in S3:
s3://screenalytics-models/engines/
├── arcface_r100-fp16-sm75.plan # Ampere (RTX 30xx)
├── arcface_r100-fp16-sm80.plan # A100
├── arcface_r100-fp16-sm86.plan # Ada (RTX 40xx)
└── arcface_r100-fp16-sm89.plan # Hopper (H100)
Naming convention: {model_name}-{precision}-sm{arch}.plan
Common Issues
"Engine not found" / TensorRT backend won’t load
Cause: No engine built for the current GPU / config mismatch
Fix: Build locally:
python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx
Embedding drift too high
Cause: FP16 quantization or TRT optimization changes
Check: Run parity compare:
python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100
Fix: Use FP32 precision:
tensorrt:
precision: fp32 # default is fp16
TensorRT slower than expected / falling back
Cause: Not batching, engine built with suboptimal shapes/precision, or backend fell back
Check: Ensure config/pipeline/embedding.yaml has embedding.backend: tensorrt and re-run with --mode benchmark.
Fix: Increase batch size, ensure GPU backend:
tensorrt:
opt_batch_size: 32
max_batch_size: 64
Out of GPU memory
Cause: Engine workspace too large
Check: nvidia-smi during inference
Fix: Reduce workspace:
tensorrt:
workspace_size_mb: 512 # default is 1024
Benchmark Reference
| Backend | Batch | Throughput | Latency | VRAM |
|---|---|---|---|---|
| PyTorch | 32 | ~50 fps | ~640ms | 2GB |
| TensorRT FP16 | 32 | ~250 fps | ~128ms | 1GB |
| TensorRT FP32 | 32 | ~180 fps | ~178ms | 1.5GB |
Diagnostic Output
{
"backend": "tensorrt",
"engine_path": "~/.cache/screenalytics/engines/arcface_r100_v1-sm86.trt",
"precision": "fp16",
"batch_size": 32,
"embedding_dim": 512,
"throughput_fps": 245.3,
"latency_ms": 130.5,
"vram_mb": 1024,
"validation": {
"drift_vs_pytorch": 0.9995,
"regression_test": "passed"
}
}
Key Files
| File | Purpose |
|---|---|
tools/episode_run.py |
Pipeline embedding backend selection (get_embedding_backend) |
FEATURES/arcface_tensorrt/src/tensorrt_builder.py |
Engine build/cache + optional S3 |
FEATURES/arcface_tensorrt/src/tensorrt_inference.py |
TensorRT inference wrapper |
FEATURES/arcface_tensorrt/src/embedding_compare.py |
Parity + speedup compare utilities |
config/pipeline/embedding.yaml |
Backend selection + validation knobs |
config/pipeline/arcface_tensorrt.yaml |
TensorRT builder/runtime config |
FEATURES/arcface_tensorrt/tests/test_tensorrt_embedding.py |
Unit tests (synthetic) |
tests/ml/test_arcface_embeddings.py |
ML-gated embedding invariants |
Related Skills
- pipeline-insights - General pipeline debugging
- face-alignment - Alignment before embedding