Claude Code Plugins

Community-maintained marketplace

Feedback

hugging-face-space-deployer

@48Nauts-Operator/opencode-baseline
0
0

Create, configure, and deploy Hugging Face Spaces for showcasing ML models. Supports Gradio, Streamlit, and Docker SDKs with templates for common use cases like chat interfaces, image generation, and model comparisons.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name hugging-face-space-deployer
description Create, configure, and deploy Hugging Face Spaces for showcasing ML models. Supports Gradio, Streamlit, and Docker SDKs with templates for common use cases like chat interfaces, image generation, and model comparisons.

Hugging Face Space Deployer

A skill for AI engineers to create, configure, and deploy interactive ML demos on Hugging Face Spaces.

CRITICAL: Pre-Deployment Checklist

Before writing ANY code, gather this information about the model:

1. Check Model Type (LoRA Adapter vs Full Model)

Use the HF MCP tool to inspect the model files:

hf-skills - Hub Repo Details (repo_ids: ["username/model"], repo_type: "model")

Look for these indicators:

Files Present Model Type Action Required
model.safetensors or pytorch_model.bin Full model Load directly with AutoModelForCausalLM
adapter_model.safetensors + adapter_config.json LoRA/PEFT adapter Must load base model first, then apply adapter with peft
Only config files, no weights Broken/incomplete Ask user to verify

If adapter_config.json exists, check for base_model_name_or_path to identify the base model.

2. Check Inference API Availability

Visit the model page on HF Hub and look for "Inference Providers" widget on the right side.

Indicators that model HAS Inference API:

  • Inference widget visible on model page
  • Model from known provider: meta-llama, mistralai, HuggingFaceH4, google, stabilityai, Qwen
  • High download count (>10,000) with standard architecture

Indicators that model DOES NOT have Inference API:

  • Personal namespace (e.g., GhostScientist/my-model)
  • LoRA/PEFT adapter (adapters never have direct Inference API)
  • Missing pipeline_tag in model metadata
  • No inference widget on model page

3. Check Model Metadata

  • Ensure pipeline_tag is set (e.g., text-generation)
  • Add conversational tag for chat models

4. Determine Hardware Needs

Model Size Recommended Hardware
< 3B parameters ZeroGPU (free) or CPU
3B - 7B parameters ZeroGPU or T4
> 7B parameters A10G or A100

5. Ask User If Unclear

If you cannot determine the model type, ASK THE USER:

"I'm analyzing your model to determine the best deployment strategy. I found:

  • [what you found about files]
  • [what you found about inference API]

Is this model:

  1. A full model you trained/uploaded?
  2. A LoRA/PEFT adapter on top of another model?
  3. Something else?

Also, would you prefer: A. Free deployment with ZeroGPU (may have queue times) B. Paid GPU for faster response (~$0.60/hr)"

Hardware Options

Hardware Use Case Cost
cpu-basic Simple demos, Inference API apps Free
cpu-upgrade Faster CPU inference ~$0.03/hr
zero-a10g Models needing GPU on-demand (recommended for most) Free (with quota)
t4-small Small GPU models (<7B) ~$0.60/hr
t4-medium Medium GPU models ~$0.90/hr
a10g-small Large models (7B-13B) ~$1.50/hr
a10g-large Very large models (30B+) ~$3.15/hr
a100-large Largest models ~$4.50/hr

ZeroGPU Note: ZeroGPU (zero-a10g) provides free GPU access on-demand. The Space runs on CPU, and when a user triggers inference, a GPU is allocated temporarily (~60-120 seconds). After deployment, you must manually set the runtime to "ZeroGPU" in Space Settings > Hardware.

Deployment Decision Tree

Analyze Model
│
├── Does it have adapter_config.json?
│   └── YES → It's a LoRA adapter
│       ├── Find base_model_name_or_path in adapter_config.json
│       └── Use Template 3 (LoRA + ZeroGPU)
│
├── Does it have model.safetensors or pytorch_model.bin?
│   └── YES → It's a full model
│       ├── Is it from a major provider with inference widget?
│       │   ├── YES → Use Inference API (Template 1)
│       │   └── NO → Use ZeroGPU (Template 2)
│
└── Neither found?
    └── ASK USER - model may be incomplete

Dependencies

For Inference API (cpu-basic, free):

gradio>=5.0.0
huggingface_hub>=0.26.0

For ZeroGPU full models (zero-a10g, free with quota):

gradio>=5.0.0
torch
transformers
accelerate
spaces

For ZeroGPU LoRA adapters (zero-a10g, free with quota):

gradio>=5.0.0
torch
transformers
accelerate
spaces
peft

CLI Commands (CORRECT Syntax)

# Create Space
hf repo create my-space-name --repo-type space --space-sdk gradio

# Upload files
hf upload username/space-name ./local-folder --repo-type space

# Download model files to inspect
hf download username/model-name --local-dir ./model-check --dry-run

# Check what files exist in a model
hf download username/model-name --local-dir /tmp/check --dry-run 2>&1 | grep -E '\.(safetensors|bin|json)'

Template 1: Inference API (For Supported Models)

Use when: Model has inference widget, is from major provider, or explicitly supports serverless API.

import gradio as gr
from huggingface_hub import InferenceClient

MODEL_ID = "HuggingFaceH4/zephyr-7b-beta"  # Must support Inference API!
client = InferenceClient(MODEL_ID)

def respond(message, history, system_message, max_tokens, temperature, top_p):
    messages = [{"role": "system", "content": system_message}]

    for user_msg, assistant_msg in history:
        if user_msg:
            messages.append({"role": "user", "content": user_msg})
        if assistant_msg:
            messages.append({"role": "assistant", "content": assistant_msg})

    messages.append({"role": "user", "content": message})

    response = ""
    for token in client.chat_completion(
        messages,
        max_tokens=max_tokens,
        stream=True,
        temperature=temperature,
        top_p=top_p,
    ):
        delta = token.choices[0].delta.content or ""
        response += delta
        yield response

demo = gr.ChatInterface(
    respond,
    title="Chat Assistant",
    description="Powered by Hugging Face Inference API",
    additional_inputs=[
        gr.Textbox(value="You are a helpful assistant.", label="System message"),
        gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max tokens"),
        gr.Slider(minimum=0.1, maximum=2.0, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
    ],
    examples=[
        ["Hello! How are you?"],
        ["Write a Python function to sort a list"],
    ],
)

if __name__ == "__main__":
    demo.launch()

requirements.txt:

gradio>=5.0.0
huggingface_hub>=0.26.0

README.md:

---
title: My Chat App
emoji: 💬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
---

Template 2: ZeroGPU Full Model (For Models Without Inference API)

Use when: Full model (has model.safetensors) but no Inference API support.

import gradio as gr
import spaces
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "username/my-full-model"

# Load tokenizer at startup
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Global model - loaded lazily on first GPU call for faster Space startup
model = None

def load_model():
    global model
    if model is None:
        model = AutoModelForCausalLM.from_pretrained(
            MODEL_ID,
            torch_dtype=torch.float16,
            device_map="auto",
        )
    return model

@spaces.GPU(duration=120)
def generate_response(message, history, system_message, max_tokens, temperature, top_p):
    model = load_model()

    messages = [{"role": "system", "content": system_message}]

    for user_msg, assistant_msg in history:
        if user_msg:
            messages.append({"role": "user", "content": user_msg})
        if assistant_msg:
            messages.append({"role": "assistant", "content": assistant_msg})

    messages.append({"role": "user", "content": message})

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=int(max_tokens),
            temperature=float(temperature),
            top_p=float(top_p),
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )
    return response

demo = gr.ChatInterface(
    generate_response,
    title="My Model",
    description="Powered by ZeroGPU (free!)",
    additional_inputs=[
        gr.Textbox(value="You are a helpful assistant.", label="System message", lines=2),
        gr.Slider(minimum=64, maximum=2048, value=512, step=64, label="Max tokens"),
        gr.Slider(minimum=0.1, maximum=1.5, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
    ],
    examples=[
        ["Hello! How are you?"],
        ["Help me write some code"],
    ],
)

if __name__ == "__main__":
    demo.launch()

requirements.txt:

gradio>=5.0.0
torch
transformers
accelerate
spaces

README.md:

---
title: My Model
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: zero-a10g
---

Template 3: ZeroGPU LoRA Adapter (CRITICAL FOR FINE-TUNED MODELS)

Use when: Model has adapter_config.json and adapter_model.safetensors (NOT model.safetensors)

You MUST identify the base model from adapter_config.json field base_model_name_or_path

import gradio as gr
import spaces
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Your LoRA adapter
ADAPTER_ID = "username/my-lora-adapter"
# Base model (from adapter_config.json -> base_model_name_or_path)
BASE_MODEL_ID = "Qwen/Qwen2.5-Coder-1.5B-Instruct"

# Load tokenizer at startup
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)

# Global model - loaded lazily on first GPU call
model = None

def load_model():
    global model
    if model is None:
        base_model = AutoModelForCausalLM.from_pretrained(
            BASE_MODEL_ID,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        model = PeftModel.from_pretrained(base_model, ADAPTER_ID)
        model = model.merge_and_unload()  # Merge for faster inference
    return model

@spaces.GPU(duration=120)
def generate_response(message, history, system_message, max_tokens, temperature, top_p):
    model = load_model()

    messages = [{"role": "system", "content": system_message}]

    for item in history:
        if isinstance(item, (list, tuple)) and len(item) == 2:
            user_msg, assistant_msg = item
            if user_msg:
                messages.append({"role": "user", "content": user_msg})
            if assistant_msg:
                messages.append({"role": "assistant", "content": assistant_msg})

    messages.append({"role": "user", "content": message})

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=int(max_tokens),
            temperature=float(temperature),
            top_p=float(top_p),
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )
    return response

demo = gr.ChatInterface(
    generate_response,
    title="My Fine-Tuned Model",
    description="LoRA fine-tuned model powered by ZeroGPU (free!)",
    additional_inputs=[
        gr.Textbox(value="You are a helpful assistant.", label="System message", lines=2),
        gr.Slider(minimum=64, maximum=2048, value=512, step=64, label="Max tokens"),
        gr.Slider(minimum=0.1, maximum=1.5, value=0.7, step=0.1, label="Temperature"),
        gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
    ],
    examples=[
        ["Hello! How are you?"],
        ["Help me with a coding task"],
    ],
)

if __name__ == "__main__":
    demo.launch()

requirements.txt (MUST include peft):

gradio>=5.0.0
torch
transformers
accelerate
spaces
peft

README.md:

---
title: My Fine-Tuned Model
emoji: 🔧
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: zero-a10g
---

Post-Deployment Steps

After uploading your Space files:

1. Set the Runtime Hardware (REQUIRED for GPU models)

  • Go to: https://huggingface.co/spaces/USERNAME/SPACE_NAME/settings
  • Under "Space Hardware", select the appropriate option:
    • ZeroGPU for free on-demand GPU (recommended)
    • Or a dedicated GPU tier if needed

2. Verify the Space is Running

  • Check the Space URL for any build errors
  • Review container logs in Settings if issues occur

3. Common Post-Deploy Fixes

Issue Cause Fix
"No API found" error Hardware mismatch Set runtime to ZeroGPU in Settings
Model not loading LoRA vs full model confusion Check if it's an adapter, use correct template
Inference API errors Model not on serverless Load directly with transformers instead

Detecting Model Type - Quick Reference

Full Model

Files include: model.safetensors, pytorch_model.bin, or sharded versions

# Can load directly
model = AutoModelForCausalLM.from_pretrained("username/model")

LoRA/PEFT Adapter

Files include: adapter_config.json, adapter_model.safetensors

# Must load base model first, then apply adapter
base_model = AutoModelForCausalLM.from_pretrained("base-model-id")
model = PeftModel.from_pretrained(base_model, "username/adapter")
model = model.merge_and_unload()  # Optional: merge for faster inference

Inference API Available

Model page shows "Inference Providers" widget on the right side

# Can use InferenceClient (simplest approach)
from huggingface_hub import InferenceClient
client = InferenceClient("username/model")

Fixing Missing pipeline_tag (To Enable Inference API)

If a model doesn't have an inference widget but should, it may be missing metadata:

# Download the README
hf download username/model-name README.md --local-dir /tmp/fix

# Edit to add pipeline_tag in YAML frontmatter:
# ---
# pipeline_tag: text-generation
# tags:
# - conversational
# ---

# Upload the fix
hf upload username/model-name /tmp/fix/README.md README.md

Note: Even with correct tags, custom models may not get Inference API - it depends on HF's infrastructure decisions.

CRITICAL: Gradio 5.x Requirements

Examples Format (MUST be nested lists)

# CORRECT:
examples=[
    ["Example 1"],
    ["Example 2"],
]

# WRONG (causes ValueError):
examples=[
    "Example 1",
    "Example 2",
]

Version Requirements

gradio>=5.0.0
huggingface_hub>=0.26.0

Do NOT use gradio==4.44.0 - causes ImportError: cannot import name 'HfFolder'

Troubleshooting

"No API found" Error

Cause: Gradio app isn't exposing API correctly, often due to hardware mismatch Fix: Go to Space Settings and set runtime to "ZeroGPU" or appropriate GPU tier

"OSError: does not appear to have a file named pytorch_model.bin, model.safetensors"

Cause: Trying to load a LoRA adapter as a full model Fix: Check for adapter_config.json - if present, use PEFT to load:

from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("base-model")
model = PeftModel.from_pretrained(base_model, "adapter-id")

Inference API Not Available

Cause: Model doesn't have pipeline_tag or isn't deployed to serverless Fix: Either: a. Add pipeline_tag: text-generation to model's README.md b. Or load model directly with transformers instead of InferenceClient

ImportError: cannot import name 'HfFolder'

Cause: gradio/huggingface_hub version mismatch Fix: Use gradio>=5.0.0 and huggingface_hub>=0.26.0

ValueError: examples must be nested list

Cause: Gradio 5.x format change Fix: Use [["ex1"], ["ex2"]] not ["ex1", "ex2"]

Space builds but model doesn't load

Cause: Missing peft for adapters, or wrong base model Fix: Check adapter_config.json for correct base_model_name_or_path

Workflow Summary

  1. Analyze model (check for adapter_config.json, model files, inference widget)
  2. Determine strategy (Inference API vs ZeroGPU, full model vs LoRA)
  3. Ask user if unclear about model type or cost preferences
  4. Generate correct template based on analysis
  5. Create Space with correct requirements and README
  6. Upload files using hf upload
  7. Set hardware in Space Settings (ZeroGPU for free GPU access)
  8. Monitor build logs for any issues