| name | hugging-face-space-deployer |
| description | Create, configure, and deploy Hugging Face Spaces for showcasing ML models. Supports Gradio, Streamlit, and Docker SDKs with templates for common use cases like chat interfaces, image generation, and model comparisons. |
Hugging Face Space Deployer
A skill for AI engineers to create, configure, and deploy interactive ML demos on Hugging Face Spaces.
CRITICAL: Pre-Deployment Checklist
Before writing ANY code, gather this information about the model:
1. Check Model Type (LoRA Adapter vs Full Model)
Use the HF MCP tool to inspect the model files:
hf-skills - Hub Repo Details (repo_ids: ["username/model"], repo_type: "model")
Look for these indicators:
| Files Present | Model Type | Action Required |
|---|---|---|
model.safetensors or pytorch_model.bin |
Full model | Load directly with AutoModelForCausalLM |
adapter_model.safetensors + adapter_config.json |
LoRA/PEFT adapter | Must load base model first, then apply adapter with peft |
| Only config files, no weights | Broken/incomplete | Ask user to verify |
If adapter_config.json exists, check for base_model_name_or_path to identify the base model.
2. Check Inference API Availability
Visit the model page on HF Hub and look for "Inference Providers" widget on the right side.
Indicators that model HAS Inference API:
- Inference widget visible on model page
- Model from known provider:
meta-llama,mistralai,HuggingFaceH4,google,stabilityai,Qwen - High download count (>10,000) with standard architecture
Indicators that model DOES NOT have Inference API:
- Personal namespace (e.g.,
GhostScientist/my-model) - LoRA/PEFT adapter (adapters never have direct Inference API)
- Missing
pipeline_tagin model metadata - No inference widget on model page
3. Check Model Metadata
- Ensure
pipeline_tagis set (e.g.,text-generation) - Add
conversationaltag for chat models
4. Determine Hardware Needs
| Model Size | Recommended Hardware |
|---|---|
| < 3B parameters | ZeroGPU (free) or CPU |
| 3B - 7B parameters | ZeroGPU or T4 |
| > 7B parameters | A10G or A100 |
5. Ask User If Unclear
If you cannot determine the model type, ASK THE USER:
"I'm analyzing your model to determine the best deployment strategy. I found:
- [what you found about files]
- [what you found about inference API]
Is this model:
- A full model you trained/uploaded?
- A LoRA/PEFT adapter on top of another model?
- Something else?
Also, would you prefer: A. Free deployment with ZeroGPU (may have queue times) B. Paid GPU for faster response (~$0.60/hr)"
Hardware Options
| Hardware | Use Case | Cost |
|---|---|---|
cpu-basic |
Simple demos, Inference API apps | Free |
cpu-upgrade |
Faster CPU inference | ~$0.03/hr |
zero-a10g |
Models needing GPU on-demand (recommended for most) | Free (with quota) |
t4-small |
Small GPU models (<7B) | ~$0.60/hr |
t4-medium |
Medium GPU models | ~$0.90/hr |
a10g-small |
Large models (7B-13B) | ~$1.50/hr |
a10g-large |
Very large models (30B+) | ~$3.15/hr |
a100-large |
Largest models | ~$4.50/hr |
ZeroGPU Note: ZeroGPU (zero-a10g) provides free GPU access on-demand. The Space runs on CPU, and when a user triggers inference, a GPU is allocated temporarily (~60-120 seconds). After deployment, you must manually set the runtime to "ZeroGPU" in Space Settings > Hardware.
Deployment Decision Tree
Analyze Model
│
├── Does it have adapter_config.json?
│ └── YES → It's a LoRA adapter
│ ├── Find base_model_name_or_path in adapter_config.json
│ └── Use Template 3 (LoRA + ZeroGPU)
│
├── Does it have model.safetensors or pytorch_model.bin?
│ └── YES → It's a full model
│ ├── Is it from a major provider with inference widget?
│ │ ├── YES → Use Inference API (Template 1)
│ │ └── NO → Use ZeroGPU (Template 2)
│
└── Neither found?
└── ASK USER - model may be incomplete
Dependencies
For Inference API (cpu-basic, free):
gradio>=5.0.0
huggingface_hub>=0.26.0
For ZeroGPU full models (zero-a10g, free with quota):
gradio>=5.0.0
torch
transformers
accelerate
spaces
For ZeroGPU LoRA adapters (zero-a10g, free with quota):
gradio>=5.0.0
torch
transformers
accelerate
spaces
peft
CLI Commands (CORRECT Syntax)
# Create Space
hf repo create my-space-name --repo-type space --space-sdk gradio
# Upload files
hf upload username/space-name ./local-folder --repo-type space
# Download model files to inspect
hf download username/model-name --local-dir ./model-check --dry-run
# Check what files exist in a model
hf download username/model-name --local-dir /tmp/check --dry-run 2>&1 | grep -E '\.(safetensors|bin|json)'
Template 1: Inference API (For Supported Models)
Use when: Model has inference widget, is from major provider, or explicitly supports serverless API.
import gradio as gr
from huggingface_hub import InferenceClient
MODEL_ID = "HuggingFaceH4/zephyr-7b-beta" # Must support Inference API!
client = InferenceClient(MODEL_ID)
def respond(message, history, system_message, max_tokens, temperature, top_p):
messages = [{"role": "system", "content": system_message}]
for user_msg, assistant_msg in history:
if user_msg:
messages.append({"role": "user", "content": user_msg})
if assistant_msg:
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({"role": "user", "content": message})
response = ""
for token in client.chat_completion(
messages,
max_tokens=max_tokens,
stream=True,
temperature=temperature,
top_p=top_p,
):
delta = token.choices[0].delta.content or ""
response += delta
yield response
demo = gr.ChatInterface(
respond,
title="Chat Assistant",
description="Powered by Hugging Face Inference API",
additional_inputs=[
gr.Textbox(value="You are a helpful assistant.", label="System message"),
gr.Slider(minimum=1, maximum=2048, value=512, step=1, label="Max tokens"),
gr.Slider(minimum=0.1, maximum=2.0, value=0.7, step=0.1, label="Temperature"),
gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
],
examples=[
["Hello! How are you?"],
["Write a Python function to sort a list"],
],
)
if __name__ == "__main__":
demo.launch()
requirements.txt:
gradio>=5.0.0
huggingface_hub>=0.26.0
README.md:
---
title: My Chat App
emoji: 💬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
---
Template 2: ZeroGPU Full Model (For Models Without Inference API)
Use when: Full model (has model.safetensors) but no Inference API support.
import gradio as gr
import spaces
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "username/my-full-model"
# Load tokenizer at startup
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Global model - loaded lazily on first GPU call for faster Space startup
model = None
def load_model():
global model
if model is None:
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
return model
@spaces.GPU(duration=120)
def generate_response(message, history, system_message, max_tokens, temperature, top_p):
model = load_model()
messages = [{"role": "system", "content": system_message}]
for user_msg, assistant_msg in history:
if user_msg:
messages.append({"role": "user", "content": user_msg})
if assistant_msg:
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({"role": "user", "content": message})
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=int(max_tokens),
temperature=float(temperature),
top_p=float(top_p),
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
outputs[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True
)
return response
demo = gr.ChatInterface(
generate_response,
title="My Model",
description="Powered by ZeroGPU (free!)",
additional_inputs=[
gr.Textbox(value="You are a helpful assistant.", label="System message", lines=2),
gr.Slider(minimum=64, maximum=2048, value=512, step=64, label="Max tokens"),
gr.Slider(minimum=0.1, maximum=1.5, value=0.7, step=0.1, label="Temperature"),
gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
],
examples=[
["Hello! How are you?"],
["Help me write some code"],
],
)
if __name__ == "__main__":
demo.launch()
requirements.txt:
gradio>=5.0.0
torch
transformers
accelerate
spaces
README.md:
---
title: My Model
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: zero-a10g
---
Template 3: ZeroGPU LoRA Adapter (CRITICAL FOR FINE-TUNED MODELS)
Use when: Model has adapter_config.json and adapter_model.safetensors (NOT model.safetensors)
You MUST identify the base model from adapter_config.json field base_model_name_or_path
import gradio as gr
import spaces
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Your LoRA adapter
ADAPTER_ID = "username/my-lora-adapter"
# Base model (from adapter_config.json -> base_model_name_or_path)
BASE_MODEL_ID = "Qwen/Qwen2.5-Coder-1.5B-Instruct"
# Load tokenizer at startup
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)
# Global model - loaded lazily on first GPU call
model = None
def load_model():
global model
if model is None:
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL_ID,
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, ADAPTER_ID)
model = model.merge_and_unload() # Merge for faster inference
return model
@spaces.GPU(duration=120)
def generate_response(message, history, system_message, max_tokens, temperature, top_p):
model = load_model()
messages = [{"role": "system", "content": system_message}]
for item in history:
if isinstance(item, (list, tuple)) and len(item) == 2:
user_msg, assistant_msg = item
if user_msg:
messages.append({"role": "user", "content": user_msg})
if assistant_msg:
messages.append({"role": "assistant", "content": assistant_msg})
messages.append({"role": "user", "content": message})
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=int(max_tokens),
temperature=float(temperature),
top_p=float(top_p),
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
outputs[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True
)
return response
demo = gr.ChatInterface(
generate_response,
title="My Fine-Tuned Model",
description="LoRA fine-tuned model powered by ZeroGPU (free!)",
additional_inputs=[
gr.Textbox(value="You are a helpful assistant.", label="System message", lines=2),
gr.Slider(minimum=64, maximum=2048, value=512, step=64, label="Max tokens"),
gr.Slider(minimum=0.1, maximum=1.5, value=0.7, step=0.1, label="Temperature"),
gr.Slider(minimum=0.1, maximum=1.0, value=0.95, step=0.05, label="Top-p"),
],
examples=[
["Hello! How are you?"],
["Help me with a coding task"],
],
)
if __name__ == "__main__":
demo.launch()
requirements.txt (MUST include peft):
gradio>=5.0.0
torch
transformers
accelerate
spaces
peft
README.md:
---
title: My Fine-Tuned Model
emoji: 🔧
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: zero-a10g
---
Post-Deployment Steps
After uploading your Space files:
1. Set the Runtime Hardware (REQUIRED for GPU models)
- Go to:
https://huggingface.co/spaces/USERNAME/SPACE_NAME/settings - Under "Space Hardware", select the appropriate option:
- ZeroGPU for free on-demand GPU (recommended)
- Or a dedicated GPU tier if needed
2. Verify the Space is Running
- Check the Space URL for any build errors
- Review container logs in Settings if issues occur
3. Common Post-Deploy Fixes
| Issue | Cause | Fix |
|---|---|---|
| "No API found" error | Hardware mismatch | Set runtime to ZeroGPU in Settings |
| Model not loading | LoRA vs full model confusion | Check if it's an adapter, use correct template |
| Inference API errors | Model not on serverless | Load directly with transformers instead |
Detecting Model Type - Quick Reference
Full Model
Files include: model.safetensors, pytorch_model.bin, or sharded versions
# Can load directly
model = AutoModelForCausalLM.from_pretrained("username/model")
LoRA/PEFT Adapter
Files include: adapter_config.json, adapter_model.safetensors
# Must load base model first, then apply adapter
base_model = AutoModelForCausalLM.from_pretrained("base-model-id")
model = PeftModel.from_pretrained(base_model, "username/adapter")
model = model.merge_and_unload() # Optional: merge for faster inference
Inference API Available
Model page shows "Inference Providers" widget on the right side
# Can use InferenceClient (simplest approach)
from huggingface_hub import InferenceClient
client = InferenceClient("username/model")
Fixing Missing pipeline_tag (To Enable Inference API)
If a model doesn't have an inference widget but should, it may be missing metadata:
# Download the README
hf download username/model-name README.md --local-dir /tmp/fix
# Edit to add pipeline_tag in YAML frontmatter:
# ---
# pipeline_tag: text-generation
# tags:
# - conversational
# ---
# Upload the fix
hf upload username/model-name /tmp/fix/README.md README.md
Note: Even with correct tags, custom models may not get Inference API - it depends on HF's infrastructure decisions.
CRITICAL: Gradio 5.x Requirements
Examples Format (MUST be nested lists)
# CORRECT:
examples=[
["Example 1"],
["Example 2"],
]
# WRONG (causes ValueError):
examples=[
"Example 1",
"Example 2",
]
Version Requirements
gradio>=5.0.0
huggingface_hub>=0.26.0
Do NOT use gradio==4.44.0 - causes ImportError: cannot import name 'HfFolder'
Troubleshooting
"No API found" Error
Cause: Gradio app isn't exposing API correctly, often due to hardware mismatch Fix: Go to Space Settings and set runtime to "ZeroGPU" or appropriate GPU tier
"OSError: does not appear to have a file named pytorch_model.bin, model.safetensors"
Cause: Trying to load a LoRA adapter as a full model
Fix: Check for adapter_config.json - if present, use PEFT to load:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("base-model")
model = PeftModel.from_pretrained(base_model, "adapter-id")
Inference API Not Available
Cause: Model doesn't have pipeline_tag or isn't deployed to serverless
Fix: Either:
a. Add pipeline_tag: text-generation to model's README.md
b. Or load model directly with transformers instead of InferenceClient
ImportError: cannot import name 'HfFolder'
Cause: gradio/huggingface_hub version mismatch
Fix: Use gradio>=5.0.0 and huggingface_hub>=0.26.0
ValueError: examples must be nested list
Cause: Gradio 5.x format change
Fix: Use [["ex1"], ["ex2"]] not ["ex1", "ex2"]
Space builds but model doesn't load
Cause: Missing peft for adapters, or wrong base model
Fix: Check adapter_config.json for correct base_model_name_or_path
Workflow Summary
- Analyze model (check for adapter_config.json, model files, inference widget)
- Determine strategy (Inference API vs ZeroGPU, full model vs LoRA)
- Ask user if unclear about model type or cost preferences
- Generate correct template based on analysis
- Create Space with correct requirements and README
- Upload files using
hf upload - Set hardware in Space Settings (ZeroGPU for free GPU access)
- Monitor build logs for any issues