| name | hugging-face-dataset-creator |
| description | Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, and streaming row updates. Designed to work alongside HF MCP server for comprehensive dataset workflows. |
Overview
This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, and content management. It is designed to complement the existing Hugging Face MCP server by providing dataset editing capabilities that the MCP server doesn't offer.
Integration with HF MCP Server
- Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
- Use This Skill for: Dataset creation, content editing, configuration management, and structured data formatting
Version
2.0.0
Dependencies
- huggingface_hub
- json (built-in)
- time (built-in)
Core Capabilities
1. Dataset Lifecycle Management
- Initialize: Create new dataset repositories with proper structure
- Configure: Store detailed configuration including system prompts and metadata
- Stream Updates: Add rows efficiently without downloading entire datasets
2. Multi-Format Dataset Support
Supports diverse dataset types through template system:
- Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
- Text Classification: Sentiment analysis, intent detection, topic classification
- Question-Answering: Reading comprehension, factual QA, knowledge bases
- Text Completion: Language modeling, code completion, creative writing
- Tabular Data: Structured data for regression/classification tasks
- Custom Formats: Flexible schema definition for specialized needs
3. Quality Assurance Features
- JSON Validation: Ensures data integrity during uploads
- Batch Processing: Efficient handling of large datasets
- Error Recovery: Graceful handling of upload failures and conflicts
Usage Instructions
The skill includes a Python script scripts/dataset_manager.py to perform operations.
Prerequisites
huggingface_hublibrary must be installed viauv add huggingface_hubHF_TOKENenvironment variable must be set with a Write-access token- Activate virtual environment:
source .venv/bin/activate
Recommended Workflow
1. Discovery (Use HF MCP Server):
# Use HF MCP tools to find existing datasets
search_datasets("conversational AI training")
get_dataset_details("username/dataset-name")
2. Creation (Use This Skill):
# Initialize new dataset
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
# Configure with detailed system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"
3. Content Management (Use This Skill):
# Quick setup with any template
python scripts/dataset_manager.py quick_setup \
--repo_id "your-username/dataset-name" \
--template classification
# Add data with template validation
python scripts/dataset_manager.py add_rows \
--repo_id "your-username/dataset-name" \
--template qa \
--rows_json "$(cat your_qa_data.json)"
Template-Based Data Structures
1. Chat Template (--template chat)
{
"messages": [
{"role": "user", "content": "Natural user request"},
{"role": "assistant", "content": "Response with tool usage"},
{"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
],
"scenario": "Description of use case",
"complexity": "simple|intermediate|advanced"
}
2. Classification Template (--template classification)
{
"text": "Input text to be classified",
"label": "classification_label",
"confidence": 0.95,
"metadata": {"domain": "technology", "language": "en"}
}
3. QA Template (--template qa)
{
"question": "What is the question being asked?",
"answer": "The complete answer",
"context": "Additional context if needed",
"answer_type": "factual|explanatory|opinion",
"difficulty": "easy|medium|hard"
}
4. Completion Template (--template completion)
{
"prompt": "The beginning text or context",
"completion": "The expected continuation",
"domain": "code|creative|technical|conversational",
"style": "description of writing style"
}
5. Tabular Template (--template tabular)
{
"columns": [
{"name": "feature1", "type": "numeric", "description": "First feature"},
{"name": "target", "type": "categorical", "description": "Target variable"}
],
"data": [
{"feature1": 123, "target": "class_a"},
{"feature1": 456, "target": "class_b"}
]
}
Advanced System Prompt Template
For high-quality training data generation:
You are an AI assistant expert at using MCP tools effectively.
## MCP SERVER DEFINITIONS
[Define available servers and tools]
## TRAINING EXAMPLE STRUCTURE
[Specify exact JSON schema for chat templating]
## QUALITY GUIDELINES
[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
## EXAMPLE CATEGORIES
[List development workflows, debugging scenarios, data management tasks]
Example Categories & Templates
The skill includes diverse training examples beyond just MCP usage:
Available Example Sets:
training_examples.json- MCP tool usage examples (debugging, project setup, database analysis)diverse_training_examples.json- Broader scenarios including:- Educational Chat - Explaining programming concepts, tutorials
- Git Workflows - Feature branches, version control guidance
- Code Analysis - Performance optimization, architecture review
- Content Generation - Professional writing, creative brainstorming
- Codebase Navigation - Legacy code exploration, systematic analysis
- Conversational Support - Problem-solving, technical discussions
Using Different Example Sets:
# Add MCP-focused examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
--rows_json "$(cat examples/training_examples.json)"
# Add diverse conversational examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
--rows_json "$(cat examples/diverse_training_examples.json)"
# Mix both for comprehensive training data
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
Commands Reference
List Available Templates:
python scripts/dataset_manager.py list_templates
Quick Setup (Recommended):
python scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification
Manual Setup:
# Initialize repository
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
# Configure with system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"
# Add data with validation
python scripts/dataset_manager.py add_rows \
--repo_id "your-username/dataset-name" \
--template qa \
--rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'
View Dataset Statistics:
python scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"
Error Handling
- Repository exists: Script will notify and continue with configuration
- Invalid JSON: Clear error message with parsing details
- Network issues: Automatic retry for transient failures
- Token permissions: Validation before operations begin