name	hugging-face-dataset-creator
description	Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, and streaming row updates. Designed to work alongside HF MCP server for comprehensive dataset workflows.

Overview

This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, and content management. It is designed to complement the existing Hugging Face MCP server by providing dataset editing capabilities that the MCP server doesn't offer.

Integration with HF MCP Server

Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
Use This Skill for: Dataset creation, content editing, configuration management, and structured data formatting

Version

2.0.0

Dependencies

huggingface_hub
json (built-in)
time (built-in)

Core Capabilities

1. Dataset Lifecycle Management

Initialize: Create new dataset repositories with proper structure
Configure: Store detailed configuration including system prompts and metadata
Stream Updates: Add rows efficiently without downloading entire datasets

2. Multi-Format Dataset Support

Supports diverse dataset types through template system:

Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
Text Classification: Sentiment analysis, intent detection, topic classification
Question-Answering: Reading comprehension, factual QA, knowledge bases
Text Completion: Language modeling, code completion, creative writing
Tabular Data: Structured data for regression/classification tasks
Custom Formats: Flexible schema definition for specialized needs

3. Quality Assurance Features

JSON Validation: Ensures data integrity during uploads
Batch Processing: Efficient handling of large datasets
Error Recovery: Graceful handling of upload failures and conflicts

Usage Instructions

The skill includes a Python script scripts/dataset_manager.py to perform operations.

Prerequisites

huggingface_hub library must be installed via uv add huggingface_hub
HF_TOKEN environment variable must be set with a Write-access token
Activate virtual environment: source .venv/bin/activate

Recommended Workflow

1. Discovery (Use HF MCP Server):

# Use HF MCP tools to find existing datasets
search_datasets("conversational AI training")
get_dataset_details("username/dataset-name")

2. Creation (Use This Skill):

# Initialize new dataset
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]

# Configure with detailed system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"

3. Content Management (Use This Skill):

# Quick setup with any template
python scripts/dataset_manager.py quick_setup \
  --repo_id "your-username/dataset-name" \
  --template classification

# Add data with template validation
python scripts/dataset_manager.py add_rows \
  --repo_id "your-username/dataset-name" \
  --template qa \
  --rows_json "$(cat your_qa_data.json)"

Template-Based Data Structures

1. Chat Template (--template chat)

{
  "messages": [
    {"role": "user", "content": "Natural user request"},
    {"role": "assistant", "content": "Response with tool usage"},
    {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"}
  ],
  "scenario": "Description of use case",
  "complexity": "simple|intermediate|advanced"
}

2. Classification Template (--template classification)

{
  "text": "Input text to be classified",
  "label": "classification_label",
  "confidence": 0.95,
  "metadata": {"domain": "technology", "language": "en"}
}

3. QA Template (--template qa)

{
  "question": "What is the question being asked?",
  "answer": "The complete answer",
  "context": "Additional context if needed",
  "answer_type": "factual|explanatory|opinion",
  "difficulty": "easy|medium|hard"
}

4. Completion Template (--template completion)

{
  "prompt": "The beginning text or context",
  "completion": "The expected continuation",
  "domain": "code|creative|technical|conversational",
  "style": "description of writing style"
}

5. Tabular Template (--template tabular)

{
  "columns": [
    {"name": "feature1", "type": "numeric", "description": "First feature"},
    {"name": "target", "type": "categorical", "description": "Target variable"}
  ],
  "data": [
    {"feature1": 123, "target": "class_a"},
    {"feature1": 456, "target": "class_b"}
  ]
}

Advanced System Prompt Template

For high-quality training data generation:

You are an AI assistant expert at using MCP tools effectively.

## MCP SERVER DEFINITIONS
[Define available servers and tools]

## TRAINING EXAMPLE STRUCTURE
[Specify exact JSON schema for chat templating]

## QUALITY GUIDELINES
[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]

## EXAMPLE CATEGORIES
[List development workflows, debugging scenarios, data management tasks]

Example Categories & Templates

The skill includes diverse training examples beyond just MCP usage:

Available Example Sets:

training_examples.json - MCP tool usage examples (debugging, project setup, database analysis)
diverse_training_examples.json - Broader scenarios including:
- Educational Chat - Explaining programming concepts, tutorials
- Git Workflows - Feature branches, version control guidance
- Code Analysis - Performance optimization, architecture review
- Content Generation - Professional writing, creative brainstorming
- Codebase Navigation - Legacy code exploration, systematic analysis
- Conversational Support - Problem-solving, technical discussions

Using Different Example Sets:

# Add MCP-focused examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
  --rows_json "$(cat examples/training_examples.json)"

# Add diverse conversational examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
  --rows_json "$(cat examples/diverse_training_examples.json)"

# Mix both for comprehensive training data
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name" \
  --rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"

Commands Reference

List Available Templates:

python scripts/dataset_manager.py list_templates

Quick Setup (Recommended):

python scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification

Manual Setup:

# Initialize repository
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]

# Configure with system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"

# Add data with validation
python scripts/dataset_manager.py add_rows \
  --repo_id "your-username/dataset-name" \
  --template qa \
  --rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'

View Dataset Statistics:

python scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"

Error Handling

Repository exists: Script will notify and continue with configuration
Invalid JSON: Clear error message with parsing details
Network issues: Automatic retry for transient failures
Token permissions: Validation before operations begin

hugging-face-dataset-creator

Install Skill

SKILL.md