name	count-dataset-tokens
description	This skill provides guidance for counting tokens in datasets using specific tokenizers. It should be used when tasks involve tokenizing dataset content, filtering data by domain or category, and aggregating token counts. Common triggers include requests to count tokens in HuggingFace datasets, filter datasets by specific fields, or use particular tokenizers (e.g., Qwen, DeepSeek, GPT).

Count Dataset Tokens

Overview

This skill guides the process of counting tokens in datasets, typically from HuggingFace Hub or similar sources. These tasks involve loading datasets, filtering by specific criteria (domains, categories, splits), tokenizing text fields, and computing aggregate statistics.

Workflow

Phase 1: Understand the Dataset Structure

Before writing any code, thoroughly examine the dataset documentation and structure:

Read the README/dataset card completely - Look for:
- Available splits (train, test, validation, etc.)
- Column/field definitions and their exact names
- Domain or category definitions (exact values used)
- Data types for each field
- Any metadata subsets available
Explore the actual data - Write exploratory code to:
- List all available columns
- Check unique values for categorical fields (especially filter fields like "domain")
- Verify field names match documentation
- Examine sample records to understand data format
Document findings before proceeding - Note:
- Exact field names to use
- Exact categorical values for filtering
- Any discrepancies between documentation and actual data

Phase 2: Clarify Task Requirements

When task wording is ambiguous, especially regarding filter criteria:

Use exact terminology from the dataset - If the task asks for "science" tokens but the dataset has "biology", "chemistry", "physics" domains:
- First check if a "science" domain actually exists
- Do NOT assume "science" means combining related domains unless explicitly documented
- Report when exact matches are not found
Handle missing categories correctly:
- If a requested filter value returns zero results, report this finding
- Do not reinterpret or expand the filter criteria without explicit justification
- Check if the requested value might be a typo or variant spelling
Distinguish between:
- Exact domain/category matches (e.g., domain == "science")
- Aggregated categories (e.g., domain in ["biology", "chemistry", "physics"])
- The task wording should guide which approach to use

Phase 3: Implement Token Counting

Load the correct tokenizer:
- Use the exact tokenizer specified in the task
- Verify the tokenizer loads correctly before processing data
- Handle any authentication requirements for gated models

Write robust counting code:

# Example structure
from transformers import AutoTokenizer
from datasets import load_dataset

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("tokenizer-name")

# Load dataset (use streaming for large datasets)
dataset = load_dataset("dataset-name", split="train")

# Filter by exact criteria
filtered = dataset.filter(lambda x: x["domain"] == "exact_value")

# Count tokens with null/empty handling
total_tokens = 0
for item in filtered:
    text = item.get("text_field")
    if text:  # Handle None and empty strings
        tokens = tokenizer.encode(text)
        total_tokens += len(tokens)

Handle edge cases:
- None/null values in text fields
- Empty strings
- Special characters or encoding issues
- Very long texts that may need truncation handling

Phase 4: Verify Results

Validate filter results:
- Check the count of filtered records
- Verify sample records match expected criteria
- Confirm no unexpected filtering occurred
Sanity check token counts:
- Compare against expected magnitudes
- Verify counts are non-zero when data exists
- Check for reasonable tokens-per-record ratios
Document assumptions:
- Note any interpretation decisions made
- Flag uncertainty in the results if criteria were ambiguous

Common Pitfalls

1. Misinterpreting Filter Criteria

Problem: Assuming "science" means biology + chemistry + physics without documentation support. Solution: Always use exact filter values from the dataset. When exact matches fail, report the discrepancy rather than reinterpreting.

2. Insufficient Dataset Exploration

Problem: Writing filtering code before understanding available field values. Solution: Always run exploratory analysis first:

# Check unique values before filtering
print(dataset.unique("domain"))  # or equivalent

3. Incomplete Documentation Review

Problem: Missing domain definitions or field specifications in the README. Solution: Read the entire dataset card, including any linked documentation. Look for schema definitions, data dictionaries, or field value enumerations.

4. Silent Filter Failures

Problem: Filter returns zero results but code continues without warning. Solution: Always check and report filter result counts:

filtered = dataset.filter(condition)
print(f"Filtered count: {len(filtered)}")
if len(filtered) == 0:
    print("WARNING: No records match filter criteria")

5. Ignoring Null/Empty Values

Problem: Tokenizing None or empty strings causes errors or incorrect counts. Solution: Always validate text content before tokenizing:

if text and isinstance(text, str) and len(text.strip()) > 0:
    tokens = tokenizer.encode(text)

6. Overconfident Assumptions

Problem: Proceeding with reinterpreted criteria without expressing uncertainty. Solution: When making interpretation decisions, document them clearly and flag the uncertainty in results.

Verification Checklist

Before finalizing results, verify:

Dataset README/card was fully reviewed
Field names match exactly what's in the data
Filter values are exact matches from the dataset
Exploratory analysis confirmed available categories
Filter produced expected number of records
Null/empty handling is implemented
Token counts pass sanity checks
Any assumptions or interpretations are documented
Results file is correctly formatted and written

count-dataset-tokens

Install Skill

SKILL.md