| name | count-dataset-tokens |
| description | This skill provides guidance for counting tokens in datasets using specific tokenizers. It should be used when tasks involve tokenizing dataset content, filtering data by domain or category, and aggregating token counts. Common triggers include requests to count tokens in HuggingFace datasets, filter datasets by specific fields, or use particular tokenizers (e.g., Qwen, DeepSeek, GPT). |
Count Dataset Tokens
Overview
This skill guides the process of counting tokens in datasets, typically from HuggingFace Hub or similar sources. These tasks involve loading datasets, filtering by specific criteria (domains, categories, splits), tokenizing text fields, and computing aggregate statistics.
Workflow
Phase 1: Understand the Dataset Structure
Before writing any code, thoroughly examine the dataset documentation and structure:
Read the README/dataset card completely - Look for:
- Available splits (train, test, validation, etc.)
- Column/field definitions and their exact names
- Domain or category definitions (exact values used)
- Data types for each field
- Any metadata subsets available
Explore the actual data - Write exploratory code to:
- List all available columns
- Check unique values for categorical fields (especially filter fields like "domain")
- Verify field names match documentation
- Examine sample records to understand data format
Document findings before proceeding - Note:
- Exact field names to use
- Exact categorical values for filtering
- Any discrepancies between documentation and actual data
Phase 2: Clarify Task Requirements
When task wording is ambiguous, especially regarding filter criteria:
Use exact terminology from the dataset - If the task asks for "science" tokens but the dataset has "biology", "chemistry", "physics" domains:
- First check if a "science" domain actually exists
- Do NOT assume "science" means combining related domains unless explicitly documented
- Report when exact matches are not found
Handle missing categories correctly:
- If a requested filter value returns zero results, report this finding
- Do not reinterpret or expand the filter criteria without explicit justification
- Check if the requested value might be a typo or variant spelling
Distinguish between:
- Exact domain/category matches (e.g., domain == "science")
- Aggregated categories (e.g., domain in ["biology", "chemistry", "physics"])
- The task wording should guide which approach to use
Phase 3: Implement Token Counting
Load the correct tokenizer:
- Use the exact tokenizer specified in the task
- Verify the tokenizer loads correctly before processing data
- Handle any authentication requirements for gated models
Write robust counting code:
# Example structure from transformers import AutoTokenizer from datasets import load_dataset # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("tokenizer-name") # Load dataset (use streaming for large datasets) dataset = load_dataset("dataset-name", split="train") # Filter by exact criteria filtered = dataset.filter(lambda x: x["domain"] == "exact_value") # Count tokens with null/empty handling total_tokens = 0 for item in filtered: text = item.get("text_field") if text: # Handle None and empty strings tokens = tokenizer.encode(text) total_tokens += len(tokens)Handle edge cases:
- None/null values in text fields
- Empty strings
- Special characters or encoding issues
- Very long texts that may need truncation handling
Phase 4: Verify Results
Validate filter results:
- Check the count of filtered records
- Verify sample records match expected criteria
- Confirm no unexpected filtering occurred
Sanity check token counts:
- Compare against expected magnitudes
- Verify counts are non-zero when data exists
- Check for reasonable tokens-per-record ratios
Document assumptions:
- Note any interpretation decisions made
- Flag uncertainty in the results if criteria were ambiguous
Common Pitfalls
1. Misinterpreting Filter Criteria
Problem: Assuming "science" means biology + chemistry + physics without documentation support. Solution: Always use exact filter values from the dataset. When exact matches fail, report the discrepancy rather than reinterpreting.
2. Insufficient Dataset Exploration
Problem: Writing filtering code before understanding available field values. Solution: Always run exploratory analysis first:
# Check unique values before filtering
print(dataset.unique("domain")) # or equivalent
3. Incomplete Documentation Review
Problem: Missing domain definitions or field specifications in the README. Solution: Read the entire dataset card, including any linked documentation. Look for schema definitions, data dictionaries, or field value enumerations.
4. Silent Filter Failures
Problem: Filter returns zero results but code continues without warning. Solution: Always check and report filter result counts:
filtered = dataset.filter(condition)
print(f"Filtered count: {len(filtered)}")
if len(filtered) == 0:
print("WARNING: No records match filter criteria")
5. Ignoring Null/Empty Values
Problem: Tokenizing None or empty strings causes errors or incorrect counts. Solution: Always validate text content before tokenizing:
if text and isinstance(text, str) and len(text.strip()) > 0:
tokens = tokenizer.encode(text)
6. Overconfident Assumptions
Problem: Proceeding with reinterpreted criteria without expressing uncertainty. Solution: When making interpretation decisions, document them clearly and flag the uncertainty in results.
Verification Checklist
Before finalizing results, verify:
- Dataset README/card was fully reviewed
- Field names match exactly what's in the data
- Filter values are exact matches from the dataset
- Exploratory analysis confirmed available categories
- Filter produced expected number of records
- Null/empty handling is implemented
- Token counts pass sanity checks
- Any assumptions or interpretations are documented
- Results file is correctly formatted and written