| name | named-entity-extractor |
| description | Extract named entities (people, organizations, locations, dates) from text using NLP. Use for document analysis, information extraction, or data enrichment. |
Named Entity Extractor
Extract named entities from text including people, organizations, locations, dates, and more.
Features
- Entity Types: People, organizations, locations, dates, money, percentages
- Multiple Models: spaCy for accuracy, regex for speed
- Batch Processing: Process multiple documents
- Entity Linking: Group same entities across text
- Export: JSON, CSV output formats
- Visualization: Entity highlighting
Quick Start
from entity_extractor import EntityExtractor
extractor = EntityExtractor()
text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."
entities = extractor.extract(text)
for entity in entities:
print(f"{entity['text']}: {entity['type']}")
# Output:
# Apple Inc.: ORG
# Steve Jobs: PERSON
# Cupertino: GPE
# California: GPE
# 1976: DATE
CLI Usage
# Extract from text
python entity_extractor.py --text "Steve Jobs founded Apple in California."
# Extract from file
python entity_extractor.py --input document.txt
# Batch process folder
python entity_extractor.py --input ./documents/ --output entities.csv
# Filter by entity type
python entity_extractor.py --input document.txt --types PERSON,ORG
# Use regex mode (faster, less accurate)
python entity_extractor.py --input document.txt --mode regex
# JSON output
python entity_extractor.py --input document.txt --json
API Reference
EntityExtractor Class
class EntityExtractor:
def __init__(self, mode: str = "spacy", model: str = "en_core_web_sm")
# Extraction
def extract(self, text: str) -> list
def extract_file(self, filepath: str) -> list
def extract_batch(self, folder: str) -> dict
# Filtering
def filter_entities(self, entities: list, types: list) -> list
def get_unique_entities(self, entities: list) -> list
def group_by_type(self, entities: list) -> dict
# Analysis
def entity_frequency(self, text: str) -> dict
def find_relationships(self, text: str) -> list
# Export
def to_csv(self, entities: list, output: str) -> str
def to_json(self, entities: list, output: str) -> str
def highlight_text(self, text: str) -> str
Entity Types
Standard Entity Types (spaCy)
| Type | Description | Example |
|---|---|---|
| PERSON | People, including fictional | "Steve Jobs" |
| ORG | Companies, agencies, institutions | "Apple Inc." |
| GPE | Countries, cities, states | "California" |
| LOC | Non-GPE locations, mountains, water | "Pacific Ocean" |
| DATE | Dates, periods | "January 2024" |
| TIME | Times | "3:30 PM" |
| MONEY | Monetary values | "$1.5 million" |
| PERCENT | Percentages | "20%" |
| PRODUCT | Products | "iPhone" |
| EVENT | Events | "World Cup" |
| WORK_OF_ART | Books, songs, etc. | "The Great Gatsby" |
| LAW | Laws, regulations | "GDPR" |
| LANGUAGE | Languages | "English" |
| NORP | Nationalities, groups | "American" |
Regex Mode Entities
Faster extraction with regex patterns:
| Type | Description |
|---|---|
| Email addresses | |
| PHONE | Phone numbers |
| URL | Web URLs |
| DATE | Common date formats |
| MONEY | Currency amounts |
| PERCENTAGE | Percentages |
Output Format
Entity Result
{
"text": "Steve Jobs",
"type": "PERSON",
"start": 10,
"end": 20,
"confidence": 0.95
}
Full Extraction Result
{
"text": "Original text...",
"entities": [
{"text": "Steve Jobs", "type": "PERSON", "start": 10, "end": 20},
{"text": "Apple Inc.", "type": "ORG", "start": 30, "end": 40}
],
"summary": {
"total_entities": 5,
"unique_entities": 4,
"by_type": {
"PERSON": 2,
"ORG": 1,
"GPE": 2
}
}
}
Filtering and Grouping
Filter by Type
entities = extractor.extract(text)
# Get only people and organizations
filtered = extractor.filter_entities(entities, ["PERSON", "ORG"])
Get Unique Entities
# Remove duplicates, keep first occurrence
unique = extractor.get_unique_entities(entities)
Group by Type
grouped = extractor.group_by_type(entities)
# Returns:
{
"PERSON": ["Steve Jobs", "Tim Cook"],
"ORG": ["Apple Inc."],
"GPE": ["California", "Cupertino"]
}
Entity Frequency
frequency = extractor.entity_frequency(text)
# Returns:
{
"Steve Jobs": {"count": 5, "type": "PERSON"},
"Apple": {"count": 8, "type": "ORG"},
"California": {"count": 2, "type": "GPE"}
}
Batch Processing
Process Folder
results = extractor.extract_batch("./documents/")
# Returns:
{
"doc1.txt": {
"entities": [...],
"summary": {...}
},
"doc2.txt": {
"entities": [...],
"summary": {...}
}
}
Export to CSV
extractor.to_csv(results, "entities.csv")
# Creates CSV with columns:
# filename, entity_text, entity_type, start, end
Text Highlighting
Generate HTML with highlighted entities:
html = extractor.highlight_text(text)
# Returns HTML with colored spans for each entity type
Example Workflows
Document Analysis
extractor = EntityExtractor()
# Analyze a document
text = open("article.txt").read()
result = extractor.extract(text)
# Get key people mentioned
people = extractor.filter_entities(result, ["PERSON"])
print(f"People mentioned: {len(people)}")
# Get frequency
freq = extractor.entity_frequency(text)
top_entities = sorted(freq.items(), key=lambda x: x[1]["count"], reverse=True)[:10]
Contact Information Extraction
extractor = EntityExtractor(mode="regex")
text = """
Contact John Smith at john.smith@example.com
or call (555) 123-4567.
"""
entities = extractor.extract(text)
# Finds: EMAIL, PHONE entities
Content Tagging
extractor = EntityExtractor()
articles = ["article1.txt", "article2.txt", "article3.txt"]
tags = {}
for article in articles:
entities = extractor.extract_file(article)
tags[article] = extractor.get_unique_entities(entities)
Dependencies
- spacy>=3.7.0
- pandas>=2.0.0
- en_core_web_sm (spaCy model)
Note: Run python -m spacy download en_core_web_sm to install the model.