name	neo4j-integration
description	Graph database schema management for structured complex data at scale with relationship modeling and querying patterns

Neo4j Integration Skill

What This Skill Provides

This skill provides comprehensive Neo4j graph database integration capabilities specifically designed for the SRS AI Systems project, which manages regulatory orphan designation data. It includes:

Graph Schema Design: Pre-built schema for regulatory data (Products, Indications, Sponsors, Designations, Regulatory Authorities)
Data Ingestion Pipelines: Production-ready scripts to load regulatory data into Neo4j
Query Patterns: Optimized Cypher queries for competitive intelligence, relationship analysis, and regulatory insights
Performance Optimization: Indexing strategies, constraint management, and query optimization
Testing Framework: Comprehensive test patterns for graph queries and data validation

When to Use This Skill

Perfect For:

Building knowledge graphs for regulatory and pharmaceutical data
Mapping relationships between products, indications, and sponsors
Competitive intelligence analysis (finding competing therapies)
Regulatory pathway analysis and designation tracking
Complex multi-hop queries (e.g., "Find all sponsors with products treating similar indications")
Visualizing and navigating interconnected regulatory data
Temporal analysis of designation trends and market landscapes

Not For:

Simple CRUD operations better suited for relational databases
High-frequency transactional systems (OLTP at massive scale)
Document storage (use document databases instead)
Time-series data with simple aggregations (use time-series databases)
Projects without complex relationship requirements

Quick Start Workflows

Workflow 1: Initial Setup and Schema Creation

Goal: Set up Neo4j database with proper schema, constraints, and indexes for regulatory data.

# Step 1: Install Neo4j dependencies
pip install neo4j pandas python-dotenv

# Step 2: Set up environment variables
cat > .env << EOF
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
EOF

# Step 3: Initialize schema with constraints and indexes
./scripts/setup_neo4j_schema.py --env-file .env --create-constraints --create-indexes

# Step 4: Verify schema
./scripts/setup_neo4j_schema.py --env-file .env --verify

Expected Output: Database with proper node types, relationships, unique constraints, and performance indexes.

Workflow 2: Ingest Regulatory Data

Goal: Load orphan designation data from CSV/JSON into the graph database.

# Step 1: Prepare data files (CSV format)
# Required files: products.csv, indications.csv, sponsors.csv, designations.csv

# Step 2: Ingest data with validation
./scripts/ingest_regulatory_data.py \
    --env-file .env \
    --products-file data/products.csv \
    --indications-file data/indications.csv \
    --sponsors-file data/sponsors.csv \
    --designations-file data/designations.csv \
    --batch-size 1000 \
    --validate

# Step 3: Create derived relationships (competitive analysis)
./scripts/ingest_regulatory_data.py \
    --env-file .env \
    --create-derived-relationships \
    --similarity-threshold 0.8

Expected Output: Populated graph with nodes, relationships, and competitive intelligence links.

Workflow 3: Query Knowledge Graph for Insights

Goal: Extract regulatory insights and competitive intelligence from the graph.

# Find competing products for a specific indication
./scripts/query_knowledge_graph.py \
    --env-file .env \
    --query-type competing-products \
    --indication "Duchenne Muscular Dystrophy" \
    --output-format json

# Find all designations for a sponsor
./scripts/query_knowledge_graph.py \
    --env-file .env \
    --query-type sponsor-portfolio \
    --sponsor "Sarepta Therapeutics" \
    --include-timeline

# Find similar indications based on product overlap
./scripts/query_knowledge_graph.py \
    --env-file .env \
    --query-type similar-indications \
    --indication "Cystic Fibrosis" \
    --min-shared-products 2

Expected Output: Structured data revealing competitive landscapes, sponsor portfolios, and market insights.

Decision Trees

When to Create a New Node Type vs. Property

Does the entity have multiple independent relationships?
├─ YES: Create a separate node type
│   Example: Sponsor (has relationships to Products, Designations, etc.)
└─ NO: Is the data queried independently or needs indexing?
    ├─ YES: Create a separate node type
    │   Example: Indication (queried for competitive analysis)
    └─ NO: Store as property on existing node
        Example: designation_date (property of OrphanDesignation)

Choosing Between Cypher Query Approaches

Query involves multiple relationship hops (>2)?
├─ YES: Use OPTIONAL MATCH for flexibility
│   └─ Need aggregation? Use WITH clause for staged processing
└─ NO: Simple relationship traversal
    └─ Performance critical?
        ├─ YES: Ensure indexes exist, use parameters, consider query hints
        └─ NO: Standard MATCH patterns sufficient

Indexing Strategy Decision

Field used in WHERE clauses frequently?
├─ YES: Is it unique identifier?
│   ├─ YES: Create UNIQUE constraint (auto-creates index)
│   └─ NO: Create standard index
└─ NO: Used in relationship traversal?
    └─ YES: Consider composite index if multi-property filtering

Quality Checklist

Schema Design

All node types have at least one unique constraint
Frequently queried properties have indexes
Relationship types are semantically clear (verb-based)
No redundant relationships (single direction unless bidirectional needed)
Property types are consistent across nodes

Data Ingestion

Batch processing used for large datasets (>10,000 nodes)
Duplicate prevention via MERGE instead of CREATE
Validation runs before full ingestion
Data cleaning applied (null handling, type conversion)
Transaction boundaries properly defined

Query Performance

Queries use indexed properties in WHERE clauses
Parameterized queries prevent query plan cache pollution
LIMIT used for potentially large result sets
EXPLAIN/PROFILE used to verify query plans
Avoid Cartesian products (missing relationship definitions)

Code Quality

Connection pooling implemented
Error handling with specific Neo4j exceptions
Logging for debugging and monitoring
Environment-based configuration (dev/staging/prod)
Unit tests for query functions

Documentation

Cypher queries have inline comments
Schema diagram available or generated
Example queries documented with expected results
Data model versioning tracked

Common Pitfalls & Solutions

Pitfall 1: Cartesian Products (Accidental Cross Joins)

Problem: Query returns exponentially more results than expected.

// BAD: Missing relationship creates Cartesian product
MATCH (p:Product), (s:Sponsor)
WHERE p.name CONTAINS 'Drug'
RETURN p, s

Solution: Always define relationships between nodes.

// GOOD: Explicit relationship
MATCH (p:Product)<-[:OWNS]-(s:Sponsor)
WHERE p.name CONTAINS 'Drug'
RETURN p, s

Pitfall 2: Not Using Parameterized Queries

Problem: Query plan cache pollution and SQL injection vulnerabilities.

# BAD: String concatenation
query = f"MATCH (p:Product {{name: '{product_name}'}}) RETURN p"
session.run(query)

Solution: Use parameterized queries.

# GOOD: Parameterized query
query = "MATCH (p:Product {name: $name}) RETURN p"
session.run(query, name=product_name)

Pitfall 3: Missing Indexes on Frequently Queried Properties

Problem: Slow queries even on small datasets.

Solution: Profile query and add indexes.

// Check query performance
PROFILE MATCH (p:Product {generic_name: 'Eteplirsen'}) RETURN p

// Add index if missing
CREATE INDEX product_generic_name FOR (p:Product) ON (p.generic_name)

Pitfall 4: Unbounded Relationship Traversal

Problem: Query explores entire graph, causing timeout.

// BAD: Unbounded depth
MATCH (p:Product)-[*]-(related)
RETURN p, related

Solution: Limit relationship depth.

// GOOD: Bounded depth
MATCH (p:Product)-[*1..3]-(related)
WHERE p.name = 'Exondys 51'
RETURN p, related
LIMIT 100

Pitfall 5: Creating Duplicate Nodes

Problem: Multiple nodes for same entity due to CREATE instead of MERGE.

// BAD: Creates duplicate if exists
CREATE (s:Sponsor {name: 'Sarepta Therapeutics'})

Solution: Use MERGE with ON CREATE/ON MATCH.

// GOOD: Ensures single node
MERGE (s:Sponsor {name: 'Sarepta Therapeutics'})
ON CREATE SET s.created_at = datetime()
ON MATCH SET s.updated_at = datetime()

Pro Tips

Tip 1: Use APOC for Advanced Operations

APOC (Awesome Procedures on Cypher) provides essential utilities for production systems.

// Batch processing with APOC
CALL apoc.periodic.iterate(
  "LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS row RETURN row",
  "MERGE (p:Product {id: row.id}) SET p += row",
  {batchSize: 1000, parallel: false}
)

Tip 2: Virtual Relationships for Derived Insights

Create temporary relationships for analysis without modifying the graph.

// Create virtual COMPETES_WITH relationships based on shared indications
MATCH (p1:Product)-[:TREATS]->(i:Indication)<-[:TREATS]-(p2:Product)
WHERE p1 <> p2
RETURN p1, p2, i,
       apoc.create.vRelationship(p1, 'COMPETES_WITH', {indication: i.name}, p2) AS competition

Tip 3: Graph Algorithms for Network Analysis

Use Neo4j Graph Data Science library for advanced analytics.

// Find most central sponsors by betweenness centrality
CALL gds.betweenness.stream('regulatory-network')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS sponsor, score
ORDER BY score DESC
LIMIT 10

Tip 4: Temporal Queries with Date Properties

Track designation history and market dynamics over time.

// Find products with designations in last 12 months
MATCH (p:Product)-[:HAS_DESIGNATION]->(d:OrphanDesignation)
WHERE d.designation_date > date() - duration({months: 12})
RETURN p.name, COUNT(d) AS recent_designations
ORDER BY recent_designations DESC

Tip 5: Export Subgraphs for Visualization

Extract relevant portions of the graph for external tools (Gephi, Cytoscape).

// Export competitive landscape for specific indication
MATCH path = (p:Product)-[:TREATS]->(i:Indication {name: 'Duchenne Muscular Dystrophy'})
              <-[:TREATS]-(competitor:Product)
WITH collect(path) AS paths
CALL apoc.export.cypher.data(nodes(paths), relationships(paths),
     "duchenne-landscape.cypher", {format: 'plain'})
YIELD file, nodes, relationships
RETURN file, nodes, relationships

Tip 6: Monitoring and Performance Tuning

Regularly check database statistics and query performance.

// Check database statistics
CALL db.stats.retrieve('GRAPH COUNTS')

// Find slow queries in query log
CALL dbms.queryJmx('org.neo4j:*') YIELD name, attributes
WHERE name CONTAINS 'Queries'
RETURN name, attributes

// Clear query plan cache after schema changes
CALL db.clearQueryCaches()

Additional Resources

Neo4j Cypher Manual: https://neo4j.com/docs/cypher-manual/current/
Graph Data Science Library: https://neo4j.com/docs/graph-data-science/current/
APOC Documentation: https://neo4j.com/labs/apoc/
Neo4j Python Driver: https://neo4j.com/docs/python-manual/current/

Integration with SRS AI Systems

This skill is specifically designed for the SRS AI Systems project schema:

Node Types:

Product: Therapeutic products (generic_name, brand_name, mechanism_of_action)
Indication: Disease indications (name, icd_code, prevalence)
Sponsor: Pharmaceutical companies (name, country, company_type)
OrphanDesignation: Regulatory designations (designation_id, date, status)
RegulatoryAuthority: FDA, EMA, etc. (name, region, authority_type)

Relationship Types:

TREATS: Product → Indication
HAS_DESIGNATION: Product → OrphanDesignation
COMPETES_WITH: Product → Product (derived)
SIMILAR_TO: Indication → Indication (derived)
FOR_INDICATION: OrphanDesignation → Indication
GRANTED_BY: OrphanDesignation → RegulatoryAuthority
OWNS: Sponsor → Product

This schema enables powerful competitive intelligence queries and regulatory pathway analysis essential for the SRS AI Systems mission.

neo4j-integration

Install Skill

SKILL.md