name

protein-assembly

description

Guidance for designing and assembling multi-component fusion protein sequences, particularly for FRET biosensors and tagged constructs. This skill applies when tasks involve identifying proteins by spectral properties (excitation/emission wavelengths), assembling fusion proteins from multiple domains, codon optimization with GC content constraints, working with PDB sequences and fluorescent protein databases, or generating gBlock sequences for gene synthesis.

Protein Assembly Skill

This skill provides systematic approaches for designing multi-component fusion protein sequences, with emphasis on FRET biosensor construction, spectral matching, and codon optimization.

Task Decomposition Strategy

Complex protein assembly tasks require systematic decomposition into discrete, verifiable sub-tasks. Establish clear success criteria for each phase before proceeding.

Phase 1: Component Identification

Break the assembly into individual components that must be identified:

Fluorescent proteins - Identify by spectral properties (excitation/emission wavelengths)
Binding domains - Identify by ligand/substrate specificity (e.g., SNAP-tag for O6-benzylguanine)
Target proteins - Identify by function (e.g., antibody targets, enzymes)
Linker sequences - Determine type and length requirements

For each component, document:

The identification criteria (wavelength, ligand, function)
The source (PDB ID, database entry, plasmid file)
The extracted sequence (verify before proceeding)

Phase 2: Sequence Extraction

Extract and verify each protein sequence before assembly:

PDB sequences: Use RCSB PDB FASTA endpoint (https://www.rcsb.org/fasta/entry/{PDB_ID})
Fluorescent proteins: Query FPbase API with wavelength filters
Plasmid sequences: Parse GenBank format files completely (handle truncation)
Validate extraction: Confirm sequence length and expected features

Phase 3: Assembly and Optimization

Assemble components in specified order with linkers, then optimize:

Apply N-terminal methionine rules (remove internal Met starts)
Insert linker sequences between domains
Perform codon optimization for target organism
Validate constraints (length, GC content windows)

Critical Verification Checkpoints

Establish verification checkpoints to prevent cascading errors:

Checkpoint 1: Spectral Matching

For FRET pairs, verify donor emission overlaps with acceptor excitation
Match filter cube specifications exactly (not approximately)
Document the specific wavelength values being matched

Checkpoint 2: Sequence Completeness

Verify API responses are not truncated
If data exceeds buffer limits, implement pagination or filtering
Cross-reference sequence lengths with expected values

Checkpoint 3: Assembly Validation

Verify component order matches requirements
Confirm linker lengths within specified constraints
Check total nucleotide count against limits

Checkpoint 4: Optimization Validation

Calculate GC content in sliding windows (typically 50nt)
Verify all windows fall within acceptable range (e.g., 30-70%)
Confirm codon usage matches target organism

Common Pitfalls and Mitigations

Data Truncation

Problem: API responses or file reads may be truncated at character limits. Mitigation:

Check response completeness before parsing
Use targeted queries with filters rather than downloading entire databases
Request specific ranges when reading large files
Implement pagination for large result sets

Imprecise Wavelength Matching

Problem: Selecting proteins with "close enough" wavelengths instead of exact matches. Mitigation:

Use FPbase API filtering parameters for excitation/emission ranges
Query for specific wavelength values, not ranges
Verify FRET compatibility (donor emission must overlap acceptor excitation)

Premature Assembly

Problem: Attempting to assemble sequences before all components are verified. Mitigation:

Create explicit checkpoints after each component is identified
Document each sequence extraction with source and verification
Do not proceed to assembly until all sequences are confirmed

Missing Constraint Validation

Problem: Generating sequences without validating requirements. Mitigation:

Build validation logic early in the process
Check constraints incrementally during codon optimization
Final validation pass before output generation

Incomplete File Parsing

Problem: Assuming sequence identity without extracting from source files. Mitigation:

Parse GenBank files to extract CDS features explicitly
Do not assume sequence identity based on file names
Verify extracted sequences against annotations

Systematic Workflow

Step 1: Parse Requirements

Extract all constraints from the task specification:

Required spectral properties (exact wavelengths)
Component ordering requirements
Linker specifications (type, length range)
Optimization constraints (GC content, length limits)
Output format requirements

Step 2: Identify Components Individually

For each required protein component:

Determine identification criteria from requirements
Query appropriate database/source
Handle complete response (paginate if needed)
Extract candidate sequences
Verify match against criteria
Document source and sequence

Step 3: Validate FRET Compatibility (if applicable)

For fluorescent protein pairs:

Confirm donor excitation matches source
Confirm acceptor emission matches detector
Verify spectral overlap for energy transfer
Document the FRET pair selection rationale

Step 4: Assemble Construct

Order components per specification
Remove internal methionines as required
Insert appropriate linkers
Generate nucleotide sequence

Step 5: Optimize Codons

Select codon table for target organism
Optimize with GC content constraints
Apply sliding window validation
Iterate until all windows pass

Step 6: Final Validation

Verify total length within limits
Confirm GC content in all windows
Translate back to verify protein sequence
Save to specified output location

Database Query Strategies

FPbase Queries

Use wavelength range parameters for initial filtering
Query individual proteins for detailed spectral data
Cross-reference excitation AND emission requirements

PDB Queries

Fetch FASTA sequences via REST API
Verify chain identifiers when multiple chains present
Handle homo-multimeric structures appropriately

SMILES/Chemical Structure Identification

Use PubChem or ChEBI for structure identification
Cross-reference with protein binding databases
Verify binding protein identity through literature

Output Generation

gBlock Sequences

Verify sequence is within synthesis limits (typically under 3000nt)
Check for problematic sequences (extreme GC, repeats)
Include 5' and 3' sequences if required for cloning
Save to specified output file path

Documentation

Record all component sources
Document selection rationale for each protein
Note any assumptions or approximations made