name	multi-modal
description	Multi-modal prompting with vision, audio, and document understanding
sasmp_version	1.3.0
bonded_agent	07-advanced-techniques-agent
bond_type	PRIMARY_BOND

Multi-Modal Prompting Skill

Name: multi-modal
Author: pluginagentmarketplace

Bonded to: advanced-techniques-agent

Quick Start

Skill("custom-plugin-prompt-engineering:multi-modal")

Parameter Schema

parameters:
  modality:
    type: enum
    values: [vision, document, audio, video]
    required: true

  task_type:
    type: enum
    values: [analysis, extraction, generation, qa]
    default: analysis

  detail_level:
    type: enum
    values: [low, medium, high]
    default: medium

Vision Prompting

Image Analysis Template

Analyze this image and provide:
1. Main subjects and objects
2. Actions or activities
3. Setting and context
4. Notable details
5. Overall interpretation

Be specific and descriptive.

Visual Q&A Pattern

Look at the image carefully.

Question: {question}

Provide a detailed answer based only on what you can see in the image.

Chart/Graph Analysis

Analyze this chart/graph:
1. Type of visualization
2. Axes and labels
3. Key data points
4. Trends or patterns
5. Main insights
6. Limitations or caveats

Document Processing

PDF Extraction

Extract the following from this document:
- Title and headers
- Key information: {fields}
- Tables (if any)
- Important dates/numbers

Output as structured JSON.

Form/Invoice Processing

extraction_schema:
  document_type: "invoice|form|contract"
  fields:
    - name: vendor
      type: string
    - name: date
      type: date
    - name: total
      type: currency
    - name: line_items
      type: array

Audio Integration

Transcription Enhancement

Transcribe and enhance:
1. Accurate transcription
2. Speaker identification
3. Timestamps for key points
4. Summary of main topics
5. Action items (if applicable)

Best Practices

best_practices:
  image_prompts:
    - Be specific about what to look for
    - Request structured output
    - Ask for confidence levels

  document_prompts:
    - Define extraction schema
    - Handle multi-page documents
    - Validate extracted data

  audio_prompts:
    - Specify language if known
    - Request speaker diarization
    - Ask for timestamps

Troubleshooting

Issue	Cause	Solution
Hallucinated details	Over-interpretation	Ask for visible-only info
Missed text in images	Low resolution	Request higher detail
Wrong document parsing	Complex layout	Break into sections
Inaccurate transcription	Audio quality	Acknowledge limitations

References

See: GPT-4V documentation, Claude Vision Guide

multi-modal

Install Skill

SKILL.md

Multi-Modal Prompting Skill

Quick Start

Parameter Schema

Vision Prompting

Image Analysis Template

Visual Q&A Pattern

Chart/Graph Analysis

Document Processing

PDF Extraction

Form/Invoice Processing

Audio Integration

Transcription Enhancement

Best Practices

Troubleshooting

References