Claude Code Plugins

Community-maintained marketplace

Feedback

axiom-vision-ref

@CharlesWiltgen/Axiom
157
0

Vision framework API, VNDetectHumanHandPoseRequest, VNDetectHumanBodyPoseRequest, person segmentation, face detection, VNImageRequestHandler, recognized points, joint landmarks, VNRecognizeTextRequest, VNDetectBarcodesRequest, DataScannerViewController, VNDocumentCameraViewController, RecognizeDocumentsRequest

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name axiom-vision-ref
description Vision framework API, VNDetectHumanHandPoseRequest, VNDetectHumanBodyPoseRequest, person segmentation, face detection, VNImageRequestHandler, recognized points, joint landmarks, VNRecognizeTextRequest, VNDetectBarcodesRequest, DataScannerViewController, VNDocumentCameraViewController, RecognizeDocumentsRequest
skill_type reference
version 1.1.0
last_updated Sat Jan 03 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
apple_platforms iOS 11+, iPadOS 11+, macOS 10.13+, tvOS 11+, axiom-visionOS 1+

Vision Framework API Reference

Comprehensive reference for Vision framework computer vision: subject segmentation, hand/body pose detection, person detection, face analysis, text recognition (OCR), barcode detection, and document scanning.

When to Use This Reference

  • Implementing subject lifting using VisionKit or Vision
  • Detecting hand/body poses for gesture recognition or fitness apps
  • Segmenting people from backgrounds or separating multiple individuals
  • Face detection and landmarks for AR effects or authentication
  • Combining Vision APIs to solve complex computer vision problems
  • Looking up specific API signatures and parameter meanings
  • Recognizing text in images (OCR) with VNRecognizeTextRequest
  • Detecting barcodes and QR codes with VNDetectBarcodesRequest
  • Building live scanners with DataScannerViewController
  • Scanning documents with VNDocumentCameraViewController
  • Extracting structured document data with RecognizeDocumentsRequest (iOS 26+)

Related skills: See axiom-vision for decision trees and patterns, axiom-vision-diag for troubleshooting

Vision Framework Overview

Vision provides computer vision algorithms for still images and video:

Core workflow:

  1. Create request (e.g., VNDetectHumanHandPoseRequest())
  2. Create handler with image (VNImageRequestHandler(cgImage: image))
  3. Perform request (try handler.perform([request]))
  4. Access observations from request.results

Coordinate system: Lower-left origin, normalized (0.0-1.0) coordinates

Performance: Run on background queue - resource intensive, blocks UI if on main thread

Subject Segmentation APIs

VNGenerateForegroundInstanceMaskRequest

Availability: iOS 17+, macOS 14+, tvOS 17+, axiom-visionOS 1+

Generates class-agnostic instance mask of foreground objects (people, pets, buildings, food, shoes, etc.)

Basic Usage

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)

try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

InstanceMaskObservation

allInstances: IndexSet containing all foreground instance indices (excludes background 0)

instanceMask: CVPixelBuffer with UInt8 labels (0 = background, 1+ = instance indices)

instanceAtPoint(_:): Returns instance index at normalized point

let point = CGPoint(x: 0.5, y: 0.5)  // Center of image
let instance = observation.instanceAtPoint(point)

if instance == 0 {
    print("Background tapped")
} else {
    print("Instance \(instance) tapped")
}

Generating Masks

createScaledMask(for:croppedToInstancesContent:)

Parameters:

  • for: IndexSet of instances to include
  • croppedToInstancesContent:
    • false = Output matches input resolution (for compositing)
    • true = Tight crop around selected instances

Returns: Single-channel floating-point CVPixelBuffer (soft segmentation mask)

// All instances, full resolution
let mask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false
)

// Single instance, cropped
let instances = IndexSet(integer: 1)
let croppedMask = try observation.createScaledMask(
    for: instances,
    croppedToInstancesContent: true
)

Instance Mask Hit Testing

Access raw pixel buffer to map tap coordinates to instance labels:

let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let width = CVPixelBufferGetWidth(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
    CGPoint(x: normalizedX, y: normalizedY),
    width: imageWidth,
    height: imageHeight
)

// Calculate byte offset
let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)

// Read instance label
let label = UnsafeRawPointer(baseAddress!).load(
    fromByteOffset: offset,
    as: UInt8.self
)

let instances = label == 0 ? observation.allInstances : IndexSet(integer: Int(label))

VisionKit Subject Lifting

ImageAnalysisInteraction (iOS)

Availability: iOS 16+, iPadOS 16+

Adds system-like subject lifting UI to views:

let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject  // Or .automatic
imageView.addInteraction(interaction)

Interaction types:

  • .automatic: Subject lifting + Live Text + data detectors
  • .imageSubject: Subject lifting only (no interactive text)

ImageAnalysisOverlayView (macOS)

Availability: macOS 13+

let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)

Programmatic Access

ImageAnalyzer

let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(image, configuration: configuration)

ImageAnalysis

subjects: [Subject] - All subjects in image

highlightedSubjects: Set<Subject> - Currently highlighted (user long-pressed)

subject(at:): Async lookup of subject at normalized point (returns nil if none)

// Get all subjects
let subjects = analysis.subjects

// Look up subject at tap
if let subject = try await analysis.subject(at: tapPoint) {
    // Process subject
}

// Change highlight state
analysis.highlightedSubjects = Set([subjects[0], subjects[1]])

Subject Struct

image: UIImage/NSImage - Extracted subject with transparency

bounds: CGRect - Subject boundaries in image coordinates

// Single subject image
let subjectImage = subject.image

// Composite multiple subjects
let compositeImage = try await analysis.image(for: [subject1, subject2])

Out-of-process: VisionKit analysis happens out-of-process (performance benefit, image size limited)

Person Segmentation APIs

VNGeneratePersonSegmentationRequest

Availability: iOS 15+, macOS 12+

Returns single mask containing all people in image:

let request = VNGeneratePersonSegmentationRequest()
// Configure quality level if needed
try handler.perform([request])

guard let observation = request.results?.first as? VNPixelBufferObservation else {
    return
}

let personMask = observation.pixelBuffer  // CVPixelBuffer

VNGeneratePersonInstanceMaskRequest

Availability: iOS 17+, macOS 14+

Returns separate masks for up to 4 people:

let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

// Same InstanceMaskObservation API as foreground instance masks
let allPeople = observation.allInstances  // Up to 4 people (1-4)

// Get mask for person 1
let person1Mask = try observation.createScaledMask(
    for: IndexSet(integer: 1),
    croppedToInstancesContent: false
)

Limitations:

  • Segments up to 4 people
  • With >4 people: may miss people or combine them (typically background people)
  • Use VNDetectFaceRectanglesRequest to count faces if you need to handle crowded scenes

Hand Pose Detection

VNDetectHumanHandPoseRequest

Availability: iOS 14+, macOS 11+

Detects 21 hand landmarks per hand:

let request = VNDetectHumanHandPoseRequest()
request.maximumHandCount = 2  // Default: 2, increase if needed

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

for observation in request.results as? [VNHumanHandPoseObservation] ?? [] {
    // Process each hand
}

Performance note: maximumHandCount affects latency. Pose computed only for hands ≤ maximum. Set to lowest acceptable value.

Hand Landmarks (21 points)

Wrist: 1 landmark

Thumb (4 landmarks):

  • .thumbTip
  • .thumbIP (interphalangeal joint)
  • .thumbMP (metacarpophalangeal joint)
  • .thumbCMC (carpometacarpal joint)

Fingers (4 landmarks each):

  • Tip (.indexTip, .middleTip, .ringTip, .littleTip)
  • DIP (distal interphalangeal joint)
  • PIP (proximal interphalangeal joint)
  • MCP (metacarpophalangeal joint)

Group Keys

Access landmark groups:

Group Key Points
.all All 21 landmarks
.thumb 4 thumb joints
.indexFinger 4 index finger joints
.middleFinger 4 middle finger joints
.ringFinger 4 ring finger joints
.littleFinger 4 little finger joints
// Get all points
let allPoints = try observation.recognizedPoints(.all)

// Get index finger points only
let indexPoints = try observation.recognizedPoints(.indexFinger)

// Get specific point
let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

// Check confidence
guard thumbTip.confidence > 0.5 else { return }

// Access location (normalized coordinates, lower-left origin)
let location = thumbTip.location  // CGPoint

Gesture Recognition Example (Pinch)

let thumbTip = try observation.recognizedPoint(.thumbTip)
let indexTip = try observation.recognizedPoint(.indexTip)

guard thumbTip.confidence > 0.5, indexTip.confidence > 0.5 else {
    return
}

let distance = hypot(
    thumbTip.location.x - indexTip.location.x,
    thumbTip.location.y - indexTip.location.y
)

let isPinching = distance < 0.05  // Normalized threshold

Chirality (Handedness)

let chirality = observation.chirality  // .left or .right or .unknown

Body Pose Detection

VNDetectHumanBodyPoseRequest (2D)

Availability: iOS 14+, macOS 11+

Detects 18 body landmarks (2D normalized coordinates):

let request = VNDetectHumanBodyPoseRequest()
try handler.perform([request])

for observation in request.results as? [VNHumanBodyPoseObservation] ?? [] {
    // Process each person
}

Body Landmarks (18 points)

Face (5 landmarks):

  • .nose, .leftEye, .rightEye, .leftEar, .rightEar

Arms (6 landmarks):

  • Left: .leftShoulder, .leftElbow, .leftWrist
  • Right: .rightShoulder, .rightElbow, .rightWrist

Torso (7 landmarks):

  • .neck (between shoulders)
  • .leftShoulder, .rightShoulder (also in arm groups)
  • .leftHip, .rightHip
  • .root (between hips)

Legs (6 landmarks):

  • Left: .leftHip, .leftKnee, .leftAnkle
  • Right: .rightHip, .rightKnee, .rightAnkle

Note: Shoulders and hips appear in multiple groups

Group Keys (Body)

Group Key Points
.all All 18 landmarks
.face 5 face landmarks
.leftArm shoulder, elbow, wrist
.rightArm shoulder, elbow, wrist
.torso neck, shoulders, hips, root
.leftLeg hip, knee, ankle
.rightLeg hip, knee, ankle
// Get all body points
let allPoints = try observation.recognizedPoints(.all)

// Get left arm only
let leftArmPoints = try observation.recognizedPoints(.leftArm)

// Get specific joint
let leftWrist = try observation.recognizedPoint(.leftWrist)

VNDetectHumanBodyPose3DRequest (3D)

Availability: iOS 17+, macOS 14+

Returns 3D skeleton with 17 joints in meters (real-world coordinates):

let request = VNDetectHumanBodyPose3DRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNHumanBodyPose3DObservation else {
    return
}

// Get 3D joint position
let leftWrist = try observation.recognizedPoint(.leftWrist)
let position = leftWrist.position  // simd_float4x4 matrix
let localPosition = leftWrist.localPosition  // Relative to parent joint

3D Body Landmarks (17 points): Same as 2D except no ears (15 vs 18 2D landmarks)

3D Observation Properties

bodyHeight: Estimated height in meters

  • With depth data: Measured height
  • Without depth data: Reference height (1.8m)

heightEstimation: .measured or .reference

cameraOriginMatrix: simd_float4x4 camera position/orientation relative to subject

pointInImage(_:): Project 3D joint back to 2D image coordinates

let wrist2D = try observation.pointInImage(leftWrist)

3D Point Classes

VNPoint3D: Base class with simd_float4x4 position matrix

VNRecognizedPoint3D: Adds identifier (joint name)

VNHumanBodyRecognizedPoint3D: Adds localPosition and parentJoint

// Position relative to skeleton root (center of hip)
let modelPosition = leftWrist.position

// Position relative to parent joint (left elbow)
let relativePosition = leftWrist.localPosition

Depth Input

Vision accepts depth data alongside images:

// From AVDepthData
let handler = VNImageRequestHandler(
    cvPixelBuffer: imageBuffer,
    depthData: depthData,
    orientation: orientation
)

// From file (automatic depth extraction)
let handler = VNImageRequestHandler(url: imageURL)  // Depth auto-fetched

Depth formats: Disparity or Depth (interchangeable via AVFoundation)

LiDAR: Use in live capture sessions for accurate scale/measurement

Face Detection & Landmarks

VNDetectFaceRectanglesRequest

Availability: iOS 11+

Detects face bounding boxes:

let request = VNDetectFaceRectanglesRequest()
try handler.perform([request])

for observation in request.results as? [VNFaceObservation] ?? [] {
    let faceBounds = observation.boundingBox  // Normalized rect
}

VNDetectFaceLandmarksRequest

Availability: iOS 11+

Detects face with detailed landmarks:

let request = VNDetectFaceLandmarksRequest()
try handler.perform([request])

for observation in request.results as? [VNFaceObservation] ?? [] {
    if let landmarks = observation.landmarks {
        let leftEye = landmarks.leftEye
        let nose = landmarks.nose
        let leftPupil = landmarks.leftPupil  // Revision 2+
    }
}

Revisions:

  • Revision 1: Basic landmarks
  • Revision 2: Detects upside-down faces
  • Revision 3+: Pupil locations

Person Detection

VNDetectHumanRectanglesRequest

Availability: iOS 13+

Detects human bounding boxes (torso detection):

let request = VNDetectHumanRectanglesRequest()
try handler.perform([request])

for observation in request.results as? [VNHumanObservation] ?? [] {
    let humanBounds = observation.boundingBox  // Normalized rect
}

Use case: Faster than pose detection when you only need location

CoreImage Integration

CIBlendWithMask Filter

Composite subject on new background using Vision mask:

// 1. Get mask from Vision
let observation = request.results?.first as? VNInstanceMaskObservation
let visionMask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false
)

// 2. Convert to CIImage
let maskImage = CIImage(cvPixelBuffer: axiom-visionMask)

// 3. Apply filter
let filter = CIFilter(name: "CIBlendWithMask")!
filter.setValue(sourceImage, forKey: kCIInputImageKey)
filter.setValue(maskImage, forKey: kCIInputMaskImageKey)
filter.setValue(newBackground, forKey: kCIInputBackgroundImageKey)

let output = filter.outputImage  // Composited result

Parameters:

  • Input image: Original image to mask
  • Mask image: Vision's soft segmentation mask
  • Background image: New background (or empty image for transparency)

HDR preservation: CoreImage preserves high dynamic range from input (Vision/VisionKit output is SDR)

Text Recognition APIs

VNRecognizeTextRequest

Availability: iOS 13+, macOS 10.15+

Recognizes text in images with configurable accuracy/speed trade-off.

Basic Usage

let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate  // Or .fast
request.recognitionLanguages = ["en-US", "de-DE"]  // Order matters
request.usesLanguageCorrection = true

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

for observation in request.results as? [VNRecognizedTextObservation] ?? [] {
    // Get top candidates
    let candidates = observation.topCandidates(3)
    let bestText = candidates.first?.string ?? ""
}

Recognition Levels

Level Performance Accuracy Best For
.fast Real-time Good Camera feed, large text, signs
.accurate Slower Excellent Documents, receipts, handwriting

Fast path: Character-by-character recognition (Neural Network → Character Detection)

Accurate path: Full-line ML recognition (Neural Network → Line/Word Recognition)

Properties

Property Type Description
recognitionLevel VNRequestTextRecognitionLevel .fast or .accurate
recognitionLanguages [String] BCP 47 language codes, order = priority
usesLanguageCorrection Bool Use language model for correction
customWords [String] Domain-specific vocabulary
automaticallyDetectsLanguage Bool Auto-detect language (iOS 16+)
minimumTextHeight Float Min text height as fraction of image (0-1)
revision Int API version (affects supported languages)

Language Support

// Check supported languages for current settings
let languages = try VNRecognizeTextRequest.supportedRecognitionLanguages(
    for: .accurate,
    revision: VNRecognizeTextRequestRevision3
)

Language correction: Improves accuracy but takes processing time. Disable for codes/serial numbers.

Custom words: Add domain-specific vocabulary for better recognition (medical terms, product codes).

VNRecognizedTextObservation

boundingBox: Normalized rect containing recognized text

topCandidates(_:): Returns [VNRecognizedText] ordered by confidence

VNRecognizedText

Property Type Description
string String Recognized text
confidence VNConfidence 0.0-1.0
boundingBox(for:) VNRectangleObservation? Box for substring range
// Get bounding box for substring
let text = candidate.string
if let range = text.range(of: "invoice") {
    let box = try candidate.boundingBox(for: range)
}

Barcode Detection APIs

VNDetectBarcodesRequest

Availability: iOS 11+, macOS 10.13+

Detects and decodes barcodes and QR codes.

Basic Usage

let request = VNDetectBarcodesRequest()
request.symbologies = [.qr, .ean13, .code128]  // Specific codes

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])

for barcode in request.results as? [VNBarcodeObservation] ?? [] {
    let payload = barcode.payloadStringValue
    let type = barcode.symbology
    let bounds = barcode.boundingBox
}

Symbologies

1D Barcodes:

  • .codabar (iOS 15+)
  • .code39, .code39Checksum, .code39FullASCII, .code39FullASCIIChecksum
  • .code93, .code93i
  • .code128
  • .ean8, .ean13
  • .gs1DataBar, .gs1DataBarExpanded, .gs1DataBarLimited (iOS 15+)
  • .i2of5, .i2of5Checksum
  • .itf14
  • .upce

2D Codes:

  • .aztec
  • .dataMatrix
  • .microPDF417 (iOS 15+)
  • .microQR (iOS 15+)
  • .pdf417
  • .qr

Performance: Specifying fewer symbologies = faster detection

Revisions

Revision iOS Features
1 11+ Basic detection, one code at a time
2 15+ Codabar, GS1, MicroPDF, MicroQR, better ROI
3 16+ ML-based, multiple codes, better bounding boxes

VNBarcodeObservation

Property Type Description
payloadStringValue String? Decoded content
symbology VNBarcodeSymbology Barcode type
boundingBox CGRect Normalized bounds
topLeft/topRight/bottomLeft/bottomRight CGPoint Corner points

VisionKit Scanner APIs

DataScannerViewController

Availability: iOS 16+

Camera-based live scanner with built-in UI for text and barcodes.

Check Availability

// Hardware support
DataScannerViewController.isSupported

// Runtime availability (camera access, parental controls)
DataScannerViewController.isAvailable

Configuration

import VisionKit

let dataTypes: Set<DataScannerViewController.RecognizedDataType> = [
    .barcode(symbologies: [.qr, .ean13]),
    .text(textContentType: .URL),  // Or nil for all text
    // .text(languages: ["ja"])  // Filter by language
]

let scanner = DataScannerViewController(
    recognizedDataTypes: dataTypes,
    qualityLevel: .balanced,  // .fast, .balanced, .accurate
    recognizesMultipleItems: true,
    isHighFrameRateTrackingEnabled: true,
    isPinchToZoomEnabled: true,
    isGuidanceEnabled: true,
    isHighlightingEnabled: true
)

scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}

RecognizedDataType

Type Description
.barcode(symbologies:) Specific barcode types
.text() All text
.text(languages:) Text filtered by language
.text(textContentType:) Text filtered by type (URL, phone, email)

Delegate Protocol

protocol DataScannerViewControllerDelegate {
    func dataScanner(_ dataScanner: DataScannerViewController,
                     didTapOn item: RecognizedItem)

    func dataScanner(_ dataScanner: DataScannerViewController,
                     didAdd addedItems: [RecognizedItem],
                     allItems: [RecognizedItem])

    func dataScanner(_ dataScanner: DataScannerViewController,
                     didUpdate updatedItems: [RecognizedItem],
                     allItems: [RecognizedItem])

    func dataScanner(_ dataScanner: DataScannerViewController,
                     didRemove removedItems: [RecognizedItem],
                     allItems: [RecognizedItem])

    func dataScanner(_ dataScanner: DataScannerViewController,
                     becameUnavailableWithError error: DataScannerViewController.ScanningUnavailable)
}

RecognizedItem

enum RecognizedItem {
    case text(RecognizedItem.Text)
    case barcode(RecognizedItem.Barcode)

    var id: UUID { get }
    var bounds: RecognizedItem.Bounds { get }
}

// Text item
struct Text {
    let transcript: String
}

// Barcode item
struct Barcode {
    let payloadStringValue: String?
    let observation: VNBarcodeObservation
}

Async Stream

// Alternative to delegate
for await items in scanner.recognizedItems {
    // Current recognized items
}

Custom Highlights

// Add custom views over recognized items
scanner.overlayContainerView.addSubview(customHighlight)

// Capture still photo
let photo = try await scanner.capturePhoto()

VNDocumentCameraViewController

Availability: iOS 13+

Document scanning with automatic edge detection, perspective correction, and lighting adjustment.

Basic Usage

import VisionKit

let camera = VNDocumentCameraViewController()
camera.delegate = self
present(camera, animated: true)

Delegate Protocol

protocol VNDocumentCameraViewControllerDelegate {
    func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                       didFinishWith scan: VNDocumentCameraScan)

    func documentCameraViewControllerDidCancel(_ controller: VNDocumentCameraViewController)

    func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                       didFailWithError error: Error)
}

VNDocumentCameraScan

Property Type Description
pageCount Int Number of scanned pages
imageOfPage(at:) UIImage Get page image at index
title String User-editable title
func documentCameraViewController(_ controller: VNDocumentCameraViewController,
                                   didFinishWith scan: VNDocumentCameraScan) {
    controller.dismiss(animated: true)

    for i in 0..<scan.pageCount {
        let pageImage = scan.imageOfPage(at: i)
        // Process with VNRecognizeTextRequest
    }
}

Document Analysis APIs

VNDetectDocumentSegmentationRequest

Availability: iOS 15+, macOS 12+

Detects document boundaries for custom camera UIs or post-processing.

let request = VNDetectDocumentSegmentationRequest()
let handler = VNImageRequestHandler(ciImage: image)
try handler.perform([request])

guard let observation = request.results?.first as? VNRectangleObservation else {
    return  // No document found
}

// Get corner points (normalized)
let corners = [
    observation.topLeft,
    observation.topRight,
    observation.bottomLeft,
    observation.bottomRight
]

vs VNDetectRectanglesRequest:

  • Document: ML-based, trained specifically on documents
  • Rectangle: Edge-based, finds any quadrilateral

RecognizeDocumentsRequest (iOS 26+)

Availability: iOS 26+, macOS 26+

Structured document understanding with semantic parsing.

Basic Usage

let request = RecognizeDocumentsRequest()
let observations = try await request.perform(on: imageData)

guard let document = observations.first?.document else {
    return
}

DocumentObservation Hierarchy

DocumentObservation
└── document: DocumentObservation.Document
    ├── text: TextObservation
    ├── tables: [Container.Table]
    ├── lists: [Container.List]
    └── barcodes: [Container.Barcode]

Table Extraction

for table in document.tables {
    for row in table.rows {
        for cell in row {
            let text = cell.content.text.transcript
            let detectedData = cell.content.text.detectedData
        }
    }
}

Detected Data Types

for data in document.text.detectedData {
    switch data.match.details {
    case .emailAddress(let email):
        let address = email.emailAddress
    case .phoneNumber(let phone):
        let number = phone.phoneNumber
    case .link(let url):
        let link = url
    case .address(let address):
        let components = address
    case .date(let date):
        let dateValue = date
    default:
        break
    }
}

TextObservation Hierarchy

TextObservation
├── transcript: String
├── lines: [TextObservation.Line]
├── paragraphs: [TextObservation.Paragraph]
├── words: [TextObservation.Word]
└── detectedData: [DetectedDataObservation]

API Quick Reference

Subject Segmentation

API Platform Purpose
VNGenerateForegroundInstanceMaskRequest iOS 17+ Class-agnostic subject instances
VNGeneratePersonInstanceMaskRequest iOS 17+ Up to 4 people separately
VNGeneratePersonSegmentationRequest iOS 15+ All people (single mask)
ImageAnalysisInteraction (VisionKit) iOS 16+ UI for subject lifting

Pose Detection

API Platform Landmarks Coordinates
VNDetectHumanHandPoseRequest iOS 14+ 21 per hand 2D normalized
VNDetectHumanBodyPoseRequest iOS 14+ 18 body joints 2D normalized
VNDetectHumanBodyPose3DRequest iOS 17+ 17 body joints 3D meters

Face & Person Detection

API Platform Purpose
VNDetectFaceRectanglesRequest iOS 11+ Face bounding boxes
VNDetectFaceLandmarksRequest iOS 11+ Face with detailed landmarks
VNDetectHumanRectanglesRequest iOS 13+ Human torso bounding boxes

Text & Barcode

API Platform Purpose
VNRecognizeTextRequest iOS 13+ Text recognition (OCR)
VNDetectBarcodesRequest iOS 11+ Barcode/QR detection
DataScannerViewController iOS 16+ Live camera scanner (text + barcodes)
VNDocumentCameraViewController iOS 13+ Document scanning with perspective correction
VNDetectDocumentSegmentationRequest iOS 15+ Programmatic document edge detection
RecognizeDocumentsRequest iOS 26+ Structured document extraction

Observation Types

Observation Returned By
VNInstanceMaskObservation Foreground/person instance masks
VNPixelBufferObservation Person segmentation (single mask)
VNHumanHandPoseObservation Hand pose
VNHumanBodyPoseObservation Body pose (2D)
VNHumanBodyPose3DObservation Body pose (3D)
VNFaceObservation Face detection/landmarks
VNHumanObservation Human rectangles
VNRecognizedTextObservation Text recognition
VNBarcodeObservation Barcode detection
VNRectangleObservation Document segmentation
DocumentObservation Structured document (iOS 26+)

Resources

WWDC: 2019-234, 2021-10041, 2022-10024, 2022-10025, 2025-272, 2023-10176, 2023-111241, 2023-10048, 2020-10653, 2020-10043, 2020-10099

Docs: /vision, /visionkit, /vision/vnrecognizetextrequest, /vision/vndetectbarcodesrequest

Skills: axiom-vision, axiom-vision-diag