Claude Code Plugins

Community-maintained marketplace

Feedback

Implement text-to-speech (TTS) capabilities using the z-ai-web-dev-sdk. Use this skill when the user needs to convert text into natural-sounding speech, create audio content, build voice-enabled applications, or generate spoken audio files. Supports multiple voices, adjustable speed, and various audio formats.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name TTS
description Implement text-to-speech (TTS) capabilities using the z-ai-web-dev-sdk. Use this skill when the user needs to convert text into natural-sounding speech, create audio content, build voice-enabled applications, or generate spoken audio files. Supports multiple voices, adjustable speed, and various audio formats.
license MIT

TTS (Text to Speech) Skill

This skill guides the implementation of text-to-speech (TTS) functionality using the z-ai-web-dev-sdk package, enabling conversion of text into natural-sounding speech audio.

Skills Path

Skill Location: {project_path}/skills/TTS

This skill is located at the above path in your project.

Reference Scripts: Example test scripts are available in the {Skill Location}/scripts/ directory for quick testing and reference. See {Skill Location}/scripts/tts.ts for a working example.

Overview

Text-to-Speech allows you to build applications that generate spoken audio from text input, supporting various voices, speeds, and output formats for diverse use cases.

IMPORTANT: z-ai-web-dev-sdk MUST be used in backend code only. Never use it in client-side code.

API Limitations and Constraints

Before implementing TTS functionality, be aware of these important limitations:

Input Text Constraints

  • Maximum length: 1024 characters per request
  • Text exceeding this limit must be split into smaller chunks

Audio Parameters

  • Speed range: 0.5 to 2.0
    • 0.5 = half speed (slower)
    • 1.0 = normal speed (default)
    • 2.0 = double speed (faster)
  • Volume range: Greater than 0, up to 10
    • Default: 1.0
    • Values must be greater than 0 (exclusive) and up to 10 (inclusive)

Format and Streaming

  • Streaming limitation: When stream: true is enabled, only pcm format is supported
  • Non-streaming: Supports wav, pcm, and mp3 formats
  • Sample rate: 24000 Hz (recommended)

Best Practice for Long Text

function splitTextIntoChunks(text, maxLength = 1000) {
  const chunks = [];
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  
  let currentChunk = '';
  for (const sentence of sentences) {
    if ((currentChunk + sentence).length <= maxLength) {
      currentChunk += sentence;
    } else {
      if (currentChunk) chunks.push(currentChunk.trim());
      currentChunk = sentence;
    }
  }
  if (currentChunk) chunks.push(currentChunk.trim());
  
  return chunks;
}

Prerequisites

The z-ai-web-dev-sdk package is already installed. Import it as shown in the examples below.

CLI Usage (For Simple Tasks)

For simple text-to-speech conversions, you can use the z-ai CLI instead of writing code. This is ideal for quick audio generation, testing voices, or simple automation.

Basic TTS

# Convert text to speech (default WAV format)
z-ai tts --input "Hello, world" --output ./hello.wav

# Using short options
z-ai tts -i "Hello, world" -o ./hello.wav

Different Voices and Speed

# Use specific voice
z-ai tts -i "Welcome to our service" -o ./welcome.wav --voice tongtong

# Adjust speech speed (0.5-2.0)
z-ai tts -i "This is faster speech" -o ./fast.wav --speed 1.5

# Slower speech
z-ai tts -i "This is slower speech" -o ./slow.wav --speed 0.8

Different Output Formats

# MP3 format
z-ai tts -i "Hello World" -o ./hello.mp3 --format mp3

# WAV format (default)
z-ai tts -i "Hello World" -o ./hello.wav --format wav

# PCM format
z-ai tts -i "Hello World" -o ./hello.pcm --format pcm

Streaming Output

# Stream audio generation
z-ai tts -i "This is a longer text that will be streamed" -o ./stream.wav --stream

CLI Parameters

  • --input, -i <text>: Required - Text to convert to speech (max 1024 characters)
  • --output, -o <path>: Required - Output audio file path
  • --voice, -v <voice>: Optional - Voice type (default: tongtong)
  • --speed, -s <number>: Optional - Speech speed, 0.5-2.0 (default: 1.0)
  • --format, -f <format>: Optional - Output format: wav, mp3, pcm (default: wav)
  • --stream: Optional - Enable streaming output (only supports pcm format)

When to Use CLI vs SDK

Use CLI for:

  • Quick text-to-speech conversions
  • Testing different voices and speeds
  • Simple batch audio generation
  • Command-line automation scripts

Use SDK for:

  • Dynamic audio generation in applications
  • Integration with web services
  • Custom audio processing pipelines
  • Production applications with complex requirements

Basic TTS Implementation

Simple Text to Speech

import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';

async function textToSpeech(text, outputPath) {
  const zai = await ZAI.create();

  const response = await zai.audio.tts.create({
    input: text,
    voice: 'tongtong',
    speed: 1.0,
    response_format: 'wav',
    stream: false
  });

  // Get array buffer from Response object
  const arrayBuffer = await response.arrayBuffer();
  const buffer = Buffer.from(new Uint8Array(arrayBuffer));

  fs.writeFileSync(outputPath, buffer);
  console.log(`Audio saved to ${outputPath}`);
  return outputPath;
}

// Usage
await textToSpeech('Hello, world!', './output.wav');

Multiple Voice Options

import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';

async function generateWithVoice(text, voice, outputPath) {
  const zai = await ZAI.create();

  const response = await zai.audio.tts.create({
    input: text,
    voice: voice, // Available voices: tongtong, chuichui, xiaochen, jam, kazi, douji, luodo
    speed: 1.0,
    response_format: 'wav',
    stream: false
  });

  // Get array buffer from Response object
  const arrayBuffer = await response.arrayBuffer();
  const buffer = Buffer.from(new Uint8Array(arrayBuffer));

  fs.writeFileSync(outputPath, buffer);
  return outputPath;
}

// Usage
await generateWithVoice('Welcome to our service', 'tongtong', './welcome.wav');

Adjustable Speed

import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';

async function generateWithSpeed(text, speed, outputPath) {
  const zai = await ZAI.create();

  // Speed range: 0.5 to 2.0 (API constraint)
  // 0.5 = half speed (slower)
  // 1.0 = normal speed (default)
  // 2.0 = double speed (faster)
  // Values outside this range will cause API errors

  const response = await zai.audio.tts.create({
    input: text,
    voice: 'tongtong',
    speed: speed,
    response_format: 'wav',
    stream: false
  });

  // Get array buffer from Response object
  const arrayBuffer = await response.arrayBuffer();
  const buffer = Buffer.from(new Uint8Array(arrayBuffer));

  fs.writeFileSync(outputPath, buffer);
  return outputPath;
}

// Usage - slower narration
await generateWithSpeed('This is an important announcement', 0.8, './slow.wav');

// Usage - faster narration
await generateWithSpeed('Quick update', 1.3, './fast.wav');

Adjustable Volume

import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';

async function generateWithVolume(text, volume, outputPath) {
  const zai = await ZAI.create();

  // Volume range: greater than 0, up to 10 (API constraint)
  // Values must be > 0 (exclusive) and <= 10 (inclusive)
  // Default: 1.0 (normal volume)

  const response = await zai.audio.tts.create({
    input: text,
    voice: 'tongtong',
    speed: 1.0,
    volume: volume, // Optional parameter
    response_format: 'wav',
    stream: false
  });

  // Get array buffer from Response object
  const arrayBuffer = await response.arrayBuffer();
  const buffer = Buffer.from(new Uint8Array(arrayBuffer));

  fs.writeFileSync(outputPath, buffer);
  return outputPath;
}

// Usage - louder audio
await generateWithVolume('This is an announcement', 5.0, './loud.wav');

// Usage - quieter audio
await generateWithVolume('Whispered message', 0.5, './quiet.wav');

Advanced Use Cases

Batch Processing

import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
import path from 'path';

async function batchTextToSpeech(textArray, outputDir) {
  const zai = await ZAI.create();
  const results = [];

  // Ensure output directory exists
  if (!fs.existsSync(outputDir)) {
    fs.mkdirSync(outputDir, { recursive: true });
  }

  for (let i = 0; i < textArray.length; i++) {
    try {
      const text = textArray[i];
      const outputPath = path.join(outputDir, `audio_${i + 1}.wav`);

      const response = await zai.audio.tts.create({
        input: text,
        voice: 'tongtong',
        speed: 1.0,
        response_format: 'wav',
        stream: false
      });

      // Get array buffer from Response object
      const arrayBuffer = await response.arrayBuffer();
      const buffer = Buffer.from(new Uint8Array(arrayBuffer));

      fs.writeFileSync(outputPath, buffer);
      results.push({
        success: true,
        text,
        path: outputPath
      });
    } catch (error) {
      results.push({
        success: false,
        text: textArray[i],
        error: error.message
      });
    }
  }

  return results;
}

// Usage
const texts = [
  'Welcome to chapter one',
  'Welcome to chapter two',
  'Welcome to chapter three'
];

const results = await batchTextToSpeech(texts, './audio-output');
console.log('Generated:', results.length, 'audio files');

Dynamic Content Generation

import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';

class TTSGenerator {
  constructor() {
    this.zai = null;
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  async generateAudio(text, options = {}) {
    const {
      voice = 'tongtong',
      speed = 1.0,
      format = 'wav'
    } = options;

    const response = await this.zai.audio.tts.create({
      input: text,
      voice: voice,
      speed: speed,
      response_format: format,
      stream: false
    });

    // Get array buffer from Response object
    const arrayBuffer = await response.arrayBuffer();
    return Buffer.from(new Uint8Array(arrayBuffer));
  }

  async saveAudio(text, outputPath, options = {}) {
    const buffer = await this.generateAudio(text, options);
    if (buffer) {
      fs.writeFileSync(outputPath, buffer);
      return outputPath;
    }
    return null;
  }
}

// Usage
const generator = new TTSGenerator();
await generator.initialize();

await generator.saveAudio(
  'Hello, this is a test',
  './output.wav',
  { speed: 1.2 }
);

Next.js API Route Example

import { NextRequest, NextResponse } from 'next/server';

export async function POST(req: NextRequest) {
  try {
    const { text, voice = 'tongtong', speed = 1.0 } = await req.json();

    // Import ZAI SDK
    const ZAI = (await import('z-ai-web-dev-sdk')).default;

    // Create SDK instance
    const zai = await ZAI.create();

    // Generate TTS audio
    const response = await zai.audio.tts.create({
      input: text.trim(),
      voice: voice,
      speed: speed,
      response_format: 'wav',
      stream: false,
    });

    // Get array buffer from Response object
    const arrayBuffer = await response.arrayBuffer();
    const buffer = Buffer.from(new Uint8Array(arrayBuffer));

    // Return audio as response
    return new NextResponse(buffer, {
      status: 200,
      headers: {
        'Content-Type': 'audio/wav',
        'Content-Length': buffer.length.toString(),
        'Cache-Control': 'no-cache',
      },
    });
  } catch (error) {
    console.error('TTS API Error:', error);

    return NextResponse.json(
      {
        error: error instanceof Error ? error.message : '生成语音失败,请稍后重试',
      },
      { status: 500 }
    );
  }
}

Best Practices

1. Text Preparation

function prepareTextForTTS(text) {
  // Remove excessive whitespace
  text = text.replace(/\s+/g, ' ').trim();

  // Expand common abbreviations for better pronunciation
  const abbreviations = {
    'Dr.': 'Doctor',
    'Mr.': 'Mister',
    'Mrs.': 'Misses',
    'etc.': 'et cetera'
  };

  for (const [abbr, full] of Object.entries(abbreviations)) {
    text = text.replace(new RegExp(abbr, 'g'), full);
  }

  return text;
}

2. Error Handling

import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';

async function safeTTS(text, outputPath) {
  try {
    // Validate input
    if (!text || text.trim().length === 0) {
      throw new Error('Text input cannot be empty');
    }

    if (text.length > 1024) {
      throw new Error('Text input exceeds maximum length of 1024 characters');
    }

    const zai = await ZAI.create();

    const response = await zai.audio.tts.create({
      input: text,
      voice: 'tongtong',
      speed: 1.0,
      response_format: 'wav',
      stream: false
    });

    // Get array buffer from Response object
    const arrayBuffer = await response.arrayBuffer();
    const buffer = Buffer.from(new Uint8Array(arrayBuffer));

    fs.writeFileSync(outputPath, buffer);

    return {
      success: true,
      path: outputPath,
      size: buffer.length
    };
  } catch (error) {
    console.error('TTS Error:', error);
    return {
      success: false,
      error: error.message
    };
  }
}

3. SDK Instance Reuse

import ZAI from 'z-ai-web-dev-sdk';

// Create a singleton instance
let zaiInstance = null;

async function getZAIInstance() {
  if (!zaiInstance) {
    zaiInstance = await ZAI.create();
  }
  return zaiInstance;
}

// Usage
const zai = await getZAIInstance();
const response = await zai.audio.tts.create({ ... });

Common Use Cases

  1. Audiobooks & Podcasts: Convert written content to audio format
  2. E-learning: Create narration for educational content
  3. Accessibility: Provide audio versions of text content
  4. Voice Assistants: Generate dynamic responses
  5. Announcements: Create automated audio notifications
  6. IVR Systems: Generate phone system prompts
  7. Content Localization: Create audio in different languages

Integration Examples

Express.js API Endpoint

import express from 'express';
import ZAI from 'z-ai-web-dev-sdk';
import fs from 'fs';
import path from 'path';

const app = express();
app.use(express.json());

let zaiInstance;
const outputDir = './audio-output';

async function initZAI() {
  zaiInstance = await ZAI.create();
  if (!fs.existsSync(outputDir)) {
    fs.mkdirSync(outputDir, { recursive: true });
  }
}

app.post('/api/tts', async (req, res) => {
  try {
    const { text, voice = 'tongtong', speed = 1.0 } = req.body;

    if (!text) {
      return res.status(400).json({ error: 'Text is required' });
    }

    const filename = `tts_${Date.now()}.wav`;
    const outputPath = path.join(outputDir, filename);

    const response = await zaiInstance.audio.tts.create({
      input: text,
      voice: voice,
      speed: speed,
      response_format: 'wav',
      stream: false
    });

    // Get array buffer from Response object
    const arrayBuffer = await response.arrayBuffer();
    const buffer = Buffer.from(new Uint8Array(arrayBuffer));

    fs.writeFileSync(outputPath, buffer);

    res.json({
      success: true,
      audioUrl: `/audio/${filename}`,
      size: buffer.length
    });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

app.use('/audio', express.static('audio-output'));

initZAI().then(() => {
  app.listen(3000, () => {
    console.log('TTS API running on port 3000');
  });
});

Troubleshooting

Issue: "Input text exceeds maximum length"

  • Solution: Text input is limited to 1024 characters. Split longer text into chunks using the splitTextIntoChunks function shown in the API Limitations section

Issue: "Invalid speed parameter" or unexpected speed behavior

  • Solution: Speed must be between 0.5 and 2.0. Check your speed value is within this range

Issue: "Invalid volume parameter"

  • Solution: Volume must be greater than 0 and up to 10. Ensure volume value is in range (0, 10]

Issue: "Stream format not supported" with WAV/MP3

  • Solution: Streaming mode only supports PCM format. Either use response_format: 'pcm' with streaming, or disable streaming (stream: false) for WAV/MP3 output

Issue: "SDK must be used in backend"

  • Solution: Ensure z-ai-web-dev-sdk is only imported in server-side code

Issue: "TypeError: response.audio is undefined"

  • Solution: The SDK returns a standard Response object, use await response.arrayBuffer() instead of accessing response.audio

Issue: Generated audio file is empty or corrupted

  • Solution: Ensure you're calling await response.arrayBuffer() and properly converting to Buffer: Buffer.from(new Uint8Array(arrayBuffer))

Issue: Audio sounds unnatural

  • Solution: Prepare text properly (remove special characters, expand abbreviations)

Issue: Long processing times

  • Solution: Break long text into smaller chunks and process in parallel

Issue: Next.js caching old API route

  • Solution: Create a new API route endpoint or restart the dev server

Performance Tips

  1. Reuse SDK Instance: Create ZAI instance once and reuse
  2. Implement Caching: Cache generated audio for repeated text
  3. Batch Processing: Process multiple texts efficiently
  4. Optimize Text: Remove unnecessary content before generation
  5. Async Processing: Use queues for handling multiple requests

Important Notes

API Constraints

Input Text Length: Maximum 1024 characters per request. For longer text:

// Split long text into chunks
const longText = "..."; // Your long text here
const chunks = splitTextIntoChunks(longText, 1000);

for (const chunk of chunks) {
  const response = await zai.audio.tts.create({
    input: chunk,
    voice: 'tongtong',
    speed: 1.0,
    response_format: 'wav',
    stream: false
  });
  // Process each chunk...
}

Streaming Format Limitation: When using stream: true, only pcm format is supported. For wav or mp3 output, use stream: false.

Sample Rate: Audio is generated at 24000 Hz sample rate (recommended setting for playback).

Response Object Format

The zai.audio.tts.create() method returns a standard Response object (not a custom object with an audio property). Always use:

// ✅ CORRECT
const response = await zai.audio.tts.create({ ... });
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));

// ❌ WRONG - This will not work
const response = await zai.audio.tts.create({ ... });
const buffer = Buffer.from(response.audio); // response.audio is undefined

Available Voices

  • tongtong - 温暖亲切
  • chuichui - 活泼可爱
  • xiaochen - 沉稳专业
  • jam - 英音绅士
  • kazi - 清晰标准
  • douji - 自然流畅
  • luodo - 富有感染力

Speed Range

  • Minimum: 0.5 (half speed)
  • Default: 1.0 (normal speed)
  • Maximum: 2.0 (double speed)

Important: Speed values outside the range [0.5, 2.0] will result in API errors.

Volume Range

  • Minimum: Greater than 0 (exclusive)
  • Default: 1.0 (normal volume)
  • Maximum: 10 (inclusive)

Note: Volume parameter is optional. When not specified, defaults to 1.0.

Remember

  • Always use z-ai-web-dev-sdk in backend code only
  • Input text is limited to 1024 characters maximum - split longer text into chunks
  • Speed must be between 0.5 and 2.0 - values outside this range will cause errors
  • Volume must be greater than 0 and up to 10 - optional parameter with default 1.0
  • Streaming only supports PCM format - use non-streaming for WAV or MP3 output
  • The SDK returns a standard Response object - use await response.arrayBuffer()
  • Convert ArrayBuffer to Buffer using Buffer.from(new Uint8Array(arrayBuffer))
  • Handle audio buffers properly when saving to files
  • Implement error handling for production applications
  • Consider caching for frequently generated content
  • Clean up old audio files periodically to manage storage