Deploy a Text-to-Speech Model

This example demonstrates how to build a comprehensive text-to-speech service using Kokoro, showcasing CPU-efficient deployment, multi-language support, multiple endpoints with shared logic, and advanced audio processing techniques.

Source Code: Find the complete implementation at fal-ai-community/fal_demos

Key Features

Multi-Language Support: American English, British English, Japanese with native voices
CPU-Efficient Deployment: Lightweight 82M parameter model runs efficiently on CPU
Multiple Endpoints: Language-specific endpoints with shared generation logic
Voice Variety: Multiple voice options for each supported language
Audio Streaming: Generator-based audio processing for memory efficiency
Character-Based Billing: Usage-based pricing tied to text length
Advanced Validation: Custom error handling with user-friendly messages
Audio File Management: Temporary file handling and CDN integration

When to Use CPU Deployment

CPU deployment is ideal when:

Models are lightweight (< 100M parameters)
Inference is fast enough on CPU
Cost optimization is important
GPU resources are not required
Multiple concurrent requests can share CPU resources efficiently

Project Setup

from typing import Literal

import fal
from fal.exceptions import FieldException
from fal.toolkit import File
from fastapi import Response
from pydantic import BaseModel, Field

Language-Specific Input Models

Define input models for each supported language with appropriate voice options:

class AmEnglishRequest(BaseModel):
    prompt: str = Field(
        default="",
        examples=[
            "The future belongs to those who believe in the beauty of their dreams. So, dream big, work hard, and make it happen!"
        ],
        ui={"important": True},
    )
    text: str = Field(
        default="",
        examples=[
            "The future belongs to those who believe in the beauty of their dreams. So, dream big, work hard, and make it happen!"
        ],
    )
    voice: Literal[
        "af_heart",    # American Female voices
        "af_alloy",
        "af_aoede",
        "af_bella",
        "af_jessica",
        "af_kore",
        "af_nicole",
        "af_nova",
        "af_river",
        "af_sarah",
        "af_sky",
        "am_adam",     # American Male voices
        "am_echo",
        "am_eric",
        "am_fenrir",
        "am_liam",
        "am_michael",
        "am_onyx",
        "am_puck",
        "am_santa",
    ] = Field(
        examples=["af_heart"],
        default="af_heart",
        description="Voice ID for the desired voice.",
    )
    speed: float = Field(
        default=1.0,
        ge=0.1,
        le=5.0,
        description="Speed of the generated audio. Default is 1.0.",
    )

class BrEnglishRequest(BaseModel):
    prompt: str = Field(
        examples=[
            "Ladies and gentlemen, welcome aboard. Please ensure your seatbelt is fastened and your tray table is stowed as we prepare for takeoff."
        ]
    )
    voice: Literal[
        "bf_alice",    # British Female voices
        "bf_emma",
        "bf_isabella",
        "bf_lily",
        "bm_daniel",   # British Male voices
        "bm_fable",
        "bm_george",
        "bm_lewis",
    ] = Field(
        examples=["bf_alice"],
        description="Voice ID for the desired voice.",
    )
    speed: float = Field(
        default=1.0,
        ge=0.1,
        le=5.0,
        description="Speed of the generated audio. Default is 1.0.",
    )

class JapaneseRequest(BaseModel):
    prompt: str = Field(
        examples=["夢を追いかけることを恐れないでください。努力すれば、必ず道は開けます！"]
    )
    voice: Literal[
        "jf_alpha",    # Japanese Female voices
        "jf_gongitsune",
        "jf_nezumi",
        "jf_tebukuro",
        "jm_kumo",     # Japanese Male voices
    ] = Field(
        examples=["jf_alpha"],
        description="Voice ID for the desired voice.",
    )
    speed: float = Field(
        default=1.0,
        ge=0.1,
        le=5.0,
        description="Speed of the generated audio. Default is 1.0.",
    )

Language-Specific Output Models

class AmEngOutput(BaseModel):
    audio: File = Field(
        description="The generated audio",
        examples=[
            File._from_url(
                "https://fal.media/files/elephant/dXVMqWsBDG9yan3kaOT0Z_tmp0vvkha3s.wav"
            )
        ],
    )

class BrEngOutput(BaseModel):
    audio: File = Field(
        description="The generated audio",
        examples=[
            File._from_url(
                "https://fal.media/files/kangaroo/4wpA60Kum6UjOVBKJoNyL_tmpxfrkn95k.wav"
            )
        ],
    )

class JapaneseOutput(BaseModel):
    audio: File = Field(
        description="The generated audio",
        examples=[
            File._from_url(
                "https://fal.media/files/lion/piLhqKO8LJxrWaNg2dVUv_tmpp6eff6zl.wav"
            )
        ],
    )

Application Configuration for CPU Deployment

class Kokoro(
    fal.App,
    min_concurrency=0,
    max_concurrency=1,
    keep_alive=3000,  # Longer keep-alive for TTS services
    name="kokoro",
):
    requirements = [
        "kokoro==0.8.4",
        "soundfile==0.13.1",
        "misaki[en]==0.8.4",  # English language support
        "misaki[ja]==0.8.4",  # Japanese language support
        "misaki[zh]==0.8.4",  # Chinese language support
        "numpy==1.26.4",
        # Spacy model for English NLP
        "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl",
    ]
    machine_type = "L"  # CPU machine - efficient for lightweight models

    async def setup(self):
        from kokoro import KPipeline

        # Initialize pipelines for each supported language
        self.pipelines = {}
        self.pipelines["American English"] = KPipeline(lang_code="a")
        self.pipelines["British English"] = KPipeline(lang_code="b")
        self.pipelines["Japanese"] = KPipeline(lang_code="j")

Shared Generation Logic

Create a reusable generation method that handles all languages:

async def _generate(
    self,
    request: AmEnglishRequest,
    response: Response,
    language: str = "American English",
):
    # Handle both 'prompt' and 'text' fields for backwards compatibility
    prompt = request.prompt or request.text

    # Custom validation with user-friendly error messages
    if len(prompt) >= 20000:
        raise FieldException(
            field="prompt",
            message="Prompt must be less than 20000 characters.",
        )

    import tempfile
    import numpy as np
    import soundfile as sf

    # Get the appropriate pipeline for the language
    pipeline = self.pipelines[language]

    # Generate audio using streaming approach
    generator = pipeline(
        prompt,
        voice=request.voice,
        speed=request.speed,
        split_pattern=r"\n+",  # Split on line breaks for better pacing
    )

    # Process audio chunks and concatenate
    for i, (gs, ps, audio) in enumerate(generator):
        if i == 0:
            final_audio = audio.detach().cpu().numpy()
        else:
            audio = audio.detach().cpu().numpy()
            final_audio = np.concatenate((final_audio, audio), axis=0)

    # Character-based billing calculation
    response.headers["x-fal-billable-units"] = str(max(1, len(prompt) // 1000))

    # Save audio to temporary file and upload to CDN
    with tempfile.NamedTemporaryFile(suffix=".wav") as f:
        sf.write(f.name, final_audio, 24000)  # 24kHz sample rate
        return AmEngOutput(
            audio=File.from_path(
                f.name,
                content_type="audio/wav",
                repository="cdn"  # Upload to CDN for fast access
            )
        )

Multiple Endpoint Definitions

Define language-specific endpoints using the shared generation logic:

@fal.endpoint("/")
async def generate(
    self, request: AmEnglishRequest, response: Response
) -> AmEngOutput:
    return await self._generate(request, response, language="American English")

@fal.endpoint("/american-english")
async def generate_am_english(
    self, request: AmEnglishRequest, response: Response
) -> AmEngOutput:
    return await self._generate(request, response, language="American English")

@fal.endpoint("/british-english")
async def generate_br_english(
    self, request: BrEnglishRequest, response: Response
) -> BrEngOutput:
    return await self._generate(request, response, language="British English")

@fal.endpoint("/japanese")
async def generate_japanese(
    self, request: JapaneseRequest, response: Response
) -> JapaneseOutput:
    return await self._generate(request, response, language="Japanese")

Key Concepts and Best Practices

CPU-Efficient Deployment

Why CPU for TTS:

Kokoro is only 82M parameters - runs efficiently on CPU
Lower cost compared to GPU instances
Sufficient performance for real-time TTS
Better resource utilization for multiple concurrent requests

Audio Streaming and Memory Management

Generator-based processing:

# Stream audio generation to handle long texts efficiently
generator = pipeline(prompt, voice=request.voice, speed=request.speed)

# Process chunks incrementally
for i, (gs, ps, audio) in enumerate(generator):
    if i == 0:
        final_audio = audio.detach().cpu().numpy()
    else:
        final_audio = np.concatenate((final_audio, audio), axis=0)

Character-Based Billing

# Scale billing with text length (per 1000 characters)
response.headers["x-fal-billable-units"] = str(max(1, len(prompt) // 1000))

Audio File Handling

# Use temporary files for audio processing
with tempfile.NamedTemporaryFile(suffix=".wav") as f:
    sf.write(f.name, final_audio, 24000)  # Save with proper sample rate
    return Output(
        audio=File.from_path(
            f.name,
            content_type="audio/wav",
            repository="cdn"  # Auto-upload to CDN
        )
    )

Multi-Language Architecture

Pipeline initialization:

self.pipelines = {
    "American English": KPipeline(lang_code="a"),
    "British English": KPipeline(lang_code="b"),
    "Japanese": KPipeline(lang_code="j"),
}

Language-specific voice options:

# American English voices
voice: Literal[
    "af_heart", "af_alloy", "af_aoede",  # Female
    "am_adam", "am_echo", "am_eric",     # Male
]

# British English voices
voice: Literal[
    "bf_alice", "bf_emma", "bf_lily",    # Female
    "bm_daniel", "bm_george", "bm_lewis" # Male
]

Advanced Features

Custom Validation

if len(prompt) >= 20000:
    raise FieldException(
        field="prompt",
        message="Prompt must be less than 20000 characters.",
    )

Backwards Compatibility

# Support both 'prompt' and 'text' field names
prompt = request.prompt or request.text

Flexible Text Processing

generator = pipeline(
    prompt,
    voice=request.voice,
    speed=request.speed,
    split_pattern=r"\n+",  # Split on paragraphs for natural pacing
)

Deployment and Usage

Running the Service

# Development
fal run fal_demos/tts/kokoro.py::Kokoro

# Production deployment
fal deploy kokoro

Making Requests

American English:

import fal_client

result = await fal_client.submit_async(
    "your-username/kokoro/american-english",
    arguments={
        "prompt": "Hello, this is a test of American English text-to-speech!",
        "voice": "af_heart",
        "speed": 1.2
    }
)

British English:

result = await fal_client.submit_async(
    "your-username/kokoro/british-english",
    arguments={
        "prompt": "Cheerio! This is British English text-to-speech.",
        "voice": "bf_alice",
        "speed": 1.0
    }
)

Japanese:

result = await fal_client.submit_async(
    "your-username/kokoro/japanese",
    arguments={
        "prompt": "こんにちは、これは日本語の音声合成です。",
        "voice": "jf_alpha",
        "speed": 0.9
    }
)

Use Cases

Content Creation: Generate voiceovers for videos and podcasts
Accessibility: Convert text content to audio for visually impaired users
E-Learning: Create educational content with natural-sounding narration
Customer Service: Generate dynamic audio responses for chatbots
Multilingual Applications: Support global audiences with native-sounding voices
Book Reading: Convert written content to audiobooks

Performance Optimizations

Memory Efficiency

# Stream processing prevents memory buildup for long texts
for i, (gs, ps, audio) in enumerate(generator):
    # Process incrementally rather than loading all at once

Cost Optimization

machine_type = "L"  # CPU is sufficient and cost-effective
keep_alive=3000     # Longer keep-alive reduces cold starts

Key Takeaways

CPU deployment is ideal for lightweight models like Kokoro (82M parameters)
Multi-language support requires separate pipelines and voice models
Character-based billing aligns costs with resource usage
Audio streaming handles long texts efficiently without memory issues
Temporary file handling with CDN upload provides fast, reliable audio delivery
Multiple endpoints with shared logic offer flexibility while maintaining DRY principles
Custom validation provides better user experience with clear error messages

This pattern is perfect for building production-ready TTS services that need to support multiple languages and voices while maintaining cost efficiency and high performance through CPU-optimized deployment.