tts

OpenAI text-to-speech service implementation.

This module provides integration with OpenAI’s text-to-speech API for generating high-quality synthetic speech from text input.

class pipecat.services.openai.tts.OpenAITTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, instructions: str | None | _NotGiven = <factory>, speed: float | None | _NotGiven = <factory>)[source]

Bases: TTSSettings

Settings for OpenAITTSService.

Parameters:
  • instructions – Instructions to guide voice synthesis behavior.

  • speed – Voice speed control (0.25 to 4.0, default 1.0).

instructions: str | None | _NotGiven
speed: float | None | _NotGiven
class pipecat.services.openai.tts.OpenAITTSService(*, api_key: str | None = None, base_url: str | None = None, voice: str | None = None, model: str | None = None, sample_rate: int | None = None, instructions: str | None = None, speed: float | None = None, params: InputParams | None = None, settings: OpenAITTSSettings | None = None, **kwargs)[source]

Bases: TTSService

OpenAI Text-to-Speech service that generates audio from text.

This service uses the OpenAI TTS API to generate PCM-encoded audio at 24kHz. Supports multiple voice models and configurable parameters for high-quality speech synthesis with streaming audio output.

Settings

alias of OpenAITTSSettings

OPENAI_SAMPLE_RATE = 24000
class InputParams(*, instructions: str | None = None, speed: float | None = None)[source]

Bases: BaseModel

Input parameters for OpenAI TTS configuration.

Deprecated since version 0.0.105: Use settings=OpenAITTSService.Settings(...) instead.

Parameters:
  • instructions – Instructions to guide voice synthesis behavior.

  • speed – Voice speed control (0.25 to 4.0, default 1.0).

instructions: str | None
speed: float | None
__init__(*, api_key: str | None = None, base_url: str | None = None, voice: str | None = None, model: str | None = None, sample_rate: int | None = None, instructions: str | None = None, speed: float | None = None, params: InputParams | None = None, settings: OpenAITTSSettings | None = None, **kwargs)[source]

Initialize OpenAI TTS service.

Parameters:
  • api_key – OpenAI API key for authentication. If None, uses environment variable.

  • base_url – Custom base URL for OpenAI API. If None, uses default.

  • voice

    Voice ID to use for synthesis. Defaults to “alloy”.

    Deprecated since version 0.0.105: Use settings=OpenAITTSService.Settings(voice=...) instead.

  • model

    TTS model to use. Defaults to “gpt-4o-mini-tts”.

    Deprecated since version 0.0.105: Use settings=OpenAITTSService.Settings(model=...) instead.

  • sample_rate – Output audio sample rate in Hz. If None, uses OpenAI’s default 24kHz.

  • instructions

    Optional instructions to guide voice synthesis behavior.

    Deprecated since version 0.0.105: Use settings=OpenAITTSService.Settings(instructions=...) instead.

  • speed

    Voice speed control (0.25 to 4.0, default 1.0).

    Deprecated since version 0.0.105: Use settings=OpenAITTSService.Settings(speed=...) instead.

  • params

    Optional synthesis controls (acting instructions, speed, …).

    Deprecated since version 0.0.105: Use settings=OpenAITTSService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • **kwargs – Additional keyword arguments passed to TTSService.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as OpenAI TTS service supports metrics generation.

async start(frame: StartFrame)[source]

Start the OpenAI TTS service.

Parameters:

frame – The start frame containing initialization parameters.

async run_tts(text: str, context_id: str) AsyncGenerator[Frame, None][source]

Generate speech from text using OpenAI’s TTS API.

Parameters:
  • text – The text to synthesize into speech.

  • context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech data.