tts

ElevenLabs text-to-speech service implementations.

This module provides WebSocket and HTTP-based TTS services using ElevenLabs API with support for streaming audio, word timestamps, and voice customization.

pipecat.services.elevenlabs.tts.language_to_elevenlabs_language(language: Language) → str | None[source]

Convert a Language enum to ElevenLabs language code.

Parameters:: language – The Language enum value to convert.
Returns:: The corresponding ElevenLabs language code, or None if not supported.

pipecat.services.elevenlabs.tts.output_format_from_sample_rate(sample_rate: int) → str[source]

Get the appropriate output format string for a given sample rate.

Parameters:: sample_rate – The audio sample rate in Hz.
Returns:: The ElevenLabs output format string.

pipecat.services.elevenlabs.tts.build_elevenlabs_voice_settings(settings: dict[str, Any] | TTSSettings) → dict[str, float | bool] | None[source]

Build voice settings dictionary for ElevenLabs based on provided settings.

Parameters:: settings – Dictionary or settings containing voice settings parameters.
Returns:: Dictionary of voice settings or None if no valid settings are provided.

class pipecat.services.elevenlabs.tts.PronunciationDictionaryLocator(*, pronunciation_dictionary_id: str, version_id: str)[source]

Bases: BaseModel

Locator for a pronunciation dictionary.

Parameters:

pronunciation_dictionary_id – The ID of the pronunciation dictionary.
version_id – The version ID of the pronunciation dictionary.

pronunciation_dictionary_id: str

version_id: str

Bases: TTSSettings

Settings for ElevenLabsTTSService.

Fields that appear in the WebSocket URL (voice, model, language) require a full reconnect when changed. Fields that affect the voice character (stability, similarity_boost, style, use_speaker_boost, speed) can be applied by closing the current audio context so a new one is opened with updated settings.

Parameters:

stability – Voice stability control (0.0 to 1.0).
similarity_boost – Similarity boost control (0.0 to 1.0).
style – Style control for voice expression (0.0 to 1.0).
use_speaker_boost – Whether to use speaker boost enhancement.
speed – Voice speed control (0.7 to 1.2).
apply_text_normalization – Text normalization mode (“auto”, “on”, “off”).

stability: float | None | _NotGiven

similarity_boost: float | None | _NotGiven

style: float | None | _NotGiven

use_speaker_boost: bool | None | _NotGiven

speed: float | None | _NotGiven

apply_text_normalization: Literal['auto', 'on', 'off'] | None | _NotGiven

URL_FIELDS: ClassVar[frozenset[str]] = frozenset({'language', 'model', 'voice'}): Fields in the WS URL — changing any of these requires a reconnect.

VOICE_SETTINGS_FIELDS: ClassVar[frozenset[str]] = frozenset({'similarity_boost', 'speed', 'stability', 'style', 'use_speaker_boost'}): Fields affecting voice character — changing these requires closing the current audio context so the next one picks up new settings.

Bases: TTSSettings

Settings for ElevenLabsHttpTTSService.

Parameters:

optimize_streaming_latency – Latency optimization level (0-4).
stability – Voice stability control (0.0 to 1.0).
similarity_boost – Similarity boost control (0.0 to 1.0).
style – Style control for voice expression (0.0 to 1.0).
use_speaker_boost – Whether to use speaker boost enhancement.
speed – Voice speed control (0.25 to 4.0).
apply_text_normalization – Text normalization mode (“auto”, “on”, “off”).

optimize_streaming_latency: int | None | _NotGiven

stability: float | None | _NotGiven

similarity_boost: float | None | _NotGiven

style: float | None | _NotGiven

use_speaker_boost: bool | None | _NotGiven

speed: float | None | _NotGiven

apply_text_normalization: Literal['auto', 'on', 'off'] | None | _NotGiven

pipecat.services.elevenlabs.tts.calculate_word_times(alignment_info: Mapping[str, Any], cumulative_time: float, partial_word: str = '', partial_word_start_time: float = 0.0) → tuple[list[tuple[str, float]], str, float][source]

Calculate word timestamps from character alignment information.

Parameters:

alignment_info – Character alignment data from ElevenLabs API.
cumulative_time – Base time offset for this chunk.
partial_word – Partial word carried over from previous chunk.
partial_word_start_time – Start time of the partial word.

Returns:

word_times: List of (word, timestamp) tuples for complete words
new_partial_word: Incomplete word at end of chunk (empty if chunk ends with space)
new_partial_word_start_time: Start time of the incomplete word

Return type:

Tuple of (word_times, new_partial_word, new_partial_word_start_time)

class pipecat.services.elevenlabs.tts.ElevenLabsTTSService(*, api_key: str, voice_id: str | None = None, model: str | None = None, url: str = 'wss://api.elevenlabs.io', sample_rate: int | None = None, auto_mode: bool | None = None, enable_ssml_parsing: bool | None = None, enable_logging: bool | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None, params: InputParams | None = None, settings: ElevenLabsTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]

Bases: WebsocketTTSService

ElevenLabs WebSocket-based TTS service with word timestamps.

Provides real-time text-to-speech using ElevenLabs’ WebSocket streaming API. Supports word-level timestamps, audio context management, and various voice customization options including stability, similarity boost, and speed controls.

Settings: alias of ElevenLabsTTSSettings

Bases: BaseModel

Input parameters for ElevenLabs TTS configuration.

Deprecated since version 0.0.105: Use settings=ElevenLabsTTSService.Settings(...) instead.

Parameters:

language – Language to use for synthesis.
stability – Voice stability control (0.0 to 1.0).
similarity_boost – Similarity boost control (0.0 to 1.0).
style – Style control for voice expression (0.0 to 1.0).
use_speaker_boost – Whether to use speaker boost enhancement.
speed – Voice speed control (0.7 to 1.2).
auto_mode – Whether to enable automatic mode optimization.
enable_ssml_parsing – Whether to parse SSML tags in text.
enable_logging – Whether to enable ElevenLabs logging.
apply_text_normalization – Text normalization mode (“auto”, “on”, “off”).
pronunciation_dictionary_locators – List of pronunciation dictionary locators to use.

language: Language | None

stability: float | None

similarity_boost: float | None

style: float | None

use_speaker_boost: bool | None

speed: float | None

auto_mode: bool | None

enable_ssml_parsing: bool | None

enable_logging: bool | None

apply_text_normalization: Literal['auto', 'on', 'off'] | None

pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None

__init__(*, api_key: str, voice_id: str | None = None, model: str | None = None, url: str = 'wss://api.elevenlabs.io', sample_rate: int | None = None, auto_mode: bool | None = None, enable_ssml_parsing: bool | None = None, enable_logging: bool | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None, params: InputParams | None = None, settings: ElevenLabsTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]

Initialize the ElevenLabs TTS service.

Parameters:

api_key – ElevenLabs API key for authentication.
voice_id –
ID of the voice to use for synthesis.

Deprecated since version 0.0.105: Use settings=ElevenLabsTTSService.Settings(voice=...) instead.
model –
TTS model to use (e.g., “eleven_turbo_v2_5”).

Deprecated since version 0.0.105: Use settings=ElevenLabsTTSService.Settings(model=...) instead.
url – WebSocket URL for ElevenLabs TTS API.
sample_rate – Audio sample rate. If None, uses default.
auto_mode – Whether to enable ElevenLabs’ auto mode, which reduces latency by disabling server-side chunk scheduling and buffering. Recommended when sending complete sentences or phrases. When None (default), auto mode is enabled for SENTENCE aggregation and disabled for TOKEN aggregation — because token streaming relies on the server-side chunk scheduler to accumulate enough text for natural-sounding synthesis.
enable_ssml_parsing – Whether to parse SSML tags in text.
enable_logging – Whether to enable ElevenLabs server-side logging.
pronunciation_dictionary_locators – List of pronunciation dictionary locators to use.
params –
Additional input parameters for voice customization.

Deprecated since version 0.0.105: Use settings=ElevenLabsTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
text_aggregation_mode – How to aggregate incoming text before synthesis.
aggregate_sentences –
Whether to aggregate sentences within the TTSService.

Deprecated since version 0.0.104: Use text_aggregation_mode instead.
**kwargs – Additional arguments passed to the parent service.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as ElevenLabs service supports metrics generation.

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to ElevenLabs language format.

Parameters:: language – The language to convert.
Returns:: The ElevenLabs-specific language code, or None if not supported.

async start(frame: StartFrame)[source]

Start the ElevenLabs TTS service.

Parameters:: frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the ElevenLabs TTS service.

Parameters:: frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the ElevenLabs TTS service.

Parameters:: frame – The cancel frame.

async flush_audio(context_id: str | None = None)[source]

Flush any pending audio and finalize the current context.

Parameters:: context_id – The specific context to flush. If None, falls back to the currently active context.

async on_audio_context_interrupted(context_id: str)[source]: Close the ElevenLabs context when the bot is interrupted.

async on_audio_context_completed(context_id: str)[source]

Close the ElevenLabs context after all audio has been played.

ElevenLabs does not send a server-side signal when a context is exhausted, so Pipecat must explicitly close it with close_context: True to free server-side resources.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame | None, None][source]

Generate speech from text using ElevenLabs’ streaming WebSocket API.

Parameters:

text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech.

class pipecat.services.elevenlabs.tts.ElevenLabsHttpTTSService(*, api_key: str, voice_id: str | None = None, aiohttp_session: ClientSession, model: str | None = None, base_url: str = 'https://api.elevenlabs.io', sample_rate: int | None = None, enable_logging: bool | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None, params: InputParams | None = None, settings: ElevenLabsHttpTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]

Bases: TTSService

ElevenLabs HTTP-based TTS service with word timestamps.

Provides text-to-speech using ElevenLabs’ HTTP streaming API for simpler, non-WebSocket integration. Suitable for use cases where streaming WebSocket connection is not required or desired.

Settings: alias of ElevenLabsHttpTTSSettings

class InputParams(*, language: Language | None = None, optimize_streaming_latency: int | None = None, stability: float | None = None, similarity_boost: float | None = None, style: float | None = None, use_speaker_boost: bool | None = None, speed: float | None = None, apply_text_normalization: Literal['auto', 'on', 'off'] | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None)[source]

Bases: BaseModel

Input parameters for ElevenLabs HTTP TTS configuration.

Deprecated since version 0.0.105: Use settings=ElevenLabsHttpTTSService.Settings(...) instead.

Parameters:

language – Language to use for synthesis.
optimize_streaming_latency – Latency optimization level (0-4).
stability – Voice stability control (0.0 to 1.0).
similarity_boost – Similarity boost control (0.0 to 1.0).
style – Style control for voice expression (0.0 to 1.0).
use_speaker_boost – Whether to use speaker boost enhancement.
speed – Voice speed control (0.25 to 4.0).
apply_text_normalization – Text normalization mode (“auto”, “on”, “off”).
pronunciation_dictionary_locators – List of pronunciation dictionary locators to use.

language: Language | None

optimize_streaming_latency: int | None

stability: float | None

similarity_boost: float | None

style: float | None

use_speaker_boost: bool | None

speed: float | None

apply_text_normalization: Literal['auto', 'on', 'off'] | None

pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None

__init__(*, api_key: str, voice_id: str | None = None, aiohttp_session: ClientSession, model: str | None = None, base_url: str = 'https://api.elevenlabs.io', sample_rate: int | None = None, enable_logging: bool | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None, params: InputParams | None = None, settings: ElevenLabsHttpTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]

Initialize the ElevenLabs HTTP TTS service.

Parameters:

api_key – ElevenLabs API key for authentication.
voice_id –
ID of the voice to use for synthesis.

Deprecated since version 0.0.105: Use settings=ElevenLabsHttpTTSService.Settings(voice=...) instead.
aiohttp_session – aiohttp ClientSession for HTTP requests.
model –
TTS model to use (e.g., “eleven_turbo_v2_5”).

Deprecated since version 0.0.105: Use settings=ElevenLabsHttpTTSService.Settings(model=...) instead.
base_url – Base URL for ElevenLabs HTTP API.
sample_rate – Audio sample rate. If None, uses default.
enable_logging – Whether to enable ElevenLabs server-side logging. Set to False for zero retention mode (enterprise only).
pronunciation_dictionary_locators – List of pronunciation dictionary locators to use.
params –
Additional input parameters for voice customization.

Deprecated since version 0.0.105: Use settings=ElevenLabsHttpTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
text_aggregation_mode – How to aggregate incoming text before synthesis.
aggregate_sentences –
Whether to aggregate sentences within the TTSService.

Deprecated since version 0.0.104: Use text_aggregation_mode instead.
**kwargs – Additional arguments passed to the parent service.

language_to_service_language(language: Language) → str | None[source]

Convert pipecat Language to ElevenLabs language code.

Parameters:: language – The language to convert.
Returns:: The ElevenLabs-specific language code, or None if not supported.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as ElevenLabs HTTP service supports metrics generation.

async start(frame: StartFrame)[source]

Start the ElevenLabs HTTP TTS service.

Parameters:: frame – The start frame containing initialization parameters.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]

Push a frame and handle state changes.

Parameters:

frame – The frame to push.
direction – The direction to push the frame.

calculate_word_times(alignment_info: Mapping[str, Any]) → list[tuple[str, float]][source]

Calculate word timing from character alignment data.

This method handles partial words that may span across multiple alignment chunks.

Parameters:: alignment_info – Character timing data from ElevenLabs.
Returns:: List of (word, timestamp) pairs for complete words in this chunk.

Example input data:

{
    "characters": [" ", "H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"],
    "character_start_times_seconds": [0.0, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
    "character_end_times_seconds": [0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}

Would produce word times (with cumulative_time=0):

[("Hello", 0.1), ("world", 0.5)]

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame | None, None][source]

Generate speech from text using ElevenLabs streaming API with timestamps.

Makes a request to the ElevenLabs API to generate audio and timing data. Tracks the duration of each utterance to ensure correct sequencing. Includes previous text as context for better prosody continuity.

Parameters:

text – Text to convert to speech.
context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio and control frames containing the synthesized speech.