tts
ElevenLabs text-to-speech service implementations.
This module provides WebSocket and HTTP-based TTS services using ElevenLabs API with support for streaming audio, word timestamps, and voice customization.
- pipecat.services.elevenlabs.tts.language_to_elevenlabs_language(language: Language) str | None[source]
Convert a Language enum to ElevenLabs language code.
- Parameters:
language – The Language enum value to convert.
- Returns:
The corresponding ElevenLabs language code, or None if not supported.
- pipecat.services.elevenlabs.tts.output_format_from_sample_rate(sample_rate: int) str[source]
Get the appropriate output format string for a given sample rate.
- Parameters:
sample_rate – The audio sample rate in Hz.
- Returns:
The ElevenLabs output format string.
- pipecat.services.elevenlabs.tts.build_elevenlabs_voice_settings(settings: dict[str, Any] | TTSSettings) dict[str, float | bool] | None[source]
Build voice settings dictionary for ElevenLabs based on provided settings.
- Parameters:
settings – Dictionary or settings containing voice settings parameters.
- Returns:
Dictionary of voice settings or None if no valid settings are provided.
- class pipecat.services.elevenlabs.tts.PronunciationDictionaryLocator(*, pronunciation_dictionary_id: str, version_id: str)[source]
Bases:
BaseModelLocator for a pronunciation dictionary.
- Parameters:
pronunciation_dictionary_id – The ID of the pronunciation dictionary.
version_id – The version ID of the pronunciation dictionary.
- pronunciation_dictionary_id: str
- version_id: str
- class pipecat.services.elevenlabs.tts.ElevenLabsTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, stability: float | None | _NotGiven = <factory>, similarity_boost: float | None | _NotGiven = <factory>, style: float | None | _NotGiven = <factory>, use_speaker_boost: bool | None | _NotGiven = <factory>, speed: float | None | _NotGiven = <factory>, apply_text_normalization: Literal['auto', 'on', 'off'] | None | ~pipecat.services.settings._NotGiven=<factory>)[source]
Bases:
TTSSettingsSettings for ElevenLabsTTSService.
Fields that appear in the WebSocket URL (
voice,model,language) require a full reconnect when changed. Fields that affect the voice character (stability,similarity_boost,style,use_speaker_boost,speed) can be applied by closing the current audio context so a new one is opened with updated settings.- Parameters:
stability – Voice stability control (0.0 to 1.0).
similarity_boost – Similarity boost control (0.0 to 1.0).
style – Style control for voice expression (0.0 to 1.0).
use_speaker_boost – Whether to use speaker boost enhancement.
speed – Voice speed control (0.7 to 1.2).
apply_text_normalization – Text normalization mode (“auto”, “on”, “off”).
- stability: float | None | _NotGiven
- similarity_boost: float | None | _NotGiven
- style: float | None | _NotGiven
- use_speaker_boost: bool | None | _NotGiven
- speed: float | None | _NotGiven
- apply_text_normalization: Literal['auto', 'on', 'off'] | None | _NotGiven
- URL_FIELDS: ClassVar[frozenset[str]] = frozenset({'language', 'model', 'voice'})
Fields in the WS URL — changing any of these requires a reconnect.
- VOICE_SETTINGS_FIELDS: ClassVar[frozenset[str]] = frozenset({'similarity_boost', 'speed', 'stability', 'style', 'use_speaker_boost'})
Fields affecting voice character — changing these requires closing the current audio context so the next one picks up new settings.
- class pipecat.services.elevenlabs.tts.ElevenLabsHttpTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, optimize_streaming_latency: int | None | _NotGiven = <factory>, stability: float | None | _NotGiven = <factory>, similarity_boost: float | None | _NotGiven = <factory>, style: float | None | _NotGiven = <factory>, use_speaker_boost: bool | None | _NotGiven = <factory>, speed: float | None | _NotGiven = <factory>, apply_text_normalization: Literal['auto', 'on', 'off'] | None | ~pipecat.services.settings._NotGiven=<factory>)[source]
Bases:
TTSSettingsSettings for ElevenLabsHttpTTSService.
- Parameters:
optimize_streaming_latency – Latency optimization level (0-4).
stability – Voice stability control (0.0 to 1.0).
similarity_boost – Similarity boost control (0.0 to 1.0).
style – Style control for voice expression (0.0 to 1.0).
use_speaker_boost – Whether to use speaker boost enhancement.
speed – Voice speed control (0.25 to 4.0).
apply_text_normalization – Text normalization mode (“auto”, “on”, “off”).
- optimize_streaming_latency: int | None | _NotGiven
- stability: float | None | _NotGiven
- similarity_boost: float | None | _NotGiven
- style: float | None | _NotGiven
- use_speaker_boost: bool | None | _NotGiven
- speed: float | None | _NotGiven
- apply_text_normalization: Literal['auto', 'on', 'off'] | None | _NotGiven
- pipecat.services.elevenlabs.tts.calculate_word_times(alignment_info: Mapping[str, Any], cumulative_time: float, partial_word: str = '', partial_word_start_time: float = 0.0) tuple[list[tuple[str, float]], str, float][source]
Calculate word timestamps from character alignment information.
- Parameters:
alignment_info – Character alignment data from ElevenLabs API.
cumulative_time – Base time offset for this chunk.
partial_word – Partial word carried over from previous chunk.
partial_word_start_time – Start time of the partial word.
- Returns:
word_times: List of (word, timestamp) tuples for complete words
new_partial_word: Incomplete word at end of chunk (empty if chunk ends with space)
new_partial_word_start_time: Start time of the incomplete word
- Return type:
Tuple of (word_times, new_partial_word, new_partial_word_start_time)
- class pipecat.services.elevenlabs.tts.ElevenLabsTTSService(*, api_key: str, voice_id: str | None = None, model: str | None = None, url: str = 'wss://api.elevenlabs.io', sample_rate: int | None = None, auto_mode: bool | None = None, enable_ssml_parsing: bool | None = None, enable_logging: bool | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None, params: InputParams | None = None, settings: ElevenLabsTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]
Bases:
WebsocketTTSServiceElevenLabs WebSocket-based TTS service with word timestamps.
Provides real-time text-to-speech using ElevenLabs’ WebSocket streaming API. Supports word-level timestamps, audio context management, and various voice customization options including stability, similarity boost, and speed controls.
- Settings
alias of
ElevenLabsTTSSettings
- class InputParams(*, language: Language | None = None, stability: float | None = None, similarity_boost: float | None = None, style: float | None = None, use_speaker_boost: bool | None = None, speed: float | None = None, auto_mode: bool | None = True, enable_ssml_parsing: bool | None = None, enable_logging: bool | None = None, apply_text_normalization: Literal['auto', 'on', 'off'] | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None)[source]
Bases:
BaseModelInput parameters for ElevenLabs TTS configuration.
Deprecated since version 0.0.105: Use
settings=ElevenLabsTTSService.Settings(...)instead.- Parameters:
language – Language to use for synthesis.
stability – Voice stability control (0.0 to 1.0).
similarity_boost – Similarity boost control (0.0 to 1.0).
style – Style control for voice expression (0.0 to 1.0).
use_speaker_boost – Whether to use speaker boost enhancement.
speed – Voice speed control (0.7 to 1.2).
auto_mode – Whether to enable automatic mode optimization.
enable_ssml_parsing – Whether to parse SSML tags in text.
enable_logging – Whether to enable ElevenLabs logging.
apply_text_normalization – Text normalization mode (“auto”, “on”, “off”).
pronunciation_dictionary_locators – List of pronunciation dictionary locators to use.
- stability: float | None
- similarity_boost: float | None
- style: float | None
- use_speaker_boost: bool | None
- speed: float | None
- auto_mode: bool | None
- enable_ssml_parsing: bool | None
- enable_logging: bool | None
- apply_text_normalization: Literal['auto', 'on', 'off'] | None
- pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None
- __init__(*, api_key: str, voice_id: str | None = None, model: str | None = None, url: str = 'wss://api.elevenlabs.io', sample_rate: int | None = None, auto_mode: bool | None = None, enable_ssml_parsing: bool | None = None, enable_logging: bool | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None, params: InputParams | None = None, settings: ElevenLabsTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]
Initialize the ElevenLabs TTS service.
- Parameters:
api_key – ElevenLabs API key for authentication.
voice_id –
ID of the voice to use for synthesis.
Deprecated since version 0.0.105: Use
settings=ElevenLabsTTSService.Settings(voice=...)instead.model –
TTS model to use (e.g., “eleven_turbo_v2_5”).
Deprecated since version 0.0.105: Use
settings=ElevenLabsTTSService.Settings(model=...)instead.url – WebSocket URL for ElevenLabs TTS API.
sample_rate – Audio sample rate. If None, uses default.
auto_mode – Whether to enable ElevenLabs’ auto mode, which reduces latency by disabling server-side chunk scheduling and buffering. Recommended when sending complete sentences or phrases. When None (default), auto mode is enabled for
SENTENCEaggregation and disabled forTOKENaggregation — because token streaming relies on the server-side chunk scheduler to accumulate enough text for natural-sounding synthesis.enable_ssml_parsing – Whether to parse SSML tags in text.
enable_logging – Whether to enable ElevenLabs server-side logging.
pronunciation_dictionary_locators – List of pronunciation dictionary locators to use.
params –
Additional input parameters for voice customization.
Deprecated since version 0.0.105: Use
settings=ElevenLabsTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.text_aggregation_mode – How to aggregate incoming text before synthesis.
aggregate_sentences –
Whether to aggregate sentences within the TTSService.
Deprecated since version 0.0.104: Use
text_aggregation_modeinstead.**kwargs – Additional arguments passed to the parent service.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as ElevenLabs service supports metrics generation.
- language_to_service_language(language: Language) str | None[source]
Convert a Language enum to ElevenLabs language format.
- Parameters:
language – The language to convert.
- Returns:
The ElevenLabs-specific language code, or None if not supported.
- async start(frame: StartFrame)[source]
Start the ElevenLabs TTS service.
- Parameters:
frame – The start frame containing initialization parameters.
- async stop(frame: EndFrame)[source]
Stop the ElevenLabs TTS service.
- Parameters:
frame – The end frame.
- async cancel(frame: CancelFrame)[source]
Cancel the ElevenLabs TTS service.
- Parameters:
frame – The cancel frame.
- async flush_audio(context_id: str | None = None)[source]
Flush any pending audio and finalize the current context.
- Parameters:
context_id – The specific context to flush. If None, falls back to the currently active context.
- async on_audio_context_interrupted(context_id: str)[source]
Close the ElevenLabs context when the bot is interrupted.
- async on_audio_context_completed(context_id: str)[source]
Close the ElevenLabs context after all audio has been played.
ElevenLabs does not send a server-side signal when a context is exhausted, so Pipecat must explicitly close it with
close_context: Trueto free server-side resources.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]
Generate speech from text using ElevenLabs’ streaming WebSocket API.
- Parameters:
text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.
- Yields:
Frame – Audio frames containing the synthesized speech.
- class pipecat.services.elevenlabs.tts.ElevenLabsHttpTTSService(*, api_key: str, voice_id: str | None = None, aiohttp_session: ClientSession, model: str | None = None, base_url: str = 'https://api.elevenlabs.io', sample_rate: int | None = None, enable_logging: bool | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None, params: InputParams | None = None, settings: ElevenLabsHttpTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]
Bases:
TTSServiceElevenLabs HTTP-based TTS service with word timestamps.
Provides text-to-speech using ElevenLabs’ HTTP streaming API for simpler, non-WebSocket integration. Suitable for use cases where streaming WebSocket connection is not required or desired.
- Settings
alias of
ElevenLabsHttpTTSSettings
- class InputParams(*, language: Language | None = None, optimize_streaming_latency: int | None = None, stability: float | None = None, similarity_boost: float | None = None, style: float | None = None, use_speaker_boost: bool | None = None, speed: float | None = None, apply_text_normalization: Literal['auto', 'on', 'off'] | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None)[source]
Bases:
BaseModelInput parameters for ElevenLabs HTTP TTS configuration.
Deprecated since version 0.0.105: Use
settings=ElevenLabsHttpTTSService.Settings(...)instead.- Parameters:
language – Language to use for synthesis.
optimize_streaming_latency – Latency optimization level (0-4).
stability – Voice stability control (0.0 to 1.0).
similarity_boost – Similarity boost control (0.0 to 1.0).
style – Style control for voice expression (0.0 to 1.0).
use_speaker_boost – Whether to use speaker boost enhancement.
speed – Voice speed control (0.25 to 4.0).
apply_text_normalization – Text normalization mode (“auto”, “on”, “off”).
pronunciation_dictionary_locators – List of pronunciation dictionary locators to use.
- optimize_streaming_latency: int | None
- stability: float | None
- similarity_boost: float | None
- style: float | None
- use_speaker_boost: bool | None
- speed: float | None
- apply_text_normalization: Literal['auto', 'on', 'off'] | None
- pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None
- __init__(*, api_key: str, voice_id: str | None = None, aiohttp_session: ClientSession, model: str | None = None, base_url: str = 'https://api.elevenlabs.io', sample_rate: int | None = None, enable_logging: bool | None = None, pronunciation_dictionary_locators: list[PronunciationDictionaryLocator] | None = None, params: InputParams | None = None, settings: ElevenLabsHttpTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]
Initialize the ElevenLabs HTTP TTS service.
- Parameters:
api_key – ElevenLabs API key for authentication.
voice_id –
ID of the voice to use for synthesis.
Deprecated since version 0.0.105: Use
settings=ElevenLabsHttpTTSService.Settings(voice=...)instead.aiohttp_session – aiohttp ClientSession for HTTP requests.
model –
TTS model to use (e.g., “eleven_turbo_v2_5”).
Deprecated since version 0.0.105: Use
settings=ElevenLabsHttpTTSService.Settings(model=...)instead.base_url – Base URL for ElevenLabs HTTP API.
sample_rate – Audio sample rate. If None, uses default.
enable_logging – Whether to enable ElevenLabs server-side logging. Set to False for zero retention mode (enterprise only).
pronunciation_dictionary_locators – List of pronunciation dictionary locators to use.
params –
Additional input parameters for voice customization.
Deprecated since version 0.0.105: Use
settings=ElevenLabsHttpTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.text_aggregation_mode – How to aggregate incoming text before synthesis.
aggregate_sentences –
Whether to aggregate sentences within the TTSService.
Deprecated since version 0.0.104: Use
text_aggregation_modeinstead.**kwargs – Additional arguments passed to the parent service.
- language_to_service_language(language: Language) str | None[source]
Convert pipecat Language to ElevenLabs language code.
- Parameters:
language – The language to convert.
- Returns:
The ElevenLabs-specific language code, or None if not supported.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as ElevenLabs HTTP service supports metrics generation.
- async start(frame: StartFrame)[source]
Start the ElevenLabs HTTP TTS service.
- Parameters:
frame – The start frame containing initialization parameters.
- async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]
Push a frame and handle state changes.
- Parameters:
frame – The frame to push.
direction – The direction to push the frame.
- calculate_word_times(alignment_info: Mapping[str, Any]) list[tuple[str, float]][source]
Calculate word timing from character alignment data.
This method handles partial words that may span across multiple alignment chunks.
- Parameters:
alignment_info – Character timing data from ElevenLabs.
- Returns:
List of (word, timestamp) pairs for complete words in this chunk.
Example input data:
{ "characters": [" ", "H", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"], "character_start_times_seconds": [0.0, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], "character_end_times_seconds": [0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] }
Would produce word times (with cumulative_time=0):
[("Hello", 0.1), ("world", 0.5)]
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]
Generate speech from text using ElevenLabs streaming API with timestamps.
Makes a request to the ElevenLabs API to generate audio and timing data. Tracks the duration of each utterance to ensure correct sequencing. Includes previous text as context for better prosody continuity.
- Parameters:
text – Text to convert to speech.
context_id – The context ID for tracking audio frames.
- Yields:
Frame – Audio and control frames containing the synthesized speech.