tts_service

Base classes for Text-to-speech services.

class pipecat.services.tts_service.TTSContext(append_to_context: bool = True, push_assistant_aggregation: bool | None = False)[source]

Bases: object

Context information for a TTS request.

Parameters:

append_to_context – Whether this TTS output should be appended to the conversation context after it is spoken.
push_assistant_aggregation – Whether to push an LLMAssistantPushAggregationFrame after the TTS has finished speaking, forcing the assistant aggregator to commit its current text buffer to the conversation context.

append_to_context: bool = True

push_assistant_aggregation: bool | None = False

class pipecat.services.tts_service.TextAggregationMode(*values)[source]

Bases: StrEnum

Controls how incoming text is aggregated before TTS synthesis.

Parameters:

SENTENCE – Buffer text until sentence boundaries are detected before synthesis. Produces more natural speech but adds latency (~200-300ms per sentence).
TOKEN – Stream text tokens directly to TTS as they arrive. Reduces latency but may affect speech quality depending on the TTS provider.

SENTENCE = 'sentence'

TOKEN = 'token'

class pipecat.services.tts_service.TTSService(*, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, push_text_frames: bool = True, push_stop_frames: bool = False, push_start_frame: bool = False, stop_frame_timeout_s: float = 3.0, push_silence_after_stop: bool = False, silence_time_s: float = 2.0, pause_frame_processing: bool = False, append_trailing_space: bool = False, sample_rate: int | None = None, skip_aggregator_types: list[str] | None = [], text_transforms: list[tuple[AggregationType | str, Callable[[str, str | AggregationType], Awaitable[str]]]] | None = None, text_filters: Sequence[BaseTextFilter] | None = None, transport_destination: str | None = None, settings: TTSSettings | None = None, reuse_context_id_within_turn: bool = True, **kwargs)[source]

Bases: AIService

Base class for text-to-speech services.

Provides common functionality for TTS services including text aggregation, filtering, audio generation, and frame management. Supports configurable sentence aggregation, silence insertion, and frame processing control.

Event handlers:: on_connected: Called when connected to the TTS service. on_disconnected: Called when disconnected from the TTS service. on_connection_error: Called when a connection to the TTS service error occurs. on_tts_request: Called before a TTS request is made, with the context ID and text.

Example:

@tts.event_handler("on_connected")
async def on_connected(tts: TTSService):
    logger.debug(f"TTS connected")

@tts.event_handler("on_disconnected")
async def on_disconnected(tts: TTSService):
    logger.debug(f"TTS disconnected")

@tts.event_handler("on_connection_error")
async def on_connection_error(tts: TTSService, error: str):
    logger.error(f"TTS connection error: {error}")

@tts.event_handler("on_tts_request")
async def on_tts_request(tts: TTSService, context_id: str, text: str):
    logger.debug(f"TTS request: {context_id} - {text}")

__init__(*, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, push_text_frames: bool = True, push_stop_frames: bool = False, push_start_frame: bool = False, stop_frame_timeout_s: float = 3.0, push_silence_after_stop: bool = False, silence_time_s: float = 2.0, pause_frame_processing: bool = False, append_trailing_space: bool = False, sample_rate: int | None = None, skip_aggregator_types: list[str] | None = [], text_transforms: list[tuple[AggregationType | str, Callable[[str, str | AggregationType], Awaitable[str]]]] | None = None, text_filters: Sequence[BaseTextFilter] | None = None, transport_destination: str | None = None, settings: TTSSettings | None = None, reuse_context_id_within_turn: bool = True, **kwargs)[source]

Initialize the TTS service.

Parameters:

text_aggregation_mode – How to aggregate incoming text before synthesis. TextAggregationMode.SENTENCE (default) buffers until sentence boundaries, TextAggregationMode.TOKEN streams tokens directly for lower latency.
aggregate_sentences –
Whether to aggregate text into sentences before synthesis.

Deprecated since version 0.0.104: Use text_aggregation_mode instead. Set to TextAggregationMode.SENTENCE to aggregate text into sentences before synthesis, or TextAggregationMode.TOKEN to stream tokens directly for lower latency.
push_text_frames – Whether to push TextFrames and LLMFullResponseEndFrames.
push_stop_frames – Whether to automatically push TTSStoppedFrames.
push_start_frame – Whether to automatically create audio contexts and push TTSStartedFrames. When True, the base class handles create_audio_context and yields TTSStartedFrame before each synthesis call, so run_tts implementations do not need to.
stop_frame_timeout_s – Idle time before pushing TTSStoppedFrame when push_stop_frames is True.
push_silence_after_stop – Whether to push silence audio after TTSStoppedFrame.
silence_time_s – Duration of silence to push when push_silence_after_stop is True.
pause_frame_processing – Whether to pause frame processing during audio generation.
append_trailing_space – Whether to append a trailing space to text before sending to TTS. This helps prevent some TTS services from vocalizing trailing punctuation (e.g., “dot”).
sample_rate – Output sample rate for generated audio.
skip_aggregator_types – List of aggregation types that should not be spoken.
text_transforms – A list of callables to transform text before just before sending it to TTS. Each callable takes the aggregated text and its type, and returns the transformed text. To register, provide a list of tuples of (aggregation_type | ‘*’, transform_function).
text_filters – Sequence of text filters to apply after aggregation.
transport_destination – Destination for generated audio frames.
settings – The runtime-updatable settings for the TTS service.
reuse_context_id_within_turn – Whether the service should reuse context IDs within the same turn.
**kwargs – Additional arguments passed to the parent AIService.

async start_tts_usage_metrics(text: str)[source]

Record TTS usage metrics.

When streaming tokens, usage metrics are aggregated and reported at flush time instead of per token, so individual calls are skipped.

Parameters:: text – The text being processed by TTS.

async start_text_aggregation_metrics()[source]

Start text aggregation metrics if not already started.

Only starts the metric once per LLM response. Skipped when streaming tokens since per-token aggregation time is not meaningful.

async stop_text_aggregation_metrics()[source]: Stop text aggregation metrics and reset the started flag.

property sample_rate: int

Get the current sample rate for audio output.

Returns:: The sample rate in Hz.

property chunk_size: int

Get the recommended chunk size for audio streaming.

This property indicates how much audio we download (from TTS services that require chunking) before we start pushing the first audio frame. This will make sure we download the rest of the audio while audio is being played without causing audio glitches (specially at the beginning). Of course, this will also depend on how fast the TTS service generates bytes.

Returns:: The recommended chunk size in bytes.

async set_model(model: str)[source]

Set the TTS model to use.

Deprecated since version 0.0.104: Use TTSUpdateSettingsFrame(model=...) instead.

Parameters:: model – The name of the TTS model.

async set_voice(voice: str)[source]

Set the voice for speech synthesis.

Deprecated since version 0.0.104: Use TTSUpdateSettingsFrame(voice=...) instead.

Parameters:: voice – The voice identifier or name.

create_context_id() → str[source]

Generate or reuse a context ID based on concurrent TTS support.

Returns:: A context ID string for the TTS request.

abstractmethod async run_tts(text: str, context_id: str) → AsyncGenerator[Frame | None, None][source]

Run text-to-speech synthesis on the provided text.

This method must be implemented by subclasses to provide actual TTS functionality.

Parameters:

text – The text to synthesize into speech.
context_id – Unique identifier for this TTS context.

Yields:

Frame – Audio frames containing the synthesized speech.

language_to_service_language(language: Language) → str | None[source]

Convert a language to the service-specific language format.

Parameters:: language – The language to convert.
Returns:: The service-specific language identifier, or None if not supported.

async flush_audio(context_id: str | None = None)[source]

Flush any buffered audio data.

Parameters:: context_id – The specific context to flush. If None, falls back to the currently active context (for non-concurrent services).

async start(frame: StartFrame)[source]

Start the TTS service.

Parameters:: frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the TTS service.

Parameters:: frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the TTS service.

Parameters:: frame – The cancel frame.

add_text_transformer(transform_function: Callable[[str, AggregationType | str], Awaitable[str]], aggregation_type: AggregationType | str = '*')[source]

Transform text for a specific aggregation type.

Parameters:

transform_function – The function to apply for transformation. This function should take the text and aggregation type as input and return the transformed text. Ex.: async def my_transform(text: str, aggregation_type: str) -> str:
aggregation_type – The type of aggregation to transform. This value defaults to “*” indicating the function should handle all text before sending to TTS.

remove_text_transformer(transform_function: Callable[[str, AggregationType | str], Awaitable[str]], aggregation_type: AggregationType | str = '*')[source]

Remove a text transformer for a specific aggregation type.

Parameters:

transform_function – The function to remove.
aggregation_type – The type of aggregation to remove the transformer for.

async on_turn_context_created(context_id: str)[source]

Called when a new turn context ID has been created.

Override to perform provider-specific setup (e.g., eagerly opening a server-side context) before text starts flowing. This is called from process_frame when an LLMFullResponseStartFrame or TTSSpeakFrame arrives.

Parameters:: context_id – The newly created turn context ID.

async on_turn_context_completed()[source]: Handle the completion of a turn.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process frames for text-to-speech conversion.

Handles TextFrames for synthesis, interruption frames, settings updates, and various control frames.

Parameters:

frame – The frame to process.
direction – The direction of frame processing.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]

Push a frame downstream with TTS-specific handling.

Parameters:

frame – The frame to push.
direction – The direction to push the frame.

async tts_process_generator(context_id: str, generator: AsyncGenerator[Frame | None, None]) → bool[source]

Process frames from an async generator, routing them through the audio context.

All non-None frames yielded by the generator are appended to the audio context identified by context_id. The audio context must be created by run_tts (via create_audio_context) before the first frame is yielded.

WebSocket services yield None to signal that audio will arrive via a separate receive loop; those services manage context lifetime themselves (via remove_audio_context in the receive loop on “done”). HTTP services never yield None and do NOT call remove_audio_context in run_tts — the caller (_synthesize_text) closes the context after appending any remaining frames (e.g. TTSTextFrame).

Parameters:

context_id – The audio context to route frames to.
generator – An async generator yielding Frame objects or None.

async start_word_timestamps()[source]: Start tracking word timestamps from the current time.

async reset_word_timestamps()[source]: Reset word timestamp tracking.

async add_word_timestamps(word_times: list[tuple[str, float]], context_id: str | None = None, includes_inter_frame_spaces: bool | None = None)[source]

Add word timestamps for processing.

When an audio context exists for this context_id, timestamps are routed into the per-context audio queue alongside audio frames so they are processed in strict playback order by _handle_audio_context. Otherwise they are processed immediately via _add_word_timestamps.

Parameters:

word_times – List of (word, timestamp) tuples where timestamp is in seconds.
context_id – Unique identifier for the TTS context.
includes_inter_frame_spaces – When True, the tokens already embed inter-word spacing (spaces and punctuation are part of the token text). Downstream consumers must not inject additional spaces between tokens. None leaves the frame’s own default unchanged.

async create_audio_context(context_id: str)[source]

Create a new audio context for grouping related audio.

Parameters:: context_id – Unique identifier for the audio context.

async append_to_audio_context(context_id: str, frame: Frame | _WordTimestampEntry | None)[source]

Append a frame or word-timestamp entry to an existing audio context queue.

Passing None signals end-of-context (used by remove_audio_context to mark the queue for deletion). If the context no longer exists but the context_id matches the active turn, the context is transparently recreated before appending.

Parameters:

context_id – The context to append to.
frame – The frame, word-timestamp entry, or None (end-of-context sentinel) to append.

async remove_audio_context(context_id: str)[source]

Remove an existing audio context.

Parameters:: context_id – The context to remove.

has_active_audio_context() → bool[source]

Check if there is an active audio context.

Returns:: True if an active audio context exists, False otherwise.

get_audio_contexts() → list[str][source]: Get a list of all available audio contexts.

get_active_audio_context_id() → str | None[source]

Get the active audio context ID.

Returns:: The active context ID, or None if no context is active.

async remove_active_audio_context()[source]: Remove the active audio context.

reset_active_audio_context()[source]: Reset the active audio context.

audio_context_available(context_id: str) → bool[source]

Check whether the given audio context is registered.

Parameters:: context_id – The context ID to check.
Returns:: True if the context exists and is available.

async on_audio_context_interrupted(context_id: str)[source]

Called when an audio context is cancelled due to an interruption.

Override this in a subclass to perform provider-specific cleanup (e.g. sending a cancel/close message over the WebSocket) when the bot is interrupted mid-speech. The audio context task has already been stopped and the active context has not yet been reset when this is called, so context_id reflects the context that was cut short.

Parameters:: context_id – The ID of the audio context that was interrupted, or None if no context was active at the time.

async on_audio_context_completed(context_id: str)[source]

Called after an audio context has finished playing all of its audio.

Override this in a subclass to perform provider-specific cleanup (e.g. sending a close-context message to free server-side resources) once an audio context has been fully processed. The context entry has already been removed from the internal context map, and the active context has not yet been reset when this is called.

Parameters:: context_id – The ID of the audio context that finished processing.

class pipecat.services.tts_service.WordTTSService(**kwargs)[source]

Bases: TTSService

Deprecated. Use TTSService directly instead.

Deprecated since version 0.0.105: Word timestamp functionality is now always active in TTSService.

__init__(**kwargs)[source]

Initialize the Word TTS service.

Parameters:: **kwargs – Additional arguments passed to the parent TTSService.

class pipecat.services.tts_service.WebsocketTTSService(*, reconnect_on_error: bool = True, **kwargs)[source]

Bases: TTSService, WebsocketService

Base class for websocket-based TTS services.

Combines TTS functionality with websocket connectivity, providing automatic error handling and reconnection capabilities.

Event handlers:: on_connection_error: Called when a websocket connection error occurs.

Example:

@tts.event_handler("on_connection_error")
async def on_connection_error(tts: TTSService, error: str):
    logger.error(f"TTS connection error: {error}")

__init__(*, reconnect_on_error: bool = True, **kwargs)[source]

Initialize the Websocket TTS service.

Parameters:

reconnect_on_error – Whether to automatically reconnect on websocket errors.
**kwargs – Additional arguments passed to parent classes.

class pipecat.services.tts_service.InterruptibleTTSService(**kwargs)[source]

Bases: WebsocketTTSService

Websocket-based TTS service that handles interruptions without word timestamps.

Designed for TTS services that don’t support word timestamps. Handles interruptions by reconnecting the websocket when the bot is speaking and gets interrupted.

__init__(**kwargs)[source]

Initialize the Interruptible TTS service.

Parameters:: **kwargs – Additional arguments passed to the parent WebsocketTTSService.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]

Push a frame downstream with TTS-specific handling.

Parameters:

frame – The frame to push.
direction – The direction to push the frame.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process frames with bot speaking state tracking.

Parameters:

frame – The frame to process.
direction – The direction of frame processing.

class pipecat.services.tts_service.WebsocketWordTTSService(*, reconnect_on_error: bool = True, **kwargs)[source]

Bases: WebsocketTTSService

Deprecated. Use WebsocketTTSService directly instead.

Deprecated since version 0.0.105: Word timestamp functionality is now always active in TTSService.

__init__(*, reconnect_on_error: bool = True, **kwargs)[source]

Initialize the Websocket Word TTS service.

Parameters:

reconnect_on_error – Whether to automatically reconnect on websocket errors.
**kwargs – Additional arguments passed to parent classes.

class pipecat.services.tts_service.InterruptibleWordTTSService(**kwargs)[source]

Bases: InterruptibleTTSService

Deprecated. Use InterruptibleTTSService directly instead.

Deprecated since version 0.0.105: Word timestamp functionality is now always active in TTSService.

__init__(**kwargs)[source]

Initialize the Interruptible Word TTS service.

Parameters:: **kwargs – Additional arguments passed to the parent InterruptibleTTSService.

class pipecat.services.tts_service.AudioContextTTSService(*, reuse_context_id_within_turn: bool = True, reconnect_on_error: bool = True, **kwargs)[source]

Bases: WebsocketTTSService

Deprecated. Inherit from WebsocketTTSService directly instead.

Audio context management (previously the main purpose of this class) is now built into TTSService. This class is kept only for backwards compatibility.

Deprecated since version 0.0.105: Subclass WebsocketTTSService directly and pass reuse_context_id_within_turn as keyword arguments to its __init__.

__init__(*, reuse_context_id_within_turn: bool = True, reconnect_on_error: bool = True, **kwargs)[source]

Initialize the Audio Context TTS service.

Parameters:

reuse_context_id_within_turn – Whether the service should reuse context IDs within the same turn.
reconnect_on_error – Whether to automatically reconnect on websocket errors.
**kwargs – Additional arguments passed to the parent WebsocketTTSService.

class pipecat.services.tts_service.AudioContextWordTTSService(*, reconnect_on_error: bool = True, **kwargs)[source]

Bases: AudioContextTTSService

Deprecated. Use WebsocketTTSService directly instead.

Deprecated since version 0.0.105: Subclass WebsocketTTSService directly.

__init__(*, reconnect_on_error: bool = True, **kwargs)[source]

Initialize the Audio Context Word TTS service.

Parameters:

reconnect_on_error – Whether to automatically reconnect on websocket errors.
**kwargs – Additional arguments passed to parent classes.