tts

Inworld AI Text-to-Speech Service Implementation.

Contains two TTS services: - InworldTTSService: WebSocket-based TTS service. - InworldHttpTTSService: HTTP-based TTS service.

Inworld’s text-to-speech (TTS) models offer ultra-realistic, context-aware speech synthesis and precise voice cloning capabilities, enabling developers to build natural and engaging experiences with human-like speech quality at an accessible price point.

Bases: TTSSettings

Settings for InworldTTSService and InworldHttpTTSService.

Parameters:

speaking_rate – Speaking rate for speech synthesis.
temperature – Temperature for speech synthesis.

speaking_rate: float | None | _NotGiven

temperature: float | None | _NotGiven

classmethod from_mapping(settings: Mapping[str, Any]) → Self[source]: Construct settings from a plain dict, destructuring legacy nested audioConfig.

class pipecat.services.inworld.tts.InworldHttpTTSService(*, api_key: str, aiohttp_session: ClientSession, voice_id: str | None = None, model: str | None = None, streaming: bool = True, sample_rate: int | None = None, encoding: str = 'LINEAR16', timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC', params: InputParams | None = None, settings: InworldTTSSettings | None = None, **kwargs)[source]

Bases: TTSService

Inworld AI HTTP-based TTS service.

Supports both streaming and non-streaming modes via the streaming parameter. Outputs LINEAR16 audio at configurable sample rates with word-level timestamps.

Settings: alias of InworldTTSSettings

class InputParams(*, temperature: float | None = None, speaking_rate: float | None = None, timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC')[source]

Bases: BaseModel

Input parameters for Inworld TTS configuration.

Deprecated since version 0.0.105: Use InworldHttpTTSService.Settings directly via the settings parameter instead.

Parameters:

temperature – Temperature for speech synthesis.
speaking_rate – Speaking rate for speech synthesis.
timestamp_transport_strategy – The strategy to use for timestamp transport.

temperature: float | None

speaking_rate: float | None

timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None

__init__(*, api_key: str, aiohttp_session: ClientSession, voice_id: str | None = None, model: str | None = None, streaming: bool = True, sample_rate: int | None = None, encoding: str = 'LINEAR16', timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC', params: InputParams | None = None, settings: InworldTTSSettings | None = None, **kwargs)[source]

Initialize the Inworld TTS service.

Parameters:

api_key – Inworld API key.
aiohttp_session – aiohttp ClientSession for HTTP requests.
voice_id –
ID of the voice to use for synthesis.

Deprecated since version 0.0.105: Use settings=InworldHttpTTSService.Settings(voice=...) instead.
model –
ID of the model to use for synthesis.

Deprecated since version 0.0.105: Use settings=InworldHttpTTSService.Settings(model=...) instead.
streaming – Whether to use streaming mode.
sample_rate – Audio sample rate in Hz.
encoding – Audio encoding format.
timestamp_transport_strategy – Strategy for timestamp transport (“ASYNC” or “SYNC”). Defaults to “ASYNC”.
params –
Input parameters for Inworld TTS configuration.

Deprecated since version 0.0.105: Use settings=InworldHttpTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
**kwargs – Additional arguments passed to the parent class.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Inworld TTS service supports metrics generation.

async start(frame: StartFrame)[source]

Start the Inworld TTS service.

Parameters:: frame – The start frame.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]

Push a frame and handle state changes.

Parameters:

frame – The frame to push.
direction – The direction to push the frame.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame | None, None][source]

Generate TTS audio for the given text.

Parameters:

text – The text to generate TTS audio for.
context_id – Unique identifier for this TTS context.

Returns:

An asynchronous generator of frames.

class pipecat.services.inworld.tts.InworldTTSService(*, api_key: str, voice_id: str | None = None, model: str | None = None, url: str = 'wss://api.inworld.ai/tts/v1/voice:streamBidirectional', sample_rate: int | None = None, encoding: str = 'LINEAR16', auto_mode: bool | None = None, apply_text_normalization: str | None = None, timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC', params: InputParams | None = None, settings: InworldTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, append_trailing_space: bool = True, **kwargs: Any)[source]

Bases: WebsocketTTSService

Inworld AI WebSocket-based TTS service.

Uses bidirectional WebSocket for lower latency streaming. Supports multiple independent audio contexts per connection (max 5). Outputs LINEAR16 audio with word-level timestamps.

Settings: alias of InworldTTSSettings

Bases: BaseModel

Input parameters for Inworld WebSocket TTS configuration.

Deprecated since version 0.0.105: Use InworldTTSService.Settings directly via the settings parameter instead.

Parameters:

temperature – Temperature for speech synthesis.
speaking_rate – Speaking rate for speech synthesis.
apply_text_normalization – Whether to apply text normalization.
max_buffer_delay_ms – Maximum buffer delay in milliseconds.
buffer_char_threshold – Buffer character threshold.
auto_mode – Whether to use auto mode. Recommended when texts are sent in full sentences/phrases. When enabled, the server controls flushing of buffered text to achieve minimal latency while maintaining high quality audio output. If None (default), automatically set based on aggregate_sentences.
timestamp_transport_strategy – The strategy to use for timestamp transport.

temperature: float | None

speaking_rate: float | None

apply_text_normalization: str | None

max_buffer_delay_ms: int | None

buffer_char_threshold: int | None

auto_mode: bool | None

timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None

__init__(*, api_key: str, voice_id: str | None = None, model: str | None = None, url: str = 'wss://api.inworld.ai/tts/v1/voice:streamBidirectional', sample_rate: int | None = None, encoding: str = 'LINEAR16', auto_mode: bool | None = None, apply_text_normalization: str | None = None, timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC', params: InputParams | None = None, settings: InworldTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, append_trailing_space: bool = True, **kwargs: Any)[source]

Initialize the Inworld WebSocket TTS service.

Parameters:

api_key – Inworld API key.
voice_id –
ID of the voice to use for synthesis.

Deprecated since version 0.0.105: Use settings=InworldTTSService.Settings(voice=...) instead.
model –
ID of the model to use for synthesis.

Deprecated since version 0.0.105: Use settings=InworldTTSService.Settings(model=...) instead.
url – URL of the Inworld WebSocket API.
sample_rate – Audio sample rate in Hz.
encoding – Audio encoding format.
auto_mode – Whether to use auto mode. When enabled, the server controls flushing of buffered text. If None (default), automatically set based on aggregate_sentences.
apply_text_normalization – Whether to apply text normalization.
timestamp_transport_strategy – Strategy for timestamp transport (“ASYNC” or “SYNC”). Defaults to “ASYNC”.
params –
Input parameters for Inworld WebSocket TTS configuration.

Deprecated since version 0.0.105: Use settings=InworldTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
aggregate_sentences –
Deprecated. Use text_aggregation_mode instead.

Deprecated since version 0.0.104: Use text_aggregation_mode instead.
text_aggregation_mode – How to aggregate text before synthesis.
append_trailing_space – Whether to append a trailing space to text before sending to TTS.
**kwargs – Additional arguments passed to the parent class.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Inworld WebSocket TTS service supports metrics generation.

async start(frame: StartFrame)[source]

Start the Inworld WebSocket TTS service.

Parameters:: frame – The start frame.

async stop(frame: EndFrame)[source]

Stop the Inworld WebSocket TTS service.

Parameters:: frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the Inworld WebSocket TTS service.

Parameters:: frame – The cancel frame.

async flush_audio(context_id: str | None = None)[source]

Flush any pending audio without closing the context.

This triggers synthesis of all accumulated text in the buffer while keeping the context open for subsequent text. The context is only closed on interruption, disconnect, or end of session.

async on_turn_context_created(context_id: str)[source]

Eagerly open the context on the server when a new turn starts.

This overlaps server-side context creation with sentence aggregation time, so the context is ready by the time text arrives in run_tts.

async on_turn_context_completed()[source]

Close the server-side context at end of turn.

Sends close_context so contextClosed arrives immediately after the last audio byte.

async on_audio_context_interrupted(context_id: str)[source]: Callback invoked when an audio context has been interrupted.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame | None, None][source]

Generate TTS audio for the given text using the Inworld WebSocket TTS service.

Parameters:

text – The text to generate TTS audio for.
context_id – Unique identifier for this TTS context.

Returns:

An asynchronous generator of frames.