tts
Inworld AI Text-to-Speech Service Implementation.
Contains two TTS services: - InworldTTSService: WebSocket-based TTS service. - InworldHttpTTSService: HTTP-based TTS service.
Inworld’s text-to-speech (TTS) models offer ultra-realistic, context-aware speech synthesis and precise voice cloning capabilities, enabling developers to build natural and engaging experiences with human-like speech quality at an accessible price point.
- class pipecat.services.inworld.tts.InworldTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, speaking_rate: float | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>)[source]
Bases:
TTSSettingsSettings for InworldTTSService and InworldHttpTTSService.
- Parameters:
speaking_rate – Speaking rate for speech synthesis.
temperature – Temperature for speech synthesis.
- speaking_rate: float | None | _NotGiven
- temperature: float | None | _NotGiven
- class pipecat.services.inworld.tts.InworldHttpTTSService(*, api_key: str, aiohttp_session: ClientSession, voice_id: str | None = None, model: str | None = None, streaming: bool = True, sample_rate: int | None = None, encoding: str = 'LINEAR16', timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC', params: InputParams | None = None, settings: InworldTTSSettings | None = None, **kwargs)[source]
Bases:
TTSServiceInworld AI HTTP-based TTS service.
Supports both streaming and non-streaming modes via the streaming parameter. Outputs LINEAR16 audio at configurable sample rates with word-level timestamps.
- Settings
alias of
InworldTTSSettings
- class InputParams(*, temperature: float | None = None, speaking_rate: float | None = None, timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC')[source]
Bases:
BaseModelInput parameters for Inworld TTS configuration.
Deprecated since version 0.0.105: Use
InworldHttpTTSService.Settingsdirectly via thesettingsparameter instead.- Parameters:
temperature – Temperature for speech synthesis.
speaking_rate – Speaking rate for speech synthesis.
timestamp_transport_strategy – The strategy to use for timestamp transport.
- temperature: float | None
- speaking_rate: float | None
- timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None
- __init__(*, api_key: str, aiohttp_session: ClientSession, voice_id: str | None = None, model: str | None = None, streaming: bool = True, sample_rate: int | None = None, encoding: str = 'LINEAR16', timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC', params: InputParams | None = None, settings: InworldTTSSettings | None = None, **kwargs)[source]
Initialize the Inworld TTS service.
- Parameters:
api_key – Inworld API key.
aiohttp_session – aiohttp ClientSession for HTTP requests.
voice_id –
ID of the voice to use for synthesis.
Deprecated since version 0.0.105: Use
settings=InworldHttpTTSService.Settings(voice=...)instead.model –
ID of the model to use for synthesis.
Deprecated since version 0.0.105: Use
settings=InworldHttpTTSService.Settings(model=...)instead.streaming – Whether to use streaming mode.
sample_rate – Audio sample rate in Hz.
encoding – Audio encoding format.
timestamp_transport_strategy – Strategy for timestamp transport (“ASYNC” or “SYNC”). Defaults to “ASYNC”.
params –
Input parameters for Inworld TTS configuration.
Deprecated since version 0.0.105: Use
settings=InworldHttpTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional arguments passed to the parent class.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Inworld TTS service supports metrics generation.
- async start(frame: StartFrame)[source]
Start the Inworld TTS service.
- Parameters:
frame – The start frame.
- async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]
Push a frame and handle state changes.
- Parameters:
frame – The frame to push.
direction – The direction to push the frame.
- class pipecat.services.inworld.tts.InworldTTSService(*, api_key: str, voice_id: str | None = None, model: str | None = None, url: str = 'wss://api.inworld.ai/tts/v1/voice:streamBidirectional', sample_rate: int | None = None, encoding: str = 'LINEAR16', auto_mode: bool | None = None, apply_text_normalization: str | None = None, timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC', params: InputParams | None = None, settings: InworldTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, append_trailing_space: bool = True, **kwargs: Any)[source]
Bases:
WebsocketTTSServiceInworld AI WebSocket-based TTS service.
Uses bidirectional WebSocket for lower latency streaming. Supports multiple independent audio contexts per connection (max 5). Outputs LINEAR16 audio with word-level timestamps.
- Settings
alias of
InworldTTSSettings
- class InputParams(*, temperature: float | None = None, speaking_rate: float | None = None, apply_text_normalization: str | None = None, max_buffer_delay_ms: int | None = None, buffer_char_threshold: int | None = None, auto_mode: bool | None = True, timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC')[source]
Bases:
BaseModelInput parameters for Inworld WebSocket TTS configuration.
Deprecated since version 0.0.105: Use
InworldTTSService.Settingsdirectly via thesettingsparameter instead.- Parameters:
temperature – Temperature for speech synthesis.
speaking_rate – Speaking rate for speech synthesis.
apply_text_normalization – Whether to apply text normalization.
max_buffer_delay_ms – Maximum buffer delay in milliseconds.
buffer_char_threshold – Buffer character threshold.
auto_mode – Whether to use auto mode. Recommended when texts are sent in full sentences/phrases. When enabled, the server controls flushing of buffered text to achieve minimal latency while maintaining high quality audio output. If None (default), automatically set based on aggregate_sentences.
timestamp_transport_strategy – The strategy to use for timestamp transport.
- temperature: float | None
- speaking_rate: float | None
- apply_text_normalization: str | None
- max_buffer_delay_ms: int | None
- buffer_char_threshold: int | None
- auto_mode: bool | None
- timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None
- __init__(*, api_key: str, voice_id: str | None = None, model: str | None = None, url: str = 'wss://api.inworld.ai/tts/v1/voice:streamBidirectional', sample_rate: int | None = None, encoding: str = 'LINEAR16', auto_mode: bool | None = None, apply_text_normalization: str | None = None, timestamp_transport_strategy: Literal['ASYNC', 'SYNC'] | None = 'ASYNC', params: InputParams | None = None, settings: InworldTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, append_trailing_space: bool = True, **kwargs: Any)[source]
Initialize the Inworld WebSocket TTS service.
- Parameters:
api_key – Inworld API key.
voice_id –
ID of the voice to use for synthesis.
Deprecated since version 0.0.105: Use
settings=InworldTTSService.Settings(voice=...)instead.model –
ID of the model to use for synthesis.
Deprecated since version 0.0.105: Use
settings=InworldTTSService.Settings(model=...)instead.url – URL of the Inworld WebSocket API.
sample_rate – Audio sample rate in Hz.
encoding – Audio encoding format.
auto_mode – Whether to use auto mode. When enabled, the server controls flushing of buffered text. If None (default), automatically set based on
aggregate_sentences.apply_text_normalization – Whether to apply text normalization.
timestamp_transport_strategy – Strategy for timestamp transport (“ASYNC” or “SYNC”). Defaults to “ASYNC”.
params –
Input parameters for Inworld WebSocket TTS configuration.
Deprecated since version 0.0.105: Use
settings=InworldTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.aggregate_sentences –
Deprecated. Use text_aggregation_mode instead.
Deprecated since version 0.0.104: Use
text_aggregation_modeinstead.text_aggregation_mode – How to aggregate text before synthesis.
append_trailing_space – Whether to append a trailing space to text before sending to TTS.
**kwargs – Additional arguments passed to the parent class.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Inworld WebSocket TTS service supports metrics generation.
- async start(frame: StartFrame)[source]
Start the Inworld WebSocket TTS service.
- Parameters:
frame – The start frame.
- async stop(frame: EndFrame)[source]
Stop the Inworld WebSocket TTS service.
- Parameters:
frame – The end frame.
- async cancel(frame: CancelFrame)[source]
Cancel the Inworld WebSocket TTS service.
- Parameters:
frame – The cancel frame.
- async flush_audio(context_id: str | None = None)[source]
Flush any pending audio without closing the context.
This triggers synthesis of all accumulated text in the buffer while keeping the context open for subsequent text. The context is only closed on interruption, disconnect, or end of session.
- async on_turn_context_created(context_id: str)[source]
Eagerly open the context on the server when a new turn starts.
This overlaps server-side context creation with sentence aggregation time, so the context is ready by the time text arrives in run_tts.
- async on_turn_context_completed()[source]
Close the server-side context at end of turn.
Sends close_context so contextClosed arrives immediately after the last audio byte.
- async on_audio_context_interrupted(context_id: str)[source]
Callback invoked when an audio context has been interrupted.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]
Generate TTS audio for the given text using the Inworld WebSocket TTS service.
- Parameters:
text – The text to generate TTS audio for.
context_id – Unique identifier for this TTS context.
- Returns:
An asynchronous generator of frames.