tts

Soniox text-to-speech service implementation.

This module provides a WebSocket-based TTS service using the Soniox real-time Text-to-Speech API. It streams text to the server incrementally and receives audio back as base64-encoded chunks, multiplexed across multiple concurrent streams by stream_id.

Soniox API reference: https://soniox.com/docs/tts/api-reference/websocket-api

pipecat.services.soniox.tts.language_to_soniox_tts_language(language: Language) str | None[source]

Convert a Pipecat Language to a Soniox TTS language code.

For the full list of supported languages, see: https://soniox.com/docs/tts/concepts/languages

class pipecat.services.soniox.tts.SonioxTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>)[source]

Bases: TTSSettings

Settings for SonioxTTSService.

voice, model, and language travel in the per-stream config message, so changing any of them does not require reconnecting the WebSocket. The current context is flushed so the next stream opens with the new values.

class pipecat.services.soniox.tts.SonioxTTSService(*, api_key: str, url: str = 'wss://tts-rt.soniox.com/tts-websocket', sample_rate: int | None = None, audio_format: str = 'pcm_s16le', settings: SonioxTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, **kwargs)[source]

Bases: WebsocketTTSService

Soniox WebSocket TTS service with streaming text-in, streaming audio-out.

Streams text incrementally to Soniox’s real-time TTS endpoint and routes the returned base64-encoded audio back as TTSAudioRawFrame frames. Multiple concurrent streams are multiplexed over a single WebSocket connection via Pipecat’s audio-context mechanism (mapped to Soniox’s stream_id). Supports up to 5 concurrent streams per connection.

For complete API documentation, see: https://soniox.com/docs/tts/api-reference/websocket-api

Settings

alias of SonioxTTSSettings

__init__(*, api_key: str, url: str = 'wss://tts-rt.soniox.com/tts-websocket', sample_rate: int | None = None, audio_format: str = 'pcm_s16le', settings: SonioxTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, **kwargs)[source]

Initialize the Soniox TTS service.

Parameters:
  • api_key – Soniox API key for authentication. Create API keys at https://console.soniox.com.

  • url – WebSocket URL for the Soniox TTS endpoint.

  • sample_rate – Output sample rate in Hz. Must be one of {8000, 16000, 24000, 44100, 48000} when using a raw PCM audio format. If None, inherits from the pipeline.

  • audio_format – Output audio format. Defaults to "pcm_s16le", which matches Pipecat’s downstream audio pipeline.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • text_aggregation_mode – How to aggregate incoming text before synthesis. Defaults to TextAggregationMode.SENTENCE.

  • **kwargs – Additional arguments passed to the parent service.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Soniox TTS supports metrics generation.

language_to_service_language(language: Language) str | None[source]

Convert a Language enum to a Soniox TTS language code.

Parameters:

language – The language to convert.

Returns:

The Soniox-specific language code, or None if not supported.

async start(frame: StartFrame)[source]

Start the Soniox TTS service.

Parameters:

frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Soniox TTS service.

Parameters:

frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the Soniox TTS service.

Parameters:

frame – The cancel frame.

async flush_audio(context_id: str | None = None)[source]

Flush any pending audio and finalize the current stream.

Parameters:

context_id – The specific context to flush. If None, falls back to the currently active context.

async on_turn_context_created(context_id: str)[source]

Eagerly open the Soniox stream when a new turn context is created.

Overlaps Soniox-side stream creation with sentence aggregation so the stream is ready by the time text reaches run_tts.

async on_turn_context_completed()[source]

Cancel any eagerly-opened Soniox stream that never received text.

The base class sends text_end:true (via flush_audio) for streams that received text — that already terminates the stream. For an empty turn (e.g., the LLM produced only tool calls), no text reaches run_tts and the eager-opened stream would otherwise sit until Soniox’s per-stream idle timer fires. Cancel it here.

async on_audio_context_interrupted(context_id: str)[source]

Cancel the active Soniox stream when the bot is interrupted.

async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]

Stream text to Soniox and deliver synthesized audio asynchronously.

The first run_tts call for a given context_id sends the per-stream config message; subsequent calls within the same stream send only text chunks. Audio arrives via the receive loop and is appended to the matching audio context.

Parameters:
  • text – The text to synthesize.

  • context_id – The audio context (maps to Soniox stream_id).

Yields:

None — audio frames are delivered out of band via the receive task and the audio-context queue.