stt

Soniox speech-to-text service implementation.

class pipecat.services.soniox.stt.SonioxContextGeneralItem(*, key: str, value: str)[source]

Bases: BaseModel

Represents a key-value pair for structured general context information.

key: str
value: str
class pipecat.services.soniox.stt.SonioxContextTranslationTerm(*, source: str, target: str)[source]

Bases: BaseModel

Represents a custom translation mapping for ambiguous or domain-specific terms.

source: str
target: str
class pipecat.services.soniox.stt.SonioxContextObject(*, general: list[SonioxContextGeneralItem] | None = None, text: str | None = None, terms: list[str] | None = None, translation_terms: list[SonioxContextTranslationTerm] | None = None)[source]

Bases: BaseModel

Context object for models with context_version 2, for Soniox stt-rt-v3-preview and higher.

Learn more about context in the documentation: https://soniox.com/docs/stt/concepts/context

general: list[SonioxContextGeneralItem] | None
text: str | None
terms: list[str] | None
translation_terms: list[SonioxContextTranslationTerm] | None
class pipecat.services.soniox.stt.SonioxInputParams(*, model: str = 'stt-rt-v4', audio_format: str | None = 'pcm_s16le', num_channels: int | None = 1, language_hints: list[Language] | None = None, language_hints_strict: bool | None = None, context: SonioxContextObject | str | None = None, enable_speaker_diarization: bool | None = False, enable_language_identification: bool | None = False, client_reference_id: str | None = None)[source]

Bases: BaseModel

Real-time transcription settings.

Deprecated since version 0.0.105: Use settings=SonioxSTTService.Settings(...) instead.

See Soniox WebSocket API documentation for more details: https://soniox.com/docs/speech-to-text/api-reference/websocket-api#configuration-parameters

Parameters:
  • model – Model to use for transcription.

  • audio_format – Audio format to use for transcription.

  • num_channels – Number of channels to use for transcription.

  • language_hints – List of language hints to use for transcription.

  • language_hints_strict – If true, strictly enforce language hints (only transcribe in provided languages).

  • context – Customization for transcription. String for models with context_version 1 and ContextObject for models with context_version 2.

  • enable_speaker_diarization – Whether to enable speaker diarization. Tokens are annotated with speaker IDs.

  • enable_language_identification – Whether to enable language identification. Tokens are annotated with language IDs.

  • client_reference_id – Client reference ID to use for transcription.

model: str
audio_format: str | None
num_channels: int | None
language_hints: list[Language] | None
language_hints_strict: bool | None
context: SonioxContextObject | str | None
enable_speaker_diarization: bool | None
enable_language_identification: bool | None
client_reference_id: str | None
pipecat.services.soniox.stt.is_end_token(token: dict) bool[source]

Determine if a token is an end token.

pipecat.services.soniox.stt.language_to_soniox_language(language: Language) str[source]

Convert a Pipecat Language to a Soniox language code.

For a list of all supported languages, see: https://soniox.com/docs/speech-to-text/core-concepts/supported-languages

class pipecat.services.soniox.stt.SonioxSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, language_hints: list[Language] | None | _NotGiven = <factory>, language_hints_strict: bool | None | _NotGiven = <factory>, context: SonioxContextObject | str | None | _NotGiven = <factory>, enable_speaker_diarization: bool | None | _NotGiven = <factory>, enable_language_identification: bool | None | _NotGiven = <factory>, client_reference_id: str | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for SonioxSTTService.

Parameters:
  • language_hints – List of language hints to use for transcription.

  • language_hints_strict – If true, strictly enforce language hints.

  • context – Customization for transcription. String for models with context_version 1 and SonioxContextObject for models with context_version 2.

  • enable_speaker_diarization – Whether to enable speaker diarization.

  • enable_language_identification – Whether to enable language identification.

  • client_reference_id – Client reference ID to use for transcription.

language_hints: list[Language] | None | _NotGiven
language_hints_strict: bool | None | _NotGiven
context: SonioxContextObject | str | None | _NotGiven
enable_speaker_diarization: bool | None | _NotGiven
enable_language_identification: bool | None | _NotGiven
client_reference_id: str | None | _NotGiven
class pipecat.services.soniox.stt.SonioxSTTService(*, api_key: str, url: str = 'wss://stt-rt.soniox.com/transcribe-websocket', sample_rate: int | None = None, model: str | None = None, audio_format: str = 'pcm_s16le', num_channels: int = 1, params: SonioxInputParams | None = None, vad_force_turn_endpoint: bool = True, settings: SonioxSTTSettings | None = None, ttfs_p99_latency: float | None = 0.35, **kwargs)[source]

Bases: WebsocketSTTService

Speech-to-Text service using Soniox’s WebSocket API.

This service connects to Soniox’s WebSocket API for real-time transcription with support for multiple languages, custom context, speaker diarization, and more.

For complete API documentation, see: https://soniox.com/docs/speech-to-text/api-reference/websocket-api

Settings

alias of SonioxSTTSettings

__init__(*, api_key: str, url: str = 'wss://stt-rt.soniox.com/transcribe-websocket', sample_rate: int | None = None, model: str | None = None, audio_format: str = 'pcm_s16le', num_channels: int = 1, params: SonioxInputParams | None = None, vad_force_turn_endpoint: bool = True, settings: SonioxSTTSettings | None = None, ttfs_p99_latency: float | None = 0.35, **kwargs)[source]

Initialize the Soniox STT service.

Parameters:
  • api_key – Soniox API key.

  • url – Soniox WebSocket API URL.

  • sample_rate – Audio sample rate.

  • model

    Soniox model to use for transcription.

    Deprecated since version 0.0.105: Use settings=SonioxSTTService.Settings(model=...) instead.

  • audio_format – Audio format for transcription. Defaults to "pcm_s16le".

  • num_channels – Number of audio channels. Defaults to 1.

  • params

    Additional configuration parameters, such as language hints, context and speaker diarization.

    Deprecated since version 0.0.105: Use settings=SonioxSTTService.Settings(...) instead.

  • vad_force_turn_endpoint – Listen to VADUserStoppedSpeakingFrame to send finalize message to Soniox. If disabled, Soniox will detect the end of the speech. Defaults to True.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to the STTService.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Soniox STT supports metrics generation.

async start(frame: StartFrame)[source]

Start the Soniox STT websocket connection.

Parameters:

frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Soniox STT websocket connection.

Stopping waits for the server to close the connection as we might receive additional final tokens after sending the stop recording message.

Parameters:

frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the Soniox STT websocket connection.

Compared to stop, this method closes the connection immediately without waiting for the server to close it. This is useful when we want to stop the connection immediately without waiting for the server to send any final tokens.

Parameters:

frame – The cancel frame.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Send audio data to Soniox STT Service.

Parameters:

audio – Raw audio bytes to transcribe.

Yields:

Frame – None (transcription results come via WebSocket callbacks).

async process_frame(frame: Frame, direction: FrameDirection)[source]

Processes a frame of audio data, either buffering or transcribing it.

Parameters:
  • frame – The frame to process.

  • direction – The direction of frame processing.