stt

Gladia Speech-to-Text (STT) service implementation.

This module provides a Speech-to-Text service using Gladia’s real-time WebSocket API, supporting multiple languages, custom vocabulary, and various audio processing options.

pipecat.services.gladia.stt.language_to_gladia_language(language: Language) str | None[source]

Convert a Language enum to Gladia’s language code format.

Parameters:

language – The Language enum value to convert.

Returns:

The Gladia language code string or None if not supported.

class pipecat.services.gladia.stt.GladiaSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, language_config: LanguageConfig | None | _NotGiven = <factory>, custom_metadata: dict[str, ~typing.Any] | None | ~pipecat.services.settings._NotGiven=<factory>, endpointing: float | None | _NotGiven = <factory>, maximum_duration_without_endpointing: int | None | _NotGiven = <factory>, pre_processing: PreProcessingConfig | None | _NotGiven = <factory>, realtime_processing: RealtimeProcessingConfig | None | _NotGiven = <factory>, messages_config: MessagesConfig | None | _NotGiven = <factory>, enable_vad: bool | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for GladiaSTTService.

Parameters:
  • language_config – Language detection and handling configuration.

  • custom_metadata – Additional metadata to include with requests.

  • endpointing – Silence duration in seconds to mark end of speech.

  • maximum_duration_without_endpointing – Maximum utterance duration without silence.

  • pre_processing – Audio pre-processing options.

  • realtime_processing – Real-time processing features.

  • messages_config – WebSocket message filtering options.

  • enable_vad – Enable VAD to trigger end of utterance detection.

language_config: LanguageConfig | None | _NotGiven
custom_metadata: dict[str, Any] | None | _NotGiven
endpointing: float | None | _NotGiven
maximum_duration_without_endpointing: int | None | _NotGiven
pre_processing: PreProcessingConfig | None | _NotGiven
realtime_processing: RealtimeProcessingConfig | None | _NotGiven
messages_config: MessagesConfig | None | _NotGiven
enable_vad: bool | None | _NotGiven
class pipecat.services.gladia.stt.GladiaSTTService(*, api_key: str, region: Literal['us-west', 'eu-west'] | None = None, url: str = 'https://api.gladia.io/v2/live', encoding: str = 'wav/pcm', bit_depth: int = 16, channels: int = 1, sample_rate: int | None = None, model: str | None = None, params: GladiaInputParams | None = None, max_buffer_size: int = 20971520, should_interrupt: bool = True, settings: GladiaSTTSettings | None = None, ttfs_p99_latency: float | None = 1.49, **kwargs)[source]

Bases: WebsocketSTTService

Speech-to-Text service using Gladia’s API.

This service connects to Gladia’s WebSocket API for real-time transcription with support for multiple languages, custom vocabulary, and various processing options. Provides automatic reconnection, audio buffering, and comprehensive error handling.

For complete API documentation, see: https://docs.gladia.io/api-reference/v2/live/init

Settings

alias of GladiaSTTSettings

__init__(*, api_key: str, region: Literal['us-west', 'eu-west'] | None = None, url: str = 'https://api.gladia.io/v2/live', encoding: str = 'wav/pcm', bit_depth: int = 16, channels: int = 1, sample_rate: int | None = None, model: str | None = None, params: GladiaInputParams | None = None, max_buffer_size: int = 20971520, should_interrupt: bool = True, settings: GladiaSTTSettings | None = None, ttfs_p99_latency: float | None = 1.49, **kwargs)[source]

Initialize the Gladia STT service.

Parameters:
  • api_key – Gladia API key for authentication.

  • region – Region used to process audio. eu-west or us-west. Defaults to eu-west.

  • url – Gladia API URL. Defaults to “https://api.gladia.io/v2/live”.

  • encoding – Audio encoding format. Defaults to "wav/pcm".

  • bit_depth – Audio bit depth. Defaults to 16.

  • channels – Number of audio channels. Defaults to 1.

  • sample_rate – Audio sample rate in Hz. If None, uses service default.

  • model

    Model to use for transcription.

    Deprecated since version 0.0.105: Use settings=GladiaSTTService.Settings(model=...) instead.

  • params

    Additional configuration parameters for Gladia service.

    Deprecated since version 0.0.105: Use settings=GladiaSTTService.Settings(...) for runtime-updatable fields and direct init parameters for encoding/bit_depth/channels.

  • max_buffer_size – Maximum size of audio buffer in bytes. Defaults to 20MB.

  • should_interrupt – Determine whether the bot should be interrupted when Gladia VAD detects user speech. Defaults to True.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to the STTService parent class.

can_generate_metrics() bool[source]

Check if the service can generate performance metrics.

Returns:

True, indicating this service supports metrics generation.

language_to_service_language(language: Language) str | None[source]

Convert pipecat Language enum to Gladia’s language code.

Parameters:

language – The Language enum value to convert.

Returns:

The Gladia language code string or None if not supported.

async start(frame: StartFrame)[source]

Start the Gladia STT websocket connection.

Parameters:

frame – The start frame triggering service startup.

async stop(frame: EndFrame)[source]

Stop the Gladia STT websocket connection.

Parameters:

frame – The end frame triggering service shutdown.

async cancel(frame: CancelFrame)[source]

Cancel the Gladia STT websocket connection.

Parameters:

frame – The cancel frame triggering service cancellation.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Run speech-to-text on audio data.

Parameters:

audio – Raw audio bytes to transcribe.

Yields:

None (processing is handled asynchronously via WebSocket).