stt

Gradium’s speech-to-text service implementation.

This module provides integration with Gradium’s real-time speech-to-text WebSocket API for streaming audio transcription.

pipecat.services.gradium.stt.language_to_gradium_language(language: Language) str | None[source]

Convert a Language enum to Gradium’s language code format.

Parameters:

language – The Language enum value to convert.

Returns:

The Gradium language code string or None if not supported.

class pipecat.services.gradium.stt.GradiumSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, delay_in_frames: int | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for GradiumSTTService.

Parameters:

delay_in_frames – Delay in audio frames (80ms each) before text is generated. Higher delays allow more context but increase latency. Allowed values: 7, 8, 10, 12, 14, 16, 20, 24, 36, 48. Default is 10 (800ms). Lower values like 7-8 give faster response.

delay_in_frames: int | None | _NotGiven
class pipecat.services.gradium.stt.GradiumSTTService(*, api_key: str, api_endpoint_base_url: str = 'wss://eu.api.gradium.ai/api/speech/asr', encoding: str = 'pcm', sample_rate: int | None = None, params: InputParams | None = None, json_config: str | None = None, settings: GradiumSTTSettings | None = None, ttfs_p99_latency: float | None = 1.61, **kwargs)[source]

Bases: WebsocketSTTService

Gradium real-time speech-to-text service.

Provides real-time speech transcription using Gradium’s WebSocket API. Supports both interim and final transcriptions with configurable parameters for audio processing and connection management.

Settings

alias of GradiumSTTSettings

class InputParams(*, language: Language | None = None, delay_in_frames: int | None = None)[source]

Bases: BaseModel

Configuration parameters for Gradium STT API.

Deprecated since version 0.0.105: Use settings=GradiumSTTService.Settings(...) instead.

Parameters:
  • language – Expected language of the audio (e.g., “en”, “es”, “fr”). This helps ground the model to a specific language and improve transcription quality.

  • delay_in_frames – Delay in audio frames (80ms each) before text is generated. Higher delays allow more context but increase latency. Allowed values: 7, 8, 10, 12, 14, 16, 20, 24, 36, 48. Default is 10 (800ms). Lower values like 7-8 give faster response.

language: Language | None
delay_in_frames: int | None
__init__(*, api_key: str, api_endpoint_base_url: str = 'wss://eu.api.gradium.ai/api/speech/asr', encoding: str = 'pcm', sample_rate: int | None = None, params: InputParams | None = None, json_config: str | None = None, settings: GradiumSTTSettings | None = None, ttfs_p99_latency: float | None = 1.61, **kwargs)[source]

Initialize the Gradium STT service.

Parameters:
  • api_key – Gradium API key for authentication.

  • api_endpoint_base_url – WebSocket endpoint URL. Defaults to Gradium’s streaming endpoint.

  • encoding – Base audio encoding type. One of “pcm”, “wav”, or “opus”. For PCM, the sample rate is appended automatically from the pipeline’s audio_in_sample_rate (e.g., “pcm” becomes “pcm_16000”). Defaults to “pcm”.

  • sample_rate – Audio sample rate in Hz. If None, uses the pipeline sample rate.

  • params

    Configuration parameters for language and delay settings.

    Deprecated since version 0.0.105: Use settings=GradiumSTTService.Settings(...) instead.

  • json_config

    Optional JSON configuration string for additional model settings.

    Deprecated since version 0.0.101: Use params instead for type-safe configuration.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to parent STTService class.

can_generate_metrics() bool[source]

Check if the service can generate metrics.

Returns:

True if metrics generation is supported.

async start(frame: StartFrame)[source]

Start the speech-to-text service.

Parameters:

frame – Start frame to begin processing.

async stop(frame: EndFrame)[source]

Stop the speech-to-text service.

Parameters:

frame – End frame to stop processing.

async cancel(frame: CancelFrame)[source]

Cancel the speech-to-text service.

Parameters:

frame – Cancel frame to abort processing.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process incoming frames and handle speech events.

Parameters:
  • frame – The frame to process.

  • direction – Direction of frame flow in the pipeline.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text conversion.

Parameters:

audio – Raw audio bytes to process.

Yields:

None (processing handled via WebSocket messages).