stt

Sarvam AI Speech-to-Text service implementation.

This module provides a streaming Speech-to-Text service using Sarvam AI’s WebSocket-based API. It supports real-time transcription with Voice Activity Detection (VAD) and can handle multiple audio formats for Indian language speech recognition.

pipecat.services.sarvam.stt.language_to_sarvam_language(language: Language) str[source]

Convert a Language enum to Sarvam’s language code format.

Parameters:

language – The Language enum value to convert.

Returns:

The Sarvam language code string.

class pipecat.services.sarvam.stt.ModelConfig(supports_prompt: bool, supports_mode: bool, supports_language: bool, supports_vad_params: bool, default_language: str | None, default_mode: str | None, use_translate_endpoint: bool, use_translate_method: bool)[source]

Bases: object

Immutable configuration for a Sarvam STT model.

Parameters:
  • supports_prompt – Whether the model accepts prompt parameter.

  • supports_mode – Whether the model accepts mode parameter.

  • supports_language – Whether the model accepts language parameter.

  • supports_vad_params – Whether the model accepts fine-grained VAD parameters.

  • default_language – Default language code (None = auto-detect).

  • default_mode – Default mode (None = not applicable).

  • use_translate_endpoint – Whether to use speech_to_text_translate_streaming endpoint.

  • use_translate_method – Whether to use translate() method instead of transcribe().

supports_prompt: bool
supports_mode: bool
supports_language: bool
supports_vad_params: bool
default_language: str | None
default_mode: str | None
use_translate_endpoint: bool
use_translate_method: bool
class pipecat.services.sarvam.stt.SarvamSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, vad_signals: bool | None | _NotGiven = <factory>, high_vad_sensitivity: bool | None | _NotGiven = <factory>, positive_speech_threshold: float | None | _NotGiven = <factory>, negative_speech_threshold: float | None | _NotGiven = <factory>, min_speech_frames: int | None | _NotGiven = <factory>, first_turn_min_speech_frames: int | None | _NotGiven = <factory>, negative_frames_count: int | None | _NotGiven = <factory>, negative_frames_window: int | None | _NotGiven = <factory>, start_speech_volume_threshold: float | None | _NotGiven = <factory>, interrupt_min_speech_frames: int | None | _NotGiven = <factory>, pre_speech_pad_frames: int | None | _NotGiven = <factory>, num_initial_ignored_frames: int | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for SarvamSTTService.

Parameters:
  • prompt – Optional prompt to guide transcription/translation style/context. Only applicable to models that support prompts (e.g., saaras:v2.5).

  • vad_signals – Enable VAD signals in response.

  • high_vad_sensitivity – Enable high VAD sensitivity.

  • positive_speech_threshold – VAD probability threshold (0.0-1.0) above which a frame is considered speech. Only for saaras:v3.

  • negative_speech_threshold – VAD probability threshold (0.0-1.0) below which a frame is considered silence. Only for saaras:v3.

  • min_speech_frames – Minimum consecutive speech frames to start a speech segment. Only for saaras:v3.

  • first_turn_min_speech_frames – Minimum speech frames for the first user turn. Only for saaras:v3.

  • negative_frames_count – Number of silence frames within the window to end a speech segment. Only for saaras:v3.

  • negative_frames_window – Sliding window size (in frames) for counting negative frames. Only for saaras:v3.

  • start_speech_volume_threshold – Volume level (dB) below which audio is too quiet to be speech. Only for saaras:v3.

  • interrupt_min_speech_frames – Minimum speech frames to register a barge-in/interruption. Only for saaras:v3.

  • pre_speech_pad_frames – Number of audio frames to prepend before detected speech onset. Only for saaras:v3.

  • num_initial_ignored_frames – Number of leading audio frames to skip at connection start. Only for saaras:v3.

prompt: str | None | _NotGiven
vad_signals: bool | None | _NotGiven
high_vad_sensitivity: bool | None | _NotGiven
positive_speech_threshold: float | None | _NotGiven
negative_speech_threshold: float | None | _NotGiven
min_speech_frames: int | None | _NotGiven
first_turn_min_speech_frames: int | None | _NotGiven
negative_frames_count: int | None | _NotGiven
negative_frames_window: int | None | _NotGiven
start_speech_volume_threshold: float | None | _NotGiven
interrupt_min_speech_frames: int | None | _NotGiven
pre_speech_pad_frames: int | None | _NotGiven
num_initial_ignored_frames: int | None | _NotGiven
class pipecat.services.sarvam.stt.SarvamSTTService(*, api_key: str, model: str | None = None, mode: Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix'] | None = None, sample_rate: int | None = None, input_audio_codec: str = 'wav', params: InputParams | None = None, settings: SarvamSTTSettings | None = None, ttfs_p99_latency: float | None = 1.17, keepalive_timeout: float | None = None, keepalive_interval: float = 5.0, **kwargs)[source]

Bases: STTService

Sarvam speech-to-text service.

Provides real-time speech recognition using Sarvam’s WebSocket API.

Event handlers available (in addition to STTService events):

  • on_connected(service): Connected to Sarvam WebSocket

  • on_disconnected(service): Disconnected from Sarvam WebSocket

  • on_connection_error(service, error): Connection error occurred

Example:

@stt.event_handler("on_connected")
async def on_connected(service):
    ...
Settings

alias of SarvamSTTSettings

class InputParams(*, language: Language | None = None, prompt: str | None = None, mode: Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix'] | None = None, vad_signals: bool | None = None, high_vad_sensitivity: bool | None = None)[source]

Bases: BaseModel

Configuration parameters for Sarvam STT service.

Deprecated since version 0.0.105: Use settings=SarvamSTTService.Settings(...) instead.

Parameters:
  • language – Target language for transcription. - saarika:v2.5: Defaults to “unknown” (auto-detect supported) - saaras:v2.5: Not used (auto-detects language) - saaras:v3: Defaults to “unknown” (auto-detect supported)

  • prompt – Optional prompt to guide transcription/translation style/context. Only applicable to saaras:v2.5. Defaults to None.

  • mode – Mode of operation for saaras:v3 models only. Options: transcribe, translate, verbatim, translit, codemix. Defaults to “transcribe” for saaras:v3.

  • vad_signals – Enable VAD signals in response. Defaults to None.

  • high_vad_sensitivity – Enable high VAD sensitivity. Defaults to None.

language: Language | None
prompt: str | None
mode: Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix'] | None
vad_signals: bool | None
high_vad_sensitivity: bool | None
__init__(*, api_key: str, model: str | None = None, mode: Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix'] | None = None, sample_rate: int | None = None, input_audio_codec: str = 'wav', params: InputParams | None = None, settings: SarvamSTTSettings | None = None, ttfs_p99_latency: float | None = 1.17, keepalive_timeout: float | None = None, keepalive_interval: float = 5.0, **kwargs)[source]

Initialize the Sarvam STT service.

Parameters:
  • api_key – Sarvam API key for authentication.

  • model

    Sarvam model to use for transcription.

    Deprecated since version 0.0.105: Use settings=SarvamSTTService.Settings(model=...) instead.

  • mode – Mode of operation. Options: transcribe, translate, verbatim, translit, codemix. Only applicable to models that support it (e.g., saaras:v3). Defaults to the model’s default mode.

  • sample_rate – Audio sample rate. Defaults to 16000 if not specified.

  • input_audio_codec – Audio codec/format of the input file. Defaults to “wav”.

  • params

    Configuration parameters for Sarvam STT service.

    Deprecated since version 0.0.105: Use settings=SarvamSTTService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • keepalive_timeout – Seconds of no audio before sending silence to keep the connection alive. None disables keepalive.

  • keepalive_interval – Seconds between idle checks when keepalive is enabled.

  • **kwargs – Additional arguments passed to the parent STTService.

language_to_service_language(language: Language) str[source]

Convert pipecat Language enum to Sarvam’s language code.

Parameters:

language – The Language enum value to convert.

Returns:

The Sarvam language code string.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Sarvam service supports metrics generation.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process incoming frames.

Handles VAD frames for TTFB tracking when using Pipecat’s VAD instead of Sarvam’s built-in VAD.

async set_prompt(prompt: str | None)[source]

Set the transcription/translation prompt and reconnect.

Deprecated since version 0.0.104: Use STTUpdateSettingsFrame(SarvamSTTService.Settings(prompt=...)) instead.

Parameters:

prompt – Prompt text to guide transcription/translation style/context. Pass None to clear/disable prompt. Only applicable to models that support prompts.

async start(frame: StartFrame)[source]

Start the Sarvam STT service.

Parameters:

frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Sarvam STT service.

Parameters:

frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the Sarvam STT service.

Parameters:

frame – The cancel frame.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Send audio data to Sarvam for transcription.

Parameters:

audio – Raw audio bytes to transcribe.

Yields:

Frame – None (transcription results come via WebSocket callbacks).