stt
Sarvam AI Speech-to-Text service implementation.
This module provides a streaming Speech-to-Text service using Sarvam AI’s WebSocket-based API. It supports real-time transcription with Voice Activity Detection (VAD) and can handle multiple audio formats for Indian language speech recognition.
- pipecat.services.sarvam.stt.language_to_sarvam_language(language: Language) str[source]
Convert a Language enum to Sarvam’s language code format.
- Parameters:
language – The Language enum value to convert.
- Returns:
The Sarvam language code string.
- class pipecat.services.sarvam.stt.ModelConfig(supports_prompt: bool, supports_mode: bool, supports_language: bool, supports_vad_params: bool, default_language: str | None, default_mode: str | None, use_translate_endpoint: bool, use_translate_method: bool)[source]
Bases:
objectImmutable configuration for a Sarvam STT model.
- Parameters:
supports_prompt – Whether the model accepts prompt parameter.
supports_mode – Whether the model accepts mode parameter.
supports_language – Whether the model accepts language parameter.
supports_vad_params – Whether the model accepts fine-grained VAD parameters.
default_language – Default language code (None = auto-detect).
default_mode – Default mode (None = not applicable).
use_translate_endpoint – Whether to use speech_to_text_translate_streaming endpoint.
use_translate_method – Whether to use translate() method instead of transcribe().
- supports_prompt: bool
- supports_mode: bool
- supports_language: bool
- supports_vad_params: bool
- default_language: str | None
- default_mode: str | None
- use_translate_endpoint: bool
- use_translate_method: bool
- class pipecat.services.sarvam.stt.SarvamSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, vad_signals: bool | None | _NotGiven = <factory>, high_vad_sensitivity: bool | None | _NotGiven = <factory>, positive_speech_threshold: float | None | _NotGiven = <factory>, negative_speech_threshold: float | None | _NotGiven = <factory>, min_speech_frames: int | None | _NotGiven = <factory>, first_turn_min_speech_frames: int | None | _NotGiven = <factory>, negative_frames_count: int | None | _NotGiven = <factory>, negative_frames_window: int | None | _NotGiven = <factory>, start_speech_volume_threshold: float | None | _NotGiven = <factory>, interrupt_min_speech_frames: int | None | _NotGiven = <factory>, pre_speech_pad_frames: int | None | _NotGiven = <factory>, num_initial_ignored_frames: int | None | _NotGiven = <factory>)[source]
Bases:
STTSettingsSettings for SarvamSTTService.
- Parameters:
prompt – Optional prompt to guide transcription/translation style/context. Only applicable to models that support prompts (e.g., saaras:v2.5).
vad_signals – Enable VAD signals in response.
high_vad_sensitivity – Enable high VAD sensitivity.
positive_speech_threshold – VAD probability threshold (0.0-1.0) above which a frame is considered speech. Only for saaras:v3.
negative_speech_threshold – VAD probability threshold (0.0-1.0) below which a frame is considered silence. Only for saaras:v3.
min_speech_frames – Minimum consecutive speech frames to start a speech segment. Only for saaras:v3.
first_turn_min_speech_frames – Minimum speech frames for the first user turn. Only for saaras:v3.
negative_frames_count – Number of silence frames within the window to end a speech segment. Only for saaras:v3.
negative_frames_window – Sliding window size (in frames) for counting negative frames. Only for saaras:v3.
start_speech_volume_threshold – Volume level (dB) below which audio is too quiet to be speech. Only for saaras:v3.
interrupt_min_speech_frames – Minimum speech frames to register a barge-in/interruption. Only for saaras:v3.
pre_speech_pad_frames – Number of audio frames to prepend before detected speech onset. Only for saaras:v3.
num_initial_ignored_frames – Number of leading audio frames to skip at connection start. Only for saaras:v3.
- prompt: str | None | _NotGiven
- vad_signals: bool | None | _NotGiven
- high_vad_sensitivity: bool | None | _NotGiven
- positive_speech_threshold: float | None | _NotGiven
- negative_speech_threshold: float | None | _NotGiven
- min_speech_frames: int | None | _NotGiven
- first_turn_min_speech_frames: int | None | _NotGiven
- negative_frames_count: int | None | _NotGiven
- negative_frames_window: int | None | _NotGiven
- start_speech_volume_threshold: float | None | _NotGiven
- interrupt_min_speech_frames: int | None | _NotGiven
- pre_speech_pad_frames: int | None | _NotGiven
- num_initial_ignored_frames: int | None | _NotGiven
- class pipecat.services.sarvam.stt.SarvamSTTService(*, api_key: str, model: str | None = None, mode: Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix'] | None = None, sample_rate: int | None = None, input_audio_codec: str = 'wav', params: InputParams | None = None, settings: SarvamSTTSettings | None = None, ttfs_p99_latency: float | None = 1.17, keepalive_timeout: float | None = None, keepalive_interval: float = 5.0, **kwargs)[source]
Bases:
STTServiceSarvam speech-to-text service.
Provides real-time speech recognition using Sarvam’s WebSocket API.
Event handlers available (in addition to STTService events):
on_connected(service): Connected to Sarvam WebSocket
on_disconnected(service): Disconnected from Sarvam WebSocket
on_connection_error(service, error): Connection error occurred
Example:
@stt.event_handler("on_connected") async def on_connected(service): ...
- Settings
alias of
SarvamSTTSettings
- class InputParams(*, language: Language | None = None, prompt: str | None = None, mode: Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix'] | None = None, vad_signals: bool | None = None, high_vad_sensitivity: bool | None = None)[source]
Bases:
BaseModelConfiguration parameters for Sarvam STT service.
Deprecated since version 0.0.105: Use
settings=SarvamSTTService.Settings(...)instead.- Parameters:
language – Target language for transcription. - saarika:v2.5: Defaults to “unknown” (auto-detect supported) - saaras:v2.5: Not used (auto-detects language) - saaras:v3: Defaults to “unknown” (auto-detect supported)
prompt – Optional prompt to guide transcription/translation style/context. Only applicable to saaras:v2.5. Defaults to None.
mode – Mode of operation for saaras:v3 models only. Options: transcribe, translate, verbatim, translit, codemix. Defaults to “transcribe” for saaras:v3.
vad_signals – Enable VAD signals in response. Defaults to None.
high_vad_sensitivity – Enable high VAD sensitivity. Defaults to None.
- prompt: str | None
- mode: Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix'] | None
- vad_signals: bool | None
- high_vad_sensitivity: bool | None
- __init__(*, api_key: str, model: str | None = None, mode: Literal['transcribe', 'translate', 'verbatim', 'translit', 'codemix'] | None = None, sample_rate: int | None = None, input_audio_codec: str = 'wav', params: InputParams | None = None, settings: SarvamSTTSettings | None = None, ttfs_p99_latency: float | None = 1.17, keepalive_timeout: float | None = None, keepalive_interval: float = 5.0, **kwargs)[source]
Initialize the Sarvam STT service.
- Parameters:
api_key – Sarvam API key for authentication.
model –
Sarvam model to use for transcription.
Deprecated since version 0.0.105: Use
settings=SarvamSTTService.Settings(model=...)instead.mode – Mode of operation. Options: transcribe, translate, verbatim, translit, codemix. Only applicable to models that support it (e.g., saaras:v3). Defaults to the model’s default mode.
sample_rate – Audio sample rate. Defaults to 16000 if not specified.
input_audio_codec – Audio codec/format of the input file. Defaults to “wav”.
params –
Configuration parameters for Sarvam STT service.
Deprecated since version 0.0.105: Use
settings=SarvamSTTService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
keepalive_timeout – Seconds of no audio before sending silence to keep the connection alive. None disables keepalive.
keepalive_interval – Seconds between idle checks when keepalive is enabled.
**kwargs – Additional arguments passed to the parent STTService.
- language_to_service_language(language: Language) str[source]
Convert pipecat Language enum to Sarvam’s language code.
- Parameters:
language – The Language enum value to convert.
- Returns:
The Sarvam language code string.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Sarvam service supports metrics generation.
- async process_frame(frame: Frame, direction: FrameDirection)[source]
Process incoming frames.
Handles VAD frames for TTFB tracking when using Pipecat’s VAD instead of Sarvam’s built-in VAD.
- async set_prompt(prompt: str | None)[source]
Set the transcription/translation prompt and reconnect.
Deprecated since version 0.0.104: Use
STTUpdateSettingsFrame(SarvamSTTService.Settings(prompt=...))instead.- Parameters:
prompt – Prompt text to guide transcription/translation style/context. Pass None to clear/disable prompt. Only applicable to models that support prompts.
- async start(frame: StartFrame)[source]
Start the Sarvam STT service.
- Parameters:
frame – The start frame containing initialization parameters.
- async stop(frame: EndFrame)[source]
Stop the Sarvam STT service.
- Parameters:
frame – The end frame.
- async cancel(frame: CancelFrame)[source]
Cancel the Sarvam STT service.
- Parameters:
frame – The cancel frame.