stt

Google Cloud Speech-to-Text V2 service implementation for Pipecat.

This module provides a Google Cloud Speech-to-Text V2 service with streaming support, enabling real-time speech recognition with features like automatic punctuation, voice activity detection, and multi-language support.

pipecat.services.google.stt.language_to_google_stt_language(language: Language) → str | None[source]

Maps Language enum to Google Speech-to-Text V2 language codes.

Parameters:: language – Language enum value.
Returns:: Google STT language code or None if not supported.
Return type:: Optional[str]

class pipecat.services.google.stt.GoogleSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, languages: list[Language] | _NotGiven = <factory>, language_codes: list[str] | None | _NotGiven = <factory>, use_separate_recognition_per_channel: bool | _NotGiven = <factory>, enable_automatic_punctuation: bool | _NotGiven = <factory>, enable_spoken_punctuation: bool | _NotGiven = <factory>, enable_spoken_emojis: bool | _NotGiven = <factory>, profanity_filter: bool | _NotGiven = <factory>, enable_word_time_offsets: bool | _NotGiven = <factory>, enable_word_confidence: bool | _NotGiven = <factory>, enable_interim_results: bool | _NotGiven = <factory>, enable_voice_activity_events: bool | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for GoogleSTTService.

Parameters:

languages – List of Language enums for recognition (e.g. [Language.EN_US]). Preferred over language_codes.
language_codes –
List of Google STT language code strings (e.g. ["en-US"]).

Deprecated since version 0.0.104: Use languages instead. If both are provided, languages takes precedence. This field is here just for backward compatibility with dict-based settings updates.
use_separate_recognition_per_channel – Process each audio channel separately.
enable_automatic_punctuation – Add punctuation to transcripts.
enable_spoken_punctuation – Include spoken punctuation in transcript.
enable_spoken_emojis – Include spoken emojis in transcript.
profanity_filter – Filter profanity from transcript.
enable_word_time_offsets – Include timing information for each word.
enable_word_confidence – Include confidence scores for each word.
enable_interim_results – Stream partial recognition results.
enable_voice_activity_events – Detect voice activity in audio.

languages: list[Language] | _NotGiven

language_codes: list[str] | None | _NotGiven

use_separate_recognition_per_channel: bool | _NotGiven

enable_automatic_punctuation: bool | _NotGiven

enable_spoken_punctuation: bool | _NotGiven

enable_spoken_emojis: bool | _NotGiven

profanity_filter: bool | _NotGiven

enable_word_time_offsets: bool | _NotGiven

enable_word_confidence: bool | _NotGiven

enable_interim_results: bool | _NotGiven

enable_voice_activity_events: bool | _NotGiven

class pipecat.services.google.stt.GoogleSTTService(*, credentials: str | None = None, credentials_path: str | None = None, location: str = 'global', sample_rate: int | None = None, params: InputParams | None = None, settings: GoogleSTTSettings | None = None, ttfs_p99_latency: float | None = 1.57, **kwargs)[source]

Bases: STTService

Google Cloud Speech-to-Text V2 service implementation.

Provides real-time speech recognition using Google Cloud’s Speech-to-Text V2 API with streaming support. Handles audio transcription and optional voice activity detection. Implements automatic stream reconnection to handle Google’s 4-minute streaming limit.

Parameters:

InputParams – Configuration parameters for the STT service.
STREAMING_LIMIT – Google Cloud’s streaming limit in milliseconds (4 minutes).

Raises:

ValueError – If neither credentials nor credentials_path is provided.
ValueError – If project ID is not found in credentials.

Settings: alias of GoogleSTTSettings

STREAMING_LIMIT = 240000

class InputParams(*, languages: Language | list[Language] = <factory>, model: str | None = 'latest_long', use_separate_recognition_per_channel: bool | None = False, enable_automatic_punctuation: bool | None = True, enable_spoken_punctuation: bool | None = False, enable_spoken_emojis: bool | None = False, profanity_filter: bool | None = False, enable_word_time_offsets: bool | None = False, enable_word_confidence: bool | None = False, enable_interim_results: bool | None = True, enable_voice_activity_events: bool | None = False)[source]

Bases: BaseModel

Configuration parameters for Google Speech-to-Text.

Deprecated since version 0.0.105: Use settings=GoogleSTTService.Settings(...) instead.

Parameters:

languages – Single language or list of recognition languages. First language is primary.
model – Speech recognition model to use.
use_separate_recognition_per_channel – Process each audio channel separately.
enable_automatic_punctuation – Add punctuation to transcripts.
enable_spoken_punctuation – Include spoken punctuation in transcript.
enable_spoken_emojis – Include spoken emojis in transcript.
profanity_filter – Filter profanity from transcript.
enable_word_time_offsets – Include timing information for each word.
enable_word_confidence – Include confidence scores for each word.
enable_interim_results – Stream partial recognition results.
enable_voice_activity_events – Detect voice activity in audio.

languages: Language | list[Language]

model: str | None

use_separate_recognition_per_channel: bool | None

enable_automatic_punctuation: bool | None

enable_spoken_punctuation: bool | None

enable_spoken_emojis: bool | None

profanity_filter: bool | None

enable_word_time_offsets: bool | None

enable_word_confidence: bool | None

enable_interim_results: bool | None

enable_voice_activity_events: bool | None

classmethod validate_languages(v) → list[Language][source]

Ensure languages is always a list.

Parameters:: v – Single Language enum or list of Language enums.
Returns:: List of configured languages.
Return type:: List[Language]

property language_list: list[Language]

Get languages as a guaranteed list.

Returns:: List of configured languages.
Return type:: List[Language]

Initialize the Google STT service.

Parameters:

credentials – JSON string containing Google Cloud service account credentials.
credentials_path – Path to service account credentials JSON file.
location – Google Cloud location (e.g., “global”, “us-central1”).
sample_rate – Audio sample rate in Hertz.
params –
Configuration parameters for the service.

Deprecated since version 0.0.105: Use settings=GoogleSTTService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated params, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to STTService.

can_generate_metrics() → bool[source]

Check if the service can generate metrics.

Returns:: True, as this service supports metrics generation.
Return type:: bool

language_to_service_language(language: Language | list[Language]) → str | list[str][source]

Convert Language enum(s) to Google STT language code(s).

Parameters:: language – Single Language enum or list of Language enums.
Returns:: Google STT language code(s).
Return type:: str | List[str]

async set_languages(languages: list[Language])[source]

Update the service’s recognition languages.

Deprecated since version 0.0.104: Use STTUpdateSettingsFrame with GoogleSTTService.Settings(languages=...) instead.

Parameters:: languages – List of languages for recognition. First language is primary.

async start(frame: StartFrame)[source]

Start the STT service and establish connection.

Parameters:: frame – The start frame triggering the service start.

async stop(frame: EndFrame)[source]

Stop the STT service and clean up resources.

Parameters:: frame – The end frame triggering the service stop.

async cancel(frame: CancelFrame)[source]

Cancel the STT service and clean up resources.

Parameters:: frame – The cancel frame triggering the service cancellation.

Update service options dynamically.

Deprecated since version 0.0.104: Use STTUpdateSettingsFrame with GoogleSTTService.Settings(...) instead.

Parameters:

languages – New list of recognition languages.
model – New recognition model.
enable_automatic_punctuation – Enable/disable automatic punctuation.
enable_spoken_punctuation – Enable/disable spoken punctuation.
enable_spoken_emojis – Enable/disable spoken emojis.
profanity_filter – Enable/disable profanity filter.
enable_word_time_offsets – Enable/disable word timing info.
enable_word_confidence – Enable/disable word confidence scores.
enable_interim_results – Enable/disable interim results.
enable_voice_activity_events – Enable/disable voice activity detection.
location – New Google Cloud location.

Note

Changes that affect the streaming configuration will cause the stream to be reconnected.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Process an audio chunk for STT transcription.

Parameters:: audio – Raw audio bytes to transcribe.
Yields:: Frame – None (actual transcription frames are pushed via internal processing).