stt

Google Cloud Speech-to-Text V2 service implementation for Pipecat.

This module provides a Google Cloud Speech-to-Text V2 service with streaming support, enabling real-time speech recognition with features like automatic punctuation, voice activity detection, and multi-language support.

pipecat.services.google.stt.language_to_google_stt_language(language: Language) str | None[source]

Maps Language enum to Google Speech-to-Text V2 language codes.

Parameters:

language – Language enum value.

Returns:

Google STT language code or None if not supported.

Return type:

Optional[str]

class pipecat.services.google.stt.GoogleSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, languages: list[Language] | _NotGiven = <factory>, language_codes: list[str] | None | _NotGiven = <factory>, use_separate_recognition_per_channel: bool | _NotGiven = <factory>, enable_automatic_punctuation: bool | _NotGiven = <factory>, enable_spoken_punctuation: bool | _NotGiven = <factory>, enable_spoken_emojis: bool | _NotGiven = <factory>, profanity_filter: bool | _NotGiven = <factory>, enable_word_time_offsets: bool | _NotGiven = <factory>, enable_word_confidence: bool | _NotGiven = <factory>, enable_interim_results: bool | _NotGiven = <factory>, enable_voice_activity_events: bool | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for GoogleSTTService.

Parameters:
  • languages – List of Language enums for recognition (e.g. [Language.EN_US]). Preferred over language_codes.

  • language_codes

    List of Google STT language code strings (e.g. ["en-US"]).

    Deprecated since version 0.0.104: Use languages instead. If both are provided, languages takes precedence. This field is here just for backward compatibility with dict-based settings updates.

  • use_separate_recognition_per_channel – Process each audio channel separately.

  • enable_automatic_punctuation – Add punctuation to transcripts.

  • enable_spoken_punctuation – Include spoken punctuation in transcript.

  • enable_spoken_emojis – Include spoken emojis in transcript.

  • profanity_filter – Filter profanity from transcript.

  • enable_word_time_offsets – Include timing information for each word.

  • enable_word_confidence – Include confidence scores for each word.

  • enable_interim_results – Stream partial recognition results.

  • enable_voice_activity_events – Detect voice activity in audio.

languages: list[Language] | _NotGiven
language_codes: list[str] | None | _NotGiven
use_separate_recognition_per_channel: bool | _NotGiven
enable_automatic_punctuation: bool | _NotGiven
enable_spoken_punctuation: bool | _NotGiven
enable_spoken_emojis: bool | _NotGiven
profanity_filter: bool | _NotGiven
enable_word_time_offsets: bool | _NotGiven
enable_word_confidence: bool | _NotGiven
enable_interim_results: bool | _NotGiven
enable_voice_activity_events: bool | _NotGiven
class pipecat.services.google.stt.GoogleSTTService(*, credentials: str | None = None, credentials_path: str | None = None, location: str = 'global', sample_rate: int | None = None, params: InputParams | None = None, settings: GoogleSTTSettings | None = None, ttfs_p99_latency: float | None = 1.57, **kwargs)[source]

Bases: STTService

Google Cloud Speech-to-Text V2 service implementation.

Provides real-time speech recognition using Google Cloud’s Speech-to-Text V2 API with streaming support. Handles audio transcription and optional voice activity detection. Implements automatic stream reconnection to handle Google’s 4-minute streaming limit.

Parameters:
  • InputParams – Configuration parameters for the STT service.

  • STREAMING_LIMIT – Google Cloud’s streaming limit in milliseconds (4 minutes).

Raises:
  • ValueError – If neither credentials nor credentials_path is provided.

  • ValueError – If project ID is not found in credentials.

Settings

alias of GoogleSTTSettings

STREAMING_LIMIT = 240000
class InputParams(*, languages: Language | list[Language] = <factory>, model: str | None = 'latest_long', use_separate_recognition_per_channel: bool | None = False, enable_automatic_punctuation: bool | None = True, enable_spoken_punctuation: bool | None = False, enable_spoken_emojis: bool | None = False, profanity_filter: bool | None = False, enable_word_time_offsets: bool | None = False, enable_word_confidence: bool | None = False, enable_interim_results: bool | None = True, enable_voice_activity_events: bool | None = False)[source]

Bases: BaseModel

Configuration parameters for Google Speech-to-Text.

Deprecated since version 0.0.105: Use settings=GoogleSTTService.Settings(...) instead.

Parameters:
  • languages – Single language or list of recognition languages. First language is primary.

  • model – Speech recognition model to use.

  • use_separate_recognition_per_channel – Process each audio channel separately.

  • enable_automatic_punctuation – Add punctuation to transcripts.

  • enable_spoken_punctuation – Include spoken punctuation in transcript.

  • enable_spoken_emojis – Include spoken emojis in transcript.

  • profanity_filter – Filter profanity from transcript.

  • enable_word_time_offsets – Include timing information for each word.

  • enable_word_confidence – Include confidence scores for each word.

  • enable_interim_results – Stream partial recognition results.

  • enable_voice_activity_events – Detect voice activity in audio.

languages: Language | list[Language]
model: str | None
use_separate_recognition_per_channel: bool | None
enable_automatic_punctuation: bool | None
enable_spoken_punctuation: bool | None
enable_spoken_emojis: bool | None
profanity_filter: bool | None
enable_word_time_offsets: bool | None
enable_word_confidence: bool | None
enable_interim_results: bool | None
enable_voice_activity_events: bool | None
classmethod validate_languages(v) list[Language][source]

Ensure languages is always a list.

Parameters:

v – Single Language enum or list of Language enums.

Returns:

List of configured languages.

Return type:

List[Language]

property language_list: list[Language]

Get languages as a guaranteed list.

Returns:

List of configured languages.

Return type:

List[Language]

__init__(*, credentials: str | None = None, credentials_path: str | None = None, location: str = 'global', sample_rate: int | None = None, params: InputParams | None = None, settings: GoogleSTTSettings | None = None, ttfs_p99_latency: float | None = 1.57, **kwargs)[source]

Initialize the Google STT service.

Parameters:
  • credentials – JSON string containing Google Cloud service account credentials.

  • credentials_path – Path to service account credentials JSON file.

  • location – Google Cloud location (e.g., “global”, “us-central1”).

  • sample_rate – Audio sample rate in Hertz.

  • params

    Configuration parameters for the service.

    Deprecated since version 0.0.105: Use settings=GoogleSTTService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated params, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to STTService.

can_generate_metrics() bool[source]

Check if the service can generate metrics.

Returns:

True, as this service supports metrics generation.

Return type:

bool

language_to_service_language(language: Language | list[Language]) str | list[str][source]

Convert Language enum(s) to Google STT language code(s).

Parameters:

language – Single Language enum or list of Language enums.

Returns:

Google STT language code(s).

Return type:

str | List[str]

async set_languages(languages: list[Language])[source]

Update the service’s recognition languages.

Deprecated since version 0.0.104: Use STTUpdateSettingsFrame with GoogleSTTService.Settings(languages=...) instead.

Parameters:

languages – List of languages for recognition. First language is primary.

async start(frame: StartFrame)[source]

Start the STT service and establish connection.

Parameters:

frame – The start frame triggering the service start.

async stop(frame: EndFrame)[source]

Stop the STT service and clean up resources.

Parameters:

frame – The end frame triggering the service stop.

async cancel(frame: CancelFrame)[source]

Cancel the STT service and clean up resources.

Parameters:

frame – The cancel frame triggering the service cancellation.

async update_options(*, languages: list[Language] | None = None, model: str | None = None, enable_automatic_punctuation: bool | None = None, enable_spoken_punctuation: bool | None = None, enable_spoken_emojis: bool | None = None, profanity_filter: bool | None = None, enable_word_time_offsets: bool | None = None, enable_word_confidence: bool | None = None, enable_interim_results: bool | None = None, enable_voice_activity_events: bool | None = None, location: str | None = None) None[source]

Update service options dynamically.

Deprecated since version 0.0.104: Use STTUpdateSettingsFrame with GoogleSTTService.Settings(...) instead.

Parameters:
  • languages – New list of recognition languages.

  • model – New recognition model.

  • enable_automatic_punctuation – Enable/disable automatic punctuation.

  • enable_spoken_punctuation – Enable/disable spoken punctuation.

  • enable_spoken_emojis – Enable/disable spoken emojis.

  • profanity_filter – Enable/disable profanity filter.

  • enable_word_time_offsets – Enable/disable word timing info.

  • enable_word_confidence – Enable/disable word confidence scores.

  • enable_interim_results – Enable/disable interim results.

  • enable_voice_activity_events – Enable/disable voice activity detection.

  • location – New Google Cloud location.

Note

Changes that affect the streaming configuration will cause the stream to be reconnected.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Process an audio chunk for STT transcription.

Parameters:

audio – Raw audio bytes to transcribe.

Yields:

Frame – None (actual transcription frames are pushed via internal processing).