tts
Google Cloud Text-to-Speech service implementations.
This module provides integration with Google Cloud Text-to-Speech API, offering both HTTP-based synthesis with SSML support and streaming synthesis for real-time applications.
It also includes GeminiTTSService which uses Gemini’s TTS-specific models for natural voice control and multi-speaker conversations.
- pipecat.services.google.tts.language_to_google_tts_language(language: Language) str | None[source]
Convert a Language enum to Google TTS language code.
Source: https://docs.cloud.google.com/text-to-speech/docs/chirp3-hd
- Parameters:
language – The Language enum value to convert.
- Returns:
The corresponding Google TTS language code, or None if not supported.
- pipecat.services.google.tts.language_to_gemini_tts_language(language: Language) str | None[source]
Convert a Language enum to Gemini TTS language code.
Source: https://docs.cloud.google.com/text-to-speech/docs/gemini-tts#available_languages
- Parameters:
language – The Language enum value to convert.
- Returns:
The corresponding Gemini TTS language code, or None if not supported.
- class pipecat.services.google.tts.GoogleHttpTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, pitch: str | None | _NotGiven = <factory>, rate: str | None | _NotGiven = <factory>, speaking_rate: float | None | _NotGiven = <factory>, volume: str | None | _NotGiven = <factory>, emphasis: Literal['strong', 'moderate', 'reduced', 'none'] | None | ~pipecat.services.settings._NotGiven=<factory>, gender: Literal['male', 'female', 'neutral'] | None | ~pipecat.services.settings._NotGiven=<factory>, google_style: Literal['apologetic', 'calm', 'empathetic', 'firm', 'lively'] | None | ~pipecat.services.settings._NotGiven=<factory>)[source]
Bases:
TTSSettingsSettings for GoogleHttpTTSService.
- Parameters:
pitch – Voice pitch adjustment (e.g., “+2st”, “-50%”).
rate – Speaking rate adjustment (e.g., “slow”, “fast”, “125%”). Used for SSML prosody tags (non-Chirp voices).
speaking_rate – Speaking rate for AudioConfig (Chirp/Journey voices). Range [0.25, 2.0].
volume – Volume adjustment (e.g., “loud”, “soft”, “+6dB”).
emphasis – Emphasis level for the text.
gender – Voice gender preference.
google_style – Google-specific voice style.
- pitch: str | None | _NotGiven
- rate: str | None | _NotGiven
- speaking_rate: float | None | _NotGiven
- volume: str | None | _NotGiven
- emphasis: Literal['strong', 'moderate', 'reduced', 'none'] | None | _NotGiven
- gender: Literal['male', 'female', 'neutral'] | None | _NotGiven
- google_style: Literal['apologetic', 'calm', 'empathetic', 'firm', 'lively'] | None | _NotGiven
- class pipecat.services.google.tts.GoogleTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, speaking_rate: float | None | _NotGiven = <factory>)[source]
Bases:
TTSSettingsSettings for GoogleTTSService.
- Parameters:
speaking_rate – The speaking rate, in the range [0.25, 2.0].
- speaking_rate: float | None | _NotGiven
- pipecat.services.google.tts.GoogleStreamTTSSettings
Deprecated since 0.0.105: Use
GoogleTTSService.Settingsinstead.
- class pipecat.services.google.tts.GeminiTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, multi_speaker: bool | _NotGiven = <factory>, speaker_configs: list[dict[str, ~typing.Any]] | None | ~pipecat.services.settings._NotGiven=<factory>)[source]
Bases:
TTSSettingsSettings for GeminiTTSService.
- Parameters:
prompt – Optional style instructions for how to synthesize the content.
multi_speaker – Whether to enable multi-speaker support.
speaker_configs – List of speaker configurations for multi-speaker mode.
- prompt: str | None | _NotGiven
- multi_speaker: bool | _NotGiven
- speaker_configs: list[dict[str, Any]] | None | _NotGiven
- class pipecat.services.google.tts.GoogleHttpTTSService(*, credentials: str | None = None, credentials_path: str | None = None, location: str | None = None, voice_id: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: GoogleHttpTTSSettings | None = None, **kwargs)[source]
Bases:
TTSServiceGoogle Cloud Text-to-Speech HTTP service with SSML support.
Provides text-to-speech synthesis using Google Cloud’s HTTP API with comprehensive SSML support for voice customization, prosody control, and styling options. Ideal for applications requiring fine-grained control over speech output.
Note
Requires Google Cloud credentials via service account JSON, credentials file, or default application credentials (GOOGLE_APPLICATION_CREDENTIALS). Chirp and Journey voices don’t support SSML and will use plain text input.
- Settings
alias of
GoogleHttpTTSSettings
- class InputParams(*, pitch: str | None = None, rate: str | None = None, speaking_rate: float | None = None, volume: str | None = None, emphasis: Literal['strong', 'moderate', 'reduced', 'none'] | None = None, language: Language | None = Language.EN, gender: Literal['male', 'female', 'neutral'] | None = None, google_style: Literal['apologetic', 'calm', 'empathetic', 'firm', 'lively'] | None = None)[source]
Bases:
BaseModelInput parameters for Google HTTP TTS voice customization.
Deprecated since version 0.0.105: Use
GoogleHttpTTSService.Settingsdirectly via thesettingsparameter instead.- Parameters:
pitch – Voice pitch adjustment (e.g., “+2st”, “-50%”).
rate – Speaking rate adjustment (e.g., “slow”, “fast”, “125%”). Used for SSML prosody tags (non-Chirp voices).
speaking_rate – Speaking rate for AudioConfig (Chirp/Journey voices). Range [0.25, 2.0].
volume – Volume adjustment (e.g., “loud”, “soft”, “+6dB”).
emphasis – Emphasis level for the text.
language – Language for synthesis. Defaults to English.
gender – Voice gender preference.
google_style – Google-specific voice style.
- pitch: str | None
- rate: str | None
- speaking_rate: float | None
- volume: str | None
- emphasis: Literal['strong', 'moderate', 'reduced', 'none'] | None
- gender: Literal['male', 'female', 'neutral'] | None
- google_style: Literal['apologetic', 'calm', 'empathetic', 'firm', 'lively'] | None
- __init__(*, credentials: str | None = None, credentials_path: str | None = None, location: str | None = None, voice_id: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: GoogleHttpTTSSettings | None = None, **kwargs)[source]
Initializes the Google HTTP TTS service.
- Parameters:
credentials – JSON string containing Google Cloud service account credentials.
credentials_path – Path to Google Cloud service account JSON file.
location – Google Cloud location for regional endpoint (e.g., “us-central1”).
voice_id –
Google TTS voice identifier (e.g., “en-US-Standard-A”).
Deprecated since version 0.0.105: Use
settings=GoogleHttpTTSService.Settings(voice=...)instead.sample_rate – Audio sample rate in Hz. If None, uses default.
params –
Voice customization parameters including pitch, rate, volume, etc.
Deprecated since version 0.0.105: Use
settings=GoogleHttpTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional arguments passed to parent TTSService.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Google HTTP TTS service supports metrics generation.
- language_to_service_language(language: Language) str | None[source]
Convert a Language enum to Google TTS language format.
- Parameters:
language – The language to convert.
- Returns:
The Google TTS-specific language code, or None if not supported.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame, None][source]
Generate speech from text using Google’s HTTP TTS API.
- Parameters:
text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.
- Yields:
Frame – Audio frames containing the synthesized speech.
- class pipecat.services.google.tts.GoogleBaseTTSService(*, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, push_text_frames: bool = True, push_stop_frames: bool = False, push_start_frame: bool = False, stop_frame_timeout_s: float = 3.0, push_silence_after_stop: bool = False, silence_time_s: float = 2.0, pause_frame_processing: bool = False, append_trailing_space: bool = False, sample_rate: int | None = None, skip_aggregator_types: list[str] | None = [], text_transforms: list[tuple[AggregationType | str, Callable[[str, str | AggregationType], Awaitable[str]]]] | None = None, text_filters: Sequence[BaseTextFilter] | None = None, transport_destination: str | None = None, settings: TTSSettings | None = None, reuse_context_id_within_turn: bool = True, **kwargs)[source]
Bases:
TTSServiceBase class for Google Cloud Text-to-Speech streaming services.
Provides shared streaming synthesis logic for Google TTS services. This is an abstract base class. Use GoogleTTSService or GeminiTTSService instead.
- class pipecat.services.google.tts.GoogleTTSService(*, credentials: str | None = None, credentials_path: str | None = None, location: str | None = None, voice_id: str | None = None, voice_cloning_key: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: GoogleTTSSettings | None = None, **kwargs)[source]
Bases:
GoogleBaseTTSServiceGoogle Cloud Text-to-Speech streaming service.
Provides real-time text-to-speech synthesis using Google Cloud’s streaming API for low-latency applications. Optimized for Chirp 3 HD and Journey voices with continuous audio streaming capabilities.
Note
Requires Google Cloud credentials via service account JSON, file path, or default application credentials (GOOGLE_APPLICATION_CREDENTIALS env var). Only Chirp 3 HD and Journey voices are supported. Use GoogleHttpTTSService for other voices.
Example:
tts = GoogleTTSService( credentials_path="/path/to/service-account.json", settings=GoogleTTSService.Settings( voice="en-US-Chirp3-HD-Charon", language=Language.EN_US, ) )
- Settings
alias of
GoogleTTSSettings
- class InputParams(*, language: Language | None = Language.EN, speaking_rate: float | None = None)[source]
Bases:
BaseModelInput parameters for Google streaming TTS configuration.
Deprecated since version 0.0.105: Use
GoogleTTSService.Settingsdirectly via thesettingsparameter instead.- Parameters:
language – Language for synthesis. Defaults to English.
speaking_rate – The speaking rate, in the range [0.25, 2.0].
- speaking_rate: float | None
- __init__(*, credentials: str | None = None, credentials_path: str | None = None, location: str | None = None, voice_id: str | None = None, voice_cloning_key: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: GoogleTTSSettings | None = None, **kwargs)[source]
Initializes the Google streaming TTS service.
- Parameters:
credentials – JSON string containing Google Cloud service account credentials.
credentials_path – Path to Google Cloud service account JSON file.
location – Google Cloud location for regional endpoint (e.g., “us-central1”).
voice_id –
Google TTS voice identifier (e.g., “en-US-Chirp3-HD-Charon”).
Deprecated since version 0.0.105: Use
settings=GoogleTTSService.Settings(voice=...)instead.voice_cloning_key – The voice cloning key for Chirp 3 custom voices.
sample_rate – Audio sample rate in Hz. If None, uses default.
params –
Language configuration parameters.
Deprecated since version 0.0.105: Use
settings=GoogleTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional arguments passed to parent TTSService.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame, None][source]
Generate streaming speech from text using Google’s streaming API.
- Parameters:
text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.
- Yields:
Frame – Audio frames containing the synthesized speech as it’s generated.
- class pipecat.services.google.tts.GeminiTTSService(*, model: str | None = None, credentials: str | None = None, credentials_path: str | None = None, location: str | None = None, voice_id: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: GeminiTTSSettings | None = None, **kwargs)[source]
Bases:
GoogleBaseTTSServiceGemini Text-to-Speech streaming service using Gemini TTS models.
Provides real-time text-to-speech synthesis using Gemini’s TTS-specific models (gemini-2.5-flash-tts and gemini-2.5-pro-tts) with support for natural voice control, prompts for style instructions, expressive markup tags, and multi-speaker conversations.
Note
Requires Google Cloud credentials via service account JSON, credentials file, or default application credentials (GOOGLE_APPLICATION_CREDENTIALS).
Uses the Google Cloud Text-to-Speech streaming API for low-latency synthesis.
Example:
tts = GeminiTTSService( credentials_path="/path/to/service-account.json", settings=GeminiTTSService.Settings( model="gemini-2.5-flash-tts", voice="Kore", language=Language.EN_US, prompt="Say this in a friendly and helpful tone" ) )
- Settings
alias of
GeminiTTSSettings
- GOOGLE_SAMPLE_RATE = 24000
- AVAILABLE_VOICES = ['Achernar', 'Achird', 'Algenib', 'Algieba', 'Alnilam', 'Aoede', 'Autonoe', 'Callirhoe', 'Charon', 'Despina', 'Enceladus', 'Erinome', 'Fenrir', 'Gacrux', 'Iapetus', 'Kore', 'Laomedeia', 'Leda', 'Orus', 'Puck', 'Pulcherrima', 'Rasalgethi', 'Sadachbia', 'Sadaltager', 'Schedar', 'Sulafar', 'Umbriel', 'Vindemiatrix', 'Zephyr', 'Zubenelgenubi']
- class InputParams(*, language: Language | None = Language.EN, prompt: str | None = None, multi_speaker: bool = False, speaker_configs: list[dict] | None = None)[source]
Bases:
BaseModelInput parameters for Gemini TTS configuration.
Deprecated since version 0.0.105: Use
GeminiTTSService.Settingsdirectly via thesettingsparameter instead.- Parameters:
language – Language for synthesis. Defaults to English.
prompt – Optional style instructions for how to synthesize the content.
multi_speaker – Whether to enable multi-speaker support.
speaker_configs – List of speaker configurations for multi-speaker mode.
- prompt: str | None
- multi_speaker: bool
- speaker_configs: list[dict] | None
- __init__(*, model: str | None = None, credentials: str | None = None, credentials_path: str | None = None, location: str | None = None, voice_id: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: GeminiTTSSettings | None = None, **kwargs)[source]
Initializes the Gemini TTS service.
- Parameters:
model –
- Gemini TTS model to use. Must be a TTS model like
”gemini-2.5-flash-tts” or “gemini-2.5-pro-tts”.
Deprecated since version 0.0.105: Use
settings=GeminiTTSService.Settings(model=...)instead.credentials – JSON string containing Google Cloud service account credentials.
credentials_path – Path to Google Cloud service account JSON file.
location – Google Cloud location for regional endpoint (e.g., “us-central1”).
voice_id –
Voice name from the available Gemini voices.
Deprecated since version 0.0.105: Use
settings=GeminiTTSService.Settings(voice=...)instead.sample_rate – Audio sample rate in Hz. If None, uses Google’s default 24kHz.
params –
TTS configuration parameters.
Deprecated since version 0.0.105: Use
settings=GeminiTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional arguments passed to parent TTSService.
- language_to_service_language(language: Language) str | None[source]
Convert a Language enum to Gemini TTS language format.
- Parameters:
language – The language to convert.
- Returns:
The Gemini TTS-specific language code, or None if not supported.
- async start(frame: StartFrame)[source]
Start the Gemini TTS service.
- Parameters:
frame – The start frame containing initialization parameters.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame, None][source]
Generate streaming speech from text using Gemini TTS models.
- Parameters:
text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames. Can include markup tags like [sigh], [laughing], [whispering] for expressive control.
- Yields:
Frame – Audio frames containing the synthesized speech as it’s generated.