tts

Google Cloud Text-to-Speech service implementations.

This module provides integration with Google Cloud Text-to-Speech API, offering both HTTP-based synthesis with SSML support and streaming synthesis for real-time applications.

It also includes GeminiTTSService which uses Gemini’s TTS-specific models for natural voice control and multi-speaker conversations.

pipecat.services.google.tts.language_to_google_tts_language(language: Language) → str | None[source]

Convert a Language enum to Google TTS language code.

Source: https://docs.cloud.google.com/text-to-speech/docs/chirp3-hd

Parameters:: language – The Language enum value to convert.
Returns:: The corresponding Google TTS language code, or None if not supported.

pipecat.services.google.tts.language_to_gemini_tts_language(language: Language) → str | None[source]

Convert a Language enum to Gemini TTS language code.

Source: https://docs.cloud.google.com/text-to-speech/docs/gemini-tts#available_languages

Parameters:: language – The Language enum value to convert.
Returns:: The corresponding Gemini TTS language code, or None if not supported.

Bases: TTSSettings

Settings for GoogleHttpTTSService.

Parameters:

pitch – Voice pitch adjustment (e.g., “+2st”, “-50%”).
rate – Speaking rate adjustment (e.g., “slow”, “fast”, “125%”). Used for SSML prosody tags (non-Chirp voices).
speaking_rate – Speaking rate for AudioConfig (Chirp/Journey voices). Range [0.25, 2.0].
volume – Volume adjustment (e.g., “loud”, “soft”, “+6dB”).
emphasis – Emphasis level for the text.
gender – Voice gender preference.
google_style – Google-specific voice style.

pitch: str | None | _NotGiven

rate: str | None | _NotGiven

speaking_rate: float | None | _NotGiven

volume: str | None | _NotGiven

emphasis: Literal['strong', 'moderate', 'reduced', 'none'] | None | _NotGiven

gender: Literal['male', 'female', 'neutral'] | None | _NotGiven

google_style: Literal['apologetic', 'calm', 'empathetic', 'firm', 'lively'] | None | _NotGiven

Bases: TTSSettings

Settings for GoogleTTSService.

Parameters:: speaking_rate – The speaking rate, in the range [0.25, 2.0].

speaking_rate: float | None | _NotGiven

pipecat.services.google.tts.GoogleStreamTTSSettings: Deprecated since 0.0.105: Use GoogleTTSService.Settings instead.

Bases: TTSSettings

Settings for GeminiTTSService.

Parameters:

prompt – Optional style instructions for how to synthesize the content.
multi_speaker – Whether to enable multi-speaker support.
speaker_configs – List of speaker configurations for multi-speaker mode.

prompt: str | None | _NotGiven

multi_speaker: bool | _NotGiven

speaker_configs: list[dict[str, Any]] | None | _NotGiven

Bases: TTSService

Google Cloud Text-to-Speech HTTP service with SSML support.

Provides text-to-speech synthesis using Google Cloud’s HTTP API with comprehensive SSML support for voice customization, prosody control, and styling options. Ideal for applications requiring fine-grained control over speech output.

Note

Requires Google Cloud credentials via service account JSON, credentials file, or default application credentials (GOOGLE_APPLICATION_CREDENTIALS). Chirp and Journey voices don’t support SSML and will use plain text input.

Settings: alias of GoogleHttpTTSSettings

class InputParams(*, pitch: str | None = None, rate: str | None = None, speaking_rate: float | None = None, volume: str | None = None, emphasis: Literal['strong', 'moderate', 'reduced', 'none'] | None = None, language: Language | None = Language.EN, gender: Literal['male', 'female', 'neutral'] | None = None, google_style: Literal['apologetic', 'calm', 'empathetic', 'firm', 'lively'] | None = None)[source]

Bases: BaseModel

Input parameters for Google HTTP TTS voice customization.

Deprecated since version 0.0.105: Use GoogleHttpTTSService.Settings directly via the settings parameter instead.

Parameters:

pitch – Voice pitch adjustment (e.g., “+2st”, “-50%”).
rate – Speaking rate adjustment (e.g., “slow”, “fast”, “125%”). Used for SSML prosody tags (non-Chirp voices).
speaking_rate – Speaking rate for AudioConfig (Chirp/Journey voices). Range [0.25, 2.0].
volume – Volume adjustment (e.g., “loud”, “soft”, “+6dB”).
emphasis – Emphasis level for the text.
language – Language for synthesis. Defaults to English.
gender – Voice gender preference.
google_style – Google-specific voice style.

pitch: str | None

rate: str | None

speaking_rate: float | None

volume: str | None

emphasis: Literal['strong', 'moderate', 'reduced', 'none'] | None

language: Language | None

gender: Literal['male', 'female', 'neutral'] | None

google_style: Literal['apologetic', 'calm', 'empathetic', 'firm', 'lively'] | None

Initializes the Google HTTP TTS service.

Parameters:

credentials – JSON string containing Google Cloud service account credentials.
credentials_path – Path to Google Cloud service account JSON file.
location – Google Cloud location for regional endpoint (e.g., “us-central1”).
voice_id –
Google TTS voice identifier (e.g., “en-US-Standard-A”).

Deprecated since version 0.0.105: Use settings=GoogleHttpTTSService.Settings(voice=...) instead.
sample_rate – Audio sample rate in Hz. If None, uses default.
params –
Voice customization parameters including pitch, rate, volume, etc.

Deprecated since version 0.0.105: Use settings=GoogleHttpTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
**kwargs – Additional arguments passed to parent TTSService.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Google HTTP TTS service supports metrics generation.

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to Google TTS language format.

Parameters:: language – The language to convert.
Returns:: The Google TTS-specific language code, or None if not supported.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame, None][source]

Generate speech from text using Google’s HTTP TTS API.

Parameters:

text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech.

class pipecat.services.google.tts.GoogleBaseTTSService(*, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, push_text_frames: bool = True, push_stop_frames: bool = False, push_start_frame: bool = False, stop_frame_timeout_s: float = 3.0, push_silence_after_stop: bool = False, silence_time_s: float = 2.0, pause_frame_processing: bool = False, append_trailing_space: bool = False, sample_rate: int | None = None, skip_aggregator_types: list[str] | None = [], text_transforms: list[tuple[AggregationType | str, Callable[[str, str | AggregationType], Awaitable[str]]]] | None = None, text_filters: Sequence[BaseTextFilter] | None = None, transport_destination: str | None = None, settings: TTSSettings | None = None, reuse_context_id_within_turn: bool = True, **kwargs)[source]

Bases: TTSService

Base class for Google Cloud Text-to-Speech streaming services.

Provides shared streaming synthesis logic for Google TTS services. This is an abstract base class. Use GoogleTTSService or GeminiTTSService instead.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Google streaming TTS services support metrics generation.

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to Google TTS language format.

Parameters:: language – The language to convert.
Returns:: The Google TTS-specific language code, or None if not supported.

Bases: GoogleBaseTTSService

Google Cloud Text-to-Speech streaming service.

Provides real-time text-to-speech synthesis using Google Cloud’s streaming API for low-latency applications. Optimized for Chirp 3 HD and Journey voices with continuous audio streaming capabilities.

Note

Requires Google Cloud credentials via service account JSON, file path, or default application credentials (GOOGLE_APPLICATION_CREDENTIALS env var). Only Chirp 3 HD and Journey voices are supported. Use GoogleHttpTTSService for other voices.

Example:

tts = GoogleTTSService(
    credentials_path="/path/to/service-account.json",
    settings=GoogleTTSService.Settings(
        voice="en-US-Chirp3-HD-Charon",
        language=Language.EN_US,
    )
)

Settings: alias of GoogleTTSSettings

class InputParams(*, language: Language | None = Language.EN, speaking_rate: float | None = None)[source]

Bases: BaseModel

Input parameters for Google streaming TTS configuration.

Deprecated since version 0.0.105: Use GoogleTTSService.Settings directly via the settings parameter instead.

Parameters:

language – Language for synthesis. Defaults to English.
speaking_rate – The speaking rate, in the range [0.25, 2.0].

language: Language | None

speaking_rate: float | None

Initializes the Google streaming TTS service.

Parameters:

credentials – JSON string containing Google Cloud service account credentials.
credentials_path – Path to Google Cloud service account JSON file.
location – Google Cloud location for regional endpoint (e.g., “us-central1”).
voice_id –
Google TTS voice identifier (e.g., “en-US-Chirp3-HD-Charon”).

Deprecated since version 0.0.105: Use settings=GoogleTTSService.Settings(voice=...) instead.
voice_cloning_key – The voice cloning key for Chirp 3 custom voices.
sample_rate – Audio sample rate in Hz. If None, uses default.
params –
Language configuration parameters.

Deprecated since version 0.0.105: Use settings=GoogleTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
**kwargs – Additional arguments passed to parent TTSService.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame, None][source]

Generate streaming speech from text using Google’s streaming API.

Parameters:

text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech as it’s generated.

Bases: GoogleBaseTTSService

Gemini Text-to-Speech streaming service using Gemini TTS models.

Provides real-time text-to-speech synthesis using Gemini’s TTS-specific models (gemini-2.5-flash-tts and gemini-2.5-pro-tts) with support for natural voice control, prompts for style instructions, expressive markup tags, and multi-speaker conversations.

Note

Requires Google Cloud credentials via service account JSON, credentials file, or default application credentials (GOOGLE_APPLICATION_CREDENTIALS).

Uses the Google Cloud Text-to-Speech streaming API for low-latency synthesis.

Example:

tts = GeminiTTSService(
    credentials_path="/path/to/service-account.json",
    settings=GeminiTTSService.Settings(
        model="gemini-2.5-flash-tts",
        voice="Kore",
        language=Language.EN_US,
        prompt="Say this in a friendly and helpful tone"
    )
)

Settings: alias of GeminiTTSSettings

GOOGLE_SAMPLE_RATE = 24000

AVAILABLE_VOICES = ['Achernar', 'Achird', 'Algenib', 'Algieba', 'Alnilam', 'Aoede', 'Autonoe', 'Callirhoe', 'Charon', 'Despina', 'Enceladus', 'Erinome', 'Fenrir', 'Gacrux', 'Iapetus', 'Kore', 'Laomedeia', 'Leda', 'Orus', 'Puck', 'Pulcherrima', 'Rasalgethi', 'Sadachbia', 'Sadaltager', 'Schedar', 'Sulafar', 'Umbriel', 'Vindemiatrix', 'Zephyr', 'Zubenelgenubi']

class InputParams(*, language: Language | None = Language.EN, prompt: str | None = None, multi_speaker: bool = False, speaker_configs: list[dict] | None = None)[source]

Bases: BaseModel

Input parameters for Gemini TTS configuration.

Deprecated since version 0.0.105: Use GeminiTTSService.Settings directly via the settings parameter instead.

Parameters:

language – Language for synthesis. Defaults to English.
prompt – Optional style instructions for how to synthesize the content.
multi_speaker – Whether to enable multi-speaker support.
speaker_configs – List of speaker configurations for multi-speaker mode.

language: Language | None

prompt: str | None

multi_speaker: bool

speaker_configs: list[dict] | None

Initializes the Gemini TTS service.

Parameters:

model –

Gemini TTS model to use. Must be a TTS model like
”gemini-2.5-flash-tts” or “gemini-2.5-pro-tts”.

Deprecated since version 0.0.105: Use settings=GeminiTTSService.Settings(model=...) instead.
credentials – JSON string containing Google Cloud service account credentials.
credentials_path – Path to Google Cloud service account JSON file.
location – Google Cloud location for regional endpoint (e.g., “us-central1”).
voice_id –
Voice name from the available Gemini voices.

Deprecated since version 0.0.105: Use settings=GeminiTTSService.Settings(voice=...) instead.
sample_rate – Audio sample rate in Hz. If None, uses Google’s default 24kHz.
params –
TTS configuration parameters.

Deprecated since version 0.0.105: Use settings=GeminiTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
**kwargs – Additional arguments passed to parent TTSService.

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to Gemini TTS language format.

Parameters:: language – The language to convert.
Returns:: The Gemini TTS-specific language code, or None if not supported.

async start(frame: StartFrame)[source]

Start the Gemini TTS service.

Parameters:: frame – The start frame containing initialization parameters.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame, None][source]

Generate streaming speech from text using Gemini TTS models.

Parameters:

text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames. Can include markup tags like [sigh], [laughing], [whispering] for expressive control.

Yields:

Frame – Audio frames containing the synthesized speech as it’s generated.