tts

Azure Cognitive Services Text-to-Speech service implementations.

pipecat.services.azure.tts.sample_rate_to_output_format(sample_rate: int) → SpeechSynthesisOutputFormat[source]

Convert sample rate to Azure speech synthesis output format.

Parameters:: sample_rate – Sample rate in Hz.
Returns:: Corresponding Azure SpeechSynthesisOutputFormat enum value. Defaults to Raw24Khz16BitMonoPcm if sample rate not found.

Bases: TTSSettings

Settings for AzureTTSService and AzureHttpTTSService.

Parameters:

emphasis – Emphasis level for speech (“strong”, “moderate”, “reduced”).
pitch – Voice pitch adjustment (e.g., “+10%”, “-5Hz”, “high”).
rate – Speech rate adjustment (e.g., “1.0”, “1.25”, “slow”, “fast”).
role – Voice role for expression (e.g., “YoungAdultFemale”).
style – Speaking style (e.g., “cheerful”, “sad”, “excited”).
style_degree – Intensity of the speaking style (0.01 to 2.0).
volume – Volume level (e.g., “+20%”, “loud”, “x-soft”).

emphasis: str | None | _NotGiven

pitch: str | None | _NotGiven

rate: str | None | _NotGiven

role: str | None | _NotGiven

style: str | None | _NotGiven

style_degree: str | None | _NotGiven

volume: str | None | _NotGiven

class pipecat.services.azure.tts.AzureBaseTTSService[source]

Bases: object

Base mixin class for Azure Cognitive Services text-to-speech implementations.

Provides common functionality for Azure TTS services including SSML construction, voice configuration, and parameter management. This is a mixin class and should be used alongside TTSService or its subclasses.

Settings: alias of AzureTTSSettings

SSML_ESCAPE_CHARS = {'"': '"', '&': '&', "'": ''', '<': '<', '>': '>'}

Bases: BaseModel

Input parameters for Azure TTS voice configuration.

Deprecated since version 0.0.105: Use settings=AzureBaseTTSService.Settings(...) instead.

Parameters:

emphasis – Emphasis level for speech (“strong”, “moderate”, “reduced”).
language – Language for synthesis. Defaults to English (US).
pitch – Voice pitch adjustment (e.g., “+10%”, “-5Hz”, “high”).
rate – Speech rate adjustment (e.g., “1.0”, “1.25”, “slow”, “fast”).
role – Voice role for expression (e.g., “YoungAdultFemale”).
style – Speaking style (e.g., “cheerful”, “sad”, “excited”).
style_degree – Intensity of the speaking style (0.01 to 2.0).
volume – Volume level (e.g., “+20%”, “loud”, “x-soft”).

emphasis: str | None

language: Language | None

pitch: str | None

rate: str | None

role: str | None

style: str | None

style_degree: str | None

volume: str | None

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to Azure language format.

Parameters:: language – The language to convert.
Returns:: The Azure-specific language code, or None if not supported.

class pipecat.services.azure.tts.AzureTTSService(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, **kwargs)[source]

Bases: TTSService, AzureBaseTTSService

Azure Cognitive Services streaming TTS service with word timestamps.

Provides real-time text-to-speech synthesis using Azure’s WebSocket-based streaming API. Audio chunks and word boundaries are streamed as they become available for lower latency playback and accurate word-level synchronization.

Settings: alias of AzureTTSSettings

__init__(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, **kwargs)[source]

Initialize the Azure streaming TTS service.

Parameters:

api_key – Azure Cognitive Services subscription key.
region – Azure region identifier (e.g., “eastus”, “westus2”).
voice –
Voice name to use for synthesis.

Deprecated since version 0.0.105: Use settings=AzureTTSService.Settings(voice=...) instead.
sample_rate – Audio sample rate in Hz. If None, uses service default.
params –
Voice and synthesis parameters configuration.

Deprecated since version 0.0.105: Use settings=AzureTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
aggregate_sentences –
Deprecated. Use text_aggregation_mode instead.

Deprecated since version 0.0.104: Use text_aggregation_mode instead.
text_aggregation_mode – How to aggregate text before synthesis.
**kwargs – Additional arguments passed to parent WordTTSService.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Azure TTS service supports metrics generation.

async start(frame: StartFrame)[source]

Start the Azure TTS service and initialize speech synthesizer.

Parameters:: frame – Start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Azure TTS service.

Parameters:: frame – End frame signaling service stop.

async cancel(frame: CancelFrame)[source]

Cancel the Azure TTS service.

Parameters:: frame – Cancel frame signaling service cancellation.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]

Push a frame and handle state changes.

Parameters:

frame – The frame to push.
direction – The direction to push the frame.

async flush_audio(context_id: str | None = None)[source]: Flush any pending audio data.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame, None][source]

Generate speech from text using Azure’s streaming synthesis.

Parameters:

text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing synthesized speech data.

class pipecat.services.azure.tts.AzureHttpTTSService(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, **kwargs)[source]

Bases: TTSService, AzureBaseTTSService

Azure Cognitive Services HTTP-based TTS service.

Provides text-to-speech synthesis using Azure’s HTTP API for simpler, non-streaming synthesis. Suitable for use cases where streaming is not required and simpler integration is preferred.

Settings: alias of AzureTTSSettings

__init__(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, **kwargs)[source]

Initialize the Azure HTTP TTS service.

Parameters:

api_key – Azure Cognitive Services subscription key.
region – Azure region identifier (e.g., “eastus”, “westus2”).
voice –
Voice name to use for synthesis.

Deprecated since version 0.0.105: Use settings=AzureHttpTTSService.Settings(voice=...) instead.
sample_rate – Audio sample rate in Hz. If None, uses service default.
params –
Voice and synthesis parameters configuration.

Deprecated since version 0.0.105: Use settings=AzureHttpTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
**kwargs – Additional arguments passed to parent TTSService.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Azure TTS service supports metrics generation.

async start(frame: StartFrame)[source]

Start the Azure HTTP TTS service and initialize speech synthesizer.

Parameters:: frame – Start frame containing initialization parameters.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame, None][source]

Generate speech from text using Azure’s HTTP synthesis API.

Parameters:

text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the complete synthesized speech.