tts
Azure Cognitive Services Text-to-Speech service implementations.
- pipecat.services.azure.tts.sample_rate_to_output_format(sample_rate: int) SpeechSynthesisOutputFormat[source]
Convert sample rate to Azure speech synthesis output format.
- Parameters:
sample_rate – Sample rate in Hz.
- Returns:
Corresponding Azure SpeechSynthesisOutputFormat enum value. Defaults to Raw24Khz16BitMonoPcm if sample rate not found.
- class pipecat.services.azure.tts.AzureTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, emphasis: str | None | _NotGiven = <factory>, pitch: str | None | _NotGiven = <factory>, rate: str | None | _NotGiven = <factory>, role: str | None | _NotGiven = <factory>, style: str | None | _NotGiven = <factory>, style_degree: str | None | _NotGiven = <factory>, volume: str | None | _NotGiven = <factory>)[source]
Bases:
TTSSettingsSettings for AzureTTSService and AzureHttpTTSService.
- Parameters:
emphasis – Emphasis level for speech (“strong”, “moderate”, “reduced”).
pitch – Voice pitch adjustment (e.g., “+10%”, “-5Hz”, “high”).
rate – Speech rate adjustment (e.g., “1.0”, “1.25”, “slow”, “fast”).
role – Voice role for expression (e.g., “YoungAdultFemale”).
style – Speaking style (e.g., “cheerful”, “sad”, “excited”).
style_degree – Intensity of the speaking style (0.01 to 2.0).
volume – Volume level (e.g., “+20%”, “loud”, “x-soft”).
- emphasis: str | None | _NotGiven
- pitch: str | None | _NotGiven
- rate: str | None | _NotGiven
- role: str | None | _NotGiven
- style: str | None | _NotGiven
- style_degree: str | None | _NotGiven
- volume: str | None | _NotGiven
- class pipecat.services.azure.tts.AzureBaseTTSService[source]
Bases:
objectBase mixin class for Azure Cognitive Services text-to-speech implementations.
Provides common functionality for Azure TTS services including SSML construction, voice configuration, and parameter management. This is a mixin class and should be used alongside TTSService or its subclasses.
- Settings
alias of
AzureTTSSettings
- SSML_ESCAPE_CHARS = {'"': '"', '&': '&', "'": ''', '<': '<', '>': '>'}
- class InputParams(*, emphasis: str | None = None, language: Language | None = Language.EN_US, pitch: str | None = None, rate: str | None = None, role: str | None = None, style: str | None = None, style_degree: str | None = None, volume: str | None = None)[source]
Bases:
BaseModelInput parameters for Azure TTS voice configuration.
Deprecated since version 0.0.105: Use
settings=AzureBaseTTSService.Settings(...)instead.- Parameters:
emphasis – Emphasis level for speech (“strong”, “moderate”, “reduced”).
language – Language for synthesis. Defaults to English (US).
pitch – Voice pitch adjustment (e.g., “+10%”, “-5Hz”, “high”).
rate – Speech rate adjustment (e.g., “1.0”, “1.25”, “slow”, “fast”).
role – Voice role for expression (e.g., “YoungAdultFemale”).
style – Speaking style (e.g., “cheerful”, “sad”, “excited”).
style_degree – Intensity of the speaking style (0.01 to 2.0).
volume – Volume level (e.g., “+20%”, “loud”, “x-soft”).
- emphasis: str | None
- pitch: str | None
- rate: str | None
- role: str | None
- style: str | None
- style_degree: str | None
- volume: str | None
- class pipecat.services.azure.tts.AzureTTSService(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, **kwargs)[source]
Bases:
TTSService,AzureBaseTTSServiceAzure Cognitive Services streaming TTS service with word timestamps.
Provides real-time text-to-speech synthesis using Azure’s WebSocket-based streaming API. Audio chunks and word boundaries are streamed as they become available for lower latency playback and accurate word-level synchronization.
- Settings
alias of
AzureTTSSettings
- __init__(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, **kwargs)[source]
Initialize the Azure streaming TTS service.
- Parameters:
api_key – Azure Cognitive Services subscription key.
region – Azure region identifier (e.g., “eastus”, “westus2”).
voice –
Voice name to use for synthesis.
Deprecated since version 0.0.105: Use
settings=AzureTTSService.Settings(voice=...)instead.sample_rate – Audio sample rate in Hz. If None, uses service default.
params –
Voice and synthesis parameters configuration.
Deprecated since version 0.0.105: Use
settings=AzureTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.aggregate_sentences –
Deprecated. Use text_aggregation_mode instead.
Deprecated since version 0.0.104: Use
text_aggregation_modeinstead.text_aggregation_mode – How to aggregate text before synthesis.
**kwargs – Additional arguments passed to parent WordTTSService.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Azure TTS service supports metrics generation.
- async start(frame: StartFrame)[source]
Start the Azure TTS service and initialize speech synthesizer.
- Parameters:
frame – Start frame containing initialization parameters.
- async stop(frame: EndFrame)[source]
Stop the Azure TTS service.
- Parameters:
frame – End frame signaling service stop.
- async cancel(frame: CancelFrame)[source]
Cancel the Azure TTS service.
- Parameters:
frame – Cancel frame signaling service cancellation.
- async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]
Push a frame and handle state changes.
- Parameters:
frame – The frame to push.
direction – The direction to push the frame.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame, None][source]
Generate speech from text using Azure’s streaming synthesis.
- Parameters:
text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.
- Yields:
Frame – Audio frames containing synthesized speech data.
- class pipecat.services.azure.tts.AzureHttpTTSService(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, **kwargs)[source]
Bases:
TTSService,AzureBaseTTSServiceAzure Cognitive Services HTTP-based TTS service.
Provides text-to-speech synthesis using Azure’s HTTP API for simpler, non-streaming synthesis. Suitable for use cases where streaming is not required and simpler integration is preferred.
- Settings
alias of
AzureTTSSettings
- __init__(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, **kwargs)[source]
Initialize the Azure HTTP TTS service.
- Parameters:
api_key – Azure Cognitive Services subscription key.
region – Azure region identifier (e.g., “eastus”, “westus2”).
voice –
Voice name to use for synthesis.
Deprecated since version 0.0.105: Use
settings=AzureHttpTTSService.Settings(voice=...)instead.sample_rate – Audio sample rate in Hz. If None, uses service default.
params –
Voice and synthesis parameters configuration.
Deprecated since version 0.0.105: Use
settings=AzureHttpTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional arguments passed to parent TTSService.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Azure TTS service supports metrics generation.
- async start(frame: StartFrame)[source]
Start the Azure HTTP TTS service and initialize speech synthesizer.
- Parameters:
frame – Start frame containing initialization parameters.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame, None][source]
Generate speech from text using Azure’s HTTP synthesis API.
- Parameters:
text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.
- Yields:
Frame – Audio frames containing the complete synthesized speech.