tts

Azure Cognitive Services Text-to-Speech service implementations.

pipecat.services.azure.tts.sample_rate_to_output_format(sample_rate: int) SpeechSynthesisOutputFormat[source]

Convert sample rate to Azure speech synthesis output format.

Parameters:

sample_rate – Sample rate in Hz.

Returns:

Corresponding Azure SpeechSynthesisOutputFormat enum value. Defaults to Raw24Khz16BitMonoPcm if sample rate not found.

class pipecat.services.azure.tts.AzureTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, emphasis: str | None | _NotGiven = <factory>, pitch: str | None | _NotGiven = <factory>, rate: str | None | _NotGiven = <factory>, role: str | None | _NotGiven = <factory>, style: str | None | _NotGiven = <factory>, style_degree: str | None | _NotGiven = <factory>, volume: str | None | _NotGiven = <factory>)[source]

Bases: TTSSettings

Settings for AzureTTSService and AzureHttpTTSService.

Parameters:
  • emphasis – Emphasis level for speech (“strong”, “moderate”, “reduced”).

  • pitch – Voice pitch adjustment (e.g., “+10%”, “-5Hz”, “high”).

  • rate – Speech rate adjustment (e.g., “1.0”, “1.25”, “slow”, “fast”).

  • role – Voice role for expression (e.g., “YoungAdultFemale”).

  • style – Speaking style (e.g., “cheerful”, “sad”, “excited”).

  • style_degree – Intensity of the speaking style (0.01 to 2.0).

  • volume – Volume level (e.g., “+20%”, “loud”, “x-soft”).

emphasis: str | None | _NotGiven
pitch: str | None | _NotGiven
rate: str | None | _NotGiven
role: str | None | _NotGiven
style: str | None | _NotGiven
style_degree: str | None | _NotGiven
volume: str | None | _NotGiven
class pipecat.services.azure.tts.AzureBaseTTSService[source]

Bases: object

Base mixin class for Azure Cognitive Services text-to-speech implementations.

Provides common functionality for Azure TTS services including SSML construction, voice configuration, and parameter management. This is a mixin class and should be used alongside TTSService or its subclasses.

Settings

alias of AzureTTSSettings

SSML_ESCAPE_CHARS = {'"': '&quot;', '&': '&amp;', "'": '&apos;', '<': '&lt;', '>': '&gt;'}
class InputParams(*, emphasis: str | None = None, language: Language | None = Language.EN_US, pitch: str | None = None, rate: str | None = None, role: str | None = None, style: str | None = None, style_degree: str | None = None, volume: str | None = None)[source]

Bases: BaseModel

Input parameters for Azure TTS voice configuration.

Deprecated since version 0.0.105: Use settings=AzureBaseTTSService.Settings(...) instead.

Parameters:
  • emphasis – Emphasis level for speech (“strong”, “moderate”, “reduced”).

  • language – Language for synthesis. Defaults to English (US).

  • pitch – Voice pitch adjustment (e.g., “+10%”, “-5Hz”, “high”).

  • rate – Speech rate adjustment (e.g., “1.0”, “1.25”, “slow”, “fast”).

  • role – Voice role for expression (e.g., “YoungAdultFemale”).

  • style – Speaking style (e.g., “cheerful”, “sad”, “excited”).

  • style_degree – Intensity of the speaking style (0.01 to 2.0).

  • volume – Volume level (e.g., “+20%”, “loud”, “x-soft”).

emphasis: str | None
language: Language | None
pitch: str | None
rate: str | None
role: str | None
style: str | None
style_degree: str | None
volume: str | None
language_to_service_language(language: Language) str | None[source]

Convert a Language enum to Azure language format.

Parameters:

language – The language to convert.

Returns:

The Azure-specific language code, or None if not supported.

class pipecat.services.azure.tts.AzureTTSService(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, **kwargs)[source]

Bases: TTSService, AzureBaseTTSService

Azure Cognitive Services streaming TTS service with word timestamps.

Provides real-time text-to-speech synthesis using Azure’s WebSocket-based streaming API. Audio chunks and word boundaries are streamed as they become available for lower latency playback and accurate word-level synchronization.

Settings

alias of AzureTTSSettings

__init__(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, **kwargs)[source]

Initialize the Azure streaming TTS service.

Parameters:
  • api_key – Azure Cognitive Services subscription key.

  • region – Azure region identifier (e.g., “eastus”, “westus2”).

  • voice

    Voice name to use for synthesis.

    Deprecated since version 0.0.105: Use settings=AzureTTSService.Settings(voice=...) instead.

  • sample_rate – Audio sample rate in Hz. If None, uses service default.

  • params

    Voice and synthesis parameters configuration.

    Deprecated since version 0.0.105: Use settings=AzureTTSService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • aggregate_sentences

    Deprecated. Use text_aggregation_mode instead.

    Deprecated since version 0.0.104: Use text_aggregation_mode instead.

  • text_aggregation_mode – How to aggregate text before synthesis.

  • **kwargs – Additional arguments passed to parent WordTTSService.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Azure TTS service supports metrics generation.

async start(frame: StartFrame)[source]

Start the Azure TTS service and initialize speech synthesizer.

Parameters:

frame – Start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Azure TTS service.

Parameters:

frame – End frame signaling service stop.

async cancel(frame: CancelFrame)[source]

Cancel the Azure TTS service.

Parameters:

frame – Cancel frame signaling service cancellation.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]

Push a frame and handle state changes.

Parameters:
  • frame – The frame to push.

  • direction – The direction to push the frame.

async flush_audio(context_id: str | None = None)[source]

Flush any pending audio data.

async run_tts(text: str, context_id: str) AsyncGenerator[Frame, None][source]

Generate speech from text using Azure’s streaming synthesis.

Parameters:
  • text – The text to synthesize into speech.

  • context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing synthesized speech data.

class pipecat.services.azure.tts.AzureHttpTTSService(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, **kwargs)[source]

Bases: TTSService, AzureBaseTTSService

Azure Cognitive Services HTTP-based TTS service.

Provides text-to-speech synthesis using Azure’s HTTP API for simpler, non-streaming synthesis. Suitable for use cases where streaming is not required and simpler integration is preferred.

Settings

alias of AzureTTSSettings

__init__(*, api_key: str, region: str, voice: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: AzureTTSSettings | None = None, **kwargs)[source]

Initialize the Azure HTTP TTS service.

Parameters:
  • api_key – Azure Cognitive Services subscription key.

  • region – Azure region identifier (e.g., “eastus”, “westus2”).

  • voice

    Voice name to use for synthesis.

    Deprecated since version 0.0.105: Use settings=AzureHttpTTSService.Settings(voice=...) instead.

  • sample_rate – Audio sample rate in Hz. If None, uses service default.

  • params

    Voice and synthesis parameters configuration.

    Deprecated since version 0.0.105: Use settings=AzureHttpTTSService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • **kwargs – Additional arguments passed to parent TTSService.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Azure TTS service supports metrics generation.

async start(frame: StartFrame)[source]

Start the Azure HTTP TTS service and initialize speech synthesizer.

Parameters:

frame – Start frame containing initialization parameters.

async run_tts(text: str, context_id: str) AsyncGenerator[Frame, None][source]

Generate speech from text using Azure’s HTTP synthesis API.

Parameters:
  • text – The text to synthesize into speech.

  • context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the complete synthesized speech.