stt

Azure Speech-to-Text service implementation for Pipecat.

This module provides speech-to-text functionality using Azure Cognitive Services Speech SDK for real-time audio transcription.

class pipecat.services.azure.stt.AzureSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for AzureSTTService.

class pipecat.services.azure.stt.AzureSTTService(*, api_key: str, region: str | None = None, language: Language | None = Language.EN_US, sample_rate: int | None = None, private_endpoint: str | None = None, endpoint_id: str | None = None, settings: AzureSTTSettings | None = None, ttfs_p99_latency: float | None = 1.8, **kwargs)[source]

Bases: STTService

Azure Speech-to-Text service for real-time audio transcription.

This service uses Azure Cognitive Services Speech SDK to convert speech audio into text transcriptions. It supports continuous recognition and provides real-time transcription results with timing information.

Settings

alias of AzureSTTSettings

__init__(*, api_key: str, region: str | None = None, language: Language | None = Language.EN_US, sample_rate: int | None = None, private_endpoint: str | None = None, endpoint_id: str | None = None, settings: AzureSTTSettings | None = None, ttfs_p99_latency: float | None = 1.8, **kwargs)[source]

Initialize the Azure STT service.

Parameters:
  • api_key – Azure Cognitive Services subscription key.

  • region – Azure region for the Speech service (e.g., ‘eastus’). Required unless private_endpoint is provided.

  • language

    Language for speech recognition. Defaults to English (US).

    Deprecated since version 0.0.105: Use settings=AzureSTTService.Settings(language=...) instead.

  • sample_rate – Audio sample rate in Hz. If None, uses service default.

  • private_endpoint – Private endpoint for STT behind firewall. See https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-services-private-link?tabs=portal

  • endpoint_id – Custom model endpoint id.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to parent STTService.

can_generate_metrics() bool[source]

Check if this service can generate performance metrics.

Returns:

True as this service supports metrics generation.

language_to_service_language(language: Language) str | None[source]

Convert a Language enum to Azure service-specific language code.

Parameters:

language – The language to convert.

Returns:

The Azure-specific language identifier, or None if not supported.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text conversion.

Feeds audio data to the Azure speech recognizer for processing. Recognition results are handled asynchronously through callbacks.

Parameters:

audio – Raw audio bytes to process.

Yields:

Frame – Either None for successful processing or ErrorFrame on failure.

async start(frame: StartFrame)[source]

Start the speech recognition service.

Parameters:

frame – Frame indicating the start of processing.

async stop(frame: EndFrame)[source]

Stop the speech recognition service.

Parameters:

frame – Frame indicating the end of processing.

async cancel(frame: CancelFrame)[source]

Cancel the speech recognition service.

Parameters:

frame – Frame indicating cancellation.