stt

Azure Speech-to-Text service implementation for Pipecat.

This module provides speech-to-text functionality using Azure Cognitive Services Speech SDK for real-time audio transcription.

Bases: STTSettings

Settings for AzureSTTService.

class pipecat.services.azure.stt.AzureSTTService(*, api_key: str, region: str | None = None, language: Language | None = Language.EN_US, sample_rate: int | None = None, private_endpoint: str | None = None, endpoint_id: str | None = None, settings: AzureSTTSettings | None = None, ttfs_p99_latency: float | None = 1.8, **kwargs)[source]

Bases: STTService

Azure Speech-to-Text service for real-time audio transcription.

This service uses Azure Cognitive Services Speech SDK to convert speech audio into text transcriptions. It supports continuous recognition and provides real-time transcription results with timing information.

Settings: alias of AzureSTTSettings

Initialize the Azure STT service.

Parameters:

api_key – Azure Cognitive Services subscription key.
region – Azure region for the Speech service (e.g., ‘eastus’). Required unless private_endpoint is provided.
language –
Language for speech recognition. Defaults to English (US).

Deprecated since version 0.0.105: Use settings=AzureSTTService.Settings(language=...) instead.
sample_rate – Audio sample rate in Hz. If None, uses service default.
private_endpoint – Private endpoint for STT behind firewall. See https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-services-private-link?tabs=portal
endpoint_id – Custom model endpoint id.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to parent STTService.

can_generate_metrics() → bool[source]

Check if this service can generate performance metrics.

Returns:: True as this service supports metrics generation.

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to Azure service-specific language code.

Parameters:: language – The language to convert.
Returns:: The Azure-specific language identifier, or None if not supported.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text conversion.

Feeds audio data to the Azure speech recognizer for processing. Recognition results are handled asynchronously through callbacks.

Parameters:: audio – Raw audio bytes to process.
Yields:: Frame – Either None for successful processing or ErrorFrame on failure.

async start(frame: StartFrame)[source]

Start the speech recognition service.

Parameters:: frame – Frame indicating the start of processing.

async stop(frame: EndFrame)[source]

Stop the speech recognition service.

Parameters:: frame – Frame indicating the end of processing.

async cancel(frame: CancelFrame)[source]

Cancel the speech recognition service.

Parameters:: frame – Frame indicating cancellation.