stt

AssemblyAI speech-to-text service implementation.

This module provides integration with AssemblyAI’s real-time speech-to-text WebSocket API for streaming audio transcription.

pipecat.services.assemblyai.stt.map_language_from_assemblyai(language_code: str) → Language[source]

Map AssemblyAI language codes to Pipecat Language enum.

AssemblyAI returns simple language codes like “es”, “fr”, etc. This function maps them to the corresponding Language enum values.

Parameters:: language_code – AssemblyAI language code (e.g., “es”, “fr”, “de”)
Returns:: Corresponding Language enum value, defaulting to Language.EN if not found.

Bases: STTSettings

Settings for AssemblyAISTTService.

Parameters:

formatted_finals – Whether to enable transcript formatting.
word_finalization_max_wait_time – Maximum time to wait for word finalization in milliseconds.
end_of_turn_confidence_threshold – Confidence threshold for end-of-turn detection.
min_turn_silence – Minimum silence duration when confident about end-of-turn.
max_turn_silence – Maximum silence duration before forcing end-of-turn.
keyterms_prompt – List of key terms to guide transcription.
prompt – Optional text prompt to guide the transcription. Only used when model is “u3-rt-pro”.
language_detection – Enable automatic language detection.
format_turns – Whether to format transcript turns.
speaker_labels – Enable speaker diarization.
vad_threshold – VAD confidence threshold (0.0–1.0) for classifying audio frames as silence. Only applicable to u3-rt-pro.
domain – Optional domain for specialized recognition modes. For example, set to “medical-v1” to enable Medical Mode for healthcare transcription.

formatted_finals: bool | _NotGiven

word_finalization_max_wait_time: int | None | _NotGiven

end_of_turn_confidence_threshold: float | None | _NotGiven

min_turn_silence: int | None | _NotGiven

max_turn_silence: int | None | _NotGiven

keyterms_prompt: list[str] | None | _NotGiven

prompt: str | None | _NotGiven

language_detection: bool | None | _NotGiven

format_turns: bool | _NotGiven

speaker_labels: bool | None | _NotGiven

vad_threshold: float | None | _NotGiven

domain: str | None | _NotGiven

class pipecat.services.assemblyai.stt.AssemblyAISTTService(*, api_key: str, language: Language | None = None, api_endpoint_base_url: str = 'wss://streaming.assemblyai.com/v3/ws', sample_rate: int = 16000, encoding: str = 'pcm_s16le', connection_params: AssemblyAIConnectionParams | None = None, vad_force_turn_endpoint: bool = True, should_interrupt: bool = True, speaker_format: str | None = None, settings: AssemblyAISTTSettings | None = None, ttfs_p99_latency: float | None = 0.42, **kwargs)[source]

Bases: WebsocketSTTService

AssemblyAI real-time speech-to-text service.

Provides real-time speech transcription using AssemblyAI’s WebSocket API. Supports both interim and final transcriptions with configurable parameters for audio processing and connection management.

Event handlers available (in addition to WebsocketSTTService events):

on_end_of_turn(service, transcript): Called when AssemblyAI detects end of turn.

Example:

@service.event_handler("on_end_of_turn")
async def on_end_of_turn(service, transcript):
    ...

Settings: alias of AssemblyAISTTSettings

__init__(*, api_key: str, language: Language | None = None, api_endpoint_base_url: str = 'wss://streaming.assemblyai.com/v3/ws', sample_rate: int = 16000, encoding: str = 'pcm_s16le', connection_params: AssemblyAIConnectionParams | None = None, vad_force_turn_endpoint: bool = True, should_interrupt: bool = True, speaker_format: str | None = None, settings: AssemblyAISTTSettings | None = None, ttfs_p99_latency: float | None = 0.42, **kwargs)[source]

Initialize the AssemblyAI STT service.

Parameters:

api_key – AssemblyAI API key for authentication.
language –
Language code for transcription. Defaults to English (Language.EN).

Deprecated since version 0.0.105: Use settings=AssemblyAISTTService.Settings(language=...) instead.
api_endpoint_base_url – WebSocket endpoint URL. Defaults to AssemblyAI’s streaming endpoint.
sample_rate – Audio sample rate in Hz. Defaults to 16000.
encoding – Audio encoding format. Defaults to “pcm_s16le”.
connection_params –
Connection configuration parameters.

Deprecated since version 0.0.105: Use settings=AssemblyAISTTService.Settings(...) instead.
vad_force_turn_endpoint – Controls turn detection mode. When True (Pipecat mode, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. - min_turn_silence defaults to 100ms (user can override) - max_turn_silence is ALWAYS set equal to min_turn_silence - VAD stop sends ForceEndpoint as ceiling - No UserStarted/StoppedSpeakingFrame emitted from STT When False (AssemblyAI turn detection mode, u3-rt-pro only): AssemblyAI’s model controls turn endings using built-in turn detection. - Uses AssemblyAI API defaults for all parameters (unless user explicitly sets them) - Emits UserStarted/StoppedSpeakingFrame from STT - No ForceEndpoint on VAD stop
should_interrupt – Whether to interrupt the bot when the user starts speaking in AssemblyAI turn detection mode (vad_force_turn_endpoint=False). Only applies when using AssemblyAI’s built-in turn detection. Defaults to True.
speaker_format – Optional format string for speaker labels when diarization is enabled. Use {speaker} for speaker label and {text} for transcript text. Example: “<{speaker}>{text}</{speaker}>” or “{speaker}: {text}” If None, transcript text is not modified. Defaults to None.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to parent STTService class.

can_generate_metrics() → bool[source]

Check if the service can generate metrics.

Returns:: True if metrics generation is supported.

async start(frame: StartFrame)[source]

Start the speech-to-text service.

Parameters:: frame – Start frame to begin processing.

async stop(frame: EndFrame)[source]

Stop the speech-to-text service.

Parameters:: frame – End frame to stop processing.

async cancel(frame: CancelFrame)[source]

Cancel the speech-to-text service.

Parameters:: frame – Cancel frame to abort processing.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text conversion.

Parameters:: audio – Raw audio bytes to process.
Yields:: None (processing handled via WebSocket messages).

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process frames for VAD and metrics handling.

Parameters:

frame – Frame to process.
direction – Direction of frame processing.