stt
AssemblyAI speech-to-text service implementation.
This module provides integration with AssemblyAI’s real-time speech-to-text WebSocket API for streaming audio transcription.
- pipecat.services.assemblyai.stt.map_language_from_assemblyai(language_code: str) Language[source]
Map AssemblyAI language codes to Pipecat Language enum.
AssemblyAI returns simple language codes like “es”, “fr”, etc. This function maps them to the corresponding Language enum values.
- Parameters:
language_code – AssemblyAI language code (e.g., “es”, “fr”, “de”)
- Returns:
Corresponding Language enum value, defaulting to Language.EN if not found.
- class pipecat.services.assemblyai.stt.AssemblyAISTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, formatted_finals: bool | _NotGiven = <factory>, word_finalization_max_wait_time: int | None | _NotGiven = <factory>, end_of_turn_confidence_threshold: float | None | _NotGiven = <factory>, min_turn_silence: int | None | _NotGiven = <factory>, max_turn_silence: int | None | _NotGiven = <factory>, keyterms_prompt: list[str] | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, language_detection: bool | None | _NotGiven = <factory>, format_turns: bool | _NotGiven = <factory>, speaker_labels: bool | None | _NotGiven = <factory>, vad_threshold: float | None | _NotGiven = <factory>, domain: str | None | _NotGiven = <factory>)[source]
Bases:
STTSettingsSettings for AssemblyAISTTService.
- Parameters:
formatted_finals – Whether to enable transcript formatting.
word_finalization_max_wait_time – Maximum time to wait for word finalization in milliseconds.
end_of_turn_confidence_threshold – Confidence threshold for end-of-turn detection.
min_turn_silence – Minimum silence duration when confident about end-of-turn.
max_turn_silence – Maximum silence duration before forcing end-of-turn.
keyterms_prompt – List of key terms to guide transcription.
prompt – Optional text prompt to guide the transcription. Only used when model is “u3-rt-pro”.
language_detection – Enable automatic language detection.
format_turns – Whether to format transcript turns.
speaker_labels – Enable speaker diarization.
vad_threshold – VAD confidence threshold (0.0–1.0) for classifying audio frames as silence. Only applicable to u3-rt-pro.
domain – Optional domain for specialized recognition modes. For example, set to “medical-v1” to enable Medical Mode for healthcare transcription.
- formatted_finals: bool | _NotGiven
- word_finalization_max_wait_time: int | None | _NotGiven
- end_of_turn_confidence_threshold: float | None | _NotGiven
- min_turn_silence: int | None | _NotGiven
- max_turn_silence: int | None | _NotGiven
- keyterms_prompt: list[str] | None | _NotGiven
- prompt: str | None | _NotGiven
- language_detection: bool | None | _NotGiven
- format_turns: bool | _NotGiven
- speaker_labels: bool | None | _NotGiven
- vad_threshold: float | None | _NotGiven
- domain: str | None | _NotGiven
- class pipecat.services.assemblyai.stt.AssemblyAISTTService(*, api_key: str, language: Language | None = None, api_endpoint_base_url: str = 'wss://streaming.assemblyai.com/v3/ws', sample_rate: int = 16000, encoding: str = 'pcm_s16le', connection_params: AssemblyAIConnectionParams | None = None, vad_force_turn_endpoint: bool = True, should_interrupt: bool = True, speaker_format: str | None = None, settings: AssemblyAISTTSettings | None = None, ttfs_p99_latency: float | None = 0.42, **kwargs)[source]
Bases:
WebsocketSTTServiceAssemblyAI real-time speech-to-text service.
Provides real-time speech transcription using AssemblyAI’s WebSocket API. Supports both interim and final transcriptions with configurable parameters for audio processing and connection management.
Event handlers available (in addition to WebsocketSTTService events):
on_end_of_turn(service, transcript): Called when AssemblyAI detects end of turn.
Example:
@service.event_handler("on_end_of_turn") async def on_end_of_turn(service, transcript): ...
- Settings
alias of
AssemblyAISTTSettings
- __init__(*, api_key: str, language: Language | None = None, api_endpoint_base_url: str = 'wss://streaming.assemblyai.com/v3/ws', sample_rate: int = 16000, encoding: str = 'pcm_s16le', connection_params: AssemblyAIConnectionParams | None = None, vad_force_turn_endpoint: bool = True, should_interrupt: bool = True, speaker_format: str | None = None, settings: AssemblyAISTTSettings | None = None, ttfs_p99_latency: float | None = 0.42, **kwargs)[source]
Initialize the AssemblyAI STT service.
- Parameters:
api_key – AssemblyAI API key for authentication.
language –
Language code for transcription. Defaults to English (Language.EN).
Deprecated since version 0.0.105: Use
settings=AssemblyAISTTService.Settings(language=...)instead.api_endpoint_base_url – WebSocket endpoint URL. Defaults to AssemblyAI’s streaming endpoint.
sample_rate – Audio sample rate in Hz. Defaults to 16000.
encoding – Audio encoding format. Defaults to “pcm_s16le”.
connection_params –
Connection configuration parameters.
Deprecated since version 0.0.105: Use
settings=AssemblyAISTTService.Settings(...)instead.vad_force_turn_endpoint – Controls turn detection mode. When True (Pipecat mode, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. - min_turn_silence defaults to 100ms (user can override) - max_turn_silence is ALWAYS set equal to min_turn_silence - VAD stop sends ForceEndpoint as ceiling - No UserStarted/StoppedSpeakingFrame emitted from STT When False (AssemblyAI turn detection mode, u3-rt-pro only): AssemblyAI’s model controls turn endings using built-in turn detection. - Uses AssemblyAI API defaults for all parameters (unless user explicitly sets them) - Emits UserStarted/StoppedSpeakingFrame from STT - No ForceEndpoint on VAD stop
should_interrupt – Whether to interrupt the bot when the user starts speaking in AssemblyAI turn detection mode (vad_force_turn_endpoint=False). Only applies when using AssemblyAI’s built-in turn detection. Defaults to True.
speaker_format – Optional format string for speaker labels when diarization is enabled. Use {speaker} for speaker label and {text} for transcript text. Example: “<{speaker}>{text}</{speaker}>” or “{speaker}: {text}” If None, transcript text is not modified. Defaults to None.
settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to parent STTService class.
- can_generate_metrics() bool[source]
Check if the service can generate metrics.
- Returns:
True if metrics generation is supported.
- async start(frame: StartFrame)[source]
Start the speech-to-text service.
- Parameters:
frame – Start frame to begin processing.
- async stop(frame: EndFrame)[source]
Stop the speech-to-text service.
- Parameters:
frame – End frame to stop processing.
- async cancel(frame: CancelFrame)[source]
Cancel the speech-to-text service.
- Parameters:
frame – Cancel frame to abort processing.
- async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]
Process audio data for speech-to-text conversion.
- Parameters:
audio – Raw audio bytes to process.
- Yields:
None (processing handled via WebSocket messages).
- async process_frame(frame: Frame, direction: FrameDirection)[source]
Process frames for VAD and metrics handling.
- Parameters:
frame – Frame to process.
direction – Direction of frame processing.