stt

AssemblyAI speech-to-text service implementation.

This module provides integration with AssemblyAI’s real-time speech-to-text WebSocket API for streaming audio transcription.

pipecat.services.assemblyai.stt.map_language_from_assemblyai(language_code: str) Language[source]

Map AssemblyAI language codes to Pipecat Language enum.

AssemblyAI returns simple language codes like “es”, “fr”, etc. This function maps them to the corresponding Language enum values.

Parameters:

language_code – AssemblyAI language code (e.g., “es”, “fr”, “de”)

Returns:

Corresponding Language enum value, defaulting to Language.EN if not found.

class pipecat.services.assemblyai.stt.AssemblyAISTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, formatted_finals: bool | _NotGiven = <factory>, word_finalization_max_wait_time: int | None | _NotGiven = <factory>, end_of_turn_confidence_threshold: float | None | _NotGiven = <factory>, min_turn_silence: int | None | _NotGiven = <factory>, max_turn_silence: int | None | _NotGiven = <factory>, keyterms_prompt: list[str] | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, language_detection: bool | None | _NotGiven = <factory>, format_turns: bool | _NotGiven = <factory>, speaker_labels: bool | None | _NotGiven = <factory>, vad_threshold: float | None | _NotGiven = <factory>, domain: str | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for AssemblyAISTTService.

Parameters:
  • formatted_finals – Whether to enable transcript formatting.

  • word_finalization_max_wait_time – Maximum time to wait for word finalization in milliseconds.

  • end_of_turn_confidence_threshold – Confidence threshold for end-of-turn detection.

  • min_turn_silence – Minimum silence duration when confident about end-of-turn.

  • max_turn_silence – Maximum silence duration before forcing end-of-turn.

  • keyterms_prompt – List of key terms to guide transcription.

  • prompt – Optional text prompt to guide the transcription. Only used when model is “u3-rt-pro”.

  • language_detection – Enable automatic language detection.

  • format_turns – Whether to format transcript turns.

  • speaker_labels – Enable speaker diarization.

  • vad_threshold – VAD confidence threshold (0.0–1.0) for classifying audio frames as silence. Only applicable to u3-rt-pro.

  • domain – Optional domain for specialized recognition modes. For example, set to “medical-v1” to enable Medical Mode for healthcare transcription.

formatted_finals: bool | _NotGiven
word_finalization_max_wait_time: int | None | _NotGiven
end_of_turn_confidence_threshold: float | None | _NotGiven
min_turn_silence: int | None | _NotGiven
max_turn_silence: int | None | _NotGiven
keyterms_prompt: list[str] | None | _NotGiven
prompt: str | None | _NotGiven
language_detection: bool | None | _NotGiven
format_turns: bool | _NotGiven
speaker_labels: bool | None | _NotGiven
vad_threshold: float | None | _NotGiven
domain: str | None | _NotGiven
class pipecat.services.assemblyai.stt.AssemblyAISTTService(*, api_key: str, language: Language | None = None, api_endpoint_base_url: str = 'wss://streaming.assemblyai.com/v3/ws', sample_rate: int = 16000, encoding: str = 'pcm_s16le', connection_params: AssemblyAIConnectionParams | None = None, vad_force_turn_endpoint: bool = True, should_interrupt: bool = True, speaker_format: str | None = None, settings: AssemblyAISTTSettings | None = None, ttfs_p99_latency: float | None = 0.42, **kwargs)[source]

Bases: WebsocketSTTService

AssemblyAI real-time speech-to-text service.

Provides real-time speech transcription using AssemblyAI’s WebSocket API. Supports both interim and final transcriptions with configurable parameters for audio processing and connection management.

Event handlers available (in addition to WebsocketSTTService events):

  • on_end_of_turn(service, transcript): Called when AssemblyAI detects end of turn.

Example:

@service.event_handler("on_end_of_turn")
async def on_end_of_turn(service, transcript):
    ...
Settings

alias of AssemblyAISTTSettings

__init__(*, api_key: str, language: Language | None = None, api_endpoint_base_url: str = 'wss://streaming.assemblyai.com/v3/ws', sample_rate: int = 16000, encoding: str = 'pcm_s16le', connection_params: AssemblyAIConnectionParams | None = None, vad_force_turn_endpoint: bool = True, should_interrupt: bool = True, speaker_format: str | None = None, settings: AssemblyAISTTSettings | None = None, ttfs_p99_latency: float | None = 0.42, **kwargs)[source]

Initialize the AssemblyAI STT service.

Parameters:
  • api_key – AssemblyAI API key for authentication.

  • language

    Language code for transcription. Defaults to English (Language.EN).

    Deprecated since version 0.0.105: Use settings=AssemblyAISTTService.Settings(language=...) instead.

  • api_endpoint_base_url – WebSocket endpoint URL. Defaults to AssemblyAI’s streaming endpoint.

  • sample_rate – Audio sample rate in Hz. Defaults to 16000.

  • encoding – Audio encoding format. Defaults to “pcm_s16le”.

  • connection_params

    Connection configuration parameters.

    Deprecated since version 0.0.105: Use settings=AssemblyAISTTService.Settings(...) instead.

  • vad_force_turn_endpoint – Controls turn detection mode. When True (Pipecat mode, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. - min_turn_silence defaults to 100ms (user can override) - max_turn_silence is ALWAYS set equal to min_turn_silence - VAD stop sends ForceEndpoint as ceiling - No UserStarted/StoppedSpeakingFrame emitted from STT When False (AssemblyAI turn detection mode, u3-rt-pro only): AssemblyAI’s model controls turn endings using built-in turn detection. - Uses AssemblyAI API defaults for all parameters (unless user explicitly sets them) - Emits UserStarted/StoppedSpeakingFrame from STT - No ForceEndpoint on VAD stop

  • should_interrupt – Whether to interrupt the bot when the user starts speaking in AssemblyAI turn detection mode (vad_force_turn_endpoint=False). Only applies when using AssemblyAI’s built-in turn detection. Defaults to True.

  • speaker_format – Optional format string for speaker labels when diarization is enabled. Use {speaker} for speaker label and {text} for transcript text. Example: “<{speaker}>{text}</{speaker}>” or “{speaker}: {text}” If None, transcript text is not modified. Defaults to None.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to parent STTService class.

can_generate_metrics() bool[source]

Check if the service can generate metrics.

Returns:

True if metrics generation is supported.

async start(frame: StartFrame)[source]

Start the speech-to-text service.

Parameters:

frame – Start frame to begin processing.

async stop(frame: EndFrame)[source]

Stop the speech-to-text service.

Parameters:

frame – End frame to stop processing.

async cancel(frame: CancelFrame)[source]

Cancel the speech-to-text service.

Parameters:

frame – Cancel frame to abort processing.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text conversion.

Parameters:

audio – Raw audio bytes to process.

Yields:

None (processing handled via WebSocket messages).

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process frames for VAD and metrics handling.

Parameters:
  • frame – Frame to process.

  • direction – Direction of frame processing.