stt_service

Base classes for Speech-to-Text services with continuous and segmented processing.

class pipecat.services.stt_service.STTService(*, audio_passthrough=True, sample_rate: int | None = None, stt_ttfb_timeout: float = 2.0, ttfs_p99_latency: float | None = None, keepalive_timeout: float | None = None, keepalive_interval: float = 5.0, settings: STTSettings | None = None, **kwargs)[source]

Bases: AIService

Base class for speech-to-text services.

Provides common functionality for STT services including audio passthrough, muting, settings management, and audio processing. Subclasses must implement the run_stt method to provide actual speech recognition.

Includes an optional keepalive mechanism that sends silent audio when no real audio has been sent for a configurable timeout, preventing servers from closing idle connections (e.g. when behind a ServiceSwitcher). Subclasses that enable keepalive must override _send_keepalive() to deliver the silence in the appropriate service-specific protocol.

Event handlers:: on_connected: Called when connected to the STT service. on_disconnected: Called when disconnected from the STT service. on_connection_error: Called when a connection to the STT service error occurs.

Example:

@stt.event_handler("on_connected")
async def on_connected(stt: STTService):
    logger.debug(f"STT connected")

@stt.event_handler("on_disconnected")
async def on_disconnected(stt: STTService):
    logger.debug(f"STT disconnected")

@stt.event_handler("on_connection_error")
async def on_connection_error(stt: STTService, error: str):
    logger.error(f"STT connection error: {error}")

__init__(*, audio_passthrough=True, sample_rate: int | None = None, stt_ttfb_timeout: float = 2.0, ttfs_p99_latency: float | None = None, keepalive_timeout: float | None = None, keepalive_interval: float = 5.0, settings: STTSettings | None = None, **kwargs)[source]

Initialize the STT service.

Parameters:

audio_passthrough – Whether to pass audio frames downstream after processing. Defaults to True.
sample_rate – The sample rate for audio input. If None, will be determined from the start frame.
stt_ttfb_timeout – Time in seconds to wait after VAD stop before reporting TTFB. This delay allows the final transcription to arrive. Defaults to 2.0. Note: STT “TTFB” differs from traditional TTFB (which measures from a discrete request to first response byte). Since STT receives continuous audio, we measure from when the user stops speaking to when the final transcript arrives—capturing the latency that matters for voice AI applications.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. This is broadcast via STTMetadataFrame at pipeline start for downstream processors (e.g., turn strategies) to optimize timing. Subclasses provide measured defaults; pass a value here to override for your deployment.
keepalive_timeout – Seconds of no audio before sending silence to keep the connection alive. None disables keepalive. Useful for services that close idle connections (e.g. behind a ServiceSwitcher).
keepalive_interval – Seconds between idle checks when keepalive is enabled.
settings – The runtime-updatable settings for the STT service.
**kwargs – Additional arguments passed to the parent AIService.

property is_muted: bool

Check if the STT service is currently muted.

Returns:: True if the service is muted and will not process audio.

request_finalize()[source]

Mark that a finalize request has been sent, awaiting server confirmation.

For providers that have explicit server confirmation of finalization (e.g., Deepgram’s from_finalize field), call this when sending the finalize request. Then call confirm_finalize() when the server confirms.

For providers without server confirmation, don’t call this method - just send the finalize/flush/commit command and rely on the TTFB timeout.

confirm_finalize()[source]

Confirm that the server has acknowledged the finalize request.

Call this when the server response confirms finalization (e.g., Deepgram’s from_finalize=True). The next TranscriptionFrame pushed will be marked as finalized.

Only has effect if request_finalize() was previously called.

property sample_rate: int

Get the current sample rate for audio processing.

Returns:: The sample rate in Hz.

async set_model(model: str)[source]

Set the speech recognition model.

Deprecated since version 0.0.104: Use STTUpdateSettingsFrame(model=...) instead.

Parameters:: model – The name of the model to use for speech recognition.

async set_language(language: Language)[source]

Set the language for speech recognition.

Deprecated since version 0.0.104: Use STTUpdateSettingsFrame(language=...) instead.

Parameters:: language – The language to use for speech recognition.

language_to_service_language(language: Language) → str | None[source]

Convert a language to the service-specific language format.

Parameters:: language – The language to convert.
Returns:: The service-specific language identifier, or None if not supported.

abstractmethod async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Run speech-to-text on the provided audio data.

This method must be implemented by subclasses to provide actual speech recognition functionality.

Parameters:: audio – Raw audio bytes to transcribe.
Yields:: Frame – Frames containing transcription results (typically TextFrame).

async start(frame: StartFrame)[source]

Start the STT service.

Parameters:: frame – The start frame containing initialization parameters.

async cleanup()[source]: Clean up STT service resources.

async process_audio_frame(frame: AudioRawFrame, direction: FrameDirection)[source]

Process an audio frame for speech recognition.

If a reconnect is in progress, the frame is buffered and replayed once the connection is restored. If the service is muted, the frame is dropped. Otherwise the frame is sent to the STT service and, if a user_id is present, it is stored for use in transcription results.

Parameters:

frame – The audio frame to process.
direction – The direction of frame processing.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process frames, handling VAD events and audio segmentation.

Parameters:

frame – The frame to process.
direction – The direction of frame processing.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]

Push a frame downstream, tracking TranscriptionFrame timestamps for TTFB.

Stores the timestamp of each TranscriptionFrame for TTFB calculation. If the frame is marked as finalized (via request_finalize/confirm_finalize), reports TTFB immediately and cancels any pending timeout. Otherwise, TTFB is reported after a timeout.

Parameters:

frame – The frame to push.
direction – The direction to push the frame.

class pipecat.services.stt_service.SegmentedSTTService(*, sample_rate: int | None = None, **kwargs)[source]

Bases: STTService

STT service that processes speech in segments using VAD events.

Uses Voice Activity Detection (VAD) events to detect speech segments and runs speech-to-text only on those segments, rather than continuously.

Requires VAD to be enabled in the pipeline to function properly. Maintains a small audio buffer to account for the delay between actual speech start and VAD detection.

__init__(*, sample_rate: int | None = None, **kwargs)[source]

Initialize the segmented STT service.

Parameters:

sample_rate – The sample rate for audio input. If None, will be determined from the start frame.
**kwargs – Additional arguments passed to the parent STTService.

async start(frame: StartFrame)[source]

Start the segmented STT service and initialize audio buffer.

Parameters:: frame – The start frame containing initialization parameters.

async push_frame(frame: Frame, direction: FrameDirection = FrameDirection.DOWNSTREAM)[source]

Push a frame, marking TranscriptionFrames as finalized.

Segmented STT services process complete speech segments and return a single TranscriptionFrame per segment, so every transcription is inherently finalized.

Parameters:

frame – The frame to push.
direction – The direction of frame flow in the pipeline.

async process_frame(frame: Frame, direction: FrameDirection)[source]: Process frames, handling VAD events and audio segmentation.

async process_audio_frame(frame: AudioRawFrame, direction: FrameDirection)[source]

Process audio frames by buffering them for segmented transcription.

Continuously buffers audio, growing the buffer while user is speaking and maintaining a small buffer when not speaking to account for VAD delay.

If the frame has a user_id, it is stored for later use in transcription.

Parameters:

frame – The audio frame to process.
direction – The direction of frame processing.

class pipecat.services.stt_service.WebsocketSTTService(*, reconnect_on_error: bool = True, **kwargs)[source]

Bases: STTService, WebsocketService

Base class for websocket-based STT services.

Combines STT functionality with websocket connectivity, providing automatic error handling, reconnection capabilities, and optional silence-based keepalive.

The keepalive feature (inherited from STTService) sends silent audio when no real audio has been sent for a configurable timeout, preventing servers from closing idle connections (e.g. when behind a ServiceSwitcher). Subclasses can override _send_keepalive() to wrap the silence in a service-specific protocol.

__init__(*, reconnect_on_error: bool = True, **kwargs)[source]

Initialize the Websocket STT service.

Parameters:

reconnect_on_error – Whether to automatically reconnect on websocket errors.
**kwargs – Additional arguments passed to parent classes (including keepalive_timeout and keepalive_interval for STTService).