stt

ElevenLabs speech-to-text service implementation.

This module provides integration with ElevenLabs’ Speech-to-Text API for transcription using segmented audio processing. The service uploads audio files and receives transcription results directly.

pipecat.services.elevenlabs.stt.language_to_elevenlabs_language(language: Language) → str | None[source]

Convert a Language enum to ElevenLabs language code.

Source:: https://elevenlabs.io/docs/capabilities/speech-to-text

Parameters:: language – The Language enum value to convert.
Returns:: The corresponding ElevenLabs language code, or None if not supported.

class pipecat.services.elevenlabs.stt.CommitStrategy(*values)[source]

Bases: StrEnum

Commit strategies for transcript segmentation.

MANUAL = 'manual'

VAD = 'vad'

Bases: STTSettings

Settings for ElevenLabsSTTService.

Parameters:: tag_audio_events – Whether to include audio events like (laughter), (coughing) in the transcription.

tag_audio_events: bool | None | _NotGiven

Bases: STTSettings

Settings for ElevenLabsRealtimeSTTService.

See ElevenLabsRealtimeSTTService.InputParams for detailed descriptions.

Parameters:

vad_silence_threshold_secs – Seconds of silence before VAD commits (0.3-3.0).
vad_threshold – VAD sensitivity (0.1-0.9, lower is more sensitive).
min_speech_duration_ms – Minimum speech duration for VAD (50-2000ms).
min_silence_duration_ms – Minimum silence duration for VAD (50-2000ms).

vad_silence_threshold_secs: float | None | _NotGiven

vad_threshold: float | None | _NotGiven

min_speech_duration_ms: int | None | _NotGiven

min_silence_duration_ms: int | None | _NotGiven

class pipecat.services.elevenlabs.stt.ElevenLabsSTTService(*, api_key: str, aiohttp_session: ClientSession, base_url: str = 'https://api.elevenlabs.io', model: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: ElevenLabsSTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Bases: SegmentedSTTService

Speech-to-text service using ElevenLabs’ file-based API.

This service uses ElevenLabs’ Speech-to-Text API to perform transcription on audio segments. It inherits from SegmentedSTTService to handle audio buffering and speech detection. The service uploads audio files to ElevenLabs and receives transcription results directly.

Settings: alias of ElevenLabsSTTSettings

class InputParams(*, language: Language | None = None, tag_audio_events: bool = True)[source]

Bases: BaseModel

Configuration parameters for ElevenLabs STT API.

Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(...) instead.

Parameters:

language – Target language for transcription.
tag_audio_events – Whether to include audio events like (laughter), (coughing), in the transcription.

language: Language | None

tag_audio_events: bool

__init__(*, api_key: str, aiohttp_session: ClientSession, base_url: str = 'https://api.elevenlabs.io', model: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: ElevenLabsSTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Initialize the ElevenLabs STT service.

Parameters:

api_key – ElevenLabs API key for authentication.
aiohttp_session – aiohttp ClientSession for HTTP requests.
base_url – Base URL for ElevenLabs API.
model –
Model ID for transcription.

Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(model=...) instead.
sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.
params –
Configuration parameters for the STT service.

Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to SegmentedSTTService.

can_generate_metrics() → bool[source]

Check if the service can generate processing metrics.

Returns:: True, as ElevenLabs STT service supports metrics generation.

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to ElevenLabs service-specific language code.

Parameters:: language – The language to convert.
Returns:: The ElevenLabs-specific language code, or None if not supported.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Transcribe an audio segment using ElevenLabs’ STT API.

Parameters:: audio – Raw audio bytes in WAV format (already converted by base class).
Yields:: Frame – TranscriptionFrame containing the transcribed text, or ErrorFrame on failure.

Note

The audio is already in WAV format from the SegmentedSTTService. Only non-empty transcriptions are yielded.

pipecat.services.elevenlabs.stt.audio_format_from_sample_rate(sample_rate: int) → str[source]

Get the appropriate audio format string for a given sample rate.

Parameters:: sample_rate – The audio sample rate in Hz.
Returns:: The ElevenLabs audio format string.

class pipecat.services.elevenlabs.stt.ElevenLabsRealtimeSTTService(*, api_key: str, base_url: str = 'api.elevenlabs.io', commit_strategy: CommitStrategy = CommitStrategy.MANUAL, model: str | None = None, sample_rate: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False, params: InputParams | None = None, settings: ElevenLabsRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 0.41, **kwargs)[source]

Bases: WebsocketSTTService

Speech-to-text service using ElevenLabs’ Realtime WebSocket API.

This service uses ElevenLabs’ Realtime Speech-to-Text API to perform transcription with ultra-low latency. It supports both partial (interim) and committed (final) transcripts, and can use either manual commit control or automatic Voice Activity Detection (VAD) for segment boundaries.

By default, uses manual commit strategy where Pipecat’s VAD controls when to commit transcript segments, providing consistency with other STT services.

Settings: alias of ElevenLabsRealtimeSTTSettings

class InputParams(*, language_code: str | None = None, commit_strategy: CommitStrategy = CommitStrategy.MANUAL, vad_silence_threshold_secs: float | None = None, vad_threshold: float | None = None, min_speech_duration_ms: int | None = None, min_silence_duration_ms: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False)[source]

Bases: BaseModel

Configuration parameters for ElevenLabs Realtime STT API.

Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(...) instead.

Parameters:

language_code – ISO-639-1 or ISO-639-3 language code. Leave None for auto-detection.
commit_strategy – How to segment speech - manual (Pipecat VAD) or vad (ElevenLabs VAD).
vad_silence_threshold_secs – Seconds of silence before VAD commits (0.3-3.0). Only used when commit_strategy is VAD. None uses ElevenLabs default.
vad_threshold – VAD sensitivity (0.1-0.9, lower is more sensitive). Only used when commit_strategy is VAD. None uses ElevenLabs default.
min_speech_duration_ms – Minimum speech duration for VAD (50-2000ms). Only used when commit_strategy is VAD. None uses ElevenLabs default.
min_silence_duration_ms – Minimum silence duration for VAD (50-2000ms). Only used when commit_strategy is VAD. None uses ElevenLabs default.
include_timestamps – Whether to include word-level timestamps in transcripts.
enable_logging – Whether to enable logging on ElevenLabs’ side.
include_language_detection – Whether to include language detection in transcripts.

language_code: str | None

commit_strategy: CommitStrategy

vad_silence_threshold_secs: float | None

vad_threshold: float | None

min_speech_duration_ms: int | None

min_silence_duration_ms: int | None

include_timestamps: bool

enable_logging: bool

include_language_detection: bool

__init__(*, api_key: str, base_url: str = 'api.elevenlabs.io', commit_strategy: CommitStrategy = CommitStrategy.MANUAL, model: str | None = None, sample_rate: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False, params: InputParams | None = None, settings: ElevenLabsRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 0.41, **kwargs)[source]

Initialize the ElevenLabs Realtime STT service.

Parameters:

api_key – ElevenLabs API key for authentication.
base_url – Base URL for ElevenLabs WebSocket API.
commit_strategy – How to segment speech — CommitStrategy.MANUAL (Pipecat VAD) or CommitStrategy.VAD (ElevenLabs VAD). Defaults to CommitStrategy.MANUAL.
model –
Model ID for transcription.

Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(model=...) instead.
sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.
include_timestamps – Whether to include word-level timestamps in transcripts.
enable_logging – Whether to enable logging on ElevenLabs’ side.
include_language_detection – Whether to include language detection in transcripts.
params –
Configuration parameters for the STT service.

Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to WebsocketSTTService.

can_generate_metrics() → bool[source]

Check if the service can generate processing metrics.

Returns:: True, as ElevenLabs Realtime STT service supports metrics generation.

async start(frame: StartFrame)[source]

Start the STT service and establish WebSocket connection.

Parameters:: frame – Frame indicating service should start.

async stop(frame: EndFrame)[source]

Stop the STT service and close WebSocket connection.

Parameters:: frame – Frame indicating service should stop.

async cancel(frame: CancelFrame)[source]

Cancel the STT service and close WebSocket connection.

Parameters:: frame – Frame indicating service should be cancelled.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process incoming frames and handle speech events.

Parameters:

frame – The frame to process.
direction – Direction of frame flow in the pipeline.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text transcription.

Parameters:: audio – Raw audio bytes to transcribe.
Yields:: None - transcription results are handled via WebSocket responses.