stt

ElevenLabs speech-to-text service implementation.

This module provides integration with ElevenLabs’ Speech-to-Text API for transcription using segmented audio processing. The service uploads audio files and receives transcription results directly.

pipecat.services.elevenlabs.stt.language_to_elevenlabs_language(language: Language) str | None[source]

Convert a Language enum to ElevenLabs language code.

Source:

https://elevenlabs.io/docs/capabilities/speech-to-text

Parameters:

language – The Language enum value to convert.

Returns:

The corresponding ElevenLabs language code, or None if not supported.

class pipecat.services.elevenlabs.stt.CommitStrategy(*values)[source]

Bases: StrEnum

Commit strategies for transcript segmentation.

MANUAL = 'manual'
VAD = 'vad'
class pipecat.services.elevenlabs.stt.ElevenLabsSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, tag_audio_events: bool | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for ElevenLabsSTTService.

Parameters:

tag_audio_events – Whether to include audio events like (laughter), (coughing) in the transcription.

tag_audio_events: bool | None | _NotGiven
class pipecat.services.elevenlabs.stt.ElevenLabsRealtimeSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, vad_silence_threshold_secs: float | None | _NotGiven = <factory>, vad_threshold: float | None | _NotGiven = <factory>, min_speech_duration_ms: int | None | _NotGiven = <factory>, min_silence_duration_ms: int | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for ElevenLabsRealtimeSTTService.

See ElevenLabsRealtimeSTTService.InputParams for detailed descriptions.

Parameters:
  • vad_silence_threshold_secs – Seconds of silence before VAD commits (0.3-3.0).

  • vad_threshold – VAD sensitivity (0.1-0.9, lower is more sensitive).

  • min_speech_duration_ms – Minimum speech duration for VAD (50-2000ms).

  • min_silence_duration_ms – Minimum silence duration for VAD (50-2000ms).

vad_silence_threshold_secs: float | None | _NotGiven
vad_threshold: float | None | _NotGiven
min_speech_duration_ms: int | None | _NotGiven
min_silence_duration_ms: int | None | _NotGiven
class pipecat.services.elevenlabs.stt.ElevenLabsSTTService(*, api_key: str, aiohttp_session: ClientSession, base_url: str = 'https://api.elevenlabs.io', model: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: ElevenLabsSTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Bases: SegmentedSTTService

Speech-to-text service using ElevenLabs’ file-based API.

This service uses ElevenLabs’ Speech-to-Text API to perform transcription on audio segments. It inherits from SegmentedSTTService to handle audio buffering and speech detection. The service uploads audio files to ElevenLabs and receives transcription results directly.

Settings

alias of ElevenLabsSTTSettings

class InputParams(*, language: Language | None = None, tag_audio_events: bool = True)[source]

Bases: BaseModel

Configuration parameters for ElevenLabs STT API.

Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(...) instead.

Parameters:
  • language – Target language for transcription.

  • tag_audio_events – Whether to include audio events like (laughter), (coughing), in the transcription.

language: Language | None
tag_audio_events: bool
__init__(*, api_key: str, aiohttp_session: ClientSession, base_url: str = 'https://api.elevenlabs.io', model: str | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: ElevenLabsSTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Initialize the ElevenLabs STT service.

Parameters:
  • api_key – ElevenLabs API key for authentication.

  • aiohttp_session – aiohttp ClientSession for HTTP requests.

  • base_url – Base URL for ElevenLabs API.

  • model

    Model ID for transcription.

    Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(model=...) instead.

  • sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.

  • params

    Configuration parameters for the STT service.

    Deprecated since version 0.0.105: Use settings=ElevenLabsSTTService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to SegmentedSTTService.

can_generate_metrics() bool[source]

Check if the service can generate processing metrics.

Returns:

True, as ElevenLabs STT service supports metrics generation.

language_to_service_language(language: Language) str | None[source]

Convert a Language enum to ElevenLabs service-specific language code.

Parameters:

language – The language to convert.

Returns:

The ElevenLabs-specific language code, or None if not supported.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Transcribe an audio segment using ElevenLabs’ STT API.

Parameters:

audio – Raw audio bytes in WAV format (already converted by base class).

Yields:

Frame – TranscriptionFrame containing the transcribed text, or ErrorFrame on failure.

Note

The audio is already in WAV format from the SegmentedSTTService. Only non-empty transcriptions are yielded.

pipecat.services.elevenlabs.stt.audio_format_from_sample_rate(sample_rate: int) str[source]

Get the appropriate audio format string for a given sample rate.

Parameters:

sample_rate – The audio sample rate in Hz.

Returns:

The ElevenLabs audio format string.

class pipecat.services.elevenlabs.stt.ElevenLabsRealtimeSTTService(*, api_key: str, base_url: str = 'api.elevenlabs.io', commit_strategy: CommitStrategy = CommitStrategy.MANUAL, model: str | None = None, sample_rate: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False, params: InputParams | None = None, settings: ElevenLabsRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 0.41, **kwargs)[source]

Bases: WebsocketSTTService

Speech-to-text service using ElevenLabs’ Realtime WebSocket API.

This service uses ElevenLabs’ Realtime Speech-to-Text API to perform transcription with ultra-low latency. It supports both partial (interim) and committed (final) transcripts, and can use either manual commit control or automatic Voice Activity Detection (VAD) for segment boundaries.

By default, uses manual commit strategy where Pipecat’s VAD controls when to commit transcript segments, providing consistency with other STT services.

Settings

alias of ElevenLabsRealtimeSTTSettings

class InputParams(*, language_code: str | None = None, commit_strategy: CommitStrategy = CommitStrategy.MANUAL, vad_silence_threshold_secs: float | None = None, vad_threshold: float | None = None, min_speech_duration_ms: int | None = None, min_silence_duration_ms: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False)[source]

Bases: BaseModel

Configuration parameters for ElevenLabs Realtime STT API.

Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(...) instead.

Parameters:
  • language_code – ISO-639-1 or ISO-639-3 language code. Leave None for auto-detection.

  • commit_strategy – How to segment speech - manual (Pipecat VAD) or vad (ElevenLabs VAD).

  • vad_silence_threshold_secs – Seconds of silence before VAD commits (0.3-3.0). Only used when commit_strategy is VAD. None uses ElevenLabs default.

  • vad_threshold – VAD sensitivity (0.1-0.9, lower is more sensitive). Only used when commit_strategy is VAD. None uses ElevenLabs default.

  • min_speech_duration_ms – Minimum speech duration for VAD (50-2000ms). Only used when commit_strategy is VAD. None uses ElevenLabs default.

  • min_silence_duration_ms – Minimum silence duration for VAD (50-2000ms). Only used when commit_strategy is VAD. None uses ElevenLabs default.

  • include_timestamps – Whether to include word-level timestamps in transcripts.

  • enable_logging – Whether to enable logging on ElevenLabs’ side.

  • include_language_detection – Whether to include language detection in transcripts.

language_code: str | None
commit_strategy: CommitStrategy
vad_silence_threshold_secs: float | None
vad_threshold: float | None
min_speech_duration_ms: int | None
min_silence_duration_ms: int | None
include_timestamps: bool
enable_logging: bool
include_language_detection: bool
__init__(*, api_key: str, base_url: str = 'api.elevenlabs.io', commit_strategy: CommitStrategy = CommitStrategy.MANUAL, model: str | None = None, sample_rate: int | None = None, include_timestamps: bool = False, enable_logging: bool = False, include_language_detection: bool = False, params: InputParams | None = None, settings: ElevenLabsRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 0.41, **kwargs)[source]

Initialize the ElevenLabs Realtime STT service.

Parameters:
  • api_key – ElevenLabs API key for authentication.

  • base_url – Base URL for ElevenLabs WebSocket API.

  • commit_strategy – How to segment speech — CommitStrategy.MANUAL (Pipecat VAD) or CommitStrategy.VAD (ElevenLabs VAD). Defaults to CommitStrategy.MANUAL.

  • model

    Model ID for transcription.

    Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(model=...) instead.

  • sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.

  • include_timestamps – Whether to include word-level timestamps in transcripts.

  • enable_logging – Whether to enable logging on ElevenLabs’ side.

  • include_language_detection – Whether to include language detection in transcripts.

  • params

    Configuration parameters for the STT service.

    Deprecated since version 0.0.105: Use settings=ElevenLabsRealtimeSTTService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to WebsocketSTTService.

can_generate_metrics() bool[source]

Check if the service can generate processing metrics.

Returns:

True, as ElevenLabs Realtime STT service supports metrics generation.

async start(frame: StartFrame)[source]

Start the STT service and establish WebSocket connection.

Parameters:

frame – Frame indicating service should start.

async stop(frame: EndFrame)[source]

Stop the STT service and close WebSocket connection.

Parameters:

frame – Frame indicating service should stop.

async cancel(frame: CancelFrame)[source]

Cancel the STT service and close WebSocket connection.

Parameters:

frame – Frame indicating service should be cancelled.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process incoming frames and handle speech events.

Parameters:
  • frame – The frame to process.

  • direction – Direction of frame flow in the pipeline.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text transcription.

Parameters:

audio – Raw audio bytes to transcribe.

Yields:

None - transcription results are handled via WebSocket responses.