stt
Whisper speech-to-text services with locally-downloaded models.
This module implements Whisper transcription using locally-downloaded models, supporting both Faster Whisper and MLX Whisper backends for efficient inference.
- class pipecat.services.whisper.stt.Model(*values)[source]
Bases:
EnumWhisper model selection options for Faster Whisper.
Provides various model sizes and specializations for speech recognition, balancing quality and performance based on use case requirements.
- Parameters:
TINY – Smallest multilingual model, fastest inference.
BASE – Basic multilingual model, good speed/quality balance.
SMALL – Small multilingual model, better speed/quality balance than BASE.
MEDIUM – Medium-sized multilingual model, better quality.
LARGE – Best quality multilingual model, slower inference.
LARGE_V3_TURBO – Fast multilingual model, slightly lower quality than LARGE.
DISTIL_LARGE_V2 – Fast multilingual distilled model.
DISTIL_MEDIUM_EN – Fast English-only distilled model.
- TINY = 'tiny'
- BASE = 'base'
- SMALL = 'small'
- MEDIUM = 'medium'
- LARGE = 'large-v3'
- LARGE_V3_TURBO = 'deepdml/faster-whisper-large-v3-turbo-ct2'
- DISTIL_LARGE_V2 = 'Systran/faster-distil-whisper-large-v2'
- DISTIL_MEDIUM_EN = 'Systran/faster-distil-whisper-medium.en'
- class pipecat.services.whisper.stt.MLXModel(*values)[source]
Bases:
EnumMLX Whisper model selection options for Apple Silicon.
Provides various model sizes optimized for Apple Silicon hardware, including quantized variants for improved performance.
- Parameters:
TINY – Smallest multilingual model for MLX.
MEDIUM – Medium-sized multilingual model for MLX.
LARGE_V3 – Best quality multilingual model for MLX.
LARGE_V3_TURBO – Finetuned, pruned Whisper large-v3, much faster with slightly lower quality.
DISTIL_LARGE_V3 – Fast multilingual distilled model for MLX.
LARGE_V3_TURBO_Q4 – LARGE_V3_TURBO quantized to Q4 for reduced memory usage.
- TINY = 'mlx-community/whisper-tiny'
- MEDIUM = 'mlx-community/whisper-medium-mlx'
- LARGE_V3 = 'mlx-community/whisper-large-v3-mlx'
- LARGE_V3_TURBO = 'mlx-community/whisper-large-v3-turbo'
- DISTIL_LARGE_V3 = 'mlx-community/distil-whisper-large-v3'
- LARGE_V3_TURBO_Q4 = 'mlx-community/whisper-large-v3-turbo-q4'
- pipecat.services.whisper.stt.language_to_whisper_language(language: Language) str | None[source]
Maps pipecat Language enum to Whisper language codes.
- Parameters:
language – A Language enum value representing the input language.
- Returns:
The corresponding Whisper language code, or None if not supported.
- Return type:
str or None
Note
Only includes languages officially supported by Whisper.
- class pipecat.services.whisper.stt.WhisperSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, no_speech_prob: float | _NotGiven = <factory>)[source]
Bases:
STTSettingsSettings for WhisperSTTService.
- Parameters:
no_speech_prob – Probability threshold for filtering non-speech segments.
- no_speech_prob: float | _NotGiven
- class pipecat.services.whisper.stt.WhisperMLXSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, no_speech_prob: float | _NotGiven = <factory>, temperature: float | _NotGiven = <factory>, engine: str | _NotGiven = <factory>)[source]
Bases:
STTSettingsSettings for WhisperMLXSTTService.
- Parameters:
no_speech_prob – Probability threshold for filtering non-speech segments.
temperature – Sampling temperature (0.0-1.0).
engine – Whisper engine identifier.
- no_speech_prob: float | _NotGiven
- temperature: float | _NotGiven
- engine: str | _NotGiven
- class pipecat.services.whisper.stt.WhisperSTTService(*, model: str | Model | None = None, device: str = 'auto', compute_type: str = 'default', no_speech_prob: float | None = None, language: Language | None = None, settings: WhisperSTTSettings | None = None, **kwargs)[source]
Bases:
SegmentedSTTServiceClass to transcribe audio with a locally-downloaded Whisper model.
This service uses Faster Whisper to perform speech-to-text transcription on audio segments. It supports multiple languages and various model sizes.
- Settings
alias of
WhisperSTTSettings
- __init__(*, model: str | Model | None = None, device: str = 'auto', compute_type: str = 'default', no_speech_prob: float | None = None, language: Language | None = None, settings: WhisperSTTSettings | None = None, **kwargs)[source]
Initialize the Whisper STT service.
- Parameters:
model –
The Whisper model to use for transcription. Can be a Model enum or string.
Deprecated since version 0.0.105: Use
settings=WhisperSTTService.Settings(model=...)instead.device – The device to run inference on (‘cpu’, ‘cuda’, or ‘auto’). Defaults to
"auto".compute_type – The compute type for inference (‘default’, ‘int8’, ‘int8_float16’, etc.). Defaults to
"default".no_speech_prob –
Probability threshold for filtering out non-speech segments.
Deprecated since version 0.0.105: Use
settings=WhisperSTTService.Settings(no_speech_prob=...)instead.language –
The default language for transcription.
Deprecated since version 0.0.105: Use
settings=WhisperSTTService.Settings(language=...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional arguments passed to SegmentedSTTService.
- can_generate_metrics() bool[source]
Indicates whether this service can generate metrics.
- Returns:
True, as this service supports metric generation.
- Return type:
bool
- language_to_service_language(language: Language) str | None[source]
Convert from pipecat Language to Whisper language code.
- Parameters:
language – The Language enum value to convert.
- Returns:
The corresponding Whisper language code, or None if not supported.
- Return type:
str or None
- async run_stt(audio: bytes) AsyncGenerator[Frame, None][source]
Transcribe audio data using Whisper.
- Parameters:
audio – Raw audio bytes in 16-bit PCM format.
- Yields:
Frame –
- Either a TranscriptionFrame containing the transcribed text
or an ErrorFrame if transcription fails.
Note
The audio is expected to be 16-bit signed PCM data. The service will normalize it to float32 in the range [-1, 1].
- class pipecat.services.whisper.stt.WhisperSTTServiceMLX(*, model: str | MLXModel | None = None, no_speech_prob: float | None = None, language: Language | None = None, temperature: float | None = None, settings: WhisperMLXSTTSettings | None = None, **kwargs)[source]
Bases:
WhisperSTTServiceSubclass of WhisperSTTService with MLX Whisper model support.
This service uses MLX Whisper to perform speech-to-text transcription on audio segments. It’s optimized for Apple Silicon and supports multiple languages and quantizations.
- Settings
alias of
WhisperMLXSTTSettings
- __init__(*, model: str | MLXModel | None = None, no_speech_prob: float | None = None, language: Language | None = None, temperature: float | None = None, settings: WhisperMLXSTTSettings | None = None, **kwargs)[source]
Initialize the MLX Whisper STT service.
- Parameters:
model –
The MLX Whisper model to use for transcription. Can be an MLXModel enum or string.
Deprecated since version 0.0.105: Use
settings=WhisperSTTServiceMLX.Settings(model=...)instead.no_speech_prob –
Probability threshold for filtering out non-speech segments.
Deprecated since version 0.0.105: Use
settings=WhisperSTTServiceMLX.Settings(no_speech_prob=...)instead.language –
The default language for transcription.
Deprecated since version 0.0.105: Use
settings=WhisperSTTServiceMLX.Settings(language=...)instead.temperature –
Temperature for sampling. Can be a float or tuple of floats.
Deprecated since version 0.0.105: Use
settings=WhisperSTTServiceMLX.Settings(temperature=...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional arguments passed to SegmentedSTTService.
- async run_stt(audio: bytes) AsyncGenerator[Frame, None][source]
Transcribe audio data using MLX Whisper.
The audio is expected to be 16-bit signed PCM data. MLX Whisper will handle the conversion internally.
- Parameters:
audio – Raw audio bytes in 16-bit PCM format.
- Yields:
Frame –
- Either a TranscriptionFrame containing the transcribed text
or an ErrorFrame if transcription fails.