stt

Whisper speech-to-text services with locally-downloaded models.

This module implements Whisper transcription using locally-downloaded models, supporting both Faster Whisper and MLX Whisper backends for efficient inference.

class pipecat.services.whisper.stt.Model(*values)[source]

Bases: Enum

Whisper model selection options for Faster Whisper.

Provides various model sizes and specializations for speech recognition, balancing quality and performance based on use case requirements.

Parameters:
  • TINY – Smallest multilingual model, fastest inference.

  • BASE – Basic multilingual model, good speed/quality balance.

  • SMALL – Small multilingual model, better speed/quality balance than BASE.

  • MEDIUM – Medium-sized multilingual model, better quality.

  • LARGE – Best quality multilingual model, slower inference.

  • LARGE_V3_TURBO – Fast multilingual model, slightly lower quality than LARGE.

  • DISTIL_LARGE_V2 – Fast multilingual distilled model.

  • DISTIL_MEDIUM_EN – Fast English-only distilled model.

TINY = 'tiny'
BASE = 'base'
SMALL = 'small'
MEDIUM = 'medium'
LARGE = 'large-v3'
LARGE_V3_TURBO = 'deepdml/faster-whisper-large-v3-turbo-ct2'
DISTIL_LARGE_V2 = 'Systran/faster-distil-whisper-large-v2'
DISTIL_MEDIUM_EN = 'Systran/faster-distil-whisper-medium.en'
class pipecat.services.whisper.stt.MLXModel(*values)[source]

Bases: Enum

MLX Whisper model selection options for Apple Silicon.

Provides various model sizes optimized for Apple Silicon hardware, including quantized variants for improved performance.

Parameters:
  • TINY – Smallest multilingual model for MLX.

  • MEDIUM – Medium-sized multilingual model for MLX.

  • LARGE_V3 – Best quality multilingual model for MLX.

  • LARGE_V3_TURBO – Finetuned, pruned Whisper large-v3, much faster with slightly lower quality.

  • DISTIL_LARGE_V3 – Fast multilingual distilled model for MLX.

  • LARGE_V3_TURBO_Q4 – LARGE_V3_TURBO quantized to Q4 for reduced memory usage.

TINY = 'mlx-community/whisper-tiny'
MEDIUM = 'mlx-community/whisper-medium-mlx'
LARGE_V3 = 'mlx-community/whisper-large-v3-mlx'
LARGE_V3_TURBO = 'mlx-community/whisper-large-v3-turbo'
DISTIL_LARGE_V3 = 'mlx-community/distil-whisper-large-v3'
LARGE_V3_TURBO_Q4 = 'mlx-community/whisper-large-v3-turbo-q4'
pipecat.services.whisper.stt.language_to_whisper_language(language: Language) str | None[source]

Maps pipecat Language enum to Whisper language codes.

Parameters:

language – A Language enum value representing the input language.

Returns:

The corresponding Whisper language code, or None if not supported.

Return type:

str or None

Note

Only includes languages officially supported by Whisper.

class pipecat.services.whisper.stt.WhisperSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, no_speech_prob: float | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for WhisperSTTService.

Parameters:

no_speech_prob – Probability threshold for filtering non-speech segments.

no_speech_prob: float | _NotGiven
class pipecat.services.whisper.stt.WhisperMLXSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, no_speech_prob: float | _NotGiven = <factory>, temperature: float | _NotGiven = <factory>, engine: str | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for WhisperMLXSTTService.

Parameters:
  • no_speech_prob – Probability threshold for filtering non-speech segments.

  • temperature – Sampling temperature (0.0-1.0).

  • engine – Whisper engine identifier.

no_speech_prob: float | _NotGiven
temperature: float | _NotGiven
engine: str | _NotGiven
class pipecat.services.whisper.stt.WhisperSTTService(*, model: str | Model | None = None, device: str = 'auto', compute_type: str = 'default', no_speech_prob: float | None = None, language: Language | None = None, settings: WhisperSTTSettings | None = None, **kwargs)[source]

Bases: SegmentedSTTService

Class to transcribe audio with a locally-downloaded Whisper model.

This service uses Faster Whisper to perform speech-to-text transcription on audio segments. It supports multiple languages and various model sizes.

Settings

alias of WhisperSTTSettings

__init__(*, model: str | Model | None = None, device: str = 'auto', compute_type: str = 'default', no_speech_prob: float | None = None, language: Language | None = None, settings: WhisperSTTSettings | None = None, **kwargs)[source]

Initialize the Whisper STT service.

Parameters:
  • model

    The Whisper model to use for transcription. Can be a Model enum or string.

    Deprecated since version 0.0.105: Use settings=WhisperSTTService.Settings(model=...) instead.

  • device – The device to run inference on (‘cpu’, ‘cuda’, or ‘auto’). Defaults to "auto".

  • compute_type – The compute type for inference (‘default’, ‘int8’, ‘int8_float16’, etc.). Defaults to "default".

  • no_speech_prob

    Probability threshold for filtering out non-speech segments.

    Deprecated since version 0.0.105: Use settings=WhisperSTTService.Settings(no_speech_prob=...) instead.

  • language

    The default language for transcription.

    Deprecated since version 0.0.105: Use settings=WhisperSTTService.Settings(language=...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • **kwargs – Additional arguments passed to SegmentedSTTService.

can_generate_metrics() bool[source]

Indicates whether this service can generate metrics.

Returns:

True, as this service supports metric generation.

Return type:

bool

language_to_service_language(language: Language) str | None[source]

Convert from pipecat Language to Whisper language code.

Parameters:

language – The Language enum value to convert.

Returns:

The corresponding Whisper language code, or None if not supported.

Return type:

str or None

async run_stt(audio: bytes) AsyncGenerator[Frame, None][source]

Transcribe audio data using Whisper.

Parameters:

audio – Raw audio bytes in 16-bit PCM format.

Yields:

Frame

Either a TranscriptionFrame containing the transcribed text

or an ErrorFrame if transcription fails.

Note

The audio is expected to be 16-bit signed PCM data. The service will normalize it to float32 in the range [-1, 1].

class pipecat.services.whisper.stt.WhisperSTTServiceMLX(*, model: str | MLXModel | None = None, no_speech_prob: float | None = None, language: Language | None = None, temperature: float | None = None, settings: WhisperMLXSTTSettings | None = None, **kwargs)[source]

Bases: WhisperSTTService

Subclass of WhisperSTTService with MLX Whisper model support.

This service uses MLX Whisper to perform speech-to-text transcription on audio segments. It’s optimized for Apple Silicon and supports multiple languages and quantizations.

Settings

alias of WhisperMLXSTTSettings

__init__(*, model: str | MLXModel | None = None, no_speech_prob: float | None = None, language: Language | None = None, temperature: float | None = None, settings: WhisperMLXSTTSettings | None = None, **kwargs)[source]

Initialize the MLX Whisper STT service.

Parameters:
  • model

    The MLX Whisper model to use for transcription. Can be an MLXModel enum or string.

    Deprecated since version 0.0.105: Use settings=WhisperSTTServiceMLX.Settings(model=...) instead.

  • no_speech_prob

    Probability threshold for filtering out non-speech segments.

    Deprecated since version 0.0.105: Use settings=WhisperSTTServiceMLX.Settings(no_speech_prob=...) instead.

  • language

    The default language for transcription.

    Deprecated since version 0.0.105: Use settings=WhisperSTTServiceMLX.Settings(language=...) instead.

  • temperature

    Temperature for sampling. Can be a float or tuple of floats.

    Deprecated since version 0.0.105: Use settings=WhisperSTTServiceMLX.Settings(temperature=...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • **kwargs – Additional arguments passed to SegmentedSTTService.

async run_stt(audio: bytes) AsyncGenerator[Frame, None][source]

Transcribe audio data using MLX Whisper.

The audio is expected to be 16-bit signed PCM data. MLX Whisper will handle the conversion internally.

Parameters:

audio – Raw audio bytes in 16-bit PCM format.

Yields:

Frame

Either a TranscriptionFrame containing the transcribed text

or an ErrorFrame if transcription fails.