stt

Speechmatics STT service integration.

class pipecat.services.speechmatics.stt.TurnDetectionMode(*values)[source]

Bases: StrEnum

Endpoint and turn detection handling mode.

How the STT engine handles the endpointing of speech. If using Pipecat’s built-in endpointing, then use TurnDetectionMode.EXTERNAL (default).

To use the STT engine’s built-in endpointing, then use TurnDetectionMode.ADAPTIVE for simple voice activity detection or TurnDetectionMode.SMART_TURN for more advanced ML-based endpointing.

FIXED = 'fixed'

EXTERNAL = 'external'

ADAPTIVE = 'adaptive'

SMART_TURN = 'smart_turn'

Bases: STTSettings

Settings for SpeechmaticsSTTService.

See SpeechmaticsSTTService.InputParams for detailed descriptions of each field.

Parameters:

domain – Domain for Speechmatics API.
turn_detection_mode – Endpoint handling mode.
speaker_active_format – Formatter for active speaker ID.
speaker_passive_format – Formatter for passive speaker ID.
focus_speakers – List of speaker IDs to focus on.
ignore_speakers – List of speaker IDs to ignore.
focus_mode – Speaker focus mode for diarization.
known_speakers – List of known speaker labels and identifiers.
additional_vocab – List of additional vocabulary entries.
operating_point – Operating point for accuracy vs. latency.
max_delay – Maximum delay in seconds for transcription.
end_of_utterance_silence_trigger – Maximum delay for end of utterance trigger.
end_of_utterance_max_delay – Maximum delay for end of utterance.
punctuation_overrides – Punctuation overrides.
include_partials – Include partial segment fragments.
split_sentences – Emit finalized sentences mid-turn.
enable_diarization – Enable speaker diarization.
speaker_sensitivity – Diarization sensitivity.
max_speakers – Maximum number of speakers to detect.
prefer_current_speaker – Prefer current speaker ID.
extra_params – Extra parameters for the STT engine.

domain: str | None | _NotGiven

turn_detection_mode: TurnDetectionMode | _NotGiven

speaker_active_format: str | _NotGiven

speaker_passive_format: str | _NotGiven

focus_speakers: list[str] | _NotGiven

ignore_speakers: list[str] | _NotGiven

focus_mode: SpeakerFocusMode | _NotGiven

known_speakers: list[SpeakerIdentifier] | _NotGiven

additional_vocab: list[AdditionalVocabEntry] | _NotGiven

operating_point: OperatingPoint | None | _NotGiven

max_delay: float | None | _NotGiven

end_of_utterance_silence_trigger: float | None | _NotGiven

end_of_utterance_max_delay: float | None | _NotGiven

punctuation_overrides: dict[str, Any] | None | _NotGiven

include_partials: bool | None | _NotGiven

split_sentences: bool | None | _NotGiven

enable_diarization: bool | None | _NotGiven

speaker_sensitivity: float | None | _NotGiven

max_speakers: int | None | _NotGiven

prefer_current_speaker: bool | None | _NotGiven

extra_params: dict[str, Any] | None | _NotGiven

HOT_FIELDS: ClassVar[frozenset[str]] = frozenset({'focus_mode', 'focus_speakers', 'ignore_speakers'}): Fields that can be updated on a live connection via the Speechmatics diarization-config API — no reconnect needed.

LOCAL_FIELDS: ClassVar[frozenset[str]] = frozenset({'speaker_active_format', 'speaker_passive_format'}): Fields that are purely local (formatting templates) — no reconnect and no API call needed.

class pipecat.services.speechmatics.stt.SpeechmaticsSTTService(*, api_key: str | None = None, base_url: str | None = None, sample_rate: int | None = None, encoding: AudioEncoding = AudioEncoding.PCM_S16LE, params: InputParams | None = None, should_interrupt: bool = True, settings: SpeechmaticsSTTSettings | None = None, ttfs_p99_latency: float | None = 0.74, **kwargs)[source]

Bases: STTService

Speechmatics STT service implementation.

This service provides real-time speech-to-text transcription using the Speechmatics API. It supports partial and final transcriptions, multiple languages, various audio formats, and speaker diarization.

Event handlers available (in addition to STTService events):

on_speakers_result(service, speakers): Speaker diarization results received

Example:

@stt.event_handler("on_speakers_result")
async def on_speakers_result(service, speakers):
    ...

Settings: alias of SpeechmaticsSTTSettings

class TurnDetectionMode(*values)

Bases: StrEnum

Endpoint and turn detection handling mode.

How the STT engine handles the endpointing of speech. If using Pipecat’s built-in endpointing, then use TurnDetectionMode.EXTERNAL (default).

To use the STT engine’s built-in endpointing, then use TurnDetectionMode.ADAPTIVE for simple voice activity detection or TurnDetectionMode.SMART_TURN for more advanced ML-based endpointing.

FIXED = 'fixed'

EXTERNAL = 'external'

ADAPTIVE = 'adaptive'

SMART_TURN = 'smart_turn'

class AudioEncoding(*values)

Bases: str, Enum

Supported audio encoding formats for real-time transcription.

The Speechmatics RT API supports several audio encoding formats for optimal compatibility with different audio sources and quality requirements.

PCM_F32LE: 32-bit float PCM used in the WAV audio format, little-endian architecture. 4 bytes per sample.

PCM_S16LE: 16-bit signed integer PCM used in the WAV audio format, little-endian architecture. 2 bytes per sample.

MULAW: 8 bit μ-law (mu-law) encoding. 1 byte per sample.

Examples

>>> encoding = AudioEncoding.PCM_S16LE

PCM_F32LE = 'pcm_f32le'

PCM_S16LE = 'pcm_s16le'

MULAW = 'mulaw'

class OperatingPoint(*values)

Bases: str, Enum

Operating point options for transcription.

ENHANCED = 'enhanced'

STANDARD = 'standard'

class SpeakerFocusMode(*values)

Bases: str, Enum

Speaker focus mode for diarization.

RETAIN: Retain words spoken by other speakers (not listed in ignore_speakers)
and process them as passive speaker frames.
IGNORE: Ignore words spoken by other speakers and they will not be processed.

Examples

Retain all speakers but mark focus:

>>> config = SpeakerFocusConfig(
...     focus_speakers=["S1"],
...     focus_mode=SpeakerFocusMode.RETAIN
... )

Ignore non-focus speakers completely:

>>> config = SpeakerFocusConfig(
...     focus_speakers=["S1", "S2"],
...     focus_mode=SpeakerFocusMode.IGNORE
... )

RETAIN = 'retain'

IGNORE = 'ignore'

class SpeakerFocusConfig(*, focus_speakers: list[str] = <factory>, ignore_speakers: list[str] = <factory>, focus_mode: SpeakerFocusMode = SpeakerFocusMode.RETAIN)

Bases: BaseModel

Speaker Focus Config.

List of speakers to focus on, ignore and how to deal with speakers that are not in focus. These settings can be changed during a session. Other changes may require a new session.

Parameters:

focus_speakers – List of speaker IDs to focus on. When enabled, only these speakers are emitted as finalized frames and other speakers are considered passive. Words from other speakers are still processed, but only emitted when a focussed speaker has also said new words. A list of labels (e.g. S1, S2) or identifiers of known speakers (e.g. speaker_1, speaker_2) can be used. Defaults to [].
ignore_speakers – List of speaker IDs to ignore. When enabled, these speakers are excluded from the transcription and their words are not processed. Their speech will not trigger any VAD or end of utterance detection. By default, any speaker with a label starting and ending with double underscores will be excluded (e.g. __ASSISTANT__). Defaults to [].
focus_mode – Speaker focus mode for diarization. When set to SpeakerFocusMode.RETAIN, the STT engine will retain words spoken by other speakers (not listed in ignore_speakers) and process them as passive speaker frames. When set to SpeakerFocusMode.IGNORE, the STT engine will ignore words spoken by other speakers and they will not be processed. Defaults to SpeakerFocusMode.RETAIN.

focus_speakers: list[str]

ignore_speakers: list[str]

focus_mode: SpeakerFocusMode

class SpeakerIdentifier(label: str = '', speaker_identifiers: list[str] = <factory>)

Bases: object

Labeled speaker identifier for guided speaker diarization.

Use this to map one or more known speaker identifiers to a human-readable label. When provided in SpeakerDiarizationConfig.speakers, the engine can use these identifiers as hints to consistently assign the specified label.

label

Human-readable label to assign to this speaker or group (e.g., “Agent”, “Customer”, “Alice”).

Type:: str

speaker_identifiers

A list of string identifiers associated with this speaker. These can be any stable identifiers relevant to your application (for example device IDs, prior session speaker IDs, channel tags, etc.).

Type:: list[str]

Examples

>>> config = SpeakerDiarizationConfig(
...     max_speakers=2,
...     speakers=[
...         SpeakerIdentifier(label="Agent", speaker_identifiers=["agent_1"]),
...         SpeakerIdentifier(label="Customer", speaker_identifiers=["cust_1"]),
...     ],
... )

label: str = ''

speaker_identifiers: list[str]

class AdditionalVocabEntry(*, content: str, sounds_like: list[str] | None = None)

Bases: BaseModel

Additional vocabulary entry.

Parameters:

content – The word to add to the dictionary.
sounds_like – Similar words to the word.

Examples

Adding a brand name:

>>> vocab = AdditionalVocabEntry(
...     content="Speechmatics",
...     sounds_like=["speech mattics", "speech matics"]
... )

Adding technical terms:

>>> vocab_list = [
...     AdditionalVocabEntry(content="API", sounds_like=["A P I"]),
...     AdditionalVocabEntry(content="WebSocket", sounds_like=["web socket"])
... ]
>>> config = VoiceAgentConfig(
...     language="en",
...     additional_vocab=vocab_list
... )

content: str

sounds_like: list[str] | None

class InputParams(*, domain: str | None = None, language: Language | str = Language.EN, turn_detection_mode: TurnDetectionMode = TurnDetectionMode.EXTERNAL, speaker_active_format: str | None = None, speaker_passive_format: str | None = None, focus_speakers: list[str] = [], ignore_speakers: list[str] = [], focus_mode: SpeakerFocusMode = SpeakerFocusMode.RETAIN, known_speakers: list[SpeakerIdentifier] = [], additional_vocab: list[AdditionalVocabEntry] = [], audio_encoding: AudioEncoding = AudioEncoding.PCM_S16LE, operating_point: OperatingPoint | None = None, max_delay: float | None = None, end_of_utterance_silence_trigger: float | None = None, end_of_utterance_max_delay: float | None = None, punctuation_overrides: dict | None = None, include_partials: bool | None = None, split_sentences: bool | None = None, enable_diarization: bool | None = None, speaker_sensitivity: float | None = None, max_speakers: int | None = None, prefer_current_speaker: bool | None = None, extra_params: dict | None = None)[source]

Bases: BaseModel

Configuration parameters for Speechmatics STT service.

Parameters:

domain – Domain for Speechmatics API. Defaults to None.
language – Language code for transcription. Defaults to Language.EN.
turn_detection_mode – Endpoint handling, one of TurnDetectionMode.FIXED, TurnDetectionMode.EXTERNAL, TurnDetectionMode.ADAPTIVE and TurnDetectionMode.SMART_TURN. Defaults to TurnDetectionMode.EXTERNAL.
speaker_active_format – Formatter for active speaker ID. This formatter is used to format the text output for individual speakers and ensures that the context is clear for language models further down the pipeline. The attributes text and speaker_id are available. The system instructions for the language model may need to include any necessary instructions to handle the formatting. Example: @{speaker_id}: {text}. Defaults to None.
speaker_passive_format – Formatter for passive speaker ID. As with the speaker_active_format, the attributes text and speaker_id are available. Example: @{speaker_id} [background]: {text}. Defaults to None.
focus_speakers – List of speaker IDs to focus on. When enabled, only these speakers are emitted as finalized frames and other speakers are considered passive. Words from other speakers are still processed, but only emitted when a focussed speaker has also said new words. A list of labels (e.g. S1, S2) or identifiers of known speakers (e.g. speaker_1, speaker_2) can be used. Defaults to [].
ignore_speakers – List of speaker IDs to ignore. When enabled, these speakers are excluded from the transcription and their words are not processed. Their speech will not trigger any VAD or end of utterance detection. By default, any speaker with a label starting and ending with double underscores will be excluded (e.g. __ASSISTANT__). Defaults to [].
focus_mode – Speaker focus mode for diarization. When set to SpeakerFocusMode.RETAIN, the STT engine will retain words spoken by other speakers (not listed in ignore_speakers) and process them as passive speaker frames. When set to SpeakerFocusMode.IGNORE, the STT engine will ignore words spoken by other speakers and they will not be processed. Defaults to SpeakerFocusMode.RETAIN.
known_speakers – List of known speaker labels and identifiers. If you supply a list of labels and identifiers for speakers, then the STT engine will use them to attribute any spoken words to that speaker. This is useful when you want to attribute words to a specific speaker, such as the assistant or a specific user. Labels and identifiers can be obtained from a running STT session and then used in subsequent sessions. Identifiers are unique to each Speechmatics account and cannot be used across accounts. Refer to our examples on the format of the known_speakers parameter. Defaults to [].
additional_vocab – List of additional vocabulary entries. If you supply a list of additional vocabulary entries, the this will increase the weight of the words in the vocabulary and help the STT engine to better transcribe the words. Defaults to [].
audio_encoding – Audio encoding format. Defaults to AudioEncoding.PCM_S16LE.
operating_point – Operating point for transcription accuracy vs. latency tradeoff. It is recommended to use OperatingPoint.ENHANCED for most use cases. Default to enhanced.
max_delay – Maximum delay in seconds for transcription. This forces the STT engine to speed up the processing of transcribed words and reduces the interval between partial and final results. Lower values can have an impact on accuracy.
end_of_utterance_silence_trigger – Maximum delay in seconds for end of utterance trigger. The delay is used to wait for any further transcribed words before emitting the final word frames. The value must be lower than max_delay.
end_of_utterance_max_delay – Maximum delay in seconds for end of utterance delay. The delay is used to wait for any further transcribed words before emitting the final word frames. The value must be greater than end_of_utterance_silence_trigger.
punctuation_overrides – Punctuation overrides. This allows you to override the punctuation in the STT engine. This is useful for languages that use different punctuation than English. See documentation for more information.
include_partials – Include partial segment fragments (words) in the output of AddPartialSegment messages. Partial fragments from the STT will always be used for speaker activity detection. This setting is used only for the formatted text output of individual segments.
split_sentences – Emit finalized sentences mid-turn. When enabled, as soon as a sentence is finalized, it will be emitted as a final segment. This is useful for applications that need to process sentences as they are finalized. Defaults to False.
enable_diarization – Enable speaker diarization. When enabled, the STT engine will determine and attribute words to unique speakers. The speaker_sensitivity parameter can be used to adjust the sensitivity of diarization.
speaker_sensitivity – Diarization sensitivity. A higher value increases the sensitivity of diarization and helps when two or more speakers have similar voices.
max_speakers – Maximum number of speakers to detect. This forces the STT engine to cluster words into a fixed number of speakers. It should not be used to limit the number of speakers, unless it is clear that there will only be a known number of speakers.
prefer_current_speaker – Prefer current speaker ID. When set to true, groups of words close together are given extra weight to be identified as the same speaker.
extra_params – Extra parameters to pass to the STT engine. This is a dictionary of additional parameters that can be used to configure the STT engine. Default to None.

domain: str | None

language: Language | str

turn_detection_mode: TurnDetectionMode

speaker_active_format: str | None

speaker_passive_format: str | None

focus_speakers: list[str]

ignore_speakers: list[str]

focus_mode: SpeakerFocusMode

known_speakers: list[SpeakerIdentifier]

additional_vocab: list[AdditionalVocabEntry]

audio_encoding: AudioEncoding

operating_point: OperatingPoint | None

max_delay: float | None

end_of_utterance_silence_trigger: float | None

end_of_utterance_max_delay: float | None

punctuation_overrides: dict | None

include_partials: bool | None

split_sentences: bool | None

enable_diarization: bool | None

speaker_sensitivity: float | None

max_speakers: int | None

prefer_current_speaker: bool | None

extra_params: dict | None

class UpdateParams(*, focus_speakers: list[str] = [], ignore_speakers: list[str] = [], focus_mode: SpeakerFocusMode = SpeakerFocusMode.RETAIN)[source]

Bases: BaseModel

Update parameters for Speechmatics STT service.

Deprecated since version 0.0.104: Use SpeechmaticsSTTService.Settings with STTUpdateSettingsFrame instead.

Parameters:

focus_speakers – List of speaker IDs to focus on. When enabled, only these speakers are emitted as finalized frames and other speakers are considered passive. Words from other speakers are still processed, but only emitted when a focussed speaker has also said new words. A list of labels (e.g. S1, S2) or identifiers of known speakers (e.g. speaker_1, speaker_2) can be used. Defaults to [].
ignore_speakers – List of speaker IDs to ignore. When enabled, these speakers are excluded from the transcription and their words are not processed. Their speech will not trigger any VAD or end of utterance detection. By default, any speaker with a label starting and ending with double underscores will be excluded (e.g. __ASSISTANT__). Defaults to [].
focus_mode – Speaker focus mode for diarization. When set to SpeakerFocusMode.RETAIN, the STT engine will retain words spoken by other speakers (not listed in ignore_speakers) and process them as passive speaker frames. When set to SpeakerFocusMode.IGNORE, the STT engine will ignore words spoken by other speakers and they will not be processed. Defaults to SpeakerFocusMode.RETAIN.

focus_speakers: list[str]

ignore_speakers: list[str]

focus_mode: SpeakerFocusMode

__init__(*, api_key: str | None = None, base_url: str | None = None, sample_rate: int | None = None, encoding: AudioEncoding = AudioEncoding.PCM_S16LE, params: InputParams | None = None, should_interrupt: bool = True, settings: SpeechmaticsSTTSettings | None = None, ttfs_p99_latency: float | None = 0.74, **kwargs)[source]

Initialize the Speechmatics STT service.

Parameters:

api_key – Speechmatics API key for authentication. Uses environment variable SPEECHMATICS_API_KEY if not provided.
base_url – Base URL for Speechmatics API. Uses environment variable SPEECHMATICS_RT_URL or defaults to wss://eu2.rt.speechmatics.com/v2.
sample_rate – Optional audio sample rate in Hz.
encoding – Audio encoding format. Defaults to AudioEncoding.PCM_S16LE.
params –
Input parameters for the service.

Deprecated since version 0.0.105: Use settings=SpeechmaticsSTTService.Settings(...) instead.
should_interrupt – Determine whether the bot should be interrupted when Speechmatics turn_detection_mode is configured to detect user speech.
settings – Runtime-updatable settings. When provided alongside deprecated params, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to STTService.

async start(frame: StartFrame)[source]: Called when the new session starts.

async stop(frame: EndFrame)[source]: Called when the session ends.

async cancel(frame: CancelFrame)[source]: Called when the session is cancelled.

update_params(params: UpdateParams) → None[source]

Updates the speaker configuration.

Deprecated since version 0.0.104: Use STTUpdateSettingsFrame with SpeechmaticsSTTService.Settings(...) instead.

This can update the speakers to listen to or ignore during an in-flight transcription. Only available if diarization is enabled.

Parameters:: params – Update parameters for the service.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process frames for VAD and metrics handling.

Parameters:

frame – Frame to process.
direction – Direction of frame processing.

async send_message(message: ClientMessageType | str, **kwargs: Any) → None[source]

Send a message to the STT service.

This sends a message to the STT service via the underlying transport. If the session is not running, this will raise an exception. Messages in the wrong format will also cause an error.

Parameters:

message – Message to send to the STT service.
**kwargs – Additional arguments passed to the underlying transport.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Speechmatics STT supports generation of metrics.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]: Adds audio to the audio buffer and yields None.