stt

Cartesia Speech-to-Text service implementation.

This module provides a WebSocket-based STT service that integrates with the Cartesia Live transcription API for real-time speech recognition.

class pipecat.services.cartesia.stt.CartesiaSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, language: Language | str | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for CartesiaSTTService.

class pipecat.services.cartesia.stt.CartesiaLiveOptions(*, model: str = 'ink-whisper', language: str = 'en', encoding: str = 'pcm_s16le', sample_rate: int = 16000, **kwargs)[source]

Bases: object

Configuration options for Cartesia Live STT service.

Deprecated since version 0.0.105: Use settings=CartesiaSTTService.Settings(...) for model/language and direct __init__ parameters for encoding/sample_rate instead.

__init__(*, model: str = 'ink-whisper', language: str = 'en', encoding: str = 'pcm_s16le', sample_rate: int = 16000, **kwargs)[source]

Initialize CartesiaLiveOptions with default or provided parameters.

Parameters:
  • model – The transcription model to use. Defaults to “ink-whisper”.

  • language – Target language for transcription. Defaults to English.

  • encoding – Audio encoding format. Defaults to “pcm_s16le”.

  • sample_rate – Audio sample rate in Hz. Defaults to 16000.

  • **kwargs – Additional parameters for the transcription service.

to_dict()[source]

Convert options to dictionary format.

Returns:

Dictionary containing all configuration parameters.

items()[source]

Get configuration items as key-value pairs.

Returns:

Iterator of (key, value) tuples for all configuration parameters.

get(key, default=None)[source]

Get a configuration value by key.

Parameters:
  • key – The configuration parameter name to retrieve.

  • default – Default value if key is not found.

Returns:

The configuration value or default if not found.

classmethod from_json(json_str: str) CartesiaLiveOptions[source]

Create options from JSON string.

Parameters:

json_str – JSON string containing configuration parameters.

Returns:

New CartesiaLiveOptions instance with parsed parameters.

class pipecat.services.cartesia.stt.CartesiaSTTService(*, api_key: str, base_url: str = '', encoding: str = 'pcm_s16le', sample_rate: int | None = None, live_options: CartesiaLiveOptions | None = None, settings: CartesiaSTTSettings | None = None, ttfs_p99_latency: float | None = 0.81, **kwargs)[source]

Bases: WebsocketSTTService

Speech-to-text service using Cartesia Live API.

Provides real-time speech transcription through WebSocket connection to Cartesia’s Live transcription service. Supports both interim and final transcriptions with configurable models and languages.

Cartesia disconnects WebSocket connections after 3 minutes of inactivity. The timeout resets with each message (audio data or text command) sent to the server. Silence-based keepalive is enabled by default to prevent this. See: https://docs.cartesia.ai/api-reference/stt/stt

Settings

alias of CartesiaSTTSettings

__init__(*, api_key: str, base_url: str = '', encoding: str = 'pcm_s16le', sample_rate: int | None = None, live_options: CartesiaLiveOptions | None = None, settings: CartesiaSTTSettings | None = None, ttfs_p99_latency: float | None = 0.81, **kwargs)[source]

Initialize CartesiaSTTService with API key and options.

Parameters:
  • api_key – Authentication key for Cartesia API.

  • base_url – Custom API endpoint URL. If empty, uses default.

  • encoding – Audio encoding format. Defaults to “pcm_s16le”.

  • sample_rate – Audio sample rate in Hz. If None, uses the pipeline sample rate.

  • live_options

    Configuration options for transcription service.

    Deprecated since version 0.0.105: Use settings=CartesiaSTTService.Settings(...) for model/language and direct init parameters for encoding/sample_rate instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to parent STTService.

can_generate_metrics() bool[source]

Check if the service can generate processing metrics.

Returns:

True, indicating metrics are supported.

async start(frame: StartFrame)[source]

Start the STT service and establish connection.

Parameters:

frame – Frame indicating service should start.

async stop(frame: EndFrame)[source]

Stop the STT service and close connection.

Parameters:

frame – Frame indicating service should stop.

async cancel(frame: CancelFrame)[source]

Cancel the STT service and close connection.

Parameters:

frame – Frame indicating service should be cancelled.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process incoming frames and handle speech events.

Parameters:
  • frame – The frame to process.

  • direction – Direction of frame flow in the pipeline.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text transcription.

Parameters:

audio – Raw audio bytes to transcribe.

Yields:

None - transcription results are handled via WebSocket responses.