tts

Cartesia text-to-speech service implementations.

class pipecat.services.cartesia.tts.GenerationConfig(*, volume: float | None = None, speed: float | None = None, emotion: str | None = None)[source]

Bases: BaseModel

Configuration for Cartesia Sonic-3 generation parameters.

Sonic-3 interprets these parameters as guidance to ensure natural speech. Test against your content for best results.

Parameters:
  • volume – Volume multiplier for generated speech. Valid range: [0.5, 2.0]. Default is 1.0.

  • speed – Speed multiplier for generated speech. Valid range: [0.6, 1.5]. Default is 1.0.

  • emotion – Single emotion string to guide the emotional tone. Examples include neutral, angry, excited, content, sad, scared. Over 60 emotions are supported. For best results, use with recommended voices: Leo, Jace, Kyle, Gavin, Maya, Tessa, Dana, and Marian.

volume: float | None
speed: float | None
emotion: str | None
pipecat.services.cartesia.tts.language_to_cartesia_language(language: Language) str | None[source]

Convert a Language enum to Cartesia language code.

Parameters:

language – The Language enum value to convert.

Returns:

The corresponding Cartesia language code, or None if not supported.

class pipecat.services.cartesia.tts.CartesiaEmotion(*values)[source]

Bases: StrEnum

Predefined Emotions supported by Cartesia.

NEUTRAL = 'neutral'
ANGRY = 'angry'
EXCITED = 'excited'
CONTENT = 'content'
SAD = 'sad'
SCARED = 'scared'
HAPPY = 'happy'
ENTHUSIASTIC = 'enthusiastic'
ELATED = 'elated'
EUPHORIC = 'euphoric'
TRIUMPHANT = 'triumphant'
AMAZED = 'amazed'
SURPRISED = 'surprised'
FLIRTATIOUS = 'flirtatious'
JOKING_COMEDIC = 'joking/comedic'
CURIOUS = 'curious'
PEACEFUL = 'peaceful'
SERENE = 'serene'
CALM = 'calm'
GRATEFUL = 'grateful'
AFFECTIONATE = 'affectionate'
TRUST = 'trust'
SYMPATHETIC = 'sympathetic'
ANTICIPATION = 'anticipation'
MYSTERIOUS = 'mysterious'
MAD = 'mad'
OUTRAGED = 'outraged'
FRUSTRATED = 'frustrated'
AGITATED = 'agitated'
THREATENED = 'threatened'
DISGUSTED = 'disgusted'
CONTEMPT = 'contempt'
ENVIOUS = 'envious'
SARCASTIC = 'sarcastic'
IRONIC = 'ironic'
DEJECTED = 'dejected'
MELANCHOLIC = 'melancholic'
DISAPPOINTED = 'disappointed'
HURT = 'hurt'
GUILTY = 'guilty'
BORED = 'bored'
TIRED = 'tired'
REJECTED = 'rejected'
NOSTALGIC = 'nostalgic'
WISTFUL = 'wistful'
APOLOGETIC = 'apologetic'
HESITANT = 'hesitant'
INSECURE = 'insecure'
CONFUSED = 'confused'
RESIGNED = 'resigned'
ANXIOUS = 'anxious'
PANICKED = 'panicked'
ALARMED = 'alarmed'
PROUD = 'proud'
CONFIDENT = 'confident'
DISTANT = 'distant'
SKEPTICAL = 'skeptical'
CONTEMPLATIVE = 'contemplative'
DETERMINED = 'determined'
class pipecat.services.cartesia.tts.CartesiaTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, generation_config: GenerationConfig | None | _NotGiven = <factory>, pronunciation_dict_id: str | None | _NotGiven = <factory>)[source]

Bases: TTSSettings

Settings for CartesiaTTSService and CartesiaHttpTTSService.

Parameters:
  • generation_config – Generation configuration for Sonic-3 models. Includes volume, speed (numeric), and emotion (string) parameters.

  • pronunciation_dict_id – The ID of the pronunciation dictionary to use for custom pronunciations.

generation_config: GenerationConfig | None | _NotGiven
pronunciation_dict_id: str | None | _NotGiven
class pipecat.services.cartesia.tts.CartesiaTTSService(*, api_key: str, voice_id: str | None = None, cartesia_version: str = '2026-03-01', url: str = 'wss://api.cartesia.ai/tts/websocket', model: str | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', max_buffer_delay_ms: int | None = None, params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]

Bases: WebsocketTTSService

Cartesia TTS service with WebSocket streaming and word timestamps.

Provides text-to-speech using Cartesia’s streaming WebSocket API. Supports word-level timestamps, audio context management, and various voice customization options including generation configuration.

Settings

alias of CartesiaTTSSettings

class InputParams(*, language: Language | None = Language.EN, generation_config: GenerationConfig | None = None, pronunciation_dict_id: str | None = None)[source]

Bases: BaseModel

Input parameters for Cartesia TTS configuration.

Parameters:
  • language – Language to use for synthesis.

  • generation_config – Generation configuration for Sonic-3 models. Includes volume, speed (numeric), and emotion (string) parameters.

  • pronunciation_dict_id – The ID of the pronunciation dictionary to use for custom pronunciations.

language: Language | None
generation_config: GenerationConfig | None
pronunciation_dict_id: str | None
__init__(*, api_key: str, voice_id: str | None = None, cartesia_version: str = '2026-03-01', url: str = 'wss://api.cartesia.ai/tts/websocket', model: str | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', max_buffer_delay_ms: int | None = None, params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]

Initialize the Cartesia TTS service.

Parameters:
  • api_key – Cartesia API key for authentication.

  • voice_id

    ID of the voice to use for synthesis.

    Deprecated since version 0.0.105: Use settings=CartesiaTTSService.Settings(voice=...) instead.

  • cartesia_version – API version string for Cartesia service.

  • url – WebSocket URL for Cartesia TTS API.

  • model

    TTS model to use (e.g., “sonic-3”).

    Deprecated since version 0.0.105: Use settings=CartesiaTTSService.Settings(model=...) instead.

  • sample_rate – Audio sample rate. If None, uses default.

  • encoding – Audio encoding format.

  • container – Audio container format.

  • max_buffer_delay_ms – Server-side buffering window before generation starts. 0 disables server buffering (custom buffering); any value in (0, 5000] enables managed buffering. If None, derived from text_aggregation_mode: 0 for SENTENCE (avoids stacking client and server buffering), unset for TOKEN (uses Cartesia’s 3000ms default).

  • params

    Additional input parameters for voice customization.

    Deprecated since version 0.0.105: Use settings=CartesiaTTSService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • text_aggregation_mode – How to aggregate incoming text before synthesis.

  • aggregate_sentences

    Whether to aggregate sentences within the TTSService.

    Deprecated since version 0.0.104: Use text_aggregation_mode instead.

  • **kwargs – Additional arguments passed to the parent service.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Cartesia service supports metrics generation.

language_to_service_language(language: Language) str | None[source]

Convert a Language enum to Cartesia language format.

Parameters:

language – The language to convert.

Returns:

The Cartesia-specific language code, or None if not supported.

static SPELL(text: str) str[source]

Wrap text in Cartesia spell tag.

static EMOTION_TAG(emotion: CartesiaEmotion) str[source]

Convenience method to create an emotion tag.

static PAUSE_TAG(seconds: float) str[source]

Convenience method to create a pause tag.

static VOLUME_TAG(volume: float) str[source]

Convenience method to create a volume tag.

static SPEED_TAG(speed: float) str[source]

Convenience method to create a speed tag.

async start(frame: StartFrame)[source]

Start the Cartesia TTS service.

Parameters:

frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Cartesia TTS service.

Parameters:

frame – The end frame.

async cancel(frame: CancelFrame)[source]

Stop the Cartesia TTS service.

Parameters:

frame – The end frame.

async on_audio_context_interrupted(context_id: str)[source]

Cancel the active Cartesia context when the bot is interrupted.

async on_audio_context_completed(context_id: str)[source]

Close the Cartesia context after all audio has been played.

No close message is needed: the server already considers the context done once it has sent its done message, which is handled in _process_messages.

async flush_audio(context_id: str | None = None)[source]

Flush any pending audio and finalize the current context.

Parameters:

context_id – The specific context to flush. If None, falls back to the currently active context.

async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]

Generate speech from text using Cartesia’s streaming API.

Parameters:
  • text – The text to synthesize into speech.

  • context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech.

class pipecat.services.cartesia.tts.CartesiaHttpTTSService(*, api_key: str, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.cartesia.ai', cartesia_version: str = '2026-03-01', aiohttp_session: ClientSession | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, **kwargs)[source]

Bases: TTSService

Cartesia HTTP-based TTS service.

Provides text-to-speech using Cartesia’s HTTP API for simpler, non-streaming synthesis. Suitable for use cases where streaming is not required and simpler integration is preferred.

Settings

alias of CartesiaTTSSettings

class InputParams(*, language: Language | None = Language.EN, generation_config: GenerationConfig | None = None, pronunciation_dict_id: str | None = None)[source]

Bases: BaseModel

Input parameters for Cartesia HTTP TTS configuration.

Parameters:
  • language – Language to use for synthesis.

  • generation_config – Generation configuration for Sonic-3 models. Includes volume, speed (numeric), and emotion (string) parameters.

  • pronunciation_dict_id – The ID of the pronunciation dictionary to use for custom pronunciations.

language: Language | None
generation_config: GenerationConfig | None
pronunciation_dict_id: str | None
__init__(*, api_key: str, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.cartesia.ai', cartesia_version: str = '2026-03-01', aiohttp_session: ClientSession | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, **kwargs)[source]

Initialize the Cartesia HTTP TTS service.

Parameters:
  • api_key – Cartesia API key for authentication.

  • voice_id

    ID of the voice to use for synthesis.

    Deprecated since version 0.0.105: Use settings=CartesiaHttpTTSService.Settings(voice=...) instead.

  • model

    TTS model to use (e.g., “sonic-3”).

    Deprecated since version 0.0.105: Use settings=CartesiaHttpTTSService.Settings(model=...) instead.

  • base_url – Base URL for Cartesia HTTP API.

  • cartesia_version – API version string for Cartesia service.

  • aiohttp_session – Optional aiohttp ClientSession for HTTP requests. If not provided, a session will be created and managed internally.

  • sample_rate – Audio sample rate. If None, uses default.

  • encoding – Audio encoding format.

  • container – Audio container format.

  • params

    Additional input parameters for voice customization.

    Deprecated since version 0.0.105: Use settings=CartesiaHttpTTSService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • **kwargs – Additional arguments passed to the parent TTSService.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Cartesia HTTP service supports metrics generation.

language_to_service_language(language: Language) str | None[source]

Convert a Language enum to Cartesia language format.

Parameters:

language – The language to convert.

Returns:

The Cartesia-specific language code, or None if not supported.

async start(frame: StartFrame)[source]

Start the Cartesia HTTP TTS service.

Parameters:

frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Cartesia HTTP TTS service.

Parameters:

frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the Cartesia HTTP TTS service.

Parameters:

frame – The cancel frame.

async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]

Generate speech from text using Cartesia’s HTTP API.

Parameters:
  • text – The text to synthesize into speech.

  • context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech.