tts

Cartesia text-to-speech service implementations.

class pipecat.services.cartesia.tts.GenerationConfig(*, volume: float | None = None, speed: float | None = None, emotion: str | None = None)[source]

Bases: BaseModel

Configuration for Cartesia Sonic-3 generation parameters.

Sonic-3 interprets these parameters as guidance to ensure natural speech. Test against your content for best results.

Parameters:

volume – Volume multiplier for generated speech. Valid range: [0.5, 2.0]. Default is 1.0.
speed – Speed multiplier for generated speech. Valid range: [0.6, 1.5]. Default is 1.0.
emotion – Single emotion string to guide the emotional tone. Examples include neutral, angry, excited, content, sad, scared. Over 60 emotions are supported. For best results, use with recommended voices: Leo, Jace, Kyle, Gavin, Maya, Tessa, Dana, and Marian.

volume: float | None

speed: float | None

emotion: str | None

pipecat.services.cartesia.tts.language_to_cartesia_language(language: Language) → str | None[source]

Convert a Language enum to Cartesia language code.

Parameters:: language – The Language enum value to convert.
Returns:: The corresponding Cartesia language code, or None if not supported.

class pipecat.services.cartesia.tts.CartesiaEmotion(*values)[source]

Bases: StrEnum

Predefined Emotions supported by Cartesia.

NEUTRAL = 'neutral'

ANGRY = 'angry'

EXCITED = 'excited'

CONTENT = 'content'

SAD = 'sad'

SCARED = 'scared'

HAPPY = 'happy'

ENTHUSIASTIC = 'enthusiastic'

ELATED = 'elated'

EUPHORIC = 'euphoric'

TRIUMPHANT = 'triumphant'

AMAZED = 'amazed'

SURPRISED = 'surprised'

FLIRTATIOUS = 'flirtatious'

JOKING_COMEDIC = 'joking/comedic'

CURIOUS = 'curious'

PEACEFUL = 'peaceful'

SERENE = 'serene'

CALM = 'calm'

GRATEFUL = 'grateful'

AFFECTIONATE = 'affectionate'

TRUST = 'trust'

SYMPATHETIC = 'sympathetic'

ANTICIPATION = 'anticipation'

MYSTERIOUS = 'mysterious'

MAD = 'mad'

OUTRAGED = 'outraged'

FRUSTRATED = 'frustrated'

AGITATED = 'agitated'

THREATENED = 'threatened'

DISGUSTED = 'disgusted'

CONTEMPT = 'contempt'

ENVIOUS = 'envious'

SARCASTIC = 'sarcastic'

IRONIC = 'ironic'

DEJECTED = 'dejected'

MELANCHOLIC = 'melancholic'

DISAPPOINTED = 'disappointed'

HURT = 'hurt'

GUILTY = 'guilty'

BORED = 'bored'

TIRED = 'tired'

REJECTED = 'rejected'

NOSTALGIC = 'nostalgic'

WISTFUL = 'wistful'

APOLOGETIC = 'apologetic'

HESITANT = 'hesitant'

INSECURE = 'insecure'

CONFUSED = 'confused'

RESIGNED = 'resigned'

ANXIOUS = 'anxious'

PANICKED = 'panicked'

ALARMED = 'alarmed'

PROUD = 'proud'

CONFIDENT = 'confident'

DISTANT = 'distant'

SKEPTICAL = 'skeptical'

CONTEMPLATIVE = 'contemplative'

DETERMINED = 'determined'

Bases: TTSSettings

Settings for CartesiaTTSService and CartesiaHttpTTSService.

Parameters:

generation_config – Generation configuration for Sonic-3 models. Includes volume, speed (numeric), and emotion (string) parameters.
pronunciation_dict_id – The ID of the pronunciation dictionary to use for custom pronunciations.

generation_config: GenerationConfig | None | _NotGiven

pronunciation_dict_id: str | None | _NotGiven

class pipecat.services.cartesia.tts.CartesiaTTSService(*, api_key: str, voice_id: str | None = None, cartesia_version: str = '2026-03-01', url: str = 'wss://api.cartesia.ai/tts/websocket', model: str | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', max_buffer_delay_ms: int | None = None, params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]

Bases: WebsocketTTSService

Cartesia TTS service with WebSocket streaming and word timestamps.

Provides text-to-speech using Cartesia’s streaming WebSocket API. Supports word-level timestamps, audio context management, and various voice customization options including generation configuration.

Settings: alias of CartesiaTTSSettings

class InputParams(*, language: Language | None = Language.EN, generation_config: GenerationConfig | None = None, pronunciation_dict_id: str | None = None)[source]

Bases: BaseModel

Input parameters for Cartesia TTS configuration.

Parameters:

language – Language to use for synthesis.
generation_config – Generation configuration for Sonic-3 models. Includes volume, speed (numeric), and emotion (string) parameters.
pronunciation_dict_id – The ID of the pronunciation dictionary to use for custom pronunciations.

language: Language | None

generation_config: GenerationConfig | None

pronunciation_dict_id: str | None

__init__(*, api_key: str, voice_id: str | None = None, cartesia_version: str = '2026-03-01', url: str = 'wss://api.cartesia.ai/tts/websocket', model: str | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', max_buffer_delay_ms: int | None = None, params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]

Initialize the Cartesia TTS service.

Parameters:

api_key – Cartesia API key for authentication.
voice_id –
ID of the voice to use for synthesis.

Deprecated since version 0.0.105: Use settings=CartesiaTTSService.Settings(voice=...) instead.
cartesia_version – API version string for Cartesia service.
url – WebSocket URL for Cartesia TTS API.
model –
TTS model to use (e.g., “sonic-3”).

Deprecated since version 0.0.105: Use settings=CartesiaTTSService.Settings(model=...) instead.
sample_rate – Audio sample rate. If None, uses default.
encoding – Audio encoding format.
container – Audio container format.
max_buffer_delay_ms – Server-side buffering window before generation starts. 0 disables server buffering (custom buffering); any value in (0, 5000] enables managed buffering. If None, derived from text_aggregation_mode: 0 for SENTENCE (avoids stacking client and server buffering), unset for TOKEN (uses Cartesia’s 3000ms default).
params –
Additional input parameters for voice customization.

Deprecated since version 0.0.105: Use settings=CartesiaTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
text_aggregation_mode – How to aggregate incoming text before synthesis.
aggregate_sentences –
Whether to aggregate sentences within the TTSService.

Deprecated since version 0.0.104: Use text_aggregation_mode instead.
**kwargs – Additional arguments passed to the parent service.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Cartesia service supports metrics generation.

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to Cartesia language format.

Parameters:: language – The language to convert.
Returns:: The Cartesia-specific language code, or None if not supported.

static SPELL(text: str) → str[source]: Wrap text in Cartesia spell tag.

static EMOTION_TAG(emotion: CartesiaEmotion) → str[source]: Convenience method to create an emotion tag.

static PAUSE_TAG(seconds: float) → str[source]: Convenience method to create a pause tag.

static VOLUME_TAG(volume: float) → str[source]: Convenience method to create a volume tag.

static SPEED_TAG(speed: float) → str[source]: Convenience method to create a speed tag.

async start(frame: StartFrame)[source]

Start the Cartesia TTS service.

Parameters:: frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Cartesia TTS service.

Parameters:: frame – The end frame.

async cancel(frame: CancelFrame)[source]

Stop the Cartesia TTS service.

Parameters:: frame – The end frame.

async on_audio_context_interrupted(context_id: str)[source]: Cancel the active Cartesia context when the bot is interrupted.

async on_audio_context_completed(context_id: str)[source]

Close the Cartesia context after all audio has been played.

No close message is needed: the server already considers the context done once it has sent its done message, which is handled in _process_messages.

async flush_audio(context_id: str | None = None)[source]

Flush any pending audio and finalize the current context.

Parameters:: context_id – The specific context to flush. If None, falls back to the currently active context.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame | None, None][source]

Generate speech from text using Cartesia’s streaming API.

Parameters:

text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech.

class pipecat.services.cartesia.tts.CartesiaHttpTTSService(*, api_key: str, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.cartesia.ai', cartesia_version: str = '2026-03-01', aiohttp_session: ClientSession | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, **kwargs)[source]

Bases: TTSService

Cartesia HTTP-based TTS service.

Provides text-to-speech using Cartesia’s HTTP API for simpler, non-streaming synthesis. Suitable for use cases where streaming is not required and simpler integration is preferred.

Settings: alias of CartesiaTTSSettings

class InputParams(*, language: Language | None = Language.EN, generation_config: GenerationConfig | None = None, pronunciation_dict_id: str | None = None)[source]

Bases: BaseModel

Input parameters for Cartesia HTTP TTS configuration.

Parameters:

language – Language to use for synthesis.
generation_config – Generation configuration for Sonic-3 models. Includes volume, speed (numeric), and emotion (string) parameters.
pronunciation_dict_id – The ID of the pronunciation dictionary to use for custom pronunciations.

language: Language | None

generation_config: GenerationConfig | None

pronunciation_dict_id: str | None

__init__(*, api_key: str, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.cartesia.ai', cartesia_version: str = '2026-03-01', aiohttp_session: ClientSession | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, **kwargs)[source]

Initialize the Cartesia HTTP TTS service.

Parameters:

api_key – Cartesia API key for authentication.
voice_id –
ID of the voice to use for synthesis.

Deprecated since version 0.0.105: Use settings=CartesiaHttpTTSService.Settings(voice=...) instead.
model –
TTS model to use (e.g., “sonic-3”).

Deprecated since version 0.0.105: Use settings=CartesiaHttpTTSService.Settings(model=...) instead.
base_url – Base URL for Cartesia HTTP API.
cartesia_version – API version string for Cartesia service.
aiohttp_session – Optional aiohttp ClientSession for HTTP requests. If not provided, a session will be created and managed internally.
sample_rate – Audio sample rate. If None, uses default.
encoding – Audio encoding format.
container – Audio container format.
params –
Additional input parameters for voice customization.

Deprecated since version 0.0.105: Use settings=CartesiaHttpTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
**kwargs – Additional arguments passed to the parent TTSService.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True, as Cartesia HTTP service supports metrics generation.

language_to_service_language(language: Language) → str | None[source]

Convert a Language enum to Cartesia language format.

Parameters:: language – The language to convert.
Returns:: The Cartesia-specific language code, or None if not supported.

async start(frame: StartFrame)[source]

Start the Cartesia HTTP TTS service.

Parameters:: frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Cartesia HTTP TTS service.

Parameters:: frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the Cartesia HTTP TTS service.

Parameters:: frame – The cancel frame.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame | None, None][source]

Generate speech from text using Cartesia’s HTTP API.

Parameters:

text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech.