tts
Cartesia text-to-speech service implementations.
- class pipecat.services.cartesia.tts.GenerationConfig(*, volume: float | None = None, speed: float | None = None, emotion: str | None = None)[source]
Bases:
BaseModelConfiguration for Cartesia Sonic-3 generation parameters.
Sonic-3 interprets these parameters as guidance to ensure natural speech. Test against your content for best results.
- Parameters:
volume – Volume multiplier for generated speech. Valid range: [0.5, 2.0]. Default is 1.0.
speed – Speed multiplier for generated speech. Valid range: [0.6, 1.5]. Default is 1.0.
emotion – Single emotion string to guide the emotional tone. Examples include neutral, angry, excited, content, sad, scared. Over 60 emotions are supported. For best results, use with recommended voices: Leo, Jace, Kyle, Gavin, Maya, Tessa, Dana, and Marian.
- volume: float | None
- speed: float | None
- emotion: str | None
- pipecat.services.cartesia.tts.language_to_cartesia_language(language: Language) str | None[source]
Convert a Language enum to Cartesia language code.
- Parameters:
language – The Language enum value to convert.
- Returns:
The corresponding Cartesia language code, or None if not supported.
- class pipecat.services.cartesia.tts.CartesiaEmotion(*values)[source]
Bases:
StrEnumPredefined Emotions supported by Cartesia.
- NEUTRAL = 'neutral'
- ANGRY = 'angry'
- EXCITED = 'excited'
- CONTENT = 'content'
- SAD = 'sad'
- SCARED = 'scared'
- HAPPY = 'happy'
- ENTHUSIASTIC = 'enthusiastic'
- ELATED = 'elated'
- EUPHORIC = 'euphoric'
- TRIUMPHANT = 'triumphant'
- AMAZED = 'amazed'
- SURPRISED = 'surprised'
- FLIRTATIOUS = 'flirtatious'
- JOKING_COMEDIC = 'joking/comedic'
- CURIOUS = 'curious'
- PEACEFUL = 'peaceful'
- SERENE = 'serene'
- CALM = 'calm'
- GRATEFUL = 'grateful'
- AFFECTIONATE = 'affectionate'
- TRUST = 'trust'
- SYMPATHETIC = 'sympathetic'
- ANTICIPATION = 'anticipation'
- MYSTERIOUS = 'mysterious'
- MAD = 'mad'
- OUTRAGED = 'outraged'
- FRUSTRATED = 'frustrated'
- AGITATED = 'agitated'
- THREATENED = 'threatened'
- DISGUSTED = 'disgusted'
- CONTEMPT = 'contempt'
- ENVIOUS = 'envious'
- SARCASTIC = 'sarcastic'
- IRONIC = 'ironic'
- DEJECTED = 'dejected'
- MELANCHOLIC = 'melancholic'
- DISAPPOINTED = 'disappointed'
- HURT = 'hurt'
- GUILTY = 'guilty'
- BORED = 'bored'
- TIRED = 'tired'
- REJECTED = 'rejected'
- NOSTALGIC = 'nostalgic'
- WISTFUL = 'wistful'
- APOLOGETIC = 'apologetic'
- HESITANT = 'hesitant'
- INSECURE = 'insecure'
- CONFUSED = 'confused'
- RESIGNED = 'resigned'
- ANXIOUS = 'anxious'
- PANICKED = 'panicked'
- ALARMED = 'alarmed'
- PROUD = 'proud'
- CONFIDENT = 'confident'
- DISTANT = 'distant'
- SKEPTICAL = 'skeptical'
- CONTEMPLATIVE = 'contemplative'
- DETERMINED = 'determined'
- class pipecat.services.cartesia.tts.CartesiaTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, generation_config: GenerationConfig | None | _NotGiven = <factory>, pronunciation_dict_id: str | None | _NotGiven = <factory>)[source]
Bases:
TTSSettingsSettings for CartesiaTTSService and CartesiaHttpTTSService.
- Parameters:
generation_config – Generation configuration for Sonic-3 models. Includes volume, speed (numeric), and emotion (string) parameters.
pronunciation_dict_id – The ID of the pronunciation dictionary to use for custom pronunciations.
- generation_config: GenerationConfig | None | _NotGiven
- pronunciation_dict_id: str | None | _NotGiven
- class pipecat.services.cartesia.tts.CartesiaTTSService(*, api_key: str, voice_id: str | None = None, cartesia_version: str = '2026-03-01', url: str = 'wss://api.cartesia.ai/tts/websocket', model: str | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', max_buffer_delay_ms: int | None = None, params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]
Bases:
WebsocketTTSServiceCartesia TTS service with WebSocket streaming and word timestamps.
Provides text-to-speech using Cartesia’s streaming WebSocket API. Supports word-level timestamps, audio context management, and various voice customization options including generation configuration.
- Settings
alias of
CartesiaTTSSettings
- class InputParams(*, language: Language | None = Language.EN, generation_config: GenerationConfig | None = None, pronunciation_dict_id: str | None = None)[source]
Bases:
BaseModelInput parameters for Cartesia TTS configuration.
- Parameters:
language – Language to use for synthesis.
generation_config – Generation configuration for Sonic-3 models. Includes volume, speed (numeric), and emotion (string) parameters.
pronunciation_dict_id – The ID of the pronunciation dictionary to use for custom pronunciations.
- generation_config: GenerationConfig | None
- pronunciation_dict_id: str | None
- __init__(*, api_key: str, voice_id: str | None = None, cartesia_version: str = '2026-03-01', url: str = 'wss://api.cartesia.ai/tts/websocket', model: str | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', max_buffer_delay_ms: int | None = None, params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, text_aggregation_mode: TextAggregationMode | None = None, aggregate_sentences: bool | None = None, **kwargs)[source]
Initialize the Cartesia TTS service.
- Parameters:
api_key – Cartesia API key for authentication.
voice_id –
ID of the voice to use for synthesis.
Deprecated since version 0.0.105: Use
settings=CartesiaTTSService.Settings(voice=...)instead.cartesia_version – API version string for Cartesia service.
url – WebSocket URL for Cartesia TTS API.
model –
TTS model to use (e.g., “sonic-3”).
Deprecated since version 0.0.105: Use
settings=CartesiaTTSService.Settings(model=...)instead.sample_rate – Audio sample rate. If None, uses default.
encoding – Audio encoding format.
container – Audio container format.
max_buffer_delay_ms – Server-side buffering window before generation starts.
0disables server buffering (custom buffering); any value in (0, 5000] enables managed buffering. IfNone, derived fromtext_aggregation_mode:0forSENTENCE(avoids stacking client and server buffering), unset forTOKEN(uses Cartesia’s 3000ms default).params –
Additional input parameters for voice customization.
Deprecated since version 0.0.105: Use
settings=CartesiaTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.text_aggregation_mode – How to aggregate incoming text before synthesis.
aggregate_sentences –
Whether to aggregate sentences within the TTSService.
Deprecated since version 0.0.104: Use
text_aggregation_modeinstead.**kwargs – Additional arguments passed to the parent service.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Cartesia service supports metrics generation.
- language_to_service_language(language: Language) str | None[source]
Convert a Language enum to Cartesia language format.
- Parameters:
language – The language to convert.
- Returns:
The Cartesia-specific language code, or None if not supported.
- static EMOTION_TAG(emotion: CartesiaEmotion) str[source]
Convenience method to create an emotion tag.
- async start(frame: StartFrame)[source]
Start the Cartesia TTS service.
- Parameters:
frame – The start frame containing initialization parameters.
- async stop(frame: EndFrame)[source]
Stop the Cartesia TTS service.
- Parameters:
frame – The end frame.
- async cancel(frame: CancelFrame)[source]
Stop the Cartesia TTS service.
- Parameters:
frame – The end frame.
- async on_audio_context_interrupted(context_id: str)[source]
Cancel the active Cartesia context when the bot is interrupted.
- async on_audio_context_completed(context_id: str)[source]
Close the Cartesia context after all audio has been played.
No close message is needed: the server already considers the context done once it has sent its
donemessage, which is handled in_process_messages.
- async flush_audio(context_id: str | None = None)[source]
Flush any pending audio and finalize the current context.
- Parameters:
context_id – The specific context to flush. If None, falls back to the currently active context.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]
Generate speech from text using Cartesia’s streaming API.
- Parameters:
text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.
- Yields:
Frame – Audio frames containing the synthesized speech.
- class pipecat.services.cartesia.tts.CartesiaHttpTTSService(*, api_key: str, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.cartesia.ai', cartesia_version: str = '2026-03-01', aiohttp_session: ClientSession | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, **kwargs)[source]
Bases:
TTSServiceCartesia HTTP-based TTS service.
Provides text-to-speech using Cartesia’s HTTP API for simpler, non-streaming synthesis. Suitable for use cases where streaming is not required and simpler integration is preferred.
- Settings
alias of
CartesiaTTSSettings
- class InputParams(*, language: Language | None = Language.EN, generation_config: GenerationConfig | None = None, pronunciation_dict_id: str | None = None)[source]
Bases:
BaseModelInput parameters for Cartesia HTTP TTS configuration.
- Parameters:
language – Language to use for synthesis.
generation_config – Generation configuration for Sonic-3 models. Includes volume, speed (numeric), and emotion (string) parameters.
pronunciation_dict_id – The ID of the pronunciation dictionary to use for custom pronunciations.
- generation_config: GenerationConfig | None
- pronunciation_dict_id: str | None
- __init__(*, api_key: str, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.cartesia.ai', cartesia_version: str = '2026-03-01', aiohttp_session: ClientSession | None = None, sample_rate: int | None = None, encoding: str = 'pcm_s16le', container: str = 'raw', params: InputParams | None = None, settings: CartesiaTTSSettings | None = None, **kwargs)[source]
Initialize the Cartesia HTTP TTS service.
- Parameters:
api_key – Cartesia API key for authentication.
voice_id –
ID of the voice to use for synthesis.
Deprecated since version 0.0.105: Use
settings=CartesiaHttpTTSService.Settings(voice=...)instead.model –
TTS model to use (e.g., “sonic-3”).
Deprecated since version 0.0.105: Use
settings=CartesiaHttpTTSService.Settings(model=...)instead.base_url – Base URL for Cartesia HTTP API.
cartesia_version – API version string for Cartesia service.
aiohttp_session – Optional aiohttp ClientSession for HTTP requests. If not provided, a session will be created and managed internally.
sample_rate – Audio sample rate. If None, uses default.
encoding – Audio encoding format.
container – Audio container format.
params –
Additional input parameters for voice customization.
Deprecated since version 0.0.105: Use
settings=CartesiaHttpTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional arguments passed to the parent TTSService.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Cartesia HTTP service supports metrics generation.
- language_to_service_language(language: Language) str | None[source]
Convert a Language enum to Cartesia language format.
- Parameters:
language – The language to convert.
- Returns:
The Cartesia-specific language code, or None if not supported.
- async start(frame: StartFrame)[source]
Start the Cartesia HTTP TTS service.
- Parameters:
frame – The start frame containing initialization parameters.
- async stop(frame: EndFrame)[source]
Stop the Cartesia HTTP TTS service.
- Parameters:
frame – The end frame.
- async cancel(frame: CancelFrame)[source]
Cancel the Cartesia HTTP TTS service.
- Parameters:
frame – The cancel frame.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]
Generate speech from text using Cartesia’s HTTP API.
- Parameters:
text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.
- Yields:
Frame – Audio frames containing the synthesized speech.