tts

Mistral text-to-speech service implementation.

This module provides integration with Mistral’s Voxtral TTS API for generating speech from text input using HTTP streaming with Server-Sent Events.

class pipecat.services.mistral.tts.MistralTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>)[source]

Bases: TTSSettings

Settings for MistralTTSService.

Parameters:
  • model – TTS model identifier.

  • voice – Voice identifier.

  • language – Language for speech synthesis.

class pipecat.services.mistral.tts.MistralTTSService(*, api_key: str | None = None, sample_rate: int | None = None, settings: MistralTTSSettings | None = None, **kwargs)[source]

Bases: TTSService

Mistral Text-to-Speech service using the Voxtral TTS API.

This service uses Mistral’s streaming TTS API to generate PCM-encoded audio at 24kHz. The API returns base64-encoded float32 PCM chunks via Server-Sent Events, which are converted to int16 for the Pipecat pipeline.

Settings

alias of MistralTTSSettings

MISTRAL_SAMPLE_RATE = 24000
__init__(*, api_key: str | None = None, sample_rate: int | None = None, settings: MistralTTSSettings | None = None, **kwargs)[source]

Initialize Mistral TTS service.

Parameters:
  • api_key – Mistral API key for authentication.

  • sample_rate – Output audio sample rate in Hz. Audio is resampled from Mistral’s native 24kHz when a different rate is requested.

  • settings – Runtime-updatable settings.

  • **kwargs – Additional keyword arguments passed to TTSService.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Mistral TTS service supports metrics generation.

async run_tts(text: str, context_id: str) AsyncGenerator[Frame, None][source]

Generate speech from text using Mistral’s TTS API.

Parameters:
  • text – The text to synthesize into speech.

  • context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech data.