tts
Sarvam AI text-to-speech service implementation.
This module provides TTS services using Sarvam AI’s API with support for multiple Indian languages and two model variants:
Model Variants:
- bulbul:v2 (default): Standard TTS model
Supports: pitch, loudness, pace (0.3-3.0)
Default sample rate: 22050 Hz
Speakers: anushka (default), abhilash, manisha, vidya, arya, karun, hitesh
- bulbul:v3-beta: Advanced TTS model with temperature control
Does NOT support: pitch, loudness
Supports: pace (0.5-2.0), temperature (0.01-1.0)
Default sample rate: 24000 Hz
Preprocessing is always enabled
Speakers: aditya (default), ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia
- bulbul:v3: Advanced TTS model with temperature control
Does NOT support: pitch, loudness
Supports: pace (0.5-2.0), temperature (0.01-1.0)
Default sample rate: 24000 Hz
Preprocessing is always enabled
Speakers: aditya (default), ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia
See https://docs.sarvam.ai/api-reference-docs/text-to-speech/stream for full API details.
- class pipecat.services.sarvam.tts.SarvamTTSModel(*values)[source]
Bases:
StrEnumAvailable Sarvam TTS models.
- Parameters:
BULBUL_V2 – Standard TTS model with pitch/loudness control. - Supports pitch, loudness, pace (0.3-3.0) - Default sample rate: 22050 Hz
BULBUL_V3_BETA – Advanced model with temperature control. - Does NOT support pitch/loudness - Pace range: 0.5-2.0 - Supports temperature parameter - Default sample rate: 24000 Hz - Preprocessing is always enabled
- BULBUL_V2 = 'bulbul:v2'
- BULBUL_V3_BETA = 'bulbul:v3-beta'
- BULBUL_V3 = 'bulbul:v3'
- class pipecat.services.sarvam.tts.SarvamTTSSpeakerV2(*values)[source]
Bases:
StrEnumAvailable speakers for bulbul:v2 model.
Female voices: anushka, manisha, vidya, arya Male voices: abhilash, karun, hitesh
- ANUSHKA = 'anushka'
- ABHILASH = 'abhilash'
- MANISHA = 'manisha'
- VIDYA = 'vidya'
- ARYA = 'arya'
- KARUN = 'karun'
- HITESH = 'hitesh'
- class pipecat.services.sarvam.tts.SarvamTTSSpeakerV3(*values)[source]
Bases:
StrEnumAvailable speakers for bulbul:v3-beta model.
Includes a wider variety of voices with different characteristics.
- ADITYA = 'aditya'
- RITU = 'ritu'
- PRIYA = 'priya'
- NEHA = 'neha'
- RAHUL = 'rahul'
- POOJA = 'pooja'
- ROHAN = 'rohan'
- SIMRAN = 'simran'
- KAVYA = 'kavya'
- AMIT = 'amit'
- DEV = 'dev'
- ISHITA = 'ishita'
- SHREYA = 'shreya'
- RATAN = 'ratan'
- VARUN = 'varun'
- MANAN = 'manan'
- SUMIT = 'sumit'
- ROOPA = 'roopa'
- KABIR = 'kabir'
- AAYAN = 'aayan'
- SHUBH = 'shubh'
- ASHUTOSH = 'ashutosh'
- ADVAIT = 'advait'
- AMELIA = 'amelia'
- SOPHIA = 'sophia'
- class pipecat.services.sarvam.tts.TTSModelConfig(supports_pitch: bool, supports_loudness: bool, supports_temperature: bool, default_sample_rate: int, default_speaker: str, pace_range: tuple[float, float], preprocessing_always_enabled: bool, speakers: tuple[str, ...])[source]
Bases:
objectImmutable configuration for a Sarvam TTS model.
- Parameters:
supports_pitch – Whether the model accepts pitch parameter.
supports_loudness – Whether the model accepts loudness parameter.
supports_temperature – Whether the model accepts temperature parameter.
default_sample_rate – Default audio sample rate in Hz.
default_speaker – Default speaker voice ID.
pace_range – Valid range for pace parameter (min, max).
preprocessing_always_enabled – Whether preprocessing is always enabled.
speakers – Tuple of available speaker names for this model.
- supports_pitch: bool
- supports_loudness: bool
- supports_temperature: bool
- default_sample_rate: int
- default_speaker: str
- pace_range: tuple[float, float]
- preprocessing_always_enabled: bool
- speakers: tuple[str, ...]
- pipecat.services.sarvam.tts.get_speakers_for_model(model: str) list[str][source]
Get the list of available speakers for a given model.
- Parameters:
model – The model name (e.g., “bulbul:v2” or “bulbul:v3-beta”).
- Returns:
List of speaker names available for the model.
- pipecat.services.sarvam.tts.language_to_sarvam_language(language: Language) str | None[source]
Convert Pipecat Language enum to Sarvam AI language codes.
- Parameters:
language – The Language enum value to convert.
- Returns:
The corresponding Sarvam AI language code, or None if not supported.
- class pipecat.services.sarvam.tts.SarvamHttpTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, enable_preprocessing: bool | None | _NotGiven = <factory>, pace: float | None | _NotGiven = <factory>, pitch: float | None | _NotGiven = <factory>, loudness: float | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>)[source]
Bases:
TTSSettingsSettings for SarvamHttpTTSService.
- Parameters:
enable_preprocessing – Whether to enable text preprocessing. Defaults to False. Note: Always enabled for bulbul:v3-beta (cannot be disabled).
pace – Speech pace multiplier. Defaults to 1.0. - bulbul:v2: Range 0.3 to 3.0 - bulbul:v3-beta: Range 0.5 to 2.0
pitch – Voice pitch adjustment (-0.75 to 0.75). Defaults to 0.0. Note: Only supported for bulbul:v2. Ignored for v3 models.
loudness – Volume multiplier (0.3 to 3.0). Defaults to 1.0. Note: Only supported for bulbul:v2. Ignored for v3 models.
temperature – Controls output randomness for bulbul:v3-beta (0.01 to 1.0). Lower values = more deterministic, higher = more random. Defaults to 0.6. Note: Only supported for bulbul:v3-beta. Ignored for v2.
- enable_preprocessing: bool | None | _NotGiven
- pace: float | None | _NotGiven
- pitch: float | None | _NotGiven
- loudness: float | None | _NotGiven
- temperature: float | None | _NotGiven
- class pipecat.services.sarvam.tts.SarvamTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, enable_preprocessing: bool | None | _NotGiven = <factory>, pace: float | None | _NotGiven = <factory>, pitch: float | None | _NotGiven = <factory>, loudness: float | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>, min_buffer_size: int | None | _NotGiven = <factory>, max_chunk_length: int | None | _NotGiven = <factory>)[source]
Bases:
SarvamHttpTTSSettingsSettings for SarvamTTSService.
Extends
SarvamHttpTTSService.Settingswith WebSocket-specific buffering parameters.- Parameters:
min_buffer_size – Minimum characters to buffer before generating audio. Lower values reduce latency but may affect quality. Defaults to 50.
max_chunk_length – Maximum characters processed in a single chunk. Controls memory usage and processing efficiency. Defaults to 150.
- min_buffer_size: int | None | _NotGiven
- max_chunk_length: int | None | _NotGiven
- class pipecat.services.sarvam.tts.SarvamHttpTTSService(*, api_key: str, aiohttp_session: ClientSession, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.sarvam.ai', sample_rate: int | None = None, params: InputParams | None = None, settings: SarvamHttpTTSSettings | None = None, **kwargs)[source]
Bases:
TTSServiceText-to-Speech service using Sarvam AI’s API.
Converts text to speech using Sarvam AI’s TTS models with support for multiple Indian languages. Provides control over voice characteristics.
Model Differences:
- bulbul:v2 (default):
Supports: pitch (-0.75 to 0.75), loudness (0.3 to 3.0), pace (0.3 to 3.0)
Default sample rate: 22050 Hz
Speakers: anushka, abhilash, manisha, vidya, arya, karun, hitesh
- bulbul:v3-beta:
Does NOT support: pitch, loudness (will be ignored)
Supports: pace (0.5 to 2.0), temperature (0.01 to 1.0)
Default sample rate: 24000 Hz
Preprocessing is always enabled
Speakers: aditya, ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia
Example:
# Using bulbul:v2 (default) tts = SarvamHttpTTSService( api_key="your-api-key", aiohttp_session=session, settings=SarvamHttpTTSService.Settings( voice="anushka", model="bulbul:v2", language=Language.HI, pitch=0.1, pace=1.2, loudness=1.5, ), ) # Using bulbul:v3-beta with temperature control tts_v3 = SarvamHttpTTSService( api_key="your-api-key", aiohttp_session=session, settings=SarvamHttpTTSService.Settings( voice="aditya", # Use v3 speaker model="bulbul:v3-beta", language=Language.HI, pace=1.2, # Range: 0.5-2.0 for v3 temperature=0.8, ), )
- Settings
alias of
SarvamHttpTTSSettings
- class InputParams(*, language: Language | None = Language.EN, pitch: Annotated[float | None, Ge(ge=-0.75), Le(le=0.75)] = 0.0, pace: Annotated[float | None, Ge(ge=0.3), Le(le=3.0)] = 1.0, loudness: Annotated[float | None, Ge(ge=0.3), Le(le=3.0)] = 1.0, enable_preprocessing: bool | None = False, temperature: Annotated[float | None, Ge(ge=0.01), Le(le=1.0)] = 0.6)[source]
Bases:
BaseModelInput parameters for Sarvam TTS configuration.
Deprecated since version 0.0.105: Use
SarvamHttpTTSService.Settingsdirectly via thesettingsparameter instead.- Parameters:
language – Language for synthesis. Defaults to English (India).
pitch – Voice pitch adjustment (-0.75 to 0.75). Defaults to 0.0. Note: Only supported for bulbul:v2. Ignored for v3 models.
pace – Speech pace multiplier. Defaults to 1.0. - bulbul:v2: Range 0.3 to 3.0 - bulbul:v3-beta: Range 0.5 to 2.0
loudness – Volume multiplier (0.3 to 3.0). Defaults to 1.0. Note: Only supported for bulbul:v2. Ignored for v3 models.
enable_preprocessing – Whether to enable text preprocessing. Defaults to False. Note: Always enabled for bulbul:v3-beta (cannot be disabled).
temperature – Controls output randomness for bulbul:v3-beta (0.01 to 1.0). Lower values = more deterministic, higher = more random. Defaults to 0.6. Note: Only supported for bulbul:v3-beta. Ignored for v2.
- pitch: float | None
- pace: float | None
- loudness: float | None
- enable_preprocessing: bool | None
- temperature: float | None
- __init__(*, api_key: str, aiohttp_session: ClientSession, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.sarvam.ai', sample_rate: int | None = None, params: InputParams | None = None, settings: SarvamHttpTTSSettings | None = None, **kwargs)[source]
Initialize the Sarvam TTS service.
- Parameters:
api_key – Sarvam AI API subscription key.
aiohttp_session – Shared aiohttp session for making requests.
voice_id –
Speaker voice ID. If None, uses model-appropriate default.
Deprecated since version 0.0.105: Use
settings=SarvamHttpTTSService.Settings(voice=...)instead.model –
TTS model to use. Options: - “bulbul:v2” (default): Standard model with pitch/loudness support - “bulbul:v3-beta”: Advanced model with temperature control
Deprecated since version 0.0.105: Use
settings=SarvamHttpTTSService.Settings(model=...)instead.base_url – Sarvam AI API base URL. Defaults to “https://api.sarvam.ai”.
sample_rate – Audio sample rate in Hz (8000, 16000, 22050, 24000). If None, uses model-specific default.
params –
Additional voice and preprocessing parameters. If None, uses defaults.
Deprecated since version 0.0.105: Use
settings=SarvamHttpTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional arguments passed to parent TTSService.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Sarvam service supports metrics generation.
- language_to_service_language(language: Language) str | None[source]
Convert a Language enum to Sarvam AI language format.
- Parameters:
language – The language to convert.
- Returns:
The Sarvam AI-specific language code, or None if not supported.
- async start(frame: StartFrame)[source]
Start the Sarvam TTS service.
- Parameters:
frame – The start frame containing initialization parameters.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]
Generate speech from text using Sarvam AI’s API.
- Parameters:
text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.
- Yields:
Frame – Audio frames containing the synthesized speech.
- class pipecat.services.sarvam.tts.SarvamTTSService(*, api_key: str, model: str | None = None, voice_id: str | None = None, url: str = 'wss://api.sarvam.ai/text-to-speech/ws', aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: SarvamTTSSettings | None = None, **kwargs)[source]
Bases:
InterruptibleTTSServiceWebSocket-based text-to-speech service using Sarvam AI.
Provides streaming TTS with real-time audio generation for multiple Indian languages. Uses WebSocket for low-latency streaming audio synthesis.
Model Differences:
- bulbul:v2 (default):
Supports: pitch (-0.75 to 0.75), loudness (0.3 to 3.0), pace (0.3 to 3.0)
Default sample rate: 22050 Hz
Speakers: anushka, abhilash, manisha, vidya, arya, karun, hitesh
- bulbul:v3-beta / bulbul:v3:
Does NOT support: pitch, loudness (will be ignored)
Supports: pace (0.5 to 2.0), temperature (0.01 to 1.0)
Default sample rate: 24000 Hz
Preprocessing is always enabled
Speakers: aditya, ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia
WebSocket Protocol: The service uses a WebSocket connection for real-time streaming. Messages include: - config: Initial configuration with voice settings - text: Text chunks for synthesis - flush: Signal to process remaining buffered text - ping: Keepalive signal
Example:
# Using bulbul:v2 (default) tts = SarvamTTSService( api_key="your-api-key", settings=SarvamTTSService.Settings( voice="anushka", model="bulbul:v2", language=Language.HI, pitch=0.1, pace=1.2, loudness=1.5, ), ) # Using bulbul:v3-beta with temperature control tts_v3 = SarvamTTSService( api_key="your-api-key", settings=SarvamTTSService.Settings( voice="aditya", # Use v3 speaker model="bulbul:v3-beta", language=Language.HI, pace=1.2, # Range: 0.5-2.0 for v3 temperature=0.8, ), )
See https://docs.sarvam.ai/api-reference-docs/text-to-speech/stream for API details.
- Settings
alias of
SarvamTTSSettings
- class InputParams(*, pitch: Annotated[float | None, Ge(ge=-0.75), Le(le=0.75)] = 0.0, pace: Annotated[float | None, Ge(ge=0.3), Le(le=3.0)] = 1.0, loudness: Annotated[float | None, Ge(ge=0.3), Le(le=3.0)] = 1.0, enable_preprocessing: bool | None = False, min_buffer_size: int | None = 50, max_chunk_length: int | None = 150, output_audio_codec: str | None = 'linear16', output_audio_bitrate: str | None = '128k', language: Language | None = Language.EN, temperature: Annotated[float | None, Ge(ge=0.01), Le(le=1.0)] = 0.6)[source]
Bases:
BaseModelConfiguration parameters for Sarvam TTS WebSocket service.
Deprecated since version 0.0.105: Use
SarvamTTSService.Settingsdirectly via thesettingsparameter instead.- Parameters:
pitch – Voice pitch adjustment (-0.75 to 0.75). Defaults to 0.0. Note: Only supported for bulbul:v2. Ignored for v3 models.
pace – Speech pace multiplier. Defaults to 1.0. - bulbul:v2: Range 0.3 to 3.0 - bulbul:v3-beta: Range 0.5 to 2.0
loudness – Volume multiplier (0.3 to 3.0). Defaults to 1.0. Note: Only supported for bulbul:v2. Ignored for v3 models.
enable_preprocessing – Enable text preprocessing. Defaults to False. Note: Always enabled for bulbul:v3-beta.
min_buffer_size – Minimum characters to buffer before generating audio. Lower values reduce latency but may affect quality. Defaults to 50.
max_chunk_length – Maximum characters processed in a single chunk. Controls memory usage and processing efficiency. Defaults to 150.
output_audio_codec – Audio codec format. Options: linear16, mulaw, alaw, opus, flac, aac, wav, mp3. Defaults to “linear16”.
output_audio_bitrate – Audio bitrate (32k, 64k, 96k, 128k, 192k). Defaults to “128k”.
language – Target language for synthesis. Supports Indian languages.
temperature – Controls output randomness for bulbul:v3-beta (0.01 to 1.0). Lower = more deterministic, higher = more random. Defaults to 0.6. Note: Only supported for bulbul:v3-beta. Ignored for v2.
Speakers by Model:
- bulbul:v2:
Female: anushka (default), manisha, vidya, arya
Male: abhilash, karun, hitesh
- bulbul:v3-beta:
aditya (default), ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia
- pitch: float | None
- pace: float | None
- loudness: float | None
- enable_preprocessing: bool | None
- min_buffer_size: int | None
- max_chunk_length: int | None
- output_audio_codec: str | None
- output_audio_bitrate: str | None
- temperature: float | None
- __init__(*, api_key: str, model: str | None = None, voice_id: str | None = None, url: str = 'wss://api.sarvam.ai/text-to-speech/ws', aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: SarvamTTSSettings | None = None, **kwargs)[source]
Initialize the Sarvam TTS service with voice and transport configuration.
- Parameters:
api_key – Sarvam API key for authenticating TTS requests.
model –
TTS model to use. Options: - “bulbul:v2” (default): Standard model with pitch/loudness support - “bulbul:v3-beta”: Advanced model with temperature control
Deprecated since version 0.0.105: Use
settings=SarvamTTSService.Settings(model=...)instead.voice_id –
Speaker voice ID. If None, uses model-appropriate default.
Deprecated since version 0.0.105: Use
settings=SarvamTTSService.Settings(voice=...)instead.url – WebSocket URL for the TTS backend (default production URL).
aggregate_sentences –
Deprecated. Use text_aggregation_mode instead.
Deprecated since version 0.0.104: Use
text_aggregation_modeinstead.text_aggregation_mode – How to aggregate text before synthesis.
sample_rate – Output audio sample rate in Hz (8000, 16000, 22050, 24000). If None, uses model-specific default.
params –
Optional input parameters to override defaults.
Deprecated since version 0.0.105: Use
settings=SarvamTTSService.Settings(...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Arguments forwarded to InterruptibleTTSService.
See https://docs.sarvam.ai/api-reference-docs/text-to-speech/stream
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True, as Sarvam service supports metrics generation.
- language_to_service_language(language: Language) str | None[source]
Convert a Language enum to Sarvam AI language format.
- Parameters:
language – The language to convert.
- Returns:
The Sarvam AI-specific language code, or None if not supported.
- async start(frame: StartFrame)[source]
Start the Sarvam TTS service.
- Parameters:
frame – The start frame containing initialization parameters.
- async stop(frame: EndFrame)[source]
Stop the Sarvam TTS service.
- Parameters:
frame – The end frame.
- async cancel(frame: CancelFrame)[source]
Cancel the Sarvam TTS service.
- Parameters:
frame – The cancel frame.
- async flush_audio(context_id: str | None = None)[source]
Flush any pending audio synthesis by sending flush command.
- async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]
Generate speech audio frames from input text using Sarvam TTS.
Sends text over WebSocket for synthesis and yields corresponding audio or status frames.
- Parameters:
text – The text input to synthesize.
context_id – The context ID for tracking audio frames.
- Yields:
Frame objects including TTSStartedFrame, TTSAudioRawFrame(s, context_id=context_id), or TTSStoppedFrame.