tts

Sarvam AI text-to-speech service implementation.

This module provides TTS services using Sarvam AI’s API with support for multiple Indian languages and two model variants:

Model Variants:

  • bulbul:v2 (default): Standard TTS model
    • Supports: pitch, loudness, pace (0.3-3.0)

    • Default sample rate: 22050 Hz

    • Speakers: anushka (default), abhilash, manisha, vidya, arya, karun, hitesh

  • bulbul:v3-beta: Advanced TTS model with temperature control
    • Does NOT support: pitch, loudness

    • Supports: pace (0.5-2.0), temperature (0.01-1.0)

    • Default sample rate: 24000 Hz

    • Preprocessing is always enabled

    • Speakers: aditya (default), ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia

  • bulbul:v3: Advanced TTS model with temperature control
    • Does NOT support: pitch, loudness

    • Supports: pace (0.5-2.0), temperature (0.01-1.0)

    • Default sample rate: 24000 Hz

    • Preprocessing is always enabled

    • Speakers: aditya (default), ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia

See https://docs.sarvam.ai/api-reference-docs/text-to-speech/stream for full API details.

class pipecat.services.sarvam.tts.SarvamTTSModel(*values)[source]

Bases: StrEnum

Available Sarvam TTS models.

Parameters:
  • BULBUL_V2 – Standard TTS model with pitch/loudness control. - Supports pitch, loudness, pace (0.3-3.0) - Default sample rate: 22050 Hz

  • BULBUL_V3_BETA – Advanced model with temperature control. - Does NOT support pitch/loudness - Pace range: 0.5-2.0 - Supports temperature parameter - Default sample rate: 24000 Hz - Preprocessing is always enabled

BULBUL_V2 = 'bulbul:v2'
BULBUL_V3_BETA = 'bulbul:v3-beta'
BULBUL_V3 = 'bulbul:v3'
class pipecat.services.sarvam.tts.SarvamTTSSpeakerV2(*values)[source]

Bases: StrEnum

Available speakers for bulbul:v2 model.

Female voices: anushka, manisha, vidya, arya Male voices: abhilash, karun, hitesh

ANUSHKA = 'anushka'
ABHILASH = 'abhilash'
MANISHA = 'manisha'
VIDYA = 'vidya'
ARYA = 'arya'
KARUN = 'karun'
HITESH = 'hitesh'
class pipecat.services.sarvam.tts.SarvamTTSSpeakerV3(*values)[source]

Bases: StrEnum

Available speakers for bulbul:v3-beta model.

Includes a wider variety of voices with different characteristics.

ADITYA = 'aditya'
RITU = 'ritu'
PRIYA = 'priya'
NEHA = 'neha'
RAHUL = 'rahul'
POOJA = 'pooja'
ROHAN = 'rohan'
SIMRAN = 'simran'
KAVYA = 'kavya'
AMIT = 'amit'
DEV = 'dev'
ISHITA = 'ishita'
SHREYA = 'shreya'
RATAN = 'ratan'
VARUN = 'varun'
MANAN = 'manan'
SUMIT = 'sumit'
ROOPA = 'roopa'
KABIR = 'kabir'
AAYAN = 'aayan'
SHUBH = 'shubh'
ASHUTOSH = 'ashutosh'
ADVAIT = 'advait'
AMELIA = 'amelia'
SOPHIA = 'sophia'
class pipecat.services.sarvam.tts.TTSModelConfig(supports_pitch: bool, supports_loudness: bool, supports_temperature: bool, default_sample_rate: int, default_speaker: str, pace_range: tuple[float, float], preprocessing_always_enabled: bool, speakers: tuple[str, ...])[source]

Bases: object

Immutable configuration for a Sarvam TTS model.

Parameters:
  • supports_pitch – Whether the model accepts pitch parameter.

  • supports_loudness – Whether the model accepts loudness parameter.

  • supports_temperature – Whether the model accepts temperature parameter.

  • default_sample_rate – Default audio sample rate in Hz.

  • default_speaker – Default speaker voice ID.

  • pace_range – Valid range for pace parameter (min, max).

  • preprocessing_always_enabled – Whether preprocessing is always enabled.

  • speakers – Tuple of available speaker names for this model.

supports_pitch: bool
supports_loudness: bool
supports_temperature: bool
default_sample_rate: int
default_speaker: str
pace_range: tuple[float, float]
preprocessing_always_enabled: bool
speakers: tuple[str, ...]
pipecat.services.sarvam.tts.get_speakers_for_model(model: str) list[str][source]

Get the list of available speakers for a given model.

Parameters:

model – The model name (e.g., “bulbul:v2” or “bulbul:v3-beta”).

Returns:

List of speaker names available for the model.

pipecat.services.sarvam.tts.language_to_sarvam_language(language: Language) str | None[source]

Convert Pipecat Language enum to Sarvam AI language codes.

Parameters:

language – The Language enum value to convert.

Returns:

The corresponding Sarvam AI language code, or None if not supported.

class pipecat.services.sarvam.tts.SarvamHttpTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, enable_preprocessing: bool | None | _NotGiven = <factory>, pace: float | None | _NotGiven = <factory>, pitch: float | None | _NotGiven = <factory>, loudness: float | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>)[source]

Bases: TTSSettings

Settings for SarvamHttpTTSService.

Parameters:
  • enable_preprocessing – Whether to enable text preprocessing. Defaults to False. Note: Always enabled for bulbul:v3-beta (cannot be disabled).

  • pace – Speech pace multiplier. Defaults to 1.0. - bulbul:v2: Range 0.3 to 3.0 - bulbul:v3-beta: Range 0.5 to 2.0

  • pitch – Voice pitch adjustment (-0.75 to 0.75). Defaults to 0.0. Note: Only supported for bulbul:v2. Ignored for v3 models.

  • loudness – Volume multiplier (0.3 to 3.0). Defaults to 1.0. Note: Only supported for bulbul:v2. Ignored for v3 models.

  • temperature – Controls output randomness for bulbul:v3-beta (0.01 to 1.0). Lower values = more deterministic, higher = more random. Defaults to 0.6. Note: Only supported for bulbul:v3-beta. Ignored for v2.

enable_preprocessing: bool | None | _NotGiven
pace: float | None | _NotGiven
pitch: float | None | _NotGiven
loudness: float | None | _NotGiven
temperature: float | None | _NotGiven
class pipecat.services.sarvam.tts.SarvamTTSSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, voice: str | None | _NotGiven = <factory>, language: Language | str | None | _NotGiven = <factory>, enable_preprocessing: bool | None | _NotGiven = <factory>, pace: float | None | _NotGiven = <factory>, pitch: float | None | _NotGiven = <factory>, loudness: float | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>, min_buffer_size: int | None | _NotGiven = <factory>, max_chunk_length: int | None | _NotGiven = <factory>)[source]

Bases: SarvamHttpTTSSettings

Settings for SarvamTTSService.

Extends SarvamHttpTTSService.Settings with WebSocket-specific buffering parameters.

Parameters:
  • min_buffer_size – Minimum characters to buffer before generating audio. Lower values reduce latency but may affect quality. Defaults to 50.

  • max_chunk_length – Maximum characters processed in a single chunk. Controls memory usage and processing efficiency. Defaults to 150.

min_buffer_size: int | None | _NotGiven
max_chunk_length: int | None | _NotGiven
class pipecat.services.sarvam.tts.SarvamHttpTTSService(*, api_key: str, aiohttp_session: ClientSession, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.sarvam.ai', sample_rate: int | None = None, params: InputParams | None = None, settings: SarvamHttpTTSSettings | None = None, **kwargs)[source]

Bases: TTSService

Text-to-Speech service using Sarvam AI’s API.

Converts text to speech using Sarvam AI’s TTS models with support for multiple Indian languages. Provides control over voice characteristics.

Model Differences:

  • bulbul:v2 (default):
    • Supports: pitch (-0.75 to 0.75), loudness (0.3 to 3.0), pace (0.3 to 3.0)

    • Default sample rate: 22050 Hz

    • Speakers: anushka, abhilash, manisha, vidya, arya, karun, hitesh

  • bulbul:v3-beta:
    • Does NOT support: pitch, loudness (will be ignored)

    • Supports: pace (0.5 to 2.0), temperature (0.01 to 1.0)

    • Default sample rate: 24000 Hz

    • Preprocessing is always enabled

    • Speakers: aditya, ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia

Example:

# Using bulbul:v2 (default)
tts = SarvamHttpTTSService(
    api_key="your-api-key",
    aiohttp_session=session,
    settings=SarvamHttpTTSService.Settings(
        voice="anushka",
        model="bulbul:v2",
        language=Language.HI,
        pitch=0.1,
        pace=1.2,
        loudness=1.5,
    ),
)

# Using bulbul:v3-beta with temperature control
tts_v3 = SarvamHttpTTSService(
    api_key="your-api-key",
    aiohttp_session=session,
    settings=SarvamHttpTTSService.Settings(
        voice="aditya",  # Use v3 speaker
        model="bulbul:v3-beta",
        language=Language.HI,
        pace=1.2,  # Range: 0.5-2.0 for v3
        temperature=0.8,
    ),
)
Settings

alias of SarvamHttpTTSSettings

class InputParams(*, language: Language | None = Language.EN, pitch: Annotated[float | None, Ge(ge=-0.75), Le(le=0.75)] = 0.0, pace: Annotated[float | None, Ge(ge=0.3), Le(le=3.0)] = 1.0, loudness: Annotated[float | None, Ge(ge=0.3), Le(le=3.0)] = 1.0, enable_preprocessing: bool | None = False, temperature: Annotated[float | None, Ge(ge=0.01), Le(le=1.0)] = 0.6)[source]

Bases: BaseModel

Input parameters for Sarvam TTS configuration.

Deprecated since version 0.0.105: Use SarvamHttpTTSService.Settings directly via the settings parameter instead.

Parameters:
  • language – Language for synthesis. Defaults to English (India).

  • pitch – Voice pitch adjustment (-0.75 to 0.75). Defaults to 0.0. Note: Only supported for bulbul:v2. Ignored for v3 models.

  • pace – Speech pace multiplier. Defaults to 1.0. - bulbul:v2: Range 0.3 to 3.0 - bulbul:v3-beta: Range 0.5 to 2.0

  • loudness – Volume multiplier (0.3 to 3.0). Defaults to 1.0. Note: Only supported for bulbul:v2. Ignored for v3 models.

  • enable_preprocessing – Whether to enable text preprocessing. Defaults to False. Note: Always enabled for bulbul:v3-beta (cannot be disabled).

  • temperature – Controls output randomness for bulbul:v3-beta (0.01 to 1.0). Lower values = more deterministic, higher = more random. Defaults to 0.6. Note: Only supported for bulbul:v3-beta. Ignored for v2.

language: Language | None
pitch: float | None
pace: float | None
loudness: float | None
enable_preprocessing: bool | None
temperature: float | None
__init__(*, api_key: str, aiohttp_session: ClientSession, voice_id: str | None = None, model: str | None = None, base_url: str = 'https://api.sarvam.ai', sample_rate: int | None = None, params: InputParams | None = None, settings: SarvamHttpTTSSettings | None = None, **kwargs)[source]

Initialize the Sarvam TTS service.

Parameters:
  • api_key – Sarvam AI API subscription key.

  • aiohttp_session – Shared aiohttp session for making requests.

  • voice_id

    Speaker voice ID. If None, uses model-appropriate default.

    Deprecated since version 0.0.105: Use settings=SarvamHttpTTSService.Settings(voice=...) instead.

  • model

    TTS model to use. Options: - “bulbul:v2” (default): Standard model with pitch/loudness support - “bulbul:v3-beta”: Advanced model with temperature control

    Deprecated since version 0.0.105: Use settings=SarvamHttpTTSService.Settings(model=...) instead.

  • base_url – Sarvam AI API base URL. Defaults to “https://api.sarvam.ai”.

  • sample_rate – Audio sample rate in Hz (8000, 16000, 22050, 24000). If None, uses model-specific default.

  • params

    Additional voice and preprocessing parameters. If None, uses defaults.

    Deprecated since version 0.0.105: Use settings=SarvamHttpTTSService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • **kwargs – Additional arguments passed to parent TTSService.

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Sarvam service supports metrics generation.

language_to_service_language(language: Language) str | None[source]

Convert a Language enum to Sarvam AI language format.

Parameters:

language – The language to convert.

Returns:

The Sarvam AI-specific language code, or None if not supported.

async start(frame: StartFrame)[source]

Start the Sarvam TTS service.

Parameters:

frame – The start frame containing initialization parameters.

async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]

Generate speech from text using Sarvam AI’s API.

Parameters:
  • text – The text to synthesize into speech.

  • context_id – The context ID for tracking audio frames.

Yields:

Frame – Audio frames containing the synthesized speech.

class pipecat.services.sarvam.tts.SarvamTTSService(*, api_key: str, model: str | None = None, voice_id: str | None = None, url: str = 'wss://api.sarvam.ai/text-to-speech/ws', aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: SarvamTTSSettings | None = None, **kwargs)[source]

Bases: InterruptibleTTSService

WebSocket-based text-to-speech service using Sarvam AI.

Provides streaming TTS with real-time audio generation for multiple Indian languages. Uses WebSocket for low-latency streaming audio synthesis.

Model Differences:

  • bulbul:v2 (default):
    • Supports: pitch (-0.75 to 0.75), loudness (0.3 to 3.0), pace (0.3 to 3.0)

    • Default sample rate: 22050 Hz

    • Speakers: anushka, abhilash, manisha, vidya, arya, karun, hitesh

  • bulbul:v3-beta / bulbul:v3:
    • Does NOT support: pitch, loudness (will be ignored)

    • Supports: pace (0.5 to 2.0), temperature (0.01 to 1.0)

    • Default sample rate: 24000 Hz

    • Preprocessing is always enabled

    • Speakers: aditya, ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia

WebSocket Protocol: The service uses a WebSocket connection for real-time streaming. Messages include: - config: Initial configuration with voice settings - text: Text chunks for synthesis - flush: Signal to process remaining buffered text - ping: Keepalive signal

Example:

# Using bulbul:v2 (default)
tts = SarvamTTSService(
    api_key="your-api-key",
    settings=SarvamTTSService.Settings(
        voice="anushka",
        model="bulbul:v2",
        language=Language.HI,
        pitch=0.1,
        pace=1.2,
        loudness=1.5,
    ),
)

# Using bulbul:v3-beta with temperature control
tts_v3 = SarvamTTSService(
    api_key="your-api-key",
    settings=SarvamTTSService.Settings(
        voice="aditya",  # Use v3 speaker
        model="bulbul:v3-beta",
        language=Language.HI,
        pace=1.2,  # Range: 0.5-2.0 for v3
        temperature=0.8,
    ),
)

See https://docs.sarvam.ai/api-reference-docs/text-to-speech/stream for API details.

Settings

alias of SarvamTTSSettings

class InputParams(*, pitch: Annotated[float | None, Ge(ge=-0.75), Le(le=0.75)] = 0.0, pace: Annotated[float | None, Ge(ge=0.3), Le(le=3.0)] = 1.0, loudness: Annotated[float | None, Ge(ge=0.3), Le(le=3.0)] = 1.0, enable_preprocessing: bool | None = False, min_buffer_size: int | None = 50, max_chunk_length: int | None = 150, output_audio_codec: str | None = 'linear16', output_audio_bitrate: str | None = '128k', language: Language | None = Language.EN, temperature: Annotated[float | None, Ge(ge=0.01), Le(le=1.0)] = 0.6)[source]

Bases: BaseModel

Configuration parameters for Sarvam TTS WebSocket service.

Deprecated since version 0.0.105: Use SarvamTTSService.Settings directly via the settings parameter instead.

Parameters:
  • pitch – Voice pitch adjustment (-0.75 to 0.75). Defaults to 0.0. Note: Only supported for bulbul:v2. Ignored for v3 models.

  • pace – Speech pace multiplier. Defaults to 1.0. - bulbul:v2: Range 0.3 to 3.0 - bulbul:v3-beta: Range 0.5 to 2.0

  • loudness – Volume multiplier (0.3 to 3.0). Defaults to 1.0. Note: Only supported for bulbul:v2. Ignored for v3 models.

  • enable_preprocessing – Enable text preprocessing. Defaults to False. Note: Always enabled for bulbul:v3-beta.

  • min_buffer_size – Minimum characters to buffer before generating audio. Lower values reduce latency but may affect quality. Defaults to 50.

  • max_chunk_length – Maximum characters processed in a single chunk. Controls memory usage and processing efficiency. Defaults to 150.

  • output_audio_codec – Audio codec format. Options: linear16, mulaw, alaw, opus, flac, aac, wav, mp3. Defaults to “linear16”.

  • output_audio_bitrate – Audio bitrate (32k, 64k, 96k, 128k, 192k). Defaults to “128k”.

  • language – Target language for synthesis. Supports Indian languages.

  • temperature – Controls output randomness for bulbul:v3-beta (0.01 to 1.0). Lower = more deterministic, higher = more random. Defaults to 0.6. Note: Only supported for bulbul:v3-beta. Ignored for v2.

Speakers by Model:

bulbul:v2:
  • Female: anushka (default), manisha, vidya, arya

  • Male: abhilash, karun, hitesh

bulbul:v3-beta:
  • aditya (default), ritu, priya, neha, rahul, pooja, rohan, simran, kavya, amit, dev, ishita, shreya, ratan, varun, manan, sumit, roopa, kabir, aayan, shubh, ashutosh, advait, amelia, sophia

pitch: float | None
pace: float | None
loudness: float | None
enable_preprocessing: bool | None
min_buffer_size: int | None
max_chunk_length: int | None
output_audio_codec: str | None
output_audio_bitrate: str | None
language: Language | None
temperature: float | None
__init__(*, api_key: str, model: str | None = None, voice_id: str | None = None, url: str = 'wss://api.sarvam.ai/text-to-speech/ws', aggregate_sentences: bool | None = None, text_aggregation_mode: TextAggregationMode | None = None, sample_rate: int | None = None, params: InputParams | None = None, settings: SarvamTTSSettings | None = None, **kwargs)[source]

Initialize the Sarvam TTS service with voice and transport configuration.

Parameters:
  • api_key – Sarvam API key for authenticating TTS requests.

  • model

    TTS model to use. Options: - “bulbul:v2” (default): Standard model with pitch/loudness support - “bulbul:v3-beta”: Advanced model with temperature control

    Deprecated since version 0.0.105: Use settings=SarvamTTSService.Settings(model=...) instead.

  • voice_id

    Speaker voice ID. If None, uses model-appropriate default.

    Deprecated since version 0.0.105: Use settings=SarvamTTSService.Settings(voice=...) instead.

  • url – WebSocket URL for the TTS backend (default production URL).

  • aggregate_sentences

    Deprecated. Use text_aggregation_mode instead.

    Deprecated since version 0.0.104: Use text_aggregation_mode instead.

  • text_aggregation_mode – How to aggregate text before synthesis.

  • sample_rate – Output audio sample rate in Hz (8000, 16000, 22050, 24000). If None, uses model-specific default.

  • params

    Optional input parameters to override defaults.

    Deprecated since version 0.0.105: Use settings=SarvamTTSService.Settings(...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • **kwargs – Arguments forwarded to InterruptibleTTSService.

See https://docs.sarvam.ai/api-reference-docs/text-to-speech/stream

can_generate_metrics() bool[source]

Check if this service can generate processing metrics.

Returns:

True, as Sarvam service supports metrics generation.

language_to_service_language(language: Language) str | None[source]

Convert a Language enum to Sarvam AI language format.

Parameters:

language – The language to convert.

Returns:

The Sarvam AI-specific language code, or None if not supported.

async start(frame: StartFrame)[source]

Start the Sarvam TTS service.

Parameters:

frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the Sarvam TTS service.

Parameters:

frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the Sarvam TTS service.

Parameters:

frame – The cancel frame.

async flush_audio(context_id: str | None = None)[source]

Flush any pending audio synthesis by sending flush command.

async run_tts(text: str, context_id: str) AsyncGenerator[Frame | None, None][source]

Generate speech audio frames from input text using Sarvam TTS.

Sends text over WebSocket for synthesis and yields corresponding audio or status frames.

Parameters:
  • text – The text input to synthesize.

  • context_id – The context ID for tracking audio frames.

Yields:

Frame objects including TTSStartedFrame, TTSAudioRawFrame(s, context_id=context_id), or TTSStoppedFrame.