tts

NVIDIA Nemotron Speech text-to-speech service implementation.

This module provides integration with NVIDIA Nemotron Speech’s TTS services through gRPC API for high-quality speech synthesis.

Refer to the NVIDIA TTS NIM documentation for usage, customization, and local deployment steps: https://docs.nvidia.com/nim/speech/latest/tts/

Bases: TTSSettings

Settings for NvidiaTTSService.

Parameters:: quality – Audio quality setting (0-100).

quality: int | _NotGiven

class pipecat.services.nvidia.tts.NvidiaTTSService(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', voice_id: str | None = None, sample_rate: int | None = None, model_function_map: Mapping[str, str] = {'function_id': '877104f7-e885-42b9-8de8-f6e4c6303969', 'model_name': 'magpie-tts-multilingual'}, params: InputParams | None = None, settings: NvidiaTTSSettings | None = None, use_ssl: bool = True, custom_dictionary: dict | None = None, encoding: EnumTypeWrapper | None = 1, **kwargs)[source]

Bases: TTSService

NVIDIA Nemotron Speech text-to-speech service.

Provides high-quality text-to-speech synthesis using NVIDIA Nemotron Speech’s cloud-based TTS models. Supports multiple voices, languages, and configurable quality settings.

Settings: alias of NvidiaTTSSettings

class InputParams(*, language: Language | None = Language.EN_US, quality: int | None = 20)[source]

Bases: BaseModel

Input parameters for Nemotron Speech TTS configuration.

Deprecated since version 0.0.105: Use NvidiaTTSService.Settings directly via the settings parameter instead.

Parameters:

language – Language code for synthesis. Defaults to US English.
quality – Audio quality setting (0-100). Defaults to 20.

language: Language | None

quality: int | None

__init__(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', voice_id: str | None = None, sample_rate: int | None = None, model_function_map: Mapping[str, str] = {'function_id': '877104f7-e885-42b9-8de8-f6e4c6303969', 'model_name': 'magpie-tts-multilingual'}, params: InputParams | None = None, settings: NvidiaTTSSettings | None = None, use_ssl: bool = True, custom_dictionary: dict | None = None, encoding: EnumTypeWrapper | None = 1, **kwargs)[source]

Initialize the NVIDIA Nemotron Speech TTS service.

Parameters:

api_key – NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.
server – gRPC server endpoint. Defaults to NVIDIA’s cloud endpoint. For local deployments, pass the local address (e.g. localhost:50051).
voice_id –
Voice model identifier. Defaults to multilingual Aria voice.

Deprecated since version 0.0.105: Use settings=NvidiaTTSService.Settings(voice=...) instead.
sample_rate – Audio sample rate. If None, uses service default.
model_function_map – Dictionary containing function_id and model_name for the TTS model.
params –
Additional configuration parameters for TTS synthesis.

Deprecated since version 0.0.105: Use settings=NvidiaTTSService.Settings(...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
use_ssl – Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.
custom_dictionary – Custom pronunciation dictionary mapping words (graphemes) to IPA phonetic representations (phonemes), e.g. {"NVIDIA": "ɛn.vɪ.diː.ʌ"}. See https://docs.nvidia.com/nim/speech/latest/tts/phoneme-support.html for the list of supported IPA phonemes.
encoding – Output audio encoding format. Defaults to AudioEncoding.LINEAR_PCM.
**kwargs – Additional arguments passed to parent TTSService.

can_generate_metrics() → bool[source]

Check if this service can generate metrics.

Returns:: True as this service supports metric generation.

async set_model(model: str)[source]

Set the TTS model.

Deprecated since version 0.0.104: Model cannot be changed after initialization for NVIDIA Nemotron Speech TTS. Set model and function id in the constructor instead.

Example:

NvidiaTTSService(
    api_key=...,
    model_function_map={"function_id": "<UUID>", "model_name": "<model_name>"},
)

Parameters:: model – The model name to set.

async start(frame: StartFrame)[source]

Start the NVIDIA Nemotron Speech TTS service.

Parameters:: frame – The start frame containing initialization parameters.

async stop(frame: EndFrame)[source]

Stop the NVIDIA Nemotron Speech TTS service.

Parameters:: frame – The end frame.

async cancel(frame: CancelFrame)[source]

Cancel the NVIDIA Nemotron Speech TTS service.

Parameters:: frame – The cancel frame.

async flush_audio(context_id: str | None = None)[source]

Flush any pending audio and finalize the current context.

Parameters:: context_id – The specific context to flush. If None, falls back to the currently active context.

async on_audio_context_interrupted(context_id: str)[source]

Cancel the active gRPC synthesis stream when the bot is interrupted.

Parameters:: context_id – The ID of the audio context that was interrupted.

async run_tts(text: str, context_id: str) → AsyncGenerator[Frame | None, None][source]

Generate speech from text using NVIDIA Nemotron Speech TTS.

On the first call for a turn, starts a persistent synthesize_online gRPC stream. Subsequent calls within the same turn feed text into the existing stream, enabling Magpie’s cross-sentence stitching.

Text is split into chunks respecting Magpie’s per-request limits. Each chunk becomes a separate request in the gRPC stream, stitched seamlessly by Magpie.

Parameters:

text – The text to synthesize into speech.
context_id – The context ID for tracking audio frames.

Yields:

None on success. Audio is delivered asynchronously via the: response consumer. ErrorFrame on failure.