stt

NVIDIA Nemotron Speech-to-Text service implementations for real-time and batch transcription.

Refer to the NVIDIA ASR NIM documentation for usage, customization, and local deployment steps: https://docs.nvidia.com/nim/speech/latest/asr/

pipecat.services.nvidia.stt.language_to_nvidia_nemotron_speech_language(language: Language) → str | None[source]

Maps Language enum to NVIDIA Nemotron Speech ASR language codes.

Source: https://docs.nvidia.com/nim/speech/latest/reference/support-matrix/asr.html#supported-languages-by-model-type

Parameters:: language – Language enum value.
Returns:: NVIDIA Nemotron Speech language code or None if not supported.
Return type:: str | None

Bases: _NvidiaBaseSTTSettings

Settings for NvidiaSTTService.

Parameters:: interim_results – Whether to return interim (partial) results.

interim_results: bool | _NotGiven

class pipecat.services.nvidia.stt.NvidiaSegmentedSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, profanity_filter: bool | _NotGiven = <factory>, automatic_punctuation: bool | _NotGiven = <factory>, verbatim_transcripts: bool | _NotGiven = <factory>, boosted_lm_words: list[str] | None | _NotGiven = <factory>, boosted_lm_score: float | _NotGiven = <factory>, max_alternatives: int | _NotGiven = <factory>, word_time_offsets: bool | _NotGiven = <factory>, speaker_diarization: bool | _NotGiven = <factory>, diarization_max_speakers: int | _NotGiven = <factory>)[source]

Bases: _NvidiaBaseSTTSettings

Settings for NvidiaSegmentedSTTService.

class pipecat.services.nvidia.stt.NvidiaSTTService(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', model_function_map: Mapping[str, str] = {'function_id': 'bb0837de-8c7b-481f-9ec8-ef5663e9c1fa', 'model_name': 'nemotron-asr-streaming'}, sample_rate: int | None = None, params: InputParams | None = None, use_ssl: bool = True, audio_channel_count: int = 1, start_history: int = -1, start_threshold: float = -1.0, stop_history: int = 320, stop_threshold: float = -1.0, stop_history_eou: int = -1, stop_threshold_eou: float = -1.0, custom_configuration: str = '', settings: NvidiaSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]

Bases: STTService

Real-time speech-to-text service using NVIDIA Nemotron Speech streaming ASR.

Provides real-time transcription capabilities using NVIDIA’s Nemotron Speech ASR models through streaming recognition. Supports interim results and continuous audio processing for low-latency applications.

Settings: alias of NvidiaSTTSettings

class InputParams(*, language: Language | None = Language.EN_US)[source]

Bases: BaseModel

Configuration parameters for NVIDIA Nemotron Speech STT service.

Deprecated since version 0.0.105: Use settings=NvidiaSTTService.Settings(...) instead.

Parameters:: language – Target language for transcription. Defaults to EN_US.

language: Language | None

__init__(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', model_function_map: Mapping[str, str] = {'function_id': 'bb0837de-8c7b-481f-9ec8-ef5663e9c1fa', 'model_name': 'nemotron-asr-streaming'}, sample_rate: int | None = None, params: InputParams | None = None, use_ssl: bool = True, audio_channel_count: int = 1, start_history: int = -1, start_threshold: float = -1.0, stop_history: int = 320, stop_threshold: float = -1.0, stop_history_eou: int = -1, stop_threshold_eou: float = -1.0, custom_configuration: str = '', settings: NvidiaSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]

Initialize the NVIDIA Nemotron Speech STT service.

Parameters:

api_key – NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.
server – NVIDIA Nemotron Speech server address. Defaults to NVIDIA Cloud Function endpoint. For local deployments, pass the local address (e.g. localhost:50051).
model_function_map – Mapping containing ‘function_id’ and ‘model_name’ for the ASR model.
sample_rate – Audio sample rate in Hz. If None, uses pipeline default.
params –
Additional configuration parameters for NVIDIA Nemotron Speech.

Deprecated since version 0.0.105: Use settings=NvidiaSTTService.Settings(...) instead.
use_ssl – Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.
audio_channel_count – Number of audio channels.
start_history – VAD start history in frames. Use -1 for Nemotron Speech default.
start_threshold – VAD start threshold. Use -1.0 for Nemotron Speech default.
stop_history – VAD stop history in frames. Use -1 for Nemotron Speech default.
stop_threshold – VAD stop threshold. Use -1.0 for Nemotron Speech default.
stop_history_eou – End-of-utterance stop history in frames. Use -1 for Nemotron Speech default.
stop_threshold_eou – End-of-utterance stop threshold. Use -1.0 for Nemotron Speech default.
custom_configuration – Custom Nemotron Speech configuration string (e.g. "enable_vad_endpointing:true,neural_vad.onset:0.65").
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to STTService.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True - this service supports metrics generation.

async set_model(model: str)[source]

Set the ASR model for transcription.

Deprecated since version 0.0.104: Model cannot be changed after initialization for NVIDIA Nemotron Speech streaming STT. Set model and function id in the constructor instead.

Example:

NvidiaSTTService(
    api_key=...,
    model_function_map={"function_id": "<UUID>", "model_name": "<model_name>"},
)

Parameters:: model – Model name to set.

async start(frame: StartFrame)[source]

Start the NVIDIA Nemotron Speech STT service and initialize streaming configuration.

Parameters:: frame – StartFrame indicating pipeline start.

async stop(frame: EndFrame)[source]

Stop the NVIDIA Nemotron Speech STT service and clean up resources.

Parameters:: frame – EndFrame indicating pipeline stop.

async cancel(frame: CancelFrame)[source]

Cancel the NVIDIA Nemotron Speech STT service operation.

Parameters:: frame – CancelFrame indicating operation cancellation.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Process audio data for speech-to-text transcription.

Parameters:: audio – Raw audio bytes to transcribe.
Yields:: None - transcription results are pushed to the pipeline via frames.

class pipecat.services.nvidia.stt.NvidiaSegmentedSTTService(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', model_function_map: Mapping[str, str] = {'function_id': 'ee8dc628-76de-4acc-8595-1836e7e857bd', 'model_name': 'canary-1b-asr'}, sample_rate: int | None = None, params: InputParams | None = None, use_ssl: bool = True, custom_configuration: str = '', settings: NvidiaSegmentedSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]

Bases: SegmentedSTTService

Speech-to-text service using NVIDIA Nemotron Speech’s offline/batch models.

By default, this service uses NVIDIA’s Nemotron Speech Canary ASR API to perform speech-to-text transcription on audio segments. It inherits from SegmentedSTTService to handle audio buffering and speech detection.

Settings: alias of NvidiaSegmentedSTTSettings

class InputParams(*, language: Language | None = Language.EN_US, profanity_filter: bool = False, automatic_punctuation: bool = True, verbatim_transcripts: bool = False, boosted_lm_words: list[str] | None = None, boosted_lm_score: float = 4.0)[source]

Bases: BaseModel

Configuration parameters for NVIDIA Nemotron Speech segmented STT service.

Deprecated since version 0.0.105: Use settings=NvidiaSegmentedSTTService.Settings(...) instead.

Parameters:

language – Target language for transcription. Defaults to EN_US.
profanity_filter – Whether to filter profanity from results.
automatic_punctuation – Whether to add automatic punctuation.
verbatim_transcripts – Whether to return verbatim transcripts.
boosted_lm_words – List of words to boost in language model.
boosted_lm_score – Score boost for specified words.

language: Language | None

profanity_filter: bool

automatic_punctuation: bool

verbatim_transcripts: bool

boosted_lm_words: list[str] | None

boosted_lm_score: float

__init__(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', model_function_map: Mapping[str, str] = {'function_id': 'ee8dc628-76de-4acc-8595-1836e7e857bd', 'model_name': 'canary-1b-asr'}, sample_rate: int | None = None, params: InputParams | None = None, use_ssl: bool = True, custom_configuration: str = '', settings: NvidiaSegmentedSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]

Initialize the NVIDIA Nemotron Speech segmented STT service.

Parameters:

api_key – NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.
server – NVIDIA Nemotron Speech server address. Defaults to NVIDIA Cloud Function endpoint. For local deployments, pass the local address (e.g. localhost:50051).
model_function_map – Mapping of model name and its corresponding NVIDIA Cloud Function ID.
sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.
params –
Additional configuration parameters for NVIDIA Nemotron Speech.

Deprecated since version 0.0.105: Use settings=NvidiaSegmentedSTTService.Settings(...) instead.
use_ssl – Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.
custom_configuration – Custom Nemotron Speech configuration string (e.g. "enable_vad_endpointing:true,neural_vad.onset:0.65").
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to SegmentedSTTService.

language_to_service_language(language: Language) → str | None[source]

Convert pipecat Language enum to NVIDIA Nemotron Speech’s language code.

Parameters:: language – Language enum value.
Returns:: NVIDIA Nemotron Speech language code or None if not supported.

can_generate_metrics() → bool[source]

Check if this service can generate processing metrics.

Returns:: True - this service supports metrics generation.

async start(frame: StartFrame)[source]

Initialize the service when the pipeline starts.

Parameters:: frame – StartFrame indicating pipeline start.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Transcribe an audio segment.

Parameters:: audio – Raw audio bytes in WAV format (already converted by base class).
Yields:: Frame – TranscriptionFrame containing the transcribed text.