stt
NVIDIA Nemotron Speech-to-Text service implementations for real-time and batch transcription.
Refer to the NVIDIA ASR NIM documentation for usage, customization, and local deployment steps: https://docs.nvidia.com/nim/speech/latest/asr/
- pipecat.services.nvidia.stt.language_to_nvidia_nemotron_speech_language(language: Language) str | None[source]
Maps Language enum to NVIDIA Nemotron Speech ASR language codes.
- Parameters:
language – Language enum value.
- Returns:
NVIDIA Nemotron Speech language code or None if not supported.
- Return type:
str | None
- class pipecat.services.nvidia.stt.NvidiaSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, profanity_filter: bool | _NotGiven = <factory>, automatic_punctuation: bool | _NotGiven = <factory>, verbatim_transcripts: bool | _NotGiven = <factory>, boosted_lm_words: list[str] | None | _NotGiven = <factory>, boosted_lm_score: float | _NotGiven = <factory>, max_alternatives: int | _NotGiven = <factory>, word_time_offsets: bool | _NotGiven = <factory>, speaker_diarization: bool | _NotGiven = <factory>, diarization_max_speakers: int | _NotGiven = <factory>, interim_results: bool | _NotGiven = <factory>)[source]
Bases:
_NvidiaBaseSTTSettingsSettings for NvidiaSTTService.
- Parameters:
interim_results – Whether to return interim (partial) results.
- interim_results: bool | _NotGiven
- class pipecat.services.nvidia.stt.NvidiaSegmentedSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, profanity_filter: bool | _NotGiven = <factory>, automatic_punctuation: bool | _NotGiven = <factory>, verbatim_transcripts: bool | _NotGiven = <factory>, boosted_lm_words: list[str] | None | _NotGiven = <factory>, boosted_lm_score: float | _NotGiven = <factory>, max_alternatives: int | _NotGiven = <factory>, word_time_offsets: bool | _NotGiven = <factory>, speaker_diarization: bool | _NotGiven = <factory>, diarization_max_speakers: int | _NotGiven = <factory>)[source]
Bases:
_NvidiaBaseSTTSettingsSettings for NvidiaSegmentedSTTService.
- class pipecat.services.nvidia.stt.NvidiaSTTService(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', model_function_map: Mapping[str, str] = {'function_id': 'bb0837de-8c7b-481f-9ec8-ef5663e9c1fa', 'model_name': 'nemotron-asr-streaming'}, sample_rate: int | None = None, params: InputParams | None = None, use_ssl: bool = True, audio_channel_count: int = 1, start_history: int = -1, start_threshold: float = -1.0, stop_history: int = 320, stop_threshold: float = -1.0, stop_history_eou: int = -1, stop_threshold_eou: float = -1.0, custom_configuration: str = '', settings: NvidiaSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]
Bases:
STTServiceReal-time speech-to-text service using NVIDIA Nemotron Speech streaming ASR.
Provides real-time transcription capabilities using NVIDIA’s Nemotron Speech ASR models through streaming recognition. Supports interim results and continuous audio processing for low-latency applications.
- Settings
alias of
NvidiaSTTSettings
- class InputParams(*, language: Language | None = Language.EN_US)[source]
Bases:
BaseModelConfiguration parameters for NVIDIA Nemotron Speech STT service.
Deprecated since version 0.0.105: Use
settings=NvidiaSTTService.Settings(...)instead.- Parameters:
language – Target language for transcription. Defaults to EN_US.
- __init__(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', model_function_map: Mapping[str, str] = {'function_id': 'bb0837de-8c7b-481f-9ec8-ef5663e9c1fa', 'model_name': 'nemotron-asr-streaming'}, sample_rate: int | None = None, params: InputParams | None = None, use_ssl: bool = True, audio_channel_count: int = 1, start_history: int = -1, start_threshold: float = -1.0, stop_history: int = 320, stop_threshold: float = -1.0, stop_history_eou: int = -1, stop_threshold_eou: float = -1.0, custom_configuration: str = '', settings: NvidiaSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]
Initialize the NVIDIA Nemotron Speech STT service.
- Parameters:
api_key – NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.
server – NVIDIA Nemotron Speech server address. Defaults to NVIDIA Cloud Function endpoint. For local deployments, pass the local address (e.g.
localhost:50051).model_function_map – Mapping containing ‘function_id’ and ‘model_name’ for the ASR model.
sample_rate – Audio sample rate in Hz. If None, uses pipeline default.
params –
Additional configuration parameters for NVIDIA Nemotron Speech.
Deprecated since version 0.0.105: Use
settings=NvidiaSTTService.Settings(...)instead.use_ssl – Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.
audio_channel_count – Number of audio channels.
start_history – VAD start history in frames. Use -1 for Nemotron Speech default.
start_threshold – VAD start threshold. Use -1.0 for Nemotron Speech default.
stop_history – VAD stop history in frames. Use -1 for Nemotron Speech default.
stop_threshold – VAD stop threshold. Use -1.0 for Nemotron Speech default.
stop_history_eou – End-of-utterance stop history in frames. Use -1 for Nemotron Speech default.
stop_threshold_eou – End-of-utterance stop threshold. Use -1.0 for Nemotron Speech default.
custom_configuration – Custom Nemotron Speech configuration string (e.g.
"enable_vad_endpointing:true,neural_vad.onset:0.65").settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to STTService.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True - this service supports metrics generation.
- async set_model(model: str)[source]
Set the ASR model for transcription.
Deprecated since version 0.0.104: Model cannot be changed after initialization for NVIDIA Nemotron Speech streaming STT. Set model and function id in the constructor instead.
Example:
NvidiaSTTService( api_key=..., model_function_map={"function_id": "<UUID>", "model_name": "<model_name>"}, )
- Parameters:
model – Model name to set.
- async start(frame: StartFrame)[source]
Start the NVIDIA Nemotron Speech STT service and initialize streaming configuration.
- Parameters:
frame – StartFrame indicating pipeline start.
- async stop(frame: EndFrame)[source]
Stop the NVIDIA Nemotron Speech STT service and clean up resources.
- Parameters:
frame – EndFrame indicating pipeline stop.
- async cancel(frame: CancelFrame)[source]
Cancel the NVIDIA Nemotron Speech STT service operation.
- Parameters:
frame – CancelFrame indicating operation cancellation.
- class pipecat.services.nvidia.stt.NvidiaSegmentedSTTService(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', model_function_map: Mapping[str, str] = {'function_id': 'ee8dc628-76de-4acc-8595-1836e7e857bd', 'model_name': 'canary-1b-asr'}, sample_rate: int | None = None, params: InputParams | None = None, use_ssl: bool = True, custom_configuration: str = '', settings: NvidiaSegmentedSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]
Bases:
SegmentedSTTServiceSpeech-to-text service using NVIDIA Nemotron Speech’s offline/batch models.
By default, this service uses NVIDIA’s Nemotron Speech Canary ASR API to perform speech-to-text transcription on audio segments. It inherits from SegmentedSTTService to handle audio buffering and speech detection.
- Settings
alias of
NvidiaSegmentedSTTSettings
- class InputParams(*, language: Language | None = Language.EN_US, profanity_filter: bool = False, automatic_punctuation: bool = True, verbatim_transcripts: bool = False, boosted_lm_words: list[str] | None = None, boosted_lm_score: float = 4.0)[source]
Bases:
BaseModelConfiguration parameters for NVIDIA Nemotron Speech segmented STT service.
Deprecated since version 0.0.105: Use
settings=NvidiaSegmentedSTTService.Settings(...)instead.- Parameters:
language – Target language for transcription. Defaults to EN_US.
profanity_filter – Whether to filter profanity from results.
automatic_punctuation – Whether to add automatic punctuation.
verbatim_transcripts – Whether to return verbatim transcripts.
boosted_lm_words – List of words to boost in language model.
boosted_lm_score – Score boost for specified words.
- profanity_filter: bool
- automatic_punctuation: bool
- verbatim_transcripts: bool
- boosted_lm_words: list[str] | None
- boosted_lm_score: float
- __init__(*, api_key: str | None = None, server: str = 'grpc.nvcf.nvidia.com:443', model_function_map: Mapping[str, str] = {'function_id': 'ee8dc628-76de-4acc-8595-1836e7e857bd', 'model_name': 'canary-1b-asr'}, sample_rate: int | None = None, params: InputParams | None = None, use_ssl: bool = True, custom_configuration: str = '', settings: NvidiaSegmentedSTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]
Initialize the NVIDIA Nemotron Speech segmented STT service.
- Parameters:
api_key – NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.
server – NVIDIA Nemotron Speech server address. Defaults to NVIDIA Cloud Function endpoint. For local deployments, pass the local address (e.g.
localhost:50051).model_function_map – Mapping of model name and its corresponding NVIDIA Cloud Function ID.
sample_rate – Audio sample rate in Hz. If not provided, uses the pipeline’s rate.
params –
Additional configuration parameters for NVIDIA Nemotron Speech.
Deprecated since version 0.0.105: Use
settings=NvidiaSegmentedSTTService.Settings(...)instead.use_ssl – Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.
custom_configuration – Custom Nemotron Speech configuration string (e.g.
"enable_vad_endpointing:true,neural_vad.onset:0.65").settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to SegmentedSTTService.
- language_to_service_language(language: Language) str | None[source]
Convert pipecat Language enum to NVIDIA Nemotron Speech’s language code.
- Parameters:
language – Language enum value.
- Returns:
NVIDIA Nemotron Speech language code or None if not supported.
- can_generate_metrics() bool[source]
Check if this service can generate processing metrics.
- Returns:
True - this service supports metrics generation.
- async start(frame: StartFrame)[source]
Initialize the service when the pipeline starts.
- Parameters:
frame – StartFrame indicating pipeline start.