llm
NVIDIA NIM API service implementation.
This module provides a service for interacting with NVIDIA’s NIM (NVIDIA Inference Microservice) API while maintaining compatibility with the OpenAI-style interface.
Refer to the NVIDIA NIM LLM API documentation for available models and usage: https://docs.api.nvidia.com/nim/reference/llm-apis
- class pipecat.services.nvidia.llm.NvidiaLLMSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, system_instruction: str | None | _NotGiven = <factory>, temperature: float | None | _NotGiven | NotGiven = <factory>, max_tokens: int | None | _NotGiven | NotGiven = <factory>, top_p: float | None | _NotGiven | NotGiven = <factory>, top_k: int | None | _NotGiven = <factory>, frequency_penalty: float | None | _NotGiven | NotGiven = <factory>, presence_penalty: float | None | _NotGiven | NotGiven = <factory>, seed: int | None | _NotGiven | NotGiven = <factory>, filter_incomplete_user_turns: bool | None | _NotGiven = <factory>, user_turn_completion_config: UserTurnCompletionConfig | None | _NotGiven = <factory>, max_completion_tokens: int | _NotGiven | NotGiven = <factory>)[source]
Bases:
OpenAILLMSettingsSettings for NvidiaLLMService.
- class pipecat.services.nvidia.llm.NvidiaLLMService(*, api_key: str | None = None, base_url: str = 'https://integrate.api.nvidia.com/v1', model: str | None = None, settings: NvidiaLLMSettings | None = None, **kwargs)[source]
Bases:
OpenAILLMServiceA service for interacting with NVIDIA’s NIM (NVIDIA Inference Microservice) API.
This service extends OpenAILLMService to work with NVIDIA’s NIM API while maintaining compatibility with the OpenAI-style interface. It handles:
Incremental token usage reporting (NIM sends per-chunk counts instead of a final summary)
Detection and filtering of leading
<think>/</think>content for models that emit reasoning inline before visible output (e.g. DeepSeek-R1, some nemotron models)Extraction of
reasoning_contentfrom the streaming delta for models with API-level reasoning separation (e.g. Nemotron Nano models)
Reasoning content is emitted as
LLMThought*Frameobjects, keeping it accessible to observers and logging without sending it to TTS.- Settings
alias of
NvidiaLLMSettings
- __init__(*, api_key: str | None = None, base_url: str = 'https://integrate.api.nvidia.com/v1', model: str | None = None, settings: NvidiaLLMSettings | None = None, **kwargs)[source]
Initialize the NvidiaLLMService.
- Parameters:
api_key – NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local NIM deployments.
base_url – The base URL for NIM API. Defaults to NVIDIA’s cloud endpoint. For local deployments, pass the local address (e.g.
http://localhost:8000/v1).model –
The model identifier to use. Defaults to “nvidia/nemotron-3-nano-30b-a3b”.
Deprecated since version 0.0.105: Use
settings=NvidiaLLMService.Settings(model=...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.**kwargs – Additional keyword arguments passed to OpenAILLMService.
- async get_chat_completions(context: LLMContext) AsyncIterator[ChatCompletionChunk][source]
Wrap the chat completion stream to handle
reasoning_content.Models with API-level reasoning separation (e.g. Nemotron Nano) include a
reasoning_contentfield on the streaming delta. This wrapper extracts those chunks and emits them asLLMThought*Frameobjects. It also rewrites streameddelta.contentso leading<think>sections are removed before the base OpenAI loop processes visible content.- Parameters:
context – The LLM context for the completion request.
- Returns:
An async iterator of chat completion chunks where
reasoning_contenthas been emitted asLLMThought*Frameside effects.
- async start_llm_usage_metrics(tokens: LLMTokenUsage)[source]
Accumulate token usage metrics during processing.
This method intercepts the incremental token updates from NVIDIA’s API and accumulates them instead of passing each update to the metrics system. The final accumulated totals are reported at the end of processing.
- Parameters:
tokens – The token usage metrics for the current chunk of processing, containing prompt_tokens and completion_tokens counts.