llm

NVIDIA NIM API service implementation.

This module provides a service for interacting with NVIDIA’s NIM (NVIDIA Inference Microservice) API while maintaining compatibility with the OpenAI-style interface.

Refer to the NVIDIA NIM LLM API documentation for available models and usage: https://docs.api.nvidia.com/nim/reference/llm-apis

class pipecat.services.nvidia.llm.NvidiaLLMSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, system_instruction: str | None | _NotGiven = <factory>, temperature: float | None | _NotGiven | NotGiven = <factory>, max_tokens: int | None | _NotGiven | NotGiven = <factory>, top_p: float | None | _NotGiven | NotGiven = <factory>, top_k: int | None | _NotGiven = <factory>, frequency_penalty: float | None | _NotGiven | NotGiven = <factory>, presence_penalty: float | None | _NotGiven | NotGiven = <factory>, seed: int | None | _NotGiven | NotGiven = <factory>, filter_incomplete_user_turns: bool | None | _NotGiven = <factory>, user_turn_completion_config: UserTurnCompletionConfig | None | _NotGiven = <factory>, max_completion_tokens: int | _NotGiven | NotGiven = <factory>)[source]

Bases: OpenAILLMSettings

Settings for NvidiaLLMService.

class pipecat.services.nvidia.llm.NvidiaLLMService(*, api_key: str | None = None, base_url: str = 'https://integrate.api.nvidia.com/v1', model: str | None = None, settings: NvidiaLLMSettings | None = None, **kwargs)[source]

Bases: OpenAILLMService

A service for interacting with NVIDIA’s NIM (NVIDIA Inference Microservice) API.

This service extends OpenAILLMService to work with NVIDIA’s NIM API while maintaining compatibility with the OpenAI-style interface. It handles:

  • Incremental token usage reporting (NIM sends per-chunk counts instead of a final summary)

  • Detection and filtering of leading <think>/</think> content for models that emit reasoning inline before visible output (e.g. DeepSeek-R1, some nemotron models)

  • Extraction of reasoning_content from the streaming delta for models with API-level reasoning separation (e.g. Nemotron Nano models)

Reasoning content is emitted as LLMThought*Frame objects, keeping it accessible to observers and logging without sending it to TTS.

Settings

alias of NvidiaLLMSettings

__init__(*, api_key: str | None = None, base_url: str = 'https://integrate.api.nvidia.com/v1', model: str | None = None, settings: NvidiaLLMSettings | None = None, **kwargs)[source]

Initialize the NvidiaLLMService.

Parameters:
  • api_key – NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local NIM deployments.

  • base_url – The base URL for NIM API. Defaults to NVIDIA’s cloud endpoint. For local deployments, pass the local address (e.g. http://localhost:8000/v1).

  • model

    The model identifier to use. Defaults to “nvidia/nemotron-3-nano-30b-a3b”.

    Deprecated since version 0.0.105: Use settings=NvidiaLLMService.Settings(model=...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • **kwargs – Additional keyword arguments passed to OpenAILLMService.

async get_chat_completions(context: LLMContext) AsyncIterator[ChatCompletionChunk][source]

Wrap the chat completion stream to handle reasoning_content.

Models with API-level reasoning separation (e.g. Nemotron Nano) include a reasoning_content field on the streaming delta. This wrapper extracts those chunks and emits them as LLMThought*Frame objects. It also rewrites streamed delta.content so leading <think> sections are removed before the base OpenAI loop processes visible content.

Parameters:

context – The LLM context for the completion request.

Returns:

An async iterator of chat completion chunks where reasoning_content has been emitted as LLMThought*Frame side effects.

async start_llm_usage_metrics(tokens: LLMTokenUsage)[source]

Accumulate token usage metrics during processing.

This method intercepts the incremental token updates from NVIDIA’s API and accumulates them instead of passing each update to the metrics system. The final accumulated totals are reported at the end of processing.

Parameters:

tokens – The token usage metrics for the current chunk of processing, containing prompt_tokens and completion_tokens counts.