llm

Google Gemini integration for Pipecat.

This module provides Google Gemini integration for the Pipecat framework, including LLM services, context management, and message aggregation.

class pipecat.services.google.llm.GoogleThinkingConfig(*, thinking_budget: int | None = None, thinking_level: Literal['low', 'high', 'medium', 'minimal'] | str | None = None, include_thoughts: bool | None = None)[source]

Bases: BaseModel

Configuration for controlling the model’s internal “thinking” process used before generating a response.

Gemini 2.5 and 3 series models have this thinking process.

Parameters:
  • thinking_level – Thinking level for Gemini 3 models. For Gemini 3 Pro, this can be “low” or “high”. For Gemini 3 Flash, this can be “minimal”, “low”, “medium”, or “high”. If not provided, Gemini 3 models default to “high”. Note: Gemini 2.5 series must use thinking_budget instead.

  • thinking_budget – Token budget for thinking, for Gemini 2.5 series. -1 for dynamic thinking (model decides), 0 to disable thinking, or a specific token count (e.g., 128-32768 for 2.5 Pro). If not provided, most models today default to dynamic thinking. See https://ai.google.dev/gemini-api/docs/thinking#set-budget for default values and allowed ranges. Note: Gemini 3 models must use thinking_level instead.

  • include_thoughts – Whether to include thought summaries in the response. Today’s models default to not including thoughts (False).

thinking_budget: int | None
thinking_level: Literal['low', 'high', 'medium', 'minimal'] | str | None
include_thoughts: bool | None
class pipecat.services.google.llm.GoogleLLMSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, Any]=<factory>, system_instruction: str | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>, max_tokens: int | None | _NotGiven = <factory>, top_p: float | None | _NotGiven = <factory>, top_k: int | None | _NotGiven = <factory>, frequency_penalty: float | None | _NotGiven = <factory>, presence_penalty: float | None | _NotGiven = <factory>, seed: int | None | _NotGiven = <factory>, filter_incomplete_user_turns: bool | None | _NotGiven = <factory>, user_turn_completion_config: UserTurnCompletionConfig | None | _NotGiven = <factory>, thinking: GoogleLLMService.ThinkingConfig | None | _NotGiven = <factory>)[source]

Bases: LLMSettings

Settings for GoogleLLMService.

Parameters:

thinking – Thinking configuration.

thinking: GoogleLLMService.ThinkingConfig | None | _NotGiven
classmethod from_mapping(settings)[source]

Convert a plain dict to settings, coercing thinking dicts.

For backward compatibility, a thinking value that is a plain dict is converted to a GoogleLLMService.ThinkingConfig.

class pipecat.services.google.llm.GoogleLLMService(*, api_key: str, model: str | None = None, params: InputParams | None = None, settings: GoogleLLMSettings | None = None, system_instruction: str | None = None, tools: list[dict[str, Any]] | None = None, tool_config: dict[str, Any] | None = None, http_options: HttpOptions | None = None, **kwargs)[source]

Bases: LLMService

Google AI (Gemini) LLM service implementation.

This class implements inference with Google’s AI models, translating internally from an LLMContext to the messages format expected by the Google AI model.

Settings

alias of GoogleLLMSettings

adapter_class

alias of GeminiLLMAdapter

ThinkingConfig

alias of GoogleThinkingConfig

class InputParams(**data: Any)[source]

Bases: BaseModel

Input parameters for Google AI models.

Deprecated since version 0.0.105: Use settings=GoogleLLMService.Settings(...) instead.

Parameters:
  • max_tokens – Maximum number of tokens to generate.

  • temperature – Sampling temperature between 0.0 and 2.0.

  • top_k – Top-k sampling parameter.

  • top_p – Top-p sampling parameter between 0.0 and 1.0.

  • thinking – Thinking configuration with thinking_budget, thinking_level, and include_thoughts. Used to control the model’s internal “thinking” process used before generating a response. Gemini 2.5 series models use thinking_budget; Gemini 3 models use thinking_level. If this is not provided, Pipecat disables thinking for all models where that’s possible (the 2.5 series, except 2.5 Pro), to reduce latency.

  • extra – Additional parameters as a dictionary.

max_tokens: int | None
temperature: float | None
top_k: int | None
top_p: float | None
thinking: GoogleLLMService.ThinkingConfig | None
extra: dict[str, Any] | None
__init__(*, api_key: str, model: str | None = None, params: InputParams | None = None, settings: GoogleLLMSettings | None = None, system_instruction: str | None = None, tools: list[dict[str, Any]] | None = None, tool_config: dict[str, Any] | None = None, http_options: HttpOptions | None = None, **kwargs)[source]

Initialize the Google LLM service.

Parameters:
  • api_key – Google AI API key for authentication.

  • model

    Model name to use.

    Deprecated since version 0.0.105: Use settings=GoogleLLMService.Settings(model=...) instead.

  • params

    Optional model parameters for inference.

    Deprecated since version 0.0.105: Use settings=GoogleLLMService.Settings(...) instead.

  • settings – Runtime-updatable settings for this service. When both deprecated parameters and settings are provided, settings values take precedence.

  • system_instruction

    System instruction/prompt for the model.

    Deprecated since version 0.0.105: Use settings=GoogleLLMService.Settings(system_instruction=...) instead.

  • tools – List of available tools/functions.

  • tool_config – Configuration for tool usage.

  • http_options – HTTP options for the client.

  • **kwargs – Additional arguments passed to parent class.

can_generate_metrics() bool[source]

Check if the service can generate usage metrics.

Returns:

True, as Google AI provides token usage metrics.

create_client()[source]

Create the Gemini client instance. Subclasses can override this.

async run_inference(context: LLMContext, max_tokens: int | None = None, system_instruction: str | None = None) str | None[source]

Run a one-shot, out-of-band (i.e. out-of-pipeline) inference with the given LLM context.

Parameters:
  • context – The LLM context containing conversation history.

  • max_tokens – Optional maximum number of tokens to generate. If provided, overrides the service’s default max_tokens setting.

  • system_instruction – Optional system instruction to use for this inference. If provided, overrides any system instruction in the context.

Returns:

The LLM’s response as a string, or None if no response is generated.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process incoming frames and handle different frame types.

Parameters:
  • frame – The frame to process.

  • direction – Direction of frame processing.

async stop(frame)[source]

Override stop to gracefully close the client.

async cancel(frame)[source]

Override cancel to gracefully close the client.