stt

OpenAI Speech-to-Text service implementations.

Provides two STT services:

  • OpenAISTTService: REST-based transcription using the Audio API (Whisper / GPT-4o).

  • OpenAIRealtimeSTTService: WebSocket-based streaming transcription using the Realtime API in transcription-only mode.

class pipecat.services.openai.stt.OpenAISTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>)[source]

Bases: BaseWhisperSTTSettings

Settings for the OpenAI STT service.

class pipecat.services.openai.stt.OpenAISTTService(*, model: str | None = None, api_key: str | None = None, base_url: str | None = None, language: Language | None = Language.EN, prompt: str | None = None, temperature: float | None = None, settings: OpenAISTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Bases: BaseWhisperSTTService

OpenAI Speech-to-Text service that generates text from audio.

Uses OpenAI’s transcription API to convert audio to text. Requires an OpenAI API key set via the api_key parameter or OPENAI_API_KEY environment variable.

Settings

alias of OpenAISTTSettings

__init__(*, model: str | None = None, api_key: str | None = None, base_url: str | None = None, language: Language | None = Language.EN, prompt: str | None = None, temperature: float | None = None, settings: OpenAISTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]

Initialize OpenAI STT service.

Parameters:
  • model

    Model to use — either gpt-4o or Whisper.

    Deprecated since version 0.0.105: Use settings=OpenAISTTService.Settings(model=...) instead.

  • api_key – OpenAI API key. Defaults to None.

  • base_url – API base URL. Defaults to None.

  • language

    Language of the audio input. Defaults to English.

    Deprecated since version 0.0.105: Use settings=OpenAISTTService.Settings(language=...) instead.

  • prompt

    Optional text to guide the model’s style or continue a previous segment.

    Deprecated since version 0.0.105: Use settings=OpenAISTTService.Settings(prompt=...) instead.

  • temperature

    Optional sampling temperature between 0 and 1. Defaults to 0.0.

    Deprecated since version 0.0.105: Use settings=OpenAISTTService.Settings(temperature=...) instead.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to BaseWhisperSTTService.

class pipecat.services.openai.stt.OpenAIRealtimeSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, noise_reduction: Literal['near_field', 'far_field'] | None | ~pipecat.services.settings._NotGiven=<factory>)[source]

Bases: STTSettings

Settings for OpenAIRealtimeSTTService.

Parameters:
  • prompt – Optional prompt text to guide transcription style.

  • noise_reduction – Noise reduction mode. "near_field" for close microphones, "far_field" for distant microphones, or None to disable.

prompt: str | None | _NotGiven
noise_reduction: Literal['near_field', 'far_field'] | None | _NotGiven
class pipecat.services.openai.stt.OpenAIRealtimeSTTService(*, api_key: str, model: str | None = None, base_url: str = 'wss://api.openai.com/v1/realtime', language: Language | None = Language.EN, prompt: str | None = None, turn_detection: dict | Literal[False] | None = False, noise_reduction: Literal['near_field', 'far_field'] | None = None, should_interrupt: bool = True, settings: OpenAIRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 1.66, **kwargs)[source]

Bases: WebsocketSTTService

OpenAI Realtime Speech-to-Text service using WebSocket transcription sessions.

Uses OpenAI’s Realtime API in transcription-only mode for real-time streaming speech recognition with optional server-side VAD and noise reduction. The model does not generate conversational responses — only transcription output.

This service supports two VAD modes:

Local VAD (default): Disable server-side VAD and use a local VAD processor in the pipeline instead. When a VADUserStoppedSpeakingFrame is received, the service commits the audio buffer so that the server begins transcription for the completed speech segment.

Server-side VAD (turn_detection=None): The OpenAI server performs voice-activity detection. The service broadcasts UserStartedSpeakingFrame and UserStoppedSpeakingFrame when the server detects speech boundaries. Do not use a separate VAD processor in the pipeline in this mode.

Audio is sent as 24 kHz 16-bit mono PCM as required by the OpenAI Realtime API. If the pipeline runs at a different sample rate (e.g. 16 kHz for Silero VAD compatibility), audio is automatically upsampled before sending.

Example:

stt = OpenAIRealtimeSTTService(
    api_key="sk-...",
    settings=OpenAIRealtimeSTTService.Settings(
        model="gpt-4o-transcribe",
        noise_reduction="near_field",
    ),
)
Settings

alias of OpenAIRealtimeSTTSettings

__init__(*, api_key: str, model: str | None = None, base_url: str = 'wss://api.openai.com/v1/realtime', language: Language | None = Language.EN, prompt: str | None = None, turn_detection: dict | Literal[False] | None = False, noise_reduction: Literal['near_field', 'far_field'] | None = None, should_interrupt: bool = True, settings: OpenAIRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 1.66, **kwargs)[source]

Initialize the OpenAI Realtime STT service.

Parameters:
  • api_key – OpenAI API key for authentication.

  • model

    Transcription model. Supported values are "gpt-4o-transcribe" and "gpt-4o-mini-transcribe".

    Deprecated since version 0.0.105: Use settings=OpenAIRealtimeSTTService.Settings(model=...) instead.

  • base_url – WebSocket base URL for the Realtime API. Defaults to "wss://api.openai.com/v1/realtime".

  • language

    Language of the audio input. Defaults to English.

    Deprecated since version 0.0.105: Use settings=OpenAIRealtimeSTTService.Settings(language=...) instead.

  • prompt

    Optional prompt text to guide transcription style or provide keyword hints.

    Deprecated since version 0.0.105: Use settings=OpenAIRealtimeSTTService.Settings(prompt=...) instead.

  • turn_detection – Server-side VAD configuration. Defaults to False (disabled), which relies on a local VAD processor in the pipeline. Pass None to use server defaults (server_vad), or a dict with custom settings (e.g. {"type": "server_vad", "threshold": 0.5}).

  • noise_reduction

    Noise reduction mode. "near_field" for close microphones, "far_field" for distant microphones, or None to disable.

    Deprecated since version 0.0.106: Use settings=OpenAIRealtimeSTTService.Settings(noise_reduction=...) instead.

  • should_interrupt – Whether to interrupt bot output when speech is detected by server-side VAD. Only applies when turn detection is enabled. Defaults to True.

  • settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark

  • **kwargs – Additional arguments passed to parent WebsocketSTTService.

can_generate_metrics() bool[source]

Check if the service can generate processing metrics.

Returns:

True, as this service supports metrics generation.

async start(frame: StartFrame)[source]

Start the service and establish WebSocket connection.

Parameters:

frame – The start frame triggering service initialization.

async stop(frame: EndFrame)[source]

Stop the service and close WebSocket connection.

Parameters:

frame – The end frame triggering service shutdown.

async cancel(frame: CancelFrame)[source]

Cancel the service and close WebSocket connection.

Parameters:

frame – The cancel frame triggering service cancellation.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Send audio data to the transcription session.

Audio is streamed over the WebSocket. Transcription results arrive asynchronously via the receive task and are pushed as InterimTranscriptionFrame or TranscriptionFrame.

Parameters:

audio – Raw audio bytes (16-bit mono PCM at the pipeline sample rate). Automatically resampled to 24 kHz.

Yields:

None — results are delivered via the WebSocket receive task.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process frames from the pipeline.

Extends the base STT service to handle local VAD events when server-side VAD is disabled. On VADUserStoppedSpeakingFrame, commits the audio buffer so the server begins transcription for the completed speech segment.

Parameters:
  • frame – The frame to process.

  • direction – The direction of frame flow in the pipeline.