stt

OpenAI Speech-to-Text service implementations.

Provides two STT services:

OpenAISTTService: REST-based transcription using the Audio API (Whisper / GPT-4o).
OpenAIRealtimeSTTService: WebSocket-based streaming transcription using the Realtime API in transcription-only mode.

Bases: BaseWhisperSTTSettings

Settings for the OpenAI STT service.

Bases: BaseWhisperSTTService

OpenAI Speech-to-Text service that generates text from audio.

Uses OpenAI’s transcription API to convert audio to text. Requires an OpenAI API key set via the api_key parameter or OPENAI_API_KEY environment variable.

Settings: alias of OpenAISTTSettings

Initialize OpenAI STT service.

Parameters:

model –
Model to use — either gpt-4o or Whisper.

Deprecated since version 0.0.105: Use settings=OpenAISTTService.Settings(model=...) instead.
api_key – OpenAI API key. Defaults to None.
base_url – API base URL. Defaults to None.
language –
Language of the audio input. Defaults to English.

Deprecated since version 0.0.105: Use settings=OpenAISTTService.Settings(language=...) instead.
prompt –
Optional text to guide the model’s style or continue a previous segment.

Deprecated since version 0.0.105: Use settings=OpenAISTTService.Settings(prompt=...) instead.
temperature –
Optional sampling temperature between 0 and 1. Defaults to 0.0.

Deprecated since version 0.0.105: Use settings=OpenAISTTService.Settings(temperature=...) instead.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to BaseWhisperSTTService.

Bases: STTSettings

Settings for OpenAIRealtimeSTTService.

Parameters:

prompt – Optional prompt text to guide transcription style.
noise_reduction – Noise reduction mode. "near_field" for close microphones, "far_field" for distant microphones, or None to disable.

prompt: str | None | _NotGiven

noise_reduction: Literal['near_field', 'far_field'] | None | _NotGiven

class pipecat.services.openai.stt.OpenAIRealtimeSTTService(*, api_key: str, model: str | None = None, base_url: str = 'wss://api.openai.com/v1/realtime', language: Language | None = Language.EN, prompt: str | None = None, turn_detection: dict | Literal[False] | None = False, noise_reduction: Literal['near_field', 'far_field'] | None = None, should_interrupt: bool = True, settings: OpenAIRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 1.66, **kwargs)[source]

Bases: WebsocketSTTService

OpenAI Realtime Speech-to-Text service using WebSocket transcription sessions.

Uses OpenAI’s Realtime API in transcription-only mode for real-time streaming speech recognition with optional server-side VAD and noise reduction. The model does not generate conversational responses — only transcription output.

This service supports two VAD modes:

Local VAD (default): Disable server-side VAD and use a local VAD processor in the pipeline instead. When a VADUserStoppedSpeakingFrame is received, the service commits the audio buffer so that the server begins transcription for the completed speech segment.

Server-side VAD (turn_detection=None): The OpenAI server performs voice-activity detection. The service broadcasts UserStartedSpeakingFrame and UserStoppedSpeakingFrame when the server detects speech boundaries. Do not use a separate VAD processor in the pipeline in this mode.

Audio is sent as 24 kHz 16-bit mono PCM as required by the OpenAI Realtime API. If the pipeline runs at a different sample rate (e.g. 16 kHz for Silero VAD compatibility), audio is automatically upsampled before sending.

Example:

stt = OpenAIRealtimeSTTService(
    api_key="sk-...",
    settings=OpenAIRealtimeSTTService.Settings(
        model="gpt-4o-transcribe",
        noise_reduction="near_field",
    ),
)

Settings: alias of OpenAIRealtimeSTTSettings

__init__(*, api_key: str, model: str | None = None, base_url: str = 'wss://api.openai.com/v1/realtime', language: Language | None = Language.EN, prompt: str | None = None, turn_detection: dict | Literal[False] | None = False, noise_reduction: Literal['near_field', 'far_field'] | None = None, should_interrupt: bool = True, settings: OpenAIRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 1.66, **kwargs)[source]

Initialize the OpenAI Realtime STT service.

Parameters:

api_key – OpenAI API key for authentication.
model –
Transcription model. Supported values are "gpt-4o-transcribe" and "gpt-4o-mini-transcribe".

Deprecated since version 0.0.105: Use settings=OpenAIRealtimeSTTService.Settings(model=...) instead.
base_url – WebSocket base URL for the Realtime API. Defaults to "wss://api.openai.com/v1/realtime".
language –
Language of the audio input. Defaults to English.

Deprecated since version 0.0.105: Use settings=OpenAIRealtimeSTTService.Settings(language=...) instead.
prompt –
Optional prompt text to guide transcription style or provide keyword hints.

Deprecated since version 0.0.105: Use settings=OpenAIRealtimeSTTService.Settings(prompt=...) instead.
turn_detection – Server-side VAD configuration. Defaults to False (disabled), which relies on a local VAD processor in the pipeline. Pass None to use server defaults (server_vad), or a dict with custom settings (e.g. {"type": "server_vad", "threshold": 0.5}).
noise_reduction –
Noise reduction mode. "near_field" for close microphones, "far_field" for distant microphones, or None to disable.

Deprecated since version 0.0.106: Use settings=OpenAIRealtimeSTTService.Settings(noise_reduction=...) instead.
should_interrupt – Whether to interrupt bot output when speech is detected by server-side VAD. Only applies when turn detection is enabled. Defaults to True.
settings – Runtime-updatable settings. When provided alongside deprecated parameters, settings values take precedence.
ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to parent WebsocketSTTService.

can_generate_metrics() → bool[source]

Check if the service can generate processing metrics.

Returns:: True, as this service supports metrics generation.

async start(frame: StartFrame)[source]

Start the service and establish WebSocket connection.

Parameters:: frame – The start frame triggering service initialization.

async stop(frame: EndFrame)[source]

Stop the service and close WebSocket connection.

Parameters:: frame – The end frame triggering service shutdown.

async cancel(frame: CancelFrame)[source]

Cancel the service and close WebSocket connection.

Parameters:: frame – The cancel frame triggering service cancellation.

async run_stt(audio: bytes) → AsyncGenerator[Frame | None, None][source]

Send audio data to the transcription session.

Audio is streamed over the WebSocket. Transcription results arrive asynchronously via the receive task and are pushed as InterimTranscriptionFrame or TranscriptionFrame.

Parameters:: audio – Raw audio bytes (16-bit mono PCM at the pipeline sample rate). Automatically resampled to 24 kHz.
Yields:: None — results are delivered via the WebSocket receive task.

async process_frame(frame: Frame, direction: FrameDirection)[source]

Process frames from the pipeline.

Extends the base STT service to handle local VAD events when server-side VAD is disabled. On VADUserStoppedSpeakingFrame, commits the audio buffer so the server begins transcription for the completed speech segment.

Parameters:

frame – The frame to process.
direction – The direction of frame flow in the pipeline.