stt
OpenAI Speech-to-Text service implementations.
Provides two STT services:
OpenAISTTService: REST-based transcription using the Audio API (Whisper / GPT-4o).OpenAIRealtimeSTTService: WebSocket-based streaming transcription using the Realtime API in transcription-only mode.
- class pipecat.services.openai.stt.OpenAISTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, temperature: float | None | _NotGiven = <factory>)[source]
Bases:
BaseWhisperSTTSettingsSettings for the OpenAI STT service.
- class pipecat.services.openai.stt.OpenAISTTService(*, model: str | None = None, api_key: str | None = None, base_url: str | None = None, language: Language | None = Language.EN, prompt: str | None = None, temperature: float | None = None, settings: OpenAISTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]
Bases:
BaseWhisperSTTServiceOpenAI Speech-to-Text service that generates text from audio.
Uses OpenAI’s transcription API to convert audio to text. Requires an OpenAI API key set via the api_key parameter or OPENAI_API_KEY environment variable.
- Settings
alias of
OpenAISTTSettings
- __init__(*, model: str | None = None, api_key: str | None = None, base_url: str | None = None, language: Language | None = Language.EN, prompt: str | None = None, temperature: float | None = None, settings: OpenAISTTSettings | None = None, ttfs_p99_latency: float | None = 2.01, **kwargs)[source]
Initialize OpenAI STT service.
- Parameters:
model –
Model to use — either gpt-4o or Whisper.
Deprecated since version 0.0.105: Use
settings=OpenAISTTService.Settings(model=...)instead.api_key – OpenAI API key. Defaults to None.
base_url – API base URL. Defaults to None.
language –
Language of the audio input. Defaults to English.
Deprecated since version 0.0.105: Use
settings=OpenAISTTService.Settings(language=...)instead.prompt –
Optional text to guide the model’s style or continue a previous segment.
Deprecated since version 0.0.105: Use
settings=OpenAISTTService.Settings(prompt=...)instead.temperature –
Optional sampling temperature between 0 and 1. Defaults to 0.0.
Deprecated since version 0.0.105: Use
settings=OpenAISTTService.Settings(temperature=...)instead.settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to BaseWhisperSTTService.
- class pipecat.services.openai.stt.OpenAIRealtimeSTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, prompt: str | None | _NotGiven = <factory>, noise_reduction: Literal['near_field', 'far_field'] | None | ~pipecat.services.settings._NotGiven=<factory>)[source]
Bases:
STTSettingsSettings for OpenAIRealtimeSTTService.
- Parameters:
prompt – Optional prompt text to guide transcription style.
noise_reduction – Noise reduction mode.
"near_field"for close microphones,"far_field"for distant microphones, orNoneto disable.
- prompt: str | None | _NotGiven
- noise_reduction: Literal['near_field', 'far_field'] | None | _NotGiven
- class pipecat.services.openai.stt.OpenAIRealtimeSTTService(*, api_key: str, model: str | None = None, base_url: str = 'wss://api.openai.com/v1/realtime', language: Language | None = Language.EN, prompt: str | None = None, turn_detection: dict | Literal[False] | None = False, noise_reduction: Literal['near_field', 'far_field'] | None = None, should_interrupt: bool = True, settings: OpenAIRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 1.66, **kwargs)[source]
Bases:
WebsocketSTTServiceOpenAI Realtime Speech-to-Text service using WebSocket transcription sessions.
Uses OpenAI’s Realtime API in transcription-only mode for real-time streaming speech recognition with optional server-side VAD and noise reduction. The model does not generate conversational responses — only transcription output.
This service supports two VAD modes:
Local VAD (default): Disable server-side VAD and use a local VAD processor in the pipeline instead. When a
VADUserStoppedSpeakingFrameis received, the service commits the audio buffer so that the server begins transcription for the completed speech segment.Server-side VAD (
turn_detection=None): The OpenAI server performs voice-activity detection. The service broadcastsUserStartedSpeakingFrameandUserStoppedSpeakingFramewhen the server detects speech boundaries. Do not use a separate VAD processor in the pipeline in this mode.Audio is sent as 24 kHz 16-bit mono PCM as required by the OpenAI Realtime API. If the pipeline runs at a different sample rate (e.g. 16 kHz for Silero VAD compatibility), audio is automatically upsampled before sending.
Example:
stt = OpenAIRealtimeSTTService( api_key="sk-...", settings=OpenAIRealtimeSTTService.Settings( model="gpt-4o-transcribe", noise_reduction="near_field", ), )
- Settings
alias of
OpenAIRealtimeSTTSettings
- __init__(*, api_key: str, model: str | None = None, base_url: str = 'wss://api.openai.com/v1/realtime', language: Language | None = Language.EN, prompt: str | None = None, turn_detection: dict | Literal[False] | None = False, noise_reduction: Literal['near_field', 'far_field'] | None = None, should_interrupt: bool = True, settings: OpenAIRealtimeSTTSettings | None = None, ttfs_p99_latency: float | None = 1.66, **kwargs)[source]
Initialize the OpenAI Realtime STT service.
- Parameters:
api_key – OpenAI API key for authentication.
model –
Transcription model. Supported values are
"gpt-4o-transcribe"and"gpt-4o-mini-transcribe".Deprecated since version 0.0.105: Use
settings=OpenAIRealtimeSTTService.Settings(model=...)instead.base_url – WebSocket base URL for the Realtime API. Defaults to
"wss://api.openai.com/v1/realtime".language –
Language of the audio input. Defaults to English.
Deprecated since version 0.0.105: Use
settings=OpenAIRealtimeSTTService.Settings(language=...)instead.prompt –
Optional prompt text to guide transcription style or provide keyword hints.
Deprecated since version 0.0.105: Use
settings=OpenAIRealtimeSTTService.Settings(prompt=...)instead.turn_detection – Server-side VAD configuration. Defaults to
False(disabled), which relies on a local VAD processor in the pipeline. PassNoneto use server defaults (server_vad), or a dict with custom settings (e.g.{"type": "server_vad", "threshold": 0.5}).noise_reduction –
Noise reduction mode.
"near_field"for close microphones,"far_field"for distant microphones, orNoneto disable.Deprecated since version 0.0.106: Use
settings=OpenAIRealtimeSTTService.Settings(noise_reduction=...)instead.should_interrupt – Whether to interrupt bot output when speech is detected by server-side VAD. Only applies when turn detection is enabled. Defaults to True.
settings – Runtime-updatable settings. When provided alongside deprecated parameters,
settingsvalues take precedence.ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. Override for your deployment. See https://github.com/pipecat-ai/stt-benchmark
**kwargs – Additional arguments passed to parent WebsocketSTTService.
- can_generate_metrics() bool[source]
Check if the service can generate processing metrics.
- Returns:
True, as this service supports metrics generation.
- async start(frame: StartFrame)[source]
Start the service and establish WebSocket connection.
- Parameters:
frame – The start frame triggering service initialization.
- async stop(frame: EndFrame)[source]
Stop the service and close WebSocket connection.
- Parameters:
frame – The end frame triggering service shutdown.
- async cancel(frame: CancelFrame)[source]
Cancel the service and close WebSocket connection.
- Parameters:
frame – The cancel frame triggering service cancellation.
- async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]
Send audio data to the transcription session.
Audio is streamed over the WebSocket. Transcription results arrive asynchronously via the receive task and are pushed as
InterimTranscriptionFrameorTranscriptionFrame.- Parameters:
audio – Raw audio bytes (16-bit mono PCM at the pipeline sample rate). Automatically resampled to 24 kHz.
- Yields:
None — results are delivered via the WebSocket receive task.
- async process_frame(frame: Frame, direction: FrameDirection)[source]
Process frames from the pipeline.
Extends the base STT service to handle local VAD events when server-side VAD is disabled. On
VADUserStoppedSpeakingFrame, commits the audio buffer so the server begins transcription for the completed speech segment.- Parameters:
frame – The frame to process.
direction – The direction of frame flow in the pipeline.