stt

xAI speech-to-text service implementation.

This module provides integration with xAI’s real-time speech-to-text WebSocket API documented at https://docs.x.ai/developers/rest-api-reference/inference/voice.

pipecat.services.xai.stt.language_to_xai_stt_language(language: Language) str | None[source]

Convert a Language enum to the xAI STT language code.

xAI STT accepts two-letter language codes (e.g. en, fr, de, ja). When set, the server applies Inverse Text Normalization.

Parameters:

language – The Language enum value to convert.

Returns:

The corresponding xAI STT language code, or None if not supported.

class pipecat.services.xai.stt.XAISTTSettings(model: str | None | _NotGiven = <factory>, extra: dict[str, ~typing.Any]=<factory>, language: Language | str | None | _NotGiven = <factory>, interim_results: bool | _NotGiven = <factory>, endpointing: int | None | _NotGiven = <factory>, multichannel: bool | None | _NotGiven = <factory>, channels: int | None | _NotGiven = <factory>, diarize: bool | None | _NotGiven = <factory>)[source]

Bases: STTSettings

Settings for XAISTTService.

Parameters:
  • interim_results – When True, partial transcripts are emitted approximately every 500ms.

  • endpointing – Silence duration in milliseconds that triggers a speech-final event. Range 0-5000. Server default is 10ms.

  • multichannel – When True, transcribes each interleaved channel independently. Requires channels >= 2.

  • channels – Number of interleaved channels (2-8). Required when multichannel is True.

  • diarize – When True, the server attaches a speaker field to each word identifying the detected speaker.

interim_results: bool | _NotGiven
endpointing: int | None | _NotGiven
multichannel: bool | None | _NotGiven
channels: int | None | _NotGiven
diarize: bool | None | _NotGiven
class pipecat.services.xai.stt.XAISTTService(*, api_key: str, ws_url: str = 'wss://api.x.ai/v1/stt', sample_rate: int = 16000, encoding: str = 'pcm', settings: XAISTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]

Bases: WebsocketSTTService

xAI real-time speech-to-text service.

Streams audio to xAI’s WebSocket STT endpoint and emits interim and final transcription frames. The XAI_API_KEY is passed directly as a Bearer token on the WebSocket handshake.

The connection is persistent: audio is streamed continuously and the server emits transcript.partial events with is_final and speech_final flags to mark utterance boundaries. If the connection drops mid-session, the base class reconnects automatically.

Settings

alias of XAISTTSettings

__init__(*, api_key: str, ws_url: str = 'wss://api.x.ai/v1/stt', sample_rate: int = 16000, encoding: str = 'pcm', settings: XAISTTSettings | None = None, ttfs_p99_latency: float | None = 1.0, **kwargs)[source]

Initialize the xAI STT service.

Parameters:
  • api_key – xAI API key (used as Bearer for the WebSocket handshake).

  • ws_url – WebSocket endpoint URL. Defaults to wss://api.x.ai/v1/stt.

  • sample_rate – Audio sample rate in Hz. Supported values: 8000, 16000, 22050, 24000, 44100, 48000. Defaults to 16000.

  • encoding – Audio encoding. One of "pcm" (signed 16-bit LE), "mulaw", or "alaw". Defaults to "pcm".

  • settings – Runtime-updatable settings overriding defaults.

  • ttfs_p99_latency – P99 latency from speech end to final transcript in seconds. See https://github.com/pipecat-ai/stt-benchmark.

  • **kwargs – Additional arguments passed to WebsocketSTTService.

can_generate_metrics() bool[source]

Check if the service can generate metrics.

Returns:

True if metrics generation is supported.

language_to_service_language(language: Language) str | None[source]

Convert a Language enum to the xAI STT language code.

async start(frame: StartFrame)[source]

Start the speech-to-text service.

async stop(frame: EndFrame)[source]

Stop the speech-to-text service.

async cancel(frame: CancelFrame)[source]

Cancel the speech-to-text service.

async run_stt(audio: bytes) AsyncGenerator[Frame | None, None][source]

Forward raw audio bytes to the xAI STT WebSocket.

Transcription frames are pushed from the receive task, not yielded from this coroutine.