string

Text processing utilities for sentence boundary detection and tag parsing.

This module provides utilities for natural language text processing including sentence boundary detection, email and number pattern handling, and XML-style tag parsing for structured text content.

Dependencies:: This module uses NLTK (Natural Language Toolkit) for robust sentence tokenization. NLTK is licensed under the Apache License 2.0. See: https://www.nltk.org/ Source: https://www.nltk.org/api/nltk.tokenize.punkt.html

pipecat.utils.string.replace_match(text: str, match: Match, old: str, new: str) → str[source]

Replace occurrences of a substring within a matched section of text.

Parameters:

text – The input text in which replacements will be made.
match – A regex match object representing the section of text to modify.
old – The substring to be replaced.
new – The substring to replace old with.

Returns:

The modified text with the specified replacements made within the matched section.

pipecat.utils.string.match_endofsentence(text: str) → int[source]

Find the position of the end of a sentence in the provided text.

This function uses NLTK’s sentence tokenizer to detect sentence boundaries in the input text, combined with punctuation verification to ensure that single tokens without proper sentence endings aren’t considered complete sentences.

Parameters:: text – The input text in which to find the end of the sentence.
Returns:: The position of the end of the sentence if found, otherwise 0.

pipecat.utils.string.parse_start_end_tags(text: str, tags: Sequence[tuple[str, str]], current_tag: tuple[str, str] | None, current_tag_index: int) → tuple[tuple[str, str] | None, int][source]

Parse text to identify start and end tag pairs.

If a start tag was previously found (i.e., current_tag is valid), wait for the corresponding end tag. Otherwise, wait for a start tag.

This function returns the index in the text where parsing should continue in the next call and the current or new tags.

Parameters:

text – The text to be parsed.
tags – List of tuples containing start and end tags.
current_tag – The currently active tags, if any.
current_tag_index – The current index in the text.

Returns:

A tuple containing None or the current tag and the index of the text.

class pipecat.utils.string.TextPartForConcatenation(text: str, includes_inter_part_spaces: bool)[source]

Bases: object

Class representing a part of text for concatenation with concatenate_aggregated_text.

Parameters:

text – The text content.
includes_inter_part_spaces – Whether any necessary inter-frame (leading/trailing) spaces are already included in the text.

text: str

includes_inter_part_spaces: bool

pipecat.utils.string.concatenate_aggregated_text(text_parts: list[TextPartForConcatenation]) → str[source]

Concatenate a list of text parts into a single string.

This function joins the provided list of text parts into a single string, taking into account whether or not the parts already contain spacing.

This function is useful for aggregating text segments received from LLMs or transcription services.

Parameters:: text_parts – A list of text parts to concatenate.
Returns:: A single concatenated string.