2.3. uniseg.wordbreak — Word Break

Unicode word boundaries.

UAX #29: Unicode Text Segmentation (Unicode 15.0.0) https://www.unicode.org/reports/tr29/tr29-41.html

uniseg.wordbreak.WB

alias of WordBreak

class uniseg.wordbreak.WordBreak(value)

Word_Break property values.

uniseg.wordbreak.word_boundaries(s: str, tailor: Callable[[str, Iterator[Literal[0, 1]]], Iterator[Literal[0, 1]]] | None = None, /) Iterator[int]

Iterate indices of the word boundaries of s

This function yields indices from the first boundary position (> 0) to the end of the string (== len(s)).

uniseg.wordbreak.word_break(c: str, index: int = 0, /) WordBreak

Return the Word_Break property of c

c must be a single Unicode code point string.

>>> word_break('\x0d')
<WordBreak.CR: 'CR'>
>>> word_break('\x0b')
<WordBreak.NEWLINE: 'Newline'>
>>> word_break('\u30a2')
<WordBreak.KATAKANA: 'Katakana'>

If index is specified, this function consider c as a unicode string and return Word_Break property of the code point at c[index].

>>> word_break('A\u30a2', 1)
<WordBreak.KATAKANA: 'Katakana'>
uniseg.wordbreak.word_breakables(s: str, /) Iterator[Literal[0, 1]]

Iterate word breaking opportunities for every position of s

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(word_breakables('ABC'))
[1, 0, 0]
>>> list(word_breakables('Hello, world.'))
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1]
>>> list(word_breakables('\x01\u0308\x01'))
[1, 0, 1]
uniseg.wordbreak.words(s: str, tailor: Callable[[str, Iterator[Literal[0, 1]]], Iterator[Literal[0, 1]]] | None = None, /) Iterator[str]

Iterate user-perceived words of s

These examples bellow is from http://www.unicode.org/reports/tr29/tr29-15.html#Word_Boundaries

>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?'
>>> print('|'.join(words(s)))
The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|?
>>> list(words(''))
[]