2.3. uniseg.wordbreak
— Word Break
Unicode word boundaries.
UAX #29: Unicode Text Segmentation (Unicode 15.0.0) https://www.unicode.org/reports/tr29/tr29-41.html
- class uniseg.wordbreak.WordBreak(value)
Word_Break property values.
- uniseg.wordbreak.word_boundaries(s: str, tailor: Callable[[str, Iterator[Literal[0, 1]]], Iterator[Literal[0, 1]]] | None = None, /) Iterator[int]
Iterate indices of the word boundaries of s
This function yields indices from the first boundary position (> 0) to the end of the string (== len(s)).
- uniseg.wordbreak.word_break(c: str, index: int = 0, /) WordBreak
Return the Word_Break property of c
c must be a single Unicode code point string.
>>> word_break('\x0d') <WordBreak.CR: 'CR'> >>> word_break('\x0b') <WordBreak.NEWLINE: 'Newline'> >>> word_break('\u30a2') <WordBreak.KATAKANA: 'Katakana'>
If index is specified, this function consider c as a unicode string and return Word_Break property of the code point at c[index].
>>> word_break('A\u30a2', 1) <WordBreak.KATAKANA: 'Katakana'>
- uniseg.wordbreak.word_breakables(s: str, /) Iterator[Literal[0, 1]]
Iterate word breaking opportunities for every position of s
1 for “break” and 0 for “do not break”. The length of iteration will be the same as
len(s)
.>>> list(word_breakables('ABC')) [1, 0, 0] >>> list(word_breakables('Hello, world.')) [1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1] >>> list(word_breakables('\x01\u0308\x01')) [1, 0, 1]
- uniseg.wordbreak.words(s: str, tailor: Callable[[str, Iterator[Literal[0, 1]]], Iterator[Literal[0, 1]]] | None = None, /) Iterator[str]
Iterate user-perceived words of s
These examples bellow is from http://www.unicode.org/reports/tr29/tr29-15.html#Word_Boundaries
>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?' >>> print('|'.join(words(s))) The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|? >>> list(words('')) []