3.3. uniseg.wordbreak — Word Break

Unicode word boundaries.

UAX #29: Unicode Text Segmentation (Unicode 16.0.0)

uniseg.wordbreak.WB

alias of Word_Break

class uniseg.wordbreak.Word_Break(value)

Word_Break property values.

ALetter = 'ALetter'

Word_Break property value ALetter

CR = 'CR'

Word_Break property value CR

Double_Quote = 'Double_Quote'

Word_Break property value Double_Quote

Extend = 'Extend'

Word_Break property value Extend

ExtendNumLet = 'ExtendNumLet'

Word_Break property value ExtendNumLet

Format = 'Format'

Word_Break property value Format

Hebrew_Letter = 'Hebrew_Letter'

Word_Break property value Hebrew_Letter

Katakana = 'Katakana'

Word_Break property value Katakana

LF = 'LF'

Word_Break property value LF

MidLetter = 'MidLetter'

Word_Break property value MidLetter

MidNum = 'MidNum'

Word_Break property value MidNum

MidNumLet = 'MidNumLet'

Word_Break property value MidNumLet

Newline = 'Newline'

Word_Break property value Newline

Numeric = 'Numeric'

Word_Break property value Numeric

Other = 'Other'

Word_Break property value Other

Regional_Indicator = 'Regional_Indicator'

Word_Break property value Regional_Indicator

Single_Quote = 'Single_Quote'

Word_Break property value Single_Quote

WSegSpace = 'WSegSpace'

Word_Break property value WSegSpace

ZWJ = 'ZWJ'

Word_Break property value ZWJ

uniseg.wordbreak.word_boundaries(s: str, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None, /) Iterator[int]

Iterate indices of the word boundaries of s

This function yields indices from the first boundary position (> 0) to the end of the string (== len(s)).

uniseg.wordbreak.word_break(c: str, /) Word_Break

Return the Word_Break property of c

c must be a single Unicode code point string.

>>> word_break('\x0d')
Word_Break.CR
>>> word_break('\x0b')
Word_Break.Newline
>>> word_break('\u30a2')
Word_Break.Katakana
uniseg.wordbreak.word_breakables(s: str, /) Iterable[Literal[0, 1]]

Iterate word breaking opportunities for every position of s

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(word_breakables('ABC'))
[1, 0, 0]
>>> list(word_breakables('Hello, world.'))
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1]
>>> list(word_breakables('\x01\u0308\x01'))
[1, 0, 1]
uniseg.wordbreak.words(s: str, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None, /) Iterator[str]

Iterate user-perceived words of s

These examples bellow is from http://www.unicode.org/reports/tr29/tr29-15.html#Word_Boundaries

>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?'
>>> print('|'.join(words(s)))
The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|?
>>> list(words(''))
[]