3.3. `uniseg.wordbreak` — Word Break

Unicode word boundaries.

UAX #29: Unicode Text Segmentation (Unicode 16.0.0)

uniseg.wordbreak.WB: alias of Word_Break

class uniseg.wordbreak.Word_Break(value)

Word_Break property values.

ALetter = 'ALetter': Word_Break property value ALetter

CR = 'CR': Word_Break property value CR

Double_Quote = 'Double_Quote': Word_Break property value Double_Quote

Extend = 'Extend': Word_Break property value Extend

ExtendNumLet = 'ExtendNumLet': Word_Break property value ExtendNumLet

Format = 'Format': Word_Break property value Format

Hebrew_Letter = 'Hebrew_Letter': Word_Break property value Hebrew_Letter

Katakana = 'Katakana': Word_Break property value Katakana

LF = 'LF': Word_Break property value LF

MidLetter = 'MidLetter': Word_Break property value MidLetter

MidNum = 'MidNum': Word_Break property value MidNum

MidNumLet = 'MidNumLet': Word_Break property value MidNumLet

Newline = 'Newline': Word_Break property value Newline

Numeric = 'Numeric': Word_Break property value Numeric

Other = 'Other': Word_Break property value Other

Regional_Indicator = 'Regional_Indicator': Word_Break property value Regional_Indicator

Single_Quote = 'Single_Quote': Word_Break property value Single_Quote

WSegSpace = 'WSegSpace': Word_Break property value WSegSpace

ZWJ = 'ZWJ': Word_Break property value ZWJ

uniseg.wordbreak.word_boundaries(s: str, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None, /) → Iterator[int]

Iterate indices of the word boundaries of s

This function yields indices from the first boundary position (> 0) to the end of the string (== len(s)).

uniseg.wordbreak.word_break(c: str, /) → Word_Break

Return the Word_Break property of c

c must be a single Unicode code point string.

>>> word_break('\x0d')
Word_Break.CR
>>> word_break('\x0b')
Word_Break.Newline
>>> word_break('\u30a2')
Word_Break.Katakana

uniseg.wordbreak.word_breakables(s: str, /) → Iterable[Literal[0, 1]]

Iterate word breaking opportunities for every position of s

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(word_breakables('ABC'))
[1, 0, 0]
>>> list(word_breakables('Hello, world.'))
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1]
>>> list(word_breakables('\x01\u0308\x01'))
[1, 0, 1]

uniseg.wordbreak.words(s: str, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None, /) → Iterator[str]

Iterate user-perceived words of s

These examples bellow is from http://www.unicode.org/reports/tr29/tr29-15.html#Word_Boundaries

>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?'
>>> print('|'.join(words(s)))
The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|?
>>> list(words(''))
[]

3.3. uniseg.wordbreak — Word Break

3.3. `uniseg.wordbreak` — Word Break