3.3. uniseg.wordbreak — Word Break

Unicode word boundaries.

UAX #29: Unicode Text Segmentation (Unicode 16.0.0)

uniseg.wordbreak.WB

alias of Word_Break

class uniseg.wordbreak.Word_Break(value)

Word_Break property values.

ALetter = 'ALetter'

Word_Break property value ALetter

CR = 'CR'

Word_Break property value CR

Double_Quote = 'Double_Quote'

Word_Break property value Double_Quote

Extend = 'Extend'

Word_Break property value Extend

ExtendNumLet = 'ExtendNumLet'

Word_Break property value ExtendNumLet

Format = 'Format'

Word_Break property value Format

Hebrew_Letter = 'Hebrew_Letter'

Word_Break property value Hebrew_Letter

Katakana = 'Katakana'

Word_Break property value Katakana

LF = 'LF'

Word_Break property value LF

MidLetter = 'MidLetter'

Word_Break property value MidLetter

MidNum = 'MidNum'

Word_Break property value MidNum

MidNumLet = 'MidNumLet'

Word_Break property value MidNumLet

Newline = 'Newline'

Word_Break property value Newline

Numeric = 'Numeric'

Word_Break property value Numeric

Other = 'Other'

Word_Break property value Other

Regional_Indicator = 'Regional_Indicator'

Word_Break property value Regional_Indicator

Single_Quote = 'Single_Quote'

Word_Break property value Single_Quote

WSegSpace = 'WSegSpace'

Word_Break property value WSegSpace

ZWJ = 'ZWJ'

Word_Break property value ZWJ

uniseg.wordbreak.word_boundaries(s: str, /, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None) Iterator[int]

Iterate indices of the word boundaries of s

This function yields indices from the first boundary position (> 0) to the end of the string (== len(s)).

uniseg.wordbreak.word_break(c: str, /) Word_Break

Return the Word_Break property of c

c must be a single Unicode code point string.

>>> word_break('\r')
Word_Break.CR
>>> word_break('\x0b')
Word_Break.Newline
>>> word_break('ア')
Word_Break.Katakana
uniseg.wordbreak.word_breakables(s: str, /) Iterable[Literal[0, 1]]

Iterate word breaking opportunities for every position of s

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(word_breakables('ABC'))
[1, 0, 0]
>>> list(word_breakables('Hello, world.'))
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1]
>>> list(word_breakables('\x01\u0308\x01'))
[1, 0, 1]
uniseg.wordbreak.words(s: str, /, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None) Iterator[str]

Iterate user-perceived words of s

These examples bellow is from http://www.unicode.org/reports/tr29/tr29-15.html#Word_Boundaries

>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?'
>>> '|'.join(words(s))
'The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|?'
>>> list(words(''))
[]