3.3. uniseg.wordbreak — Word Break

Unicode word boundaries.

UAX #29: Unicode Text Segmentation (Unicode 16.0.0)

uniseg.wordbreak.WB

alias of WordBreak

class uniseg.wordbreak.WordBreak(value)

Word_Break property values.

ALETTER = 'ALetter'

Word_Break property value ALetter

CR = 'CR'

Word_Break property value CR

DOUBLE_QUOTE = 'Double_Quote'

Word_Break property value Double_Quote

EXTEND = 'Extend'

Word_Break property value Extend

EXTENDNUMLET = 'ExtendNumLet'

Word_Break property value ExtendNumLet

FORMAT = 'Format'

Word_Break property value Format

HEBREW_LETTER = 'Hebrew_Letter'

Word_Break property value Hebrew_Letter

KATAKANA = 'Katakana'

Word_Break property value Katakana

LF = 'LF'

Word_Break property value LF

MIDLETTER = 'MidLetter'

Word_Break property value MidLetter

MIDNUM = 'MidNum'

Word_Break property value MidNum

MIDNUMLET = 'MidNumLet'

Word_Break property value MidNumLet

NEWLINE = 'Newline'

Word_Break property value Newline

NUMERIC = 'Numeric'

Word_Break property value Numeric

OTHER = 'Other'

Word_Break property value Other

REGIONAL_INDICATOR = 'Regional_Indicator'

Word_Break property value Regional_Indicator

SINGLE_QUOTE = 'Single_Quote'

Word_Break property value Single_Quote

WSEGSPACE = 'WSegSpace'

Word_Break property value WSegSpace

ZWJ = 'ZWJ'

Word_Break property value ZWJ

uniseg.wordbreak.word_boundaries(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.wordbreak.WordBreak] = <function word_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[int]

Iterate indices of the word boundaries of s

This function yields indices from the first boundary position (> 0) to the end of the string (== len(s)).

uniseg.wordbreak.word_break(c: str, /) WordBreak

Return the Word_Break value assigned to the code point c.

c must be a single Unicode code point string.

>>> word_break('\r')
WordBreak.CR
>>> word_break('\x0b')
WordBreak.NEWLINE
>>> word_break('ア')
WordBreak.KATAKANA
uniseg.wordbreak.word_breakables(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.wordbreak.WordBreak] = <function word_break>) Iterable[Literal[0, 1]]

Iterate word breaking opportunities for every position of s

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(word_breakables('ABC'))
[1, 0, 0]
>>> list(word_breakables('Hello, world.'))
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1]
>>> list(word_breakables('\x01\u0308\x01'))
[1, 0, 1]
uniseg.wordbreak.words(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.wordbreak.WordBreak] = <function word_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[str]

Iterate user-perceived words of s

These examples bellow is from http://www.unicode.org/reports/tr29/tr29-15.html#Word_Boundaries

>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?'
>>> '|'.join(words(s))
'The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|?'
>>> list(words(''))
[]