3.3. uniseg.wordbreak — Word Break
Unicode word boundaries.
UAX #29: Unicode Text Segmentation (Unicode 16.0.0)
- class uniseg.wordbreak.WordBreak(value)
Word_Break property values.
- ALETTER = 'ALetter'
Word_Break property value ALetter
- CR = 'CR'
Word_Break property value CR
- DOUBLE_QUOTE = 'Double_Quote'
Word_Break property value Double_Quote
- EXTEND = 'Extend'
Word_Break property value Extend
- EXTENDNUMLET = 'ExtendNumLet'
Word_Break property value ExtendNumLet
- FORMAT = 'Format'
Word_Break property value Format
- HEBREW_LETTER = 'Hebrew_Letter'
Word_Break property value Hebrew_Letter
- KATAKANA = 'Katakana'
Word_Break property value Katakana
- LF = 'LF'
Word_Break property value LF
- MIDLETTER = 'MidLetter'
Word_Break property value MidLetter
- MIDNUM = 'MidNum'
Word_Break property value MidNum
- MIDNUMLET = 'MidNumLet'
Word_Break property value MidNumLet
- NEWLINE = 'Newline'
Word_Break property value Newline
- NUMERIC = 'Numeric'
Word_Break property value Numeric
- OTHER = 'Other'
Word_Break property value Other
- REGIONAL_INDICATOR = 'Regional_Indicator'
Word_Break property value Regional_Indicator
- SINGLE_QUOTE = 'Single_Quote'
Word_Break property value Single_Quote
- WSEGSPACE = 'WSegSpace'
Word_Break property value WSegSpace
- ZWJ = 'ZWJ'
Word_Break property value ZWJ
- uniseg.wordbreak.word_boundaries(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.wordbreak.WordBreak] = <function word_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[int]
Iterate indices of the word boundaries of s
This function yields indices from the first boundary position (> 0) to the end of the string (== len(s)).
- uniseg.wordbreak.word_break(c: str, /) WordBreak
Return the Word_Break value assigned to the code point c.
c must be a single Unicode code point string.
>>> word_break('\r') WordBreak.CR >>> word_break('\x0b') WordBreak.NEWLINE >>> word_break('ア') WordBreak.KATAKANA
- uniseg.wordbreak.word_breakables(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.wordbreak.WordBreak] = <function word_break>) Iterable[Literal[0, 1]]
Iterate word breaking opportunities for every position of s
1 for “break” and 0 for “do not break”. The length of iteration will be the same as
len(s).>>> list(word_breakables('ABC')) [1, 0, 0] >>> list(word_breakables('Hello, world.')) [1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1] >>> list(word_breakables('\x01\u0308\x01')) [1, 0, 1]
- uniseg.wordbreak.words(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.wordbreak.WordBreak] = <function word_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[str]
Iterate user-perceived words of s
These examples bellow is from http://www.unicode.org/reports/tr29/tr29-15.html#Word_Boundaries
>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?' >>> '|'.join(words(s)) 'The| |quick| |(|“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet|,| |right|?' >>> list(words('')) []