3.2. uniseg.graphemecluster — Grapheme Cluster
Unicode grapheme cluster boundaries.
UAX #29: Unicode Text Segmentation (Unicode 16.0.0)
- uniseg.graphemecluster.GCB
alias of
Grapheme_Cluster_Break
- class uniseg.graphemecluster.Grapheme_Cluster_Break(value)
Grapheme_Cluster_Break property values in UAX #29.
- CR = 'CR'
Grapheme_Cluster_Break property value CR
- Control = 'Control'
Grapheme_Cluster_Break property value Control
- Extend = 'Extend'
Grapheme_Cluster_Break property value Extend
- L = 'L'
Grapheme_Cluster_Break property value L
- LF = 'LF'
Grapheme_Cluster_Break property value LF
- LV = 'LV'
Grapheme_Cluster_Break property value LV
- LVT = 'LVT'
Grapheme_Cluster_Break property value LVT
- Other = 'Other'
Grapheme_Cluster_Break property value Other
- Prepend = 'Prepend'
Grapheme_Cluster_Break property value Prepend
- Regional_Indicator = 'Regional_Indicator'
Grapheme_Cluster_Break property value Regional_Indicator
- SpacingMark = 'SpacingMark'
Grapheme_Cluster_Break property value SpacingMark
- T = 'T'
Grapheme_Cluster_Break property value T
- V = 'V'
Grapheme_Cluster_Break property value V
- ZWJ = 'ZWJ'
Grapheme_Cluster_Break property value ZWJ
- uniseg.graphemecluster.grapheme_cluster_boundaries(s: str, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None, /) Iterator[int]
Iterate indices of the grapheme cluster boundaries of s.
This function yields from 0 to the end of the string (== len(s)).
>>> list(grapheme_cluster_boundaries('ABC')) [0, 1, 2, 3] >>> list(grapheme_cluster_boundaries('g̈')) [0, 2] >>> list(grapheme_cluster_boundaries('')) []
- uniseg.graphemecluster.grapheme_cluster_break(c: str, /) Grapheme_Cluster_Break
Return the Grapheme_Cluster_Break property of c.
c must be a single Unicode string.
>>> grapheme_cluster_break('a') Grapheme_Cluster_Break.Other >>> grapheme_cluster_break('\x0d') Grapheme_Cluster_Break.CR >>> print(grapheme_cluster_break('\x0a')) LF
- uniseg.graphemecluster.grapheme_cluster_breakables(s: str, /) Iterable[Literal[0, 1]]
Iterate grapheme cluster breaking opportunities for every position of s.
1 for “break” and 0 for “do not break”. The length of iteration will be the same as
len(s).>>> list(grapheme_cluster_breakables('ABC')) [1, 1, 1] >>> list(grapheme_cluster_breakables('g̈')) [1, 0] >>> list(grapheme_cluster_breakables('')) []
- uniseg.graphemecluster.grapheme_clusters(s: str, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None, /) Iterator[str]
Iterate every grapheme cluster token of s.
Grapheme clusters (both legacy and extended):
>>> list(grapheme_clusters('g\u0308')) == ['g\u0308'] True >>> list(grapheme_clusters('\uac01')) == ['\uac01'] True >>> list(grapheme_clusters('\u1100\u1161\u11a8')) == ['\u1100\u1161\u11a8'] True
Extended grapheme clusters:
>>> list(grapheme_clusters('\u0ba8\u0bbf')) == ['\u0ba8\u0bbf'] True >>> list(grapheme_clusters('\u0937\u093f')) == ['\u0937\u093f'] True
Empty string leads the result of empty sequence:
>>> list(grapheme_clusters('')) == [] True
You can customize the default breaking behavior by modifying breakable table so as to fit the specific locale in tailor function. It receives s and its default breaking sequence (iterator) as its arguments and returns the sequence of customized breaking opportunities:
>>> def tailor_grapheme_cluster_breakables(s, breakables): ... ... for i, breakable in enumerate(breakables): ... # don't break between 'c' and 'h' ... if s.endswith('c', 0, i) and s.startswith('h', i): ... yield 0 ... else: ... yield breakable ... >>> s = 'Czech' >>> list(grapheme_clusters(s)) == ['C', 'z', 'e', 'c', 'h'] True >>> list(grapheme_clusters( ... s, tailor_grapheme_cluster_breakables)) == ['C', 'z', 'e', 'ch'] True