3.2. uniseg.graphemecluster — Grapheme Cluster
Unicode grapheme cluster boundaries.
UAX #29: Unicode Text Segmentation (Unicode 16.0.0)
- uniseg.graphemecluster.GCB
alias of
GraphemeClusterBreak
- class uniseg.graphemecluster.GraphemeClusterBreak(value)
Grapheme_Cluster_Break property values in UAX #29.
- CONTROL = 'Control'
Grapheme_Cluster_Break property value Control
- CR = 'CR'
Grapheme_Cluster_Break property value CR
- EXTEND = 'Extend'
Grapheme_Cluster_Break property value Extend
- L = 'L'
Grapheme_Cluster_Break property value L
- LF = 'LF'
Grapheme_Cluster_Break property value LF
- LV = 'LV'
Grapheme_Cluster_Break property value LV
- LVT = 'LVT'
Grapheme_Cluster_Break property value LVT
- OTHER = 'Other'
Grapheme_Cluster_Break property value Other
- PACINGMARK = 'SpacingMark'
Grapheme_Cluster_Break property value SpacingMark
- PREPEND = 'Prepend'
Grapheme_Cluster_Break property value Prepend
- REGIONAL_INDICATOR = 'Regional_Indicator'
Grapheme_Cluster_Break property value Regional_Indicator
- T = 'T'
Grapheme_Cluster_Break property value T
- V = 'V'
Grapheme_Cluster_Break property value V
- ZWJ = 'ZWJ'
Grapheme_Cluster_Break property value ZWJ
- uniseg.graphemecluster.grapheme_cluster_boundaries(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.graphemecluster.GraphemeClusterBreak] = <function grapheme_cluster_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[int]
Iterate indices of the grapheme cluster boundaries of s.
This function yields from 0 to the end of the string (== len(s)).
>>> list(grapheme_cluster_boundaries('ABC')) [0, 1, 2, 3] >>> list(grapheme_cluster_boundaries('g̈')) # (== '\u0067\u0308') [0, 2] >>> list(grapheme_cluster_boundaries('')) []
- uniseg.graphemecluster.grapheme_cluster_break(c: str, /) GraphemeClusterBreak
Return the Grapheme_Cluster_Break value assigned to the code point c.
c must be a single Unicode character (code point).
>>> grapheme_cluster_break('a') GraphemeClusterBreak.OTHER >>> grapheme_cluster_break('\r') GraphemeClusterBreak.CR >>> print(grapheme_cluster_break('\n')) LF
- uniseg.graphemecluster.grapheme_cluster_breakables(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.graphemecluster.GraphemeClusterBreak] = <function grapheme_cluster_break>) Iterable[Literal[0, 1]]
Iterate grapheme cluster breaking opportunities for every position of s.
1 for “break” and 0 for “do not break”. The length of iteration will be the same as
len(s).>>> list(grapheme_cluster_breakables('ABC')) [1, 1, 1] >>> list(grapheme_cluster_breakables('g̈')) # (== '\u0067\u0308') [1, 0] >>> list(grapheme_cluster_breakables('')) []
- uniseg.graphemecluster.grapheme_clusters(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.graphemecluster.GraphemeClusterBreak] = <function grapheme_cluster_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[str]
Iterate every grapheme cluster token of s.
Grapheme clusters (both legacy and extended):
>>> list(grapheme_clusters('g̈')) # (== '\u0067\u0308') ['g̈'] >>> list(grapheme_clusters('각')) # (== '\uac01') ['각'] >>> list(grapheme_clusters('각')) # (== '\u1100\u1161\u11a8') ['각']
Extended grapheme clusters:
>>> list(grapheme_clusters('நி')) # (== '\u0ba8\u0bbf') ['நி'] >>> list(grapheme_clusters('षि')) # (== '\u0937\u093f') ['षि']
Empty string leads the result of empty sequence:
>>> list(grapheme_clusters('')) []
You can customize the default breaking behavior by modifying breakable table so as to fit the specific locale in tailor function. It receives s and its default breaking sequence (iterator) as its arguments and returns the sequence of customized breaking opportunities:
>>> def tailor_gcb_breakables(s, breakables) -> Breakables: ... for i, breakable in enumerate(breakables): ... # don't break between 'c' and 'h' ... if s.endswith('c', 0, i) and s.startswith('h', i): ... yield 0 ... else: ... yield breakable ... >>> list(grapheme_clusters('Czech')) ['C', 'z', 'e', 'c', 'h'] >>> list(grapheme_clusters('Czech', tailor=tailor_gcb_breakables)) ['C', 'z', 'e', 'ch']