3.2. uniseg.graphemecluster — Grapheme Cluster
Unicode grapheme cluster boundaries.
UAX #29: Unicode Text Segmentation (Unicode 16.0.0)
- uniseg.graphemecluster.GCB
alias of
Grapheme_Cluster_Break
- class uniseg.graphemecluster.Grapheme_Cluster_Break(value)
Grapheme_Cluster_Break property values in UAX #29.
- CR = 'CR'
Grapheme_Cluster_Break property value CR
- Control = 'Control'
Grapheme_Cluster_Break property value Control
- Extend = 'Extend'
Grapheme_Cluster_Break property value Extend
- L = 'L'
Grapheme_Cluster_Break property value L
- LF = 'LF'
Grapheme_Cluster_Break property value LF
- LV = 'LV'
Grapheme_Cluster_Break property value LV
- LVT = 'LVT'
Grapheme_Cluster_Break property value LVT
- Other = 'Other'
Grapheme_Cluster_Break property value Other
- Prepend = 'Prepend'
Grapheme_Cluster_Break property value Prepend
- Regional_Indicator = 'Regional_Indicator'
Grapheme_Cluster_Break property value Regional_Indicator
- SpacingMark = 'SpacingMark'
Grapheme_Cluster_Break property value SpacingMark
- T = 'T'
Grapheme_Cluster_Break property value T
- V = 'V'
Grapheme_Cluster_Break property value V
- ZWJ = 'ZWJ'
Grapheme_Cluster_Break property value ZWJ
- uniseg.graphemecluster.grapheme_cluster_boundaries(s: str, /, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None) Iterator[int]
Iterate indices of the grapheme cluster boundaries of s.
This function yields from 0 to the end of the string (== len(s)).
>>> list(grapheme_cluster_boundaries('ABC')) [0, 1, 2, 3] >>> list(grapheme_cluster_boundaries('g̈')) # (== '\u0067\u0308') [0, 2] >>> list(grapheme_cluster_boundaries('')) []
- uniseg.graphemecluster.grapheme_cluster_break(c: str, /) Grapheme_Cluster_Break
Return the Grapheme_Cluster_Break property of c.
c must be a single Unicode string.
>>> grapheme_cluster_break('a') Grapheme_Cluster_Break.Other >>> grapheme_cluster_break('\r') Grapheme_Cluster_Break.CR >>> print(grapheme_cluster_break('\n')) LF
- uniseg.graphemecluster.grapheme_cluster_breakables(s: str, /) Iterable[Literal[0, 1]]
Iterate grapheme cluster breaking opportunities for every position of s.
1 for “break” and 0 for “do not break”. The length of iteration will be the same as
len(s).>>> list(grapheme_cluster_breakables('ABC')) [1, 1, 1] >>> list(grapheme_cluster_breakables('g̈')) # (== '\u0067\u0308') [1, 0] >>> list(grapheme_cluster_breakables('')) []
- uniseg.graphemecluster.grapheme_clusters(s: str, /, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None) Iterator[str]
Iterate every grapheme cluster token of s.
Grapheme clusters (both legacy and extended):
>>> list(grapheme_clusters('g̈')) # (== '\u0067\u0308') ['g̈'] >>> list(grapheme_clusters('각')) # (== '\uac01') ['각'] >>> list(grapheme_clusters('각')) # (== '\u1100\u1161\u11a8') ['각']
Extended grapheme clusters:
>>> list(grapheme_clusters('நி')) # (== '\u0ba8\u0bbf') ['நி'] >>> list(grapheme_clusters('षि')) # (== '\u0937\u093f') ['षि']
Empty string leads the result of empty sequence:
>>> list(grapheme_clusters('')) []
You can customize the default breaking behavior by modifying breakable table so as to fit the specific locale in tailor function. It receives s and its default breaking sequence (iterator) as its arguments and returns the sequence of customized breaking opportunities:
>>> def tailor_grapheme_cluster_breakables(s, breakables): ... for i, breakable in enumerate(breakables): ... # don't break between 'c' and 'h' ... if s.endswith('c', 0, i) and s.startswith('h', i): ... yield 0 ... else: ... yield breakable ... >>> list(grapheme_clusters('Czech')) ['C', 'z', 'e', 'c', 'h'] >>> list(grapheme_clusters('Czech', tailor_grapheme_cluster_breakables)) ['C', 'z', 'e', 'ch']