2.2. uniseg.graphemecluster — Grapheme Cluster

Unicode grapheme cluster boundaries.

UAX #29: Unicode Text Segmentation (Unicode 15.0.0) https://www.unicode.org/reports/tr29/tr29-41.html

uniseg.graphemecluster.GCB

alias of GraphemeClusterBreak

class uniseg.graphemecluster.GraphemeClusterBreak(value)

Grapheme_Cluster_Break property values in UAX #29.

uniseg.graphemecluster.grapheme_cluster_boundaries(s: str, tailor: Callable[[str, Iterator[Literal[0, 1]]], Iterator[Literal[0, 1]]] | None = None, /) Iterator[int]

Iterate indices of the grapheme cluster boundaries of s

This function yields from 0 to the end of the string (== len(s)).

>>> list(grapheme_cluster_boundaries('ABC'))
[0, 1, 2, 3]
>>> list(grapheme_cluster_boundaries('g̈'))
[0, 2]
>>> list(grapheme_cluster_boundaries(''))
[]
uniseg.graphemecluster.grapheme_cluster_break(c: str, index: int = 0, /) GraphemeClusterBreak

Return the Grapheme_Cluster_Break property of c

c must be a single Unicode code point string.

>>> grapheme_cluster_break('a')
<GraphemeClusterBreak.OTHER: 'Other'>
>>> grapheme_cluster_break('\x0d')
<GraphemeClusterBreak.CR: 'CR'>
>>> grapheme_cluster_break('\x0a').name
'LF'

If index is specified, this function consider c as a unicode string and return Grapheme_Cluster_Break property of the code point at c[index].

>>> grapheme_cluster_break('a\x0d', 1).name
'CR'
uniseg.graphemecluster.grapheme_cluster_breakables(s: str, /) Iterator[Literal[0, 1]]

Iterate grapheme cluster breaking opportunities for every position of s

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(grapheme_cluster_breakables('ABC'))
[1, 1, 1]
>>> list(grapheme_cluster_breakables('g̈'))
[1, 0]
>>> list(grapheme_cluster_breakables(''))
[]
uniseg.graphemecluster.grapheme_clusters(s: str, tailor: Callable[[str, Iterator[Literal[0, 1]]], Iterator[Literal[0, 1]]] | None = None, /) Iterator[str]

Iterate every grapheme cluster token of s

Grapheme clusters (both legacy and extended):

>>> list(grapheme_clusters('g\u0308')) == ['g\u0308']
True
>>> list(grapheme_clusters('\uac01')) == ['\uac01']
True
>>> list(grapheme_clusters('\u1100\u1161\u11a8')) == ['\u1100\u1161\u11a8']
True

Extended grapheme clusters:

>>> list(grapheme_clusters('\u0ba8\u0bbf')) == ['\u0ba8\u0bbf']
True
>>> list(grapheme_clusters('\u0937\u093f')) == ['\u0937\u093f']
True

Empty string leads the result of empty sequence:

>>> list(grapheme_clusters('')) == []
True

You can customize the default breaking behavior by modifying breakable table so as to fit the specific locale in tailor function. It receives s and its default breaking sequence (iterator) as its arguments and returns the sequence of customized breaking opportunities:

>>> def tailor_grapheme_cluster_breakables(s, breakables):
...
...     for i, breakable in enumerate(breakables):
...         # don't break between 'c' and 'h'
...         if s.endswith('c', 0, i) and s.startswith('h', i):
...             yield 0
...         else:
...             yield breakable
...
>>> s = 'Czech'
>>> list(grapheme_clusters(s)) == ['C', 'z', 'e', 'c', 'h']
True
>>> list(grapheme_clusters(
...     s, tailor_grapheme_cluster_breakables)) == ['C', 'z', 'e', 'ch']
True