3.2. `uniseg.graphemecluster` — Grapheme Cluster

Unicode grapheme cluster boundaries.

UAX #29: Unicode Text Segmentation (Unicode 16.0.0)

uniseg.graphemecluster.GCB: alias of Grapheme_Cluster_Break

class uniseg.graphemecluster.Grapheme_Cluster_Break(value)

Grapheme_Cluster_Break property values in UAX #29.

CR = 'CR': Grapheme_Cluster_Break property value CR

Control = 'Control': Grapheme_Cluster_Break property value Control

Extend = 'Extend': Grapheme_Cluster_Break property value Extend

L = 'L': Grapheme_Cluster_Break property value L

LF = 'LF': Grapheme_Cluster_Break property value LF

LV = 'LV': Grapheme_Cluster_Break property value LV

LVT = 'LVT': Grapheme_Cluster_Break property value LVT

Other = 'Other': Grapheme_Cluster_Break property value Other

Prepend = 'Prepend': Grapheme_Cluster_Break property value Prepend

Regional_Indicator = 'Regional_Indicator': Grapheme_Cluster_Break property value Regional_Indicator

SpacingMark = 'SpacingMark': Grapheme_Cluster_Break property value SpacingMark

T = 'T': Grapheme_Cluster_Break property value T

V = 'V': Grapheme_Cluster_Break property value V

ZWJ = 'ZWJ': Grapheme_Cluster_Break property value ZWJ

uniseg.graphemecluster.grapheme_cluster_boundaries(s: str, /, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None) → Iterator[int]

Iterate indices of the grapheme cluster boundaries of s.

This function yields from 0 to the end of the string (== len(s)).

>>> list(grapheme_cluster_boundaries('ABC'))
[0, 1, 2, 3]
>>> list(grapheme_cluster_boundaries('g̈')) # (== '\u0067\u0308')
[0, 2]
>>> list(grapheme_cluster_boundaries(''))
[]

uniseg.graphemecluster.grapheme_cluster_break(c: str, /) → Grapheme_Cluster_Break

Return the Grapheme_Cluster_Break property of c.

c must be a single Unicode string.

>>> grapheme_cluster_break('a')
Grapheme_Cluster_Break.Other
>>> grapheme_cluster_break('\r')
Grapheme_Cluster_Break.CR
>>> print(grapheme_cluster_break('\n'))
LF

uniseg.graphemecluster.grapheme_cluster_breakables(s: str, /) → Iterable[Literal[0, 1]]

Iterate grapheme cluster breaking opportunities for every position of s.

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(grapheme_cluster_breakables('ABC'))
[1, 1, 1]
>>> list(grapheme_cluster_breakables('g̈')) # (== '\u0067\u0308')
[1, 0]
>>> list(grapheme_cluster_breakables(''))
[]

uniseg.graphemecluster.grapheme_clusters(s: str, /, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None) → Iterator[str]

Iterate every grapheme cluster token of s.

Grapheme clusters (both legacy and extended):

>>> list(grapheme_clusters('g̈')) # (== '\u0067\u0308')
['g̈']
>>> list(grapheme_clusters('각')) # (== '\uac01')
['각']
>>> list(grapheme_clusters('각')) # (== '\u1100\u1161\u11a8')
['각']

Extended grapheme clusters:

>>> list(grapheme_clusters('நி')) # (== '\u0ba8\u0bbf')
['நி']
>>> list(grapheme_clusters('षि')) # (== '\u0937\u093f')
['षि']

Empty string leads the result of empty sequence:

>>> list(grapheme_clusters(''))
[]

You can customize the default breaking behavior by modifying breakable table so as to fit the specific locale in tailor function. It receives s and its default breaking sequence (iterator) as its arguments and returns the sequence of customized breaking opportunities:

>>> def tailor_grapheme_cluster_breakables(s, breakables):
...     for i, breakable in enumerate(breakables):
...         # don't break between 'c' and 'h'
...         if s.endswith('c', 0, i) and s.startswith('h', i):
...             yield 0
...         else:
...             yield breakable
...
>>> list(grapheme_clusters('Czech'))
['C', 'z', 'e', 'c', 'h']
>>> list(grapheme_clusters('Czech', tailor_grapheme_cluster_breakables))
['C', 'z', 'e', 'ch']

3.2. uniseg.graphemecluster — Grapheme Cluster

3.2. `uniseg.graphemecluster` — Grapheme Cluster