3.2. uniseg.graphemecluster — Grapheme Cluster

Unicode grapheme cluster boundaries.

UAX #29: Unicode Text Segmentation (Unicode 16.0.0)

uniseg.graphemecluster.GCB

alias of Grapheme_Cluster_Break

class uniseg.graphemecluster.Grapheme_Cluster_Break(value)

Grapheme_Cluster_Break property values in UAX #29.

CR = 'CR'

Grapheme_Cluster_Break property value CR

Control = 'Control'

Grapheme_Cluster_Break property value Control

Extend = 'Extend'

Grapheme_Cluster_Break property value Extend

L = 'L'

Grapheme_Cluster_Break property value L

LF = 'LF'

Grapheme_Cluster_Break property value LF

LV = 'LV'

Grapheme_Cluster_Break property value LV

LVT = 'LVT'

Grapheme_Cluster_Break property value LVT

Other = 'Other'

Grapheme_Cluster_Break property value Other

Prepend = 'Prepend'

Grapheme_Cluster_Break property value Prepend

Regional_Indicator = 'Regional_Indicator'

Grapheme_Cluster_Break property value Regional_Indicator

SpacingMark = 'SpacingMark'

Grapheme_Cluster_Break property value SpacingMark

T = 'T'

Grapheme_Cluster_Break property value T

V = 'V'

Grapheme_Cluster_Break property value V

ZWJ = 'ZWJ'

Grapheme_Cluster_Break property value ZWJ

uniseg.graphemecluster.grapheme_cluster_boundaries(s: str, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None, /) Iterator[int]

Iterate indices of the grapheme cluster boundaries of s.

This function yields from 0 to the end of the string (== len(s)).

>>> list(grapheme_cluster_boundaries('ABC'))
[0, 1, 2, 3]
>>> list(grapheme_cluster_boundaries('g̈'))
[0, 2]
>>> list(grapheme_cluster_boundaries(''))
[]
uniseg.graphemecluster.grapheme_cluster_break(c: str, /) Grapheme_Cluster_Break

Return the Grapheme_Cluster_Break property of c.

c must be a single Unicode string.

>>> grapheme_cluster_break('a')
Grapheme_Cluster_Break.Other
>>> grapheme_cluster_break('\x0d')
Grapheme_Cluster_Break.CR
>>> print(grapheme_cluster_break('\x0a'))
LF
uniseg.graphemecluster.grapheme_cluster_breakables(s: str, /) Iterable[Literal[0, 1]]

Iterate grapheme cluster breaking opportunities for every position of s.

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(grapheme_cluster_breakables('ABC'))
[1, 1, 1]
>>> list(grapheme_cluster_breakables('g̈'))
[1, 0]
>>> list(grapheme_cluster_breakables(''))
[]
uniseg.graphemecluster.grapheme_clusters(s: str, tailor: Callable[[str, Iterable[Literal[0, 1]]], Iterable[Literal[0, 1]]] | None = None, /) Iterator[str]

Iterate every grapheme cluster token of s.

Grapheme clusters (both legacy and extended):

>>> list(grapheme_clusters('g\u0308')) == ['g\u0308']
True
>>> list(grapheme_clusters('\uac01')) == ['\uac01']
True
>>> list(grapheme_clusters('\u1100\u1161\u11a8')) == ['\u1100\u1161\u11a8']
True

Extended grapheme clusters:

>>> list(grapheme_clusters('\u0ba8\u0bbf')) == ['\u0ba8\u0bbf']
True
>>> list(grapheme_clusters('\u0937\u093f')) == ['\u0937\u093f']
True

Empty string leads the result of empty sequence:

>>> list(grapheme_clusters('')) == []
True

You can customize the default breaking behavior by modifying breakable table so as to fit the specific locale in tailor function. It receives s and its default breaking sequence (iterator) as its arguments and returns the sequence of customized breaking opportunities:

>>> def tailor_grapheme_cluster_breakables(s, breakables):
...
...     for i, breakable in enumerate(breakables):
...         # don't break between 'c' and 'h'
...         if s.endswith('c', 0, i) and s.startswith('h', i):
...             yield 0
...         else:
...             yield breakable
...
>>> s = 'Czech'
>>> list(grapheme_clusters(s)) == ['C', 'z', 'e', 'c', 'h']
True
>>> list(grapheme_clusters(
...     s, tailor_grapheme_cluster_breakables)) == ['C', 'z', 'e', 'ch']
True