3.2. uniseg.graphemecluster — Grapheme Cluster

Unicode grapheme cluster boundaries.

UAX #29: Unicode Text Segmentation (Unicode 16.0.0)

uniseg.graphemecluster.GCB

alias of GraphemeClusterBreak

class uniseg.graphemecluster.GraphemeClusterBreak(value)

Grapheme_Cluster_Break property values in UAX #29.

CONTROL = 'Control'

Grapheme_Cluster_Break property value Control

CR = 'CR'

Grapheme_Cluster_Break property value CR

EXTEND = 'Extend'

Grapheme_Cluster_Break property value Extend

L = 'L'

Grapheme_Cluster_Break property value L

LF = 'LF'

Grapheme_Cluster_Break property value LF

LV = 'LV'

Grapheme_Cluster_Break property value LV

LVT = 'LVT'

Grapheme_Cluster_Break property value LVT

OTHER = 'Other'

Grapheme_Cluster_Break property value Other

PACINGMARK = 'SpacingMark'

Grapheme_Cluster_Break property value SpacingMark

PREPEND = 'Prepend'

Grapheme_Cluster_Break property value Prepend

REGIONAL_INDICATOR = 'Regional_Indicator'

Grapheme_Cluster_Break property value Regional_Indicator

T = 'T'

Grapheme_Cluster_Break property value T

V = 'V'

Grapheme_Cluster_Break property value V

ZWJ = 'ZWJ'

Grapheme_Cluster_Break property value ZWJ

uniseg.graphemecluster.grapheme_cluster_boundaries(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.graphemecluster.GraphemeClusterBreak] = <function grapheme_cluster_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[int]

Iterate indices of the grapheme cluster boundaries of s.

This function yields from 0 to the end of the string (== len(s)).

>>> list(grapheme_cluster_boundaries('ABC'))
[0, 1, 2, 3]
>>> list(grapheme_cluster_boundaries('g̈')) # (== '\u0067\u0308')
[0, 2]
>>> list(grapheme_cluster_boundaries(''))
[]
uniseg.graphemecluster.grapheme_cluster_break(c: str, /) GraphemeClusterBreak

Return the Grapheme_Cluster_Break value assigned to the code point c.

c must be a single Unicode character (code point).

>>> grapheme_cluster_break('a')
GraphemeClusterBreak.OTHER
>>> grapheme_cluster_break('\r')
GraphemeClusterBreak.CR
>>> print(grapheme_cluster_break('\n'))
LF
uniseg.graphemecluster.grapheme_cluster_breakables(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.graphemecluster.GraphemeClusterBreak] = <function grapheme_cluster_break>) Iterable[Literal[0, 1]]

Iterate grapheme cluster breaking opportunities for every position of s.

1 for “break” and 0 for “do not break”. The length of iteration will be the same as len(s).

>>> list(grapheme_cluster_breakables('ABC'))
[1, 1, 1]
>>> list(grapheme_cluster_breakables('g̈')) # (== '\u0067\u0308')
[1, 0]
>>> list(grapheme_cluster_breakables(''))
[]
uniseg.graphemecluster.grapheme_clusters(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.graphemecluster.GraphemeClusterBreak] = <function grapheme_cluster_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[str]

Iterate every grapheme cluster token of s.

Grapheme clusters (both legacy and extended):

>>> list(grapheme_clusters('g̈')) # (== '\u0067\u0308')
['g̈']
>>> list(grapheme_clusters('각')) # (== '\uac01')
['각']
>>> list(grapheme_clusters('각')) # (== '\u1100\u1161\u11a8')
['각']

Extended grapheme clusters:

>>> list(grapheme_clusters('நி')) # (== '\u0ba8\u0bbf')
['நி']
>>> list(grapheme_clusters('षि')) # (== '\u0937\u093f')
['षि']

Empty string leads the result of empty sequence:

>>> list(grapheme_clusters(''))
[]

You can customize the default breaking behavior by modifying breakable table so as to fit the specific locale in tailor function. It receives s and its default breaking sequence (iterator) as its arguments and returns the sequence of customized breaking opportunities:

>>> def tailor_gcb_breakables(s, breakables) -> Breakables:
...     for i, breakable in enumerate(breakables):
...         # don't break between 'c' and 'h'
...         if s.endswith('c', 0, i) and s.startswith('h', i):
...             yield 0
...         else:
...             yield breakable
...
>>> list(grapheme_clusters('Czech'))
['C', 'z', 'e', 'c', 'h']
>>> list(grapheme_clusters('Czech', tailor=tailor_gcb_breakables))
['C', 'z', 'e', 'ch']