3.5. uniseg.linebreak — Line Break

Unicode line breaking algorithm.

UAX #14: Unicode Line Breaking Algorithm (Unicode 16.0.0)

uniseg.linebreak.LB

alias of LineBreak

class uniseg.linebreak.LineBreak(value)

Line_Break property values.

AI = 'AI'

Line_Break property value AI, Ambiguous (Alphabetic or Ideographic)

AK = 'AK'

Line_Break property value AK, Aksara

AL = 'AL'

Line_Break property value AL, Alphabetic

AP = 'AP'

Line_Break property value AP, Aksara Pre-Base

AS = 'AS'

Line_Break property value AS, Aksara Start

B2 = 'B2'

Line_Break property value B2, Break Opportunity Before and After

BA = 'BA'

Line_Break property value BA, Break After

BB = 'BB'

Line_Break property value BB, Break Before

BK = 'BK'

Line_Break property value BK, Mandatory Break

CB = 'CB'

Line_Break property value CB, Contingent Break Opportunity

CJ = 'CJ'

Line_Break property value CJ, Conditional Japanese Starter

CL = 'CL'

Line_Break property value CL, Close Punctuation

CM = 'CM'

Line_Break property value CM, Combining Mark

CP = 'CP'

Line_Break property value CP, Close Parenthesis

CR = 'CR'

Line_Break property value CR, Carriage Return

EB = 'EB'

Line_Break property value EB, Emoji Base

EM = 'EM'

Line_Break property value EM, Emoji Modifier

EX = 'EX'

Line_Break property value EX, Exclamation/Interrogation

GL = 'GL'

Line_Break property value GL, Non-breaking (“Glue”)

H2 = 'H2'

Line_Break property value H2, Hangul LV Syllable

H3 = 'H3'

Line_Break property value H3, Hangul LVT Syllable

HL = 'HL'

Line_Break property value HL, Hebrew Letter

HY = 'HY'

Line_Break property value HY, Hyphen

ID = 'ID'

Line_Break property value ID, Ideographic

IN = 'IN'

Line_Break property value IN, Inseparable

IS = 'IS'

Line_Break property value IS, Infix Numeric Separator

JL = 'JL'

Line_Break property value JL, Hangul L Jamo

JT = 'JT'

Line_Break property value JT, Hangul T Jamo

JV = 'JV'

Line_Break property value JV, Hangul V Jamo

LF = 'LF'

Line_Break property value LF, Line Feed

NL = 'NL'

Line_Break property value NL, Next Line

NS = 'NS'

Line_Break property value NS, Nonstarter

NU = 'NU'

Line_Break property value NU, Numeric

OP = 'OP'

Line_Break property value OP, Open Punctuation

PO = 'PO'

Line_Break property value PO, Postfix Numeric

PR = 'PR'

Line_Break property value PR, Prefix Numeric

QU = 'QU'

Line_Break property value QU, Quotation

RI = 'RI'

Line_Break property value RI, Regional Indicator

SA = 'SA'

Line_Break property value SA, Complex Context Dependent (South East Asian)

SG = 'SG'

Line_Break property value SG, Surrogate

SP = 'SP'

Line_Break property value SP, Space

SY = 'SY'

Line_Break property value SY, Symbols Allowing Break After

VF = 'VF'

Line_Break property value VF, Virama Final

VI = 'VI'

Line_Break property value VI, Virama

WJ = 'WJ'

Line_Break property value WJ, Word Joiner

XX = 'XX'

Line_Break property value XX, Unknown

ZW = 'ZW'

Line_Break property value ZW, Zero Width Space

ZWJ = 'ZWJ'

ZLine_Break property value ZWJ, Zero Width Joiner

uniseg.linebreak.line_break(c: str, /) LineBreak

Return the Line_Break value assigned to the code point c.

c must be a single Unicode code point string.

>>> line_break('\r')
LineBreak.CR
>>> line_break(' ')
LineBreak.SP
>>> line_break('1')
LineBreak.NU
>>> line_break('᭄') # (== '\u1b44')
LineBreak.VI
>>> line_break('𐀀') # U+10000, LINEAR B SYLLABLE B008 A
LineBreak.AL
uniseg.linebreak.line_break_boundaries(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.unicodeproperty._T] = <function line_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[int]

Iterate indices of the line breaking boundaries for s.

This function iterates values from 0, which is the start of the string, to the end boundary of the string which its value is len(s).

>>> list(line_break_boundaries('a'))
[1]
>>> list(line_break_boundaries('a b'))
[2, 3]
>>> list(line_break_boundaries('a b\n'))
[2, 4]
>>> list(line_break_boundaries('あい、うえ、お。'))
[1, 3, 4, 6, 8]

The length of the returned list means the count of the line break units for the string.

uniseg.linebreak.line_break_breakables(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.linebreak.LineBreak] = <function line_break>) Iterable[Literal[0, 1]]

Iterate line breaking opportunities for every position of s

1 means “break” and 0 means “do not break” BEFORE the postion. The length of iteration will be the same as len(s).

>>> list(line_break_breakables('ABC'))
[0, 0, 0]
>>> list(line_break_breakables('Hello, world.'))
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
>>> list(line_break_breakables(''))
[]
uniseg.linebreak.line_break_units(s: str, /, *, property: ~collections.abc.Callable[[str], ~uniseg.linebreak.LineBreak] = <function line_break>, tailor: ~collections.abc.Callable[[str, ~collections.abc.Iterable[~typing.Literal[0, 1]]], ~collections.abc.Iterable[~typing.Literal[0, 1]]] | None = None) Iterator[str]

Iterate every line breaking token of s

>>> s = 'The quick (“brown”) fox can’t jump 32.3 feet, right?'
>>> '|'.join(line_break_units(s))
'The |quick |(“brown”) |fox |can’t |jump |32.3 |feet, |right?'
>>> list(line_break_units(''))
[]
>>> list(line_break_units('①①'))
['①①']
>>> def line_break_legacy(c: str, /) -> LineBreak:
...    return LB.ID if (lb := line_break(c)) == LB.AI else lb
...
>>> list(line_break_units('①①', property=line_break_legacy))
['①', '①']