Module `Uucp.Break`

Break properties.

These properties are mainly for the Unicode text segmentation and line breaking algorithm.

References

Mark Davis. UAX #29 Unicode Text Segmentation. (latest version)
Andy Heninger. UAX #14 Unicode Line Breaking Algorithm. (latest version)
Ken Lunde 小林劍. UAX #11 East Asian width. (latest version)

Line break

type line = [ | `AI | `AL | `B2 | `BA | `BB | `BK | `CB | `CJ | `CL | `CM | `CP | `CR | `EX | `EB | `EM | `GL | `H2 | `H3 | `HL | `HY | `ID | `IN | `IS | `JL | `JT | `JV | `LF | `NL | `NS | `NU | `OP | `PO | `PR | `QU | `RI | `SA | `SG | `SP | `SY | `WJ | `XX | `ZW | `ZWJ ]: The type for line breaks.

val pp_line : Stdlib.Format.formatter -> line -> unit: pp_line ppf l prints an unspecified representation of l on ppf.

val line : Stdlib.Uchar.t -> line: line u is u's line break property.

Grapheme cluster break

type grapheme_cluster = [ | `CN | `CR | `EX | `EB | `EBG | `EM | `GAZ | `L | `LF | `LV | `LVT | `PP | `RI | `SM | `T | `V | `XX | `ZWJ ]: The type for grapheme cluster breaks.

val pp_grapheme_cluster : Stdlib.Format.formatter -> grapheme_cluster -> unit: pp_grapheme_cluster ppf g prints an unspecified representation of g on ppf.

val grapheme_cluster : Stdlib.Uchar.t -> grapheme_cluster: grapheme_cluster u is u's grapheme cluster break property.

Word break

type word = [ | `CR | `DQ | `EX | `EB | `EBG | `EM | `Extend | `FO | `GAZ | `HL | `KA | `LE | `LF | `MB | `ML | `MN | `NL | `NU | `RI | `SQ | `WSegSpace | `XX | `ZWJ ]: The type for word breaks.

val pp_word : Stdlib.Format.formatter -> word -> unit: pp_grapheme_cluster ppf g prints an unspecified representation of g on ppf.

val word : Stdlib.Uchar.t -> word: world u is u's word break property.

Sentence break

type sentence = [ | `AT | `CL | `CR | `EX | `FO | `LE | `LF | `LO | `NU | `SC | `SE | `SP | `ST | `UP | `XX ]: The type for sentence breaks.

val pp_sentence : Stdlib.Format.formatter -> sentence -> unit: pp_grapheme_cluster ppf g prints an unspecified representation of g on ppf.

val sentence : Stdlib.Uchar.t -> sentence: sentence u is u's sentence break property.

East Asian width

type east_asian_width = [ | `A | `F | `H | `N | `Na | `W ]: The type for East Asian widths.

val pp_east_asian_width : Stdlib.Format.formatter -> east_asian_width -> unit: pp_east_asian_width ppf w prints an unspecified representation of w on ppf.

val east_asian_width : Stdlib.Uchar.t -> east_asian_width: east_asian_width u is u's East Asian width property.

Terminal width

val tty_width_hint : Stdlib.Uchar.t -> int

tty_width_hint u approximates u's column width as rendered by a typical character terminal.

The current implementation of the function returns either 0, 1, 2 or -1. The value -1 is only returned for scalar values for which the property is non-sensical; clients are expected to sanitize their inputs and not to use the function with these scalar values which are those in range U+0001-U+001F (C0 controls without U+0000) and U+007F-U+009F (DELETE and C1 controls).

Note. Converting a string to normalization form C before folding this function over its scalar values will, in general, yield better approximations (e.g. on Hangul).

Warning. This is not a normative property and only a heuristic. If you find yourself using this function please read carefully the following lines.

This function is the moral equivalent of POSIX wcwidth, in that its purpose is to help align text displayed by a character terminal. It mimics wcwidth, as widely implemented, in yet another way: it is mostly wrong.

Computing column width is a surprisingly difficult task in general. Much of the software infrastructure still carries legacy assumptions about the nature of text harking back to the ASCII era. Different terminal emulators attempt to cope with general Unicode text in different ways, creating a fundamental problem: width of text fragments will vary across terminal emulators, with no way of getting feedback from the output layer back into the text-producing layer.

For example: on a modern Linux system, a collection of terminals will disagree on some or all of U+00AD, U+0CBF, and U+2029. They will likewise disagree about unassigned characters (category Cn), sometimes contradicting the system's wcwidth (e.g. U+0378, U+0530). Terminals using bare libxft will display complex scripts differently from terminals using HarfBuzz, and the rendering on OS X will be slightly different from both.

tty_width_hint uses a simple and predictable width algorithm, based on Markus Kuhn's portable wcwidth:

Scalar values in the ranges U+0001-U+001F (C0 controls without U+0000) and U+007F-U+009F (DELETE and C1 controls) have undefined width (-1).
Characters with East Asian Width Fullwidth or Wide have a width of 2.
Characters with General Category Mn, Me, Cf and U+0000 have a width of 0.
Most other characters have a width of 1, including Cn.

This approach works well, in that it gives results generally consistent with a wide range of terminals, for alphabetic scripts, and for east Asian syllabic and logographic scripts in non-decomposed form. Support varies for abjad scripts in the presence of vowel marks, and it mostly breaks down on abugidas.

Moreover, non-text symbols like Emoji or Yijing hexagrams will be incorrectly classified as 1-wide, but this in fact agrees with their rendering on many terminals.

Clients should not over-rely on tty_width_hint. It provides a best-effort approximation which will sometimes fail in practice.

Low level interface

module Low : sig ... end: Low level interface.