Module Uucp.Break
Break properties.
These properties are mainly for the Unicode text segmentation and line breaking algorithm.
References
- Mark Davis. UAX #29 Unicode Text Segmentation. (latest version)
- Andy Heninger. UAX #14 Unicode Line Breaking Algorithm. (latest version)
- Ken Lunde 小林劍. UAX #11 East Asian width. (latest version)
Line break
type line=[]The type for line breaks.
val pp_line : Stdlib.Format.formatter -> line -> unitpp_line ppf lprints an unspecified representation oflonppf.
val line : Stdlib.Uchar.t -> lineline uisu's line break property.
Grapheme cluster break
type grapheme_cluster=[|`CN|`CR|`EX|`EB|`EBG|`EM|`GAZ|`L|`LF|`LV|`LVT|`PP|`RI|`SM|`T|`V|`XX|`ZWJ]The type for grapheme cluster breaks.
val pp_grapheme_cluster : Stdlib.Format.formatter -> grapheme_cluster -> unitpp_grapheme_cluster ppf gprints an unspecified representation ofgonppf.
val grapheme_cluster : Stdlib.Uchar.t -> grapheme_clustergrapheme_cluster uisu's grapheme cluster break property.
Word break
type word=[|`CR|`DQ|`EX|`EB|`EBG|`EM|`Extend|`FO|`GAZ|`HL|`KA|`LE|`LF|`MB|`ML|`MN|`NL|`NU|`RI|`SQ|`WSegSpace|`XX|`ZWJ]The type for word breaks.
val pp_word : Stdlib.Format.formatter -> word -> unitpp_grapheme_cluster ppf gprints an unspecified representation ofgonppf.
val word : Stdlib.Uchar.t -> wordworld uisu's word break property.
Sentence break
type sentence=[|`AT|`CL|`CR|`EX|`FO|`LE|`LF|`LO|`NU|`SC|`SE|`SP|`ST|`UP|`XX]The type for sentence breaks.
val pp_sentence : Stdlib.Format.formatter -> sentence -> unitpp_grapheme_cluster ppf gprints an unspecified representation ofgonppf.
val sentence : Stdlib.Uchar.t -> sentencesentence uisu's sentence break property.
East Asian width
val pp_east_asian_width : Stdlib.Format.formatter -> east_asian_width -> unitpp_east_asian_width ppf wprints an unspecified representation ofwonppf.
val east_asian_width : Stdlib.Uchar.t -> east_asian_widtheast_asian_width uisu's East Asian width property.
Terminal width
val tty_width_hint : Stdlib.Uchar.t -> inttty_width_hint uapproximatesu's column width as rendered by a typical character terminal.The current implementation of the function returns either
0,1,2or-1. The value-1is only returned for scalar values for which the property is non-sensical; clients are expected to sanitize their inputs and not to use the function with these scalar values which are those in range U+0001-U+001F (C0 controls without U+0000) and U+007F-U+009F (DELETE and C1 controls).Note. Converting a string to normalization form C before folding this function over its scalar values will, in general, yield better approximations (e.g. on Hangul).
Warning. This is not a normative property and only a heuristic. If you find yourself using this function please read carefully the following lines.
This function is the moral equivalent of POSIX
wcwidth, in that its purpose is to help align text displayed by a character terminal. It mimicswcwidth, as widely implemented, in yet another way: it is mostly wrong.Computing column width is a surprisingly difficult task in general. Much of the software infrastructure still carries legacy assumptions about the nature of text harking back to the ASCII era. Different terminal emulators attempt to cope with general Unicode text in different ways, creating a fundamental problem: width of text fragments will vary across terminal emulators, with no way of getting feedback from the output layer back into the text-producing layer.
For example: on a modern Linux system, a collection of terminals will disagree on some or all of U+00AD, U+0CBF, and U+2029. They will likewise disagree about unassigned characters (category Cn), sometimes contradicting the system's
wcwidth(e.g. U+0378, U+0530). Terminals using bare libxft will display complex scripts differently from terminals using HarfBuzz, and the rendering on OS X will be slightly different from both.tty_width_hintuses a simple and predictable width algorithm, based on Markus Kuhn's portablewcwidth:- Scalar values in the ranges U+0001-U+001F (C0 controls without U+0000) and U+007F-U+009F (DELETE and C1 controls) have undefined width (
-1). - Characters with East Asian Width Fullwidth or Wide have a width of
2. - Characters with General Category Mn, Me, Cf and U+0000 have a width of
0. - Most other characters have a width of
1, including Cn.
This approach works well, in that it gives results generally consistent with a wide range of terminals, for alphabetic scripts, and for east Asian syllabic and logographic scripts in non-decomposed form. Support varies for abjad scripts in the presence of vowel marks, and it mostly breaks down on abugidas.
Moreover, non-text symbols like Emoji or Yijing hexagrams will be incorrectly classified as
1-wide, but this in fact agrees with their rendering on many terminals.Clients should not over-rely on
tty_width_hint. It provides a best-effort approximation which will sometimes fail in practice.- Scalar values in the ranges U+0001-U+001F (C0 controls without U+0000) and U+007F-U+009F (DELETE and C1 controls) have undefined width (
Low level interface
module Low : sig ... endLow level interface.