Module Uunf
Unicode text normalization.
Uunf normalizes Unicode text. It supports all Unicode normalization forms. The module is independent from any IO mechanism or Unicode text data structure and it can process text without a complete in-memory representation of the data.
The supported Unicode version is determined by the unicode_version value.
Consult the basics, limitations and examples of use.
v11.0.0 — Unicode version 11.0.0 — homepage
References
- The Unicode Consortium. The Unicode Standard. (latest version)
- Mark Davis. UAX #15 Unicode Normalization Forms. (latest version)
- The Unicode Consortium. Normalization charts.
Normalize
type form=[|`NFD|`NFC|`NFKD|`NFKC]The type for normalization forms.
`NFDnormalization form D, canonical decomposition.`NFCnormalization form C, canonical decomposition followed by canonical composition (recommended for the www).`NFKDnormalization form KD, compatibility decomposition.`NFKCnormalization form KC, compatibility decomposition, followed by canonical composition.
type ret=[|`Uchar of Stdlib.Uchar.t|`End|`Await]The type for normalizer results. See
add.
val add : t -> [ `Uchar of Stdlib.Uchar.t | `Await | `End ] -> retadd n vis:`Uchar uifuis the next character in the normalized sequence. The client must then calladdwith`Awaituntil`Awaitis returned.`Awaitwhen the normalizer is ready to add a new`Ucharor`End.
For
vuse`Uchar uto add a new character to the sequence to normalize and`Endto signal the end of sequence. After adding one of these two values, always calladdwith`Awaituntil`Awaitis returned.Raises.
Invalid_argumentif`Ucharor`Endis added directly after an`Ucharwas returned by the normalizer or if an`Ucharis added after`Endwas added.
val reset : t -> unitreset nresets the normalizer to a state equivalent to the state ofUunf.create (Uunf.form n).
val copy : t -> tcopy nis a copy ofnin its current state. Subsequentadds onndo not affect the copy.
val pp_ret : Stdlib.Format.formatter -> ret -> unitpp_ret ppf vprints an unspecified representation ofvonppf.
Normalization properties
These properties are used internally to implement the normalizers. They are not needed to use the module but are exposed as they may be useful to implement other algorithms.
val ccc : Stdlib.Uchar.t -> intccc uisu's canonical combining class value.
val decomp : Stdlib.Uchar.t -> int arraydecomp uisu's decomposition mapping. If the empty array is returned,udecomposes to itself.The first number in the array contains additional information, it cannot be used as an
uchar. Used_ucharon the number to get the actual character andd_compatibilityto find out if this is a compatibility decomposition. All other characters of the array are guaranteed to be convertible usingUchar.of_int.Warning. Do not mutate the array.
val d_uchar : int -> Stdlib.Uchar.tSee
decomp.
val d_compatibility : int -> boolSee
decomp.
val composite : Stdlib.Uchar.t -> Stdlib.Uchar.t -> Stdlib.Uchar.t optioncomposite u1 u2is the primary composite canonically equivalent to the sequence<u1,u2>, if any.
Limitations
An Uunf normalizer consumes only a small bounded amount of memory on ordinary, meaningful text. However on legal but degenerate text like a starter followed by 10'000 combining non-spacing marks it will have to bufferize all the marks (a workaround is to first convert your input to stream-safe text format).
Basics
A normalizer is a stateful filter that inputs a sequence of characters and outputs an equivalent sequence in the requested normal form.
The function create returns a new normalizer for a given normal form:
let nfd = Uunf.create `NFDTo add characters to the sequence to normalize, call add on nfd with `Uchar _. To end the sequence, call add on nfd with `End. The normalized sequence of characters is returned, character by character, by the successive calls to add.
The client and the normalizer must wait on each other to limit internal buffering: each time the client adds to the sequence by calling add with `Uchar or `End it must continue to call add with `Await until the normalizer returns `Await. In practice this leads to the following kind of control flow:
let rec add acc v = match Uunf.add nfd v with
| `Uchar u -> add (u :: acc) `Await
| `Await | `End -> accFor example to normalize the character U+00E9 (é) with nfd to a list of characters we can write:
let e_acute = Uchar.of_int 0x00E9
let e_acute_nfd = List.rev (add (add [] (`Uchar e_acute)) `End)The next section has more examples.
Examples
UTF-8 normalization
utf_8_normalize nf s is the UTF-8 encoded normal form nf of the UTF-8 encoded string s. This example uses Uutf to fold over the characters of s and to encode the normalized sequence in a standard OCaml buffer.
let utf_8_normalize nf s =
let b = Buffer.create (String.length s * 3) in
let n = Uunf.create nf in
let rec add v = match Uunf.add n v with
| `Uchar u -> Uutf.Buffer.add_utf_8 b u; add `Await
| `Await | `End -> ()
in
let add_uchar _ _ = function
| `Malformed _ -> add (`Uchar Uutf.u_rep)
| `Uchar _ as u -> add u
in
Uutf.String.fold_utf_8 add_uchar () s; add `End; Buffer.contents bNote that this functionality is available directly through Uunf_string.normalize_utf_8