Module Uuseg
Unicode text segmentation.
Uuseg segments Unicode text. It implements the locale independent Unicode text segmentation algorithms to detect grapheme cluster, word and sentence boundaries and the Unicode line breaking algorithm to detect line break opportunities.
The module is independent from any IO mechanism or Unicode text data structure and it can process text without a complete in-memory representation.
The supported Unicode version is determined by the unicode_version value.
Consult the basics, limitations and examples of use.
Warning Version 11.0.0 of UAX #29 grapheme cluster and word segmentation are not stricly conformant with respect to emojis see this issue for details.
v11.0.0 — Unicode version 11.0.0 — homepage
References
- The Unicode Consortium. The Unicode Standard. (latest version)
- Mark Davis. UAX #29 Unicode Text Segmentation. (latest version)
- Andy Heninger. UAX #14 Unicode Line Breaking Algorithm. (latest version)
- Web based ICU break utility.
Segment
type customThe type for custom segmenters. See
custom.
type boundary=[|`Grapheme_cluster|`Word|`Sentence|`Line_break|`Custom of custom]The type for boundaries.
`Grapheme_clusterdetermines extended grapheme clusters boundaries according to UAX 29 (corresponds, for most scripts, to user-perceived characters).`Worddetermines word boundaries according to UAX 29.`Sentencedetermines sentence boundaries according to UAX 29.`Line_breakdetermines mandatory line breaks and line break opportunities according to UAX 14.
val pp_boundary : Stdlib.Format.formatter -> boundary -> unitpp_boundary ppf bprints an unspecified representation ofbonppf.
type ret=[|`Boundary|`Uchar of Stdlib.Uchar.t|`Await|`End]The type for segmenter results. See
add.
val add : t -> [ `Uchar of Stdlib.Uchar.t | `Await | `End ] -> retadd s vis:`Boundaryif there is a boundary at that point in the sequence of characters. The client must then calladdwith`Awaituntil`Awaitis returned.`Uchar uifuis the next character in the sequence. The client must then calladdwith`Awaituntil`Awaitis returned.`Awaitwhen the segmenter is ready to add a new`Ucharor`End.`Endwhen`Endwas added and all`Boundaryand`Ucharwere output.
For
vuse`Uchar uto add a new character to the sequence to segment and`Endto signal the end of sequence. After adding one of these two values always calladdwith`Awaituntil`Awaitor`Endis returned.- raises Invalid_argument
if
`Ucharor`Endis added while that last add did not return`Awaitor if an`Ucharor`Endis added after an`Endwas already added.
val mandatory : t -> boolmandatory sistrueif the last`Boundaryreturned byaddwas mandatory. This function only makes sense for`Line_breaksegmenters or`Customsegmenters that sport that notion. For other segmenters or if no`Boundarywas returned so far,trueis returned.
val copy : t -> tcopy sis a copy ofsin its current state. Subsequentadds onsdo not affect the copy.
val pp_ret : Stdlib.Format.formatter -> [< ret ] -> unitpp_ret ppf vprints an unspecified representation ofvonppf.
Custom segmenters
val custom : ?mandatory:('a -> bool) -> name:string -> create:(unit -> 'a) -> copy:('a -> 'a) -> add:('a -> [ `Uchar of Stdlib.Uchar.t | `Await | `End ] -> ret) -> unit -> customcreate ~mandatory ~name ~create ~copy ~addis a custom segmenter.nameis a name to identify the segmenter.createis called when the segmenter is created it should return a custom segmenter value.copyis called with the segmenter value whenever the segmenter is copied. It should return a copy of the segmenter value.mandatoryis called with the segmenter value to define the result of themandatoryfunction. Defaults always returnstrue.addis called with the segmenter value to define the result of theaddvalue. The returned value should respect the semantics ofadd. Use the functionserr_exp_awaitanderr_endedto raiseInvalid_argumentexception inadds error cases.
Limitations
A `Grapheme_cluster segmenter will always consume only a small bounded amount of memory on any text. Other segmenters will also do so on non-degenerate text, but it's possible to feed them with input that will make them buffer an arbitrary amount of characters.
Basics
A segmenter is a stateful filter that inputs a sequence of characters and outputs the same sequence except characters are interleaved with `Boundary values whenever the segmenter detects a boundary.
The function create returns a new segmenter for a given boundary type:
let words = Uuseg.create `WordTo add characters to the sequence to segment, call add on words with `Uchar _. To end the sequence call add on words with `End. The segmented sequence of characters is returned character by character, interleaved with `Boundary values at the appropriate places, by the successive calls to add.
The client and the segmenter must wait on each other to limit internal buffering: each time the client adds to the sequence by calling add with `Uchar or `End it must continue to call add with `Await until the segmenter returns `Await or `End. In practice this leads to the following kind of control flow:
let rec add acc v = match Uuseg.add words v with
| `Uchar u -> add (`Uchar u :: acc) `Await
| `Boundary -> add (`B :: acc) `Await
| `Await | `End -> accFor example to segment the sequence <U+0041, U+0020, U+0042> ("a b") to a list of characters interleaved with `B values on word boundaries we can write:
let uchar = `Uchar (Uchar.of_int u)
let seq = [uchar 0x0041; uchar 0x0020; uchar 0x0042]
let seq_words = List.rev (add (List.fold_left add [] seq) `End)Examples
utf_8_segments seg s is the list of UTF-8 encoded seg segments of the UTF-8 encoded string s. This example uses Uutf to fold over the characters of s and to encode the characters in a standard OCaml buffer. Note that this function can be derived directly from Uuseg_string.fold_utf_8.
let utf_8_segments seg s =
let b = Buffer.create 42 in
let flush_segment acc =
let segment = Buffer.contents b in
Buffer.clear b; if segment = "" then acc else segment :: acc
in
let seg = Uuseg.create (seg :> Uuseg.boundary) in
let rec add acc v = match Uuseg.add seg v with
| `Uchar u -> Uutf.Buffer.add_utf_8 b u; add acc `Await
| `Boundary -> add (flush_segment acc) `Await
| `Await -> acc
in
let rec uchar acc _ = function
| `Uchar _ as u -> add acc u
| `Malformed _ -> add acc (`Uchar Uutf.u_rep)
in
List.rev (flush_segment (add (Uutf.String.fold_utf_8 uchar [] s) `End))