Module Jsonm
Non-blocking streaming JSON codec.
Jsonm is a non-blocking streaming codec to decode and encode the JSON data format. It can process JSON text without blocking on IO and without a complete in-memory representation of the data.
The uncut codec also processes whitespace and (non-standard) JSON with JavaScript comments.
Consult the data model, limitations and examples of use.
v1.0.1 - homepage
References
- T. Bray Ed. The JavaScript Object Notation (JSON) Data Interchange Format, 2014
JSON data model
type lexeme=[|`Null|`Bool of bool|`String of string|`Float of float|`Name of string|`As|`Ae|`Os|`Oe]The type for JSON lexemes.
`Asand`Aestart and end arrays and`Osand`Oestart and end objects.`Nameis for the member names of objects.A well-formed sequence of lexemes belongs to the language of the
jsongrammar:json = value object = `Os *member `Oe member = (`Name s) value array = `As *value `Ae value = `Null / `Bool b / `Float f / `String s / object / arrayA decoder returns only well-formed sequences of lexemes or
`Errors are returned. The UTF-8, UTF-16, UTF-16LE and UTF-16BE encoding schemes are supported. The strings of decoded`Nameand`Stringlexemes are however always UTF-8 encoded. In these strings, characters originally escaped in the input are in their unescaped representation.An encoder accepts only well-formed sequences of lexemes or
Invalid_argumentis raised. Only the UTF-8 encoding scheme is supported. The strings of encoded`Nameand`Stringlexemes are assumed to be immutable and must be UTF-8 encoded, this is not checked by the module. In these strings, the delimiter charactersU+0022andU+005C('"','\') aswell as the control charactersU+0000-U+001Fare automatically escaped by the encoders, as mandated by the standard.
val pp_lexeme : Stdlib.Format.formatter -> [< lexeme ] -> unitpp_lexeme ppf lprints a unspecified non-JSON representation oflonppf.
Decode
type error=[|`Illegal_BOM|`Illegal_escape of [ `Not_hex_uchar of Stdlib.Uchar.t | `Not_esc_uchar of Stdlib.Uchar.t | `Not_lo_surrogate of int | `Lone_lo_surrogate of int | `Lone_hi_surrogate of int ]|`Illegal_string_uchar of Stdlib.Uchar.t|`Illegal_bytes of string|`Illegal_literal of string|`Illegal_number of string|`Unclosed of [ `As | `Os | `String | `Comment ]|`Expected of [ `Comment | `Value | `Name | `Name_sep | `Json | `Eoi | `Aval of bool | `Omem of bool ]]
val pp_error : Stdlib.Format.formatter -> [< error ] -> unitpp_error eprints an unspecified UTF-8 representation ofeonppf.
type src=[|`Channel of Stdlib.in_channel|`String of string|`Manual]The type for input sources. With a
`Manualsource the client must provide input withManual.src.
val decoder : ?encoding:[< encoding ] -> [< src ] -> decoderdecoder encoding srcis a JSON decoder that inputs fromsrc.encodingspecifies the character encoding of the data. If unspecified the encoding is guessed as suggested by the old RFC4627 standard.
val decode : decoder -> [> `Await | `Lexeme of lexeme | `End | `Error of error ]decode dis:`Awaitifdhas a`Manualsource and awaits for more input. The client must useManual.srcto provide it.`Lexeme lif a lexemelwas decoded.`Endif the end of input was reached.`Error eif a decoding error occured. If the client is interested in a best-effort decoding it can still continue to decode after an error (see Error recovery) although the resulting sequence of`Lexemes is undefined and may not be well-formed.
The
Uncut.pp_decodefunction can be used to inspect decode results.Note. Repeated invocation always eventually returns
`End, even in case of errors.
val decoded_range : decoder -> (int * int) * (int * int)decoded_range dis the range of characters spanning the last`Lexemeor`Error(or`Whiteor`Commentfor an Decode) decoded byd. A pair of line and column numbers respectively one and zero based.
Encode
type dst=[|`Channel of Stdlib.out_channel|`Buffer of Stdlib.Buffer.t|`Manual]The type for output destinations. With a
`Manualdestination the client must provide output storage withManual.dst.
val encoder : ?minify:bool -> [< dst ] -> encoderencoder minify dstis an encoder that outputs todst. Ifminifyistrue(default) the output is made as compact as possible, otherwise the output is indented. If you want better control on whitespace useminify = trueand Encode.
val encode : encoder -> [< `Await | `End | `Lexeme of lexeme ] -> [ `Ok | `Partial ]encode e vis:`Partialiffehas a`Manualdestination and needs more output storage. The client must useManual.dstto provide a new buffer and then call Encode with`Awaituntil`Okis returned.`Okwhen the encoder is ready to encode a new`Lexemeor`End.
For
`Manualdestinations, encoding`Endalways returns`Partial, the client should as usual useManual.dstand continue with`Awaituntil`Okis returned at which pointManual.dst_remeis guaranteed to be the size of the last provided buffer (i.e. nothing was written).Raises.
Invalid_argumentif a non well-formed sequence of lexemes is encoded or if`Lexemeor`Endis encoded after a`Partialencode.
val encoder_minify : encoder -> boolencoder_minify eistrueife's output is minified.
Manual sources and destinations
module Manual : sig ... endManual input sources and output destinations.
Uncut codec
module Uncut : sig ... endCodec with comments and whitespace.
Limitations
Decode
Decoders parse valid JSON with the following limitations:
- JSON numbers are represented with OCaml
floatvalues. This means that it can only represent integers exactly in the in the interval [-253;253]. This is equivalent to the contraints JavaScript has. - A superset of JSON numbers is parsed. After having seen a minus or a digit, including zero,
Pervasives.float_of_string, is used. In particular this parses number with leading zeros, which are specifically prohibited by the standard. - Strings returned by
`String,`Name,`Whiteand`Commentare limited bySys.max_string_length. There is no built-in protection against the fact that the internal OCamlBuffer.tvalue may raiseFailureon Decode. This should however only be a problem on 32-bits platforms if your strings are greater than 16Mo.
Position tracking assumes that each decoded Unicode scalar value has a column width of 1. The same assumption may not be made by the display program (e.g. for emacs' compilation mode you need to set compilation-error-screen-columns to nil).
The newlines LF (U+000A), CR (U+000D), and CRLF are all normalized to LF internally. This may have an impact in some corner `Error cases. For example the invalid escape sequence <U+005C,U+000D> in a string will be reported as being `Illegal_escape (`Not_esc_uchar
0x000A).
Encode
Encoders produce valid JSON provided the client ensures that the following holds.
- All the strings given to the encoder must be valid UTF-8 and immutable. Characters that need to be escaped are automatically escaped by
Jsonm. `Floatlexemes must not be,Pervasives.nan,Pervasives.infinity orPervasives.neg_infinity. They are encoded with the format string"%.16g", this allows to roundtrip all the integers that can be precisely represented in OCamlfloatvalues, i.e. the integers in the interval [-253;253]. This is equivalent to the constraints JavaScript has.- If the uncut codec is used
`Whitemust be made of JSON whitespace and`Commentmust never be encoded.
Error recovery
After a decoding error, if best-effort decoding is performed. The following happens before continuing:
`Illegal_BOM, the initial BOM is skipped.`Illegal_bytes,`Illegal_escape,`Illegal_string_uchar, a Unicode replacement character (U+FFFD) is substituted to the illegal sequence.`Illegal_literal,`Illegal_numberthe corresponding`Lexemeis skipped.`Expected r, input is discarded until a synchronyzing lexeme that depends onris found.`Unclosed, the end of input is reached, further decodes will be`End
Examples
Trip
The result of trip src dst has the JSON from src written on dst.
let trip ?encoding ?minify
(src : [`Channel of in_channel | `String of string])
(dst : [`Channel of out_channel | `Buffer of Buffer.t])
=
let rec loop d e = match Jsonm.decode d with
| `Lexeme _ as v -> ignore (Jsonm.encode e v); loop d e
| `End -> ignore (Jsonm.encode e `End); `Ok
| `Error err -> `Error (Jsonm.decoded_range d, err)
| `Await -> assert false
in
let d = Jsonm.decoder ?encoding src in
let e = Jsonm.encoder ?minify dst in
loop d eUsing the `Manual interface, trip_fd does the same but between Unix file descriptors.
let trip_fd ?encoding ?minify
(fdi : Unix.file_descr)
(fdo : Unix.file_descr)
=
let rec encode fd s e v = match Jsonm.encode e v with `Ok -> ()
| `Partial ->
let rec unix_write fd s j l =
let rec write fd s j l = try Unix.single_write fd s j l with
| Unix.Unix_error (Unix.EINTR, _, _) -> write fd s j l
in
let wc = write fd s j l in
if wc < l then unix_write fd s (j + wc) (l - wc) else ()
in
unix_write fd s 0 (Bytes.length s - Jsonm.Manual.dst_rem e);
Jsonm.Manual.dst e s 0 (String.length s);
encode fd s e `Await
in
let rec loop fdi fdo ds es d e = match Jsonm.decode d with
| `Lexeme _ as v -> encode fdo es e v; loop fdi fdo ds es d e
| `End -> encode fdo es e `End; `Ok
| `Error err -> `Error (Jsonm.decoded_range d, err)
| `Await ->
let rec unix_read fd s j l = try Unix.read fd s j l with
| Unix.Unix_error (Unix.EINTR, _, _) -> unix_read fd s j l
in
let rc = unix_read fdi ds 0 (Bytes.length ds) in
Jsonm.Manual.src d ds 0 rc; loop fdi fdo ds es d e
in
let ds = Bytes.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
let es = Bytes.create 65536 (* UNIX_BUFFER_SIZE in 4.0.0 *) in
let d = Jsonm.decoder ?encoding `Manual in
let e = Jsonm.encoder ?minify `Manual in
Jsonm.Manual.dst e es 0 (Bytes.length es);
loop fdi fdo ds es d eMember selection
The result of memsel names src is the list of string values of members of src that have their name in names. In this example, decoding errors are silently ignored.
let memsel ?encoding names
(src : [`Channel of in_channel | `String of string])
=
let rec loop acc names d = match Jsonm.decode d with
| `Lexeme (`Name n) when List.mem n names ->
begin match Jsonm.decode d with
| `Lexeme (`String s) -> loop (s :: acc) names d
| _ -> loop acc names d
end
| `Lexeme _ | `Error _ -> loop acc names d
| `End -> List.rev acc
| `Await -> assert false
in
loop [] names (Jsonm.decoder ?encoding src)Generic JSON representation
A generic OCaml representation of JSON text is the following one.
type json =
[ `Null | `Bool of bool | `Float of float| `String of string
| `A of json list | `O of (string * json) list ]The result of json_of_src src is the JSON text from src in this representation. The function is tail recursive.
exception Escape of ((int * int) * (int * int)) * Jsonm.error
let json_of_src ?encoding
(src : [`Channel of in_channel | `String of string])
=
let dec d = match Jsonm.decode d with
| `Lexeme l -> l
| `Error e -> raise (Escape (Jsonm.decoded_range d, e))
| `End | `Await -> assert false
in
let rec value v k d = match v with
| `Os -> obj [] k d | `As -> arr [] k d
| `Null | `Bool _ | `String _ | `Float _ as v -> k v d
| _ -> assert false
and arr vs k d = match dec d with
| `Ae -> k (`A (List.rev vs)) d
| v -> value v (fun v -> arr (v :: vs) k) d
and obj ms k d = match dec d with
| `Oe -> k (`O (List.rev ms)) d
| `Name n -> value (dec d) (fun v -> obj ((n, v) :: ms) k) d
| _ -> assert false
in
let d = Jsonm.decoder ?encoding src in
try `JSON (value (dec d) (fun v _ -> v) d) with
| Escape (r, e) -> `Error (r, e)The result of json_to_dst dst json has the JSON text json written on dst. The function is tail recursive.
let json_to_dst ~minify
(dst : [`Channel of out_channel | `Buffer of Buffer.t ])
(json : json)
=
let enc e l = ignore (Jsonm.encode e (`Lexeme l)) in
let rec value v k e = match v with
| `A vs -> arr vs k e
| `O ms -> obj ms k e
| `Null | `Bool _ | `Float _ | `String _ as v -> enc e v; k e
and arr vs k e = enc e `As; arr_vs vs k e
and arr_vs vs k e = match vs with
| v :: vs' -> value v (arr_vs vs' k) e
| [] -> enc e `Ae; k e
and obj ms k e = enc e `Os; obj_ms ms k e
and obj_ms ms k e = match ms with
| (n, v) :: ms -> enc e (`Name n); value v (obj_ms ms k) e
| [] -> enc e `Oe; k e
in
let e = Jsonm.encoder ~minify dst in
let finish e = ignore (Jsonm.encode e `End) in
value json finish e