Module B0_std.Conv
Value converters.
A value converter describes how to encode and decode OCaml values to a binary presentation and a textual, human specifiable, s-expression based, representation.
Notation. Given a value v
and a converter c
we write [v
]c the textual encoding of v
according to c
.
Low-level encoders and decoders
exception
Error of int * int * string
The exception for conversion errors. This exception is raised both by encoders and decoders with
raise_notrace
. The integers indicates a byte index range in the input on decoding errors, it is meaningless on encoding ones.Note. This exception is used for defining converters. High-level converting functions do not raise but use result values to report errors.
module Bin : sig ... end
Binary codecs.
module Txt : sig ... end
Textual codecs
Converters
val v : kind:string -> docvar:string -> 'a Bin.enc -> 'a Bin.dec -> 'a Txt.enc -> 'a Txt.dec -> 'a t
v ~kind ~docvar bin_enc bin_dec txt_enc txt_dec
is a value converter usingbin_enc
,bin_dec
,txt_enc
,txt_dec
for binary and textual conversions.kind
documents the kind of converted value anddocvar
a meta-variable used in documentation to stand for these values (use uppercase e.g.INT
for integers).
val kind : 'a t -> string
kind c
is the documented kind of value converted byc
.
val docvar : 'a t -> string
docvar c
is the documentation meta-variable for values converted byc
.
val with_kind : ?docvar:string -> string -> 'a t -> 'a t
with_kind ~docvar k c
isc
with kindk
and documentation meta-variabledocvar
(defaults todocvar c
).
val with_docvar : string -> 'a t -> 'a t
with_docvar docvar c
isc
with documentation meta-variabledocvar
.
val with_conv : kind:string -> docvar:string -> ('b -> 'a) -> ('a -> 'b) -> 'a t -> 'b t
with_conv ~kind ~docvar to_t of_t t_conv
is a converter for type'b
given a convertert_conv
for type'a
and conversion functions from and to type'b
. The conversion functions should raiseError
if they are not total.
Converting
val to_bin : ?buf:Stdlib.Buffer.t -> 'a t -> 'a -> (string, string) Stdlib.result
to_bin c v
binary encodesv
usingc
.buf
is used as the internal buffer if specified (it isBuffer
.cleared before usage).
val of_bin : 'a t -> string -> ('a, string) Stdlib.result
of_bin c s
binary decodes a value froms
usingc
.
val to_txt : ?buf:Stdlib.Buffer.t -> 'a t -> 'a -> (string, string) Stdlib.result
to_txt c v
textually encodesv
usingc
.buf
is used as the internal buffer if specified (it isBuffer
.cleared before usage).
val of_txt : 'a t -> string -> ('a, string) Stdlib.result
of_txt c s
textually decodes a value froms
usingc
.
Predefined converters
val bool : bool t
bool
converts booleans. Textual conversions represent booleans with the atoms true and false.
val byte : int t
byte
converts a byte. Textual decoding parses an atom according to the syntax ofint_of_string
. Conversions fail if the integer is not in the range [0;255].
val int : int t
int
converts signed OCaml integers. Textual decoding parses an atom according to the syntax ofint_of_string
. Conversions fail if the integer is not in the range [-2Sys
.int_size-1;2Sys
.int_size-1-1].Warning. A large integer encoded on a 64-bit platform may fail to decode on a 32-bit platform, use
int31
orint64
if this is a problem.
val int31 : int t
int31
converts signed 31-bit integers. Textual decoding parses an atom according to the syntax ofint_of_string
. Conversions fail if the integer is not in the range [-230;230-1].
val int32 : int32 t
int32
converts signed 32-bit integers. Textual decoding parses an atom according to the syntax ofInt32
.of_string. Conversions fail if the integer is not in the range [-231;231-1].
val int64 : int64 t
int64
converts signed 64-bit integers. Textual decoding parses an atom according to the syntax ofInt64
.of_string. Conversions fail if the integer is not in the range [-263;263-1].
val float : float t
float
converts floating point numbers. Textual decoding parses an atom usingfloat_of_string
.
val string_bytes : string t
string_bytes
converts OCaml strings as byte sequences. Textual conversion represents the bytes ofs
with the s-expression (hex [s
]hex) with [s
]hex the atom resulting fromString.Ascii.to_hex
s
. See alsoatom
andonly_string
.Warning. A large string encoded on a 64-bit platform may fail to decode on a 32-bit platform.
val atom : string t
atom
converts strings assumed to represent UTF-8 encoded Unicode text; but the encoding is not checked. Textual conversions represent strings as atoms. See alsostring_bytes
andonly_string
.Warning. A large atom encoded on a 64-bit platform may fail to decode on a 32-bit platform.
val option : ?kind:string -> ?docvar:string -> 'a t -> 'a option t
option c
converts optional values converted withc
. Textual conversions representNone
with the atom none andSome v
with the s-expression (some [v
]c).
val some : 'a t -> 'a option t
some c
wraps decodes ofc
withOption.some
. Warning.None
can't be converted in either direction, useoption
for this.
val result : ?kind:string -> ?docvar:string -> 'a t -> 'b t -> ('a, 'b) Stdlib.result t
result ok error
converts result values withok
anderror
. Textual conversions representOk v
with the s-expression (ok [v
]ok) andError e
with (error [e
]error).
val list : ?kind:string -> ?docvar:string -> 'a t -> 'a list t
array c
converts a list of values converted withc
. Textual conversions represent a list[v0; ... vn]
by the s-expression ([v0
]c ... [vn
]c).Warning. A large list encoded on a 64-bit platform may fail to decode on a 32-bit platform.
val array : ?kind:string -> ?docvar:string -> 'a t -> 'a array t
array c
is likelist
but converts arrays.Warning. A large array encoded on a 64-bit platform may fail to decode on a 32-bit platform.
val pair : ?kind:string -> ?docvar:string -> 'a t -> 'b t -> ('a * 'b) t
pair c0 c1
converts pairs of values converted withc0
andc1
. Textual conversion represent a pair(v0, v1)
by the s-expression ([v0
]c0 [v1
]c1).
val enum : kind:string -> docvar:string -> ?eq:('a -> 'a -> bool) -> (string * 'a) list -> 'a t
enum ~kind ~docvar ~eq vs
converts values present invs
.eq
is used to test equality among values (defaults to( = )
). The list length should not exceed 256. Textual conversions use the strings of the pairs invs
as atoms to encode the corresponding value.
Non-composable predefined converters
Textual conversions performed by the following converters cannot be composed; they do not respect the syntax of s-expression atoms. They can be used for direct conversions when one does not want to be subject to the syntactic constraints of s-expressions. For example when parsing command line interface arguments or environment variables.
val string_only : string t
string_only
converts OCaml strings. Textual conversion is not composable, usestring_bytes
oratom
instead. Textual encoding passes the string as is and decoding ignores the initial starting point and returns the whole input string.Warning. A large string encoded on a 64-bit platform may fail to decode on a 32-bit platform.
S-expressions syntax
S-expressions are a general way of describing data via atoms (sequences of characters) and lists delimited by parentheses. Here are a few examples of s-expressions and their syntax:
this-is-an-atom (this is a list of seven atoms) (this list contains (a nested) list) ; This is a comment ; Anything that follows a semi-colon is ignored until the next line (this list ; has three atoms and an embededded () comment) "this is a quoted atom, it can contain spaces ; and ()" "quoted atoms can be split ^ across lines or contain Unicode esc^u\{0061\}pes"
We define the syntax of s-expressions over a sequence of Unicode characters in which all US-ASCII control characters except whitespace are forbidden in unescaped form.
Note. This module assumes the sequence of Unicode characters is encoded as UTF-8 although it doesn't check this for now.
S-expressions and sequences thereof
An s-expression is either an atom or a list of s-expressions interspaced with whitespace and comments. A sequence of s-expressions is a succession of s-expressions interspaced with whitespace and comments.
These elements are informally described below and finally made precise via an ABNF grammar.
Whitespace
Whitespace is a sequence of whitespace characters, namely, space ' '
(U+0020), tab '\t'
(U+0009), line feed '\n'
(U+000A), vertical tab '\t'
(U+000B), form feed (U+000C) and carriage return '\r'
(U+000D).
Comments
Unless it occurs inside an atom in quoted form (see below) anything that follows a semicolon ';'
(U+003B) is ignored until the next end of line, that is either a line feed '\n'
(U+000A), a carriage return '\r'
(U+000D) or a carriage return and a line feed "\r\n"
(<U+000D,U+000A>).
(this is not a comment) ; This is a comment (this is not a comment)
Atoms
An atom represents ground data as a string of Unicode characters. It can, via escapes, represent any sequence of Unicode characters, including control characters and U+0000. It cannot represent an arbitrary byte sequence except via a client-defined encoding convention (e.g. Base64 or hex encoding).
Atoms can be specified either via an unquoted or a quoted form. In unquoted form the atom is written without delimiters. In quoted form the atom is delimited by double quote '\"'
(U+0022) characters, it is mandatory for atoms that contain whitespace, parentheses '('
')'
, semicolons ';'
, quotes '\"'
, carets '^'
or characters that need to be escaped.
abc ; a token for the atom "abc" "abc" ; a quoted token for the atom "abc" "abc; (d" ; a quoted token for the atom "abc; (d" "" ; the quoted token for the atom ""
For atoms that do not need to be quoted, both their unquoted and quoted form represent the same string; e.g. the string "true"
can be represented both by the atoms true and "true". The empty string can only be represented in quoted form by "".
In quoted form escapes are introduced by a caret '^'
. Double quotes '\"'
and carets '^'
must always be escaped.
"^^" ; atom for ^ "^n" ; atom for line feed U+000A "^u\{0000\}" ; atom for U+0000 "^"^u\{1F42B\}^"" ; atom with a quote, U+1F42B and a quote
The following escape sequences are recognized:
"^ "
(<U+005E,U+0020>) for space' '
(U+0020)"^\""
(<U+005E,U+0022>) for double quote'\"'
(U+0022) mandatory"^^"
(<U+005E,U+005E>) for caret'^'
(U+005E) mandatory"^n"
(<U+005E,U+006E>) for line feed'\n'
(U+000A)"^r"
(<U+005E,U+0072>) for carriage return'\r'
(U+000D)"^u{X}"
withX
is from 1 to at most 6 upper or lower case hexadecimal digits standing for the corresponding Unicode character U+X.- Any other character except line feed
'\n'
(U+000A) or carriage return'\r'
(U+000D), following a caret is an illegal sequence of characters. In the two former cases the atom continues on the next line and white space is ignored.
An atom in quoted form can be split across lines by using a caret '^'
(U+005E) followed by a line feed '\n'
(U+000A) or a carriage return '\r'
(U+000D); any subsequent whitespace is ignored.
"^ a^ ^ " ; the atom "a "
The character '^'
(U+005E) is used as an escape character rather than the usual '\\'
(U+005C) in order to make quoted Windows® file paths decently readable and, not the least, utterly please DKM.
Lists
Lists are delimited by left '('
(U+0028) and right ')'
(U+0029) parentheses. Their elements are s-expressions separated by optional whitespace and comments. For example:
(a list (of four) expressions) (a list(of four)expressions) ("a"list("of"four)expressions) (a list (of ; This is a comment four) expressions) () ; the empty list
S-expression grammar
The following RFC 5234 ABNF grammar is defined on a sequence of Unicode characters.
sexp-seq = *(ws / comment / sexp) sexp = atom / list list = %x0028 sexp-seq %x0029 atom = token / qtoken token = t-char *(t-char) qtoken = %x0022 *(q-char / escape / cont) %x0022 escape = %x005E (%x0020 / %x0022 / %x005E / %x006E / %x0072 / %x0075 %x007B unum %x007D) unum = 1*6(HEXDIG) cont = %x005E nl ws ws = *(ws-char) comment = %x003B *(c-char) nl nl = %x000A / %x000D / %x000D %x000A t-char = %x0021 / %x0023-0027 / %x002A-%x003A / %x003C-%x005D / %x005F-%x007E / %x0080-D7FF / %xE000-10FFFF q-char = t-char / ws-char / %x0028 / %x0029 / %x003B ws-char = %x0020 / %x0009 / %x000A / %x000B / %x000C / %x000D c-char = %x0009 / %x000B / %x000C / %x0020-D7FF / %xE000-10FFFF
A few additional constraints not expressed by the grammar:
unum
once interpreted as an hexadecimal number must be a Unicode scalar value.- A comment can be ended by the end of the character sequence rather than
nl
.