Make.Parser
The Parser
module offers an API that is a balance between sharing the common logic of parsing canonical S-expressions while allowing to write parsers that are as efficient as possible, both in terms of speed and allocations. A carefully written parser using this API will be:
caml_modify
(a slow function of the OCaml runtime that is emitted when mutating a constructed value)To parse using this API, you must first create a lexer via Lexer.create
. The lexer is responsible for scanning the input and forming tokens. The user must feed characters read from the input one by one to the lexer until it yields a token. For instance:
# let lexer = Lexer.create ();;
val lexer : Lexer.t = <abstract>
# Lexer.feed lexer '(';;
- : [ `atom | `other ] Lexer.token = Lparen
# Lexer.feed lexer ')';;
- : [ `atom | `other ] Lexer.token = Rparen
When the lexer doesn't have enough to return a token, it simply returns the special token Lexer.token.Await
:
# Lexer.feed lexer '1';;
- : [ `atom | `other ] Lexer.token = Await
Note that since atoms of canonical S-expressions do not need quoting, they are always represented as a contiguous sequence of characters that don't need further processing. To achieve maximum efficiency, the lexer only returns the length of the atom and it is the responsibility of the caller to extract the atom from the input source:
# Lexer.feed lexer '2';;
- : [ `atom | `other ] Lexer.token = Await
# Lexer.feed lexer ':';;
- : [ `atom | `other ] Lexer.token = Atom 2
When getting Atom n
, the caller should then proceed to read the next n
characters of the input as a string. For instance, if the input is an in_channel
the caller should proceed with really_input_string ic n
.
Finally, when the end of input is reached the user should call Lexer.feed_eoi
to make sure the lexer is not awaiting more input. If that is the case, Lexer.feed_eoi
will raise:
# Lexer.feed lexer '1';;
- : [ `atom | `other ] Lexer.token = Await
# Lexer.feed_eoi lexer;;
Exception: Parse_error "premature end of input".
The lexer doesn't keep track of the structure of the S-expressions. In order to construct a whole structured S-expressions, the caller must maintain a parsing stack via the Stack
module. A Stack.t
value simply represent a parsed prefix in reverse order.
For instance, the prefix "1:x((1:y1:z)" will be represented as:
Sexp (List [ Atom "y"; Atom "z" ], Open (Sexp (Atom "x", Empty)))
The Stack
module offers various primitives to open or close parentheses or insert an atom. And for convenience it provides a function Stack.add_token
that takes the output of Lexer.feed
directly:
# Stack.add_token Rparen Empty;;
- : Stack.t = Open Empty
# Stack.add_token Lparen (Open Empty);;
- : Stack.t = Sexp (List [], Empty)
Note that Stack.add_token
doesn't accept Atom _
. This is enforced at the type level by a GADT. The reason for this is that in order to insert an atom, the user must have fetched the contents of the atom themselves. In order to insert an atom into a stack, you can use the function Stack.add_atom
:
# Stack.add_atom "foo" (Open Empty);;
- : Stack.t = Sexp (Atom "foo", Open Empty)
When parsing is finished, one may call the function Stack.to_list
in order to extract all the toplevel S-expressions from the stack:
# Stack.to_list (Sexp (Atom "x", Sexp (List [Atom "y"], Empty)));;
- : sexp list = [List [Atom "y"; Atom "x"]]
If instead you want to stop parsing as soon a single full S-expression has been discovered, you can match on the structure of the stack. If the stack is of the form Sexp (_, Empty)
, then you know that exactly one S-expression has been parsed and you can stop there.
In order to reduce allocations to a minumim, parsing errors are reported via the exception Parse_error
. It is the responsibility of the caller to catch this exception and return it as an Error _
value. Functions that may raise Parse_error
are documented as such.
When extracting an atom and the input doesn't have enough characters left, the user may raise Parse_error premature_end_of_input
. This will produce an error message similar to what the various high-level functions of this library produce.
Parsing functions should always follow the following pattern:
Atom n
, fetch the next n
characters from the input to form an atomStack.add_atom
or Stack.add_token
Lexer.feed_eoi
when the end of input is reached, otherwise stop as soon as the stack is of the form Sexp (_, Empty)
-For instance, to parse a string as a list of S-expressions:
module Sexp = struct
type t =
| Atom of string
| List of t list
end
module Csexp = Csexp.Make (Sexp)
let extract_atom s pos len =
match String.sub s pos len with
| exception _ ->
(* Turn out-of-bounds errors into [Parse_error] *)
raise (Parse_error premature_end_of_input)
| s -> s
let parse_string =
let open Csexp.Parser in
let rec loop s pos len lexer stack =
if pos = len then (
Lexer.feed_eoi lexer;
Stack.to_list stack
) else
match Lexer.feed lexer (String.unsafe_get s pos) with
| Atom atom_len ->
let atom = extract_atom s (pos + 1) atom_len in
loop s (pos + 1 + atom) len lexer (Stack.add_atom atom stack)
| (Await | Lparen | Rparen) as x ->
loop s (pos + 1) len lexer (Stack.add_token x stack)
in
fun s ->
match loop s 0 (String.length s) (Lexer.create ()) Empty with
| v -> Ok v
| exception Parse_error msg -> Error msg
Error message signaling the end of input was reached prematurely. You can use this when extracting an atom from the input and the input doesn't have enough characters.
module Lexer : sig ... end
Lexical analyser
module Stack : sig ... end
Parsing stack