Module Csexp.Parser

The Parser module offers an API that is a balance between sharing the common logic of parsing canonical S-expressions while allowing to write parsers that are as efficient as possible, both in terms of speed and allocations. A carefully written parser using this API will be:

Lexers

To parse using this API, you must first create a lexer via Lexer.create. The lexer is responsible for scanning the input and forming tokens. The user must feed characters read from the input one by one to the lexer until it yields a token. For instance:

# let lexer = Lexer.create ();;
val lexer : Lexer.t = <abstract>
# Lexer.feed lexer '(';;
- : [ `atom | `other ] Lexer.token = Lparen
# Lexer.feed lexer ')';;
- : [ `atom | `other ] Lexer.token = Rparen

When the lexer doesn't have enough to return a token, it simply returns the special token Lexer.token.Await:

# Lexer.feed lexer '1';;
- : [ `atom | `other ] Lexer.token = Await

Note that since atoms of canonical S-expressions do not need quoting, they are always represented as a contiguous sequence of characters that don't need further processing. To achieve maximum efficiency, the lexer only returns the length of the atom and it is the responsibility of the caller to extract the atom from the input source:

# Lexer.feed lexer '2';;
- : [ `atom | `other ] Lexer.token = Await
# Lexer.feed lexer ':';;
- : [ `atom | `other ] Lexer.token = Atom 2

When getting Atom n, the caller should then proceed to read the next n characters of the input as a string. For instance, if the input is an in_channel the caller should proceed with really_input_string ic n.

Finally, when the end of input is reached the user should call Lexer.feed_eoi to make sure the lexer is not awaiting more input. If that is the case, Lexer.feed_eoi will raise:

# Lexer.feed lexer '1';;
- : [ `atom | `other ] Lexer.token = Await
# Lexer.feed_eoi lexer;;
Exception: Parse_error "premature end of input".

Parsing stacks

The lexer doesn't keep track of the structure of the S-expressions. In order to construct a whole structured S-expressions, the caller must maintain a parsing stack via the Stack module. A Stack.t value simply represent a parsed prefix in reverse order.

For instance, the prefix "1:x((1:y1:z)" will be represented as:

Sexp (List [ Atom "y"; Atom "z" ], Open (Sexp (Atom "x", Empty))) 

The Stack module offers various primitives to open or close parentheses or insert an atom. And for convenience it provides a function Stack.add_token that takes the output of Lexer.feed directly:

# Stack.add_token Rparen Empty;;
- : Stack.t = Open Empty
# Stack.add_token Lparen (Open Empty);;
- : Stack.t = Sexp (List [], Empty)

Note that Stack.add_token doesn't accept Atom _. This is enforced at the type level by a GADT. The reason for this is that in order to insert an atom, the user must have fetched the contents of the atom themselves. In order to insert an atom into a stack, you can use the function Stack.add_atom:

# Stack.add_atom "foo" (Open Empty);;
- : Stack.t = Sexp (Atom "foo", Open Empty)

When parsing is finished, one may call the function Stack.to_list in order to extract all the toplevel S-expressions from the stack:

# Stack.to_list (Sexp (Atom "x", Sexp (List [Atom "y"], Empty)));;
- : sexp list = [List [Atom "y"; Atom "x"]]

If instead you want to stop parsing as soon a single full S-expression has been discovered, you can match on the structure of the stack. If the stack is of the form Sexp (_, Empty), then you know that exactly one S-expression has been parsed and you can stop there.

Parsing errors

In order to reduce allocations to a minumim, parsing errors are reported via the exception Parse_error. It is the responsibility of the caller to catch this exception and return it as an Error _ value. Functions that may raise Parse_error are documented as such.

When extracting an atom and the input doesn't have enough characters left, the user may raise Parse_error premature_end_of_input. This will produce an error message similar to what the various high-level functions of this library produce.

Building a parsing function

Parsing functions should always follow the following pattern:

  1. create a lexer and start with an empty parsing stack
  2. iterate over the input, feeding the lexer characters one by one. When the lexer returns Atom n, fetch the next n characters from the input to form an atom
  3. update the stack via Stack.add_atom or Stack.add_token
  4. if parsing the whole input, call Lexer.feed_eoi when the end of input is reached, otherwise stop as soon as the stack is of the form Sexp (_, Empty) -

For instance, to parse a string as a list of S-expressions:

module Sexp = struct
  type t =
    | Atom of string
    | List of t list
end

module Csexp = Csexp.Make (Sexp)

let extract_atom s pos len =
  match String.sub s pos len with
  | exception _ ->
    (* Turn out-of-bounds errors into [Parse_error] *)
    raise (Parse_error premature_end_of_input)
  | s -> s

let parse_string =
  let open Csexp.Parser in
  let rec loop s pos len lexer stack =
    if pos = len then (
      Lexer.feed_eoi lexer;
      Stack.to_list stack
    ) else
      match Lexer.feed lexer (String.unsafe_get s pos) with
      | Atom atom_len ->
        let atom = extract_atom s (pos + 1) atom_len in
        loop s (pos + 1 + atom) len lexer (Stack.add_atom atom stack)
      | (Await | Lparen | Rparen) as x ->
        loop s (pos + 1) len lexer (Stack.add_token x stack)
  in
  fun s ->
    match loop s 0 (String.length s) (Lexer.create ()) Empty with
    | v -> Ok v
    | exception Parse_error msg -> Error msg
exception Parse_error of string
val premature_end_of_input : string

Error message signaling the end of input was reached prematurely. You can use this when extracting an atom from the input and the input doesn't have enough characters.

module Lexer : sig ... end

Lexical analyser

module Stack : sig ... end

Parsing stack