Module Markup
Error-recovering streaming HTML and XML parsers and writers.
Markup.ml is an HTML and XML parsing and serialization library. It:
- Is error-recovering, so you can get a best-effort parse of malformed input.
- Reports all errors before recovery, so you can get strict parsing instead.
- Conforms closely to the XML grammar and HTML parser from the respective specifications.
- Accepts document fragments, but can be told to accept only full documents.
- Detects character encodings automatically.
- Supports both simple synchronous (this module) and non-blocking usage (
Markup_lwt). - Is streaming and lazy. Partial input is processed as soon as received, but only as needed.
- Does one pass over the input and emits a stream of SAX-style parsing signals. A helper (
tree) allows that to be easily converted into DOM-style trees.
The usage is straightforward. For example:
open Markup
(* Correct and pretty-print HTML. *)
channel stdin
|> parse_html |> signals |> pretty_print
|> write_html |> to_channel stdout
(* Show up to 10 XML well-formedness errors to the user. Stop after
the 10th, without reading more input. *)
let report =
let count = ref 0 in
fun location error ->
error |> Error.to_string ~location |> prerr_endline;
count := !count + 1;
if !count >= 10 then raise_notrace Exit
string "some xml" |> parse_xml ~report |> signals |> drain
(* Load HTML into a custom document tree data type. *)
type html = Text of string | Element of string * html list
file "some_file"
|> fst
|> parse_html
|> signals
|> tree
~text:(fun ss -> Text (String.concat "" ss))
~element:(fun (_, name) _ children -> Element (name, children))The interface is centered around four functions. In pseudocode:
val parse_html : char stream -> signal stream
val write_html : signal stream -> char stream
val parse_xml : char stream -> signal stream
val write_xml : signal stream -> char streamMost of the remaining functions create streams from, or write streams to, strings, files, and channels, or manipulate streams, such as next and the combinators map and fold.
Apart from this module, Markup.ml provides two other top-level modules:
Markup_lwtMarkup_lwt_unix
Most of the interface of Markup_lwt is specified in signature ASYNCHRONOUS, which will be shared with a Markup_async module, should it be implemented.
Markup.ml is developed on GitHub and distributed under the BSD license. This documentation is for version 0.8.0 of the library. Documentation for older versions can be found on the releases page.
Streams
type asynctype syncPhantom types for use with
('a, 's) streamin place of's. See explanation below.
type ('a, 's) streamStreams of elements of type
'a.In simple usage, when using only this module
Markup, the additional type parameter'sis alwayssync, and there is no need to consider it further.However, if you are using
Markup_lwt, you may create someasyncstreams. The difference between the two is thatnexton asyncstream retrieves an element beforenext"returns," whilenexton anasyncstream might not retrieve an element until later. As a result, it is not safe to pass anasyncstream where asyncstream is required. The phantom types are used to make the type checker catch such errors at compile time.
Errors
The parsers recover from errors automatically. If that is sufficient, you can ignore this section. However, if you want stricter behavior, or need to debug parser output, use optional argument ?report of the parsers, and look in module Error.
module Error : sig ... endError type and
to_stringfunction.
Encodings
The parsers detect encodings automatically. If you need to specify an encoding, use optional argument ?encoding of the parsers, and look in module Encoding.
module Encoding : sig ... endCommon Internet encodings such as UTF-8 and UTF-16; also includes some less popular encodings that are sometimes used for XML.
Signals
type xml_declaration={version : string;encoding : string option;standalone : bool option;}Representation of an XML declaration, i.e.
<?xml version="1.0" encoding="utf-8"?>.
type doctype={doctype_name : string option;public_identifier : string option;system_identifier : string option;raw_text : string option;force_quirks : bool;}Representation of a document type declaration. The HTML parser fills in all fields besides
raw_text. The XML parser reads declarations roughly, and fills only theraw_textfield with the text found in the declaration.
type signal=[|`Start_element of name * (name * string) list|`End_element|`Text of string list|`Doctype of doctype|`Xml of xml_declaration|`PI of string * string|`Comment of string]Parsing signals. The parsers emit them according to the following grammar:
doc ::= `Xml? misc* `Doctype? misc* element misc* misc ::= `PI | `Comment element ::= `Start_element content* `End_element content ::= `Text | element | `PI | `CommentAs a result, emitted
`Start_elementand`End_elementsignals are always balanced, and, if there is an XML declaration, it is the first signal.If parsing with
~context:`Document, the signal sequence will match thedocproduction until the first error. If parsing with~context:`Fragment, it will matchcontent*. If~contextis not specified, the parser will pick one of the two by examining the input.As an example, if the XML parser is parsing
<?xml version="1.0"?><root>text<nested>more text</nested></root>it will emit the signal sequence
`Xml {version = "1.0"; encoding = None; standalone = None} `Start_element (("", "root"), []) `Text ["text"] `Start_element (("", "nested"), []) `Text ["more text"] `End_element `End_elementThe
`Textsignal carries astring listinstead of a singlestringbecause on 32-bit platforms, OCaml strings cannot be larger than 16MB. In case the parsers encounter a very long sequence of text, one whose length exceeds aboutSys.max_string_length / 2, they will emit a`Textsignal with several strings.
type content_signal=[|`Start_element of name * (name * string) list|`End_element|`Text of string list]A restriction of type
signalto only elements and text, i.e. no comments, processing instructions, or declarations. This can be useful for pattern matching in applications that only care about the content and element structure of a document. See the helpercontent.
val signal_to_string : [< signal ] -> stringProvides a human-readable representation of signals for debugging.
Parsers
XML
val parse_xml : ?report:(location -> Error.t -> unit) -> ?encoding:Encoding.t -> ?namespace:(string -> string option) -> ?entity:(string -> string option) -> ?context:[< `Document | `Fragment ] -> (char, 's) stream -> 's parserCreates a parser that converts an XML byte stream to a signal stream.
For simple usage,
string "foo" |> parse_xml |> signals.If
~reportis provided,reportis called for every error encountered. You may raise an exception inreport, and it will propagate to the code reading the signal stream.If
~encodingis not specified, the parser detects the input encoding automatically. Otherwise, the given encoding is used.~namespaceis called when the parser is unable to resolve a namespace prefix. If it evaluates toSome s, the parser maps the prefix tos. Otherwise, the parser reports`Bad_namespace.~entityis called when the parser is unable to resolve an entity reference. If it evaluates toSome s, the parser insertssinto the text or attribute being parsed without any further parsing ofs.sis assumed to be encoded in UTF-8. Ifentityevaluates toNoneinstead, the parser reports`Bad_token. Seexhtml_entityif you are parsing XHTML.The meaning of
~contextis described atsignal, above.
val write_xml : ?report:((signal * int) -> Error.t -> unit) -> ?prefix:(string -> string option) -> ([< signal ], 's) stream -> (char, 's) streamConverts an XML signal stream to a byte stream.
If
~reportis provided, it is called for every error encountered. The first argument is a pair of the signal causing the error and its index in the signal stream. You may raise an exception inreport, and it will propagate to the code reading the byte stream.~prefixis called when the writer is unable to find a prefix in scope for a namespace URI. If it evaluates toSome s, the writer usessfor the URI. Otherwise, the writer reports`Bad_namespace.
HTML
val parse_html : ?report:(location -> Error.t -> unit) -> ?encoding:Encoding.t -> ?context:[< `Document | `Fragment of string ] -> (char, 's) stream -> 's parserSimilar to
parse_xml, but parses HTML with embedded SVG and MathML, never emits signals`Xmlor`PI, and~contexthas a different type on tag`Fragment.For HTML fragments, you should specify the enclosing element, e.g.
`Fragment "body". This is because, when parsing HTML, error recovery and the interpretation of text depend on the current element. For example, the textfoo</bar>parses differently in
titleelements than inpelements. In the former, it is parsed asfoo</bar>, while in the latter, it isfoofollowed by a parse error due to unmatched tag</bar>. To get these behaviors, set~contextto`Fragment "title"and`Fragment "p", respectively.If you use
`Fragment "svg", the fragment is assumed to be SVG markup. Likewise,`Fragment "math"causes the parser to parse MathML markup.If
~contextis omitted, the parser guesses it from the input stream. For example, if the first signal would be`Doctype, the context is set to`Document, but if the first signal would be`Start_element "td", the context is set to`Fragment "tr". If the first signal would be`Start_element "g", the context is set to`Fragment "svg".
Input sources
val string : string -> (char, sync) streamEvaluates to a stream that retrieves successive bytes from the given string.
val buffer : Stdlib.Buffer.t -> (char, sync) streamEvaluates to a stream that retrieves successive bytes from the given buffer. Be careful of changing the buffer while it is being iterated by the stream.
val channel : Stdlib.Pervasives.in_channel -> (char, sync) streamEvaluates to a stream that retrieves bytes from the given channel. If the channel cannot be read, the next read of the stream results in raising
Sys_error.Note that this input source is synchronous because
Pervasives.in_channelreads are blocking. For non-blocking channels, seeMarkup_lwt_unix.
val file : string -> (char, sync) stream * (unit -> unit)file pathopens the file atpath, then evaluates to a pairs, close, where reading from streamsretrieves successive bytes from the file, and callingclose ()closes the file.The file is closed automatically if
sis read to completion, or if readingsraises an exception. It is not necessary to callclose ()in these cases.If the file cannot be opened, raises
Sys_errorimmediately. If the file cannot be read, reading the stream raisesSys_error.
Output destinations
val to_string : (char, sync) stream -> stringEagerly retrieves bytes from the given stream and assembles a string.
val to_buffer : (char, sync) stream -> Stdlib.Buffer.tEagerly retrieves bytes from the given stream and places them into a buffer.
val to_channel : Stdlib.Pervasives.out_channel -> (char, sync) stream -> unitEagerly retrieves bytes from the given stream and writes them to the given channel. If writing fails, raises
Sys_error.
val to_file : string -> (char, sync) stream -> unitEagerly retrieves bytes from the given stream and writes them to the given file. If writing fails, or the file cannot be opened, raises
Sys_error. Note that the file is truncated (cleared) before writing. If you wish to append to file, open it with the appropriate flags and useto_channelon the resulting channel.
Stream operations
val stream : (unit -> 'a option) -> ('a, sync) streamstream fcreates a stream that repeatedly callsf (). Each timef ()evaluates toSome v, the next item in the stream isv. The first timef ()evaluates toNone, the stream ends.
val next : ('a, sync) stream -> 'a optionRetrieves the next item in the stream, if any, and removes it from the stream.
val peek : ('a, sync) stream -> 'a optionRetrieves the next item in the stream, if any, but does not remove the item from the stream.
val transform : ('a -> 'b -> 'c list * 'a option) -> 'a -> ('b, 's) stream -> ('c, 's) streamtransform f init slazily creates a stream by repeatedly applyingf acc v, whereaccis an accumulator whose initial value isinit, andvis consecutive values ofs. Each time,f acc vevaluates to a pair(vs, maybe_acc'). The valuesvsare added to the result stream. Ifmaybe_acc'isSome acc', the accumulator is set toacc'. Otherwise, ifmaybe_acc'isNone, the result stream ends.
val fold : ('a -> 'b -> 'a) -> 'a -> ('b, sync) stream -> 'afold f init seagerly folds over the itemsv,v',v'', ... ofs, i.e. evaluatesf (f (f init v) v') v''...
val map : ('a -> 'b) -> ('a, 's) stream -> ('b, 's) streammap f slazily appliesfto each item ofs, and produces the resulting stream.
val filter : ('a -> bool) -> ('a, 's) stream -> ('a, 's) streamfilter f sisswithout the items for whichfevaluates tofalse.filteris lazy.
val filter_map : ('a -> 'b option) -> ('a, 's) stream -> ('b, 's) streamfilter_map f slazily appliesfto each itemvofs. Iff vevaluates toSome v', the result stream hasv'. Iff vevaluates toNone, no item corresponding tovappears in the result stream.
val iter : ('a -> unit) -> ('a, sync) stream -> unititer f seagerly appliesfto each item ofs, i.e. evaluatesf v; f v'; f v''...
Utility
val content : ([< signal ], 's) stream -> (content_signal, 's) streamConverts a
signalstream into acontent_signalstream by filtering out all signals besides`Start_element,`End_element, and`Text.
val tree : ?text:(string list -> 'a) -> ?element:(name -> (name * string) list -> 'a list -> 'a) -> ?comment:(string -> 'a) -> ?pi:(string -> string -> 'a) -> ?xml:(xml_declaration -> 'a) -> ?doctype:(doctype -> 'a) -> ([< signal ], sync) stream -> 'a optionThis function's type signature may look intimidating, but it is actually easy to use. It is best introduced by example:
type my_dom = Text of string | Element of name * my_dom list "<p>HTML5 is <em>easy</em> to parse" |> string |> parse_html |> signals |> tree ~text:(fun ss -> Text (String.concat "" ss)) ~element:(fun (name, _) children -> Element (name, children))results in the structure
Element ("p" [ Text "HTML5 is "; Element ("em", [Text "easy"]); Text " to parse"])Formally,
treeassembles a tree data structure of type'afrom a signal stream. The stream is parsed according to the following grammar:stream ::= node* node ::= element | `Text | `Comment | `PI | `Xml | `Doctype element ::= `Start_element node* `End_elementEach time
treesmatches a production ofnode, it calls the corresponding function to convert the node into your tree type'a. For example, whentreesmatches`Text ss, it calls~text ss, if~textis supplied. Similarly, whentreesmatcheselement, it calls~element name attributes children, if~elementis supplied.See
treesif the input stream might have multiple top-level trees. This functiontreeonly retrieves the first one.
val trees : ?text:(string list -> 'a) -> ?element:(name -> (name * string) list -> 'a list -> 'a) -> ?comment:(string -> 'a) -> ?pi:(string -> string -> 'a) -> ?xml:(xml_declaration -> 'a) -> ?doctype:(doctype -> 'a) -> ([< signal ], 's) stream -> ('a, 's) streamLike
tree, but converts all top-level trees, not only the first one. The trees are emitted on the resulting stream, in the sequence that they appear in the input.
type 'a node=[|`Element of name * (name * string) list * 'a list|`Text of string|`Doctype of doctype|`Xml of xml_declaration|`PI of string * string|`Comment of string]See
from_treebelow.
val from_tree : ('a -> 'a node) -> 'a -> (signal, sync) streamDeconstructs tree data structures of type
'ainto signal streams. The function argument is applied to each data structure node. For example,type my_dom = Text of string | Element of string * my_dom list let dom = Element ("p", [ Text "HTML5 is "; Element ("em", [Text "easy"]); Text " to parse"]) dom |> from_tree (function | Text s -> `Text s | Element (name, children) -> `Element (("", name), [], children))results in the signal stream
`Start_element (("", "p"), []) `Text ["HTML5 is "] `Start_element (("", "em"), []) `Text ["easy"] `End_element `Text " to parse" `End_element
val elements : (name -> (name * string) list -> bool) -> ([< signal ] as a, 's) stream -> (('a, 's) stream, 's) streamelements f sscans the signal streamsfor`Start_element (name, attributes)signals that satisfyf name attributes. Each such matching signal is the beginning of a substream that ends with the corresponding`End_elementsignal. The result ofelements f sis the stream of these substreams.Matches don't nest. If there is a matching element contained in another matching element, only the top one results in a substream.
Code using
elementsdoes not have to read each substream to completion, or at all. However, once the using code has tried to get the next substream, it should not try to read a previous one.
val text : ([< signal ], 's) stream -> (char, 's) streamExtracts all the text in a signal stream by discarding all markup. For each
`Text sssignal, the result stream has the bytes of the stringsss, and all other signals are ignored.
val trim : ([> content_signal ] as a, 's) stream -> ('a, 's) streamTrims insignificant whitespace in an HTML signal stream. Whitespace around flow ("block") content does not matter, but whitespace in phrasing ("inline") content does. So, if the input stream is
<div> <p> <em>foo</em> bar </p> </div>passing it through
Markup.trimwill result in<div><p><em>foo</em> bar</p></div>Note that whitespace around the
</em>tag was preserved.
val normalize_text : ([> `Text of string list ] as a, 's) stream -> ('a, 's) streamConcatenates adjacent
`Textsignals, then eliminates all empty strings, then all`Text []signals. Signals besides`Textare unaffected. Note that signal streams emitted by the parsers already have normalized text. This function is useful when you are inserting text into a signal stream after parsing, or generating streams from scratch, and would like to clean up the`Textsignals.
val pretty_print : ([> content_signal ] as a, 's) stream -> ('a, 's) streamAdjusts the whitespace in the
`Textsignals in the given stream so that the output appears nicely-indented when the stream is converted to bytes and written.This function is aware of the significance of whitespace in HTML, so it avoids changing the whitespace in phrasing ("inline") content. For example, pretty printing
<div><p><em>foo</em>bar</p></div>results in
<div> <p> <em>foo</em>bar </p> </div>Note that no whitespace was inserted around
<em>and</em>, because doing so would create a word break that wasn't present in the original stream.
val html5 : ([< signal ], 's) stream -> (signal, 's) streamConverts a signal stream into an HTML5 signal stream by stripping any document type declarations, XML declarations, and processing instructions, and prefixing the HTML5 doctype declaration. This is useful when converting between XHTML and HTML.
val xhtml : ?dtd:[< `Strict_1_0 | `Transitional_1_0 | `Frameset_1_0 | `Strict_1_1 ] -> ([< signal ], 's) stream -> (signal, 's) streamSimilar to
html5, but does not strip processing instructions, and prefixes an XHTML document type declaration and an XML declaration. The~dtdargument specifies which DTD to refer to in the doctype declaration. The default is`Strict_1_1.
val xhtml_entity : string -> string optionTranslates XHTML entities. This function is for use with the
~entityargument ofparse_xmlwhen parsing XHTML.
Namespaces
module Ns : sig ... endCommon namespace URIs.
Asynchronous interface
module type ASYNCHRONOUS = sig ... endMarkup.ml interface for monadic I/O libraries such as Lwt and Async.
Conformance status
The HTML parser seeks to implement section 8 of the HTML5 specification. That section describes a parser, part of a full-blown user agent, that is building up a DOM representation of an HTML document. Markup.ml is neither inherently part of a user agent, nor does it build up a DOM representation. With respect to section 8 of HTML5, Markup.ml is concerned with only the syntax. When that section requires that the user agent perform an action, Markup.ml emits enough information for a hypothetical user agent based on it to be able to decide to perform this action. Likewise, Markup.ml seeks to emit enough information for a hypothetical user agent to build up a conforming DOM.
The XML parser seeks to be a non-validating implementation of the XML and Namespaces in XML specifications.
This rest of this section lists known deviations from HTML5, XML, and Namespaces in XML. Some of these deviations are meant to be corrected in future versions of Markup.ml, while others will probably remain. The latter satisfy some or all of the following properties:
- They require non-local adjustment, especially of past nodes. For example, adjusting the start signal of the root node mid-way through the signal stream is difficult for a one-pass parser.
- They are minor. Users implementing less than a conforming browser typically don't care about them. They typically have to do with obscure error recovery. There are no deviations affecting the parsing of well-formed input.
- They can easily be corrected by code written over Markup.ml that builds up a DOM or maintains other auxiliary data structures during parsing.
To be corrected:
- XML: There is no attribute value normalization.
- HTML: foster parenting is not implemented, because it requires non-local adjustments.
- HTML: Quirks mode is not honored. This affects the interaction between automatic closing of
pelements and opening oftableelements. - HTML: The parser has non-standard recovery from unmatched closing
formtags in some situations. - HTML: The parser ignores interactions between
formandtemplate. - HTML: The form translation for
isindexis completely ignored.isindexis handled as an unknown element.
To remain:
- HTML: Except when detecting encodings, the parser does not try to read
<meta>tags for encoding declarations. The user of Markup.ml should read these, if necessary. They are part of the emitted signal stream. - HTML:
noscriptelements are always parsed, as arescriptelements. For conforming behavior, if the user of Markup.ml "supports scripts," the user should serialize the content ofnoscriptto a`Textsignal usingwrite_html. - HTML: Elements such as
titlethat belong inhead, but are found betweenheadandbody, are not moved intohead. - HTML:
<html>tags found in the body do not have their attributes added to the`Start_element "html"signal emitted at the beginning of the document.