Note: This website is archived. For up-to-date information about D projects and development, please visit wiki.dlang.org.

Welcome to the XMLP project page

The svn source code repository here is now not being updated. The current release is now found at https://launchpad.net/d2-xml, as the whole repository history has been imported. This means climbing up the learning curve using the Bazaar source code repository toolset. Its easy to get a bzr command line client and pull down a copy. I am also updating this SVN repository, by periodically updating my local svn checkout, and bringing it up to date from my local Bazaar checkout. So I may get confused, and this SVN repository may be a little out of sync sometimes. There are also some bug reporting, features on the new project site. I will be still checking on tickets here. The documentation here, such that it is, is already a bit out of date.

I played around with the original std.xml today, making a std.xml1, and without changing the design, I found a few different ways to tweak performance to near 50% better, based on what I found making the new parsers. Custom munch functions, that are used frequently are better than generic munch. The main parse function was restructured to better sequence, inoptimal startsWith function calls. I was amazed to find, on reading the parse code, that attribute values surrounded by single quotes are not supported, so I fixed that. So std.xml1, is a modified std.xml, only a fair bit faster (40-50%) on my limited indicative tests, and just a bit more compatible. Not nearly as fast or compatible as the 2 parsers in std.xmlp package, but it was good to get refreshed on the compactness and ideas in std.xml, even if it has limits. One problem in std.xml is the great deal of error checking code in the debug compiles, which drags down performance enormously. If any further work was done on it, I would get rid of the Item category arrays in element.

This project has a few XML parsers that can be used as an alternative to the std.xml parser. The code is for D2. There are effectively two different parsers.

  • Slice Parse. Specialised string slicing parser (SP). The text is processed from a single string. The parser returns references to the original string slice, unless replacing entity references.
  • Core Parse. A dchar input Range parser that conforms to the XML standards for input conversion, and processes DTD and external entities, and does full XML validation.

Both of these parsers inherit from a IXMLParser interface.

There is also a Document Object Model (DOM) module. This copies, more or less, the Node based classes, using pointer links and methods as found in Java and C++ implementations of a DOM.

The base parsers do not themselves depend on the DOM. Both parsers return a series of tokens in a XmlReturn struct.

/**
	Returns parsed fragment of XML. The type indicates what to expect.
*/
struct XmlReturn {
	enum  {
		RET_NULL,  /// nothing returned
		TAG_START, /// Element name in scratch.  Attributes in names and values. Element content expected next.
		TAG_END,   /// Element end tag.  No more element content.
		TAG_EMPTY, /// Element name in scratch.  Attributes in names and values. Element has no content.
		XML_DEC,   /// XML declaration.  Declaration attributes in names and values.
		STR_TEXT,  /// Text block content in scratch.
		STR_CDATA, /// CDATA block content in scratch.
		STR_PI,		///  Processing Instruction.  Name in scratch.  Instruction content in values[0].
		STR_XI,		///   XmlInstruction. Not implemented.
		STR_COMMENT,  /// Comment block content in scratch.
		DOC_TYPE,	/// DTD parse results contained in doctype as DtdValidate.
		XI_ENTITY,  /// EntityData. Not implemented. 
		XI_NOTATION, /// Notation data.  Not implemented.
		RET_MAX     /// Maximum value of this enum
	};

	uintptr_t		type = RET_NULL;
	string			scratch;
	string[]		names;
	string[]		values;

	Object			doctype; // maybe used to pass back a DTD or DocumentType object

	/// Retrive attribute value by name
	string opIndex(const(char)[] name);

	/// Retrive pointer to attribute value by name
	string* opIn_r(const(char)[] name);
}

Using the IXMLParser interface.

There are about 3 main styles as to how to process XML documents using this project.

  • Low level.
  • XmlBuilder
  • DocumentBuilder.

Compare them using books.xml, and a matching struct Books.

sourceXML = 
`<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications with XML.</description>
</book> . . . </catalog>`;

struct Book
{
    string id;
    string author;
    string title;
    string genre;
    string price;
    string pubDate;
    string description;
}


LowLevel.

Use IXMLParser.parse(ref XmlReturn) in a loop directly, process results into the programs data structures.

Call parse in a loop, until encountering start tag of interest. For multiple elements, use the TagStartSelect struct to pick individual handlers. There is a convenience function textSelect, to just pick out the text content of the element parse.

The implementation of TagStartSelect is not sophisticated. It tracks TAG_START , TAG_EMPTY, and TAG_END, tracks element depth and returns when it is zero. TAG_EMPTY indicates the element has no content. This happens after the parser has processed the attributes, and then processed a "/>" or a matching element "</endtag>". An element name match at any depth calls the element select delegate. Since the delegate will be called for both TAG_START and TAG_EMPTY, a delegate should check that the XmlReturn.type is not TAG_EMPTY to ensure that the element has some content, before nesting any further processing in the delegate. A delegate should process all further nested content, using a new TagStartSelect, or call textSelect(), if any is expected. If the xml document xml structure is not regular, or does not match the code, then unfortunate processing surprises will happen.

/// Use IXMLParser interface
void directApproach(string s)
{
	auto parser = new XmlStringParser(s);
	XmlReturn xret;
	Book[] books;

	// call this after starting a book element
	void buildBook()
	{
		Book book;
		/// in book element, interested in content of 
		book.id = xret["id"];
	
		auto xml = TagStartSelect(parser); // selector
		
		xml.select["author"]       = (ref XmlReturn item) { book.author	   = textSelect(parser); };
		xml.select["title"]        = (ref XmlReturn item) { book.title       = textSelect(parser); };
		xml.select["genre"]        = (ref XmlReturn item) { book.genre       = textSelect(parser); };
		xml.select["price"]        = (ref XmlReturn item) { book.price       = textSelect(parser); };
		xml.select["publish-date"] = (ref XmlReturn item) { book.pubDate     = textSelect(parser); };
		xml.select["description"]  = (ref XmlReturn item) { book.description = textSelect(parser); };

		xml.parseContent();
		
		books ~= book;
	};
	// more direct 
	while (parser.parse(xret))
	{
		switch(xret.type)
		{
		case XmlReturn.TAG_START:
			if (xret.scratch == "book")
			{
				buildBook();
			}
			break;
		default:
			break;
		}
	}
	OutputBooksXml(books);
}

XmlVisitor

This looks something like the original books example, and forces nested delegate handling, with delegates for each kind of XML token. A major difference to std.xml that a handler for the root xml element needs to be setup.

void testExampleBooks(string s)
{
    Book[] books;

	auto xml = XmlVisitor(new XmlStringParser(s));
	/// Need to handle root element, as well as its content
	xml["catalog"] = (ref XmlVisitor b1)
	{
		b1["book"] = (ref XmlVisitor b2)
		{
			Book book;
		
			book.id = b2.attributes["id"];		
			b2["author"]        = (Element item) { book.author       = item.text(); };
			b2["title"]        = (Element item) { book.title       = item.text(); };
			b2["genre"]        = (Element item) { book.genre       = item.text(); };
			b2["price"]        = (Element item) { book.price       = item.text(); };
			b2["publish-date"] = (Element item) { book.pubDate     = item.text(); };
			b2["description"]  = (Element item) { book.description = item.text(); };			
			
			b2.parseContent();
			
			books ~= book;
			
		};
		
		b1.parseContent();
	};
	xml.parseDocument();
	OutputBooksXml(books);
}

DocumentBuilder

The preceding examples do a single pass through the xml document, and store items of interest. For multiple queries into the xml document, having the entire document as a DOM makes sense.

To build a Document object, create the IXMLParser from its source. Perhaps set a validation or namespaces flag. Create a DocumentBuilder structure and call the buildContent method.

void test_parse_sdom(string xml)
{
	auto parser = new XmlStringParser(xml);
	parser.validate = true;
	auto builder = DocumentBuilder(parser);
	builder.buildContent();
	Document doc = builder.document;
}
There are convenience functions to build a document from a file path or a string.
struct DocumentBuilder {
   static Document LoadString(string path, bool validate = true, bool useNamespaces = true);
   static Document LoadFile(string path, bool validate = true, bool useNamespaces = true);
}
Both of these functions will use the XmlDtdParser, as they do not know what to expect. The process of doing it all manually for the XmlDtdParser looks like this.
 // create an InputRange source
	auto xinput = new SliceFill!(char)(src);
 // or if the source is a filepath,
	auto buffile = new BufferedFile(srcpath);
	auto xinput = new XmlStreamFiller(buffile);

 // create a document
        Document doc = new Document("Some name");
 // create the parser
	IXMLParser cp = new XmlDtdParser(xinput, doc, true);
 // build the document
	DocumentBuilder b = DocumentBuilder(cp);
 // need to supply the Document object, otherwise will create another.
	b.buildContent(doc);
	

Catching Errors

The Parsers implement DOMErrorHandler and DOMConfiguration classes. These can be used to set an Exception Handler call back during the parse. To customize this, derive a class from DOMErrorHandler, and implement.

DOMConiguration uses Variant to store parameters.

This is why XmlDtdParser wants the Document as a parameter. It fetches the Documents DOMConfiguration to setup error handling, and find out parse settings. The cast(DOMErrorHandler) is essential. Of course the caller should still set try .. catch(ParseError exc) around the xml processing.

There is a IXMLParser error handler delegate, which can be set to intercept exception throwing. XmlDtdParser uses this to pass on exceptions to a DOMErrorHandler class.

alias ParseError delegate(ParseError ex) PrepareThrowDg;

void setPrepareThrowDg(PrepareThrowDg dg);

class ParseErrorHandler : DOMErrorHandler {
	override bool handleError(DOMError error);
}
//...
	Document doc = new Document("Doc"); // path here is just a tag label.
	auto peh = new ParseErrorHandler();

	DOMConfiguration config = doc.getDomConfig();

	config.setParameter("error-handler",Variant(cast(DOMErrorHandler) peh));
	config.setParameter("namespaces", Variant(false));

XPath 1.0 query

There is an XPath 1.0 expression parser available. This returns a NodeList from a Document or Element argument.

/// get a NodeList out of Document using expression
NodeList xpathNodeList(Document d, string pathStr)
{
	XPathParser xpp;

	PathExpression pex = xpp.prepare(pathStr,InitState.IN_PATH);
	return run(pex,d);
}

/// get a NodeList out of Element using expression
NodeList xpathNodeList(Element e, string pathStr)
{
	XPathParser xpp;

	PathExpression pex = xpp.prepare(pathStr,InitState.IN_PATH);
	return run(pex,e);
}