XML parsing in AOLserver LG #63

AOLserver

AOLserver is an open-source, multi-threaded, high-performance web server. AOLserver is less known than Apache but it has a few features that put it ahead of Apache: rich and well-thought extension API, superior database connectivity API, embedded and tightly integrated Tcl interpreter. Read my previous LG article to learn more about AOLserver.

XML

If you’re going to do serious work with XML you’ll have to learn about it and you’ll have to do it somewhere else. The best summary of XML I’ve seen is: XML is an (inefficient) way to to represent data in tree form as text (ASCII) files. Text is good because it’s simple. Tree is good because a lot can be represented as trees (e.g., a non-circular list is just a degenerated tree and a circular list can be described with multiple trees). Inefficient is bad but it usually makes an engineering sense to trade inefficiency for extensibility and wide adoption that XML enjoys (lots of tools, lots of information).

XML support in AOLserver

XML processing (parsing and modification of XML documents) in AOLserver is possible thanks to an ns_xml module written by ArsDigita. This module is a wrapper around version 2.x (>2.2.5) of libxml library and adds ns_xml command to the embedded Tcl interpreter. You can download the source or get it directly from the CVS repository doing:

cvs -d:pserver:anonymous@cvs.aolserver.sourceforge.net:/cvsroot/aolserver login
cvs -z3 -d:pserver:anonymous@cvs.aolserver.sourceforge.net:/cvsroot/aolserver co nsxml

You need to press Enter after first command since CVS is waiting for a password (which is empty).

As of Dec. 2000 Linux distributions usually come with version 1.x of libxml library so chances are that you’ll need to install 2.x by yourself (this will change in the future since everyone is migrating to 2.x). To install nsxml module go into nsxml directory, optionally edit a path in Makefile to point into AOLserver source directory. Then run make. You should get nsxml.so module that should be placed in AOLserver bin directory (the same that has main nsd executable). Add the following to your nsd.tcl config file:

ns_section "ns/server/${servername}/modules"
ns_param   nsxml           ${bindir}/ns_xml.so

and restart AOLserver. You can verify that the module gets loaded by watching server.log, I usually use a shell window with:

tail -f $AOLSERVERDIR/log/server.log

This is also a great way to debug Tcl scripts since AOLserver will dump detailed debug information every time there is an error in the script.

READ  Setting Up A Java Development Enviroment For Linux LG #45

XML Quick reference

Here’s a quick reference of all commands available through ns_xml.

set doc_id [ns_xml parse ?-persist? $string]
Parse the XML document in a $string and return document id (handle to in-memory parsed tree). If you don’t provide ?-persist? flag the memory will be automatically freed when the script exits. Otherwise you’ll have to free the memory by calling ns_xml doc free. You need to use -persist flag if you want to share parsed XML docs between scripts.
set doc_stats [ns_xml doc stats $doc_id]
Return document’s statistics.
ns_xml doc free $doc_id
Free a document. Should only be called on a document if ?-persistent? flag has been passed to either ns_xml parse or ns_xml doc create
set node_id [ns_xml doc root $doc_id]
Return the node id of the document root (you start traversal of the document tree from here.)
set children_list [ns_xml node children $node_id]
Return a list of children nodes of a given node.
set node_name [ns_xml node name $node_id]
Return the name of a node.
set node_type [ns_xml node type $node_id]
Return the type of a node. Possible types: element, attribute, text, cdata_section, entity_ref, entity, pi, comment, document, document_type, document_frag, notation, html_document
set content [ns_xml node getcontent $node_id]
Get a content (text) of a given node.
set attr [ns_xml node getattr $node_id $attr_name]
Return the value of an attribute of a given node.
set doc_id [ns_xml doc create ?-persist? $doc-version]
Create a new document in memory. If -persist flag is given you’ll have to explicitely free the memory taken by the document with ns_xml doc free, otherwise it’ll be freed automatically after execution of the script. $doc_version is a version of an XML doc, if not specified it’ll be « 1.0 ».
set xml_string [ns_xml doc render $doc_id]
Generate XML from the in-memory representation of the document.
set node_id [ns_xml doc new_root $doc_id $node_name $node_content]
Create a root node for a document.
set node_id [ns_xml node new_sibling $node_id $name $content]
Create a new sibling of a given node.
set node_id [ns_xml node new_child $node_id $name $content]
Create a child of a given node.
ns_xml node setcontent $node_id $content
Set a content for a given node.
ns_xml node setattr $node_id $attr_name $value
Set the value of an attribute in a given node.
READ  Locales mini-HOWTO: What is a "locale" anyhow?

A simple example

An educational and simple thing to do is to parse a document and print out its tree structure. Stripped to bare bones the process is:

  • use ns_xml parse $xml_doc to parse XML document in string $xml_doc and get its document id
  • use ns_xml doc root $doc_id to get the id of a root node
  • use ns_xml node children $node_id to traverse document tree and ns_xml node ... commands to get node content and attributes

If you provide -persist flag to ns_xml parse you’ll have to explicitly call ns_xml doc free $doc_id to free memory associated with this document, otherwise it will get automatically freed after execution of a script.

In code it could look like this:

proc dump_node {node_id level} {
    set name [ns_xml node name $node_id]
    set type [ns_xml node type $node_id]
    set content [ns_xml node getcontent $node_id]
    ns_write "
  •  » ns_write « node id=$node_id name=$name type=$type » if { [string compare $type « attribute »] != 0 } { ns_write  » content=$content\n » } } proc dump_tree_rec {children} { ns_write  »
      \n » foreach child_id $children { dump_node $child_id set new_children [ns_xml node children $child_id] if { [llength $new_children] > 0 } { dump_tree_rec $new_children } } } proc dump_tree {node_id} { dump_tree_rec [list $node_id] 0 } proc dump_doc {doc_id} { ns_write « doc id=$doc_id
      \n » set root_id [ns_xml doc root $doc_id] dump_tree $root_id } set xml_doc « this is a test of xml » set doc_id [ns_xml parse $xml_doc] dump_doc $doc_id ns_xml parsecommand will throw an error if XML document is not valid (e.g., not well formed) so in production code we should catch it and display a meaningful error message, e.g.:

      if { [catch {set doc_id [ns_xml parse $xml_doc]} err] } {
          ns_write "There was an error parsing the following XML document: "
          ns_write [ns_quotehtml $xml_doc]
          ns_write "Error message is:"
          ns_write [ns_quotehtml $err]
          ns_write "