LINUX GAZETTE

[ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]

"Linux Gazette...making Linux just a little more fun!"


XML parsing in AOLserver

By Irving Washington


AOLserver

AOLserver is an open-source, multi-threaded, high-performance web server. AOLserver is less known than Apache but it has a few features that put it ahead of Apache: rich and well-thought extension API, superior database connectivity API, embedded and tightly integrated Tcl interpreter. Read my previous LG article to learn more about AOLserver.

XML

If you're going to do serious work with XML you'll have to learn about it and you'll have to do it somewhere else. The best summary of XML I've seen is: XML is an (inefficient) way to to represent data in tree form as text (ASCII) files. Text is good because it's simple. Tree is good because a lot can be represented as trees (e.g., a non-circular list is just a degenerated tree and a circular list can be described with multiple trees). Inefficient is bad but it usually makes an engineering sense to trade inefficiency for extensibility and wide adoption that XML enjoys (lots of tools, lots of information).

XML support in AOLserver

XML processing (parsing and modification of XML documents) in AOLserver is possible thanks to an ns_xml module written by ArsDigita. This module is a wrapper around version 2.x (>2.2.5) of libxml library and adds ns_xml command to the embedded Tcl interpreter. You can download the source or get it directly from the CVS repository doing:
cvs -d:pserver:anonymous@cvs.aolserver.sourceforge.net:/cvsroot/aolserver login
cvs -z3 -d:pserver:anonymous@cvs.aolserver.sourceforge.net:/cvsroot/aolserver co nsxml
You need to press Enter after first command since CVS is waiting for a password (which is empty).

As of Dec. 2000 Linux distributions usually come with version 1.x of libxml library so chances are that you'll need to install 2.x by yourself (this will change in the future since everyone is migrating to 2.x). To install nsxml module go into nsxml directory, optionally edit a path in Makefile to point into AOLserver source directory. Then run make. You should get nsxml.so module that should be placed in AOLserver bin directory (the same that has main nsd executable). Add the following to your nsd.tcl config file:

ns_section "ns/server/${servername}/modules"
ns_param   nsxml           ${bindir}/ns_xml.so
and restart AOLserver. You can verify that the module gets loaded by watching server.log, I usually use a shell window with:
tail -f $AOLSERVERDIR/log/server.log
This is also a great way to debug Tcl scripts since AOLserver will dump detailed debug information every time there is an error in the script.

XML Quick reference

Here's a quick reference of all commands available through ns_xml.

set doc_id [ns_xml parse ?-persist? $string]
Parse the XML document in a $string and return document id (handle to in-memory parsed tree). If you don't provide ?-persist? flag the memory will be automatically freed when the script exits. Otherwise you'll have to free the memory by calling ns_xml doc free. You need to use -persist flag if you want to share parsed XML docs between scripts.
set doc_stats [ns_xml doc stats $doc_id]
Return document's statistics.
ns_xml doc free $doc_id
Free a document. Should only be called on a document if ?-persistent? flag has been passed to either ns_xml parse or ns_xml doc create
set node_id [ns_xml doc root $doc_id]
Return the node id of the document root (you start traversal of the document tree from here.)
set children_list [ns_xml node children $node_id]
Return a list of children nodes of a given node.
set node_name [ns_xml node name $node_id]
Return the name of a node.
set node_type [ns_xml node type $node_id]
Return the type of a node. Possible types: element, attribute, text, cdata_section, entity_ref, entity, pi, comment, document, document_type, document_frag, notation, html_document
set content [ns_xml node getcontent $node_id]
Get a content (text) of a given node.
set attr [ns_xml node getattr $node_id $attr_name]
Return the value of an attribute of a given node.
set doc_id [ns_xml doc create ?-persist? $doc-version]
Create a new document in memory. If -persist flag is given you'll have to explicitely free the memory taken by the document with ns_xml doc free, otherwise it'll be freed automatically after execution of the script. $doc_version is a version of an XML doc, if not specified it'll be "1.0".
set xml_string [ns_xml doc render $doc_id]
Generate XML from the in-memory representation of the document.
set node_id [ns_xml doc new_root $doc_id $node_name $node_content]
Create a root node for a document.
set node_id [ns_xml node new_sibling $node_id $name $content]
Create a new sibling of a given node.
set node_id [ns_xml node new_child $node_id $name $content]
Create a child of a given node.
ns_xml node setcontent $node_id $content
Set a content for a given node.
ns_xml node setattr $node_id $attr_name $value
Set the value of an attribute in a given node.

A simple example

An educational and simple thing to do is to parse a document and print out its tree structure. Stripped to bare bones the process is: If you provide -persist flag to ns_xml parse you'll have to explicitly call ns_xml doc free $doc_id to free memory associated with this document, otherwise it will get automatically freed after execution of a script.

In code it could look like this:

proc dump_node {node_id level} {
    set name [ns_xml node name $node_id]
    set type [ns_xml node type $node_id]
    set content [ns_xml node getcontent $node_id]
    ns_write "<li>"
    ns_write "node id=$node_id name=$name type=$type"
    if { [string compare $type "attribute"] != 0 } {
	ns_write " content=$content\n"
    }
}

proc dump_tree_rec {children} {
    ns_write "<ul>\n"
    foreach child_id $children {
	dump_node $child_id
	set new_children [ns_xml node children $child_id]
	if { [llength $new_children] > 0 } {
	    dump_tree_rec $new_children
	}
    }
}

proc dump_tree {node_id} {
    dump_tree_rec [list $node_id] 0
}

proc dump_doc {doc_id} {
    ns_write "doc id=$doc_id<br>\n"
    set root_id [ns_xml doc root $doc_id]
    dump_tree $root_id
}

set xml_doc "<test version="1.0">this is a
<blind>test</blind> of xml</test>"
set doc_id [ns_xml parse $xml_doc]
dump_doc $doc_id    
ns_xml parse command will throw an error if XML document is not valid (e.g., not well formed) so in production code we should catch it and display a meaningful error message, e.g.:
if { [catch {set doc_id [ns_xml parse $xml_doc]} err] } {
    ns_write "There was an error parsing the following XML document: "
    ns_write [ns_quotehtml $xml_doc]
    ns_write "Error message is:"
    ns_write [ns_quotehtml $err]
    ns_write "\n"
    return
}
Code like this takes more time to write but some day it may save a lot of debugging time (and a day like this always comes).

See how the code works in practice [external site running AOLserver] and get the full source [included in Linux Gazette]. It's a bit more complex than the above snippet. You can see the structure of an arbitrary XML document by typing it in the provided text area. The script also shows how to parse form data and has more robust error handling.

Real life example

XML is better than other similar formats because it is a standard, it has gained wide acceptance and its usage is growing rapidly. One of the possible usages of XML is as a way of communication between web sites (web services). The simplest scenario is that of one web server grabbing information in XML format from another web server. A popular example of such communication is a congregation of headlines, e.g., if you go to freshmeat.net you'll see that they provide current headlines from linuxtoday.com. We'll do the same thing (vive l'originalite!).

In the past it could've been done in a rather distasteful way by grabbing the whole HTML page and trying to extract relevant information. It would be hard to program and fragile (a change in the way HTML page is generated would most likely break such parsing).

Today the site that wants to provide headlines for others can publish this data in an easily to parse XML format under some URL. In our case the data are provided at http://www.linuxtoday.com/backend/linuxtoday.xml. See the format of this file (using previously developed script).

As you can see XML document represent headlines on LinuxToday site. It is a set of stories, each story having title, url, author etc. We know that after parsing the XML document we would like to have a way to easily extract the information. Let's use a "wishful-thinking" (in other words top-down) method of writing the code advocated in a Structure and interpretation of computer programs (a truly great CS book). Let's assume that we've converted XML representation into an object. To build an HTML table showing the data we need the following procedures:

For simplicity I only use URL and title but extending this to more attributes should be trivial.

Having those procedures we can generate the simplest (but rather ugly) table:

proc story_to_html_table_row { story } {
    set url [story_get_url $story]
    set title [story_get_title $story]
    return "- <a href=\"$url\"><font color=#000000>$title</font></a><br>\n"
}

# given headlines generate HTML code of the table with this data
proc headlines_to_html_table { headlines } {
    set to_return "<table border=0 cellspacing=1 cellpadding=3>"
    append to_return "<tr><td><small>"

    set stories_count [headlines_get_stories_count $headlines]
    for {set i 0} {$i < $stories_count} {incr i} {
	set story [headlines_get_story $headlines $i]
	append to_return [story_to_html_table_row $story]
    }

    append to_return "</td></tr></table>\n"
    return $to_return
}
Tcl doesn't give us much choice for representing this object; we'll use lists.
proc headlines_get_stories_count { headlines } {
    return [llength $headlines]
}

proc headlines_get_story { headlines story_no } {
    return [lindex $headlines $story_no]
}

proc story_get_url { story } {
    return [lindex $story 0]
}

proc story_get_title { story } {
    return [lindex $story 1]
}
Note that if we forget about purity (just for a while) we can rewrite the following part of headlines_to_html_table:
set stories_count [headlines_get_stories_count $headlines]
for {set i 0} {$i < $stories_count} {incr i} {
    set story [headlines_get_story $headlines $i]
    append to_return [story_to_html_table_row $story]
}
in a bit more terse way:
foreach story $headlines {
    append to_return [story_to_html_table_row $story]
}
Now the most important part: converting XML doc into the representation we've chosen.
# does a name of the node identified by $node_id equals $name
proc is_node_name_p { node_id name } {
    set node_name [ns_xml node name $node_id]
    if { [string_equal_p $name $node_name] } {
	return 1
    } else {
	return 0
    }
}

# does a type of the node identified by $node_id equals $type
proc is_node_type_p { node_id type } {
    set node_type [ns_xml node type $node_id]
    if { [string_equal_p $type $node_type] } {
	return 1
    } else {
	return 0
    }
}

# is this an node of type "attribute"?
proc is_attribute_node_p { node_id } {
    return [is_node_type_p $node_id "attribute"]
}

# raise an error if node name is different than $name
proc error_if_node_name_not {node_id name} {
    if { ![is_node_name_p $node_id $name] } {
	set node_name [ns_xml node name $node_id]
	error "node name should be $name and not $node_name"
    }
}

# raise an error if node type is different than $type
proc error_if_node_type_not {node_id type} {
    if { ![is_node_type_p $node_id $type] } {
	set node_type [ns_xml node type $node_id]
	error "node type should be $type and not $node_type"
    }
}

# given url and title construct a story object with
# those attributes
proc define_story { url title } {
    return [list $url $title]
}

# convert a node of name "story" into an object
# that represents story
proc story_node_to_story {node_id} {
    set url ""
    set title ""
    # go through all children and extract content of url and title nodes
    set children [ns_xml node children $node_id]
    foreach node_id $children {
	# we're only interested in nodes whose name is "url" or "title"
	if { [is_attribute_node_p $node_id]} {
	    if { [is_node_name_p $node_id "url"] || [is_node_name_p $node_id "title"]} {
		set node_children [ns_xml node children $node_id]
		# those should only have one children node with
		# the name "text" and type "cdata_section"
		if { [llength $node_children] != 1 } {
		    set name [ns_xml node name $node_id]
		    error "$name node should only have 1 child"
		}
		set one_node_id [lindex $node_children 0]
		error_if_node_type_not $one_node_id "cdata_section"
		error_if_node_name_not $one_node_id "text"
		set txt [ns_xml node getcontent $one_node_id]
		if { [is_node_name_p $node_id "url"] } {
		    set url $txt
		}
		if { [is_node_name_p $node_id "title"]} {
		    set title $txt
		}
	    }
	}
    }
    return [define_story $url $title]
}

# convert XML doc to headlines object
proc xml_to_headlines { doc_id } {
    set headlines [list]
    set root_id [ns_xml doc root $doc_id]
    # root node should be named "linuxtoday" and of type "attribute"
    error_if_node_name_not $root_id "linuxtoday"
    error_if_node_type_not $root_id "attribute"
    set children [ns_xml node children $root_id]
    foreach node_id $children {
	# only interested in attribute type nodes whose name is "story"
	if { [is_node_name_p $node_id "story"] && [is_attribute_node_p $node_id]} {
	    set story [story_node_to_story $node_id]
	    lappend headlines $story
	}
    }
    return $headlines
}
The code is rather straightforward. We use the knowledge about the structure of XML file. In this case we know that root node is named linuxtoday and should have a child named story. Each story node should have children named url and title etc. The previous script that dumps general structure of the tree helped me a lot in writing this function. Note the usage of error command to abort the script if XML doesn't look good to us.

Having an intermediate representation of the data might look like an excess given that it costs us more code and some performance but there are very good reasons to have it. We could have written a proc xml_to_html_table that would create HTML table directly from XML document but such code would be more complex, more buggy and harder to modify. Separation that we've made provides an abstraction that reduces complexity, which is always good. It also gives us more flexibility: we can easily imagine writing another headlines_to_html_table procedure that gives us slightly different table.

See how it works in practice [external site running AOLserver] and get the source [included in Linux Gazette]. It should produce something like this:

linuxtoday
- Kernel Cousin Debian Hurd #73 By Paul Emsley And Zack Brown
- Zope 2.2.5 b1 released
- O#39;Reilly Network: Insecurities in a Nutshell: SAMBA, pine, ircd, and More
- ZDNet: Linux Laptop SuperGuide
- ComputerWorld: Think tank warns that Microsoft hack could pose national security risk

One thing missing in this code is caching. As it is, it will grab the XML file from other people's server everytime it is invoked. This is not nice. It would be fairly easy to add a logic to cache XML file (or its in-memory representation) and only fetch a new version if, say, 1 hour passed since it was last retrieved.

Conclusion about XML as a data exchange language

Is this data exchange thing between web servers a novel idea? No. You could do everything described here with the first generation of web servers. You would probably use different technologies (C code running inside a web server or a CGI script instead of an embedded scripting language; some ad-hoc text or binary format instead of XML) but the idea would be the same: one web server acts as a client, grabs the data from the other server using HTTP protocol and does something useful with the data. The other web server acts as a server providing data for others. It's just another implementation of a client-server paradigm. It's nothing new. It is just a sign that web programming is maturing. After 5+ years we've finally solved most of the problems with presenting static html pages or generating dynamic web pages from the data kept on the server (e.g., in a database). Now we enter the times of providing services and data for other web sites. Today state-of-the-art is pretty much limited to exchanging headlines and similar trivia but possibilities are bigger, ranging from simple things like providing stock quotes or dictionary definitions to executing complex (e.g., financial) transactions following an agreed upon protocol.

Conclusion about XML parsing in AOLserver

Beside parsing you can also create and manipulate XML documents in memory and convert them to XML ASCII representation. It is not covered in this article but it's so straightforward that you should be able to do it just by looking at the API.

ns_xml module provides basics of XML processing. Although you can do quite a bit with it one could wish to do more. Things that are obviously missing:

An alternative approach to ns_xml module would be to:

Links

If you have comments or suggestions, send them in.


Copyright © 2001, Irving Washington.
Copying license http://www.linuxgazette.com/copying.html
Published in Issue 63 of Linux Gazette, Mid-February (EXTRA) 2001

[ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]