Tag: code
Generating HTML5 using XSLT
by Mike on Jan.28, 2011, under Technology, Tutorials

Recently, I have been updating some of my HTML generation tools to output valid HTML5, rather than the XHTML 1.0 standard I have been using for the last few years. The main advantage from my perspective is the ability to use the more semantic block elements, such as the nav, section and article elements.
In general this is a fairly straightforward task, as I am generating clean XHTML using XSLT and my template library works pretty well, but I ran into some problems whilst validating the output using the W3C Validator.
The first issue is to sort the DOCTYPE out. The XHTML doctype looks like this:
<!DOCTYPE html
PUBLICĀ "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
This is easy to generate in XSLT using the following output element.
<xsl:output encoding="UTF-8" indent="yes" method="xml"
omit-xml-declaration="yes"
doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" />
This unfortunately forces the document to validate against the XHTML 1.0 specification which does not include all the lovely new semantic elements – which means that my new documents are suddenly invalid!
We need to generate:
<!DOCTYPE html>
which is really hard to do using XSLT. I have read a number of articles that suggest you output the element as text, however this is extremely ugly and as it turns out, incorrect.
The correct XSLT incantation is:
<xsl:output
method="xml"
doctype-system="about:legacy-compat"
encoding="UTF-8"
indent="yes" />
This forces use of a dummy DTD (about:legacy-compat), which is the W3C recommended way of not using a standard DTD URI.
Now the W3C validator will happily validate against the HTML5 specification rather than the XHTML 1.0 specification.
Using libxml2 and python to scrape content from a website
by Mike on Feb.06, 2007, under Knowledge Base, Technology, Tutorials
This is a practical example, using Libxml2 to parse a real-world Web page (I chose the TV listings pages from the Guardian Website as it is the type of page you are likely to want to scrape for useful data.
Additionally, the Guardian TV listings contain a couple of very typical HTML errors. The listings are contained within a table, and some of the rows in the table are not closed.
The Libxml2 parser recovers from these errors by closing the tags at the end of the page and then continuing parsing from the next useable opening tag. This leaves us with a tr tag containing duplicated content from later in the document, which this code handles in a simple way by splitting the broken content on the next opening tag, and using another document instance to close the unterminated tags.
TV Listings
import libxml2, os, sys, datetime
# This script reads the TV listings from the Guardian TV listings website
# (http://www.guardian.co.uk/TV/)
parse_options = libxml2.HTML_PARSE_RECOVER + \
libxml2.HTML_PARSE_NOERROR + \
libxml2.HTML_PARSE_NOWARNING
today = datetime.date.today()
tomorrow = today + datetime.timedelta(days=1)
class Channel:
def __init__(self, name):
self.name = name
self.entries = []
class ListingEntry:
def __init__(self,when):
self.when = when
self.title = ''
self.content = ''
def newTime(node, entries):
timeStr = str(node.content).strip()
# the Guardian listings show times like 6.00am or 5.45pm, we need to
# turn this into a more useable form. A Python datetime object will do
# just fine. It is also worth noting that the listings run from 6.00am
# to 6.00am, so we need to account for a date boundary at midnight.
if (timeStr.count('am') > 0) or (timeStr.count('pm') > 0):
t = timeStr.split('.')
hour = int(t[0])
minute = int(t[1][0:2])
ampm = t[1][2:4]
if (ampm == 'pm') and (hour < 12):
hour += 12
if (hour < 6):
date = tomorrow
else:
date = today
when = datetime.datetime(date.year, date.month, date.day, hour, minute)
newEntry = ListingEntry(when)
entries.append(newEntry)
def newProgramme(node, entries):
# luckily for us, all the Guardian TV entries are wrapped in <font> tags
# which is bad for accessibility, but gives us a known node to grab
items = node.xpathEval('.//font/node()')
for item in items:
if not item.isBlankNode():
if (item.type == 'text') or (item.type == 'element'):
if entries[-1].title == '':
entries[-1].title += str(item.content).strip()
else:
entries[-1].content += str(item.content).strip() + '\n'
def processSourceHTML(url,entries):
doc = libxml2.htmlReadFile(url, None, parse_options)
listingTable = doc.xpathEval('//table')[6]
rows = listingTable.xpathEval('.//tr')
for row in rows:
if len(row.xpathEval('.//tr')) > 0:
# This row is broken, tr tags should not contain more tr tags!
# it probably is missing one or more closing tags and therefore
# needs special handling.
fixup = row.serialize()
rows = fixup.split('<tr>')
# Here we load the broken HTML fragment into another documet
# to extract whatever we can from it.
fixDoc = libxml2.htmlReadDoc('<html>'+rows[1]+'</html>', \
'', None, parse_options)
cells = fixDoc.xpathEval('//td')
for cell in cells:
if cell.prev == None:
# if the cell has no previous sibling then it is the first
# cell in the row, e.g. the one containing the time
newTime(cell, entries)
else:
newProgramme(cell, entries)
fixDoc.freeDoc()
else:
cells = row.xpathEval('td')
for cell in cells:
if cell.prev == None:
# if the cell has no previous sibling then it is the first
# cell in the row, e.g. the one containing the time
newTime(cell, entries)
else:
newProgramme(cell, entries)
doc.freeDoc()
channels = []
# We could do more here from an automation perspective - spider the list
# of channels, automatically populating the channel names etc...
# but this is left as an exercise for the reader
channels.append( Channel('BBC1') )
processSourceHTML( \
'http://www.guardian.co.uk/TV/bbc1s_meridian.html', \
channels[-1].entries)
channels.append( Channel('BBC2') )
processSourceHTML( \
'http://www.guardian.co.uk/TV/bbc2s_meridian.html', \
channels[-1].entries)
channels.append( Channel('ITV - Meridian') )
processSourceHTML( \
'http://www.guardian.co.uk/TV/meridian_meridian.html', \
channels[-1].entries)
channels.append( Channel('Channel 4') )
processSourceHTML( \
'http://www.guardian.co.uk/TV/ch4_meridian.html', \
channels[-1].entries)
for channel in channels:
print channel.name
for entry in channel.entries:
print "----"
print entry.when
print entry.title
print "----"
print entry.content
Getting started with Libxml2 and Python (Part 1)
by Mike on Feb.06, 2007, under Knowledge Base, Technology, Tutorials
This article is an import from my old site – the original was published on 6th Feb 2007.
Getting to grips with Libxml2 and Python can be a frustrating experience,
particularly as in-depth, accurate Python documentation is hard to find
on the Web.
Many Python developers seem to dislike the Libxml2 bindings, as they are ‘un-Pythonic’
and much too C-like. This however misses the point of Libxml2. The point being that
this library is portable, mature, extremely full-featured and *very* fast.
In the process of writing this tutorial, I hung out in the #xml channel on
irc.gnome.org, and subscribed to the xml@gnome.org mailing list – I
was given a lot of help when things weren’t obvious! Although there’s not a massive
amount of activity on IRC, or in the mailing list on a daily basis, I would
definitely recommend spending some time browsing the archive – or using Google
to search it when you have questions. Additionally, I have found the people in
the Libxml2 community very helpful.
Manipulating XML using Libxml2 is fairly straightforward when you have a couple
of working examples, however that tends to be the problem in Python. Finding
working examples tends to be a bit of a hit-and-miss affair.
The first place to look is in the examples folder in the documentation installed
with your release (/usr/share/doc/libxml2-python-2.6.27/examples on my machine).
TODO: where are the examples on a number of distributions/platforms?
Also, take a moment to scan through libxml2.py itself – this is the Python wrapper and
is a good place to look if you are hunting for a particular function. There
is plenty of information in the wrapper as all the docstrings have been
populated, you can always get information like
print libxml2.parseFile.__doc__
for any particular function.
Also remember that you can list the available methods for any Python object by
using the dir function. The most immediately useful objects are xmlCore, xmlNode
xmlDoc, so
dir(libxml2.xmlCore)
is your friend when working out what functions are available to you.
I’m going to assume that you know a bit about XML, at least enough to recognise
an XML document when you see one, and hopefully enough about Python to know
where to find the documentation!
Contents
installing Libxml2
TODO: installation examples for a number of distros/platforms.
Loading a document
The first thing you want to do in XML will be to load a document of some sort.
As a new Libxml2 user, this is where our confusion starts! It is worth remembering
that in general, the Python bindings are automatically generated – therefore
there is an equivalent Python function for every C function, and sometimes this
can lead to unnecessary, or apparently duplicated Python functions.
The library contains a number of different functions we can use to load an XML
document:
parseDoc, parseFile, parseMemory, readDoc, readFd, readFile, readMemory,
recoverDoc and recoverFile
All of these functions return an xmlDoc object. Examples for using each of these
follow:
parseDoc(cur) – load an XML document from memory (a string)
doc = libxml2.parseDoc("""<?xml version="1.0"?>
<root>Hello world!</root>""")
parseMemory(buffer, size) – load an XML document from memory
doc = libxml2.parseMemory(xml, len(xml))
This function performs exactly the same job as parseDoc from a Python perspective.
parseFile(filename) – load an XML document from a file
doc = libxml2.parseFile('test.xml')
readDoc(cur, URL, encoding, options) – load an XML document from memory (a string)
This version of the function allows you to specify options on a per-document
basis. The parseDoc version uses the parser defaults (in practice, the
parser global settings, which can also be modified using global functions).
In most cases,
doc = libxml2.readDoc('<foo/>',None,None,0)
will be equivalent to
doc = libxml2.parseDoc('<foo/>')
When using XSL, I have found it better to force entities
to be resolved before running the transform, in which case it is useful to
use the following:
doc = libxml2.readDoc( xml, None, libxml2.XML_PARSE_NOENT)
readFd(fd, URL, encoding, options) – load an XML document from a file descriptor
readFile(filename, encoding, options) – load an XML document from a file allowing
the specification of per-document options.
readMemory(buffer, size, URL, encoding, options) – for Python, equivalent to
using readDoc
recoverDoc(cur) – this is equivalent to readDoc, except that even broken XML
will result in a valid XML tree being created.
doc = libxml2.recoverDoc('<foo><broken></foo>')
will raise a parser error, but after the error has been handled, doc will
contain:
<?xml version="1.0"?> <foo><broken/></foo>
recoverFile(filename) – same as recoverDoc, but for files.
In the simplest case, to load a file from disk you can do:
doc = libxml2.parseFile( 'test.xml' )
managing your memory
Ugh, nasty memory management. Isn’t that why we’re using Python, to avoid all that
stuff?
Libxml2 does not explicitly handle the cleaning up of the memory it uses, so when
you finish working with your xmlDoc object, you need to remember to call freeDoc.
The same is true of xpath evaluation contexts created with xpathNewContext, you
call xpathFreeContext on them.
OK, so what we have now is something like the following:
doc = libxml2.parseFile( 'test.xml' ) # Do some stuff with the document here! doc.freeDoc()
It doesn’t matter which method you use to create your xmlDoc object – each of the
functions return the same thing, so just remember to call freeDoc on it when you
are done and all will be well.
There, that wasn’t so hard was it?
Working with the document
Now we have a working document, and know how to dispose of it when we’re done
it is time to look at a number of common XML operations and see how we can do
those using Libxml2 and Python.
Elements
The xmlDoc object has a large number of methods. As well as its own collection,
it inherits from xmlNode, which inherits from xmlCore; this gives you over 200
available methods to read up on! This is fairly daunting, when you can’t find an
example that shows you how to perform simple tasks but don’t worry, In practice
we can get by in most situations with a small fraction of these.
All valid XML documents contain a single root node, which contains all the
other nodes.
You can get a reference to the root element using getRootElement on the document
object. The root element is an xmlNode object, just like all other nodes in the
document. Working with nodes is fairly straightforward:
>>> import libxml2
>>> doc = libxml2.parseDoc( '<foo>Hello world.</foo>' )
>>> root = doc.getRootElement()
>>> print root.name
foo
>>> print root.content
Hello world.
>>> root.setProp('bar', 'an attribute')
<xmlAttr (bar) object at 0x13c00d0>
>>> root.prop('bar')
an attribute
>>> print root.serialize()
<foo bar="an attribute">Hello world.</foo>
>>> doc.freeDoc()
The serialize method can be called on a single node, or on the document and
provides a string representation of the document.
Navigating through the document is not much more difficult – we can use the node
properties (from the xmlCore ancestor object) to find the child nodes:
child = root.children # the children property returns the FIRST child of a node while child is not None: if child.type == "element": # do something with the child node print child.name child = child.next
Accessing the attributes of a node is possible in a similar way
import libxml2
doc = libxml2.parseDoc('<foo att1="value 1" att2="value 2"/>')
root = doc.getRootElement()
for property in root.properties:
if property.type=='attribute':
# do something with the attributes
print property.name
print property.content
doc.freeDoc()
Notice that in both looping through the children, and looping through the
properties there is a test for the type of the node. This is because in most
documents, there is additional whitespace that shows up as well as the specific
node types we are interested in.
XPath
Navigating a document in this manner is straightforward, but tedious and requires
accessing every node in the document until you get to the specific one you need.
More often, you want to retrive a set of nodes or a single node matching some
specific criteria. This is where XPath comes in, and Libxml2 has full support
for XPath.
XPath queries can be run against the document or a specific element in the
document, but in either case the procedure is the same.
The xmlsoft.org Python page suggests the following:
doc = libxml2.parseFile("test.xml")
ctxt = doc.xpathNewContext()
result = ctxt.xpathEval("//*")
# do something with the result
doc.freeDoc()
ctxt.xpathFreeContext()
which involves creating an XPath context, running a query against it and then
freeing the context when finished. If you have a lot of queries to run, then
this is the best way to work, as the context can be re-used for each query.
In practice, the xmlCore object provides a helper function which wraps this up
for you. For single queries running xpathEval directly on the node will suffice,
just be aware that each query creates and destroys its own context, which is
going to be slower than the above implementation.
An XPath query will return a typed result, corresponding to the four basic types mentioned in the
introduction section of the XPath Specification, where the result is a
node-set this will be a tuple. This makes it easy to perform an operation on many nodes at once.
import libxml2
doc = libxml2.parseFile('test.xml')
# select every element in the document
result = doc.xpathEval('//*')
for node in result:
print node.name
doc.freeDoc()
Apart from the call to freeDoc, I can’t see how much more Pythonic it could be?
Namespaces
Dealing with XML Namespaces is possible as well.
Here we create an XML document and declare a namespace on the root element.
import libxml2
doc = libxml2.newDoc('1.0')
root = libxml2.newNode('foo')
doc.setRootElement(root)
#Register the toto namespace
ns = root.newNs('http://toto.org', 'toto')
root.setNs(ns) #put this node in the namespace
#Add to the root node a property in this namespace
root.setNsProp(ns, 'Id', str(12345))
print doc.serialize()
This produces:
<?xml version="1.0"?> <toto:foo xmlns:toto="http://toto.org" toto:Id="12345"/>
Namespace can also be dealt with in XPath, provided you register the namespace with the XPath context object.
import libxml2
doc = libxml2.parseDoc("""
<foo xmlns:MYNS="http://somewhere.fr">
<MYNS:a id="a1"/>
<a id="a2"/>
</foo>
""")
ctxt = doc.xpathNewContext()
#you can choose any name, the URI is the namespace identifier
ctxt.xpathRegisterNs("OtherName", "http://somewhere.fr")
# select the 'a' node in the somewhere.fr namespace
result = ctxt.xpathEval('//OtherName:a')
for node in result:
print node.name, "id=%s"%node.prop("id") #will display "a id=a1"
ctxt.xpathFreeContext()
doc.freeDoc()
If a namespace by default is specified, you will have to register it in XPath with a name of your choice to use it in a XPath expression.
Writing to to a file
To write the contents of your XML document to a file, just use the saveTo method:
f = open('output.xml','w')
doc.saveTo(f)
f.close
The saveTo method is also part of xmlCore, so you can use it to save the contents
of just a single node and it’s children as well as the whole document.
It is also worth noting that both saveTo, and serialize can accept an encoding parameter, which allows the conversion of a document from one encoding to another. Libxml2 itself uses UTF-8 internally, and will convert the document when loading and serialising.
>>>>doc = libxml2.parseDoc("""<root><foo>hello</foo></root>""")
>>>>str = doc.serialize()
>>>>print str
<?xml version="1.0"?>
<root><foo>hello</foo></root>
>>>>str = doc.serialize("iso-8859-1")
>>>>print str
<?xml version="1.0" encoding="iso-8859-1"?>
<root><foo>hello</foo></root>
Modifying documents
To add a new node to a document, first we must create the node and then add it
as a child of the element it belongs to.
import libxml2
doc = libxml2.parseDoc('<foo/>')
root = doc.getRootElement()
newNode = libxml2.newNode('bar')
root.addChild(newNode)
At this stage, our document contains
<?xml version="1.0"?> <foo><bar/></foo>
Using the content property of newNode, we can do:
newNode.setContent('Hello')
We can append some content to our element by calling addContent,
newNode.addContent(' world')
which gives us
<?xml version="1.0"?> <foo><bar>Hello world</bar></foo>
Creating or setting an attribute is easy to, we use the setProp method.
newNode.setProp('attribute', 'the value')
If the attribute doesn’t exist, it will be created otherwise it will just have
its content changed.
Adding nodes at a particular location in the hierarchy is possible using
addNextSibling, or addPrevSibling. These operate in the same way as addChild,
except they operate on the node you wish to add next to, rather than to the
parent.
sibling = libxml2.newNode('bar2')
newNode.addPrevSibling(sibling)
gives
<?xml version="1.0"?> <foo><bar2/><bar new attribute="the value">Hello world</bar></foo>
whereas
sibling = libxml2.newNode('bar2')
newNode.addNextSibling(sibling)
gives
<?xml version="1.0"?> <foo><bar new attribute="the value">Hello world</bar><bar2/></foo>
To insert text into the document, you create a text node with some content and
add it in the same way
text = libxml2.newText('some text\n')
bar.addNextSibling(text)
which leaves us with
<?xml version="1.0"?> <foo><bar2/><bar new attribute="the value">Hello world</bar>some text </foo>
To create content and nodes, the useful Libxml2 helper functions are newComment,
newText and newNode. You can also create a new node by copying one that already
exists. The xmlNode object has copyNode and copyProp methods which can be useful
here.
To add these new nodes into a document, you need to use one of the following
methods (directly on nodes rather than on the document), addChild, addContent,
addNextSibling, addPrevSibling.
XSLT
Libxml2 has a companion library called libxslt which provides support for
XSL Transformations. I find the following example provides most of the
useful information for a Python coder:
def runTransform(xmlFile,xslFile): out = '' sourcedoc = libxml2.parseFile( xmlFile ) styledoc = libxml2.parseFile( xslFile ) style = libxslt.parseStylesheetDoc(styledoc) result = style.applyStylesheet(sourcedoc, None) out = style.saveResultToString( result ) style.freeStylesheet() result.freeDoc() sourcedoc.freeDoc() return out
Notice that there are three documents involved, each of which need to be
explicitly freed, the source, the stylesheet and the result. The starting point
for documentation can be found here, http://xmlsoft.org/XSLT/python.html.
Libxml2 and HTML
If you have spent any time poking around libxml2.py, you will probably have
noticed a number of functions that start with html. This is because Libxml2 has
an HTML parser built in that does a pretty good job of loading real world
(in other words horribly broken) HTML documents. You can then use the features
we have previously discussed to read or modify the HTML.
The following example will load pretty much any HTML file into an xmlDoc object
parse_options = libxml2.HTML_PARSE_RECOVER + \ libxml2.HTML_PARSE_NOERROR + \ libxml2.HTML_PARSE_NOWARNING doc = libxml2.htmlReadDoc(html, '', None, parse_options)
Here is a more complete example, which extracts all the links from the Guardian
newspaper Website home page and prints the href attribute.
import urllib2
import libxml2
# Load the page into a string
f = urllib2.urlopen('http://www.guardian.co.uk')
html = f.read()
f.close()
parse_options = libxml2.HTML_PARSE_RECOVER + \
libxml2.HTML_PARSE_NOERROR + \
libxml2.HTML_PARSE_NOWARNING
doc = libxml2.htmlReadDoc(html,'',None,parse_options)
links = doc.xpathEval('//a')
for link in links:
href = link.xpathEval('attribute::href')
if len(href) > 0:
href = href[0].content
print href
doc.freeDoc()
For a more comprehensive example, see example of scraping content from a website.
Schema
One may validate an XML instance against a W3C schema, as shown below:
# inspired from the test suite file "xstc/xstc.py"
# thanks to Kasimier Buchcik
#
import libxml2
ctxt = libxml2.schemaNewParserCtxt("my-schema.wxs")
schema = ctxt.schemaParse()
del ctxt
validationCtxt = schema.schemaNewValidCtxt()
doc = libxml2.parseFile("test.xml")
#instance_Err = validationCtxt.schemaValidateFile(filePath, 0)
instance_Err = validationCtxt.schemaValidateDoc(doc)
del validationCtxt
del schema
doc.freeDoc()
if instance_Err != 0:
print "VALIDATION FAILED"
else:
print "VALIDATED"
Known Problems
Node equality Problem
The usual equality test (==) does not work,
, look at this:
>>> import libxml2
>>> doc = libxml2.parseDoc('<foo/>')
>>> root1 = doc.getRootElement()
>>> root2 = doc.getRootElement()
>>> root1 == root2
False
(note: This issue affects earlier builds of Libxml2 for Python. It is referred to in http://bugzilla.gnome.org/show_bug.cgi?id=345779 and appears to be resolved in current builds)
Using libxml2-2.6.27, this produces the expected result.
>>> import libxml2
>>> doc = libxml2.parseDoc('<foo/>')
>>> root1 = doc.getRootElement()
>>> root2 = doc.getRootElement()
>>> root1 == root2
True