Tag: python

Accurate Postcode data for the UK and Northern Ireland

by on Jun.06, 2011, under Knowledge Base, Technology


Getting accurate postcode data for use in my programs has proved to be an interesting technical challenge, despite the availability of free services such as those from Google.

I needed a UK postcode to Latitude/Longitude conversion tool that would work offline, which made the Google API’s completely useless.

A freely available dataset was made available from the Ordnance Survey under the moniker of “Code-point open”, however this is missing some data (notably Northern Ireland) so can’t really be called complete, and additionally it does not provide locations in Latitude/Longitude, but instead it uses Ordnance Survey Grid references.

If you need to do this, then hopefully the tools here will be of some help.

First we need to grab the Ordnance Survey Code-point open dataset. They don’t provide a straight link, so you will need to register to actually get access to the data. Once downloaded and extracted you will have a folder full of CSV files.

Secondly we need to convert the grid references to more useable LatLong values. Chris Veness has written some Javascript to do just that – as well as writing some incredibly informative articles about how this stuff actually works (there’s some horrible maths involved!). I spent a couple of evenings converting his scripts to Python and wrote a little data extract utility which put all the data into a SQLite3 database.

Another couple of hours to write a set of wrapper functions around the database and include the validation rules from the UK government data schema and hey presto! We have a usable postcode_utils module.

This just leaves Northern Ireland…

It turns out that the Northern Ireland Statistics and Research Agency (NISRA) have some GIS data available in ESRI ShapeFile format. There is a useful Python library to work with ESRI Shapefiles and it took a matter of minutes to extract the coordinates from the NISRA data, that provides postcodes in… bah! another Grid System, but this time it’s the Irish Grid!

Some more hunting online to get the right values for the grid transformations and the Geoid dimensions used for each grid system and a bit of tinkering with Chris Veness’ code and we are done! Here is a Sqlite3 database containing all the postcodes in Great Britain and Northern Ireland with accurate Latitude and Longitudes (more accurate than some Web Services I could mention) and some Python to use with it. Also, this information is all freely available and free for use commercially and non-commercial use (provided you give attribution as mentioned on the various sites linked to above).

If you find this useful, please let me know by using the comments feature. Also, if anybody has a source of data for the Channel Islands or the Isle of Man I would be very please to find them!

3 Comments :, , , , more...

Getting started with Libxml2 and Python (Part 2)

by on Feb.21, 2007, under Knowledge Base, Technology, Tutorials

After I published the first part of this tutorial, John Dennis gave me some
feedback on the xml@gnome.org mailing list (http://mail.gnome.org/archives/xml/2007-February/thread.html).
He posed a couple of interesting questions

  1. 1. how do I build complex python objects by parsing an XML doc?
  2. 2. how can I serialize python objects into XML?

Normally I would use pickling and unpickling to serialise Python objects, but
I can see some cases in which this might come in useful. Having a bit of a play
with dynamically creating objects in Python made me realise that this is a
non-trivial challenge as well and so is probably an ideal exercise for this
tutorial. Creating arbitrary objects from an XML source document.

Simple example

Let’s imagine that we have the following XML document

	<?xml version='1.0'>
	<user>
		<name>Mike Kneller</name>
		<homepage>http://www.mikekneller.com</homepage>
	</user>

Dynamically populating an object is fairly straightforward in Python as it
allows the dynamic creation of object attributes. We just need to loop through
the document creating the properties. This is simple when we realise that
setattr(obj,’foo’,123) is equivalent to obj.foo = 123.

	class DynamicObject:
		pass

	user = DynamicObject()

	doc = libxml2.parseFile( 'userdata.xml' )
	child = doc.getRootElement().children

	while child is not None:
		if child.type == "element":
			setattr(user, child.name, child.content)
		child = child.next
	doc.freeDoc()

That was almost too easy! Examining user shows that we have a Python object
populated with the contents of the XML document as expected.

	>>> print user.__dict__
	{'homepage': 'http://www.mikekneller.com', 'name': 'Mike Kneller'}
	>>> print user.homepage

http://www.mikekneller.com

Although this is a useful routine, simply filling an object with string values
doesn’t really count as a ‘complex’ object (although it illustrates the point).
It’s main use is the generation of arbitrary data objects where the order and
naming of the data is not known in advance.

Walking the tree

A ‘complex’ object would be one containing a mixture of data types, possibly
holding other objects and maybe some code.

This XML document (people.xml) contains a group of people that we would like to
load.

	<?xml version='1.0'?>
	<people>
		<person>
			<name>Mike</name>
			<age>34</age>
			<friends>
				<friend>Steve</friend>
				<friend>Mark</friend>
				<friend>Dave</friend>
			</friends>
		</person>

		<person>
			<name>Steve</name>
			<friends>
				<friend>Mike</friend>
				<friend>Mark</friend>
				<friend>Dave</friend>
			</friends>
			<hobbies>
				<hobby>Stamp collecting</hobby>
				<hobby>Train spotting</hobby>
			</hobbies>
		</person>

		<person>
			<name>Mark</name>
			<age>28</age>
			<friends>
				<friend>Mike</friend>
				<friend>Steve</friend>
			</friends>
		</person>

		<person>
			<name>Dave</name>
			<age>30</age>
			<friends>
				<friend>Mike</friend>
				<friend>Steve</friend>
			</friends>
		</person>
	</people>

To construct arbitrary Python objects from a document like this, we will need to walk the tree. This example recurses an XML document, printing out the node names and content, indenting as it goes.

	import libxml2

	def walkTree(xmlnode):
		child = xmlnode.children
		while child is not None:
			if not child.isBlankNode():
				if child.type == "element":
					childCount = int(child.xpathEval('count(*)'))

					# a count of the ancestor nodes tells us how deep in the
					# tree we are - lets just use it to indent our printed
					# output
					depth = int(child.xpathEval('count(ancestor::*)')) - 1
					if childCount == 0:
						# If the count of child elements is 0 then we
						# have a node only containing text
						print  depth * '\t' + child.name + ' : ' + child.content
					else:
						# If the node contains other child elements then
						# we can recurse down the tree
						print depth * '\t' + child.name
						walkTree(child)

			child = child.next

	doc = libxml2.parseFile('people.xml')
	root = doc.getRootElement()

	walkTree(root)

	doc.freeDoc()
Leave a Comment :, , , , more...

Using libxml2 and python to scrape content from a website

by on Feb.06, 2007, under Knowledge Base, Technology, Tutorials

This is a practical example, using Libxml2 to parse a real-world Web page (I chose the TV listings pages from the Guardian Website as it is the type of page you are likely to want to scrape for useful data.

Additionally, the Guardian TV listings contain a couple of very typical HTML errors. The listings are contained within a table, and some of the rows in the table are not closed.

The Libxml2 parser recovers from these errors by closing the tags at the end of the page and then continuing parsing from the next useable opening tag. This leaves us with a tr tag containing duplicated content from later in the document, which this code handles in a simple way by splitting the broken content on the next opening tag, and using another document instance to close the unterminated tags.

TV Listings

import libxml2, os, sys, datetime

# This script reads the TV listings from the Guardian TV listings website
# (http://www.guardian.co.uk/TV/)

parse_options = libxml2.HTML_PARSE_RECOVER + \
	libxml2.HTML_PARSE_NOERROR + \
	libxml2.HTML_PARSE_NOWARNING

today = datetime.date.today()
tomorrow = today + datetime.timedelta(days=1)

class Channel:
	def __init__(self, name):
		self.name = name
		self.entries = []

class ListingEntry:
	def __init__(self,when):
		self.when = when
		self.title = ''
		self.content = ''

def newTime(node, entries):
	timeStr = str(node.content).strip()

	# the Guardian listings show times like 6.00am or 5.45pm, we need to
	# turn this into a more useable form. A Python datetime object will do
	# just fine. It is also worth noting that the listings run from 6.00am
	# to 6.00am, so we need to account for a date boundary at midnight.
	if (timeStr.count('am') > 0) or (timeStr.count('pm') > 0):
		t = timeStr.split('.')
		hour = int(t[0])
		minute = int(t[1][0:2])
		ampm = t[1][2:4]

		if (ampm == 'pm') and (hour < 12):
			hour += 12

		if (hour < 6):
			date = tomorrow
		else:
			date = today

		when = datetime.datetime(date.year, date.month, date.day, hour, minute)
		newEntry = ListingEntry(when)
		entries.append(newEntry)

def newProgramme(node, entries):
	# luckily for us, all the Guardian TV entries are wrapped in <font> tags
	# which is bad for accessibility, but gives us a known node to grab
	items = node.xpathEval('.//font/node()')
	for item in items:
		if not item.isBlankNode():
			if (item.type == 'text') or (item.type == 'element'):
				if entries[-1].title == '':
					entries[-1].title += str(item.content).strip()
				else:
					entries[-1].content += str(item.content).strip() + '\n'

def processSourceHTML(url,entries):
	doc = libxml2.htmlReadFile(url, None, parse_options)
	listingTable = doc.xpathEval('//table')[6]
	rows = listingTable.xpathEval('.//tr')
	for row in rows:
		if len(row.xpathEval('.//tr')) > 0:
			# This row is broken, tr tags should not contain more tr tags!
			# it probably is missing one or more closing tags and therefore
			# needs special handling.
			fixup = row.serialize()

			rows = fixup.split('<tr>')

			# Here we load the broken HTML fragment into another documet
			# to extract whatever we can from it.
			fixDoc = libxml2.htmlReadDoc('<html>'+rows[1]+'</html>', \
				'', None, parse_options)

			cells = fixDoc.xpathEval('//td')
			for cell in cells:
				if cell.prev == None:
					# if the cell has no previous sibling then it is the first
					# cell in the row, e.g. the one containing the time
					newTime(cell, entries)
				else:
					newProgramme(cell, entries)

			fixDoc.freeDoc()
		else:
			cells = row.xpathEval('td')
			for cell in cells:
				if cell.prev == None:
					# if the cell has no previous sibling then it is the first
					# cell in the row, e.g. the one containing the time
					newTime(cell, entries)
				else:
					newProgramme(cell, entries)
	doc.freeDoc()

channels = []

# We could do more here from an automation perspective - spider the list
# of channels, automatically populating the channel names etc...
# but this is left as an exercise for the reader
channels.append( Channel('BBC1') )
processSourceHTML( \
	'http://www.guardian.co.uk/TV/bbc1s_meridian.html', \
	channels[-1].entries)

channels.append( Channel('BBC2') )
processSourceHTML( \
	'http://www.guardian.co.uk/TV/bbc2s_meridian.html', \
	channels[-1].entries)

channels.append( Channel('ITV - Meridian') )
processSourceHTML( \
	'http://www.guardian.co.uk/TV/meridian_meridian.html', \
	channels[-1].entries)

channels.append( Channel('Channel 4') )
processSourceHTML( \
	'http://www.guardian.co.uk/TV/ch4_meridian.html', \
	channels[-1].entries)

for channel in channels:
	print channel.name
	for entry in channel.entries:
		print "----"
		print entry.when
		print entry.title
		print "----"
		print entry.content
Leave a Comment :, , , , more...

Getting started with Libxml2 and Python (Part 1)

by on Feb.06, 2007, under Knowledge Base, Technology, Tutorials

This article is an import from my old site – the original was published on 6th Feb 2007.

Getting to grips with Libxml2 and Python can be a frustrating experience,
particularly as in-depth, accurate Python documentation is hard to find
on the Web.

Many Python developers seem to dislike the Libxml2 bindings, as they are ‘un-Pythonic’
and much too C-like. This however misses the point of Libxml2. The point being that
this library is portable, mature, extremely full-featured and *very* fast.

In the process of writing this tutorial, I hung out in the #xml channel on
irc.gnome.org, and subscribed to the xml@gnome.org mailing list – I
was given a lot of help when things weren’t obvious! Although there’s not a massive
amount of activity on IRC, or in the mailing list on a daily basis, I would
definitely recommend spending some time browsing the archive – or using Google
to search it when you have questions. Additionally, I have found the people in
the Libxml2 community very helpful.

Manipulating XML using Libxml2 is fairly straightforward when you have a couple
of working examples, however that tends to be the problem in Python. Finding
working examples tends to be a bit of a hit-and-miss affair.

The first place to look is in the examples folder in the documentation installed
with your release (/usr/share/doc/libxml2-python-2.6.27/examples on my machine).

TODO: where are the examples on a number of distributions/platforms?

Also, take a moment to scan through libxml2.py itself – this is the Python wrapper and
is a good place to look if you are hunting for a particular function. There
is plenty of information in the wrapper as all the docstrings have been
populated, you can always get information like

	print libxml2.parseFile.__doc__

for any particular function.

Also remember that you can list the available methods for any Python object by
using the dir function. The most immediately useful objects are xmlCore, xmlNode
xmlDoc
, so

	dir(libxml2.xmlCore)

is your friend when working out what functions are available to you.

I’m going to assume that you know a bit about XML, at least enough to recognise
an XML document when you see one, and hopefully enough about Python to know
where to find the documentation!

Contents

installing Libxml2

TODO: installation examples for a number of distros/platforms.

Loading a document

The first thing you want to do in XML will be to load a document of some sort.
As a new Libxml2 user, this is where our confusion starts! It is worth remembering
that in general, the Python bindings are automatically generated – therefore
there is an equivalent Python function for every C function, and sometimes this
can lead to unnecessary, or apparently duplicated Python functions.

The library contains a number of different functions we can use to load an XML
document:

parseDoc, parseFile, parseMemory, readDoc, readFd, readFile, readMemory,
recoverDoc and recoverFile

All of these functions return an xmlDoc object. Examples for using each of these
follow:
parseDoc(cur) – load an XML document from memory (a string)

	doc = libxml2.parseDoc("""<?xml version="1.0"?>
	<root>Hello world!</root>""")

parseMemory(buffer, size) – load an XML document from memory

	doc = libxml2.parseMemory(xml, len(xml))

This function performs exactly the same job as parseDoc from a Python perspective.
parseFile(filename) – load an XML document from a file

	doc = libxml2.parseFile('test.xml')

readDoc(cur, URL, encoding, options) – load an XML document from memory (a string)

This version of the function allows you to specify options on a per-document
basis. The parseDoc version uses the parser defaults (in practice, the
parser global settings, which can also be modified using global functions).

In most cases,

		doc = libxml2.readDoc('<foo/>',None,None,0)

will be equivalent to

		doc = libxml2.parseDoc('<foo/>')

When using XSL, I have found it better to force entities
to be resolved before running the transform, in which case it is useful to
use the following:

	doc = libxml2.readDoc( xml, None, libxml2.XML_PARSE_NOENT)

readFd(fd, URL, encoding, options) – load an XML document from a file descriptor

readFile(filename, encoding, options) – load an XML document from a file allowing
the specification of per-document options.
readMemory(buffer, size, URL, encoding, options) – for Python, equivalent to
using readDoc
recoverDoc(cur) – this is equivalent to readDoc, except that even broken XML
will result in a valid XML tree being created.

	doc = libxml2.recoverDoc('<foo><broken></foo>')

will raise a parser error, but after the error has been handled, doc will
contain:

	<?xml version="1.0"?>
	<foo><broken/></foo>

recoverFile(filename) – same as recoverDoc, but for files.
In the simplest case, to load a file from disk you can do:

	doc = libxml2.parseFile( 'test.xml' )

managing your memory

Ugh, nasty memory management. Isn’t that why we’re using Python, to avoid all that
stuff?

Libxml2 does not explicitly handle the cleaning up of the memory it uses, so when
you finish working with your xmlDoc object, you need to remember to call freeDoc.
The same is true of xpath evaluation contexts created with xpathNewContext, you
call xpathFreeContext on them.

OK, so what we have now is something like the following:

	doc = libxml2.parseFile( 'test.xml' )
	# Do some stuff with the document here!
	doc.freeDoc()

It doesn’t matter which method you use to create your xmlDoc object – each of the
functions return the same thing, so just remember to call freeDoc on it when you
are done and all will be well.

There, that wasn’t so hard was it? :-)

Working with the document

Now we have a working document, and know how to dispose of it when we’re done
it is time to look at a number of common XML operations and see how we can do
those using Libxml2 and Python.

Elements

The xmlDoc object has a large number of methods. As well as its own collection,
it inherits from xmlNode, which inherits from xmlCore; this gives you over 200
available methods
to read up on! This is fairly daunting, when you can’t find an
example that shows you how to perform simple tasks but don’t worry, In practice
we can get by in most situations with a small fraction of these.

All valid XML documents contain a single root node, which contains all the
other nodes.

You can get a reference to the root element using getRootElement on the document
object. The root element is an xmlNode object, just like all other nodes in the
document. Working with nodes is fairly straightforward:

	>>> import libxml2
	>>> doc = libxml2.parseDoc( '<foo>Hello world.</foo>' )
	>>> root = doc.getRootElement()
	>>> print root.name
	foo
	>>> print root.content
	Hello world.
	>>> root.setProp('bar', 'an attribute')
	<xmlAttr (bar) object at 0x13c00d0>
	>>> root.prop('bar')
	an attribute
	>>> print root.serialize()
	<foo bar="an attribute">Hello world.</foo>
	>>> doc.freeDoc()

The serialize method can be called on a single node, or on the document and
provides a string representation of the document.

Navigating through the document is not much more difficult – we can use the node
properties (from the xmlCore ancestor object) to find the child nodes:

	child = root.children
	# the children property returns the FIRST child of a node
	while child is not None:
		if child.type == "element":
			# do something with the child node
			print child.name
		child = child.next

Accessing the attributes of a node is possible in a similar way

	import libxml2
	doc = libxml2.parseDoc('<foo att1="value 1" att2="value 2"/>')
	root = doc.getRootElement()
	for property in root.properties:
		if property.type=='attribute':
			# do something with the attributes
			print property.name
			print property.content
	doc.freeDoc()

Notice that in both looping through the children, and looping through the
properties there is a test for the type of the node. This is because in most
documents, there is additional whitespace that shows up as well as the specific
node types we are interested in.

XPath

Navigating a document in this manner is straightforward, but tedious and requires
accessing every node in the document until you get to the specific one you need.
More often, you want to retrive a set of nodes or a single node matching some
specific criteria. This is where XPath comes in, and Libxml2 has full support
for XPath.

XPath queries can be run against the document or a specific element in the
document, but in either case the procedure is the same.

The xmlsoft.org Python page suggests the following:

	doc = libxml2.parseFile("test.xml")
	ctxt = doc.xpathNewContext()
	result = ctxt.xpathEval("//*")
	# do something with the result

	doc.freeDoc()
	ctxt.xpathFreeContext()

which involves creating an XPath context, running a query against it and then
freeing the context when finished. If you have a lot of queries to run, then
this is the best way to work, as the context can be re-used for each query.

In practice, the xmlCore object provides a helper function which wraps this up
for you. For single queries running xpathEval directly on the node will suffice,
just be aware that each query creates and destroys its own context, which is
going to be slower than the above implementation.

An XPath query will return a typed result, corresponding to the four basic types mentioned in the
introduction section of the XPath Specification, where the result is a
node-set this will be a tuple. This makes it easy to perform an operation on many nodes at once.

	import libxml2
	doc = libxml2.parseFile('test.xml')
	# select every element in the document
	result = doc.xpathEval('//*')
	for node in result:
		print node.name
	doc.freeDoc()

Apart from the call to freeDoc, I can’t see how much more Pythonic it could be?

Namespaces

Dealing with XML Namespaces is possible as well.

Here we create an XML document and declare a namespace on the root element.

	import libxml2

	doc = libxml2.newDoc('1.0')
	root = libxml2.newNode('foo')
	doc.setRootElement(root)

	#Register the toto namespace
	ns = root.newNs('http://toto.org', 'toto')

        root.setNs(ns)  #put this node in the namespace

	#Add to the root node a property in this namespace
	root.setNsProp(ns, 'Id', str(12345))

	print doc.serialize()

This produces:

	<?xml version="1.0"?>
	<toto:foo xmlns:toto="http://toto.org" toto:Id="12345"/>

Namespace can also be dealt with in XPath, provided you register the namespace with the XPath context object.

	import libxml2

	doc = libxml2.parseDoc("""
	<foo xmlns:MYNS="http://somewhere.fr">
	   <MYNS:a id="a1"/>
	   <a      id="a2"/>
	</foo>
	""")

	ctxt = doc.xpathNewContext()
	#you can choose any name, the URI is the namespace identifier
	ctxt.xpathRegisterNs("OtherName", "http://somewhere.fr") 

	# select the 'a' node in the somewhere.fr namespace
	result = ctxt.xpathEval('//OtherName:a')
	for node in result:
		print node.name, "id=%s"%node.prop("id")  #will display "a id=a1"

	ctxt.xpathFreeContext()
	doc.freeDoc()

If a namespace by default is specified, you will have to register it in XPath with a name of your choice to use it in a XPath expression.

Writing to to a file

To write the contents of your XML document to a file, just use the saveTo method:

	f = open('output.xml','w')
	doc.saveTo(f)
	f.close

The saveTo method is also part of xmlCore, so you can use it to save the contents
of just a single node and it’s children as well as the whole document.

It is also worth noting that both saveTo, and serialize can accept an encoding parameter, which allows the conversion of a document from one encoding to another. Libxml2 itself uses UTF-8 internally, and will convert the document when loading and serialising.

	>>>>doc = libxml2.parseDoc("""<root><foo>hello</foo></root>""")
	>>>>str = doc.serialize()
	>>>>print str
	<?xml version="1.0"?>
	<root><foo>hello</foo></root>

	>>>>str = doc.serialize("iso-8859-1")
	>>>>print str
	<?xml version="1.0" encoding="iso-8859-1"?>
	<root><foo>hello</foo></root>

Modifying documents

To add a new node to a document, first we must create the node and then add it
as a child of the element it belongs to.

	import libxml2
	doc = libxml2.parseDoc('<foo/>')
	root = doc.getRootElement()
	newNode = libxml2.newNode('bar')
	root.addChild(newNode)

At this stage, our document contains

	<?xml version="1.0"?>
	<foo><bar/></foo>

Using the content property of newNode, we can do:

	newNode.setContent('Hello')

We can append some content to our element by calling addContent,

	newNode.addContent(' world')

which gives us

	<?xml version="1.0"?>
	<foo><bar>Hello world</bar></foo>

Creating or setting an attribute is easy to, we use the setProp method.

	newNode.setProp('attribute', 'the value')

If the attribute doesn’t exist, it will be created otherwise it will just have
its content changed.

Adding nodes at a particular location in the hierarchy is possible using
addNextSibling, or addPrevSibling. These operate in the same way as addChild,
except they operate on the node you wish to add next to, rather than to the
parent.

	sibling = libxml2.newNode('bar2')
	newNode.addPrevSibling(sibling)

gives

	<?xml version="1.0"?>
	<foo><bar2/><bar new attribute="the value">Hello world</bar></foo>

whereas

	sibling = libxml2.newNode('bar2')
	newNode.addNextSibling(sibling)

gives

	<?xml version="1.0"?>
	<foo><bar new attribute="the value">Hello world</bar><bar2/></foo>

To insert text into the document, you create a text node with some content and
add it in the same way

	text = libxml2.newText('some text\n')
	bar.addNextSibling(text)

which leaves us with

	<?xml version="1.0"?>
	<foo><bar2/><bar new attribute="the value">Hello world</bar>some text
	</foo>

To create content and nodes, the useful Libxml2 helper functions are newComment,
newText and newNode. You can also create a new node by copying one that already
exists. The xmlNode object has copyNode and copyProp methods which can be useful
here.

To add these new nodes into a document, you need to use one of the following
methods (directly on nodes rather than on the document), addChild, addContent,
addNextSibling, addPrevSibling.

XSLT

Libxml2 has a companion library called libxslt which provides support for
XSL Transformations. I find the following example provides most of the
useful information for a Python coder:

	def runTransform(xmlFile,xslFile):
		out = ''
		sourcedoc = libxml2.parseFile( xmlFile )
		styledoc = libxml2.parseFile( xslFile )
		style = libxslt.parseStylesheetDoc(styledoc)
		result = style.applyStylesheet(sourcedoc, None)
		out = style.saveResultToString( result )
		style.freeStylesheet()
		result.freeDoc()
		sourcedoc.freeDoc()
		return out

Notice that there are three documents involved, each of which need to be
explicitly freed, the source, the stylesheet and the result. The starting point
for documentation can be found here, http://xmlsoft.org/XSLT/python.html.

Libxml2 and HTML

If you have spent any time poking around libxml2.py, you will probably have
noticed a number of functions that start with html. This is because Libxml2 has
an HTML parser built in that does a pretty good job of loading real world
(in other words horribly broken) HTML documents. You can then use the features
we have previously discussed to read or modify the HTML.

The following example will load pretty much any HTML file into an xmlDoc object

	parse_options = libxml2.HTML_PARSE_RECOVER + \
		libxml2.HTML_PARSE_NOERROR + \
		libxml2.HTML_PARSE_NOWARNING
	doc = libxml2.htmlReadDoc(html, '', None, parse_options)

Here is a more complete example, which extracts all the links from the Guardian
newspaper Website home page and prints the href attribute.

	import urllib2
	import libxml2

	# Load the page into a string
	f = urllib2.urlopen('http://www.guardian.co.uk')
	html = f.read()
	f.close()

	parse_options = libxml2.HTML_PARSE_RECOVER + \
		libxml2.HTML_PARSE_NOERROR + \
		libxml2.HTML_PARSE_NOWARNING
	doc = libxml2.htmlReadDoc(html,'',None,parse_options)
	links = doc.xpathEval('//a')
	for link in links:
		href = link.xpathEval('attribute::href')
		if len(href) > 0:
			href = href[0].content
			print href
	doc.freeDoc()

For a more comprehensive example, see example of scraping content from a website.

Schema

One may validate an XML instance against a W3C schema, as shown below:

	# inspired from the test suite file "xstc/xstc.py"
	# thanks to Kasimier Buchcik
	#
	import libxml2

	ctxt = libxml2.schemaNewParserCtxt("my-schema.wxs")
	schema = ctxt.schemaParse()
	del ctxt

	validationCtxt = schema.schemaNewValidCtxt()

	doc = libxml2.parseFile("test.xml")

	#instance_Err = validationCtxt.schemaValidateFile(filePath, 0)
	instance_Err = validationCtxt.schemaValidateDoc(doc)

	del validationCtxt
	del schema
	doc.freeDoc()

	if instance_Err != 0:
            print "VALIDATION FAILED"
	else:
	    print "VALIDATED"

Known Problems

Node equality Problem

The usual equality test (==) does not work, :-( , look at this:

	>>> import libxml2
	>>> doc = libxml2.parseDoc('<foo/>')
	>>> root1 = doc.getRootElement()
	>>> root2 = doc.getRootElement()
	>>> root1 == root2
	False

(note: This issue affects earlier builds of Libxml2 for Python. It is referred to in http://bugzilla.gnome.org/show_bug.cgi?id=345779 and appears to be resolved in current builds)

Using libxml2-2.6.27, this produces the expected result.

	>>> import libxml2
	>>> doc = libxml2.parseDoc('<foo/>')
	>>> root1 = doc.getRootElement()
	>>> root2 = doc.getRootElement()
	>>> root1 == root2
	True
1 Comment :, , , , , , more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!