Tag: Web Development
Accurate Postcode data for the UK and Northern Ireland
by Mike on Jun.06, 2011, under Knowledge Base, Technology

Getting accurate postcode data for use in my programs has proved to be an interesting technical challenge, despite the availability of free services such as those from Google.
I needed a UK postcode to Latitude/Longitude conversion tool that would work offline, which made the Google API’s completely useless.
A freely available dataset was made available from the Ordnance Survey under the moniker of “Code-point open”, however this is missing some data (notably Northern Ireland) so can’t really be called complete, and additionally it does not provide locations in Latitude/Longitude, but instead it uses Ordnance Survey Grid references.
If you need to do this, then hopefully the tools here will be of some help.
First we need to grab the Ordnance Survey Code-point open dataset. They don’t provide a straight link, so you will need to register to actually get access to the data. Once downloaded and extracted you will have a folder full of CSV files.
Secondly we need to convert the grid references to more useable LatLong values. Chris Veness has written some Javascript to do just that – as well as writing some incredibly informative articles about how this stuff actually works (there’s some horrible maths involved!). I spent a couple of evenings converting his scripts to Python and wrote a little data extract utility which put all the data into a SQLite3 database.
Another couple of hours to write a set of wrapper functions around the database and include the validation rules from the UK government data schema and hey presto! We have a usable postcode_utils module.
This just leaves Northern Ireland…
It turns out that the Northern Ireland Statistics and Research Agency (NISRA) have some GIS data available in ESRI ShapeFile format. There is a useful Python library to work with ESRI Shapefiles and it took a matter of minutes to extract the coordinates from the NISRA data, that provides postcodes in… bah! another Grid System, but this time it’s the Irish Grid!
Some more hunting online to get the right values for the grid transformations and the Geoid dimensions used for each grid system and a bit of tinkering with Chris Veness’ code and we are done! Here is a Sqlite3 database containing all the postcodes in Great Britain and Northern Ireland with accurate Latitude and Longitudes (more accurate than some Web Services I could mention) and some Python to use with it. Also, this information is all freely available and free for use commercially and non-commercial use (provided you give attribution as mentioned on the various sites linked to above).
If you find this useful, please let me know by using the comments feature. Also, if anybody has a source of data for the Channel Islands or the Isle of Man I would be very please to find them!
Generating HTML5 using XSLT
by Mike on Jan.28, 2011, under Technology, Tutorials

Recently, I have been updating some of my HTML generation tools to output valid HTML5, rather than the XHTML 1.0 standard I have been using for the last few years. The main advantage from my perspective is the ability to use the more semantic block elements, such as the nav, section and article elements.
In general this is a fairly straightforward task, as I am generating clean XHTML using XSLT and my template library works pretty well, but I ran into some problems whilst validating the output using the W3C Validator.
The first issue is to sort the DOCTYPE out. The XHTML doctype looks like this:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
This is easy to generate in XSLT using the following output element.
<xsl:output encoding="UTF-8" indent="yes" method="xml"
omit-xml-declaration="yes"
doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" />
This unfortunately forces the document to validate against the XHTML 1.0 specification which does not include all the lovely new semantic elements – which means that my new documents are suddenly invalid!
We need to generate:
<!DOCTYPE html>
which is really hard to do using XSLT. I have read a number of articles that suggest you output the element as text, however this is extremely ugly and as it turns out, incorrect.
The correct XSLT incantation is:
<xsl:output
method="xml"
doctype-system="about:legacy-compat"
encoding="UTF-8"
indent="yes" />
This forces use of a dummy DTD (about:legacy-compat), which is the W3C recommended way of not using a standard DTD URI.
Now the W3C validator will happily validate against the HTML5 specification rather than the XHTML 1.0 specification.
Setting up a web project environment in Visual Studio 2010 to allow debugging using both IIS7 and the development Web server
by Mike on Jan.13, 2011, under Knowledge Base, Technology, Tutorials
We have run into an issue recently when importing a number of our legacy web projects into Visual Studio 2010. It seems that some projects don’t provide the options to allow us to run and debug them on the local version of IIS. For some reason, when a project is initially configured in VS2010 it sets a flag that determines whether the project properties page is shown inside the project window (where it will contain the necessary configuration options) or the project properties page is shown as a pop-up dialog window (where it doesn’t contain the options to allow debugging on a local IIS).
Here are the notes you will need when setting up a new project to guarantee that you have access to the required configuration page.
IIS 7 Configuration
The first step is entirely optional, as it is perfectly reasonable to run multiple Web Applications within a singe Web Site, however I prefer having separate Web Sites for different projects (or clients) as this allows me to control individual settings for each of them (for example host and port bindings) without me worrying about breaking a different project.
If you will use “localhost” to access the server, then you will only be able to have one Web Site instance running at a time, but this is rarely an issue, and you can easily run more than one site at a time by using a unique hostname or port for each.
Create the site in IIS Manager:

Where you create the Web Site mostly doesn’t matter, Visual Studio will add your project to the site as a virtual directory and the physical files can exist anywhere on your filesystem.


Once the site has been created, stop any other site that is using port 80, and start the Web site from the actions panel at the right of the window.
Because you want Visual Studio to modify the IIS configuration, you will need to run Visual Studio as Administrator.
Create an empty Web Application
Create a new project in Visual Studio (File > New > Project…).

Select: Visual C# > Web > ASP.NET Empty Web Application.
If you leave the “Create directory for solution” checkbox selected, then all your project files will end up one directory below the solution file. Personally I don’t mind having the .sln file in the same folder as my project files so I uncheck it.

If you click OK to create the various project files, then immediately compile and run the application you will see that by default Visual Studio has configured the development web server.

Reconfigure the project to use IIS

Now whilst this is useful for small standalone projects, we would ideally like to be able to switch development to the full version of IIS7, and to do this, we need to perform another couple of steps.
Back in Visual Studio, stop the project and in the Solution Explorer, right-click on the project and select “Properties”.
In the window that appears, select the Web option and scroll down until you can see the Servers section. Change the selection from “Use Visual Studio Development Server” to “Use Local IIS Web server”.

Click the “Create Virtual Directory” button to complete the IIS configuration. If you see the dialog below then everything has gone as expected and you can now run (and debug) your web application in IIS. You can switch between IIS and Cassini (the development web server) simply by changing the radio button back.

Using the same web.config for both IIS and Cassini
Because IIS7 uses a new integrated pipeline, and Cassini uses the classic pipeline, some of the settings in the web.config are not compatible. Whilst it is possible to use two web.config files (and move them in and out of the application directory when you switch webservers), it is much more convenient to use a single web.config with both webservers.
.NET will complain about the legacy settings if you are using the integrated pipeline (which we normally will want to do) unless you have the following included in the system.webServer section in the web.config:
<validation validateIntegratedModeConfiguration="false" />
This line instructs the .NET not to validate any settings it finds in the web.config for the integrated pipeline – this allows you to have settings that only work in classic mode (such as httpHandlers and httpModules co-existing with the handlers and modules sections that are required by IIS7.
Getting started with Libxml2 and Python (Part 2)
by Mike on Feb.21, 2007, under Knowledge Base, Technology, Tutorials
After I published the first part of this tutorial, John Dennis gave me some
feedback on the xml@gnome.org mailing list (http://mail.gnome.org/archives/xml/2007-February/thread.html).
He posed a couple of interesting questions
- 1. how do I build complex python objects by parsing an XML doc?
- 2. how can I serialize python objects into XML?
Normally I would use pickling and unpickling to serialise Python objects, but
I can see some cases in which this might come in useful. Having a bit of a play
with dynamically creating objects in Python made me realise that this is a
non-trivial challenge as well and so is probably an ideal exercise for this
tutorial. Creating arbitrary objects from an XML source document.
Simple example
Let’s imagine that we have the following XML document
<?xml version='1.0'> <user> <name>Mike Kneller</name> <homepage>http://www.mikekneller.com</homepage> </user>
Dynamically populating an object is fairly straightforward in Python as it
allows the dynamic creation of object attributes. We just need to loop through
the document creating the properties. This is simple when we realise that
setattr(obj,’foo’,123) is equivalent to obj.foo = 123.
class DynamicObject: pass user = DynamicObject() doc = libxml2.parseFile( 'userdata.xml' ) child = doc.getRootElement().children while child is not None: if child.type == "element": setattr(user, child.name, child.content) child = child.next doc.freeDoc()
That was almost too easy! Examining user shows that we have a Python object
populated with the contents of the XML document as expected.
>>> print user.__dict__
{'homepage': 'http://www.mikekneller.com', 'name': 'Mike Kneller'}
>>> print user.homepage
http://www.mikekneller.com
Although this is a useful routine, simply filling an object with string values
doesn’t really count as a ‘complex’ object (although it illustrates the point).
It’s main use is the generation of arbitrary data objects where the order and
naming of the data is not known in advance.
Walking the tree
A ‘complex’ object would be one containing a mixture of data types, possibly
holding other objects and maybe some code.
This XML document (people.xml) contains a group of people that we would like to
load.
<?xml version='1.0'?> <people> <person> <name>Mike</name> <age>34</age> <friends> <friend>Steve</friend> <friend>Mark</friend> <friend>Dave</friend> </friends> </person> <person> <name>Steve</name> <friends> <friend>Mike</friend> <friend>Mark</friend> <friend>Dave</friend> </friends> <hobbies> <hobby>Stamp collecting</hobby> <hobby>Train spotting</hobby> </hobbies> </person> <person> <name>Mark</name> <age>28</age> <friends> <friend>Mike</friend> <friend>Steve</friend> </friends> </person> <person> <name>Dave</name> <age>30</age> <friends> <friend>Mike</friend> <friend>Steve</friend> </friends> </person> </people>
To construct arbitrary Python objects from a document like this, we will need to walk the tree. This example recurses an XML document, printing out the node names and content, indenting as it goes.
import libxml2
def walkTree(xmlnode):
child = xmlnode.children
while child is not None:
if not child.isBlankNode():
if child.type == "element":
childCount = int(child.xpathEval('count(*)'))
# a count of the ancestor nodes tells us how deep in the
# tree we are - lets just use it to indent our printed
# output
depth = int(child.xpathEval('count(ancestor::*)')) - 1
if childCount == 0:
# If the count of child elements is 0 then we
# have a node only containing text
print depth * '\t' + child.name + ' : ' + child.content
else:
# If the node contains other child elements then
# we can recurse down the tree
print depth * '\t' + child.name
walkTree(child)
child = child.next
doc = libxml2.parseFile('people.xml')
root = doc.getRootElement()
walkTree(root)
doc.freeDoc()
Using libxml2 and python to scrape content from a website
by Mike on Feb.06, 2007, under Knowledge Base, Technology, Tutorials
This is a practical example, using Libxml2 to parse a real-world Web page (I chose the TV listings pages from the Guardian Website as it is the type of page you are likely to want to scrape for useful data.
Additionally, the Guardian TV listings contain a couple of very typical HTML errors. The listings are contained within a table, and some of the rows in the table are not closed.
The Libxml2 parser recovers from these errors by closing the tags at the end of the page and then continuing parsing from the next useable opening tag. This leaves us with a tr tag containing duplicated content from later in the document, which this code handles in a simple way by splitting the broken content on the next opening tag, and using another document instance to close the unterminated tags.
TV Listings
import libxml2, os, sys, datetime
# This script reads the TV listings from the Guardian TV listings website
# (http://www.guardian.co.uk/TV/)
parse_options = libxml2.HTML_PARSE_RECOVER + \
libxml2.HTML_PARSE_NOERROR + \
libxml2.HTML_PARSE_NOWARNING
today = datetime.date.today()
tomorrow = today + datetime.timedelta(days=1)
class Channel:
def __init__(self, name):
self.name = name
self.entries = []
class ListingEntry:
def __init__(self,when):
self.when = when
self.title = ''
self.content = ''
def newTime(node, entries):
timeStr = str(node.content).strip()
# the Guardian listings show times like 6.00am or 5.45pm, we need to
# turn this into a more useable form. A Python datetime object will do
# just fine. It is also worth noting that the listings run from 6.00am
# to 6.00am, so we need to account for a date boundary at midnight.
if (timeStr.count('am') > 0) or (timeStr.count('pm') > 0):
t = timeStr.split('.')
hour = int(t[0])
minute = int(t[1][0:2])
ampm = t[1][2:4]
if (ampm == 'pm') and (hour < 12):
hour += 12
if (hour < 6):
date = tomorrow
else:
date = today
when = datetime.datetime(date.year, date.month, date.day, hour, minute)
newEntry = ListingEntry(when)
entries.append(newEntry)
def newProgramme(node, entries):
# luckily for us, all the Guardian TV entries are wrapped in <font> tags
# which is bad for accessibility, but gives us a known node to grab
items = node.xpathEval('.//font/node()')
for item in items:
if not item.isBlankNode():
if (item.type == 'text') or (item.type == 'element'):
if entries[-1].title == '':
entries[-1].title += str(item.content).strip()
else:
entries[-1].content += str(item.content).strip() + '\n'
def processSourceHTML(url,entries):
doc = libxml2.htmlReadFile(url, None, parse_options)
listingTable = doc.xpathEval('//table')[6]
rows = listingTable.xpathEval('.//tr')
for row in rows:
if len(row.xpathEval('.//tr')) > 0:
# This row is broken, tr tags should not contain more tr tags!
# it probably is missing one or more closing tags and therefore
# needs special handling.
fixup = row.serialize()
rows = fixup.split('<tr>')
# Here we load the broken HTML fragment into another documet
# to extract whatever we can from it.
fixDoc = libxml2.htmlReadDoc('<html>'+rows[1]+'</html>', \
'', None, parse_options)
cells = fixDoc.xpathEval('//td')
for cell in cells:
if cell.prev == None:
# if the cell has no previous sibling then it is the first
# cell in the row, e.g. the one containing the time
newTime(cell, entries)
else:
newProgramme(cell, entries)
fixDoc.freeDoc()
else:
cells = row.xpathEval('td')
for cell in cells:
if cell.prev == None:
# if the cell has no previous sibling then it is the first
# cell in the row, e.g. the one containing the time
newTime(cell, entries)
else:
newProgramme(cell, entries)
doc.freeDoc()
channels = []
# We could do more here from an automation perspective - spider the list
# of channels, automatically populating the channel names etc...
# but this is left as an exercise for the reader
channels.append( Channel('BBC1') )
processSourceHTML( \
'http://www.guardian.co.uk/TV/bbc1s_meridian.html', \
channels[-1].entries)
channels.append( Channel('BBC2') )
processSourceHTML( \
'http://www.guardian.co.uk/TV/bbc2s_meridian.html', \
channels[-1].entries)
channels.append( Channel('ITV - Meridian') )
processSourceHTML( \
'http://www.guardian.co.uk/TV/meridian_meridian.html', \
channels[-1].entries)
channels.append( Channel('Channel 4') )
processSourceHTML( \
'http://www.guardian.co.uk/TV/ch4_meridian.html', \
channels[-1].entries)
for channel in channels:
print channel.name
for entry in channel.entries:
print "----"
print entry.when
print entry.title
print "----"
print entry.content