Tag: Web Development

Accurate Postcode data for the UK and Northern Ireland

by Mike on Jun.06, 2011, under Knowledge Base, Technology


Getting accurate postcode data for use in my programs has proved to be an interesting technical challenge, despite the availability of free services such as those from Google.

I needed a UK postcode to Latitude/Longitude conversion tool that would work offline, which made the Google API’s completely useless.

A freely available dataset was made available from the Ordnance Survey under the moniker of “Code-point open”, however this is missing some data (notably Northern Ireland) so can’t really be called complete, and additionally it does not provide locations in Latitude/Longitude, but instead it uses Ordnance Survey Grid references.

If you need to do this, then hopefully the tools here will be of some help.

First we need to grab the Ordnance Survey Code-point open dataset. They don’t provide a straight link, so you will need to register to actually get access to the data. Once downloaded and extracted you will have a folder full of CSV files.

Secondly we need to convert the grid references to more useable LatLong values. Chris Veness has written some Javascript to do just that – as well as writing some incredibly informative articles about how this stuff actually works (there’s some horrible maths involved!). I spent a couple of evenings converting his scripts to Python and wrote a little data extract utility which put all the data into a SQLite3 database.

Another couple of hours to write a set of wrapper functions around the database and include the validation rules from the UK government data schema and hey presto! We have a usable postcode_utils module.

This just leaves Northern Ireland…

It turns out that the Northern Ireland Statistics and Research Agency (NISRA) have some GIS data available in ESRI ShapeFile format. There is a useful Python library to work with ESRI Shapefiles and it took a matter of minutes to extract the coordinates from the NISRA data, that provides postcodes in… bah! another Grid System, but this time it’s the Irish Grid!

Some more hunting online to get the right values for the grid transformations and the Geoid dimensions used for each grid system and a bit of tinkering with Chris Veness’ code and we are done! Here is a Sqlite3 database containing all the postcodes in Great Britain and Northern Ireland with accurate Latitude and Longitudes (more accurate than some Web Services I could mention) and some Python to use with it. Also, this information is all freely available and free for use commercially and non-commercial use (provided you give attribution as mentioned on the various sites linked to above).

If you find this useful, please let me know by using the comments feature. Also, if anybody has a source of data for the Channel Islands or the Isle of Man I would be very please to find them!

2 Comments :, , , , more...

Generating HTML5 using XSLT

by Mike on Jan.28, 2011, under Technology, Tutorials

HTML5 Logo

Recently, I have been updating some of my HTML generation tools to output valid HTML5, rather than the XHTML 1.0 standard I have been using for the last few years. The main advantage from my perspective is the ability to use the more semantic block elements, such as the nav, section and article elements.

In general this is a fairly straightforward task, as I am generating clean XHTML using XSLT and my template library works pretty well, but I ran into some problems whilst validating the output using the W3C Validator.

The first issue is to sort the DOCTYPE out. The XHTML doctype looks like this:

<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

This is easy to generate in XSLT using the following output element.

<xsl:output encoding="UTF-8" indent="yes" method="xml"
    omit-xml-declaration="yes"
    doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
    doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" />

This unfortunately forces the document to validate against the XHTML 1.0 specification which does not include all the lovely new semantic elements – which means that my new documents are suddenly invalid!

We need to generate:

<!DOCTYPE html>

which is really hard to do using XSLT. I have read a number of articles that suggest you output the element as text, however this is extremely ugly and as it turns out, incorrect.

The correct XSLT incantation is:

<xsl:output
     method="xml"
     doctype-system="about:legacy-compat"
     encoding="UTF-8"
     indent="yes" />

This forces use of a dummy DTD (about:legacy-compat), which is the W3C recommended way of not using a standard DTD URI.

Now the W3C validator will happily validate against the HTML5 specification rather than the XHTML 1.0 specification.

5 Comments :, , , , , , more...

Setting up a web project environment in Visual Studio 2010 to allow debugging using both IIS7 and the development Web server

by Mike on Jan.13, 2011, under Knowledge Base, Technology, Tutorials

We have run into an issue recently when importing a number of our legacy web projects into Visual Studio 2010. It seems that some projects don’t provide the options to allow us to run and debug them on the local version of IIS. For some reason, when a project is initially configured in VS2010 it sets a flag that determines whether the project properties page is shown inside the project window (where it will contain the necessary configuration options) or the project properties page is shown as a pop-up dialog window (where it doesn’t contain the options to allow debugging on a local IIS).

Here are the notes you will need when setting up a new project to guarantee that you have access to the required configuration page.

IIS 7 Configuration

The first step is entirely optional, as it is perfectly reasonable to run multiple Web Applications within a singe Web Site, however I prefer having separate Web Sites for different projects (or clients) as this allows me to control individual settings for each of them (for example host and port bindings) without me worrying about breaking a different project.

If you will use “localhost” to access the server, then you will only be able to have one Web Site instance running at a time, but this is rarely an issue, and you can easily run more than one site at a time by using a unique hostname or port for each.

Create the site in IIS Manager:

Create the site in IIS Manager

Where you create the Web Site mostly doesn’t matter, Visual Studio will add your project to the site as a virtual directory and the physical files can exist anywhere on your filesystem.

Add web site settings

Manage Web Site

Once the site has been created, stop any other site that is using port 80, and start the Web site from the actions panel at the right of the window.

Because you want Visual Studio to modify the IIS configuration, you will need to run Visual Studio as Administrator.

Create an empty Web Application

Create a new project in Visual Studio (File > New > Project…).

Create an empty Web Application

Select: Visual C# > Web > ASP.NET Empty Web Application.

If you leave the “Create directory for solution” checkbox selected, then all your project files will end up one directory below the solution file. Personally I don’t mind having the .sln file in the same folder as my project files so I uncheck it.

New Web Project Settings

If you click OK to create the various project files, then immediately compile and run the application you will see that by default Visual Studio has configured the development web server.

Cassini - empty web application

Reconfigure the project to use IIS

Web project properties

Now whilst this is useful for small standalone projects, we would ideally like to be able to switch development to the full version of IIS7, and to do this, we need to perform another couple of steps.

Back in Visual Studio, stop the project and in the Solution Explorer, right-click on the project and select “Properties”.

In the window that appears, select the Web option and scroll down until you can see the Servers section. Change the selection from “Use Visual Studio Development Server” to “Use Local IIS Web server”.

VS2010 - Web Server settings

Click the “Create Virtual Directory” button to complete the IIS configuration. If you see the dialog below then everything has gone as expected and you can now run (and debug) your web application in IIS. You can switch between IIS and Cassini (the development web server) simply by changing the radio button back.

Dialog - Virtual directory successfully created.

Using the same web.config for both IIS and Cassini

Because IIS7 uses a new integrated pipeline, and Cassini uses the classic pipeline, some of the settings in the web.config are not compatible. Whilst it is possible to use two web.config files (and move them in and out of the application directory when you switch webservers), it is much more convenient to use a single web.config with both webservers.

.NET will complain about the legacy settings if you are using the integrated pipeline (which we normally will want to do) unless you have the following included in the system.webServer section in the web.config:

<validation validateIntegratedModeConfiguration="false" />

This line instructs the .NET not to validate any settings it finds in the web.config for the integrated pipeline – this allows you to have settings that only work in classic mode (such as httpHandlers and httpModules co-existing with the handlers and modules sections that are required by IIS7.

1 Comment :, , , , , , , , , more...

Getting started with Libxml2 and Python (Part 2)

by Mike on Feb.21, 2007, under Knowledge Base, Technology, Tutorials

After I published the first part of this tutorial, John Dennis gave me some
feedback on the xml@gnome.org mailing list (http://mail.gnome.org/archives/xml/2007-February/thread.html).
He posed a couple of interesting questions

  1. 1. how do I build complex python objects by parsing an XML doc?
  2. 2. how can I serialize python objects into XML?

Normally I would use pickling and unpickling to serialise Python objects, but
I can see some cases in which this might come in useful. Having a bit of a play
with dynamically creating objects in Python made me realise that this is a
non-trivial challenge as well and so is probably an ideal exercise for this
tutorial. Creating arbitrary objects from an XML source document.

Simple example

Let’s imagine that we have the following XML document

	<?xml version='1.0'>
	<user>
		<name>Mike Kneller</name>
		<homepage>http://www.mikekneller.com</homepage>
	</user>

Dynamically populating an object is fairly straightforward in Python as it
allows the dynamic creation of object attributes. We just need to loop through
the document creating the properties. This is simple when we realise that
setattr(obj,’foo’,123) is equivalent to obj.foo = 123.

	class DynamicObject:
		pass

	user = DynamicObject()

	doc = libxml2.parseFile( 'userdata.xml' )
	child = doc.getRootElement().children

	while child is not None:
		if child.type == "element":
			setattr(user, child.name, child.content)
		child = child.next
	doc.freeDoc()

That was almost too easy! Examining user shows that we have a Python object
populated with the contents of the XML document as expected.

	>>> print user.__dict__
	{'homepage': 'http://www.mikekneller.com', 'name': 'Mike Kneller'}
	>>> print user.homepage

http://www.mikekneller.com

Although this is a useful routine, simply filling an object with string values
doesn’t really count as a ‘complex’ object (although it illustrates the point).
It’s main use is the generation of arbitrary data objects where the order and
naming of the data is not known in advance.

Walking the tree

A ‘complex’ object would be one containing a mixture of data types, possibly
holding other objects and maybe some code.

This XML document (people.xml) contains a group of people that we would like to
load.

	<?xml version='1.0'?>
	<people>
		<person>
			<name>Mike</name>
			<age>34</age>
			<friends>
				<friend>Steve</friend>
				<friend>Mark</friend>
				<friend>Dave</friend>
			</friends>
		</person>

		<person>
			<name>Steve</name>
			<friends>
				<friend>Mike</friend>
				<friend>Mark</friend>
				<friend>Dave</friend>
			</friends>
			<hobbies>
				<hobby>Stamp collecting</hobby>
				<hobby>Train spotting</hobby>
			</hobbies>
		</person>

		<person>
			<name>Mark</name>
			<age>28</age>
			<friends>
				<friend>Mike</friend>
				<friend>Steve</friend>
			</friends>
		</person>

		<person>
			<name>Dave</name>
			<age>30</age>
			<friends>
				<friend>Mike</friend>
				<friend>Steve</friend>
			</friends>
		</person>
	</people>

To construct arbitrary Python objects from a document like this, we will need to walk the tree. This example recurses an XML document, printing out the node names and content, indenting as it goes.

	import libxml2

	def walkTree(xmlnode):
		child = xmlnode.children
		while child is not None:
			if not child.isBlankNode():
				if child.type == "element":
					childCount = int(child.xpathEval('count(*)'))

					# a count of the ancestor nodes tells us how deep in the
					# tree we are - lets just use it to indent our printed
					# output
					depth = int(child.xpathEval('count(ancestor::*)')) - 1
					if childCount == 0:
						# If the count of child elements is 0 then we
						# have a node only containing text
						print  depth * '\t' + child.name + ' : ' + child.content
					else:
						# If the node contains other child elements then
						# we can recurse down the tree
						print depth * '\t' + child.name
						walkTree(child)

			child = child.next

	doc = libxml2.parseFile('people.xml')
	root = doc.getRootElement()

	walkTree(root)

	doc.freeDoc()
Leave a Comment :, , , , more...

Using libxml2 and python to scrape content from a website

by Mike on Feb.06, 2007, under Knowledge Base, Technology, Tutorials

This is a practical example, using Libxml2 to parse a real-world Web page (I chose the TV listings pages from the Guardian Website as it is the type of page you are likely to want to scrape for useful data.

Additionally, the Guardian TV listings contain a couple of very typical HTML errors. The listings are contained within a table, and some of the rows in the table are not closed.

The Libxml2 parser recovers from these errors by closing the tags at the end of the page and then continuing parsing from the next useable opening tag. This leaves us with a tr tag containing duplicated content from later in the document, which this code handles in a simple way by splitting the broken content on the next opening tag, and using another document instance to close the unterminated tags.

TV Listings

import libxml2, os, sys, datetime

# This script reads the TV listings from the Guardian TV listings website
# (http://www.guardian.co.uk/TV/)

parse_options = libxml2.HTML_PARSE_RECOVER + \
	libxml2.HTML_PARSE_NOERROR + \
	libxml2.HTML_PARSE_NOWARNING

today = datetime.date.today()
tomorrow = today + datetime.timedelta(days=1)

class Channel:
	def __init__(self, name):
		self.name = name
		self.entries = []

class ListingEntry:
	def __init__(self,when):
		self.when = when
		self.title = ''
		self.content = ''

def newTime(node, entries):
	timeStr = str(node.content).strip()

	# the Guardian listings show times like 6.00am or 5.45pm, we need to
	# turn this into a more useable form. A Python datetime object will do
	# just fine. It is also worth noting that the listings run from 6.00am
	# to 6.00am, so we need to account for a date boundary at midnight.
	if (timeStr.count('am') > 0) or (timeStr.count('pm') > 0):
		t = timeStr.split('.')
		hour = int(t[0])
		minute = int(t[1][0:2])
		ampm = t[1][2:4]

		if (ampm == 'pm') and (hour < 12):
			hour += 12

		if (hour < 6):
			date = tomorrow
		else:
			date = today

		when = datetime.datetime(date.year, date.month, date.day, hour, minute)
		newEntry = ListingEntry(when)
		entries.append(newEntry)

def newProgramme(node, entries):
	# luckily for us, all the Guardian TV entries are wrapped in <font> tags
	# which is bad for accessibility, but gives us a known node to grab
	items = node.xpathEval('.//font/node()')
	for item in items:
		if not item.isBlankNode():
			if (item.type == 'text') or (item.type == 'element'):
				if entries[-1].title == '':
					entries[-1].title += str(item.content).strip()
				else:
					entries[-1].content += str(item.content).strip() + '\n'

def processSourceHTML(url,entries):
	doc = libxml2.htmlReadFile(url, None, parse_options)
	listingTable = doc.xpathEval('//table')[6]
	rows = listingTable.xpathEval('.//tr')
	for row in rows:
		if len(row.xpathEval('.//tr')) > 0:
			# This row is broken, tr tags should not contain more tr tags!
			# it probably is missing one or more closing tags and therefore
			# needs special handling.
			fixup = row.serialize()

			rows = fixup.split('<tr>')

			# Here we load the broken HTML fragment into another documet
			# to extract whatever we can from it.
			fixDoc = libxml2.htmlReadDoc('<html>'+rows[1]+'</html>', \
				'', None, parse_options)

			cells = fixDoc.xpathEval('//td')
			for cell in cells:
				if cell.prev == None:
					# if the cell has no previous sibling then it is the first
					# cell in the row, e.g. the one containing the time
					newTime(cell, entries)
				else:
					newProgramme(cell, entries)

			fixDoc.freeDoc()
		else:
			cells = row.xpathEval('td')
			for cell in cells:
				if cell.prev == None:
					# if the cell has no previous sibling then it is the first
					# cell in the row, e.g. the one containing the time
					newTime(cell, entries)
				else:
					newProgramme(cell, entries)
	doc.freeDoc()

channels = []

# We could do more here from an automation perspective - spider the list
# of channels, automatically populating the channel names etc...
# but this is left as an exercise for the reader
channels.append( Channel('BBC1') )
processSourceHTML( \
	'http://www.guardian.co.uk/TV/bbc1s_meridian.html', \
	channels[-1].entries)

channels.append( Channel('BBC2') )
processSourceHTML( \
	'http://www.guardian.co.uk/TV/bbc2s_meridian.html', \
	channels[-1].entries)

channels.append( Channel('ITV - Meridian') )
processSourceHTML( \
	'http://www.guardian.co.uk/TV/meridian_meridian.html', \
	channels[-1].entries)

channels.append( Channel('Channel 4') )
processSourceHTML( \
	'http://www.guardian.co.uk/TV/ch4_meridian.html', \
	channels[-1].entries)

for channel in channels:
	print channel.name
	for entry in channel.entries:
		print "----"
		print entry.when
		print entry.title
		print "----"
		print entry.content
Leave a Comment :, , , , more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!