Generating HTML5 using XSLT
by Mike on Jan.28, 2011, under Technology, Tutorials

Recently, I have been updating some of my HTML generation tools to output valid HTML5, rather than the XHTML 1.0 standard I have been using for the last few years. The main advantage from my perspective is the ability to use the more semantic block elements, such as the nav, section and article elements.
In general this is a fairly straightforward task, as I am generating clean XHTML using XSLT and my template library works pretty well, but I ran into some problems whilst validating the output using the W3C Validator.
The first issue is to sort the DOCTYPE out. The XHTML doctype looks like this:
<!DOCTYPE html
PUBLICĀ "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
This is easy to generate in XSLT using the following output element.
<xsl:output encoding="UTF-8" indent="yes" method="xml"
omit-xml-declaration="yes"
doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" />
This unfortunately forces the document to validate against the XHTML 1.0 specification which does not include all the lovely new semantic elements – which means that my new documents are suddenly invalid!
We need to generate:
<!DOCTYPE html>
which is really hard to do using XSLT. I have read a number of articles that suggest you output the element as text, however this is extremely ugly and as it turns out, incorrect.
The correct XSLT incantation is:
<xsl:output
method="xml"
doctype-system="about:legacy-compat"
encoding="UTF-8"
indent="yes" />
This forces use of a dummy DTD (about:legacy-compat), which is the W3C recommended way of not using a standard DTD URI.
Now the W3C validator will happily validate against the HTML5 specification rather than the XHTML 1.0 specification.
May 15th, 2011 on 10:41 pm
Hi Mike,
Its a nice article with good information of xsl & html5.
June 26th, 2011 on 4:23 am
Thanks. Just what I needed
June 27th, 2011 on 2:19 am
This causes errors for me as tags are now treated like which is incorrect.
June 30th, 2011 on 6:25 pm
Hi Brendan,
using the method shown does mean the document is treated as strict XML rather than HTML, so extra care needs to be taken with things like empty tags and so on, but is the best half way house I have come up with until the parsers have a flag to indicate that you are working in HTML5 with XML syntax.
The other alternative I have used is to use a pure XHTML 1.0 transform and then run a string replace before sending it to the client (to replace the doctype with the HTML5 one), but that seems a bit wrong IMHO.
What are your thoughts?
Mike
July 5th, 2011 on 3:48 pm
Mike,
Please delete my last comment, I entered in the tags and it hid the tags in my last comment, so my last comment won’t make sense. Here’s the comments with out the tag markup.
Nice post. How do you deal with elements like textarea? textarea is an empty element, but if it treats it as xml, it will self close the tag, and this causes a problem…
I found away to work around elements like style and script, by putting xsl:comments in the middle them, it keeps them from self-closing. But, by putting a comment inside of textarea, it causes problems with the textarea.
Any thoughts?
November 27th, 2011 on 1:57 pm
Hi Mike,
I am trying the one you recommend with no luck. and some other tags still get an error. I am using the Umbraco CMS but I can’t make it work with html 5.
Cheers, Giorgos
May 6th, 2012 on 7:18 am
I get total garbage using this. the last method gives me the text of everything in the page all run together with spaces between. will look elsewhere for solution for now.
May 6th, 2012 on 7:27 am
this is modified as a fix from what you gave. it outputs
and it doesn’t output my html5 like a garbage pile. it looks very nicely formatted, as it should.
the difference was I changed xml to html. that fixed it.
May 8th, 2012 on 8:39 pm
That’s interesting – which XSL parser are you using? I use LibXML/LibXSLT in Python which doesn’t show that behaviour.
May 8th, 2012 on 8:45 pm
Definitely interested in which parser combination you are using! I have issues with the “html” output type, possibly because I use mixed namespaces in my output, but would be interested in seeing your transform chain to compare.
To be honest, I’m still not entirely happy with the “pure” XSL approach to HTML5 transforms – there are a couple of instances where the output isn’t right, so still have a string replace stage in my output code (which doesn’t feel like a great solution).
May 8th, 2012 on 8:54 pm
Hi Giorgios,
I’m sorry, I’m not really familiar with how Umbraco handles XSLT, although as a .NET application I would imagine the parser would be pretty compliant with W3C specs. I’ll ask around and see if I can work it out.
Cheers, Mike
June 4th, 2012 on 7:01 am
This is just what i need, Thanks a lot.