Wednesday, May 27, 2009

Stumbling over simple errors

A colleague at work asked me if I could prepare a printable version of her blog. No problem, I said, the page is XHTML, I can download the page, parse it with an XML parser, put the posts together and I'm done. A simple Python script plus a number of CSS rules should do the trick.

Sounds simple, but it ain't. Try something like this:


import urllib
from xml.dom import urllib

URL = "http://leszczynski.blogspot.com"

dom = minidom.parse(urllib.urlopen(url))


Of course not. The interpreter kindly presents you with a "not well-formed (invalid token): line 4" error. If you look at line 4, you see something like:


<script type="text/javascript">
//...
var x = y < 5;
</script>


The cause of the problem is the "<" sign, which muddles the xml parser, as it means beginning of a new tag. We can deal with this by wrapping the script contents with CDATA markers:


<script type="text/javascript">
//<[CDATA[
//...
var x = y < 5;
//]]>
</script>


As we have no influence on the page contents, we have to (unfortunately) clean this by hand. Regular expressions to the rescue:


import re

content = urrlib.urlopen(URL).read()
content = re.sub(r"(<script.*?>)",r"\1 <![CDATA[", content)
content = re.sub(r"(</script>)", r"]]></script>", content)

dom = minidom.parseString(content)


This should do the trick, but of course it doesn't. A similar problem appears in one of the <style> tags, so, by analogy, we have to wrap the <style> tags with CDATA. Still nothing. It appears that some links on the page have "&" characters inside. FFFFFUUUUU... Another simple correction, and ta-da, the full parsing script:


import urllib
from xml.dom import minidom
import re

URL = "http://leszczynski.blogspot.com/"

re1 = re.compile("(<script.*?>)")
re2 = re.compile("(</script>)");
re3 = re.compile("(<style.*?>)")
re4 = re.compile("(</style>)");

content = re1.sub(r"\1<![CDATA[", content)
content = re2.sub(r"]]>\1",content)
content = re3.sub(r"\1<![CDATA[", content)
content = re4.sub(r"]]>\1",content)

content = content.replace("&","&amp")

dom = minidom.parseString(content)


That's a lot of messing around, which could be easily avoided if the author of the page obeyed the rules of XHTML; validation using the W3C validator shows over one hundred errors!

Such errors in an XHTML page kind of defeat the purpose of the whole XHTML thing - its main idea is that the browser can use XML parser to read the page (XML parsers are lighter and faster than the SGML ones)- but unfortunately, even one error such as the ones mentioned above cause the parser to stop. So, what's really the point?

Did I mention that the URL I was reading from was actually my own blog? Oooops. Should I try to correct it? After all, I am kind of a content provider, I don't really feel at home here, yet this disturbs me a bit. I am not the first one commenting on this issue. Maybe I should try to prepare a clean template?

No comments: