Sounds simple, but it ain't. Try something like this:
import urllib
from xml.dom import urllib
URL = "http://leszczynski.blogspot.com"
dom = minidom.parse(urllib.urlopen(url))
Of course not. The interpreter kindly presents you with a "not well-formed (invalid token): line 4" error. If you look at line 4, you see something like:
<script type="text/javascript">
//...
var x = y < 5;
</script>
The cause of the problem is the "<" sign, which muddles the xml parser, as it means beginning of a new tag. We can deal with this by wrapping the script contents with CDATA markers:
<script type="text/javascript">
//<[CDATA[
//...
var x = y < 5;
//]]>
</script>
As we have no influence on the page contents, we have to (unfortunately) clean this by hand. Regular expressions to the rescue:
import re
content = urrlib.urlopen(URL).read()
content = re.sub(r"(<script.*?>)",r"\1 <![CDATA[", content)
content = re.sub(r"(</script>)", r"]]></script>", content)
dom = minidom.parseString(content)
This should do the trick, but of course it doesn't. A similar problem appears in one of the <style> tags, so, by analogy, we have to wrap the <style> tags with CDATA. Still nothing. It appears that some links on the page have "&" characters inside. FFFFFUUUUU... Another simple correction, and ta-da, the full parsing script:
import urllib
from xml.dom import minidom
import re
URL = "http://leszczynski.blogspot.com/"
re1 = re.compile("(<script.*?>)")
re2 = re.compile("(</script>)");
re3 = re.compile("(<style.*?>)")
re4 = re.compile("(</style>)");
content = re1.sub(r"\1<![CDATA[", content)
content = re2.sub(r"]]>\1",content)
content = re3.sub(r"\1<![CDATA[", content)
content = re4.sub(r"]]>\1",content)
content = content.replace("&","&")
dom = minidom.parseString(content)
That's a lot of messing around, which could be easily avoided if the author of the page obeyed the rules of XHTML; validation using the W3C validator shows over one hundred errors!
Such errors in an XHTML page kind of defeat the purpose of the whole XHTML thing - its main idea is that the browser can use XML parser to read the page (XML parsers are lighter and faster than the SGML ones)- but unfortunately, even one error such as the ones mentioned above cause the parser to stop. So, what's really the point?
Did I mention that the URL I was reading from was actually my own blog? Oooops. Should I try to correct it? After all, I am kind of a content provider, I don't really feel at home here, yet this disturbs me a bit. I am not the first one commenting on this issue. Maybe I should try to prepare a clean template?
No comments:
Post a Comment