Leszek in wonderland: May 2009

A colleague at work asked me if I could prepare a printable version of her blog. No problem, I said, the page is XHTML, I can download the page, parse it with an XML parser, put the posts together and I'm done. A simple Python script plus a number of CSS rules should do the trick.

Sounds simple, but it ain't. Try something like this:


import urllib
from xml.dom import urllib

URL = "http://leszczynski.blogspot.com"

dom = minidom.parse(urllib.urlopen(url))

Of course not. The interpreter kindly presents you with a "not well-formed (invalid token): line 4" error. If you look at line 4, you see something like:


<script type="text/javascript"> 
   //... 
   var x = y < 5; 
</script>

The cause of the problem is the "<" sign, which muddles the xml parser, as it means beginning of a new tag. We can deal with this by wrapping the script contents with CDATA markers:


<script type="text/javascript">
  //<[CDATA[ 
  //... 
  var x = y < 5; 
  //]]>
</script>

As we have no influence on the page contents, we have to (unfortunately) clean this by hand. Regular expressions to the rescue:


import re

content = urrlib.urlopen(URL).read()
content = re.sub(r"(<script.*?>)",r"\1 <![CDATA[", content)
content = re.sub(r"(</script>)", r"]]></script>", content)

dom = minidom.parseString(content)

This should do the trick, but of course it doesn't. A similar problem appears in one of the <style> tags, so, by analogy, we have to wrap the <style> tags with CDATA. Still nothing. It appears that some links on the page have "&" characters inside. FFFFFUUUUU... Another simple correction, and ta-da, the full parsing script:


import urllib
from xml.dom import minidom
import re

URL = "http://leszczynski.blogspot.com/"

re1 = re.compile("(<script.*?>)")
re2 = re.compile("(</script>)");
re3 = re.compile("(<style.*?>)")
re4 = re.compile("(</style>)");

content = re1.sub(r"\1<![CDATA[", content)
content = re2.sub(r"]]>\1",content)
content = re3.sub(r"\1<![CDATA[", content)
content = re4.sub(r"]]>\1",content)

content = content.replace("&","&amp")

dom = minidom.parseString(content)

That's a lot of messing around, which could be easily avoided if the author of the page obeyed the rules of XHTML; validation using the W3C validator shows over one hundred errors!

Such errors in an XHTML page kind of defeat the purpose of the whole XHTML thing - its main idea is that the browser can use XML parser to read the page (XML parsers are lighter and faster than the SGML ones)- but unfortunately, even one error such as the ones mentioned above cause the parser to stop. So, what's really the point?

Did I mention that the URL I was reading from was actually my own blog? Oooops. Should I try to correct it? After all, I am kind of a content provider, I don't really feel at home here, yet this disturbs me a bit. I am not the first one commenting on this issue. Maybe I should try to prepare a clean template?

Imagine having two browser windows - "parent" and "child", "child" being opened from the "parent" using window.open function. At some point you want to pass some parameters to the parent, and then close the child window. Easy:



var myData = null;

var returnValue = function(data) {
  myData = data;
}

in parent, and


  window.opener.returnValue("Hello, play with me dad");

in the child window. The returned value is assigned to the myData variable and can be used after the window is closed. This simple schema gets a bit complicated when we deal with object variables:


  window.opener.returnValue({ name: "Jerry", surname: "Blueberry"});

Now, after the child window is closed, the interpreter behavior depends on the browser used. In Firefox (and probably any other "modern" browser) everything works smoothly. However, in IE 6.0 (not tested in other versions) the object reference is assigned to the myData variable, but the object itself is destroyed. This causes a runtime error with a message "the object invoked has disconnected from its clients" - the returned data has been lost.

The simplest way to deal with this issue is to use some form of serialization of the returned data, and the simplest way to serialize data is to use JSON representation. For example, using the Prototype library:



var myData = null;

var returnValue = function(data) {
  myData = data.toJSON().evalJSON();
}

This is the simplest (even if not the fastest) way to make a deep clone of an object. By creating a deep copy of the returned object, we effectively disconnect the object from the originating window. Problem solved.

Leszek in wonderland

Wednesday, May 27, 2009

Stumbling over simple errors

Monday, May 11, 2009

Passing parameters between browser windows

Blog Archive

About Me