Wednesday, May 27, 2009

Stumbling over simple errors

A colleague at work asked me if I could prepare a printable version of her blog. No problem, I said, the page is XHTML, I can download the page, parse it with an XML parser, put the posts together and I'm done. A simple Python script plus a number of CSS rules should do the trick.

Sounds simple, but it ain't. Try something like this:


import urllib
from xml.dom import urllib

URL = "http://leszczynski.blogspot.com"

dom = minidom.parse(urllib.urlopen(url))


Of course not. The interpreter kindly presents you with a "not well-formed (invalid token): line 4" error. If you look at line 4, you see something like:


<script type="text/javascript">
//...
var x = y < 5;
</script>


The cause of the problem is the "<" sign, which muddles the xml parser, as it means beginning of a new tag. We can deal with this by wrapping the script contents with CDATA markers:


<script type="text/javascript">
//<[CDATA[
//...
var x = y < 5;
//]]>
</script>


As we have no influence on the page contents, we have to (unfortunately) clean this by hand. Regular expressions to the rescue:


import re

content = urrlib.urlopen(URL).read()
content = re.sub(r"(<script.*?>)",r"\1 <![CDATA[", content)
content = re.sub(r"(</script>)", r"]]></script>", content)

dom = minidom.parseString(content)


This should do the trick, but of course it doesn't. A similar problem appears in one of the <style> tags, so, by analogy, we have to wrap the <style> tags with CDATA. Still nothing. It appears that some links on the page have "&" characters inside. FFFFFUUUUU... Another simple correction, and ta-da, the full parsing script:


import urllib
from xml.dom import minidom
import re

URL = "http://leszczynski.blogspot.com/"

re1 = re.compile("(<script.*?>)")
re2 = re.compile("(</script>)");
re3 = re.compile("(<style.*?>)")
re4 = re.compile("(</style>)");

content = re1.sub(r"\1<![CDATA[", content)
content = re2.sub(r"]]>\1",content)
content = re3.sub(r"\1<![CDATA[", content)
content = re4.sub(r"]]>\1",content)

content = content.replace("&","&amp")

dom = minidom.parseString(content)


That's a lot of messing around, which could be easily avoided if the author of the page obeyed the rules of XHTML; validation using the W3C validator shows over one hundred errors!

Such errors in an XHTML page kind of defeat the purpose of the whole XHTML thing - its main idea is that the browser can use XML parser to read the page (XML parsers are lighter and faster than the SGML ones)- but unfortunately, even one error such as the ones mentioned above cause the parser to stop. So, what's really the point?

Did I mention that the URL I was reading from was actually my own blog? Oooops. Should I try to correct it? After all, I am kind of a content provider, I don't really feel at home here, yet this disturbs me a bit. I am not the first one commenting on this issue. Maybe I should try to prepare a clean template?

Monday, May 11, 2009

Passing parameters between browser windows

Imagine having two browser windows - "parent" and "child", "child" being opened from the "parent" using window.open function. At some point you want to pass some parameters to the parent, and then close the child window. Easy:



var myData = null;

var returnValue = function(data) {
myData = data;
}


in parent, and


window.opener.returnValue("Hello, play with me dad");


in the child window. The returned value is assigned to the myData variable and can be used after the window is closed. This simple schema gets a bit complicated when we deal with object variables:


window.opener.returnValue({ name: "Jerry", surname: "Blueberry"});


Now, after the child window is closed, the interpreter behavior depends on the browser used. In Firefox (and probably any other "modern" browser) everything works smoothly. However, in IE 6.0 (not tested in other versions) the object reference is assigned to the myData variable, but the object itself is destroyed. This causes a runtime error with a message "the object invoked has disconnected from its clients" - the returned data has been lost.

The simplest way to deal with this issue is to use some form of serialization of the returned data, and the simplest way to serialize data is to use JSON representation. For example, using the Prototype library:



var myData = null;

var returnValue = function(data) {
myData = data.toJSON().evalJSON();
}


This is the simplest (even if not the fastest) way to make a deep clone of an object. By creating a deep copy of the returned object, we effectively disconnect the object from the originating window. Problem solved.

Monday, April 20, 2009

Introspection, the hard way

My current project is an Ext application with a database backend. The communication between the two is carried out by a rather thin JEE middleware, which serves JSON data to the client. The Java-to-JSON mapping is done by the JsonTools library, which makes this mapping a breeze:


String str = JSONMapper.toJSON(bean).render(true);


Really simple and effective. The library gets the bean properties by introspection, and converts them to their respective JSON representations, so there is no need to write any mapping XMLs or anything- just as you were writing in Ruby :)

However, today I stumbled on a nasty surprise. I created a new class hierarchy, with an abstract class on the top, with non-public concrete classes. The concrete classes are constructed using a Factory pattern, like in the example below:


public abstract class Animal {
public static Animal createAnimal() {
return new Elephant();
}
}

class Elephant extends Animal {
private String weight;
public String getWeight() {...}
public void setWeight(String weight) {...}
}


Everything nice and clean, however:


Animal dumbo = Animal.createAnimal();
System.out.println(JSONMapper.toJSON(dumbo).render(true));
//...


And KA-POW, an IllegalAccessException. What the?

The problematic snippet of code looks something like this:


Class dumboClass = dumbo.getClass();

System.out.println("dumbo is a "+dumboClass.getName()); //Elephant

PropertyDescriptor[] propDscs =
Introspector.getBeanInfo(dumboClass, Introspector.USE_ALL_BEANINFO).
getPropertyDescriptors();

for(PropertyDescriptor pd: propDscs) {
if(pd.getName().equals("class")) continue;

Method preader = pd.getReadMethod();
System.out.println("dumbo."+pd.getDisplayName()+" has a reader "+preader.getName());
//--> dumbo.weight has a reader getWeight -- ok
System.out.println("dumbo."+pd.getName() +"() = "+preader.invoke(dumbo));
// IllegalAccessException
}


If you think a bit about this, it makes perfect sense. Because the dumbo variable is declared as Animal instance, the class using it has no way to know about any of its methods other than the ones defined in the Animal class. Moreover, because the Elephant class is not public, it is not visible outside its package at all, as well as any of its methods.

However, what is surprising is that the Introspector somehow knows about both the Elephant class and its getWeight methods. What is even more surprising, is that you can write:


preader.setAccessible(true);


which makes the code work without any exceptions. The setAccessible method effectively nullifies any access restrictions- which is a mixed blessing, I imagine.

Ext expandable grid rows, now with AJAX

Ext homepage shows an example of a grid with expandable rows. However, the data in the expanded rows is loaded together with all the data in the grid. I needed to change this behaviour so the data would be loaded with an AJAX request, only after the row is expanded. So here is what I did:

First, download the RowExpander plugin, which is used in the example. Next, replace the getBodyContent method with:

getBodyContent : function(record, index, body){
if(!this.enableCaching){
return this.tpl.apply(record.data);
}

var content = this.bodyContent[record.id];
if(!content){
var th = this;
Ext.Ajax.request({
url: this.url,
method: "GET",
params: { id: record.id },
success: function(res) {
var result = eval("("+res.responseText+")");
var content = th.tpl.apply(result);
th.bodyContent[record.id] = content;
body.innerHTML = content;
},
failure: function(res) {
content = "Error loading data";
body.innerHTML = content;
}
});
content = "Loading...";
this.bodyContent[record.id] = content;
}
return content;
}

As you can see, the method signature has been changed, so you have to add body to the method invocation in beforeExpand method. Also, the url to the AJAX service has to be smuggled somehow - I just passed it in the RowExpander initialization parameters.

This probably is not the most elegant solution to this issue - but it has one great advantage: it works :).

Document paging with XSLT

Suppose you want to display a long XML document, with one part spanning multiple pages. This part could be - for example - displayed in a table, with header and footer repeated on every page. How to achieve this with XSLT? Very easily, in fact, using recursive template calls:


<xsl:template name="recursive">
<xsl:param name="start"/>
<xsl:param name="end"/>
<xsl:variable name="xml-doc"
select = "/path/to/data[position() > $start
and $end >= position() ]"/>

<xsl:if test="count($xml-doc) > 0">

<!-- template body goes here -->

<xsl:call-template name="recursive">
<xsl:with-param name="start" select="$end"/>
<xsl:with-param name="end" select="$end + $recordsPerPage"/>
</xsl:call-template>
</xsl:if>
</xsl:template>

That's it. The template calls itself recursively, dividing its content into separate parts. All you need to do is to define the recordsPerPage variable somewhere, and to call the template for the first time:


<xsl:call-template name="recursive">
<xsl:with-param name="start" select="0"/>
<xsl:with-param name="end" select="$recordPerPage"/>
</xsl:call-template>