
| Key: |
JAXEN-31
|
| Type: |
Improvement
|
| Status: |
Open
|
| Priority: |
Major
|
| Assignee: |
Unassigned
|
| Reporter: |
Brian Ewins
|
| Votes: |
2
|
| Watchers: |
2
|
|
If you were logged in you would be able to see more operations.
|
|
|
|
Time Tracking:
|
|
Original Estimate:
|
1 week
|
|
|
Remaining Estimate:
|
1 week
|
|
|
Time Spent:
|
Not Specified
|
|
|
|
The short story - I've rewritten jaxen to use a nodeset abstraction instead of a list to accumulate results. With the unit tests at nearly 100%, the performance test on dom4j runs in ~200ms on my 1GHz powerbook, compared to ~7400ms for the nightly jaxen build. The failing unit tests are for paths not covered in the performance test.
The long story - I wrote an implementation of the draft ISO Schematron spec on top of Jaxen, instead of the more usual XSLT. The schematron spec is designed so this can be done, mainly for performance reasons. Once I had enough working I was able to compare jaxen with xalan on the 'screamatron' torture test, which is using schematron to validate the schematron 1.5 spec. I was suprised to find that jaxen was about 2x /slower/ than xalan, and marginally slower with a jdom navigator than a dom4j navigator. Looking at why it was so slow, it became clear that the problem was that throughout jaxen it's building lists of intermediate results to filter down, instead of generating the 'next' result on demand. There's comments about this in several places in the code, but the list thing is so ingrained it needs more than a little refactoring to fix.
I decided to have a go. This change produces apis incompatible with previous versions of jaxen, so this will be a lump of code not a wee patch. I've basically cloned jaxen to 'org.jaxen2' to work on it. I'm nearly done with the changes, failing tests for the non-xpath extension functions which I havent implemented yet, patterns (not touched at all yet), and a couple of minor bugs.
I'll post the code on jira once I'm done, but here's the gist of the changes I made:
- Introduce a Nodeset abstraction instead of a list...
public interface extends Iterator {
Object current();
int position();
// use of last() replaces the internal iterator with an iterator over a list caching results.
// it's basically a fallback on what jaxen used to do.
int last();
}
the reason for this is obvious, eg you can just test 'hasNext()' to check the boolean value of a nodeset.
- change the signature of Expr to allow the selection of a fast code path
public interface Expr {
// these require context because of variable/function refs only. see later.
boolean isBoolean(Context ctx);
boolean isNumber(Context ctx);
boolean isString(Context ctx);
boolean isNodeset(Context ctx);
boolean booleanValue(Context ctx);
double numberValue(Context ctx);
String stringValue(Context ctx);
Nodeset nodesetValue(Context ctx);
// etc
}
this is really for my own sanity as there's so much code to maintain. Throughout jaxen, everything is untyped, so its fairly easy to (eg) pass a nodeSet when you meant to pass a node; there's instanceof checks, == null checks, and casts everywhere to deal with this. Once you put everything behind an interface like the one above, making objects responsible for their own casts, a lot of the code is simplified, eg addition is just expr1.numberValue(ctx) + expr2.numberValue(ctx). All but a couple of classes inherit from abstract base classes that implement all but one of those methods (eg StringExprBase leaves stringValue() abstract.)
Variable/Function refs don't have a type without context in my version, and like Jaxen I look up functions, variables every time. This isn't smart - in xpath we can resolve all variable, function refs beforehand, and get faster code as a result. I'll probably change things so I have 'bound' and 'unbound' exprs, add 'BoundExpr bind(VariableContext,FunctionContext)'. I'll still need to pass these as members of Context to support the 'evaluate' extension function though.
Another approach to the same concern that I didn't try is to pass in a typed context, eg 'StringContext', 'NodesetContext', etc (think of this as allowing things like perl's 'wantarray()'. That approach would make it easier to expand the number of supported types (eg for xpath2) in future, I'll try it once the tests pass.
- moved all the functions to a single utility class
public class Functions {
String substring(String s, double start, double end);
// etc
}
this makes it easier to unit test the xpath functions, since they are mostly context-independent. Functions that need a navigator get it passed as the first arg, functions that optionally use the context node instead of an argument have that function added as a method of Context. Incidentally I changed some of the unit tests to get mine to pass - in particular I changed the substring() tests back to match the spec (they don't currently).
No function in my version returns null. (compare to eg the name, namespaceuri functions of Navigator). The xpath spec uses "" or an empty nodeset throughout, so do I, this gets rid of a lot of == null checks.
Some of the bugs against the current version of jaxen are fixed by this code - eg the use of double throughout allows gt/lt comparisons of longs (bug jaxen-28), + both of the bugs I filed previously.
I'd estimate I can finish this in the next week.
|
|
Description
|
The short story - I've rewritten jaxen to use a nodeset abstraction instead of a list to accumulate results. With the unit tests at nearly 100%, the performance test on dom4j runs in ~200ms on my 1GHz powerbook, compared to ~7400ms for the nightly jaxen build. The failing unit tests are for paths not covered in the performance test.
The long story - I wrote an implementation of the draft ISO Schematron spec on top of Jaxen, instead of the more usual XSLT. The schematron spec is designed so this can be done, mainly for performance reasons. Once I had enough working I was able to compare jaxen with xalan on the 'screamatron' torture test, which is using schematron to validate the schematron 1.5 spec. I was suprised to find that jaxen was about 2x /slower/ than xalan, and marginally slower with a jdom navigator than a dom4j navigator. Looking at why it was so slow, it became clear that the problem was that throughout jaxen it's building lists of intermediate results to filter down, instead of generating the 'next' result on demand. There's comments about this in several places in the code, but the list thing is so ingrained it needs more than a little refactoring to fix.
I decided to have a go. This change produces apis incompatible with previous versions of jaxen, so this will be a lump of code not a wee patch. I've basically cloned jaxen to 'org.jaxen2' to work on it. I'm nearly done with the changes, failing tests for the non-xpath extension functions which I havent implemented yet, patterns (not touched at all yet), and a couple of minor bugs.
I'll post the code on jira once I'm done, but here's the gist of the changes I made:
- Introduce a Nodeset abstraction instead of a list...
public interface extends Iterator {
Object current();
int position();
// use of last() replaces the internal iterator with an iterator over a list caching results.
// it's basically a fallback on what jaxen used to do.
int last();
}
the reason for this is obvious, eg you can just test 'hasNext()' to check the boolean value of a nodeset.
- change the signature of Expr to allow the selection of a fast code path
public interface Expr {
// these require context because of variable/function refs only. see later.
boolean isBoolean(Context ctx);
boolean isNumber(Context ctx);
boolean isString(Context ctx);
boolean isNodeset(Context ctx);
boolean booleanValue(Context ctx);
double numberValue(Context ctx);
String stringValue(Context ctx);
Nodeset nodesetValue(Context ctx);
// etc
}
this is really for my own sanity as there's so much code to maintain. Throughout jaxen, everything is untyped, so its fairly easy to (eg) pass a nodeSet when you meant to pass a node; there's instanceof checks, == null checks, and casts everywhere to deal with this. Once you put everything behind an interface like the one above, making objects responsible for their own casts, a lot of the code is simplified, eg addition is just expr1.numberValue(ctx) + expr2.numberValue(ctx). All but a couple of classes inherit from abstract base classes that implement all but one of those methods (eg StringExprBase leaves stringValue() abstract.)
Variable/Function refs don't have a type without context in my version, and like Jaxen I look up functions, variables every time. This isn't smart - in xpath we can resolve all variable, function refs beforehand, and get faster code as a result. I'll probably change things so I have 'bound' and 'unbound' exprs, add 'BoundExpr bind(VariableContext,FunctionContext)'. I'll still need to pass these as members of Context to support the 'evaluate' extension function though.
Another approach to the same concern that I didn't try is to pass in a typed context, eg 'StringContext', 'NodesetContext', etc (think of this as allowing things like perl's 'wantarray()'. That approach would make it easier to expand the number of supported types (eg for xpath2) in future, I'll try it once the tests pass.
- moved all the functions to a single utility class
public class Functions {
String substring(String s, double start, double end);
// etc
}
this makes it easier to unit test the xpath functions, since they are mostly context-independent. Functions that need a navigator get it passed as the first arg, functions that optionally use the context node instead of an argument have that function added as a method of Context. Incidentally I changed some of the unit tests to get mine to pass - in particular I changed the substring() tests back to match the spec (they don't currently).
No function in my version returns null. (compare to eg the name, namespaceuri functions of Navigator). The xpath spec uses "" or an empty nodeset throughout, so do I, this gets rid of a lot of == null checks.
Some of the bugs against the current version of jaxen are fixed by this code - eg the use of double throughout allows gt/lt comparisons of longs (bug jaxen-28), + both of the bugs I filed previously.
I'd estimate I can finish this in the next week. |
Show » |
|
I don't touch the saxpath stuff, and wanted to leave the navigators alone, but unfortunately Navigator has a method to return an XPath so it was simpler to copy everything. I see that method as a cyclic dependency, I think it should be dropped. If I fix this there's less duplicate code to submit, but it makes a non-backward compatible change in jaxen. Your preference.
If you have the IDEA xml file for the haus coding conventions, it would be great if it could be posted somewhere? I've got something approximately the same I'm using to format my code but it would be handy to be able to format all this the way you want. Of course, you can do that once you see the code...
My tests share their data with the jaxen tests. This means that when jaxen's gone deliberately off-spec (for substring()) one or other must break. I'm not keen on checking in broken tests but thats already happened for the 'preceding' axis tests? I'd rather comment out those tests for both, so gump works.
Any comments, better ideas appreciated.