History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: JAXEN-31
Type: Improvement Improvement
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Brian Ewins
Votes: 2
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
jaxen

if jaxen stopped using lists it would be 37x faster

Created: 11/Aug/04 06:05 AM   Updated: 25/May/05 08:45 AM
Component/s: None
Affects Version/s: 1.0
Fix Version/s: 2.0

Time Tracking:
Original Estimate: 1 week
Original Estimate - 1 week
Remaining Estimate: 1 week
Remaining Estimate - 1 week
Time Spent: Not Specified
Remaining Estimate - 1 week

File Attachments: 1. Zip Archive jaxen.zip (1.36 Mb)
2. File stuff.tgz (148 kb)



 Description  « Hide
The short story - I've rewritten jaxen to use a nodeset abstraction instead of a list to accumulate results. With the unit tests at nearly 100%, the performance test on dom4j runs in ~200ms on my 1GHz powerbook, compared to ~7400ms for the nightly jaxen build. The failing unit tests are for paths not covered in the performance test.

The long story - I wrote an implementation of the draft ISO Schematron spec on top of Jaxen, instead of the more usual XSLT. The schematron spec is designed so this can be done, mainly for performance reasons. Once I had enough working I was able to compare jaxen with xalan on the 'screamatron' torture test, which is using schematron to validate the schematron 1.5 spec. I was suprised to find that jaxen was about 2x /slower/ than xalan, and marginally slower with a jdom navigator than a dom4j navigator. Looking at why it was so slow, it became clear that the problem was that throughout jaxen it's building lists of intermediate results to filter down, instead of generating the 'next' result on demand. There's comments about this in several places in the code, but the list thing is so ingrained it needs more than a little refactoring to fix.

I decided to have a go. This change produces apis incompatible with previous versions of jaxen, so this will be a lump of code not a wee patch. I've basically cloned jaxen to 'org.jaxen2' to work on it. I'm nearly done with the changes, failing tests for the non-xpath extension functions which I havent implemented yet, patterns (not touched at all yet), and a couple of minor bugs.

I'll post the code on jira once I'm done, but here's the gist of the changes I made:

  • Introduce a Nodeset abstraction instead of a list...
    public interface extends Iterator {
    Object current();
    int position();
    // use of last() replaces the internal iterator with an iterator over a list caching results.
    // it's basically a fallback on what jaxen used to do.
    int last();
    }
    the reason for this is obvious, eg you can just test 'hasNext()' to check the boolean value of a nodeset.
  • change the signature of Expr to allow the selection of a fast code path
    public interface Expr {
    // these require context because of variable/function refs only. see later.
    boolean isBoolean(Context ctx);
    boolean isNumber(Context ctx);
    boolean isString(Context ctx);
    boolean isNodeset(Context ctx);
    boolean booleanValue(Context ctx);
    double numberValue(Context ctx);
    String stringValue(Context ctx);
    Nodeset nodesetValue(Context ctx);
    // etc
    }
    this is really for my own sanity as there's so much code to maintain. Throughout jaxen, everything is untyped, so its fairly easy to (eg) pass a nodeSet when you meant to pass a node; there's instanceof checks, == null checks, and casts everywhere to deal with this. Once you put everything behind an interface like the one above, making objects responsible for their own casts, a lot of the code is simplified, eg addition is just expr1.numberValue(ctx) + expr2.numberValue(ctx). All but a couple of classes inherit from abstract base classes that implement all but one of those methods (eg StringExprBase leaves stringValue() abstract.)

Variable/Function refs don't have a type without context in my version, and like Jaxen I look up functions, variables every time. This isn't smart - in xpath we can resolve all variable, function refs beforehand, and get faster code as a result. I'll probably change things so I have 'bound' and 'unbound' exprs, add 'BoundExpr bind(VariableContext,FunctionContext)'. I'll still need to pass these as members of Context to support the 'evaluate' extension function though.

Another approach to the same concern that I didn't try is to pass in a typed context, eg 'StringContext', 'NodesetContext', etc (think of this as allowing things like perl's 'wantarray()'. That approach would make it easier to expand the number of supported types (eg for xpath2) in future, I'll try it once the tests pass.

  • moved all the functions to a single utility class
    public class Functions {
    String substring(String s, double start, double end);
    // etc
    }
    this makes it easier to unit test the xpath functions, since they are mostly context-independent. Functions that need a navigator get it passed as the first arg, functions that optionally use the context node instead of an argument have that function added as a method of Context. Incidentally I changed some of the unit tests to get mine to pass - in particular I changed the substring() tests back to match the spec (they don't currently).

No function in my version returns null. (compare to eg the name, namespaceuri functions of Navigator). The xpath spec uses "" or an empty nodeset throughout, so do I, this gets rid of a lot of == null checks.

Some of the bugs against the current version of jaxen are fixed by this code - eg the use of double throughout allows gt/lt comparisons of longs (bug jaxen-28), + both of the bugs I filed previously.

I'd estimate I can finish this in the next week.



 All   Comments   Work Log   Change History      Sort Order: Ascending order - Click to sort in descending order
Brian Ewins - 11/Aug/04 07:15 AM
BTW if bob or peter take a look at this... as this touches all over the code, I was planning to submit as a zip of the entire tree, not patches, with the code in some distinct package (org.jaxen2 currently, but I'll probably rename that to org.jaxen.v2 to stay under the right root).

I don't touch the saxpath stuff, and wanted to leave the navigators alone, but unfortunately Navigator has a method to return an XPath so it was simpler to copy everything. I see that method as a cyclic dependency, I think it should be dropped. If I fix this there's less duplicate code to submit, but it makes a non-backward compatible change in jaxen. Your preference.

If you have the IDEA xml file for the haus coding conventions, it would be great if it could be posted somewhere? I've got something approximately the same I'm using to format my code but it would be handy to be able to format all this the way you want. Of course, you can do that once you see the code...

My tests share their data with the jaxen tests. This means that when jaxen's gone deliberately off-spec (for substring()) one or other must break. I'm not keen on checking in broken tests but thats already happened for the 'preceding' axis tests? I'd rather comment out those tests for both, so gump works.

Any comments, better ideas appreciated.


peter royal - 11/Aug/04 11:12 AM
wow, awesome! i'll certainly take a look!

if you made a zip of it as org.jaxen, then one could easily pull up both source trees in a diff program to view the changes.. we can then decide on where to house the changes


Brian Ewins - 14/Aug/04 05:46 PM
ok here it is. This passes the tests 100%. Its still just about 37x faster on the performance test, but no faster on the functional tests - I think those are largely I/O bound. I've also put in my schematron impl (org.jaxen.schematron package). this has a performance test too, which compares the xsl/jaxen performance on a more complex set of patterns. While I can pass the jaxen tests 100%, some of these tests fail (see the commented out assertion test in the performance test). So I don't think this code is quite right yet. The schematron tests run about 20% faster than xalan with this code (nearly twice as fast as with old jaxen). Also, I found several more jaxen tests that violated the spec, due to * patterns matching more than the principal node type (sec 2.3). I've put the corrected tests in but commented them out, so tests.xml will pass both jaxen and my code. Have fun!

peter royal - 20/Feb/05 11:30 AM
this is a big change. i believe brian said that this direct approach may have problems.

either way, the idea is a 2.0 idea


Brian Ewins - 15/May/05 07:34 PM
This is a revised attack on a number of v2 issues. Largely rewritten from scratch with jdk1.5 work-a-like public interface, a javacc-generated parser, an interal class structure designed around the OPTMINCONTEXT algorithm. While I havent gone as far as that algorithm in this code, its still 3x faster on the perf test, and comparable on the navigator test. It should be possible with some renames and a bit of surgery on BaseXPath to bolt this into the existing Jaxen APIs.

Brian Ewins - 19/May/05 04:14 AM
A bit more fiddling with this new build... I'm back to MUCH MUCH faster (jaxen runs the performance test in 31 seconds, the new code does it in 200 /milliseconds/), but still passing all the navigator tests; touch wood that it won't all blow up later on. Performance on the navigator tests is largely the same as jaxen's because it calls result.size() to test for a pass/fail - forcing my code to evaluate all nodes.

The differences from the previous upload were mainly in making RelativeLocationPath use an iterator instead of a loop, so it's as lazy as it can be; and relying on an optimised navigator.getNodeType() instead of navigator.isElement() and friends. I've profiled the code and don't see any more low-hanging fruit, so I'm going to work on building this as a (hefty) patch to the main code now.


peter royal - 19/May/05 06:54 AM
how about working on this in a branch in CVS?

Brian Ewins - 25/May/05 08:45 AM
A branch in CVS will be a good idea, but not quite yet. I'm doing too much messing around with packages still, cvs would be painful. Also, when I upgraded MacOS I realized that I no longer remembered the passphrase for my ssh key (which keychain had been managing for me). I am such a loser . I'll sort this out soon.

Anyhoo, I am slowly getting to enough stability to go for a branch (2.0-alpha?). I'm working on backwards compatibility at the moment, the patch doesn't need to be as radical as that attachment. Sometime this week.