jira.codehaus.org

  • Log In Access more options
    • Online Help
    • Keyboard Shortcuts
    • About JIRA
    • JIRA Credits
    • What?s New
  • Dashboards Access more options (Alt+d)
  • Projects Access more options (Alt+p)
  • Issues Access more options (Alt+i)
  • jaxen
  • JAXEN-88

Early termination with positional predicates

  • Log In
  • Views
    • XML
    • Word
    • Printable

Details

  • Type: Improvement Improvement
  • Status: Open Open
  • Priority: Major Major
  • Resolution: Unresolved
  • Affects Version/s: None
  • Fix Version/s: 2.0
  • Component/s: core
  • Labels:
    None

Description

Michael Kay describes a Saxon optimization in this article on developerWorks:

http://www-106.ibm.com/developerworks/library/x-xslt2/

"A predicate such as para[position() <= 3] selects the first three <para> children of the current node. It is not necessary to apply this predicate explicitly to every <para> element to see if it is true, since processing can stop after the third node."

As always with this sort of stuff, we need to carefully benchmark to insure we don't spend more time doing the optimization than it actually saves.

Activity

Ascending order - Click to sort in descending order
  • All
  • Comments
  • Work Log
  • History
  • Activity
Hide
Permalink
Brian Ewins added a comment - 26/May/05 12:39 PM

Some related optimisations and thoughts on how to do it: the slow functions in xpath are last(), count(), sum(). I've actually seen someone this week write this code:

foo[count(.//bar) != 0]

When the lazy-evaluation patches land, thats a slow way of saying 'foo[.//bar]' (right now it makes no difference). More generally, we can also spot comparisons with expressions like 'count(//*)' which only need evaluated once. The transformation looks like this (pseudocode):

relev is a function returning a bitmask of CN, CP, CS; bits are set if x depends on context node, position, or size respectively.

(x = y).simplify() ->

if ((relev & (CP|CS|CN)) == 0) {
// assume if relev == 0 then the subexpression was cached
// if the lhs is an expensive function, transform:
if (y ~ count(z)) { simplify to: count-between(z, x, x+2); (extension fn with the obvious definition) } // and so on. provide position-between(), last-between() functions too.
} else if ((relev & (CP|CS|CN)) == 0) {
// use symmetry:
swap_lhs_rhs(); simplify();
} else {
// no simplification possible.
return;
}

relev() can be calculated recursively quite cheaply, just bitwise-or the relev of subexpressions; a simplifying assumption that all predicates require 'CP', you can do a bit better if you also calculate the types of all subexpressions, since boolean predicates don't necessarily need position.

I've implemented the relev() calculations once before, but ripped them out when they didn't make a difference. A way to add this into jaxen is to use a visitor-like technique with the only thing thats globally available to an evaluation - ContextSupport. Typically an operator would need coded as (pseudocode):
Object eval(Context context) {
ContextSupport cs = context.getContextSupport();
Object lhsVal = cs.eval(lhs, context);
Object rhsVal = cs.eval(rhs, context);
// etc... return something like lhsVal == rhsVal, say
}
int relev(ContextSupport cs) { return cs.relev(lhs) | cs.relev(rhs); }
while ContextSupport would contain this:
Object eval(Expr expr, Context context) {
int relev = relev(expr); // fetch cached or computed relev
if (relev == 0) { // fetch cached or computed eval } else { return expr.eval(context); }
}

That would get us the early termination/reduction of predicates optimisations michael mentions. BTW I implemented this cache stuff once already and ripped it out, its described in the OPTMINCONTEXT papers. It hardly got triggered at all by our current tests. But having seen Michaels examples and the one I mentioned at the top, I now see where triggering it would be useful. The savings could be quite large - O(document nodes) becomes O(exprs in the path), roughly speaking, but my previous tests showed a performance hit of 10-20% on the small docs used in our unit tests.

Show
Brian Ewins added a comment - 26/May/05 12:39 PM Some related optimisations and thoughts on how to do it: the slow functions in xpath are last(), count(), sum(). I've actually seen someone this week write this code: foo[count(.//bar) != 0] When the lazy-evaluation patches land, thats a slow way of saying 'foo[.//bar]' (right now it makes no difference). More generally, we can also spot comparisons with expressions like 'count(//*)' which only need evaluated once. The transformation looks like this (pseudocode): relev is a function returning a bitmask of CN, CP, CS; bits are set if x depends on context node, position, or size respectively. (x = y).simplify() -> if ((relev & (CP|CS|CN)) == 0) { // assume if relev == 0 then the subexpression was cached // if the lhs is an expensive function, transform: if (y ~ count(z)) { simplify to: count-between(z, x, x+2); (extension fn with the obvious definition) } // and so on. provide position-between(), last-between() functions too. } else if ((relev & (CP|CS|CN)) == 0) { // use symmetry: swap_lhs_rhs(); simplify(); } else { // no simplification possible. return; } relev() can be calculated recursively quite cheaply, just bitwise-or the relev of subexpressions; a simplifying assumption that all predicates require 'CP', you can do a bit better if you also calculate the types of all subexpressions, since boolean predicates don't necessarily need position. I've implemented the relev() calculations once before, but ripped them out when they didn't make a difference. A way to add this into jaxen is to use a visitor-like technique with the only thing thats globally available to an evaluation - ContextSupport. Typically an operator would need coded as (pseudocode): Object eval(Context context) { ContextSupport cs = context.getContextSupport(); Object lhsVal = cs.eval(lhs, context); Object rhsVal = cs.eval(rhs, context); // etc... return something like lhsVal == rhsVal, say } int relev(ContextSupport cs) { return cs.relev(lhs) | cs.relev(rhs); } while ContextSupport would contain this: Object eval(Expr expr, Context context) { int relev = relev(expr); // fetch cached or computed relev if (relev == 0) { // fetch cached or computed eval } else { return expr.eval(context); } } That would get us the early termination/reduction of predicates optimisations michael mentions. BTW I implemented this cache stuff once already and ripped it out, its described in the OPTMINCONTEXT papers. It hardly got triggered at all by our current tests. But having seen Michaels examples and the one I mentioned at the top, I now see where triggering it would be useful. The savings could be quite large - O(document nodes) becomes O(exprs in the path), roughly speaking, but my previous tests showed a performance hit of 10-20% on the small docs used in our unit tests.
Hide
Permalink
Brian Ewins added a comment - 15/Jun/05 6:12 PM

changing component to core (this requires access to the tree, unavailable in saxpath) and marking for 2.0 as its performance

Show
Brian Ewins added a comment - 15/Jun/05 6:12 PM changing component to core (this requires access to the tree, unavailable in saxpath) and marking for 2.0 as its performance
Hide
Permalink
Aleksander Adamowski added a comment - 30/Oct/09 8:39 AM

Hasn't JAXEN-31 addressed this?

Show
Aleksander Adamowski added a comment - 30/Oct/09 8:39 AM Hasn't JAXEN-31 addressed this?
Hide
Permalink
Brian Ewins added a comment - 30/Oct/09 9:36 AM

Not entirely. JAXEN-31 helps if the user avoids using last() (for example) since we don't have to touch all the nodes. My comment above is that this optimisation is one of family of transformations like these:

count < y is equivalent to: for (node in x) { count++; if (!(count < y)) {return false} } return true;

and

count == y is equivalent to: for (node in x) { count++; if (count > y) {return false} } return count == y;

These loops terminate early without hitting every node in x. There's no way for the user to express this intent in xpath, unless we add some extension functions:

count < y is equivalent to: count-between(x, 0, y - 1)
count == y is equivalent to: count-between(x, y, y)

Where count-between(x, ymin, ymax) is defined as:
for (node in x) { count++; if (count > ymax) {return false} } return count >= ymin;

...once we have these, we can get the optimisation by transforming the parse tree to use these functions wherever theres a count() comparison. Even after JAXEN-31 lands, this is still a win.

Re the bug title, 'early termination with positional predicates'... we /may/ have that one covered. There are still some internal uses of size() in jaxen, I'd have to check.

Show
Brian Ewins added a comment - 30/Oct/09 9:36 AM Not entirely. JAXEN-31 helps if the user avoids using last() (for example) since we don't have to touch all the nodes. My comment above is that this optimisation is one of family of transformations like these: count < y is equivalent to: for (node in x) { count++; if (!(count < y)) {return false} } return true; and count == y is equivalent to: for (node in x) { count++; if (count > y) {return false} } return count == y; These loops terminate early without hitting every node in x. There's no way for the user to express this intent in xpath, unless we add some extension functions: count < y is equivalent to: count-between(x, 0, y - 1) count == y is equivalent to: count-between(x, y, y) Where count-between(x, ymin, ymax) is defined as: for (node in x) { count++; if (count > ymax) {return false} } return count >= ymin; ...once we have these, we can get the optimisation by transforming the parse tree to use these functions wherever theres a count() comparison. Even after JAXEN-31 lands, this is still a win. Re the bug title, 'early termination with positional predicates'... we /may/ have that one covered. There are still some internal uses of size() in jaxen, I'd have to check.

People

  • Assignee:
    Unassigned
    Reporter:
    Elliotte Rusty Harold
Vote (0)
Watch (1)

Dates

  • Created:
    09/Mar/05 7:23 AM
    Updated:
    30/Oct/09 9:36 AM
  • Atlassian JIRA (v5.0.4#731-sha1:3aa7374)
  • Report a problem
  • Powered by a free Atlassian JIRA open source license for Codehaus. Try JIRA - bug tracking software for your team.