jaxen

Implicit casting of a Text node to a numeric value leads to incorrect predicate comparisons.

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Minor Minor
  • Resolution: Fixed
  • Affects Version/s: 1.1
  • Fix Version/s: 1.1.3
  • Component/s: None
  • Labels:
    None
  • Number of attachments :
    1

Description

If I use Jaxen to evaluate an XPath predicate and that predicate requires a text node to be implicitly cast to a numeric value, the results of the predicate are not what I expect. If, however, I do the cast explicitly, then I do get the expected results.

As an example, assume I have the following input data:

<geo_data xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#">
<loc id="1">
<geo:long>-122.424761</geo:long>
<geo:lat>37.776694</geo:lat>
</loc>
<loc id="2">
<geo:long>-122.539001465</geo:long>
<geo:lat>37.9743995667</geo:lat>
</loc>
<loc id="3">
<geo:long>-122.42099762</geo:long>
<geo:lat>37.7573013306</geo:lat>
</loc>
<loc id="4">
<geo:long>-122.427001953</geo:long>
<geo:lat>37.7430992126</geo:lat>
</loc>
<loc id="5">
<geo:long>-122.419998169</geo:long>
<geo:lat>37.7448005676</geo:lat>
</loc>
</geo_data>

If I try to run the following XPath expression, I get zero results:

//geo:lat[(text() >= 37.75)]

But I expect to get 3 results. And indeed, if I specify an explicit cast of the text() node to numeric, that's exactly what I get:

//geo:lat[(number(text()) >= 37.75)]

I skimmed through the XPath spec and tried out a bunch of different expressions, and based on the results, my guess (and it's only a guess) is that Jaxen has a bug where implicit casting of a Text node to a numeric value incorrectly uses the "toString()" value from the Text node (ex. "#text: 37.7448005676"), which leads to a result that is NaN.

I (perhaps erroneously) came to the above guess based on the following observations.

=================================
Sample Expressions & Observations
=================================

Observation 1:
--------------

//geo:lat[(text() >= '37.75')] ==> 0 results
//geo:lat[(text() >= 37.75)] ==> 0 results

From this it is clear that specifying the literal without quotes doesn't change the results; in either case Jaxen returns 0 results.

Observation 2:
--------------

//geo:lat[(text() >= '37.7430992126')] ==> 0 results
//geo:lat[(text() >= 37.7430992126)] ==> 0 results

Using a precise number with the same number of decimal digits as the text() node does not change the results. So this does not appear to be an issue of string padding.

Observation 3:
--------------

//geo:lat[(text() = '37.7430992126')] ==> 1 result
//geo:lat[(text() = 37.7430992126)] ==> 0 results
//geo:lat[(text() != '37.7430992126')] ==> 4 results
//geo:lat[(text() != 37.7430992126)] ==> 5 results

Doing EQUALS or NOT EQUALS instead of GREATER-THAN-OR-EQUALS does make a difference. Jaxen appears to do different things when we have "=, Unable to render embedded object: File (=" verses when we have "<, <=, >, >=". When using "=" or ") not found.=" we get the expected results if the literal is a string, but we get unexpected results when the literal is a number.

Observation 4:
--------------

//geo:lat[number(text()) >= '37.75'] ==> 3 results
//geo:lat[number(text()) >= 37.75] ==> 3 results
//geo:lat[number(text()) = '37.7430992126'] ==> 1 result
//geo:lat[number(text()) = 37.7430992126] ==> 1 result

Using the "number()" function on the text node gives us the expected results. This is true regardless of whether the literal is a string or a number.

Observation 5:
--------------

In section 3.4 of the XPath 1.0 specification (http://www.w3.org/TR/xpath) an explanation of comparison operator behavior is given. The behavior depends on two primary factors:

a) Whether or not the operands are "node-sets".
b) Whether the operator is "=, !=", or whether it is "<, <=, >, >=".

Factor "b" has a direct correlation with observation #3: i.e. the behavior of Jaxen changes depending on whether we are using "=, !=" verses "<, <=, >, >=".

Factor "a" has a direct correlation with observation #4: i.e. the behavior of Jaxen changes depending on whether we have a "node-set" (that is, "text()") verses an atomic value ("number(text())").

==============
XPath Snippets
==============

R1: "If one object to be compared is a node-set and the other is a number, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the number to be compared and on the result of converting the string-value of that node to a number using the number function is true. If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true."

R2: "When neither object to be compared is a node-set and the operator is = or !=, then the objects are compared by converting them to a common type as follows and then comparing them. [...] if at least one object to be compared is a number, then each object to be compared is converted to a number as if by applying the number function. Otherwise, both objects to be compared are converted to strings as if by applying the string function."

R3: "When neither object to be compared is a node-set and the operator is <=, <, >= or >, then the objects are compared by converting both objects to numbers and comparing the numbers according to IEEE 754."

============================
Assumptions and Explanations
============================

IF we *ASSUME* that:

a) Jaxen takes a Text node, which constitutes a node-set, and
turns it into a string or a numeric value, which is no longer
a node-set, as part of R1 processing. Then it passes that
value down to R2 or R3, along with the literal on the right
side of the comparison; AND

b) Jaxen correctly converts string literals into numeric
values; AND

c) Jaxen fails to correctly convert the Text node's string
value into a number and instead produces something along
the lines of an NaN,

THEN all of the results for the expressions that I try become (mostly) explainable. Here are the details.

Q1. //geo:lat[(text() >= '37.75')] ==> 0 results
Q2. //geo:lat[(text() >= 37.75)] ==> 0 results
Q3. //geo:lat[(text() >= '37.7430992126')] ==> 0 results
Q4. //geo:lat[(text() >= 37.7430992126)] ==> 0 results

For Q1 thru Q4, R1 dicates that we get a string value from the text node; then R3 indicates that both operands should be converted to a number. If attempts to convert the text node's string value into a numeric value returns NaN, the comparison will never return true, hence no results.

Q5. //geo:lat[(text() = 37.7430992126)] ==> 0 results
Q6. //geo:lat[(text() != 37.7430992126)] ==> 5 results

For Q5 and Q6, R1 dictates that we get a numeric value from the text node. If retrieval of the numeric value from the Text node returns NaN, then subsequent equality comparison with the numeric literal will never return true; but IN-equality comparison could perhaps return true because NaN is clearly not the same as a valid number...? (just guessing).

Q7. //geo:lat[(text() = '37.7430992126')] ==> 1 result
Q8. //geo:lat[(text() != '37.7430992126')] ==> 4 results

For Q7 and Q8, R1 dicates that we get a string value from the text node; then R2 indicates that both operands should be converted to a string--which they both already are. So the comparison runs as a string comparison and returns the expected results, because the text node string value was never converted to a numeric.

Q9. //geo:lat[number(text()) >= '37.75'] ==> 3 results
Q10. //geo:lat[number(text()) >= 37.75] ==> 3 results

For Q9 and Q10, the number() function correctly converts the text node's string value to a numeric value; then since neither operand is a node-set R1 does NOT apply. R3 dictates that all operands are converted to numeric, which is a no-op on the value returned from the number() function and which works correctly for the right-hand literals. Thus the comparison returns correct results.

So it all kind of makes a bit of sense if we assume that Jaxen is having problems with implicit conversion of the string value of a Text node returned by "text()" into a numeric value...

But all of that said, I could simply be missing something key. So if this is working as designed, then please excuse the noise...

Activity

Hide
Army added a comment -

Attaching a simple, standalone (not JUnit) repro for the reported issue. I tried it against Jaxen 1.1.1 and Jaxen 1.1.2, but am seeing the same behavior in both releases.

Extract the three files into a directory, then from within that directory...

To compile:

javac -cp C:/Progs/jaxen-1.1.2/jaxen-1.1.2.jar JaxenXPCheck.java

To run:

java -cp "C:/Progs/jaxen-1.1.2/jaxen-1.1.2.jar;." JaxenXPCheck [XPath expressions]

XPath expressions can be specified as arguments using quoted values. If you do not specify any expressions, the repro will simply run two default expressions, where the only difference between the two is implicit vs. explicit casting. Ex.

> java -cp "C:/Progs/jaxen-1.1.2/jaxen-1.1.2.jar;." JaxenXPCheck

Expr: //geo:lat[(text() >= 37.75)]

=> Found list w/ 0 matching nodes.

Expr: //geo:lat[(number(text()) >= 37.75)]

=> Found list w/ 3 matching nodes.

Show
Army added a comment - Attaching a simple, standalone (not JUnit) repro for the reported issue. I tried it against Jaxen 1.1.1 and Jaxen 1.1.2, but am seeing the same behavior in both releases. Extract the three files into a directory, then from within that directory... To compile: javac -cp C:/Progs/jaxen-1.1.2/jaxen-1.1.2.jar JaxenXPCheck.java To run: java -cp "C:/Progs/jaxen-1.1.2/jaxen-1.1.2.jar;." JaxenXPCheck [XPath expressions] XPath expressions can be specified as arguments using quoted values. If you do not specify any expressions, the repro will simply run two default expressions, where the only difference between the two is implicit vs. explicit casting. Ex. > java -cp "C:/Progs/jaxen-1.1.2/jaxen-1.1.2.jar;." JaxenXPCheck Expr: //geo:lat[(text() >= 37.75)] => Found list w/ 0 matching nodes. Expr: //geo:lat[(number(text()) >= 37.75)] => Found list w/ 3 matching nodes.
Hide
Elliotte Rusty Harold added a comment -

Thanks. This is a very good report. I'll look into it.

The first thing to do is to verify that the two expressions should produce identical output. I'll check the specs and if I don't see anything clearly on point there, then I'll contrive an XSLT stylesheet so I can check the behavior against Xerces or libxslt.

Per the spec, your expression is at least syntactically correct. Also, I see in section 3.5, "The numeric operators convert their operands to numbers as if by calling the number function." So the version with and without the explicit cast certainly should be equivalent.

Show
Elliotte Rusty Harold added a comment - Thanks. This is a very good report. I'll look into it. The first thing to do is to verify that the two expressions should produce identical output. I'll check the specs and if I don't see anything clearly on point there, then I'll contrive an XSLT stylesheet so I can check the behavior against Xerces or libxslt. Per the spec, your expression is at least syntactically correct. Also, I see in section 3.5, "The numeric operators convert their operands to numbers as if by calling the number function." So the version with and without the explicit cast certainly should be equivalent.
Hide
Elliotte Rusty Harold added a comment -

libxslt agrees with you. Let me try your unit test and see if there's an obvious solution.

Show
Elliotte Rusty Harold added a comment - libxslt agrees with you. Let me try your unit test and see if there's an obvious solution.
Hide
Elliotte Rusty Harold added a comment -

Here's a self-contained unit test that demonstrates the problem at a high level:

public void testJaxen208()
throws JaxenException, ParserConfigurationException, SAXException, IOException { XPath implicitCast = new DOMXPath("//lat[(text() >= 37)]"); XPath explicitCast = new DOMXPath("//lat[(number(text()) >= 37)]"); DocumentBuilder builder = factory.newDocumentBuilder(); ByteArrayInputStream in = new ByteArrayInputStream("<geo><lat>39</lat></geo>".getBytes("UTF-8")); Document document = builder.parse(in); List result = explicitCast.selectNodes(document); assertEquals(1, result.size()); result = implicitCast.selectNodes(document); assertEquals(1, result.size()); }

Show
Elliotte Rusty Harold added a comment - Here's a self-contained unit test that demonstrates the problem at a high level: public void testJaxen208() throws JaxenException, ParserConfigurationException, SAXException, IOException { XPath implicitCast = new DOMXPath("//lat[(text() >= 37)]"); XPath explicitCast = new DOMXPath("//lat[(number(text()) >= 37)]"); DocumentBuilder builder = factory.newDocumentBuilder(); ByteArrayInputStream in = new ByteArrayInputStream("<geo><lat>39</lat></geo>".getBytes("UTF-8")); Document document = builder.parse(in); List result = explicitCast.selectNodes(document); assertEquals(1, result.size()); result = implicitCast.selectNodes(document); assertEquals(1, result.size()); }
Hide
Elliotte Rusty Harold added a comment -

Wow. This is a weird one. It probably only occurs in DOM, and probably only with Xerces. Seems at some point Xerces uses a DeferredTextImpl for a Node instead of a String so in NumberFunction our instanceof tests miss it.

Show
Elliotte Rusty Harold added a comment - Wow. This is a weird one. It probably only occurs in DOM, and probably only with Xerces. Seems at some point Xerces uses a DeferredTextImpl for a Node instead of a String so in NumberFunction our instanceof tests miss it.
Hide
Elliotte Rusty Harold added a comment -

I have a quick fix but it's ugly since it's pushes model specific code (i.e.e the DOM text interface) into model-independent code (NumberFunction.java). I'm going to see if I can improve this.

Show
Elliotte Rusty Harold added a comment - I have a quick fix but it's ugly since it's pushes model specific code (i.e.e the DOM text interface) into model-independent code (NumberFunction.java). I'm going to see if I can improve this.
Hide
Elliotte Rusty Harold added a comment -

OK. I have a better fix now. However this bug suggests there may be other area where we've missed issues with implicit casts. e.g. if the node is a comment or processing instruction instead of a text. I'm going to look at a few of those before closing.

Show
Elliotte Rusty Harold added a comment - OK. I have a better fix now. However this bug suggests there may be other area where we've missed issues with implicit casts. e.g. if the node is a comment or processing instruction instead of a text. I'm going to look at a few of those before closing.
Hide
Elliotte Rusty Harold added a comment -

Fix is in head now. I think this is significant enough of a bug to justify releasing 1.1.3. I'll try and push that out in the near future.

Show
Elliotte Rusty Harold added a comment - Fix is in head now. I think this is significant enough of a bug to justify releasing 1.1.3. I'll try and push that out in the near future.
Hide
Elliotte Rusty Harold added a comment -

FYI, I think you can change your XPath expression to the following and it should work with 1.1.2:

//geo:lat[. >= 37.75]

There's no actual need to use the text() node test here. Furthermore there are some edge cases where . >= '37.75' will work and text() >= '37.75' won't. (If there's a comment in the middle of the number, for example).

Show
Elliotte Rusty Harold added a comment - FYI, I think you can change your XPath expression to the following and it should work with 1.1.2: //geo:lat[. >= 37.75] There's no actual need to use the text() node test here. Furthermore there are some edge cases where . >= '37.75' will work and text() >= '37.75' won't. (If there's a comment in the middle of the number, for example).
Hide
Army added a comment -

> Fix is in head now.

Wow, thanks for picking this one up--and for providing a fix so quickly!

> I think this is significant enough of a bug to justify
> releasing 1.1.3.

Ok, that's great to hear.

> I think you can change your XPath expression [...] and it
> should work with 1.1.2: [...] There's no actual need to use
> the text() node test here

Good point, thanks for mentioning that. I tried it out and confirmed that yes, this alternate expression does indeed return the results I expect. This will definitely prove to be a useful workaround in the short term...

Thank you very much, Elliotte, for your time and rapid response. Your quick turn-around on this is greatly appreciated!

Show
Army added a comment - > Fix is in head now. Wow, thanks for picking this one up--and for providing a fix so quickly! > I think this is significant enough of a bug to justify > releasing 1.1.3. Ok, that's great to hear. > I think you can change your XPath expression [...] and it > should work with 1.1.2: [...] There's no actual need to use > the text() node test here Good point, thanks for mentioning that. I tried it out and confirmed that yes, this alternate expression does indeed return the results I expect. This will definitely prove to be a useful workaround in the short term... Thank you very much, Elliotte, for your time and rapid response. Your quick turn-around on this is greatly appreciated!

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved: