Woodstox
  1. Woodstox
  2. WSTX-253

WstxInputLocation.getCharacterOffset() incorrect on unicode characters

    Details

    • Type: Bug Bug
    • Status: Open Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0.8
    • Fix Version/s: None
    • Labels:
      None
    • Testcase included:
      yes
    • Number of attachments :
      2

      Description

      The javadoc for javax.xml.stream.Location.getCharacterOffset() states "If the input source is a file or a byte stream then this is the byte offset into that stream". However, when given an input stream as a source, the character offset when encountering a unicode control character is wrong.

      This is important to us as the XML may have embedded binaries, and we need to detect the start and end byte offsets of those binaries so we can directly seek to those locations.

      I have attached a unit test that highlights the error. Although this is a ByteArrayInputStream, the same bug occurs for FileInputStream too. (I note that getCharacterOffset() behaves differently for character media, but this is purely byte streams that we are using.)

      1. aalto-test.java
        3 kB
        Raj Nagappan
      2. TestWstxInputLocation.java
        2 kB
        Raj Nagappan

        Activity

        Hide
        Tatu Saloranta added a comment -

        Behavior is due to fact that Woodstox parser itself sees input via Reader, and not as byte source, so it reports character and not byte offset. This can be considered a bug in implementation, but is in my opinion a design flaw in Stax specification: there should be 2 separate methods which would then allow caller to determine behavior (stax2 extension api does expose separate methods).
        Overloading a single method with two alternative behavior is wrong. But there is nothing we can do about that.

        Ideally it should of course be possible to get byte offsets, but with existing Woodstox design this is very difficult to do, as byte-to-char decoding occurs before actual parsing, and thus parser has no knowledge of original byte offsets.
        Theoretically it would be possible to explicitly re-calculate byte offsets back from characters (essentially re-encoding characters), but this requires encoding using whatever encoding was used for decoding.

        If and when you really need to know byte offsets, another possibility would be to consider Aalto XML processor, since it actually handles this particular aspect bit better, and exposes actual byte offsets. It has somewhat less functionality (specifically, no DTD handling), but does implement Stax and SAX APIs.

        Show
        Tatu Saloranta added a comment - Behavior is due to fact that Woodstox parser itself sees input via Reader, and not as byte source, so it reports character and not byte offset. This can be considered a bug in implementation, but is in my opinion a design flaw in Stax specification: there should be 2 separate methods which would then allow caller to determine behavior (stax2 extension api does expose separate methods). Overloading a single method with two alternative behavior is wrong. But there is nothing we can do about that. Ideally it should of course be possible to get byte offsets, but with existing Woodstox design this is very difficult to do, as byte-to-char decoding occurs before actual parsing, and thus parser has no knowledge of original byte offsets. Theoretically it would be possible to explicitly re-calculate byte offsets back from characters (essentially re-encoding characters), but this requires encoding using whatever encoding was used for decoding. If and when you really need to know byte offsets, another possibility would be to consider Aalto XML processor, since it actually handles this particular aspect bit better, and exposes actual byte offsets. It has somewhat less functionality (specifically, no DTD handling), but does implement Stax and SAX APIs.
        Hide
        Raj Nagappan added a comment -

        I tried Aalto on the same test file that I submitted but the parser crashed with an exception on a missing tag. I'm hesitant to go with it anyway as it doesn't seem very mature yet.

        If we can't get the correct byte offsets from woodstox we may have to look at some sort of custom solution.

        Show
        Raj Nagappan added a comment - I tried Aalto on the same test file that I submitted but the parser crashed with an exception on a missing tag. I'm hesitant to go with it anyway as it doesn't seem very mature yet. If we can't get the correct byte offsets from woodstox we may have to look at some sort of custom solution.
        Hide
        Tatu Saloranta added a comment -

        Hmmh. Actually, regardless of whether you can use Aalto, would you mind sharing some more details? I am author of Aalto as well, and although I haven't worked on it a lot for past year, its core xml parsing should be very well tested... so I would want to address any bugs you might uncover.

        One question on original issue: although it is (alas!) not possible on short term to provide actual byte offset via Woodstox, it might be possible to figure out the fact that offset is character based, and hence return -1 instead of that offset in case where underlying source was byte based. Would this help you at all? That is, instead of providing wrong information, indicating no accurate information is available.

        Show
        Tatu Saloranta added a comment - Hmmh. Actually, regardless of whether you can use Aalto, would you mind sharing some more details? I am author of Aalto as well, and although I haven't worked on it a lot for past year, its core xml parsing should be very well tested... so I would want to address any bugs you might uncover. One question on original issue: although it is (alas!) not possible on short term to provide actual byte offset via Woodstox, it might be possible to figure out the fact that offset is character based, and hence return -1 instead of that offset in case where underlying source was byte based. Would this help you at all? That is, instead of providing wrong information, indicating no accurate information is available.
        Hide
        Raj Nagappan added a comment -

        Hi Tatu, to get the error on Aalto I simply used the same test case as before and just changed the reader factory to the Aalto one. I'll attach a revised test case now. I've just added two jars to my path - the Aalto jar and the Stax2 API jar in the same download folder.

        When I run it I get:

        com.fasterxml.aalto.WFCException: Unexpected character 'm' (code 109) in epilog (unbalanced start/end tags?)
        at [row,col

        {unknown-source}

        ]: [3,1]
        at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1306)
        at com.fasterxml.aalto.in.XmlScanner.throwUnexpectedChar(XmlScanner.java:1467)
        at com.fasterxml.aalto.in.XmlScanner.reportPrologUnexpChar(XmlScanner.java:1332)
        at com.fasterxml.aalto.in.StreamScanner.nextFromProlog(StreamScanner.java:183)
        at com.fasterxml.aalto.in.StreamReaderImpl.next(StreamReaderImpl.java:748)
        at com.nuix.integration.vendor.woodstox.TestWstxInputLocation.testGetCharacterOffsetAalto(TestWstxInputLocation.java:44)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

        Show
        Raj Nagappan added a comment - Hi Tatu, to get the error on Aalto I simply used the same test case as before and just changed the reader factory to the Aalto one. I'll attach a revised test case now. I've just added two jars to my path - the Aalto jar and the Stax2 API jar in the same download folder. When I run it I get: com.fasterxml.aalto.WFCException: Unexpected character 'm' (code 109) in epilog (unbalanced start/end tags?) at [row,col {unknown-source} ]: [3,1] at com.fasterxml.aalto.in.XmlScanner.reportInputProblem(XmlScanner.java:1306) at com.fasterxml.aalto.in.XmlScanner.throwUnexpectedChar(XmlScanner.java:1467) at com.fasterxml.aalto.in.XmlScanner.reportPrologUnexpChar(XmlScanner.java:1332) at com.fasterxml.aalto.in.StreamScanner.nextFromProlog(StreamScanner.java:183) at com.fasterxml.aalto.in.StreamReaderImpl.next(StreamReaderImpl.java:748) at com.nuix.integration.vendor.woodstox.TestWstxInputLocation.testGetCharacterOffsetAalto(TestWstxInputLocation.java:44) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        Hide
        Raj Nagappan added a comment -

        Oh, re the second question, returning -1 doesn't really help us, we just open a UTF-8 XML file and read it. It's only going to do this if we hit a unicode control character, at which point all our offsets in the file become useless from then on. Not much point having locations to seek to if they are frequently missing.

        Raj.

        Show
        Raj Nagappan added a comment - Oh, re the second question, returning -1 doesn't really help us, we just open a UTF-8 XML file and read it. It's only going to do this if we hit a unicode control character, at which point all our offsets in the file become useless from then on. Not much point having locations to seek to if they are frequently missing. Raj.
        Hide
        Tatu Saloranta added a comment -

        Right, that's what I thought might be the case. Thanks for the test & details, I'll need to see what is going on.

        Show
        Tatu Saloranta added a comment - Right, that's what I thought might be the case. Thanks for the test & details, I'll need to see what is going on.
        Hide
        Tatu Saloranta added a comment -

        Oh, this is easy. Your XML document is invalid, as it has two root elements. So while Aalto error message is not optimal, it is correct in indicating error.
        (Woodstox does have alternate modes in which one can parse such non-standard xml content; possibly Aalto could also support something similar)

        Show
        Tatu Saloranta added a comment - Oh, this is easy. Your XML document is invalid, as it has two root elements. So while Aalto error message is not optimal, it is correct in indicating error. (Woodstox does have alternate modes in which one can parse such non-standard xml content; possibly Aalto could also support something similar)
        Hide
        Raj Nagappan added a comment -

        Oh sorry, my bad. I will give it another go in a day or two perhaps.

        Thanks,
        Raj.

        Show
        Raj Nagappan added a comment - Oh sorry, my bad. I will give it another go in a day or two perhaps. Thanks, Raj.
        Hide
        Tatu Saloranta added a comment -

        Ok good luck! I think we should also add some kind of note regarding the issue with Woodstox, byte vs char based input sources.

        Show
        Tatu Saloranta added a comment - Ok good luck! I think we should also add some kind of note regarding the issue with Woodstox, byte vs char based input sources.

          People

          • Assignee:
            Tatu Saloranta
            Reporter:
            Raj Nagappan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: