Details
-
Type:
Bug
-
Status:
Open
-
Priority:
Critical
-
Resolution: Unresolved
-
Affects Version/s: 4.0.8
-
Fix Version/s: None
-
Labels:None
-
Testcase included:yes
-
Number of attachments :
Description
The javadoc for javax.xml.stream.Location.getCharacterOffset() states "If the input source is a file or a byte stream then this is the byte offset into that stream". However, when given an input stream as a source, the character offset when encountering a unicode control character is wrong.
This is important to us as the XML may have embedded binaries, and we need to detect the start and end byte offsets of those binaries so we can directly seek to those locations.
I have attached a unit test that highlights the error. Although this is a ByteArrayInputStream, the same bug occurs for FileInputStream too. (I note that getCharacterOffset() behaves differently for character media, but this is purely byte streams that we are using.)
Behavior is due to fact that Woodstox parser itself sees input via Reader, and not as byte source, so it reports character and not byte offset. This can be considered a bug in implementation, but is in my opinion a design flaw in Stax specification: there should be 2 separate methods which would then allow caller to determine behavior (stax2 extension api does expose separate methods).
Overloading a single method with two alternative behavior is wrong. But there is nothing we can do about that.
Ideally it should of course be possible to get byte offsets, but with existing Woodstox design this is very difficult to do, as byte-to-char decoding occurs before actual parsing, and thus parser has no knowledge of original byte offsets.
Theoretically it would be possible to explicitly re-calculate byte offsets back from characters (essentially re-encoding characters), but this requires encoding using whatever encoding was used for decoding.
If and when you really need to know byte offsets, another possibility would be to consider Aalto XML processor, since it actually handles this particular aspect bit better, and exposes actual byte offsets. It has somewhat less functionality (specifically, no DTD handling), but does implement Stax and SAX APIs.