Woodstox
  1. Woodstox
  2. WSTX-230

Attribute Entities not affected by XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES or XMLInputFactory.SUPPORT_DTD

    Details

    • Type: Wish Wish
    • Status: Open Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0.7
    • Fix Version/s: None
    • Labels:
      None
    • Number of attachments :
      0

      Description

      I'm sure there is a good reason for this, but I couldn't find any documentation telling me why.

      I currently want to process an XML document and leave it exactly as it comes in. I.E. no resolving of entities.
      XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES or XMLInputFactory.SUPPORT_DTD will do this for entities within elements, but there seems to be no way of leaving entities in attributes unresolved.

      I tried using SUPPORT_DTD == false, and then using a custom "com.ctc.wstx.undeclaredEntityResolver" that just returns the entity as is:

      public class UndeclaredEntityResolver implements javax.xml.stream.XMLResolver {
      
      	@Override
      	public Object resolveEntity(String publicID, String systemID, String baseURI, String namespace)
      			throws XMLStreamException {
      
      		String entity = "&" + namespace + ";";
      		return new ByteArrayInputStream(entity.getBytes());
      	}
      }
      

      but this will just result in a recursive entity problem.

      Am I missing something?
      Is there any nice way I fan accomplish this?

        Activity

        Hide
        Andreas Veithen added a comment -

        I guess the reason for this is that the StAX API doesn't foresee a way to report entity references in attributes. Attribute values are always reported as strings. This is in contrast with entities within elements: they can be reported as separate ENTITY_REFERENCE events.

        Show
        Andreas Veithen added a comment - I guess the reason for this is that the StAX API doesn't foresee a way to report entity references in attributes. Attribute values are always reported as strings. This is in contrast with entities within elements: they can be reported as separate ENTITY_REFERENCE events.
        Hide
        Tatu Saloranta added a comment -

        Right, Stax API has no way to do this. And in general, there is absolutely no way to do this, if attribute value must be returned as String (like Andreas pointed out). Problem is, it is not possible to know whether '&' was unexpanded entity, or expanded from something like '∓'.

        The only possibility would be to expose raw underlying input as is. Woodstox does not try to do this, since there is no efficient way to do it; but it does provide accurate input pointers that calling application may be able to find actual physical String (if it has copy of input data).

        So I don't think there is a nice way. If you really want to leave document as is, just make a copy?

        Show
        Tatu Saloranta added a comment - Right, Stax API has no way to do this. And in general, there is absolutely no way to do this, if attribute value must be returned as String (like Andreas pointed out). Problem is, it is not possible to know whether '&' was unexpanded entity, or expanded from something like '∓'. The only possibility would be to expose raw underlying input as is. Woodstox does not try to do this, since there is no efficient way to do it; but it does provide accurate input pointers that calling application may be able to find actual physical String (if it has copy of input data). So I don't think there is a nice way. If you really want to leave document as is, just make a copy?
        Hide
        Tatu Saloranta added a comment -

        Also: these settings do not affect "standard" entities (amp, lt, gt, apos).
        They are considered to be effectively same as character entities (  and such), not real general entities. As such, they will always be replaced, even in regular element content. This was based on interoperability reasons, as well as convenience.
        I don't know which entities you are dealing with here, but realized you might be thinking of the default ones.

        Show
        Tatu Saloranta added a comment - Also: these settings do not affect "standard" entities (amp, lt, gt, apos). They are considered to be effectively same as character entities (  and such), not real general entities. As such, they will always be replaced, even in regular element content. This was based on interoperability reasons, as well as convenience. I don't know which entities you are dealing with here, but realized you might be thinking of the default ones.
        Hide
        Ben Davies added a comment -

        Thanks for your comments both.

        If you really want to leave document as is, just make a copy?

        Not applicable to my situation unfortunately, if I'm understanding you correctly.

        I'm us Woodstox with JiBX to unmarshall/marshall XML, and the service it sits within has to be able to return parts of documents exactly as the came into the system, i.e. all entities unexpanded.

        I've got around this at the moment by replacing all occurrences of "&" in the xml string with some defined place-holder, before unmarshalling, and then replacing the place-holder with "&" post marshall on the way out. Not nice, but it works.

        Thanks again.
        Ben

        Show
        Ben Davies added a comment - Thanks for your comments both. If you really want to leave document as is, just make a copy? Not applicable to my situation unfortunately, if I'm understanding you correctly. I'm us Woodstox with JiBX to unmarshall/marshall XML, and the service it sits within has to be able to return parts of documents exactly as the came into the system, i.e. all entities unexpanded. I've got around this at the moment by replacing all occurrences of "&" in the xml string with some defined place-holder, before unmarshalling, and then replacing the place-holder with "&" post marshall on the way out. Not nice, but it works. Thanks again. Ben
        Hide
        Tatu Saloranta added a comment -

        Ok. I guess fundamental question is, why worry about physical serialization, but there may be reasons (... legacy systems). Any system that assumes unexpanded entities is somewhat flawed, conceptually, since XML makes no guarantees of what entities are used if any.
        But I assume you know that and it's other systems (and their designers) that don't.

        Additional escaping may indeed be the way to go. But if you just need to pass sub-section (element with its contents), you could still consider using location offsets if you have access to them (this depends on how input is given) and underlying content. Woodstox has a way to pass "raw" content to output. But given that you are using data binder above Woodstox, maybe it just won't work.

        Show
        Tatu Saloranta added a comment - Ok. I guess fundamental question is, why worry about physical serialization, but there may be reasons (... legacy systems). Any system that assumes unexpanded entities is somewhat flawed, conceptually, since XML makes no guarantees of what entities are used if any. But I assume you know that and it's other systems (and their designers) that don't. Additional escaping may indeed be the way to go. But if you just need to pass sub-section (element with its contents), you could still consider using location offsets if you have access to them (this depends on how input is given) and underlying content. Woodstox has a way to pass "raw" content to output. But given that you are using data binder above Woodstox, maybe it just won't work.

          People

          • Assignee:
            Tatu Saloranta
            Reporter:
            Ben Davies
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated: