Maven Doxia
  1. Maven Doxia
  2. DOXIA-226

Make XML based parsers better handle whitespace

    Details

    • Type: Improvement Improvement
    • Status: Open Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Number of attachments :
      0

      Description

      Regarding whitespace in XML documents, one needs to consider the following aspects:

      • ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent
      • collapsible whitespace, i.e. view "Text   Text" and "Text Text" as equivalent
      • trimmable whitespace, i.e. view "<p> Text </p>" and "<p>Text</p>" as equivalent

      Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.

      Currently, whitespace handling is rather static, e.g. XhtmlBaseParser pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if AbstractXmlParser provided a default implementation of handleText() that subclasses can simply control via state flags instead of implementing handleText() from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.

      More precisely, I image the following changes:

      • Have AbstractXmlParser maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
      • Have AbstractXmlParser push/pop a tuple from this stack before/after calling handleStartTag()/handleEndTag()
      • Have AbstractXmlParser provide setters to allow subclasses to control the desired whitespace handling in their handleStartTag() implementation
      • Have AbstractXmlParser implement handleText() where it evalutes the top-most tuple from the stack

        Issue Links

          Activity

          Hide
          Vincent Siveton added a comment -

          First implementation in r694807 which solves DOXIA-251

          ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent

          Right but space is important in <p><b>word</b> <i>word</i></p> so we need to take care of spaces for some HTML style tags and not for xml or HTML table tags and others.

          Show
          Vincent Siveton added a comment - First implementation in r694807 which solves DOXIA-251 ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent Right but space is important in <p><b>word</b> <i>word</i></p> so we need to take care of spaces for some HTML style tags and not for xml or HTML table tags and others.
          Hide
          Benjamin Bentmann added a comment -

          Right but space is important in <p><b>word</b> <i>word</i></p>

          Right, that's what I intended to say with

          maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element

          i.e. these flags should be associated with individual markup elements. They are definitively not meant to be global for a parser instance.

          Show
          Benjamin Bentmann added a comment - Right but space is important in <p><b>word</b> <i>word</i></p> Right, that's what I intended to say with maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element i.e. these flags should be associated with individual markup elements. They are definitively not meant to be global for a parser instance.
          Hide
          Lukas Theussl added a comment -

          In addition, whitespace is never ignorable/collapsible/trimmable within verbatim blocks, ie within <source></source> or <pre></pre> in xdocs.

          Show
          Lukas Theussl added a comment - In addition, whitespace is never ignorable/collapsible/trimmable within verbatim blocks, ie within <source></source> or <pre></pre> in xdocs.

            People

            • Assignee:
              Unassigned
              Reporter:
              Benjamin Bentmann
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated: