jira.codehaus.org

  • Log In Access more options
    • Online Help
    • Keyboard Shortcuts
    • About JIRA
    • JIRA Credits
    • What?s New
  • Dashboards Access more options (Alt+d)
  • Projects Access more options (Alt+p)
  • Issues Access more options (Alt+i)
  • Maven Doxia
  • DOXIA-226

Make XML based parsers better handle whitespace

  • Log In
  • Views
    • XML
    • Word
    • Printable

Details

  • Type: Improvement Improvement
  • Status: Open Open
  • Priority: Major Major
  • Resolution: Unresolved
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None

Description

Regarding whitespace in XML documents, one needs to consider the following aspects:

  • ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent
  • collapsible whitespace, i.e. view "Text   Text" and "Text Text" as equivalent
  • trimmable whitespace, i.e. view "<p> Text </p>" and "<p>Text</p>" as equivalent

Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.

Currently, whitespace handling is rather static, e.g. XhtmlBaseParser pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if AbstractXmlParser provided a default implementation of handleText() that subclasses can simply control via state flags instead of implementing handleText() from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.

More precisely, I image the following changes:

  • Have AbstractXmlParser maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
  • Have AbstractXmlParser push/pop a tuple from this stack before/after calling handleStartTag()/handleEndTag()
  • Have AbstractXmlParser provide setters to allow subclasses to control the desired whitespace handling in their handleStartTag() implementation
  • Have AbstractXmlParser implement handleText() where it evalutes the top-most tuple from the stack

Issue Links

is duplicated by

Bug - A problem which impairs or prevents the functions of the product. DOXIA-251 The AbstractXmlParser should take care of EOL

  • Major - Major loss of function.
  • Closed - The issue is considered finished, the resolution is correct. Issues which are not closed can be reopened.
relates to

New Feature - A new feature of the product, which has yet to be developed. DOXIA-263 Improve validation of input documents

  • Major - Major loss of function.
  • Closed - The issue is considered finished, the resolution is correct. Issues which are not closed can be reopened.

Activity

Ascending order - Click to sort in descending order
  • All
  • Comments
  • Work Log
  • History
  • Activity
Hide
Permalink
Vincent Siveton added a comment - 12/Sep/08 4:26 PM

First implementation in r694807 which solves DOXIA-251

ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent

Right but space is important in <p><b>word</b> <i>word</i></p> so we need to take care of spaces for some HTML style tags and not for xml or HTML table tags and others.

Show
Vincent Siveton added a comment - 12/Sep/08 4:26 PM First implementation in r694807 which solves DOXIA-251
ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent
Right but space is important in <p><b>word</b> <i>word</i></p> so we need to take care of spaces for some HTML style tags and not for xml or HTML table tags and others.
Hide
Permalink
Benjamin Bentmann added a comment - 13/Sep/08 1:35 AM

Right but space is important in <p><b>word</b> <i>word</i></p>

Right, that's what I intended to say with

maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element

i.e. these flags should be associated with individual markup elements. They are definitively not meant to be global for a parser instance.

Show
Benjamin Bentmann added a comment - 13/Sep/08 1:35 AM
Right but space is important in <p><b>word</b> <i>word</i></p>
Right, that's what I intended to say with
maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
i.e. these flags should be associated with individual markup elements. They are definitively not meant to be global for a parser instance.
Hide
Permalink
Lukas Theussl added a comment - 12/Mar/09 7:56 AM

In addition, whitespace is never ignorable/collapsible/trimmable within verbatim blocks, ie within <source></source> or <pre></pre> in xdocs.

Show
Lukas Theussl added a comment - 12/Mar/09 7:56 AM In addition, whitespace is never ignorable/collapsible/trimmable within verbatim blocks, ie within <source></source> or <pre></pre> in xdocs.

People

  • Assignee:
    Unassigned
    Reporter:
    Benjamin Bentmann
Vote (0)
Watch (0)

Dates

  • Created:
    22/Feb/08 5:36 PM
    Updated:
    11/Apr/11 4:51 AM
  • Atlassian JIRA (v5.0.4#731-sha1:3aa7374)
  • Report a problem
  • Powered by a free Atlassian JIRA open source license for Codehaus. Try JIRA - bug tracking software for your team.