Details
-
Type:
Improvement
-
Status:
Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Number of attachments :
Description
Regarding whitespace in XML documents, one needs to consider the following aspects:
- ignorable whitespace, i.e. view "<tr> <td/> </tr>" and "<tr><td/></tr>" as equivalent
- collapsible whitespace, i.e. view "Text Text" and "Text Text" as equivalent
- trimmable whitespace, i.e. view "<p> Text </p>" and "<p>Text</p>" as equivalent
Those distinctions require a DTD/XSD in combination with a validating parser and/or application-specific knowledge. For robustness, doxia parsers for XML-based formats should not depend on the existence of a schema definition such that they reliably deliver events into the sinks. Hence I suggest to hard-code the required logic for proper whitespace handling into each parser.
Currently, whitespace handling is rather static, e.g. XhtmlBaseParser pushes all input whitespace into the sink. This might cause troubles with sinks that are not expected to receive ignorable whitespace. To address this issue, it seems helpful if AbstractXmlParser provided a default implementation of handleText() that subclasses can simply control via state flags instead of implementing handleText() from scratch in each parser. Copy&Paste - which caused DOXIA-225 - needs to be avoided.
More precisely, I image the following changes:
- Have AbstractXmlParser maintain a stack of tuples (ignorable, collapsible, trimmable) where each tuple describes the whitespace handling for the currently parsed element
- Have AbstractXmlParser push/pop a tuple from this stack before/after calling handleStartTag()/handleEndTag()
- Have AbstractXmlParser provide setters to allow subclasses to control the desired whitespace handling in their handleStartTag() implementation
- Have AbstractXmlParser implement handleText() where it evalutes the top-most tuple from the stack
First implementation in r694807 which solves
DOXIA-251Right but space is important in <p><b>word</b> <i>word</i></p> so we need to take care of spaces for some HTML style tags and not for xml or HTML table tags and others.