Maven Doxia
  1. Maven Doxia
  2. DOXIA-222

XhtmlBaseParser swallows significant whitespace

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.1
    • Component/s: Core
    • Labels:
      None
    • Patch Submitted:
      Yes
    • Number of attachments :
      1

      Description

      The current head discards whitespace which is intended to seperate words.

        Issue Links

          Activity

          Hide
          Lukas Theussl added a comment -

          This patch breaks the identity tests in xdoc and xhtml modules as there are many more text events emitted by the parser now. Need to clarify if this should be fixed in the tests or in the corresponding sinks. Generally, we should add a XhtmlBaseParser/SinkTest.

          Show
          Lukas Theussl added a comment - This patch breaks the identity tests in xdoc and xhtml modules as there are many more text events emitted by the parser now. Need to clarify if this should be fixed in the tests or in the corresponding sinks. Generally, we should add a XhtmlBaseParser/SinkTest.
          Hide
          Benjamin Bentmann added a comment -

          To clarify the use case: The inputs

          <b>word</b> <i>word</i>
          

          and

          <b>word</b>
          <i>word</i>
          

          should render as given. The unpatched parser discards
          a) text blocks that consist only of whitespace as in the first example
          b) trailing newlines as in the second example

          Show
          Benjamin Bentmann added a comment - To clarify the use case: The inputs <b> word </b> <i> word </i> and <b> word </b> <i> word </i> should render as given. The unpatched parser discards a) text blocks that consist only of whitespace as in the first example b) trailing newlines as in the second example
          Hide
          Lukas Theussl added a comment -

          I agree that this needs to be fixed, only keep the identity tests intact.

          Show
          Lukas Theussl added a comment - I agree that this needs to be fixed, only keep the identity tests intact.
          Hide
          Benjamin Bentmann added a comment -

          I think the tests themselves are alright, the sinks would need adjustment to match the altered parser behaviour. For example, I temporarily removed the EOL writes in AbstractXmlSink.writeStartTag() and AbstractXmlSink.writeEndTag() and the XDoc/XHTML tests passed (Docbook kept failing but that seems to be another issue, I watched it sinking "Tablecaption" instead of "Table caption" ...)

          Ignorable or significant whitespace, that's the whole question here. Unless the XhtmlBaseParser can validate the input with a DTD/XSD, it can only report all whitespace as significant. As a consequence, a sink should not emit any whitespace that is intended to be ignored (as done now for some pretty printing) because the parser would erroneously classify it as significant later on.

          If the pretty printing should be kept, a parser like XdocParser could track the element hierarchy to manually maintain a flag whether the currently parsed element has mixed content or just element content. XhtmlBaseParser.handleText() could then evaluate this flag to decide how to deal with whitespace.

          Show
          Benjamin Bentmann added a comment - I think the tests themselves are alright, the sinks would need adjustment to match the altered parser behaviour. For example, I temporarily removed the EOL writes in AbstractXmlSink.writeStartTag() and AbstractXmlSink.writeEndTag() and the XDoc/XHTML tests passed (Docbook kept failing but that seems to be another issue, I watched it sinking "Tablecaption" instead of "Table caption" ...) Ignorable or significant whitespace, that's the whole question here. Unless the XhtmlBaseParser can validate the input with a DTD/XSD, it can only report all whitespace as significant. As a consequence, a sink should not emit any whitespace that is intended to be ignored (as done now for some pretty printing) because the parser would erroneously classify it as significant later on. If the pretty printing should be kept, a parser like XdocParser could track the element hierarchy to manually maintain a flag whether the currently parsed element has mixed content or just element content. XhtmlBaseParser.handleText() could then evaluate this flag to decide how to deal with whitespace.
          Hide
          Lukas Theussl added a comment -

          For your first paragraph see DOXIA-189 and the discussion that ensued on the dev list: http://tinyurl.com/2f48m8 .

          IMO the doxia core sinks are not supposed to be pretty printers, pretty printing should be done by a low-level end-user sink,eg the SiteRenderingSink. At least we need to decouple the pretty printing in order to be able to test (the docbook tests fail because the linebreaker inserts a newline which is later deleted in the test).

          Show
          Lukas Theussl added a comment - For your first paragraph see DOXIA-189 and the discussion that ensued on the dev list: http://tinyurl.com/2f48m8 . IMO the doxia core sinks are not supposed to be pretty printers, pretty printing should be done by a low-level end-user sink,eg the SiteRenderingSink. At least we need to decouple the pretty printing in order to be able to test (the docbook tests fail because the linebreaker inserts a newline which is later deleted in the test).
          Hide
          Benjamin Bentmann added a comment -

          see DOXIA-189 and the discussion that ensued on the dev list: http://tinyurl.com/2f48m8.

          I agree with you that AbstractXmlSink should not do any pretty printing by itself because XML is just too low-level and pretty printing requires knowledge about the exact semantics of the output format. For this reason, AbstractXmlSink.writeStartTag() should not emit an eol for "simple" tags because an inline element can in general not know whether its surrounding block element supports ignorable whitespace such that the current heuristic may cause errors sooner or later.

          Being a user, I consider human-friendly sink output a nice to have feature and hence have no arguments against abstract sinks that provide means for their subclasses to add ignorable whitespace to the output. However, as a matter of design, I suggest to decouple pretty printing and actual content output from one other in the sink methods. I.e. instead of offering both writeEndTag() and writeEndTagWithoutEOL() discard the former one and rename the later one to writeEndTag() such that this method does really what it name suggests to readers, write the end tag but no more. Writing the EOL (and maybe some indent?) would then be provided by a dedicated method, say writeEOL(). Subclasses that want to pretty print would invoke writeEndTag() and writeEOL() explicitly, again leading to better understandable/maintainable code IMHO.

          As for the parsing site: In concern of robustness, I believe it is a desirable feature of AbstractXmlParser and all its subclasses if their output does not differ regardless whether the underlying XML parser is equipped with a DTD/XSD when parsing or not (i.e. whether ignorable whitespace may be reported). This could be realized by

          1. have all doxia XML parsers assume their underlying XML parser cannot detect ignorable whitespace but reports it as PCDATA
          2. add a boolean flag to AbstractXmlParser that subclasses like XdocParser can use to control whether the currently parsed element has mixed content or not
          3. move handleText() from AbstractXhtmlParser up to AbstractXmlParser and have it appropriately trim the text if the ignorableWhitespace flag is set
          Show
          Benjamin Bentmann added a comment - see DOXIA-189 and the discussion that ensued on the dev list: http://tinyurl.com/2f48m8 . I agree with you that AbstractXmlSink should not do any pretty printing by itself because XML is just too low-level and pretty printing requires knowledge about the exact semantics of the output format. For this reason, AbstractXmlSink.writeStartTag() should not emit an eol for "simple" tags because an inline element can in general not know whether its surrounding block element supports ignorable whitespace such that the current heuristic may cause errors sooner or later. Being a user, I consider human-friendly sink output a nice to have feature and hence have no arguments against abstract sinks that provide means for their subclasses to add ignorable whitespace to the output. However, as a matter of design, I suggest to decouple pretty printing and actual content output from one other in the sink methods. I.e. instead of offering both writeEndTag() and writeEndTagWithoutEOL() discard the former one and rename the later one to writeEndTag() such that this method does really what it name suggests to readers, write the end tag but no more. Writing the EOL (and maybe some indent?) would then be provided by a dedicated method, say writeEOL() . Subclasses that want to pretty print would invoke writeEndTag() and writeEOL() explicitly, again leading to better understandable/maintainable code IMHO. As for the parsing site: In concern of robustness, I believe it is a desirable feature of AbstractXmlParser and all its subclasses if their output does not differ regardless whether the underlying XML parser is equipped with a DTD/XSD when parsing or not (i.e. whether ignorable whitespace may be reported). This could be realized by have all doxia XML parsers assume their underlying XML parser cannot detect ignorable whitespace but reports it as PCDATA add a boolean flag to AbstractXmlParser that subclasses like XdocParser can use to control whether the currently parsed element has mixed content or not move handleText() from AbstractXhtmlParser up to AbstractXmlParser and have it appropriately trim the text if the ignorableWhitespace flag is set
          Hide
          Lukas Theussl added a comment -

          Fixed in r630198 with a couple of necessary adjustments in various tests.

          For your parsing comments I suggest you open a separate issue for discussion.

          Thanks!

          Show
          Lukas Theussl added a comment - Fixed in r630198 with a couple of necessary adjustments in various tests. For your parsing comments I suggest you open a separate issue for discussion. Thanks!

            People

            • Assignee:
              Lukas Theussl
              Reporter:
              Benjamin Bentmann
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: