groovy
  1. groovy
  2. GROOVY-4831

New line character is not escaped in attribute when using MarkupBuilder

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.0
    • Fix Version/s: 1.8.1
    • Component/s: XML Processing
    • Labels:
      None
    • Number of attachments :
      2

      Description

      For example:

      import groovy.xml.*
      def xml = new MarkupBuilder()
      xml.test(a:"hello\nworld"){}
      

      produces output:

      <test a='hello
      world' />
      

      But xml specification says that when parsing, new lines in attribute value should be converted to single space. So they are not preserved.

      I suggest this new method implementation (sorry that I don't provide real .patch):

          private String checkForReplacement(boolean isAttrValue, char ch) {
              switch (ch) {
                  case '&':
                      return "&amp;";
                  case '<':
                      return "&lt;";
                  case '>':
                      return "&gt;";
                  case '"':
                      // The double quote is only escaped if the value is for
                      // an attribute and the builder is configured to output
                      // attribute values inside double quotes.
                      if (isAttrValue && useDoubleQuotes) return "&quot;";
                      break;
                  case '\'':
                      // The apostrophe is only escaped if the value is for an
                      // attribute, as opposed to element content, and if the
                      // builder is configured to surround attribute values with
                      // single quotes.
                      if (isAttrValue && !useDoubleQuotes) return "&apos;";
                      break;
                  case '\n':
                      if(isAttrValue) return "&#10;";
                  case '\r':
                      if(isAttrValue) return "&#13;";
              }
              return null;
          }
      
      1. MarkupBuilder.java.patch
        0.5 kB
        Jakub Neubauer
      2. repair.txt
        1 kB
        Jakub Neubauer

        Activity

        Hide
        Jakub Neubauer added a comment -

        My previous comment was mangled, so I rather try to attach a patch.

        Show
        Jakub Neubauer added a comment - My previous comment was mangled, so I rather try to attach a patch.
        Hide
        blackdrag blackdrag added a comment -

        But if you say the spec says "new lines in attribute value should be converted to single space" and then produce , isn't that wrong? would you want to use " "?

        Show
        blackdrag blackdrag added a comment - But if you say the spec says "new lines in attribute value should be converted to single space" and then produce , isn't that wrong? would you want to use " "?
        Hide
        Paul King added a comment -

        Do you have a link to the relevant part of the spec?

        Show
        Paul King added a comment - Do you have a link to the relevant part of the spec?
        Hide
        Jakub Neubauer added a comment -

        The link to spec is: http://www.w3.org/TR/xml/#AVNormalize

        It says when parsing, white space characters #x20, #xD, #xA, #x9 should be converted to #x20 (space). Then, all leading and trailing spaces are removed and all sequences of spaces are merged to one space.

        So, now I see that my patch is not enough. The generator should also convert spaces and tabs to their escape sequences, assuming that attribute value should be preserved exactly as is after parsing it back.

        To Jochen: we probably didn't understand each other. The normalizing attribute value occurs in XML parser. I expect that generating xml and parsing it back should provide the same values. But if you generate XML with attribute value containing new line, parse it back, you get a space.

        Show
        Jakub Neubauer added a comment - The link to spec is: http://www.w3.org/TR/xml/#AVNormalize It says when parsing, white space characters #x20, #xD, #xA, #x9 should be converted to #x20 (space). Then, all leading and trailing spaces are removed and all sequences of spaces are merged to one space. So, now I see that my patch is not enough. The generator should also convert spaces and tabs to their escape sequences, assuming that attribute value should be preserved exactly as is after parsing it back. To Jochen: we probably didn't understand each other. The normalizing attribute value occurs in XML parser. I expect that generating xml and parsing it back should provide the same values. But if you generate XML with attribute value containing new line, parse it back, you get a space.
        Hide
        Paul King added a comment -

        Should be fixed in trunk. I ended up using a slightly tweaked version of your patch. If you can try both MarkupBuilder and StreamingMarkupBuilder using one of the CI server produced snapshots, that would be great. Thanks for spotting the problem.

        Show
        Paul King added a comment - Should be fixed in trunk. I ended up using a slightly tweaked version of your patch. If you can try both MarkupBuilder and StreamingMarkupBuilder using one of the CI server produced snapshots, that would be great. Thanks for spotting the problem.
        Hide
        Jakub Neubauer added a comment - - edited

        Sorry, but in later comment I stated that TAB and SPACE characters should be escaped too ('\t' -> "&#9;" and ' ' -> "&#32"). For the same reasons as \n and \r.

        Show
        Jakub Neubauer added a comment - - edited Sorry, but in later comment I stated that TAB and SPACE characters should be escaped too ('\t' -> "&#9;" and ' ' -> "&#32"). For the same reasons as \n and \r.
        Hide
        Paul King added a comment -

        I am wondering whether I am totally following what you would like. Are you wanting the first or second line:

        \t -> &#9;
        \t -> &amp;#9;
        

        because I am wondering whether even the earlier changes are want you wanted?

        Show
        Paul King added a comment - I am wondering whether I am totally following what you would like. Are you wanting the first or second line: \t -> &#9; \t -> &amp;#9; because I am wondering whether even the earlier changes are want you wanted?
        Hide
        Jakub Neubauer added a comment -

        The first line is what I meant, and similarly for SPACE. To be sure, I attach a code snippet from MarkupBuilder.java, it is most precise after all

        Show
        Jakub Neubauer added a comment - The first line is what I meant, and similarly for SPACE. To be sure, I attach a code snippet from MarkupBuilder.java, it is most precise after all
        Hide
        Paul King added a comment -

        That is what you said before - I would be worried about us producing "non-standard" looking XML. Can you elaborate a bit more on what kind of round-tripping you are trying to enable?

        Show
        Paul King added a comment - That is what you said before - I would be worried about us producing "non-standard" looking XML. Can you elaborate a bit more on what kind of round-tripping you are trying to enable?
        Hide
        Jakub Neubauer added a comment -

        Hello, I went through the spec again (few times) and tried another libraries(Java DOM, Transformer, SAX Parser). I realized that I made mistake concerning a SPACE character. It can be produced as is. I was mistaken by this part of spec:

        If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.

        But, 2 paragraphs later:

        All attributes for which no declaration has been read SHOULD be treated by a non-validating processor as if declared CDATA.

        So, SPACES don't need to be produced as escape sequences. I tried groovy.xml.XMLParser and it keeps them as they are.

        So, only TAB character should be escaped. This is test, which should pass:

        import groovy.xml.*
        
        doTest("  hello  world  ") //  (1)
        doTest("hello\nworld")     //  (2)
        doTest("hello\rworld")     //  (3)
        doTest("hello\tworld")     //  (4)
        
        def doTest(String value) {
          StringWriter w = new StringWriter()
          new MarkupBuilder(w).root(a:value)
          xml2 = new XmlParser().parseText(w.toString())
          assert value == xml2.@a
        }
        

        Original version fails in parts (2), (3) and (4), after your changes it still fails in part (4). The part (1) passes, which is confirmation of what I said in this comment.

        Show
        Jakub Neubauer added a comment - Hello, I went through the spec again (few times) and tried another libraries(Java DOM, Transformer, SAX Parser). I realized that I made mistake concerning a SPACE character. It can be produced as is. I was mistaken by this part of spec: If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character. But, 2 paragraphs later: All attributes for which no declaration has been read SHOULD be treated by a non-validating processor as if declared CDATA. So, SPACES don't need to be produced as escape sequences. I tried groovy.xml.XMLParser and it keeps them as they are. So, only TAB character should be escaped. This is test, which should pass: import groovy.xml.* doTest( " hello world " ) // (1) doTest( "hello\nworld" ) // (2) doTest( "hello\rworld" ) // (3) doTest( "hello\tworld" ) // (4) def doTest( String value) { StringWriter w = new StringWriter() new MarkupBuilder(w).root(a:value) xml2 = new XmlParser().parseText(w.toString()) assert value == xml2.@a } Original version fails in parts (2), (3) and (4), after your changes it still fails in part (4). The part (1) passes, which is confirmation of what I said in this comment.
        Hide
        Paul King added a comment -

        Tab is now escaped too. Thanks for all the feedback and information.

        Show
        Paul King added a comment - Tab is now escaped too. Thanks for all the feedback and information.

          People

          • Assignee:
            Paul King
            Reporter:
            Jakub Neubauer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: