groovy
  1. groovy
  2. GROOVY-4831

New line character is not escaped in attribute when using MarkupBuilder

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.0
    • Fix Version/s: 1.8.1
    • Component/s: XML Processing
    • Labels:
      None
    • Number of attachments :
      2

      Description

      For example:

      import groovy.xml.*
      def xml = new MarkupBuilder()
      xml.test(a:"hello\nworld"){}
      

      produces output:

      <test a='hello
      world' />
      

      But xml specification says that when parsing, new lines in attribute value should be converted to single space. So they are not preserved.

      I suggest this new method implementation (sorry that I don't provide real .patch):

          private String checkForReplacement(boolean isAttrValue, char ch) {
              switch (ch) {
                  case '&':
                      return "&amp;";
                  case '<':
                      return "&lt;";
                  case '>':
                      return "&gt;";
                  case '"':
                      // The double quote is only escaped if the value is for
                      // an attribute and the builder is configured to output
                      // attribute values inside double quotes.
                      if (isAttrValue && useDoubleQuotes) return "&quot;";
                      break;
                  case '\'':
                      // The apostrophe is only escaped if the value is for an
                      // attribute, as opposed to element content, and if the
                      // builder is configured to surround attribute values with
                      // single quotes.
                      if (isAttrValue && !useDoubleQuotes) return "&apos;";
                      break;
                  case '\n':
                      if(isAttrValue) return "&#10;";
                  case '\r':
                      if(isAttrValue) return "&#13;";
              }
              return null;
          }
      
      1. MarkupBuilder.java.patch
        0.5 kB
        Jakub Neubauer
      2. repair.txt
        1 kB
        Jakub Neubauer

        Activity

        Hide
        Paul King added a comment -

        I am wondering whether I am totally following what you would like. Are you wanting the first or second line:

        \t -> &#9;
        \t -> &amp;#9;
        

        because I am wondering whether even the earlier changes are want you wanted?

        Show
        Paul King added a comment - I am wondering whether I am totally following what you would like. Are you wanting the first or second line: \t -> &#9; \t -> &amp;#9; because I am wondering whether even the earlier changes are want you wanted?
        Hide
        Jakub Neubauer added a comment -

        The first line is what I meant, and similarly for SPACE. To be sure, I attach a code snippet from MarkupBuilder.java, it is most precise after all

        Show
        Jakub Neubauer added a comment - The first line is what I meant, and similarly for SPACE. To be sure, I attach a code snippet from MarkupBuilder.java, it is most precise after all
        Hide
        Paul King added a comment -

        That is what you said before - I would be worried about us producing "non-standard" looking XML. Can you elaborate a bit more on what kind of round-tripping you are trying to enable?

        Show
        Paul King added a comment - That is what you said before - I would be worried about us producing "non-standard" looking XML. Can you elaborate a bit more on what kind of round-tripping you are trying to enable?
        Hide
        Jakub Neubauer added a comment -

        Hello, I went through the spec again (few times) and tried another libraries(Java DOM, Transformer, SAX Parser). I realized that I made mistake concerning a SPACE character. It can be produced as is. I was mistaken by this part of spec:

        If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.

        But, 2 paragraphs later:

        All attributes for which no declaration has been read SHOULD be treated by a non-validating processor as if declared CDATA.

        So, SPACES don't need to be produced as escape sequences. I tried groovy.xml.XMLParser and it keeps them as they are.

        So, only TAB character should be escaped. This is test, which should pass:

        import groovy.xml.*
        
        doTest("  hello  world  ") //  (1)
        doTest("hello\nworld")     //  (2)
        doTest("hello\rworld")     //  (3)
        doTest("hello\tworld")     //  (4)
        
        def doTest(String value) {
          StringWriter w = new StringWriter()
          new MarkupBuilder(w).root(a:value)
          xml2 = new XmlParser().parseText(w.toString())
          assert value == xml2.@a
        }
        

        Original version fails in parts (2), (3) and (4), after your changes it still fails in part (4). The part (1) passes, which is confirmation of what I said in this comment.

        Show
        Jakub Neubauer added a comment - Hello, I went through the spec again (few times) and tried another libraries(Java DOM, Transformer, SAX Parser). I realized that I made mistake concerning a SPACE character. It can be produced as is. I was mistaken by this part of spec: If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character. But, 2 paragraphs later: All attributes for which no declaration has been read SHOULD be treated by a non-validating processor as if declared CDATA. So, SPACES don't need to be produced as escape sequences. I tried groovy.xml.XMLParser and it keeps them as they are. So, only TAB character should be escaped. This is test, which should pass: import groovy.xml.* doTest( " hello world " ) // (1) doTest( "hello\nworld" ) // (2) doTest( "hello\rworld" ) // (3) doTest( "hello\tworld" ) // (4) def doTest( String value) { StringWriter w = new StringWriter() new MarkupBuilder(w).root(a:value) xml2 = new XmlParser().parseText(w.toString()) assert value == xml2.@a } Original version fails in parts (2), (3) and (4), after your changes it still fails in part (4). The part (1) passes, which is confirmation of what I said in this comment.
        Hide
        Paul King added a comment -

        Tab is now escaped too. Thanks for all the feedback and information.

        Show
        Paul King added a comment - Tab is now escaped too. Thanks for all the feedback and information.

          People

          • Assignee:
            Paul King
            Reporter:
            Jakub Neubauer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: