Maven Help Plugin
  1. Maven Help Plugin
  2. MPH-87

help:effective-pom uses platform encoding and garbles non-ascii characters, emits invalid XML

    Details

    • Type: Bug Bug
    • Status: Open Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.1.1
    • Fix Version/s: None
    • Component/s: effective-pom
    • Labels:
      None
    • Environment:
      Windows, MacOSX, Linux, Maven 3.0.4
    • Number of attachments :
      1

      Description

      As stated in http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info XML files without a BOM and without a XML encoding declaration should read the XML as UTF-8.

      help:effective-pom does use the platform encoding for writing the effective-pom without emitting an appropriate XML encoding declaration in the resulting XML file.

      I have created a small sample project (available at https://github.com/mfriedenhagen/invalidpom, attached as ZIP) which will reproduce the issue.

      While the parent pom (https://raw.github.com/mfriedenhagen/invalidpom/master/pom.xml) has a XML encoding declaration, https://raw.github.com/mfriedenhagen/invalidpom/master/child-invalid/pom.xml has none.

      Now running:

      mvn -s settings.xml -gs settings.xml clean validate
      

      will produce an invalid character for the developer name "Jörg" in child-invalid.

      Two workarounds are:

      • to include a XML encoding declaration as done in child-valid.
      • to use JAVA_TOOL_OPTIONS on Windows as stated in http://stackoverflow.com/a/623036/49132
      • to use MAVEN_OPTS=-Dfile.encoding=utf-8 mvn -s settings.xml -gs settings.xml clean validate.

      Nonetheless I consider this a Major bug, as it clearly violates the recommendations of W3C.

        Activity

        Hide
        Mirko Friedenhagen added a comment -

        Attach sample project as ZIP.

        Show
        Mirko Friedenhagen added a comment - Attach sample project as ZIP.
        Hide
        Herve Boutemy added a comment -

        yes, that's expected
        if you want to use the result, you need to use output parameter to let the plugin write the content directly to a file, with encoding support

        mvn -Doutput=effective.xml help:effective-pom
        Show
        Herve Boutemy added a comment - yes, that's expected if you want to use the result, you need to use output parameter to let the plugin write the content directly to a file, with encoding support mvn -Doutput=effective.xml help:effective-pom
        Hide
        Jürgen Hermann added a comment - - edited

        Writing the result to a file doesn't really help (then the file's content is broken, i.e. not well-formed XML). Consider this:

        $ head -n1 pom.xml
        <?xml version="1.0"?>
        
        $ grep -m1 name pom.xml | xxd
        0000000: 2020 2020 3c6e 616d 653e 4d75 6c74 692d      <name>Multi-
        0000010: 4172 6368 6574 7970 6573 2052 6f6f 7420  Archetypes Root 
        0000020: 504f 4d20 c3a4 c3b6 c3bc c39f 3c2f 6e61  POM ........</na
        0000030: 6d65 3e0a                                me>.
        
        $ MAVEN_OPTS="-Dfile.encoding=iso-8859-15" mvn -Doutput=effective.xml help:effective-pom 
        ...
        [INFO] Multi-Archetypes Root POM &#65533;&#65533;&#65533;
        ...
        
        $ head -n1 effective.xml 
        <?xml version="1.0" encoding="UTF-8"?>
        
        $ xmllint effective.xml 
        effective.xml:26: parser error : Input is not proper UTF-8, indicate encoding !
        Bytes: 0xE4 0xF6 0xFC 0xDF
            <name>Multi-Archetypes Root POM &#65533;&#65533;&#65533;&#65533;</name>
        
        $ mvn -version
        Apache Maven 3.0.3 (r1075438; 2011-02-28 18:31:09+0100)
        Java version: 1.6.0_26, vendor: Sun Microsystems Inc.
        Default locale: en_US, platform encoding: ANSI_X3.4-1968
        OS name: "linux", version: "3.0.0-12-generic-pae", arch: "i386", family: "unix"
        

        i.e. we have a pom.xml with default encoding (UTF-8) containing some properly encoded umlauts (c3a4...). The Maven run (with simulating a system that uses Latin-9) already doesn't read that correctly and emits replacement characters. The resulting XML is a mess, stating explicitely it's UTF-8, while containing Latin-9.

        In summary: Maven doesn't behave deterministically here, and depends on the system environment where it shouldn't, leading to hard to find problems that occur "out of the blue" for some developers only.

        Show
        Jürgen Hermann added a comment - - edited Writing the result to a file doesn't really help (then the file's content is broken, i.e. not well-formed XML). Consider this: $ head -n1 pom.xml <?xml version= "1.0" ?> $ grep -m1 name pom.xml | xxd 0000000: 2020 2020 3c6e 616d 653e 4d75 6c74 692d <name>Multi- 0000010: 4172 6368 6574 7970 6573 2052 6f6f 7420 Archetypes Root 0000020: 504f 4d20 c3a4 c3b6 c3bc c39f 3c2f 6e61 POM ........</na 0000030: 6d65 3e0a me>. $ MAVEN_OPTS= "-Dfile.encoding=iso-8859-15" mvn -Doutput=effective.xml help:effective-pom ... [INFO] Multi-Archetypes Root POM &#65533;&#65533;&#65533; ... $ head -n1 effective.xml <?xml version= "1.0" encoding= "UTF-8" ?> $ xmllint effective.xml effective.xml:26: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0xF6 0xFC 0xDF <name>Multi-Archetypes Root POM &#65533;&#65533;&#65533;&#65533;</name> $ mvn -version Apache Maven 3.0.3 (r1075438; 2011-02-28 18:31:09+0100) Java version: 1.6.0_26, vendor: Sun Microsystems Inc. Default locale: en_US, platform encoding: ANSI_X3.4-1968 OS name: "linux" , version: "3.0.0-12- generic -pae" , arch: "i386" , family: "unix" i.e. we have a pom.xml with default encoding (UTF-8) containing some properly encoded umlauts (c3a4...). The Maven run (with simulating a system that uses Latin-9) already doesn't read that correctly and emits replacement characters. The resulting XML is a mess, stating explicitely it's UTF-8, while containing Latin-9. In summary: Maven doesn't behave deterministically here, and depends on the system environment where it shouldn't, leading to hard to find problems that occur "out of the blue" for some developers only.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mirko Friedenhagen
          • Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated: