Maven Site Plugin
  1. Maven Site Plugin
  2. MSITE-19

Various encoding problems with InputStream and XML

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0-beta-6
    • Component/s: encoding
    • Labels:
      None
    • Number of attachments :
      9

      Description

      There is various encoding problems with InputStream and XML in different components.

      • Property resource file is encoded with UTF-8 , but Java reads bundle with UTF-8.
      • In different components Reader is constructed with default system encoding.
      • MXParser ignores encoding attribute in xml declaration.
      1. plexus-i18n.diff
        2 kB
        Vincent Siveton
      2. plexus-site-renderer.diff
        0.9 kB
        Vincent Siveton
      3. plexus-utils_2.diff
        0.6 kB
        Vincent Siveton
      4. plexus-utils.diff
        1 kB
        Vincent Siveton
      5. project-info-report_ja.properties
        18 kB
        Naoki Nose
      6. project-info-report_zh_CN.properties
        26 kB
        Yue Ni
      7. project-info-report_zh_CN.properties
        26 kB
        Yue Ni
      8. site-plugin_ja.properties
        2 kB
        Naoki Nose
      9. site-plugin_zh_CN.properties
        3 kB
        Yue Ni

        Issue Links

          Activity

          Hide
          Vincent Siveton added a comment -

          This issue appears currently for a Japanese translation and maybe for other East Asian languages (CJK charsets).

          • Using a VM parameter could be a good starting point -Dfile.encoding=UTF-8 (to add to MAVEN_OPTS).
          • Xpp3DomBuilder in plexus-util seems to not handle correctly encoding parameter in XML header. So, plexus-site-renderer component doesn't generate a site descriptor with special characters.
            Have a look to plexus-utils.diff and plexus-site-renderer.diff
            Another issue could be in the toString() method from Xpp3Dom class: we need to add a default encoding. See plexus-utils_2.diff.

          It is hard to debug charset problems and depends on several factors.
          Other ideas are welcome.

          Show
          Vincent Siveton added a comment - This issue appears currently for a Japanese translation and maybe for other East Asian languages (CJK charsets). Using a VM parameter could be a good starting point -Dfile.encoding=UTF-8 (to add to MAVEN_OPTS). Java reads bundles stream with the ISO-8859-1 charset. PropertyResourceBundle class uses Properties internally: the ISO 8859-1 character encoding is used to load properties. Have a look to the API: http://java.sun.com/j2se/1.4.2/docs/api/java/util/PropertyResourceBundle.html http://java.sun.com/j2se/1.4.2/docs/api/java/util/Properties.html So, I propose to correct plexus-i18n and use it instead of ResourceBundle.getBundle() calls (I think specifically in maven-project-info-reports-plugin subproject). See plexus-i18n.diff. Another solution could be to use native2ascii in each bundles but IMHO it is not really human readable. Xpp3DomBuilder in plexus-util seems to not handle correctly encoding parameter in XML header. So, plexus-site-renderer component doesn't generate a site descriptor with special characters. Have a look to plexus-utils.diff and plexus-site-renderer.diff Another issue could be in the toString() method from Xpp3Dom class: we need to add a default encoding. See plexus-utils_2.diff. Finally, IMHO, I don't think that the StringInputStream class in plexus-utils component has a good implementation because no encoding is defined. Maybe we could migrate to the StringInputStream class from Ant project. http://svn.apache.org/repos/asf/ant/core/trunk/src/main/org/apache/tools/ant/filters/StringInputStream.java It is hard to debug charset problems and depends on several factors. Other ideas are welcome.
          Hide
          Vincent Siveton added a comment -

          plexus-i18n.diff

          Show
          Vincent Siveton added a comment - plexus-i18n.diff
          Hide
          Vincent Siveton added a comment -

          plexus-site-renderer.diff

          Show
          Vincent Siveton added a comment - plexus-site-renderer.diff
          Hide
          Vincent Siveton added a comment -

          plexus-utils.diff

          Show
          Vincent Siveton added a comment - plexus-utils.diff
          Hide
          Vincent Siveton added a comment -

          plexus-utils_2.diff

          Show
          Vincent Siveton added a comment - plexus-utils_2.diff
          Hide
          Lukas Theussl added a comment -

          The german translation also has some problems. The properties files are UTF-8 encoded, but the html output is unreadable (even with -Dfile.encoding=UTF-8, LC_ALL=en_US.UTF-8, checked with test9 of the site plugin). Strangely, the french properties files are not UTF-8 encoded (contrary to our own standarts), but the html result is correct in UTF-8. This definitely has to be sorted out before more translations are coming in...

          Show
          Lukas Theussl added a comment - The german translation also has some problems. The properties files are UTF-8 encoded, but the html output is unreadable (even with -Dfile.encoding=UTF-8, LC_ALL=en_US.UTF-8, checked with test9 of the site plugin). Strangely, the french properties files are not UTF-8 encoded (contrary to our own standarts), but the html result is correct in UTF-8. This definitely has to be sorted out before more translations are coming in...
          Hide
          Michael Schnake added a comment -

          While trying to get meaningful results building a (default, that is no own apt etc. files) site in german the current (= maven2 build from SVN) situation seems to be that it is impossible right now. My default file.encoding is UTF-8. I have maven-site-plugin configured with <outputEncoding>UTF-8</outputEncoding>.

          The situation "out of the box" with regard to the german umlauts for the generated side is:
          => Result: site content has garbage, site navigator is correct, organization name (from pom.xlm) in copyright statement is correct.

          1. Despite the statement at http://maven.apache.org/plugins/maven-site-plugin/i18n.html all Java .properties files must be encoded "ISO-8859-1 with unicode escapes as needed" (as defined by the Java API and already stated above). So I converted site-plugin_de.properties from UTF-8 to ISO-8859-1.
          => Result: Site content is correct, site navigator has garbage, organization name is correct.

          2. Well, the component building the site navigator seems to (incorrectly, or at least "non Property API-Doc conforming") read site-plugin_de.properties using my platform default encoding (= UTF-8). So I called "mvn site" with MAVEN_OPTS="-Dfile.encoding=ISO-8859-1".
          => Result: Site content is correct, site navigator is correct, organization name has garbage.

          So, now the organziation name has garbage, although it comes from my pom.xlm which explicitly states <?xml version="1.0" encoding="UTF-8"?>. But the parser reading the organization name from there seems to ignore that and uses the platform encoding (= ISO-8859-1 in the step above) instead.

          The net result is that you currently have to sacrifice one of site [content | navigator | copyright]. But, hey, two out of three is not that bad Note that the previous comments for this bug already seem to explain (and probably fix) that behaviour. But perhaps this comment helps those struggling with site i18n until this is fixed.

          Show
          Michael Schnake added a comment - While trying to get meaningful results building a (default, that is no own apt etc. files) site in german the current (= maven2 build from SVN) situation seems to be that it is impossible right now. My default file.encoding is UTF-8. I have maven-site-plugin configured with <outputEncoding>UTF-8</outputEncoding>. The situation "out of the box" with regard to the german umlauts for the generated side is: => Result: site content has garbage, site navigator is correct, organization name (from pom.xlm) in copyright statement is correct. 1. Despite the statement at http://maven.apache.org/plugins/maven-site-plugin/i18n.html all Java .properties files must be encoded "ISO-8859-1 with unicode escapes as needed" (as defined by the Java API and already stated above). So I converted site-plugin_de.properties from UTF-8 to ISO-8859-1. => Result: Site content is correct, site navigator has garbage, organization name is correct. 2. Well, the component building the site navigator seems to (incorrectly, or at least "non Property API-Doc conforming") read site-plugin_de.properties using my platform default encoding (= UTF-8). So I called "mvn site" with MAVEN_OPTS="-Dfile.encoding=ISO-8859-1". => Result: Site content is correct, site navigator is correct, organization name has garbage. So, now the organziation name has garbage, although it comes from my pom.xlm which explicitly states <?xml version="1.0" encoding="UTF-8"?>. But the parser reading the organization name from there seems to ignore that and uses the platform encoding (= ISO-8859-1 in the step above) instead. The net result is that you currently have to sacrifice one of site [content | navigator | copyright] . But, hey, two out of three is not that bad Note that the previous comments for this bug already seem to explain (and probably fix) that behaviour. But perhaps this comment helps those struggling with site i18n until this is fixed.
          Hide
          Naoki Nose added a comment -

          I've looked into the source codes for the cause of encoding problems.

          Problem 1.
          the encoding detection of the input files heavily rely on
          default system encoding.
          Problem 2.
          In the site generation process, The Stirng to byte array conversions occur many times.
          This leads to problems difficult to solve.

          With problem 1, I have some idea about the solutions.

          there are some types of input files, for example

          • property resource file
          • XML file
          • apt file

          and there should be an method
          of specifying encoding according to the file type .

          With property resource file, I like to use native2ascii.
          Certainly, that's not human readable, but rarely causes the encoding problems.
          And the problem of readability can be avoided by automating
          native2ascii processing. the build lifecycle phase
          "process-resource" will be
          good place to hold such a process.

          With XML file , I think the encoding detection should
          follow XML specification of w3c.
          So, MXParser should be changed to support the auto
          encoding detection.
          http://www.w3.org/TR/REC-xml/#sec-guessing

          With apt file , I think the encoding detection should follow
          POM configuration. The configuration will be like following:

          <configuration>
          <inputEncoding>Shift_JIS</inputEncoding>
          <outputEncoding>UTF-8</outputEncoding>
          <locales>en,ja</locales>
          </configuration>

          With problem 2, I have no idea about the good solutions, yet.
          the string to byte array conversion occur many times
          in the process of getting the site descriptor. In that process,
          the characters seems to be converted wrongly.

          Show
          Naoki Nose added a comment - I've looked into the source codes for the cause of encoding problems. Problem 1. the encoding detection of the input files heavily rely on default system encoding. Problem 2. In the site generation process, The Stirng to byte array conversions occur many times. This leads to problems difficult to solve. With problem 1, I have some idea about the solutions. there are some types of input files, for example property resource file XML file apt file and there should be an method of specifying encoding according to the file type . With property resource file, I like to use native2ascii. Certainly, that's not human readable, but rarely causes the encoding problems. And the problem of readability can be avoided by automating native2ascii processing. the build lifecycle phase "process-resource" will be good place to hold such a process. With XML file , I think the encoding detection should follow XML specification of w3c. So, MXParser should be changed to support the auto encoding detection. http://www.w3.org/TR/REC-xml/#sec-guessing With apt file , I think the encoding detection should follow POM configuration. The configuration will be like following: <configuration> <inputEncoding>Shift_JIS</inputEncoding> <outputEncoding>UTF-8</outputEncoding> <locales>en,ja</locales> </configuration> With problem 2, I have no idea about the good solutions, yet. the string to byte array conversion occur many times in the process of getting the site descriptor. In that process, the characters seems to be converted wrongly.
          Hide
          Brett Porter added a comment -

          the plexus-site-renderer patch is no longer required as it has moved to parsing using the modello generated model which accounts for encoding

          Show
          Brett Porter added a comment - the plexus-site-renderer patch is no longer required as it has moved to parsing using the modello generated model which accounts for encoding
          Hide
          Brett Porter added a comment -

          I have not applied the i18n patch. I like the idea of doing native2ascii in process-resources better.

          Do you know if there will be any negative side effects of the change to XmlWriter? What was that attempting to address?

          Is there anything else necessary to get this issue resolved other than the above patches and the native2ascii'ing?

          Show
          Brett Porter added a comment - I have not applied the i18n patch. I like the idea of doing native2ascii in process-resources better. Do you know if there will be any negative side effects of the change to XmlWriter? What was that attempting to address? Is there anything else necessary to get this issue resolved other than the above patches and the native2ascii'ing?
          Hide
          Vincent Siveton added a comment -

          Brett,

          I tried to generate a dummy site in Japanese and in other available languages.
          So, I used plexus-utils trunk version and I converted all bundles with native2ascii.
          It works a treat for me with outputEncoding=UTF-8
          Naoki, could you confirm too?
          Moreover some translation in japanese are missing (eg in the dependencies page).

          From my point of view, I don't see any negative side effects.

          I think we could close this issue after native2ascii'ing all bundles (automating native2ascii with process-resources phase or not)

          Show
          Vincent Siveton added a comment - Brett, I tried to generate a dummy site in Japanese and in other available languages. So, I used plexus-utils trunk version and I converted all bundles with native2ascii. It works a treat for me with outputEncoding=UTF-8 Naoki, could you confirm too? Moreover some translation in japanese are missing (eg in the dependencies page). From my point of view, I don't see any negative side effects. I think we could close this issue after native2ascii'ing all bundles (automating native2ascii with process-resources phase or not)
          Hide
          Naoki Nose added a comment -

          I also tried to generate a dummy site including Japanese.
          My enviroment is Debian/GNU Linux and the default encoding is EUC-JP.
          I used trunk version of maven-site-plugin, doxia, modello and plexus and Japanese rendered collectly.
          Thanks! Many Japanese developers will appreciate this improvement.

          > Moreover some translation in japanese are missing (eg in the dependencies page).
          Some property items have added to the original property file since I send the japanese translation first.
          I will update japanese translation later.

          >I think we could close this issue after native2ascii'ing all bundles (automating native2ascii with process-resources phase or not)

          There are some disirable improvements about this problems.
          1. XML parser in plexus-utils should handle encoding parameter in XML declaration collectly.
          2. Dixia constucts reader with a default encoding. The encoding of the site documents should be declared explicitly.

          May I create new issues about these ?

          Show
          Naoki Nose added a comment - I also tried to generate a dummy site including Japanese. My enviroment is Debian/GNU Linux and the default encoding is EUC-JP. I used trunk version of maven-site-plugin, doxia, modello and plexus and Japanese rendered collectly. Thanks! Many Japanese developers will appreciate this improvement. > Moreover some translation in japanese are missing (eg in the dependencies page). Some property items have added to the original property file since I send the japanese translation first. I will update japanese translation later. >I think we could close this issue after native2ascii'ing all bundles (automating native2ascii with process-resources phase or not) There are some disirable improvements about this problems. 1. XML parser in plexus-utils should handle encoding parameter in XML declaration collectly. 2. Dixia constucts reader with a default encoding. The encoding of the site documents should be declared explicitly. May I create new issues about these ?
          Hide
          Brett Porter added a comment -

          Naoki,
          yes, please create new issues for your 2 points, and the updated japanese translation. Thanks!

          Show
          Brett Porter added a comment - Naoki, yes, please create new issues for your 2 points, and the updated japanese translation. Thanks!
          Hide
          Naoki Nose added a comment -

          maven-site-plugin japanese translation update.

          Show
          Naoki Nose added a comment - maven-site-plugin japanese translation update.
          Hide
          Naoki Nose added a comment -

          mave-project-info-reports-plugin japanese translation update.

          Show
          Naoki Nose added a comment - mave-project-info-reports-plugin japanese translation update.
          Hide
          Vincent Siveton added a comment -

          Applied in SVN. Thanks for the translation!

          Show
          Vincent Siveton added a comment - Applied in SVN. Thanks for the translation!
          Hide
          Vincent Siveton added a comment -

          Brett,

          Any news about potential side effects? Could we close this issue?

          Show
          Vincent Siveton added a comment - Brett, Any news about potential side effects? Could we close this issue?
          Hide
          Brett Porter added a comment -

          We still need to setup the native2ascii'ing.

          Show
          Brett Porter added a comment - We still need to setup the native2ascii'ing.
          Hide
          Yue Ni added a comment -

          I translate the Chinese simplified version of the site and project-info-report resource bundles, and attach them here, could anyone help to commit them to the svn repository?

          Show
          Yue Ni added a comment - I translate the Chinese simplified version of the site and project-info-report resource bundles, and attach them here, could anyone help to commit them to the svn repository?
          Hide
          Yue Ni added a comment -

          Corrected a term in the translation.

          Show
          Yue Ni added a comment - Corrected a term in the translation.
          Hide
          Brett Porter added a comment -

          applied Chinese simplified translation - thanks. Please attach new translations to a new issue!

          Show
          Brett Porter added a comment - applied Chinese simplified translation - thanks. Please attach new translations to a new issue!
          Hide
          Jesse McConnell added a comment -

          should be use make another mojo that can be bound to process-resources and implements the native2ascii behavior we are looking for here?

          I see that some people have already gotten this functionality working by using the ant native2ascii task...

          if this is the case then we can probably make a mojo for this pretty quickly, and just make an issue over there for creating it, link it to this issue and then close this issue out if that is all that is remaining.

          Show
          Jesse McConnell added a comment - should be use make another mojo that can be bound to process-resources and implements the native2ascii behavior we are looking for here? I see that some people have already gotten this functionality working by using the ant native2ascii task... if this is the case then we can probably make a mojo for this pretty quickly, and just make an issue over there for creating it, link it to this issue and then close this issue out if that is all that is remaining.
          Hide
          Brett Porter added a comment -

          I think so

          Show
          Brett Porter added a comment - I think so
          Hide
          Carlos Sanchez added a comment -

          What is the status of all this patches?

          It'd be better to use the external XmlPullParser from the standard jsr173 api than patching the one from plexus

          Show
          Carlos Sanchez added a comment - What is the status of all this patches? It'd be better to use the external XmlPullParser from the standard jsr173 api than patching the one from plexus
          Hide
          Marian Flor added a comment -

          This is a follow-up to Michael Schnake's explanation:

          Here's how I got the most german umlauts (may be in general applied to other languages):
          Running Maven 2.0.4 (bin), with maven-project-info-reports-plugin-2.0.1.jar and maven-site-plugin-2.0-beta-5.jar.

          • Default Eclipse-IDE/OS Encoding is UTF-8.
          • Edited the _de.properties according to Michaels instructions (this is the ugly part since it will break the umlauts whenever a new version of the plugin is propagated) and repacked (ISO-8859-1 encoded Prop-Files) the jars in the local repository.
          • Plugin-Configuration (pom.xml):
            ...
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-site-plugin</artifactId>
            <configuration>
            <locales>de</locales>
            <outputEncoding>UTF-8</outputEncoding>
            </configuration>
            ...
          • set MAVEN_OPTS=-Dfile.encoding=UTF-8
          • pom.xml, site.xml are UTF-8 encoded.
          • *.apt-Files are ISO-8859-1 encoded.
          • faq.fml (UTF-8 or ISO-8859-1 does not matter)

          Results:
          Navigation (site.xml): ok, Copyright and team-list (pom.xml): ok, Content (.apt): ok, FAQ (.fml): broken umlauts. Thus I get 3 from 4 .

          The FAQ Umlauts are broken with either encoding. My workaround is to use ASCII in *.fml-Documents. :-/
          The "FAQ-Umlauts" does not bother me that much, but if someone has an explanation/fix for this it will be greatly appreciated.

          regards,
          Marian

          Show
          Marian Flor added a comment - This is a follow-up to Michael Schnake's explanation: Here's how I got the most german umlauts (may be in general applied to other languages): Running Maven 2.0.4 (bin), with maven-project-info-reports-plugin-2.0.1.jar and maven-site-plugin-2.0-beta-5.jar. Default Eclipse-IDE/OS Encoding is UTF-8. Edited the _de.properties according to Michaels instructions (this is the ugly part since it will break the umlauts whenever a new version of the plugin is propagated) and repacked (ISO-8859-1 encoded Prop-Files) the jars in the local repository. Plugin-Configuration (pom.xml): ... <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-site-plugin</artifactId> <configuration> <locales>de</locales> <outputEncoding>UTF-8</outputEncoding> </configuration> ... set MAVEN_OPTS=-Dfile.encoding=UTF-8 pom.xml, site.xml are UTF-8 encoded. *.apt-Files are ISO-8859-1 encoded. faq.fml (UTF-8 or ISO-8859-1 does not matter) Results: Navigation (site.xml): ok, Copyright and team-list (pom.xml): ok, Content ( .apt): ok, FAQ ( .fml): broken umlauts. Thus I get 3 from 4 . The FAQ Umlauts are broken with either encoding. My workaround is to use ASCII in *.fml-Documents. :-/ The "FAQ-Umlauts" does not bother me that much, but if someone has an explanation/fix for this it will be greatly appreciated. regards, Marian
          Hide
          Darius added a comment -

          When using site:run DoxiaFilter does not set any output encoding.
          I think it should call servletResponse.setCharacterEncoding() or servletResponse.setContentType()
          with output encoding that is specified in pom.xml before caling servletResponse.getWriter().

          Maven 2.0.4 (bin) with maven-site-plugin-2.0-beta-5.jar.
          My apt files are utf-8 and I get correct html files, but in preview mode (site:run) I see "?"s.

          Darius

          Show
          Darius added a comment - When using site:run DoxiaFilter does not set any output encoding. I think it should call servletResponse.setCharacterEncoding() or servletResponse.setContentType() with output encoding that is specified in pom.xml before caling servletResponse.getWriter(). Maven 2.0.4 (bin) with maven-site-plugin-2.0-beta-5.jar. My apt files are utf-8 and I get correct html files, but in preview mode (site:run) I see "?"s. Darius
          Hide
          Herve Boutemy added a comment -

          XML encoding detection support is fixed in 2.0-beta-6 for site.xml, xdoc files, *.fml, and so on
          all problems from this issue should be fixed now
          if there is still some area needing rework, please open another issue focused on it

          Show
          Herve Boutemy added a comment - XML encoding detection support is fixed in 2.0-beta-6 for site.xml, xdoc files, *.fml, and so on all problems from this issue should be fixed now if there is still some area needing rework, please open another issue focused on it
          Hide
          Herve Boutemy added a comment -

          see http://docs.codehaus.org/display/MAVENUSER/XML+encoding for information on XML encoding detection support in Maven and plugins

          Show
          Herve Boutemy added a comment - see http://docs.codehaus.org/display/MAVENUSER/XML+encoding for information on XML encoding detection support in Maven and plugins

            People

            • Assignee:
              Herve Boutemy
              Reporter:
              Naoki Nose
            • Votes:
              27 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: