Maven 1.x XDoc Plugin
  1. Maven 1.x XDoc Plugin
  2. MPXDOC-195

xDoc plugin scrambles UTF-8 source files when generating UTF-8 HTML

    Details

    • Type: Bug Bug
    • Status: Open Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.9.2
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      Maven 1.1-beta-2 and maven-xdoc-plugin-1.9.2 on a Windows XP workstation with IBM Java SDK V1.4.2
    • Testcase included:
      yes
    • Number of attachments :
      2

      Description

      When I attempt to build UTF-8 encoded HTML from UTF-8 XML source files, every special character is scrambled.

      We are using the xDoc plugin to generate the HTML for our on-line user guide. We sent the English source files and the default properties file to 9 translation centers. The translators returned valid UTF-8 source, but xDoc will not generate valid UTF-8 HTML.

      I have attached a very small subset of our product that demonstrates this problem. See the README.txt file within the ZIP archive for information about how to use the supplied scripts to build the output for German, English, French, and Traditional Chinese and view the result.

        Activity

        Hide
        Lance Bader added a comment -

        Since I opened the issue I copied the same archive to a workstation using the Sun JDK V1.5 Update 6 and recreated the same problem. I think this means that it is unlikely that the IBM JDK is causing the problem.

        I have also modified the velocity-1.4.jar file in the .maven\repository\velocity\jars directory. I replaced org\apache\velocity\runtime\defaults\velocity.properties after making the following changes.

        #----------------------------------------------------------------------------

        1. T E M P L A T E E N C O D I N G
          #----------------------------------------------------------------------------

        input.encoding=UTF-8
        output.encoding=UTF-8

        I recreated the problem after making this change. I like to think that the problem is in velocity, but this change did not affect the outcome.

        Next I will move to Red Hat Linux and try the suggested work around there.

        Lance

        Show
        Lance Bader added a comment - Since I opened the issue I copied the same archive to a workstation using the Sun JDK V1.5 Update 6 and recreated the same problem. I think this means that it is unlikely that the IBM JDK is causing the problem. I have also modified the velocity-1.4.jar file in the .maven\repository\velocity\jars directory. I replaced org\apache\velocity\runtime\defaults\velocity.properties after making the following changes. #---------------------------------------------------------------------------- T E M P L A T E E N C O D I N G #---------------------------------------------------------------------------- input.encoding=UTF-8 output.encoding=UTF-8 I recreated the problem after making this change. I like to think that the problem is in velocity, but this change did not affect the outcome. Next I will move to Red Hat Linux and try the suggested work around there. Lance
        Hide
        Lance Bader added a comment -

        NOTE: Although it has no affect on this problem, I have discovered a defect in the test case I supplied. The properties files in src\i18nBundles have not been converted to the required ASCII encoding. I expected the translators to return ASCII encoded files, but they used some native format instead. As a result, the navigation items, the section, headers, and the subsection headers will appear wrong, even if the rest of the page is generated correctly.

        I will attach an updated test case when I have converted the properties files correctly. I first have to find out what encoding the translators used (it is obviously not UTF-8) and fix them with native2ascii.

        Lance

        Show
        Lance Bader added a comment - NOTE: Although it has no affect on this problem, I have discovered a defect in the test case I supplied. The properties files in src\i18nBundles have not been converted to the required ASCII encoding. I expected the translators to return ASCII encoded files, but they used some native format instead. As a result, the navigation items, the section, headers, and the subsection headers will appear wrong, even if the rest of the page is generated correctly. I will attach an updated test case when I have converted the properties files correctly. I first have to find out what encoding the translators used (it is obviously not UTF-8) and fix them with native2ascii. Lance
        Hide
        Lance Bader added a comment -

        I found an old Red Hat Linux system where I could run the supplied test case. Precisely, it is Red Hat Enterprise Linux V4 update 3 for i386. I installed Maven 1.1-beta-2 with maven-xdoc-plugin-1.9.2 and the attached test case. I created a script that matches the actions in build_de.bat, build_en.bat, build_fr.bat, and build_zh_TW.bat.

        Except for the unrelated problem caused by poison properties files in src\i18nBundles (see the problem report in a previous comment), the HTML was generated CORRECTLY.

        NOTE: I did NOT have to modify the LANG or LC_CTYPE environment variables, as suggested in the xDoc plugin FAQ or in http://jira.codehaus.org/browse/MPXDOC-184 . By default, LANG was already set to LANG="en_US.UTF-8". I dumped the Java system properties and observed that file.encoding=UTF-8 by default.

        So, that begs the question, "Why doesn't this work on a Windows XP system when you use -Dfile.encoding=UTF-8 to override the default file encoding?" Its a mystery.

        Show
        Lance Bader added a comment - I found an old Red Hat Linux system where I could run the supplied test case. Precisely, it is Red Hat Enterprise Linux V4 update 3 for i386. I installed Maven 1.1-beta-2 with maven-xdoc-plugin-1.9.2 and the attached test case. I created a script that matches the actions in build_de.bat, build_en.bat, build_fr.bat, and build_zh_TW.bat. Except for the unrelated problem caused by poison properties files in src\i18nBundles (see the problem report in a previous comment), the HTML was generated CORRECTLY. NOTE: I did NOT have to modify the LANG or LC_CTYPE environment variables, as suggested in the xDoc plugin FAQ or in http://jira.codehaus.org/browse/MPXDOC-184 . By default, LANG was already set to LANG="en_US.UTF-8". I dumped the Java system properties and observed that file.encoding=UTF-8 by default. So, that begs the question, "Why doesn't this work on a Windows XP system when you use -Dfile.encoding=UTF-8 to override the default file encoding?" Its a mystery.
        Hide
        Lance Bader added a comment -

        This is an updated test case where the poison properties files have been converted to ASCII using the native2ascii utility. When you run the supplied batch files on a Windows system, the navigation pane, the section names, and the subsection names will appear correctly, however, special characters in the other content will be scrambled.

        Lance

        Show
        Lance Bader added a comment - This is an updated test case where the poison properties files have been converted to ASCII using the native2ascii utility. When you run the supplied batch files on a Windows system, the navigation pane, the section names, and the subsection names will appear correctly, however, special characters in the other content will be scrambled. Lance

          People

          • Assignee:
            Unassigned
            Reporter:
            Lance Bader
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: