Jetty
  1. Jetty
  2. JETTY-244

OutputWriter handle multibyte UTF-8 chars wrong

    Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1.2rc1
    • Component/s: HTTP
    • Labels:
      None
    • Number of attachments :
      3

      Description

      There is problem in the way how multibyte UTF-8 characters are handled at end of chunk in the method org.mortbay.jetty.AbstractGenerator.OutputWriter.write(char[] s,int offset, int length).

      When multibyte UTF-8 character (for example á - \u00E1) is last character which can fit into "bytes" buffer, it is printed two times to output. One times at the end of buffer, but than this code

      if (chunk-i>buffer.length-bytes)
      chunk=buffer.length-bytes+i;

      cuts the chunk (it is right in the other places - we spend two or more bytes form "bytes" buffer, so we must shorten number of chars which can fir teh buffer). But when this cut occurs at the end of "for (int i = 0; i < chunk; i++)" cycle, this shortcuting of chunk appears like we didn't write last char into buffer. So it is written again in next cycle of OutputWrite.write() call.

      I think condition

      if (chunk-i>buffer.length-bytes)
      chunk=buffer.length-bytes+i;

      should be properly

      if (chunk-i>buffer.length-bytes && buffer.length-bytes>0)
      chunk=buffer.length-bytes+i;

      1. ServletTest2.java
        0.9 kB
        Filip Jirsák
      2. ServletTest1java
        0.9 kB
        Filip Jirsák
      3. OutputWriter-utf-8.diff
        3 kB
        Filip Jirsák

        Activity

        Hide
        Filip Jirsák added a comment -

        Testcase servlet. Should write 21 lines with "áíáíáí..." text. All lines should has the same length and chars "á" and "í" should alternate regularly.

        With default configuration (buffer length set to 2048), it now (before applying patch) prints lines 7 and 20 one character longer - there is double "íí" at line 7 and double "áá" at beginning of line 20. (You can find it by "(.)\1" RegExp.)

        Show
        Filip Jirsák added a comment - Testcase servlet. Should write 21 lines with "áíáíáí..." text. All lines should has the same length and chars "á" and "í" should alternate regularly. With default configuration (buffer length set to 2048), it now (before applying patch) prints lines 7 and 20 one character longer - there is double "íí" at line 7 and double "áá" at beginning of line 20. (You can find it by "(.)\1" RegExp.)
        Hide
        Greg Wilkins added a comment -

        Actually, I think the whole if statement is wrong... it is being too paranoid.

        The if with break before the bytes are written is more than enough.

        Just testing this and will soon commit

        Show
        Greg Wilkins added a comment - Actually, I think the whole if statement is wrong... it is being too paranoid. The if with break before the bytes are written is more than enough. Just testing this and will soon commit
        Hide
        Greg Wilkins added a comment -

        Fixed in svn

        Thanks for the report, test harness and patch! made it very easy!

        Show
        Greg Wilkins added a comment - Fixed in svn Thanks for the report, test harness and patch! made it very easy!
        Hide
        Filip Jirsák added a comment -

        Commenting out the whole if statement was first thing I tried and it doesn't work - it throws ArrayIndexOutOfBounds exception sometime. It is because there isn't test for buffer length in case of one byte unicode character

        if ((code & 0xffffff80) == 0)

        { // 1b buffer[bytes++]=(byte)(code); }

        When while if statements at multibyte cases (//2b and bigger) would be removed there must be condition for buffer length in one byte case statement

        if ((code & 0xffffff80) == 0)
        {
        // 1b
        if (bytes >= buffer.length)

        { chunk=i; break; }

        buffer[bytes++]=(byte)(code);
        }

        In my patch I choose another solution. In my code chunk is shortcuted only in case we are not on buffer end. With this option there must be one more condition within multibyte characters, but there is no need for condition within one byte character. I think in HTTP environment there will be much more onebyte characters, so I think no condition in onbyte case and two conditions in multibyte cases is more efficient.

        Show
        Filip Jirsák added a comment - Commenting out the whole if statement was first thing I tried and it doesn't work - it throws ArrayIndexOutOfBounds exception sometime. It is because there isn't test for buffer length in case of one byte unicode character if ((code & 0xffffff80) == 0) { // 1b buffer[bytes++]=(byte)(code); } When while if statements at multibyte cases (//2b and bigger) would be removed there must be condition for buffer length in one byte case statement if ((code & 0xffffff80) == 0) { // 1b if (bytes >= buffer.length) { chunk=i; break; } buffer [bytes++] =(byte)(code); } In my patch I choose another solution. In my code chunk is shortcuted only in case we are not on buffer end. With this option there must be one more condition within multibyte characters, but there is no need for condition within one byte character. I think in HTTP environment there will be much more onebyte characters, so I think no condition in onbyte case and two conditions in multibyte cases is more efficient.
        Hide
        Filip Jirsák added a comment -

        Testcase which fails with removed if conditions in multibyte characters:

        else if((code&0xfffff800)==0)
        {
        // 2b
        if (buffer.length-bytes<2)

        { chunk=i; break; }

        buffer[bytes++]=(byte)(0xc0|(code>>6));
        buffer[bytes++]=(byte)(0x80|(code&0x3f));

        //condition removed
        // if (chunk-i>buffer.length-bytes)
        // chunk=buffer.length-bytes+i;
        }

        It should output text file with pattern of spaces and "á". With condition removed it fails with java.lang.ArrayIndexOutOfBoundsException: 2048 in at org.mortbay.jetty.AbstractGenerator$OutputWriter.write

        Show
        Filip Jirsák added a comment - Testcase which fails with removed if conditions in multibyte characters: else if((code&0xfffff800)==0) { // 2b if (buffer.length-bytes<2) { chunk=i; break; } buffer [bytes++] =(byte)(0xc0|(code>>6)); buffer [bytes++] =(byte)(0x80|(code&0x3f)); //condition removed // if (chunk-i>buffer.length-bytes) // chunk=buffer.length-bytes+i; } It should output text file with pattern of spaces and "á". With condition removed it fails with java.lang.ArrayIndexOutOfBoundsException: 2048 in at org.mortbay.jetty.AbstractGenerator$OutputWriter.write
        Hide
        Greg Wilkins added a comment -

        right you are..... off my game today!

        Show
        Greg Wilkins added a comment - right you are..... off my game today!
        Hide
        Greg Wilkins added a comment -

        OK, I think i have the solution now.
        I didn't like your test as I am sure there are cases where the chars needs to be reduced when bytes < buffer.length

        So the version I have gone with is this

        if (bytes+chars-i-1>buffer.length)
        chars-=1;

        which simply detects the possibility of overflow and reduces the chars accordingly.

        It is working for your test cases... but I will do some more testing today

        Show
        Greg Wilkins added a comment - OK, I think i have the solution now. I didn't like your test as I am sure there are cases where the chars needs to be reduced when bytes < buffer.length So the version I have gone with is this if (bytes+chars-i-1>buffer.length) chars-=1; which simply detects the possibility of overflow and reduces the chars accordingly. It is working for your test cases... but I will do some more testing today

          People

          • Assignee:
            Greg Wilkins
            Reporter:
            Filip Jirsák
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: