JRuby

String#unpack("Ux") does not handle multi-bytes correctly

Details

  • Type: Bug Bug
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: JRuby 1.0.1
  • Fix Version/s: JRuby 1.1.4
  • Component/s: Core Classes/Modules
  • Labels:
    None
  • Patch Submitted:
    Yes
  • Number of attachments :
    3

Description

A call of b.unpack("U2") should return two characters, but returns only one.

  1. three characters 'ä'
    b = [0xC3,0xA4, 0xC3,0xA4, 0xC3,0xA4].pack("C*")
  2. should result in two charachters, currently 1
    b.unpack("U2")

A patch which solves the problem is attached

  1. 1454.patch
    01/Nov/07 4:28 PM
    1 kB
    Kjetil Ødegaard
  2. patch_03102007.txt
    03/Nov/07 10:34 AM
    4 kB
    Tom Quellenberg
  3. patch.txt
    19/Oct/07 3:28 AM
    3 kB
    Tom Quellenberg

Activity

Hide
Charles Oliver Nutter added a comment -

Two things for you:

  1. enclose code snippits in a noformat section to prevent it being formated; it kinda garbled your example. See http://jira.codehaus.org/secure/WikiRendererHelpAction.jspa?section=advanced
  2. can you attach a short test case that demonstrates the problem?
Show
Charles Oliver Nutter added a comment - Two things for you:
  1. enclose code snippits in a noformat section to prevent it being formated; it kinda garbled your example. See http://jira.codehaus.org/secure/WikiRendererHelpAction.jspa?section=advanced
  2. can you attach a short test case that demonstrates the problem?
Hide
Tom Quellenberg added a comment -

Sorry for my garbled description.

Maybe this test explains the problem:

require 'test/unit'

class UnpackTest < Test::Unit::TestCase
	def test_unpack
		b = [0xC3,0xA4, 0xC3,0xA4, 0xC3,0xA4].pack("C*")
		assert_equal 2, b.unpack("U2").length
	end
end
Show
Tom Quellenberg added a comment - Sorry for my garbled description. Maybe this test explains the problem:
require 'test/unit'

class UnpackTest < Test::Unit::TestCase
	def test_unpack
		b = [0xC3,0xA4, 0xC3,0xA4, 0xC3,0xA4].pack("C*")
		assert_equal 2, b.unpack("U2").length
	end
end
Hide
Thomas E Enebo added a comment -

The supplied patch I think is not quite right because if it encounters a an unmappable character then it will convert it to '?' instead of throwing an exception. This is one reason why the existing code uses an explicit decoder. Tom, Can you update your patch to account for this?

Show
Thomas E Enebo added a comment - The supplied patch I think is not quite right because if it encounters a an unmappable character then it will convert it to '?' instead of throwing an exception. This is one reason why the existing code uses an explicit decoder. Tom, Can you update your patch to account for this?
Hide
Wes Nakamura added a comment -

The problem occurs on this line:

byte[] toUnpack = new byte[occurrences];

in the section beginning on line 902 of Pack.java:

                case 'U' :
                    {
                        if (occurrences == IS_STAR || occurrences > encode.remaining()) {
                            occurrences = encode.remaining();
                        }
                        //get the correct substring
                        byte[] toUnpack = new byte[occurrences];
                        encode.get(toUnpack);
                        CharBuffer lUtf8 = null;

The byte array is only occurrences in size, whereas each UTF-8 char can be more than one byte.

As a result lutf8.hasRemaining() (a little further down) return false too early, as toUnpack never held enough of the original array.

As a result, using * works, but using a number in general does not.

I'm not sure what the correct fix would be - always set toUpack's size to encode.remaining()? Or perhaps use occurrences * 4 since I believe 4 bytes is the maximum length of a UTF-8 char?

Show
Wes Nakamura added a comment - The problem occurs on this line:
byte[] toUnpack = new byte[occurrences];
in the section beginning on line 902 of Pack.java:
                case 'U' :
                    {
                        if (occurrences == IS_STAR || occurrences > encode.remaining()) {
                            occurrences = encode.remaining();
                        }
                        //get the correct substring
                        byte[] toUnpack = new byte[occurrences];
                        encode.get(toUnpack);
                        CharBuffer lUtf8 = null;
The byte array is only occurrences in size, whereas each UTF-8 char can be more than one byte. As a result lutf8.hasRemaining() (a little further down) return false too early, as toUnpack never held enough of the original array. As a result, using * works, but using a number in general does not. I'm not sure what the correct fix would be - always set toUpack's size to encode.remaining()? Or perhaps use occurrences * 4 since I believe 4 bytes is the maximum length of a UTF-8 char?
Hide
Charles Oliver Nutter added a comment -

Bumping off 1.0.2.

Show
Charles Oliver Nutter added a comment - Bumping off 1.0.2.
Hide
Kjetil Ødegaard added a comment -

Here's a small patch that builds on Wes Nakamura's comments. It allocates four bytes per UTF-8 character, but not more than what's left in the encode buffer. This keeps the byte array small, even if the encode buffer is large.

Show
Kjetil Ødegaard added a comment - Here's a small patch that builds on Wes Nakamura's comments. It allocates four bytes per UTF-8 character, but not more than what's left in the encode buffer. This keeps the byte array small, even if the encode buffer is large.
Hide
Tom Quellenberg added a comment -

By reading always 4 bytes per character, the decode function in most cases reads too many bytes. When the additional bytes are not in the correct format (they do not belong to the string, the user wants to decode), the function throws wrongly an encoding exception.
I think the only way to solve the problem is to calculate the correct number of bytes. I took the information about utf-8 byte length from this page: http://en.wikipedia.org/wiki/UTF-8 and wrote a patch (patch_03102007.txt).

Show
Tom Quellenberg added a comment - By reading always 4 bytes per character, the decode function in most cases reads too many bytes. When the additional bytes are not in the correct format (they do not belong to the string, the user wants to decode), the function throws wrongly an encoding exception. I think the only way to solve the problem is to calculate the correct number of bytes. I took the information about utf-8 byte length from this page: http://en.wikipedia.org/wiki/UTF-8 and wrote a patch (patch_03102007.txt).
Hide
Charles Oliver Nutter added a comment -

Test passes in 1.1.4 and a much more extensive set of pack tests live in rubyspec. Plus 1.0 branch is dead.

Show
Charles Oliver Nutter added a comment - Test passes in 1.1.4 and a much more extensive set of pack tests live in rubyspec. Plus 1.0 branch is dead.

People

Vote (0)
Watch (3)

Dates

  • Created:
    Updated:
    Resolved: