JRuby

RubyString incorrect handling of utf-8

Details

  • Number of attachments :
    0

Description

Reproduce the bug with

import org.jruby.Ruby;
import org.jruby.RubyString;

public class MyTest { 
  public static String hello(String name) {
    Ruby r = (Ruby)Ruby.getDefaultInstance();
    return r.newString("Hello " + name).toString();
  }
}

and a test file (to prevent shell funkiness)

spinäl

jruby -e "require 'rubygems' ; include_class 'MyTest' ; p MyTest.hello(IO.readlines('test_file.txt').first)"

Run with jruby 1.1.6, the correct result is
"Hello spin\303\244l"

Run with jruby 1.3.1, the utf-8 characters are incorrectly unicoded as well
"Hello spin\303\203\302\244l"

The commit that introduced the bug http://github.com/jruby/jruby/commit/4f4595e09bcfd817a64c41b4badff5f8ebf3aa4f

This bug is related to http://jira.codehaus.org/browse/JRUBY-3732

Activity

Hide
Ken Mayer added a comment -

Related also to JRUBY-2974

Show
Ken Mayer added a comment - Related also to JRUBY-2974
Hide
Charles Oliver Nutter added a comment -

Ken: Yeah, that probably would break other things. Have you tried reverting and re-running all tests?

Show
Charles Oliver Nutter added a comment - Ken: Yeah, that probably would break other things. Have you tried reverting and re-running all tests?
Hide
Ken Mayer added a comment -

Charles: We forked jruby (http://github.com/kmayer/jruby/tree/master), reverted the commit: All the tests ran green.

Show
Ken Mayer added a comment - Charles: We forked jruby (http://github.com/kmayer/jruby/tree/master), reverted the commit: All the tests ran green.
Hide
Charles Oliver Nutter added a comment -

Ok, I see how this was probably an over-reaching change. The original problem was that Java strings coming in from exception messages were being garbled because they were getting encoded into bytes using ISO-8859-1, which destroyed the original characters. I "fixed" it by adding a String version of the RubyString construction that used the Java default encoding to encode the bytes. This fixed the exception issue, but it had the side effect that every piece of code in the system that called Ruby.newString or RubyString.newString with a java.lang.String now switched from raw encoding to Java default encoding. It was a blunt attack on a specific problem that ended up "fixing" a bunch of other things in possibly the wrong way.

I attempted to modify the code to only use the default encoding path for exception messages, to keep the original issue fixed, and I renamed the method from newString(Ruby, String) to newStringUsingDefaultEncoding(Ruby, String). The idea being that we can choose to use this where appropriate in the rest of the system, rather than forcing it on the entire codebase without examining side effects (like those reported in this bug).

This did, however introduce a failure in one location, where we test for File.expand_path with a unicode path. I think we need to look at your specific case in detail and see why it's getting garbled. In almost all cases, the right behavior for a java.lang.String is to use the Java default encoding when turning it into bytes, and after having this code out there for so long I am reluctant to revert it.

Can we try to narrow this down to a specific code path that's garbling the strings? I really want this fixed without the wholesale reversion of what should be a generally correct change.

Show
Charles Oliver Nutter added a comment - Ok, I see how this was probably an over-reaching change. The original problem was that Java strings coming in from exception messages were being garbled because they were getting encoded into bytes using ISO-8859-1, which destroyed the original characters. I "fixed" it by adding a String version of the RubyString construction that used the Java default encoding to encode the bytes. This fixed the exception issue, but it had the side effect that every piece of code in the system that called Ruby.newString or RubyString.newString with a java.lang.String now switched from raw encoding to Java default encoding. It was a blunt attack on a specific problem that ended up "fixing" a bunch of other things in possibly the wrong way. I attempted to modify the code to only use the default encoding path for exception messages, to keep the original issue fixed, and I renamed the method from newString(Ruby, String) to newStringUsingDefaultEncoding(Ruby, String). The idea being that we can choose to use this where appropriate in the rest of the system, rather than forcing it on the entire codebase without examining side effects (like those reported in this bug). This did, however introduce a failure in one location, where we test for File.expand_path with a unicode path. I think we need to look at your specific case in detail and see why it's getting garbled. In almost all cases, the right behavior for a java.lang.String is to use the Java default encoding when turning it into bytes, and after having this code out there for so long I am reluctant to revert it. Can we try to narrow this down to a specific code path that's garbling the strings? I really want this fixed without the wholesale reversion of what should be a generally correct change.

People

Vote (3)
Watch (1)

Dates

  • Created:
    Updated: