Details

    • Number of attachments :
      2

      Description

      This gist explains the issue:

      https://gist.github.com/2314158

      In a nutshell if a US-ASCII source file is passed a UTF-8 string and tries to merge it with a local string using '%' jruby performs differently than 1.9 and also differently from '+' in jruby. I would expect consistent behavior with MRI.

      This is causing real world encoding bugs for me. Any library using '%' in a US-ASCII source file will garble any UTF-8 strings passed in.

      1. jruby_encoding_test.rb
        0.2 kB
        Patrick Ritchie
      2. utf8_file
        0.0 kB
        Patrick Ritchie

        Activity

        Hide
        Andy Lindeman added a comment -

        Seems to be inconsistent even within 1.9 in MRI:

        1.9.2p318 :001 > s = '%s'.force_encoding('US-ASCII')
        1.9.2p318 :002 > utf8 = 'foo'.force_encoding('UTF-8')
        1.9.2p318 :003 > (s % utf8).encoding
        => #<Encoding:UTF-8>

        1.9.3p125 :007 > s = '%s'.force_encoding('US-ASCII')
        1.9.3p125 :008 > utf8 = 'foo'.force_encoding('UTF-8')
        1.9.3p125 :009 > (s % utf8).encoding
        => #<Encoding:US-ASCII>

        It seems like 1.9.3 agrees with JRuby. I haven't tried ruby-head.

        Are you seeing it with 1.9.3? Or only 1.9.2? Thoughts?

        Show
        Andy Lindeman added a comment - Seems to be inconsistent even within 1.9 in MRI: 1.9.2p318 :001 > s = '%s'.force_encoding('US-ASCII') 1.9.2p318 :002 > utf8 = 'foo'.force_encoding('UTF-8') 1.9.2p318 :003 > (s % utf8).encoding => #<Encoding:UTF-8> 1.9.3p125 :007 > s = '%s'.force_encoding('US-ASCII') 1.9.3p125 :008 > utf8 = 'foo'.force_encoding('UTF-8') 1.9.3p125 :009 > (s % utf8).encoding => #<Encoding:US-ASCII> It seems like 1.9.3 agrees with JRuby. I haven't tried ruby-head. Are you seeing it with 1.9.3? Or only 1.9.2? Thoughts?
        Hide
        Patrick Ritchie added a comment -

        I'm using 1.9.3 ... got the same result as you under 1.9.2

        But... I first noticed this due to an issue with slim templates in Rails. This template is garbled under jruby but comes out fine under 1.9.2 and 1.9.3

        p= "Béige"

        I debugged it to the '%' handling, but seeing as though 1.9.2 behaves differently than 1.9.3 perhaps there is something else at work here as well.

        If it would be helpful I can package up a sample Rails app with a spec that demonstrates the issue.

        Show
        Patrick Ritchie added a comment - I'm using 1.9.3 ... got the same result as you under 1.9.2 But... I first noticed this due to an issue with slim templates in Rails. This template is garbled under jruby but comes out fine under 1.9.2 and 1.9.3 p= "Béige" I debugged it to the '%' handling, but seeing as though 1.9.2 behaves differently than 1.9.3 perhaps there is something else at work here as well. If it would be helpful I can package up a sample Rails app with a spec that demonstrates the issue.
        Hide
        Hiro Asari added a comment -

        The problem here appears to me that the encoding of the resultant String of String#% should be dictated by the format it takes, not the encoding of the receiver. (See https://github.com/ruby/ruby/blob/e95f7ea80d096cf27ea0ae5f7dc712ad72e71f3c/sprintf.c#L477 )

        For this, we need a general utility function that returns an Encoding from any object (not just a ByteList)-e.g., (see https://github.com/ruby/ruby/blob/57fb2199059cb55b632d093c2e64c8a3c60acfbb/encoding.c#L676 )-which I don't think we have.

        Show
        Hiro Asari added a comment - The problem here appears to me that the encoding of the resultant String of String#% should be dictated by the format it takes, not the encoding of the receiver. (See https://github.com/ruby/ruby/blob/e95f7ea80d096cf27ea0ae5f7dc712ad72e71f3c/sprintf.c#L477 ) For this, we need a general utility function that returns an Encoding from any object (not just a ByteList)- e.g., (see https://github.com/ruby/ruby/blob/57fb2199059cb55b632d093c2e64c8a3c60acfbb/encoding.c#L676 ) -which I don't think we have.
        Hide
        Patrick Ritchie added a comment -

        That might help explain some of the other random encoding issues I'm seeing under jRuby.

        Show
        Patrick Ritchie added a comment - That might help explain some of the other random encoding issues I'm seeing under jRuby.
        Hide
        Charles Oliver Nutter added a comment -

        I think I have it. Looks like MRI uses the format string's encoding as a base, but as it encounters string arguments it uses rb_enc_check to negotiate a common encoding. Because the incoming String actually has unicode characters, the result ends up UTF-8.

        Patch pending.

        Show
        Charles Oliver Nutter added a comment - I think I have it. Looks like MRI uses the format string's encoding as a base, but as it encounters string arguments it uses rb_enc_check to negotiate a common encoding. Because the incoming String actually has unicode characters, the result ends up UTF-8. Patch pending.
        Hide
        Charles Oliver Nutter added a comment -
        commit 224ff97e26c0ebd361a03fb1f066631941d45f1c
        Author: Charles Oliver Nutter <headius@headius.com>
        Date:   Tue May 15 18:34:10 2012 -0500
        
            Fix JRUBY-6582
            
            We were missing encoding negotiation that MRI does during % with
            string args.
        
        Show
        Charles Oliver Nutter added a comment - commit 224ff97e26c0ebd361a03fb1f066631941d45f1c Author: Charles Oliver Nutter <headius@headius.com> Date: Tue May 15 18:34:10 2012 -0500 Fix JRUBY-6582 We were missing encoding negotiation that MRI does during % with string args.

          People

          • Assignee:
            Charles Oliver Nutter
            Reporter:
            Patrick Ritchie
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: