Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: JRuby 1.6.7
    • Fix Version/s: JRuby 1.7.0.pre1
    • Component/s: None
    • Labels:
      None
    • Testcase included:
      yes
    • Number of attachments :
      0

      Description

      In Moped (my MongoDB driver), I have some code attempting to safely convert input text into UTF-8 (which all strings must be).

      See for a simplified test-case of the implementation: https://gist.github.com/2351047

      Problem: that calling encode('utf-8') on a binary string creates garbage utf-8 text instead of raising a conversion error like 1.9.

      Reason the code exists: We're tracking referrers in our app, but the header value in the rack environment is encoded as ASCII-8bit; this code then follows the happy-path of calling `encode('utf-8')` and finally tries forcing the encoding when that raises an error.

      Tested on: jruby-1.6.7 and jruby-1.7.0-dev

        Activity

        Hide
        Bernerd Schaefer added a comment -

        As a work-around, replacing:

        begin
          data = string.encode('utf-8')
        rescue EncodingError
          data = string.dup
          data.force_encoding 'utf-8'
        
          raise unless data.valid_encoding?
        end
        

        with:

        if string.encoding.name == 'ASCII-8BIT'
          data = string.dup
          data.force_encoding('utf-8')
        
          unless data.valid_encoding?
            raise EncodingError, "Could not encode ASCII-8BIT data #{string.dump} as UTF-8"
          end
        else
          data = string.encode('utf-8')
        end
        

        seems to accomplish the same results; though unfortunately this requires checking the encoding for all strings, while the other way allows strings that are already or compatible with utf-8 to be processed without the encoding check.

        Show
        Bernerd Schaefer added a comment - As a work-around, replacing: begin data = string.encode('utf-8') rescue EncodingError data = string.dup data.force_encoding 'utf-8' raise unless data.valid_encoding? end with: if string.encoding.name == 'ASCII-8BIT' data = string.dup data.force_encoding('utf-8') unless data.valid_encoding? raise EncodingError, "Could not encode ASCII-8BIT data #{string.dump} as UTF-8" end else data = string.encode('utf-8') end seems to accomplish the same results; though unfortunately this requires checking the encoding for all strings, while the other way allows strings that are already or compatible with utf-8 to be processed without the encoding check.
        Hide
        Charles Oliver Nutter added a comment -

        Partial attempt. This is probably close to doing the right thing. https://gist.github.com/2707532

        FWIW, we do have the real transcoding logic on master now, which would probably be the proper thing to wire up (rather than Java NIO transcoding).

        Show
        Charles Oliver Nutter added a comment - Partial attempt. This is probably close to doing the right thing. https://gist.github.com/2707532 FWIW, we do have the real transcoding logic on master now, which would probably be the proper thing to wire up (rather than Java NIO transcoding).
        Hide
        Charles Oliver Nutter added a comment -

        I fixed this atop the current transcoding logic. I imagine we'll have to revisit this, but I'll get your test into our suite for that day.

        commit ede8fd9417158e6762079278ed7be270ef342f12
        Author: Charles Oliver Nutter <headius@headius.com>
        Date:   Thu May 17 17:35:55 2012 -0500
        
            Fix JRUBY-6588
            
            Multiple fixes and improvements for transcoding. We are still
            passing everthing through Java Charset logic, which means the
            intermediate phase is always UTF-16, but we will now error out
            properly. The test case from the bug passes now.
        
        Show
        Charles Oliver Nutter added a comment - I fixed this atop the current transcoding logic. I imagine we'll have to revisit this, but I'll get your test into our suite for that day. commit ede8fd9417158e6762079278ed7be270ef342f12 Author: Charles Oliver Nutter <headius@headius.com> Date: Thu May 17 17:35:55 2012 -0500 Fix JRUBY-6588 Multiple fixes and improvements for transcoding. We are still passing everthing through Java Charset logic, which means the intermediate phase is always UTF-16, but we will now error out properly. The test case from the bug passes now.

          People

          • Assignee:
            Charles Oliver Nutter
            Reporter:
            Bernerd Schaefer
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: