Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: JRuby 1.7.0
    • Component/s: None
    • Labels:
      None
    • Number of attachments :
      0

      Description

      This is taken as a comment after we closed https://github.com/jruby/jruby/issues/314.

      irb(main):022:0> s = "\u{00F6}"
      "\u00F6"
      irb(main):023:0> yml = YAML.dump s
      "--- \xF6\n"
      irb(main):024:0> yml.encoding
      #<Encoding:UTF-8>
      irb(main):025:0> yml.encode("windows-1252")
      Encoding::UndefinedConversionError: Input length = 2
          from org/jruby/RubyString.java:7479:in `encode'
          from (irb):25:in `evaluate'
          from org/jruby/RubyKernel.java:1065:in `eval'
          from org/jruby/RubyKernel.java:1390:in `loop'
          from org/jruby/RubyKernel.java:1173:in `catch'
          from org/jruby/RubyKernel.java:1173:in `catch'
          from C:\jruby-1.7.0.RC2\/bin/jirb_swing:54:in `(root)'
      irb(main):026:0> 
      

      I opened this because this is about dumping instead of loading. MRI will dump the chars are \xF6.

        Activity

        Hide
        Charles Oliver Nutter added a comment -

        This appears to be a case where SnakeYAML is not properly handling a non-ascii string encoded into YAML. We decode the character content from RubyString into java.lang.String correctly regardless of external encoding, but SnakeYAML takes that string and encodes it incorrectly depending on what the JVM's file.encoding is set to. If set to a non-UTF-8 encoding, it's encoding the bytes in an invalid format.

        This needs a SnakeYAML fix, as far as I can tell.

        Show
        Charles Oliver Nutter added a comment - This appears to be a case where SnakeYAML is not properly handling a non-ascii string encoded into YAML. We decode the character content from RubyString into java.lang.String correctly regardless of external encoding, but SnakeYAML takes that string and encodes it incorrectly depending on what the JVM's file.encoding is set to. If set to a non-UTF-8 encoding, it's encoding the bytes in an invalid format. This needs a SnakeYAML fix, as far as I can tell.
        Hide
        Charles Oliver Nutter added a comment -
        Show
        Charles Oliver Nutter added a comment - Filed SnakeYAML bug: http://code.google.com/p/snakeyaml/issues/detail?id=160
        Hide
        Charles Oliver Nutter added a comment -

        Ok, turns out it's not actually a SnakeYAML bug.

        In PsychEmitter we set up a Writer for it to write to using OutputStreamWriter(IOOutputStream). However, we don't give that OSW an encoding, so it defaults to Java default.

        Giving it UTF-8, as in the patch below, makes it work...but I'm not sure it's right for all cases. Should we use our default internal encoding? The write encoding of the target IO? Unclear.

        Punting to 1.7.1, since we want to get this right.

        diff --git a/src/org/jruby/ext/psych/PsychEmitter.java b/src/org/jruby/ext/psych/PsychEmitter.java
        index a3564e4..654e75b 100644
        --- a/src/org/jruby/ext/psych/PsychEmitter.java
        +++ b/src/org/jruby/ext/psych/PsychEmitter.java
        @@ -29,6 +29,7 @@ package org.jruby.ext.psych;
         
         import java.io.IOException;
         import java.io.OutputStreamWriter;
        +import java.io.UnsupportedEncodingException;
         import java.util.HashMap;
         import java.util.Map;
         import org.jruby.Ruby;
        @@ -81,7 +82,12 @@ public class PsychEmitter extends RubyObject {
             public IRubyObject initialize(ThreadContext context, IRubyObject io) {
                 options = new DumperOptions();
                 options.setIndent(2);
        -        emitter = new Emitter(new OutputStreamWriter(new IOOutputStream(io)), options);
        +        try {
        +            emitter = new Emitter(new OutputStreamWriter(new IOOutputStream(io), "UTF-8"), options);
        +        } catch (UnsupportedEncodingException uee) {
        +            // should never happen on a compliant JVM
        +            emitter = new Emitter(new OutputStreamWriter(new IOOutputStream(io)), options);
        +        }
         
                 return context.nil;
             }
        
        Show
        Charles Oliver Nutter added a comment - Ok, turns out it's not actually a SnakeYAML bug. In PsychEmitter we set up a Writer for it to write to using OutputStreamWriter(IOOutputStream). However, we don't give that OSW an encoding, so it defaults to Java default. Giving it UTF-8, as in the patch below, makes it work...but I'm not sure it's right for all cases. Should we use our default internal encoding? The write encoding of the target IO? Unclear. Punting to 1.7.1, since we want to get this right. diff --git a/src/org/jruby/ext/psych/PsychEmitter.java b/src/org/jruby/ext/psych/PsychEmitter.java index a3564e4..654e75b 100644 --- a/src/org/jruby/ext/psych/PsychEmitter.java +++ b/src/org/jruby/ext/psych/PsychEmitter.java @@ -29,6 +29,7 @@ package org.jruby.ext.psych; import java.io.IOException; import java.io.OutputStreamWriter; +import java.io.UnsupportedEncodingException; import java.util.HashMap; import java.util.Map; import org.jruby.Ruby; @@ -81,7 +82,12 @@ public class PsychEmitter extends RubyObject { public IRubyObject initialize(ThreadContext context, IRubyObject io) { options = new DumperOptions(); options.setIndent(2); - emitter = new Emitter(new OutputStreamWriter(new IOOutputStream(io)), options); + try { + emitter = new Emitter(new OutputStreamWriter(new IOOutputStream(io), "UTF-8"), options); + } catch (UnsupportedEncodingException uee) { + // should never happen on a compliant JVM + emitter = new Emitter(new OutputStreamWriter(new IOOutputStream(io)), options); + } return context.nil; }
        Hide
        Charles Oliver Nutter added a comment -

        Committed my not-quite-right-but-working fix to post17 branch:

        commit 0b8c417043e2e71a21790ae1d4ee139646e5a355
        Author: Charles Oliver Nutter <headius@headius.com>
        Date:   Wed Oct 17 18:30:07 2012 -0500
        
            Probable fix for JRUBY-6930
            
            yaml escaping of non-printable characters
            
            We need to give an encoding to the OutputStreamWriter. I chose
            UTF-8 for simplicity, but that may not be quite right. Need to
            decide what this encoding should be based on the target "io" we
            will write to.
        
        :100644 100644 a3564e4... 654e75b... M  src/org/jruby/ext/psych/PsychEmitter.java
        
        Show
        Charles Oliver Nutter added a comment - Committed my not-quite-right-but-working fix to post17 branch: commit 0b8c417043e2e71a21790ae1d4ee139646e5a355 Author: Charles Oliver Nutter <headius@headius.com> Date: Wed Oct 17 18:30:07 2012 -0500 Probable fix for JRUBY-6930 yaml escaping of non-printable characters We need to give an encoding to the OutputStreamWriter. I chose UTF-8 for simplicity, but that may not be quite right. Need to decide what this encoding should be based on the target "io" we will write to. :100644 100644 a3564e4... 654e75b... M src/org/jruby/ext/psych/PsychEmitter.java
        Hide
        Charles Oliver Nutter added a comment -

        Ok, after talking with Aaron Patterson, I think we have a good fix. And Tom Enebo convinced me this should be in 1.7.0, since it breaks unicode YAML stuff on non-UTF-8 platforms.

        commit f6d07a29bc5c1884790c163d31f4e6a90ee26a9d
        Author: Charles Oliver Nutter <headius@headius.com>
        Date:   Wed Oct 17 19:43:48 2012 -0500
        
            Fix JRUBY-6930: yaml escaping of non-printable characters
            
            Defer creating the SnakeYAML emitter until we are given an
            encoding from the start_stream event. This allows us to set up
            the OutputStreamWriter to use the proper charset, and has the
            added bonus of making other lazily-set options actually be
            reflected in the emitter's behavior.
        
        Show
        Charles Oliver Nutter added a comment - Ok, after talking with Aaron Patterson, I think we have a good fix. And Tom Enebo convinced me this should be in 1.7.0, since it breaks unicode YAML stuff on non-UTF-8 platforms. commit f6d07a29bc5c1884790c163d31f4e6a90ee26a9d Author: Charles Oliver Nutter <headius@headius.com> Date: Wed Oct 17 19:43:48 2012 -0500 Fix JRUBY-6930: yaml escaping of non-printable characters Defer creating the SnakeYAML emitter until we are given an encoding from the start_stream event. This allows us to set up the OutputStreamWriter to use the proper charset, and has the added bonus of making other lazily-set options actually be reflected in the emitter's behavior.

          People

          • Assignee:
            Charles Oliver Nutter
            Reporter:
            Thomas E Enebo
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: