JRuby (please use github issues at http://bugs.jruby.org)
  1. JRuby (please use github issues at http://bugs.jruby.org)
  2. JRUBY-5301

[1.9] Unicode characters in regexp in unicode-encoded file do not parse

    Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: JRuby 1.6
    • Component/s: Parser, Ruby 1.9.2
    • Labels:
      None
    • Number of attachments :
      0

      Description

      File:

      # encoding: utf-8
      /こ/u
      

      Error:

      ~/projects/rails/activesupport ➔ jruby --1.9 -c multibyte_chars_test.rb 
      org.jruby.exceptions.RaiseException: (RegexpError) invalid multibyte character: /こ“/
      
      ~/projects/rails/activesupport ➔ jruby -Xbacktrace.style=raw --1.9 -c multibyte_chars_test.rb 
      org.jruby.exceptions.RaiseException: (RegexpError) invalid multibyte character: /こ“/
      	at java.lang.Thread.getStackTrace(Thread.java:1503)
      	at org.jruby.RubyException.prepareBacktrace(RubyException.java:154)
      	at org.jruby.exceptions.RaiseException.preRaise(RaiseException.java:156)
      	at org.jruby.Ruby.newRaiseException(Ruby.java:3405)
      	at org.jruby.Ruby.newRegexpError(Ruby.java:3238)
      	at org.jruby.RubyRegexp.raiseRegexpError19(RubyRegexp.java:1017)
      	at org.jruby.RubyRegexp.raisePreprocessError(RubyRegexp.java:431)
      	at org.jruby.RubyRegexp.unescapeNonAscii(RubyRegexp.java:601)
      	at org.jruby.RubyRegexp.preprocess(RubyRegexp.java:671)
      	at org.jruby.RubyRegexp.preprocessCheck(RubyRegexp.java:678)
      	at org.jruby.parser.ParserSupport.regexpFragmentCheck(ParserSupport.java:1617)
      	at org.jruby.parser.ParserSupport.newRegexpNode(ParserSupport.java:1630)
      	at org.jruby.parser.Ruby19Parser$295.execute(Ruby19Parser.java:3575)
      	at org.jruby.parser.Ruby19Parser.yyparse(Ruby19Parser.java:1477)
      	at org.jruby.parser.Ruby19Parser.yyparse(Ruby19Parser.java:1368)
      	at org.jruby.parser.Ruby19Parser.parse(Ruby19Parser.java:4237)
      	at org.jruby.parser.Parser.parse(Parser.java:112)
      	at org.jruby.parser.Parser.parse(Parser.java:94)
      	at org.jruby.Ruby.parseFile(Ruby.java:2290)
      	at org.jruby.Ruby.parseFile(Ruby.java:2295)
      	at org.jruby.Ruby.parseFromMain(Ruby.java:444)
      	at org.jruby.Main.run(Main.java:255)
      	at org.jruby.Main.run(Main.java:144)
      	at org.jruby.Main.main(Main.java:113)
      

        Activity

        Hide
        Charles Oliver Nutter added a comment -

        Possible fix pushed to master. The problem appears to be that when pulling multibyte characters off the stream, StreamTerm.parseStringIntoBuffer was using the first byte as a codepoint for the whole character, and appending the resulting character bytes to the buffer. It then proceeded to the next byte and did the same thing again. The result was a totally garbled string. My fix was to get the encoded length associated with the first byte and append that many bytes to the buffer. There's probably a utility method for this somewhere I don't know about.

        commit e94175ed4335d251ef21fc4ec68d46aae801380a
        Author: Charles Oliver Nutter <headius@headius.com>
        Date:   Tue Jan 4 04:19:09 2011 -0600
        
            Possible fix for JRUBY-5301: [1.9] Unicode characters in regexp in unicode-encoded file do not parse
            
            * When reading multi-byte characters from stream, use encoded length associated with first byte to read remaining bytes. Old code was using first byte as codepoint, and garbling the string.
        
        diff --git a/src/org/jruby/lexer/yacc/StringTerm.java b/src/org/jruby/lexer/yacc/StringTerm.java
        index cc6ff85..77404e8 100644
        --- a/src/org/jruby/lexer/yacc/StringTerm.java
        +++ b/src/org/jruby/lexer/yacc/StringTerm.java
        @@ -300,7 +300,13 @@ public class StringTerm extends StrTerm {
                         if (buffer.getEncoding() != encoding) {
                             mixedEscape(lexer, buffer.getEncoding(), encoding);
                         }
        -                lexer.tokenAddMBC(c, buffer, buffer.getEncoding()); // No EOF here?
        +                // read bytes for length of character
        +                int length = encoding.length((byte)c);
        +                buffer.append((byte)c);
        +                for (int off = 0; off < length - 1; off++) {
        +                    buffer.append(src.read());
        +                }
        +                continue;
                     } else if (qwords && Character.isWhitespace(c)) {
                         src.unread(c);
                         break;
        
        Show
        Charles Oliver Nutter added a comment - Possible fix pushed to master. The problem appears to be that when pulling multibyte characters off the stream, StreamTerm.parseStringIntoBuffer was using the first byte as a codepoint for the whole character, and appending the resulting character bytes to the buffer. It then proceeded to the next byte and did the same thing again. The result was a totally garbled string. My fix was to get the encoded length associated with the first byte and append that many bytes to the buffer. There's probably a utility method for this somewhere I don't know about. commit e94175ed4335d251ef21fc4ec68d46aae801380a Author: Charles Oliver Nutter <headius@headius.com> Date: Tue Jan 4 04:19:09 2011 -0600 Possible fix for JRUBY-5301: [1.9] Unicode characters in regexp in unicode-encoded file do not parse * When reading multi-byte characters from stream, use encoded length associated with first byte to read remaining bytes. Old code was using first byte as codepoint, and garbling the string. diff --git a/src/org/jruby/lexer/yacc/StringTerm.java b/src/org/jruby/lexer/yacc/StringTerm.java index cc6ff85..77404e8 100644 --- a/src/org/jruby/lexer/yacc/StringTerm.java +++ b/src/org/jruby/lexer/yacc/StringTerm.java @@ -300,7 +300,13 @@ public class StringTerm extends StrTerm { if (buffer.getEncoding() != encoding) { mixedEscape(lexer, buffer.getEncoding(), encoding); } - lexer.tokenAddMBC(c, buffer, buffer.getEncoding()); // No EOF here? + // read bytes for length of character + int length = encoding.length((byte)c); + buffer.append((byte)c); + for (int off = 0; off < length - 1; off++) { + buffer.append(src.read()); + } + continue; } else if (qwords && Character.isWhitespace(c)) { src.unread(c); break;
        Hide
        Thomas E Enebo added a comment -

        This was fixed right before 1.6.0.RC1.

        Show
        Thomas E Enebo added a comment - This was fixed right before 1.6.0.RC1.

          People

          • Assignee:
            Thomas E Enebo
            Reporter:
            Charles Oliver Nutter
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: