Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: None
-
Fix Version/s: JRuby 1.6
-
Component/s: Parser, Ruby 1.9.2
-
Labels:None
-
Number of attachments :
Description
File:
# encoding: utf-8 /こ/u
Error:
~/projects/rails/activesupport ➔ jruby --1.9 -c multibyte_chars_test.rb org.jruby.exceptions.RaiseException: (RegexpError) invalid multibyte character: /ã““/ ~/projects/rails/activesupport ➔ jruby -Xbacktrace.style=raw --1.9 -c multibyte_chars_test.rb org.jruby.exceptions.RaiseException: (RegexpError) invalid multibyte character: /ã““/ at java.lang.Thread.getStackTrace(Thread.java:1503) at org.jruby.RubyException.prepareBacktrace(RubyException.java:154) at org.jruby.exceptions.RaiseException.preRaise(RaiseException.java:156) at org.jruby.Ruby.newRaiseException(Ruby.java:3405) at org.jruby.Ruby.newRegexpError(Ruby.java:3238) at org.jruby.RubyRegexp.raiseRegexpError19(RubyRegexp.java:1017) at org.jruby.RubyRegexp.raisePreprocessError(RubyRegexp.java:431) at org.jruby.RubyRegexp.unescapeNonAscii(RubyRegexp.java:601) at org.jruby.RubyRegexp.preprocess(RubyRegexp.java:671) at org.jruby.RubyRegexp.preprocessCheck(RubyRegexp.java:678) at org.jruby.parser.ParserSupport.regexpFragmentCheck(ParserSupport.java:1617) at org.jruby.parser.ParserSupport.newRegexpNode(ParserSupport.java:1630) at org.jruby.parser.Ruby19Parser$295.execute(Ruby19Parser.java:3575) at org.jruby.parser.Ruby19Parser.yyparse(Ruby19Parser.java:1477) at org.jruby.parser.Ruby19Parser.yyparse(Ruby19Parser.java:1368) at org.jruby.parser.Ruby19Parser.parse(Ruby19Parser.java:4237) at org.jruby.parser.Parser.parse(Parser.java:112) at org.jruby.parser.Parser.parse(Parser.java:94) at org.jruby.Ruby.parseFile(Ruby.java:2290) at org.jruby.Ruby.parseFile(Ruby.java:2295) at org.jruby.Ruby.parseFromMain(Ruby.java:444) at org.jruby.Main.run(Main.java:255) at org.jruby.Main.run(Main.java:144) at org.jruby.Main.main(Main.java:113)
Possible fix pushed to master. The problem appears to be that when pulling multibyte characters off the stream, StreamTerm.parseStringIntoBuffer was using the first byte as a codepoint for the whole character, and appending the resulting character bytes to the buffer. It then proceeded to the next byte and did the same thing again. The result was a totally garbled string. My fix was to get the encoded length associated with the first byte and append that many bytes to the buffer. There's probably a utility method for this somewhere I don't know about.
commit e94175ed4335d251ef21fc4ec68d46aae801380a Author: Charles Oliver Nutter <headius@headius.com> Date: Tue Jan 4 04:19:09 2011 -0600 Possible fix for JRUBY-5301: [1.9] Unicode characters in regexp in unicode-encoded file do not parse * When reading multi-byte characters from stream, use encoded length associated with first byte to read remaining bytes. Old code was using first byte as codepoint, and garbling the string. diff --git a/src/org/jruby/lexer/yacc/StringTerm.java b/src/org/jruby/lexer/yacc/StringTerm.java index cc6ff85..77404e8 100644 --- a/src/org/jruby/lexer/yacc/StringTerm.java +++ b/src/org/jruby/lexer/yacc/StringTerm.java @@ -300,7 +300,13 @@ public class StringTerm extends StrTerm { if (buffer.getEncoding() != encoding) { mixedEscape(lexer, buffer.getEncoding(), encoding); } - lexer.tokenAddMBC(c, buffer, buffer.getEncoding()); // No EOF here? + // read bytes for length of character + int length = encoding.length((byte)c); + buffer.append((byte)c); + for (int off = 0; off < length - 1; off++) { + buffer.append(src.read()); + } + continue; } else if (qwords && Character.isWhitespace(c)) { src.unread(c); break;