Details
Description
In irb of CRuby 1.9, String#encoding returns valid encodings, UTF-8.
But in 1.9 mode irb of JRuby, String#encoding returns invalid encodings, ASCII-8BIT.
Please see the attached screenshots because description area can't diaplay Japanese.
-
Hide
- irb_screenshots.zip
- 22/Oct/10 10:53 PM
- 75 kB
- Youhei Kondou
-
- cirb19.png 36 kB
- __MACOSX/._cirb19.png 0.3 kB
- jirb19.png 46 kB
- __MACOSX/._jirb19.png 0.3 kB
-
- 246cc28_jruby5156_jruby160_irb.png
- 14 kB
- 15/Feb/11 8:32 PM
-
- 246cc28_jruby5156_jruby160.PNG
- 27 kB
- 15/Feb/11 6:48 PM
-
- jruby5156_cruby192.png
- 70 kB
- 19/Jan/11 7:02 AM
-
- jruby5156_jruby160.png
- 75 kB
- 19/Jan/11 7:02 AM
Activity
So where does Ruby 1.9 get UTF-8 from? Perhaps we're still just defaulting to ASCII because that's the standard in Ruby 1.8.
At first, thanks to correct Component/s area.
> So where does Ruby 1.9 get UTF-8 from?
UTF-8 is from terminal's default encode (both in Mac OS X and in openSUSE).
> Perhaps we're still just defaulting to ASCII because that's the standard in Ruby 1.8
IMHO, it's okay if there are only alphabets and numbers. But, like its screenshot, it's strange Hiragana is indicated as ASCII.
As far as I read my translation, http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html#l14, Ruby 1.9 sees $LANG environment variable to determine the encoding of standard I/O. On my OS X, "echo $LANG" printed out "ja_JP.UTF-8", while Java's file.encoding is SJIS on JDK 6 and UTF-8 on JDK 5. Terminal's default encoding is usually the value of $LANG, so it coincides with what Youhei said.
Perhaps we should use whatever the JDK claims as its default platform encoding for our default 1.9 mode encoding? That should be the correct encoding based on the current platform's $LANG and other settings.
There are Ruby 1.9 Encoding Summary:
http://redmine.ruby-lang.org/wiki/ruby-19/ScriptEncoding
I guess irb is categorized as "-e and stdin case", and my sample is surely "no -K -E, no magic comment" case, so both "script encoding" and "default external" should be indicated for locale ($LANG).
There are two bugs here:
- We are not properly setting up the default external encoding and "kcode" (which lingers on from 1.8 mode) when starting up in 1.9 mode.
- The parser is not propagating the current encoding (or default external encoding) into strings, etc that it parses.
Most of this fix will require work in the 1.9 parser logic, so that it will create ByteList objects for our RubyStrings with the appropriate encoding. Some additional work is needed to set up the kcode and default external encoding in 1.9 mode, but it's not major.
Because of this, I'm marking this as a 1.9 and Parser bug. Also marked for fixing in 1.6.
Recent encoding work (likely Tom's big push for internal/external and parser encoding support) appear to have fixed this. Thanks for the report!
~/projects/jruby ➔ jruby --1.9 -S irb >> "foo".encoding => #<Encoding:UTF-8>
At first, many thanks for all to implement String#encoding with M17N.
In 1.6.0 RC1, both on my Linux and on Mac OS X (Terminal is in UTF8), it succeeds totally.
irb(main):001:0> JRUBY_VERSION "1.6.0.RC1" irb(main):002:0> RUBY_VERSION "1.9.2" irb(main):003:0> 'a'.encoding #<Encoding:UTF-8> irb(main):004:0> 'あ'.encoding #<Encoding:UTF-8>
But on Windows (Terminal is in Japanese), it succeeds at half.
irb(main):001:0> JRUBY_VERSION "1.6.0.RC1" irb(main):002:0> RUBY_VERSION "1.9.2" irb(main):003:0> 'a'.encoding #<Encoding:Windows-31J> irb(main):004:0> 'あ'.encoding SyntaxError: (irb):4: invalid multibyte char (Windows-31J) from org/jruby/RubyKernel.java:1096:in `eval19' from org/jruby/RubyKernel.java:1421:in `loop' from org/jruby/RubyKernel.java:1208:in `rbCatch19' from org/jruby/RubyKernel.java:1208:in `rbCatch19'
Adding "Windows" category, since it now appears to only fail on Windows.
I'm surprised by this. Can you provide more information about your system please?
- default external and internal encodings
- whether this character works in a script, with and without encoding: header
- full Java backtrace for the error, by passing -J-Djruby.backtrace.style=raw to JRuby
I'd like to see us fix this for 1.6, so marking as such.
screenshot for irb. The first is CRuby 1.9.2 and the last is JRuby 1.6.0 RC1 with --1.9 mode
I attached both CRuby's irb and JRuby's irb
At first, CRuby's one.
! jruby5156_cruby192.png !
Encoding.default_external is Windows-31J and Encoding.default_internal is nil.
Windows-31J is the default encoding of command prompt in Japanese Windows.
At last, JRuby's one.
! jruby5156_jruby160.png !
Encoding.default_external is US-ASCII and Encoding.default_internal is nil.
And its Java backtrace is below:
SyntaxError: (irb):7: invalid multibyte char (Windows-31J) from Thread.java:1479:in `getStackTrace' from RubyException.java:154:in `prepareBacktrace' from RaiseException.java:156:in `preRaise' from Ruby.java:3432:in `newRaiseException' from Ruby.java:3261:in `newSyntaxError' from Parser.java:139:in `parse' from Parser.java:83:in `parse' from Parser.java:75:in `parse' from Ruby.java:2346:in `parseEval' from ASTInterpreter.java:146:in `evalWithBinding' from RubyKernel.java:1138:in `evalCommon' from RubyKernel.java:1096:in `eval19' from org/jruby/RubyKernel$s_method_0_3$RUBYINVOKER$eval19.gen:65535:in `call' from DynamicMethod.java:178:in `call' from CachingCallSite.java:252:in `cacheAndCall' from CachingCallSite.java:71:in `call' ... 118 levels... from BeginNode.java:83:in `interpret' from NewlineNode.java:103:in `interpret' from BlockNode.java:71:in `interpret' from ASTInterpreter.java:70:in `INTERPRET_METHOD' from InterpretedMethod.java:184:in `call' from DefaultMethod.java:179:in `call' from CachingCallSite.java:282:in `cacheAndCall' from CachingCallSite.java:139:in `call' from jirb_swing:65:in `__file__' from jirb_swing:-1:in `load' from Ruby.java:702:in `runScript' from Ruby.java:587:in `runNormally' from Ruby.java:421:in `runFromMain' from Main.java:304:in `run' from Main.java:144:in `run' from Main.java:113:in `main'
I believe I at least know what we are doing wrong. I set the default external encoding unconditionally to US-ASCII, and clearly this gets picked up by some other means. I will try and figure out how default encoding is supposed to get set and fix this.
In 1.6.0 RC2, Encoding.default_external returns Windows-31J, but same error occurs.
Reduced test case:
# coding: utf-8
require 'readline'
line = Readline.readline('> ', true)
p line.encoding
The problem is our readline is not honoring default_external.
This issue should be fixed for irb, but it was not really a 100% fix since it assumes default_external will match the default charset used by a Java String. Unfortunately, we need to change jline to just read in bytes since String is not an adequate class for us to support all encodings supporting by M17n.
Youhei...Can you try latest master (commit 246cc28 or later) and tell me if it fixes the problem for you?
I retry with 246cc28. But same error occurs. I attach the screenshot ! 246cc28_jruby5156_jruby160.PNG !of jirb_swing and write the stack-trace below
SyntaxError: (irb):8: invalid multibyte char (Windows-31J) from Thread.java:1479:in `getStackTrace' from TraceType.java:20:in `getBacktrace' from RubyException.java:151:in `prepareBacktrace' from RaiseException.java:159:in `preRaise' from RaiseException.java:80:in `<init>' from Ruby.java:3258:in `newRaiseException' from Ruby.java:3097:in `newSyntaxError' from Parser.java:139:in `parse' from Parser.java:83:in `parse' from Parser.java:75:in `parse' from Ruby.java:2311:in `parseEval' from ASTInterpreter.java:158:in `evalWithBinding' from RubyKernel.java:1135:in `evalCommon' from RubyKernel.java:1093:in `eval19' from org/jruby/RubyKernel$s_method_0_3$RUBYINVOKER$eval19.gen:65535:in `call' from DynamicMethod.java:179:in `call' ... 118 levels... from BlockNode.java:71:in `interpret' from ASTInterpreter.java:74:in `INTERPRET_METHOD' from InterpretedMethod.java:190:in `call' from DefaultMethod.java:179:in `call' from CachingCallSite.java:282:in `cacheAndCall' from CachingCallSite.java:139:in `call' from C:/jruby_latest/bin/jirb_swing:65:in `__file__' from C:/jruby_latest/bin/jirb_swing:-1:in `load' from Ruby.java:667:in `runScript' from Ruby.java:571:in `runNormally' from Ruby.java:420:in `runFromMain' from Main.java:278:in `doRunFromMain' from Main.java:198:in `internalRun' from Main.java:164:in `run' from Main.java:148:in `run' from Main.java:128:in `main'
Does it work with jirb? It is possible jirb_swing starts up much differently (not to say it isn't a bug as well).
Okay, now working with jirb.
! 246cc28_jruby5156_jruby160_irb.png !
But, inputted Japanese character is garbling.
(So that's the why I use jirb_swing.)
Youhei: What happens when you use console IRB with --noreadline? Also, please try an updated build of JRuby after this commit and let me know if it's any better:
commit 4f3a2a2d320a1e5c4d5ba4b47948f66b61798f16
Author: Charles Oliver Nutter <headius@headius.com>
Date: Tue Apr 5 12:33:53 2011 -0500
Update to jline-1.0-SNAPSHOT to get recent merges and patches.
(1) jruby --1.9 -S jirb --noreadline (1.6.0)
Succeeded with Win31J strings
irb(main):001:0> JRUBY_VERSION => "1.6.0" irb(main):002:0> RUBY_VERSION => "1.9.2" irb(main):003:0> 'a'.encoding => #<Encoding:Windows-31J> irb(main):004:0> 'あ'.encoding => #<Encoding:Windows-31J>
(2) jruby --1.9 -S jirb_swing --noreadline (1.6.0)
No errors, but no successes. prompt is freezing.
(3) jruby --1.9 -S jirb --noreadline (4f3a2a2d)
Same as (1)
(4) jruby --1.9 -S jirb_swing --noreadline (4f3a2a2d)
Same as (2)
jirb is all OK and jirb_swing is all NG. And jirb is enough for JRuby echosystem (probably right). If jirb_swing is not important product in JRuby echosystem (probably right), IMHO, it is one way to purge jirb_swing from JRuby.
I try again in JRuby 1.6.6
(1) jruby --1.9 -S jirb (1.6.6)
irb(main):001:0> JRUBY_VERSION => "1.6.6" irb(main):002:0> RUBY_VERSION => "1.9.2" irb(main):003:0> 'a'.encoding => #<Encoding:Windows-31J> irb(main):004:0> 'あ'.encoding => #<Encoding:Windows-31J>
(1) jruby --1.9 -S jirb_swing (1.6.6)
irb(main):001:0> JRUBY_VERSION => "1.6.6" irb(main):002:0> RUBY_VERSION => "1.9.2" irb(main):003:0> 'a'.encoding => #<Encoding:Windows-31J> irb(main):004:0> 'あ'.encoding SyntaxError: (irb):4: invalid multibyte char (Windows-31J) from org/jruby/RubyKernel.java:1082:in `eval' from org/jruby/RubyKernel.java:1408:in `loop' from org/jruby/RubyKernel.java:1195:in `catch' from org/jruby/RubyKernel.java:1195:in `catch' from C:/jruby-1.6.6/bin/jirb_swing:54:in `(root)'
Can you try JRuby 1.7/master? We have fixed a number of encoding issues. I tried with a chinese character and it was handled correctly as UTF-8 in my setup, but I'm not sure how to simulate your setup.
I tryed with JRuby 1.7.0 preview 1 zip on github.
(1) jruby -S jirb (1.7.0 preview 1)
irb(main):001:0> JRUBY_VERSION => "1.7.0.preview1" irb(main):002:0> RUBY_VERSION => "1.9.3" irb(main):003:0> 'a'.encoding => #<Encoding:Windows-31J> irb(main):004:0> 'あ'.encoding irb(main):005:0' '
(1) jruby -S jirb_swing (1.7.0 preview 1)
irb(main):001:0> JRUBY_VERSION "1.7.0.preview1" irb(main):002:0> RUBY_VERSION "1.9.3" irb(main):003:0> 'a'.encoding #<Encoding:Windows-31J> irb(main):004:0> 'あ'.encoding SyntaxError: (irb):4: invalid multibyte char (Windows-31J) from org/jruby/RubyKernel.java:1037:in `eval' from org/jruby/RubyKernel.java:1353:in `loop' from org/jruby/RubyKernel.java:1146:in `catch' from org/jruby/RubyKernel.java:1146:in `catch' from C:/temp/jruby-jruby-00c8c98/bin/jirb_swing:54:in `(root)'
jirb_swing still fails, and jirb becomes to fail again(with another reason).
Thank you for the update. We will try to solve this once and for all for 1.7.
Hello,
i have trouble with UTF-8 encoded file scripts. It is working with magic comment but if i want run it withou it i have syntax error. I tried it with jruby 1.6.7.2 and also wiht new 1.7.0.preview1. I use windows 7 operationg system.
Here is my test script(string is in czech):
a = "Příliš žluťoučký kůň"
puts a
puts a.encoding
and if i run this console command:
jruby -Ku --1.9 test_encoding.rb
i got Syntax error:
SyntaxError: test_encoding.rb:1: invalid multibyte char (US-ASCII)
i thing -Ku parameter should be exactly for setting file encoding but it doesn't work.
jruby and OS version i tryed:
jruby 1.7.0.preview1 (ruby-1.9.3-p203) (2012-05-19 00c8c98) (Java HotSpot(TM) Client VM 1.6.0_31) [Windows 7-x86-java] and jruby 1.6.7.2 (ruby-1.9.2-p312) (2012-05-01 26e08ba) (Java HotSpot(TM) Client VM 1.6.0_31) [Windows 7-x86-java]
Thanks for help.
In jirb and jirb_swing on 1.7.0 RC1 and RC2, http://jira.codehaus.org/browse/JRUBY-5156?focusedCommentId=299233&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-299233 still occurs.
Can you run this and tell me your result:
jruby -e 'require "java"; puts java.lang.System.getProperty("file.encoding")'
Also does your input charset match this value? Does your input charset match default_external in Ruby?
I am assuming that your default_external is not matching file.encoding system property. I even have a comment about this being a problem in readline source code.
Yes, these two are different.
> jruby -e "require 'java'; puts java.lang.System.getProperty('file.encoding')"
MS932
> jruby -e "puts Encoding.default_external"
Windows-31J
But, these two indicate same codepoints.
I wrong to select Component/s area, please correct it.