JRuby

Problems in getting access to files who has Chinese charactors in name.

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Fixed
  • Affects Version/s: JRuby 1.1.2
  • Fix Version/s: JRuby 1.6.4
  • Component/s: Core Classes/Modules
  • Labels:
    None
  • Environment:
    WinXP SP2; Linux
  • Number of attachments :
    3

Description

Original post is http://jira.codehaus.org/browse/JRUBY-2677 , when problems goes deeper, this issue is created.
All files involved in WinXP is encoded in GB2312, and all files involved in Linux is encoded in UTF-8.
My WinXP uses GB2312 as locale, and my Linux uses UTF-8 as locale.

[Dice@localhost lib]$ cat test.rb
puts ARGV[0]
puts '中文字符串'
IO.foreach(ARGV[0]) {|line| puts line}
[Dice@localhost lib]$ cat txt_en
English string from text file.
中文字符串来自text文件。
[Dice@localhost lib]$ cat txt_中文
English string from text file.
中文字符串来自text文件。

[Dice@localhost lib]$ java -jar jruby.jar -v
ruby 1.8.6 (2008-05-28 rev 6586) [i386-jruby1.1.2]
[Dice@localhost lib]$ java -jar jruby.jar test.rb txt_en
txt_en
中文字符串
English string from text file.
中文字符串来自text文件。
[Dice@localhost lib]$ java -jar jruby.jar test.rb txt_中文
txt_??
中文字符串
test.rb:3:in `initialize': No such file or directory - File not found - txt_?? (Errno::ENOENT)
from test.rb:3:in `foreach'
from test.rb:3

This error comes from the fact that jruby1.1.2 can't process arguments containing Chinese charactors correctly.
This bug is reported in Issue JRUBY-2677, and Charles Oliver Nutter fixed it.
[Dice@localhost lib]$ java -jar jruby.jar -v
jruby 1.1.3-dev (ruby 1.8.6 patchlevel 114) (2008-07-16 rev 7193) [i386-java]
[Dice@localhost lib]$ java -jar jruby.jar test.rb txt_en
txt_en
中文字符串
English string from text file.
中文字符串来自text文件。
[Dice@localhost lib]$ java -jar jruby.jar test.rb txt_中文
txt_中文
中文字符串
English string from text file.
中文字符串来自text文件。

It becomes OK because the locale of my Linux is UTF-8.
But when I tried it under WinXP whose locale is GB2312, deeper problems occured.
F:\MyStudio\jruby-1.1.2\bin>jruby.bat -v
jruby 1.1.3-dev (ruby 1.8.6 patchlevel 114) (2008-07-16 rev 7193) [x86-java]
F:\MyStudio\jruby-1.1.2\bin>jruby.bat test.rb txt_中文
txt_中文
中文字符串
test.rb:3:in `initialize': No such file or directory - File not found - txt_????
(Errno::ENOENT)
from test.rb:3:in `foreach'
from test.rb:3

Charles Oliver Nutter's explanation is overhere http://jira.codehaus.org/browse/JRUBY-2677?focusedCommentId=142123#action_142123

Afer this weekend I'll have my summer holiday for more than one month, during which I will not be able to surf the Internet, so I can't view and post comments on this issue.
And I'm just a Chinese college student who is just a beginner in programing, so I can't help you too much in fact.
Well, when I return I will see what had happend.
Good luck!

  1. JRUBY-2812.diff
    17/Jul/08 2:54 AM
    0.6 kB
    TAKAI Naoto
  2. jruby-2812.patch
    01/Dec/08 3:00 PM
    0.9 kB
    Charles Oliver Nutter
  3. JRUBY-2812-for-trunk.diff
    17/Jul/08 3:37 AM
    1 kB
    TAKAI Naoto

Issue Links

Activity

Hide
TAKAI Naoto added a comment -

In order to get encoding, we can use system property. Please try this patch.

Show
TAKAI Naoto added a comment - In order to get encoding, we can use system property. Please try this patch.
Hide
TAKAI Naoto added a comment -

Here is for trunk.

Show
TAKAI Naoto added a comment - Here is for trunk.
Hide
Tsing added a comment -

I'm a bit confused.
The first .diff file seems to update standard jruby-1.1.2 source code, but the second isn't.
Then I update the second .diff by hand.

But nothing is improved.

F:\NetBeansProjects\jruby-1.1.2\lib>java -jar jruby.jar test.rb txt_中文
txt_??
中文字符串
test.rb:3:in `initialize': No such file or directory - File not found - txt_?? (Errno::ENOENT)
from test.rb:3:in `foreach'
from test.rb:3

Show
Tsing added a comment - I'm a bit confused. The first .diff file seems to update standard jruby-1.1.2 source code, but the second isn't. Then I update the second .diff by hand. But nothing is improved. F:\NetBeansProjects\jruby-1.1.2\lib>java -jar jruby.jar test.rb txt_中文 txt_?? 中文字符串 test.rb:3:in `initialize': No such file or directory - File not found - txt_?? (Errno::ENOENT) from test.rb:3:in `foreach' from test.rb:3
Hide
Tsing added a comment -

In my opinion, your patch won't work because when executing ".getBytes("ISO-8859-1")", all the Chinese charaters will be replaced with "?" in the return byte streams.

Show
Tsing added a comment - In my opinion, your patch won't work because when executing ".getBytes("ISO-8859-1")", all the Chinese charaters will be replaced with "?" in the return byte streams.
Hide
TAKAI Naoto added a comment -

Hi Tsing,

First patch is for JRuby 1.1.2, another one is for trunk. If you have 1.1.2 source, apply patch JRUBY-2812.diff only and try again.

Show
TAKAI Naoto added a comment - Hi Tsing, First patch is for JRuby 1.1.2, another one is for trunk. If you have 1.1.2 source, apply patch JRUBY-2812.diff only and try again.
Hide
Tsing added a comment -

Hi TAKAI Naoto, my English is poor, and until now I still can't understand what's the meaning of "trunk".
I have apply the first patch to JRuby1.1.2 only, but that bug is still there.
Test in WinXP:
F:\NetBeansProjects\jruby-1.1.2-old\lib>java -jar jruby.jar test.rb txt_en
txt_en
中文字符串
English string from text file.
中文字符串来自text文件。
F:\NetBeansProjects\jruby-1.1.2-old\lib>java -jar jruby.jar test.rb txt_中文
txt_??
中文字符串
test.rb:3:in `initialize': No such file or directory - File not found - txt_?? (Errno::ENOENT)
from test.rb:3:in `foreach'
from test.rb:3:in `foreach'
from test.rb:3
F:\NetBeansProjects\jruby-1.1.2-old\lib>java -jar jruby.jar 测试.rb txt_en
Error opening script file: F:/NetBeansProjects/jruby-1.1.2-old/lib/??.rb (文件名、目录名或卷标语法不正确。)
PS: test_en.rb==test_测试.rb==test.rb txt_en==txt_中文 All are encoded in GB2312.

As for the second patch, I have applied it to "jruby 1.1.3-dev (ruby 1.8.6 patchlevel 114) (2008-07-16 rev 7193) [x86-java]", but I can't build the jruby.jar, so I can't test it.
The version of NetBeans I'm using:
Product Version: NetBeans IDE 6.0 (Build 200711261600)
Java: 1.6.0_03; Java HotSpot(TM) Client VM 1.6.0_03-b05
System: Windows XP version 5.1 running on x86; GBK; zh_CN (nb)
Userdir: C:\Documents and Settings\Administrator\.netbeans\6.0

May you can tell me how to build it.

I still belive the problem occurs just after decoding the argvs with ".getByte("ISO-8859-1")".
These Chinese charactors have been replaced with "?" before encoding these bytes back to a string.

Show
Tsing added a comment - Hi TAKAI Naoto, my English is poor, and until now I still can't understand what's the meaning of "trunk". I have apply the first patch to JRuby1.1.2 only, but that bug is still there. Test in WinXP: F:\NetBeansProjects\jruby-1.1.2-old\lib>java -jar jruby.jar test.rb txt_en txt_en 中文字符串 English string from text file. 中文字符串来自text文件。 F:\NetBeansProjects\jruby-1.1.2-old\lib>java -jar jruby.jar test.rb txt_中文 txt_?? 中文字符串 test.rb:3:in `initialize': No such file or directory - File not found - txt_?? (Errno::ENOENT) from test.rb:3:in `foreach' from test.rb:3:in `foreach' from test.rb:3 F:\NetBeansProjects\jruby-1.1.2-old\lib>java -jar jruby.jar 测试.rb txt_en Error opening script file: F:/NetBeansProjects/jruby-1.1.2-old/lib/??.rb (文件名、目录名或卷标语法不正确。) PS: test_en.rb==test_测试.rb==test.rb txt_en==txt_中文 All are encoded in GB2312. As for the second patch, I have applied it to "jruby 1.1.3-dev (ruby 1.8.6 patchlevel 114) (2008-07-16 rev 7193) [x86-java]", but I can't build the jruby.jar, so I can't test it. The version of NetBeans I'm using: Product Version: NetBeans IDE 6.0 (Build 200711261600) Java: 1.6.0_03; Java HotSpot(TM) Client VM 1.6.0_03-b05 System: Windows XP version 5.1 running on x86; GBK; zh_CN (nb) Userdir: C:\Documents and Settings\Administrator\.netbeans\6.0 May you can tell me how to build it. I still belive the problem occurs just after decoding the argvs with ".getByte("ISO-8859-1")". These Chinese charactors have been replaced with "?" before encoding these bytes back to a string.
Hide
Tsing added a comment -

Recently I found that, if using Iconv to convert a string encoded in GB2312 in WinXP to UTF-8 first, then everything turns to be all right. (Of course so because after the conversion everything goes in the same way as my Linux which I mentioned in my first post.)
So, this bug is still a bug, but it no long matters much.

Show
Tsing added a comment - Recently I found that, if using Iconv to convert a string encoded in GB2312 in WinXP to UTF-8 first, then everything turns to be all right. (Of course so because after the conversion everything goes in the same way as my Linux which I mentioned in my first post.) So, this bug is still a bug, but it no long matters much.
Hide
Charles Oliver Nutter added a comment -

Thanks for that update, Tsing. We'll leave it open and perhaps someone will have a chance apply and test your patch.

Show
Charles Oliver Nutter added a comment - Thanks for that update, Tsing. We'll leave it open and perhaps someone will have a chance apply and test your patch.
Hide
Christian Seiler added a comment -

There is also http://jira.codehaus.org/browse/JRUBY-3053 (seems like a related issue or even dup)

Show
Christian Seiler added a comment - There is also http://jira.codehaus.org/browse/JRUBY-3053 (seems like a related issue or even dup)
Hide
Thomas E Enebo added a comment -

We have a patch...let's evaluate this for 1.1.6 since it will probably solve at least two issues.

Show
Thomas E Enebo added a comment - We have a patch...let's evaluate this for 1.1.6 since it will probably solve at least two issues.
Hide
Charles Oliver Nutter added a comment -

Ok, here's a patch that might solve a whole bunch of string encoding problems at once. Basically, what this does is modify the RubyString.getUnicodeValue method to just use the default encoding for the platform. That should be the correct thing to do in 90% of cases, since most people will be putting non-ASCII characters into files on their own machines. It will also allow strings to come in from the system and back out to Java APIs without being damaged.

But the problem is that all the systems I have available for testing default to UTF-8. I need people on non-UTF-8 systems to test this extensively, and I really need help figuring out a way for us to test this regularly without having easy access to systems with other encodings (like GB-1252).

Please review patch! Try it out, try lots of stuff out.

Show
Charles Oliver Nutter added a comment - Ok, here's a patch that might solve a whole bunch of string encoding problems at once. Basically, what this does is modify the RubyString.getUnicodeValue method to just use the default encoding for the platform. That should be the correct thing to do in 90% of cases, since most people will be putting non-ASCII characters into files on their own machines. It will also allow strings to come in from the system and back out to Java APIs without being damaged. But the problem is that all the systems I have available for testing default to UTF-8. I need people on non-UTF-8 systems to test this extensively, and I really need help figuring out a way for us to test this regularly without having easy access to systems with other encodings (like GB-1252). Please review patch! Try it out, try lots of stuff out.
Hide
Charles Oliver Nutter added a comment -

I posted a patch with a possible fix, but it's rather dangerous to do this late in 1.1.6 cycle. I think it might be ok, but we need more testing than two days can give us. Punting to 1.1.7, hoping for community help.

Show
Charles Oliver Nutter added a comment - I posted a patch with a possible fix, but it's rather dangerous to do this late in 1.1.6 cycle. I think it might be ok, but we need more testing than two days can give us. Punting to 1.1.7, hoping for community help.
Hide
Tsing added a comment -

Hi Charles, good news, that patch works.
The test sample works well on both my Linux(utf8) and XP(gb2312).
I will test more cases with that.
Well, maybe I should Google some Chinese characters problems which are complained on the web.

Show
Tsing added a comment - Hi Charles, good news, that patch works. The test sample works well on both my Linux(utf8) and XP(gb2312). I will test more cases with that. Well, maybe I should Google some Chinese characters problems which are complained on the web.
Hide
Tsing added a comment -

Hi Charles:
I Googled a tough one.

File Test.java:
public class Test{
public static String message = "你好"
public void setMessage(String s){ message = s; }
}

File hey.rb:
include Java
include_class("java.lang.System")
include_class("Test") #Please place Test.class and hey.rb under a same directory.
System.out.println(Test.message)#Works well.
a = Test.new
a.setMessage("测试中文输出结果")
System.out.println(Test.message)#Question marks displayed.

The Result:(jruby 1.1.5 (ruby 1.8.6 patchlevel 114) (2008-12-02 rev 6586) [x86-java])
你好
?????????????

That patch doesn't work.

Show
Tsing added a comment - Hi Charles: I Googled a tough one. File Test.java: public class Test{ public static String message = "你好" public void setMessage(String s){ message = s; } } File hey.rb: include Java include_class("java.lang.System") include_class("Test") #Please place Test.class and hey.rb under a same directory. System.out.println(Test.message)#Works well. a = Test.new a.setMessage("测试中文输出结果") System.out.println(Test.message)#Question marks displayed. The Result:(jruby 1.1.5 (ruby 1.8.6 patchlevel 114) (2008-12-02 rev 6586) [x86-java]) 你好 ????????????? That patch doesn't work.
Hide
Charles Oliver Nutter added a comment -

Ok, this is a slightly different case. Currently when we get strings cross the Java/Ruby boundary, we're making the same mistake of assuming UTF-8. So the same change would need to happen there. I don't think this damns the getUnicodeValue patch, but we may need a new "getDecodedValue" method and go case-by-case to see which one should be called.

Show
Charles Oliver Nutter added a comment - Ok, this is a slightly different case. Currently when we get strings cross the Java/Ruby boundary, we're making the same mistake of assuming UTF-8. So the same change would need to happen there. I don't think this damns the getUnicodeValue patch, but we may need a new "getDecodedValue" method and go case-by-case to see which one should be called.
Hide
Charles Oliver Nutter added a comment -

This bug is related to JRUBY-3053. We need a comprehensive solution for handling file encodings, and it's not going to be easy.

Show
Charles Oliver Nutter added a comment - This bug is related to JRUBY-3053. We need a comprehensive solution for handling file encodings, and it's not going to be easy.
Hide
Charles Oliver Nutter added a comment -

This is related to JRUBY-2812, which I've punted to 1.3. We'll include this bug and its test cases in that work.

Show
Charles Oliver Nutter added a comment - This is related to JRUBY-2812, which I've punted to 1.3. We'll include this bug and its test cases in that work.
Hide
Tsing added a comment -

Charles:
I have seen your post on JRUBY-2812 and I think I know what's going on about this bug. And "Fix Version:1.3", it seems there will be a long time before this problem fixed.
If you work out some fix I'll try some test cases for you.

Show
Tsing added a comment - Charles: I have seen your post on JRUBY-2812 and I think I know what's going on about this bug. And "Fix Version:1.3", it seems there will be a long time before this problem fixed. If you work out some fix I'll try some test cases for you.
Hide
Charles Oliver Nutter added a comment -

Well hopefully 1.3 won't be that far out; really 1.2 was only three months, and this bug is already flagged to get worked on. I also think the Ruby 1.9 work we're doing will help this along, since it will force us to address default encoding, transcoding to Java String, and other details necessary to resolve this.

Test cases will be helpful...and I'm eager to work on this more and get it working well in 1.3. Plus if we get it fixed sooner, there's no reason you couldn't use a nightly build.

Show
Charles Oliver Nutter added a comment - Well hopefully 1.3 won't be that far out; really 1.2 was only three months, and this bug is already flagged to get worked on. I also think the Ruby 1.9 work we're doing will help this along, since it will force us to address default encoding, transcoding to Java String, and other details necessary to resolve this. Test cases will be helpful...and I'm eager to work on this more and get it working well in 1.3. Plus if we get it fixed sooner, there's no reason you couldn't use a nightly build.
Hide
Charles Oliver Nutter added a comment -

Tsing: It's still getting pushed back...we simply don't have the cycles to work on everything. But if you are able to get some simple test cases we'll be able to fix each case in turn.

Show
Charles Oliver Nutter added a comment - Tsing: It's still getting pushed back...we simply don't have the cycles to work on everything. But if you are able to get some simple test cases we'll be able to fix each case in turn.
Hide
Hiro Asari added a comment - - edited

With recent improvements in encoding handling, I'm fairly certain that this one is fixed.

(Original comment included output from Terminal.app, but apparently the encoding fails on JIRA.)

Show
Hiro Asari added a comment - - edited With recent improvements in encoding handling, I'm fairly certain that this one is fixed. (Original comment included output from Terminal.app, but apparently the encoding fails on JIRA.)

People

Vote (0)
Watch (4)

Dates

  • Created:
    Updated:
    Resolved: